Using Linguistic Knowledge in Statistical Machine Translation

Zbib, Rabih M.

Using Linguistic Knowledge in Statistical Machine Translation

Active / Technical Report | Accession Number: ADA544288 |

Open PDF

Abstract:

In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation SMT. One of the advantages of the statistical approach to machine translation is that it is largely languageagnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using linguistic knowledge to address specific areas of the translation process, where translations would be hard to learn fully automatically. We present methods that use linguistic knowledge at various levels to improve statistical machine translation, focusing on Arabic-English translation as a case study. In the first part, morphological information is used to preprocess the Arabic text for Arabic-to-English and English-to-Arabic translation, which reduces the gap in the complexity of the morphology between Arabic and English. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, we show how additional local context information on the source side is incorporated, which helps reduce lexical ambiguity. Two methods are proposed for using binary decision trees to control the amount of context information introduced. These methods are successfully applied to the use of diacritized Arabic source in Arabic-to-English translation. The final method combines the outputs of an SMT system and a Rule-based MT RBMT system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system.

Author(s):

Zbib, Rabih M.

Author Organization(s):

MASSACHUSETTS INST OF TECH CAMBRIDGE

Descriptive Note:

Doctoral thesis

Pagination:

0164

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Contract/Grant Number(s):

HR0011-06-C-0022

Monitor Series:

DARPA

Subject Terms

Joint Capability Areas:

JCA_1_Force Support; JCA_5_Command and Control; JCA_1.2_Force Preparation; JCA_1.2.1_Training; JCA_5.4_Decide; JCA_1.2.7_Experimentation; JCA_1.3_Human Capital Management; JCA_1.3.2_Personnel Management; JCA_5.3_Planning; JCA_5.2.2_Develop Knowledge and Situational Awareness; JCA_5.2_Understand; JCA_5.5.2_Task; JCA_5.5_Direct; JCA_8_Building Partnerships; JCA_1.4.2_Health Service Delivery; JCA_1.4_Force Support; JCA_8.2_Shape; JCA_1.2.3_Educating; JCA_1.2.5_Lessons Learned; JCA_1.3.1_Personnel and Family Support; JCA_6_Net Centric

Modernization Areas:

Fully Networked C3

Communities of Interest:

C4I

Descriptor(s):

*MACHINE TRANSLATION, THESES, STATISTICAL PROCESSES, ARABIC LANGUAGE, ENGLISH LANGUAGE, LEARNING MACHINES, LINGUISTICS

Field(s)/Group(s):

Linguistics

Keyword(s):

SMT(STATISTICAL MACHINE TRANSLATION)

Report Date:

2010 Sep 01

Creation Date:

2011 Jul 06