Domain Adaptation of Translation Models for Multilingual Applications
CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE
Pagination or Media Count:
The performance of a statistical translation algorithm in the context of multilingual applications such as cross-lingual information retrieval CLIR and machine translation MT depends on the quality, quantity and proper domain matching of the training data. Traditionally, manual selection and customization of training resources has been the prevailing approach. In addition to being labor-intensive, this approach does not scale to the large quantity of heterogeneous resources that have recently become available, such as parallel text and bilingual thesauri in various domains. More importantly, manual customization does not offer a solution to efficiently and effectively producing tailored translation models for a mixture of heterogeneous target documents in various domains, topics, languages and genres. Translation models trained on a general domain do not work well in technical domains models trained on written documents are not appropriate for spoken dialogue models trained on manual transcripts can be sub-optimal for translating noisy transcripts produced by a speech recognizer finally, models trained on a mixture of topics are not optimal for any of the topic-specific documents. We seek to address this challenge by automatically adapting translation models and implicitly parallel training resources to specific target domains or sub-domains.
- Numerical Mathematics