Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect
UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY INFORMATION SCIENCES INST
Pagination or Media Count:
We report on our experience with building a statistical MT system from scratch, including the creation of a small parallel Tamil-English corpus, and the results of a taskbased pilot evaluation of statistical MT systems trained on sets of ca. 1300 and ca. 5000 parallel sentences of Tamil and English data. Our results show that even with apparently incomprehensible system output, humans without any knowledge of Tamil can achieve performance rates as high as 86 accuracy for topic identification, 93 recall for document retrieval, and 64 recall on question answering plus an additional 14 partially correct answers.