Accession Number:



Finding Relevant Data in a Sea of Languages

Descriptive Note:

Technical Report

Corporate Author:

MIT Lincoln Laboratory Lexington United States

Report Date:


Pagination or Media Count:



A cross-language search engine combines language identification, machine translation, information retrieval, and query-biased summarization techniques to enable English monolingual analysts to find foreign language documents relevant to their investigations. About 6,000 languages are currently spoken in the world today, says Elizabeth Salesky of Lincoln Laboratorys Human Language Technology HLT Group. Within the law enforcement community, there are not enough multilingual analysts who possess the necessary level of proficiency to understand and analyze content across these languages, she continues. This problem of too many languages and too few specialized analysts is one Salesky and her colleagues are now working to solve for law enforcement agencies, but their work has potential application for the Department of Defense and Intelligence Community. The research team is taking advantage of major advances in language recognition, speaker recognition, speech recognition, machine translation, and information retrieval to automate language processing tasks so that the limited number of linguists available for analyzing text and spoken foreign languages can be used more efficiently. With HLT, an equivalent of 20 times more foreign language analysts are at your disposal, says Salesky. One area in which Laboratory researchers are focusing their efforts is cross-language information retrieval CLIR. The Cross-LAnguage Search Engine, or CLASE, is a CLIR tool developed by the HLT Groupfor the Federal Bureau of Investigation FBI. CLASE is a fusion of Laboratory research in language identification, machine translation, information retrieval, and query-biased summarization. CLASE enables English monolingual analysts to help search for and filter foreign language documentstasks that have traditionally been restricted to foreign language analysts. Laboratory researchers considered three algorithmic approaches to CLIR that have emerged in the HLT research community

Subject Categories:

  • Computer Programming and Software
  • Linguistics

Distribution Statement: