Accession Number:



Leveraging Small-Lexicon Language Models

Descriptive Note:

Technical Report,01 Jul 2015,31 Dec 2016

Corporate Author:

CRCL INC San Clemente United States

Personal Author(s):

Report Date:


Pagination or Media Count:



This final report describes the Leveraging Small-Lexicon Language Models project, contracted for 18 months under DARPA LORELEI. We focused on Asia-Pacific a global hotspot of disaster risk with high language density, but few electronic data resources and little off-the-shelf language technology. CRCL provided data and initial analysis for five major families with varied typology Austroasiatic, Austronesian, Hmong-Mien, Kra-Dai, and Sino-Tibetan these include about 2,000 languages. We delivered more than 1,000 lects from some 500 distinct ISO 639-3 codes, including over 850,000 lexemes. Data mainly came from smallish, high-quality print lexicons developed for linguistic purposes language sketch, survey, and comparative analysis these are the only resources that are widely available throughout the region. Primary effort went to normalizing phonological transcription and semantic glossing using the MetaForm and MetaGloss frameworks we devised, identifying cognate sets, and producing various types of phonological and semantic analysis of the lexicons we also distributed a multilingual HADR thesaurus of disaster-related terms. All language materials are available for re-use under the CC 4.0 license.

Subject Categories:

  • Linguistics

Distribution Statement: