Addressing Challenges of Machine Translation of Inuit Languages
US Army Research Laboratory Adelphi United States
Pagination or Media Count:
Machine translation to and from polysynthetic languages, such as those of the Inuit language family, has largely been overlooked as their complex morphology has been a barrier to research in computational methodologies. Polysynthetic languages pack abundant semantic and grammatical information into single words, thus the data sets are inherently extremely sparse, making them challenging computationally using typical word-based analysis. Here, we focus on Inuktitut, a polysynthetic language spoken in Canada, one of the official languages of the Nunavut territory, used in all its governmental and educational documentation. We discuss Inuktitut, highlighting its polysynthetic typology, word formation, grammatical complexity, morphophonemics, spelling, and dialect variation, and review how this complexity presents challenges for machine translation and morphological processing. We consider the following improving the performance of an finite-state transducer morphological analyzer using various neural network approaches using alternate subword units with a neural network architecture to improve over a baseline English-Inuktitut statistical machine translation system and determining what subword unit yields the most improvement using a pipelined English-Inuktitut translation system, featuring deep-representation morpheme sequences converted to surface forms, to compete with the best subword system and using hierarchical structures over morphemes in a novel approach to improve over the best subword system.