Accession Number : AD1038524


Title :   Machine Translation Based Data Augmentation for Cantonese Keyword Spotting (Author's Manuscript)


Descriptive Note : Conference Paper


Corporate Author : LIMSI CNRS, Spoken Language Processing Group Orsay Cedex France


Personal Author(s) : Huang, Guangpu ; Gorin,Arseniy ; Gauvain,Jean-Luc ; Lamel,Lori


Full Text : https://apps.dtic.mil/dtic/tr/fulltext/u2/1038524.pdf


Report Date : 19 May 2016


Pagination or Media Count : 5


Abstract : This paper presents a method to improve a language model for a limited-resourced language using statistical machine translation from a related language to generate data for the target language. In this work, the machine translation model is trained on a corpus of parallel Mandarin-Cantonese subtitles and used to translate a large set of Mandarin conversational telephone transcripts to Cantonese, which has limited resources. The translated transcripts are used to train a more robust language model for speech recognition and for keyword search in Cantonese conversational telephone speech. This method enables the keyword search system to detect 1.5 times more out-of-vocabulary words, and achieve 1.7% absolute improvement on actual term-weighted value.


Descriptors :   automated speech recognition , artificial neural networks , machine translation , detection , translations , dictionaries , training , recognition , language , vocabulary , artificial intelligence software , computational linguistics , automatic


Subject Categories : Cybernetics


Distribution Statement : APPROVED FOR PUBLIC RELEASE