Machine Translation Based Data Augmentation for Cantonese Keyword Spotting (Author's Manuscript)

Huang, Guangpu; Gorin, Arseniy; Gauvain, Jean-luc; Lamel, Lori

Machine Translation Based Data Augmentation for Cantonese Keyword Spotting (Author's Manuscript)

Active / Technical Report | Accession Number: AD1038524 |

Open PDF

Abstract:

This paper presents a method to improve a language model for a limited-resourced language using statistical machine translation from a related language to generate data for the target language. In this work, the machine translation model is trained on a corpus of parallel Mandarin-Cantonese subtitles and used to translate a large set of Mandarin conversational telephone transcripts to Cantonese, which has limited resources. The translated transcripts are used to train a more robust language model for speech recognition and for keyword search in Cantonese conversational telephone speech. This method enables the keyword search system to detect 1.5 times more out-of-vocabulary words, and achieve 1.7 absolute improvement on actual term-weighted value.

Author(s):

Huang, Guangpu ; Gorin, Arseniy ; Gauvain, Jean-luc ; Lamel, Lori

Author Organization(s):

LIMSI CNRS, Spoken Language Processing Group Orsay Cedex France

Descriptive Note:

Conference Paper

Supplementary Note:

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ISASSP) , 20 Mar 2016, 25 Mar 2016,

Pagination:

0005

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release;

RECORD

Collection: TR

Identifying Numbers

Report Number(s):

IARPA/DC

Subject Terms

Joint Capability Areas:

JCA_1.2.1_Training; JCA_1.2.7_Experimentation

Modernization Areas:

AI and Machine Learning

Communities of Interest:

No COI(s) Identified

Descriptor(s):

automated speech recognition, artificial neural networks, machine translation, detection, translations, dictionaries, training, recognition, language, vocabulary, artificial intelligence software, computational linguistics, automatic

Field(s)/Group(s):

Cybernetics

Keyword(s):

IARPA Collection, Hidden Markov models, data models, dictionaries, keyword spotting, data augmentation, language modelling, low resourced languages

Report Date:

2016 May 19

Creation Date:

2016 May 19