On Using Written Language Training Data for Spoken Language Modeling
BBN SYSTEMS AND TECHNOLOGIES CORP CAMBRIDGE MA
Pagination or Media Count:
We attempted to improve recognition accuracy by reducing the inadequacies of the lexicon and language model. Specifically we address the following three problems 1 the best size for the lexicon, 2 conditioning written text for spoken language recognition, and 3 using additional training outside the text distribution. We found that increasing the lexicon 20,000 words to 40,000 words reduced the percentage of words outside the vocabulary from over 2 to just 0.2, thereby decreasing the error rate substantially. The error rate on words already in the vocabulary did not increase substantially. We modified the language model training text by applying rules to simulate the differences between the training text and what people actually said. Finally, we found that using another three years of training text - even without the appropriate preprocessing, substantially improved the language model We also tested these approaches on spontaneous news dictation and found similar improvements. 3, we explore the effect of increasing the vocabulary size on recognition accuracy in an unlimited vocabulary task. Second, in Section 4, we consider ways to model the differences between the language model Training text and the way people actually speak. And third, in Section 5, we show that simply increasing the amount of language model training helps significantly. 2. THE WSJ CORPUS The November 1993 ARPA Continuous Speech Recognition CSR evaluations was based on speech and language taken from the Wall Street Journal WSJ. The standard language model training text was estimated from about 35 million words of text extracted from the WSJ from 1987 to 1989.