Some Statistical Opportunities in Speech and Language,
UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY INFORMATION SCIENCES INST
Pagination or Media Count:
Text analysis is a hot topic, and for good reason. Text is more available than ever before. Just ten years ago, the one-million word Brown Corpus Francis and Kucera, 1982 was still considered large, but even then, there were much larger corpora in use such as the 18 million word Birmingham Corpus Sinclair 1987a, 1987b. These days, there are many places that regularly use samples of text running into the hundreds of millions of words. And it is very likely that billions of words will be available very soon. All of this data provides a great research opportunity it easier these days to corpus data much more effectively than it was in the 1950s, the last time that empiricism was in fashion. Text analysis focuses on broad though possibly superficial coverage of unrestricted text, rather than a deep analysis of a restricted domain. Ms pragmatic view toward coverage and performance distinguishes text analysis from so-called intelligent approaches such as natural language understanding. This approach has produced a number of tools such as spelling correctors and part of speech taggers that work on unrestricted text, with reasonable accuracy and efficiency.