Robust Text Processing in Automated Information Retrieval
NEW YORK UNIV NY COURANT INST OF MATHEMATICAL SCIENCES
Pagination or Media Count:
This paper outlines a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. The backbone of our system is a traditional retrieval engine which builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language processing is used to 1 preprocess the documents in order to extract contents-carrying terms, 2 discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and 3 process users natural language requests into effective search queries. The basic assumption of this design is that term-based representation of contents is in principle sufficient to build an effective if not optimal search query out of any users request. This has been confirmed by an experiment that compared effectiveness of expert-user prepared queries with those derived automatically from an initial narrative information request. In this paper we show that large-scale natural language processing hundreds of millions of words and more is not only required for a better retrieval, but it is also doable, given appropriate resources. We report on selected preliminary results of experiments with 500 MByte database of Wall Street Journal articles, as well as some earlier results with a smaller document collection.
- Information Science