Accession Number:

ADA460240

Title:

Robust Text Processing in Automated Information Retrieval

Descriptive Note:

Corporate Author:

NEW YORK UNIV NY COURANT INST OF MATHEMATICAL SCIENCES

Personal Author(s):

Report Date:

1993-01-01

Pagination or Media Count:

12.0

Abstract:

This paper outlines a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. The backbone of our system is a traditional retrieval engine which builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language processing is used to 1 preprocess the documents in order to extract contents-carrying terms, 2 discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and 3 process users natural language requests into effective search queries. The basic assumption of this design is that term-based representation of contents is in principle sufficient to build an effective if not optimal search query out of any users request. This has been confirmed by an experiment that compared effectiveness of expert-user prepared queries with those derived automatically from an initial narrative information request. In this paper we show that large-scale natural language processing hundreds of millions of words and more is not only required for a better retrieval, but it is also doable, given appropriate resources. We report on selected preliminary results of experiments with 500 MByte database of Wall Street Journal articles, as well as some earlier results with a smaller document collection.

Subject Categories:

  • Information Science

Distribution Statement:

APPROVED FOR PUBLIC RELEASE