Information Extraction Overview
Abstract:
The information explosion of the last decade has placed increasing demands on processing and analyzing large volumes of on-line data. In response, the Advanced Research Projects Agency ARPA has been supporting research to develop a new technology called information extraction. Information extraction is a type of document processing which captures and outputs factual information contained within a document. Similar to an information retrieval IR system, an information extraction system responds to a users information need. Whereas an IR system identifies a subset of documents in a large text database or in a library scenario a subset of resources in a library, an information extraction system identifies a subset of information within a document This subset of information is not necessarily a summary or gist of the contents of the document. Rather it corresponds to predefied generic types of information of interest and represents specific instances found in the text For example, a user of a system may be interested in identifying and databasing information on all companies named within a set of documents, including companies not previously known to the user. An information extraction system can extract and output all of the occurrences of company names within a text with an accuracy of 75. Moreover, it is possible to specify that the system only extract those companies of a certain type, such as Japanese companies or companies in the textile industry.