Document Filtering Using Semantic Information from a Machine Readable Dictionary

Liddy, Elizabeth D.; Paik, Woojin; Yu, Edmund S.

Document Filtering Using Semantic Information from a Machine Readable Dictionary

Active / Technical Report | Accession Number: ADA457808 |

Open PDF

Abstract:

Large scale information retrieval systems need to refine the flow of documents which will receive further fine-grain analysis to those documents with a high potential for relevance to their respective users. This paper reports on research we have conducted into the usefulness of semantic codes from a machine readable dictionary for filtering large sets of incoming documents for their broad subject appropriateness to a topic of interest. The Subject Field Coder produces a summary-level semantic representation of a texts contents by tagging each word in the document with the appropriate, disambiguated Subject Field Code SFC. The within-document SFCs are normalized to produce a vector of the SFCs representing that documents contents. Queries are likewise represented as SFC vectors and then compared to SFC vectors of incoming documents, which are then ranked according to similarity to the query SFC vector. Only those documents whose SFC vectors exhibit a predetermined degree of similarity to the query SFC vector are passed to later system components for more refined representation and matching. The assignment of SFCs is fully automatic, efficient and has been empirically tested as a reasonable approach for ranking documents from a very large incoming flow of documents. We report details of the implementation, as well as results of an empirical testing of the Subject Field Coder on fifty queries.

Author(s):

Liddy, Elizabeth D. ; Paik, Woojin ; Yu, Edmund S.

Author Organization(s):

SYRACUSE UNIV NY

Descriptive Note:

Conference paper

Supplementary Note:

Presented at the ACL Workshop on Very Large Corpora held in Columbus, OH on 22 Jun 1993. Published in the Proceedings of the ACL Workshop on Very Large Corpora, 1993.

Pagination:

0011

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Contract/Grant Number(s):

91-F136100-00

Monitor Series:

DARPA

Subject Terms

Joint Capability Areas:

JCA_1_Force Support; JCA_1.2.6_Concepts; JCA_1.2_Force Preparation; JCA_5_Command and Control; JCA_5.3_Planning; JCA_8_Building Partnerships; JCA_6_Net Centric; JCA_1.2.3_Educating; JCA_3_Force Application; JCA_6.1_Information Transport; JCA_8.2_Shape; JCA_1.2.1_Training; JCA_8.1_Communicate

Modernization Areas:

AI and Machine Learning

Communities of Interest:

Engineered Resilient Systems

Descriptor(s):

*CODING, *DOCUMENTS, *WORDS(LANGUAGE), *INFORMATION RETRIEVAL, SYMPOSIA, SEMANTICS, CLASSIFICATION

Field(s)/Group(s):

Information Science, Linguistics, Computer Programming and Software

Keyword(s):

*SFC(SUBJECT FIELD CODES), *DOCUMENT FILTERING, *RELEVANCE, INFORMATION FILTERING

Report Date:

1993 Jun 01

Creation Date:

2006 Dec 13