Accession Number:

ADA470523

Title:

Acquaintance: Language-Independent Document Categorization by N-Grams

Descriptive Note:

Corporate Author:

DEPARTMENT OF DEFENSE FORT GEORGE G MEADE MD

Personal Author(s):

Report Date:

1995-11-01

Pagination or Media Count:

14.0

Abstract:

Acquaintance is the name of a novel vector-space n-gram technique for categorizing documents. The technique is completely language-independent, highly garble-resistant, and computationally simple. An unoptimized version of the algorithm was used to process the TREC database in a very short time. Acquaintance is the name of a technique for information processing that combines the robustness of an n-gram-based algorithm with a novel vector-space model. Acquaintance gauges similarity among documents on the basis of common features, permitting document categorization based on a common language, a common topic, or common subtopics. The algorithm is completely language- and topic- independent, and is resistant to garbling even at the 10 to 15 character level. Acquaintance is fully described in Damashek, 1995. The TREC-3 conference provided the first public demonstration and evaluation of this new technique, and TREC-4 provided an opportunity to test its usefulness on several types of text retrieval tasks.

Subject Categories:

  • Linguistics
  • Theoretical Mathematics

Distribution Statement:

APPROVED FOR PUBLIC RELEASE