Lucene for n-grams using the ClueWeb Collection

reportActive / Technical Report | Accession Number: ADA517732 | Open PDF

Abstract:

The ARSC team made modifications to the Apache Lucene engine to accommodate go words, taken from the Google Gigaword vocabulary of n-grams. Indexing the Category B subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:
Approved For Public Release
Distribution Statement:
Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR
Identifying Numbers
Subject Terms