Lucene for n-grams using the ClueWeb Collection
Abstract:
The ARSC team made modifications to the Apache Lucene engine to accommodate go words, taken from the Google Gigaword vocabulary of n-grams. Indexing the Category B subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.
Security Markings
DOCUMENT & CONTEXTUAL SUMMARY
Distribution:
Approved For Public Release
Distribution Statement:
Approved For Public Release; Distribution Is Unlimited.
RECORD
Collection: TR