Lucene for n-grams using the ClueWeb Collection
ALASKA UNIV ANCHORAGE ARTIC REGION SUPERCOMPUTING CENTER
Pagination or Media Count:
The ARSC team made modifications to the Apache Lucene engine to accommodate go words, taken from the Google Gigaword vocabulary of n-grams. Indexing the Category B subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.
- Information Science
- Computer Programming and Software
- Test Facilities, Equipment and Methods