Accession Number:

ADA517732

Title:

Lucene for n-grams using the ClueWeb Collection

Descriptive Note:

Conference paper

Corporate Author:

ALASKA UNIV ANCHORAGE ARTIC REGION SUPERCOMPUTING CENTER

Report Date:

2009-11-01

Pagination or Media Count:

8.0

Abstract:

The ARSC team made modifications to the Apache Lucene engine to accommodate go words, taken from the Google Gigaword vocabulary of n-grams. Indexing the Category B subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.

Subject Categories:

  • Information Science
  • Computer Programming and Software
  • Test Facilities, Equipment and Methods
  • Linguistics

Distribution Statement:

APPROVED FOR PUBLIC RELEASE