Accession Number:



Data Mining of Extremely Large Ad Hoc Data Sets to Produce Inverted Indices

Personal Author(s):

Corporate Author:

Naval Postgraduate School Monterey United States

Report Date:



The purpose of this study is to leverage existing Internet-sized ad hoc data sets by creating an inverted index that will enable a robust search capability. In particular, this study is focused on the Common Crawl web corpus. This involves exploring the tools and techniques necessary to effectively traverse this data set, as well as producing the tools to create an inverted index relationship between the terms and websites found within web archive files. The primary tools utilized in this process are Apache Hadoop, Apache MapReduce, Amazon Web Services, and Java. Additionally, methods to enhance this relationship with other information of interest are investigated in this thesis. Specifically, an index was developed that contains the added component of term relative location. This inverted index relationship is an essential component ofand the first step increating a robust search capability for a very large ad hoc data set.

Descriptive Note:

Technical Report



Subject Categories:

Communities Of Interest:

Distribution Statement:

Approved For Public Release;

File Size: