Data Mining of Extremely Large Ad-Hoc Data Sets to Produce Reverse Web-Link Graphs
Technical Report,28 Dec 2015,31 Mar 2017
Naval Postgraduate School Monterey United States
Pagination or Media Count:
Data mining can be a valuable tool, particularly in the acquisition of military intelligence. As the second study within a larger NavalPostgraduate School research project using Amazon Web Services AWS, this thesis focuses on data mining on a very large dataset 32 TB with the open web crawler data set Common Crawl. Similar to previous studies, this research employs MapReduceMR for sorting and categorizing output value pairs. Our research, however, is the first to implement the basic Reverse Web-LinkGraph RWLG algorithm as a search capability for web sites, with validation that it works correctly. A second goal is to extend theRWLG algorithm using a full Common Crawl archive as input for processing as a single MR job. To mitigate the out-of-memory error,we relate some environment variables with the Yet Another Resource Negotiator YARN architecture and provide some sampleerror tracking methods. As a further contribution, this study considers limitations associated with using AWS, which inform ourrecommendations for future work.
- Computer Systems