TREC Dynamic Domain: Polar Science
University of Southern California Los Angeles United States
Pagination or Media Count:
This paper outlines the creation of the Polar dataset within the TREC-Dynamic Domain track. The techniques used to create the Polar dataset fall into two basic categories information extraction using Apache Tika and information retrieval using Apache Nutch. First, we expanded the parsing capabilities of Apache Tika, an open source framework for text and metadata extraction, to provide more searchable content within Polar data repositories. Second, we used Apache Nutch, a distributed search engine that runs on top of Apache Hadoop, to crawl three prominent Polar data repositories the National Science Foundation Advanced Cooperative Arctic Data and Information System ACADIS, the National Snow and Ice Data Center NSIDC Arctic Data Explorer ADE, and the National Aeronautics and Space Administration Antarctic Master Directory AMD. Because finding data is often a primary challenge in scientific discovery, the inclusion of the Polar dataset in TREC-DD helps advance science through data discovery and provides TREC-DD a new challenge in in the realm of search relevancy.