Accession Number:



Building an Entity-Centric Stream Filtering Test Collection for TREC 2012

Descriptive Note:

Conference paper

Corporate Author:


Report Date:


Pagination or Media Count:



The Knowledge Base Acceleration track in TREC 2012 focused on a single task filter a time-ordered corpus for documents that are highly relevant to a predefined list of entities. KBA differs from previous filtering evaluations in two primary ways the stream corpus is 100x larger than previous filtering collections, and the use of entities as topics enables systems to incorporate structured knowledge bases KB, such as Wikipedia, as external data sources. A successful KBA system must do more than resolve the meaning of entity mentions by linking documents to the KB it must also distinguish centrally relevant documents that are worth citing in the entitys WP article. This combines thinking from natural language processing NLP and information retrieval IR. Filtering tracks in TREC have typically used queries based on topics described by a set of keyword queries or short descriptions, and annotators have generated relevance judgments based on their personal interpretation of the topic. For TREC 2012, we selected a set of filter topics based on Wikipedia entities 27 people and 2 organizations. Such named entities are more familiar in NLP than IR. We also constructed an entirely new stream corpus spanning 4,973 consecutive hours from October 2011 through April 2012. It contains over 400M documents, which we augmented with named entity classification tagging for the 40 of the documents identified as English. Each document has a timestamp that places it in the stream. The 29 target entities were mentioned infrequently enough in the corpus that NIST assessors could judge the relevance of most of the mentioning documents 91. Judgments for documents from before January 2012 were provided to TREC teams as training data for filtering documents from the remaining hours. Run submissions were evaluated against the assessor-generated list of citation-worthy documents. We present peak F1 scores averaged across the entities for all run submissions. High scoring system

Subject Categories:

  • Information Science

Distribution Statement: