TREC Chemical IR Track 2009: A Distributed Dimensional Indexing Model for Chemical Patent Search
WISCONSIN UNIV-MILWAUKEE DEPT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Pagination or Media Count:
For the TREC-2009 Chemical IR Track, we explore development of a distributed information retrieval system based on a dimensional data model. The indexing model supports named entity identification and aggregation of term statistics at multiple levels of patent structure including individual words, sentences, claims, descriptions, abstracts, and titles. The system was deployed across 15 Amazon Web Services AWS Elastic Cloud Compute EC2 instances and 15 Elastic Block Storage EBS database shards to support efficient indexing and query processing of the relatively large index generated from indexing each individual word sans stop words in the 100G collection of chemical patent documents. The query processing algorithm for technology survey search and prior art search uses information extraction techniques and locally aggregated term statistics to help disambiguate candidate entities and terms in context. Query processing for prior art search automatically generates a structured query based on the relative distinctiveness of individual terms and candidate entity phrases from the query patents claims, abstract, and title sections. For both the technology survey and prior art search, we evaluated several probabilistic retrieval functions for integrating statistics of retrieved named entities with term statistics at multiple levels of document structure to identify relevant patents.
- Information Science