Query-Structure Based Web Page Indexing
OTTAWA UNIV (ONTARIO)
Pagination or Media Count:
Indexing is a crucial technique for dealing with the massive amount of data present on the web. In our third participation in the web track at TREC 2012, we explore the idea of building an efficient query-based indexing system over Web page collection. Our prototype explores the trends in user queries and consequently indexes texts using particular attributes available in the documents. This paper provides an in-depth description of our approach for indexing web documents efficiently that is, topics available in the web documents are discovered with the assistance of knowledge available in Wikipedia. The well-defined articles in Wikipedia are shown to be valuable as a training set when indexing Webpages. Our complex index structure also records information from titles and urls, and pays attention to web domains. Our approach is designed to close the gaps in our approaches from the previous two years, for some queries. Our framework is able to efficiently index the 50 million pages available in the subset B of the ClueWeb09 collection. Our preliminary experiments on the TREC 2012 testing queries showed that our indexing scheme is robust and efficient for both indexing and retrieving relevant web pages, for both the ad-hoc and diversity task.
- Information Science