IIT Kharagpur at TREC 2008 Blog Track
INDIAN INST OF TECH KHARAGPUR
Pagination or Media Count:
Blogs are often informally written, poorly structured, and filled with spelling and grammatical errors and nontraditional content. Performing linguistic analysis on blogs is plagued by two additional problems 1 the presence of spam blogs and spam comments, and 2 extraneous noncontent, including blog-rolls, link-rolls, advertisements, and sidebars. Our system of retrieving the documents was made using the Apache Lucene search engine. Lucene was able to index the whole Blog06 dataset and could retrieve the documents very quickly. To decrease the size of the index it was necessary to remove a lot of noise in the HTML. A lot of the documents had malformed html which was corrected using the HTML Tidy utility. We used the qrels of the Blog Track of TREC 2006 and 2007 to train the sentence level subjectivity and polarity classifiers. This paper describes the authors opinion retrieval system for the TREC 2008 blog track. The system contains five modules. The first module is focused on extracting the blog content from junk html, thereby decreasing the noise in the indexed content. The second module aims at removing various kinds of spam content from real blogs. The third module aims at retrieving relevant documents. The fourth module filters out opinionated documents, and the fifth module calculates the polarity of the sentiments in the documents. The final ranked retrieval runs were based on various combinations of settings in each module so as to study the effects of each. For classification of subjectivity and polarity, they did the predictions by using a complementary naive bayes classifier.
- Information Science