Adaptive Web-page Content Identification
MITRE CORP BEDFORD MA
Pagination or Media Count:
Identifying which parts of a Web-page contain target content e.g., the portion of an online news page that contains the actual article is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle they fail to properly extract content in some cases and break when the structure of a sites Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97 of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.
- Information Science
- Operations Research