Accession Number:

ADA470494

Title:

Adaptive Web-page Content Identification

Descriptive Note:

Technical paper

Corporate Author:

MITRE CORP BEDFORD MA

Personal Author(s):

Report Date:

2007-07-01

Pagination or Media Count:

9.0

Abstract:

Identifying which parts of a Web-page contain target content e.g., the portion of an online news page that contains the actual article is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle they fail to properly extract content in some cases and break when the structure of a sites Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97 of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

Subject Categories:

  • Information Science
  • Operations Research
  • Cybernetics

Distribution Statement:

APPROVED FOR PUBLIC RELEASE