Accession Number:

ADA459586

Title:

Exploiting Secondary Sources for Unsupervised Record Linkage

Descriptive Note:

Conference paper

Corporate Author:

UNIVERSITY OF SOUTHERN CALIFORNIA LOS ANGELES INFORMATION SCIENCES INSTITUTE

Report Date:

2004-01-01

Pagination or Media Count:

7.0

Abstract:

XML, Web services, and the Semantic Web have opened the door for new and exciting information integration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources. Data from many online sources does not contain enough information to accurately link the records using state of the art record linkage systems. There is an inherent need for learning in these systems, most of the time requiring a user in the loop, to accurately link records across datasets. In this paper we describe a novel approach to exploiting additional data sources to design an unsupervised record linkage method. Our evaluation using real world data sets shows that the performance of unsupervised learning in a record linkage system is on par with traditional supervised learning methods.

Subject Categories:

  • Information Science
  • Cybernetics

Distribution Statement:

APPROVED FOR PUBLIC RELEASE