Accession Number:

ADA478204

Title:

On-Line New Event Detection, Clustering, and Tracking

Descriptive Note:

Doctoral thesis

Corporate Author:

MASSACHUSETTS UNIV AMHERST DEPT OF COMPUTER SCIENCE

Personal Author(s):

Report Date:

1999-09-01

Pagination or Media Count:

170.0

Abstract:

In this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line source of news. We present solutions to three related classification problems new event detection, event clustering, and event tracking. The primary focus of this thesis is new event detection, where the goal is to identify news stories that have not previously reported, in a stream of broadcast news comprising radio, television, and newswire. We present an algorithm for new event detection, and analyze the effects of incorporating domain properties into the classification algorithm. We explore a solution that models the temporal relationship between news stories, and investigate the use of proper noun phrase extraction to capture the who, what, when, and where contained in news. Our results for new event detection suggest that previous approaches to document clustering provide a good basis for an approach to new event detection, and that further improvements to classification accuracy are obtained when the domain properties of broadcast news are modeled. New event detection is related to the problem of event clustering, where the goal is to group stories that discuss the same event. We investigate on-line clustering as an approach to new event detection, and re-evaluate existing cluster comparison strategies previously used for document retrieval. Our results suggest that these strategies produce different groupings of events, and that the on-line single-link strategy extended with a model for domain properties is faster and more effective than other approaches. In this dissertation, we explore several test representation issues in the context of event tracking, where a classifier for an event is formulated from one or more sample stories. The classifier is used to monitor the subsequent news stream for documents related to the event.

Subject Categories:

  • Information Science
  • Cybernetics

Distribution Statement:

APPROVED FOR PUBLIC RELEASE