Accession Number:



Leveraging Topic Models to Develop Metrics for Evaluating the Quality of Narrative Threads Extracted from News Stories

Descriptive Note:

Journal Article - Open Access

Corporate Author:

Informatics Laboratory, Lockheed Martin Advanced Technology Laboratories Kennesaw United States

Report Date:


Pagination or Media Count:



Analysts and software systems are increasingly tasked with making sense of a growing amount of data to help their organizations make decisions involving risk and uncertainty. A key enabler of this work is the ability to quickly discover structure in large amounts of text such as news stories and blogs. Recent work in this area has shown it is possible to automatically link documents from a corpus together to build a narrative structure, called a story chain, without the need for prior domain knowledge. This approach is an unsupervised method that discovers large numbers of story chains of variable quality. In this paper, we describe and evaluate methods to identify the most coherent and informative story chains. We explore two types of topic model based analytics. The first type is a measure of representativeness that captures how well a story chain represents the corpus from which it was generated. This is done by comparing the similarity of topics found over time in a story chain against those expressed in the corpus during the same time period. Our hypothesis is that story chains that have similar topic expression to the corpus will convey narratives that are central to the corpus. This type of analytic could help an analyst quickly focus on the key narratives in a large corpus of documents. The second type is a measure of quality of a story chain and is composed of topic consistency and topic persistence measures. Our hypothesis is that high quality chains would be composed of sequences of stories that have clearly defined primary topics that persist across significant portions of the story chain. We used these analytics to predict the clarity of story chains within one of four categories 1 very clear narrative, 2 somewhat clear narrative, 3 somewhat unclear narrative, 4 very unclear narrative, and found we were able to train a data model to label story chains with the same label as human coders 77 of the time.

Subject Categories:

  • Information Science

Distribution Statement: