Unsupervised Topic Discovery by Anomaly Detection

reportActive / Technical Report | Accession Number: ADA589398 | Open PDF

Abstract:

With the vast amount of information and public comment available online, it is of increasing interest to understand what is being said and what topics are trending online. Government agencies, for example, want to know what policies concern the public without having to look through thousands of comments manually. Topic detection provides automatic identification of topics in documents based on the information content and enhances many natural language processing tasks, including text summarization and information retrieval. Unsupervised topic detection, however, has always been a difficult task. Methods such as Latent Dirichlet Allocation LDA convert documents from word space into document space weighted sums over topic space, but do not perform any form of classification, nor do they address the relation of generated topics with actual human level topics. In this thesis we attempt a novel way of unsupervised topic detection and classification by performing LDA and then clustering. We propose variations to the popular K-Mean Clustering algorithm to optimize the choice of centroids, and we perform experiments using Facebook data and the New York Times NYT corpus. Although the results were poor for the Facebook data, our method performed acceptably with the NYT data. The new clustering algorithms also performed slightly and consistently better than the normal K-Means algorithm.

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:
Approved For Public Release
Distribution Statement:
Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR
Identifying Numbers
Subject Terms