Unsupervised Topic Discovery by Anomaly Detection

Cheng, Leon

Unsupervised Topic Discovery by Anomaly Detection

Active / Technical Report | Accession Number: ADA589398 |

Open PDF

Abstract:

With the vast amount of information and public comment available online, it is of increasing interest to understand what is being said and what topics are trending online. Government agencies, for example, want to know what policies concern the public without having to look through thousands of comments manually. Topic detection provides automatic identification of topics in documents based on the information content and enhances many natural language processing tasks, including text summarization and information retrieval. Unsupervised topic detection, however, has always been a difficult task. Methods such as Latent Dirichlet Allocation LDA convert documents from word space into document space weighted sums over topic space, but do not perform any form of classification, nor do they address the relation of generated topics with actual human level topics. In this thesis we attempt a novel way of unsupervised topic detection and classification by performing LDA and then clustering. We propose variations to the popular K-Mean Clustering algorithm to optimize the choice of centroids, and we perform experiments using Facebook data and the New York Times NYT corpus. Although the results were poor for the Facebook data, our method performed acceptably with the NYT data. The new clustering algorithms also performed slightly and consistently better than the normal K-Means algorithm.

Author(s):

Cheng, Leon

Author Organization(s):

NAVAL POSTGRADUATE SCHOOL MONTEREY CA

Descriptive Note:

Master's thesis

Pagination:

0067

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Monitor Series:

NPS

Subject Terms

Joint Capability Areas:

JCA_5_Command and Control; JCA_1.2.7_Experimentation; JCA_5.5_Direct; JCA_5.3_Planning; JCA_5.5.1_Communicate Intent and Guidance; JCA_1.2.5_Lessons Learned; JCA_5.5.2_Task; JCA_1.2.1_Training; JCA_5.2.2_Develop Knowledge and Situational Awareness; JCA_5.2_Understand; JCA_1.2.3_Educating

Modernization Areas:

Autonomy; AI and Machine Learning

Communities of Interest:

Autonomy

Descriptor(s):

*ANOMALIES, *DETECTION, *INFORMATION RETRIEVAL, ALGORITHMS, AUTOMATIC, CLASSIFICATION, CLUSTERING, DOCUMENTS, DORMANCY, HUMANS, IDENTIFICATION, INFORMATION PROCESSING, NATURAL LANGUAGE, ONLINE SYSTEMS, POLICIES, THESES, WEIGHTING FUNCTIONS, WORDS(LANGUAGE)

Field(s)/Group(s):

Information Science

Report Date:

2013 Sep 01

Creation Date:

2013 Dec 18