Augmenting Latent Dirichlet Allocation and Rank Threshold Detection with Ontologies
AIR FORCE INST OF TECH WRIGHT-PATTERSON AFB OH DEPT OF GRADUATE COMPUTER SCIENCE
Pagination or Media Count:
In an ever-increasing data rich environment, actionable information must be extracted, filtered, and correlated from massive amounts of disparate often free text sources. The usefulness of the retrieved information depends on how we accomplish these steps and present the most relevant information to the analyst. One method for extracting information from free text is Latent Dirichlet Allocation LDA, a document categorization technique to classify documents into cohesive topics. Although LDA accounts for some implicit relationships such as synonymy same meaning it often ignores other semantic relationships such as polysemy different meanings, hyponym subordinate, meronym part of, and troponomys manner. To compensate for this de ciency, we incorporate explicit word ontologies, such as WordNet, into the LDA algorithm to account for various semantic relationships. Experiments over the 20 Newsgroups, NIPS, OHSUMED, and IED document collections demonstrate that incorporating such knowledge improves perplexity measure over LDA alone for given parameters.
- Information Science