Exploring Dimensionality Reduction for Text Mining
Trident Scholar Project rept. no. 362
NAVAL ACADEMY ANNAPOLIS MD
Pagination or Media Count:
Text mining is the extraction of important information from a collection of textual data sources. For instance, text mining can be used to discover related concepts or to categorize previously unseen documents. In this age of information overload, text mining applications can potentially yield tremendous benefits to both individuals and organizations. However, the effectiveness of text mining is limited by the large volume of textual data, as well as its complex and noisy characteristics. Both of these challenges can be addressed with dimensionality reduction DR. DR is the process of transforming a large amount of data into a much smaller, less noisy representation that preserves important relationships from the original data. DR techniques have been shown to effectively simplify large geometric datasets, but have yet to be adequately evaluated for textual data. This project evaluated five DR techniques Principal Components Analysis, Multidimensional Scaling, Isomap, Locally Linear Embedding, and Laplace-Beltrami Diffusion Maps from two distinct perspectives. First, the impact of each DR technique on the ability to automatically perform document classification on corpuses of scientific abstracts or news articles was measured. For each technique, the dataset was reduced, then a standard linear, quadratic, or nearest neighbor classifier was used to assign categories to a test set of documents based upon a labeled training set. Results showed that, for any fixed number of dimensions used by the classifier, performing any kind of DR almost always improved classification accuracy compared to using the non-reduced data. Amongst different DR techniques, Isomap and Multi-dimensional Scaling were best able to reduce the data and eliminate noise, yielding improved accuracy. This suggests that these textual data sets lie primarily on a linear manifold for which the more complex non-linear techniques do not have an advantage.
- Information Science
- Statistics and Probability