Accession Number : AD1027466


Title :   Applications and Benefits for Big Data Sets Using Tree Distances and The T-SNE Algorithm


Descriptive Note : Technical Report


Corporate Author : Naval Postgraduate School Monterey United States


Personal Author(s) : Lee,Suyoung


Full Text : https://apps.dtic.mil/dtic/tr/fulltext/u2/1027466.pdf


Report Date : 01 Mar 2016


Pagination or Media Count : 81


Abstract : Modern data sets often consist of unstructured data and mixed data; that is, they include both numerical and categorical variables. Often, these data sets will include noise, redundancy, missing values and outliers. Clustering is one of the most important and widely-used data analytic methods. However, clustering requires the ability to measure distances or dissimilarities, which are not defined in an obvious way for mixed data. Practitioners often use the Gower dissimilarity for this task. In this work we use tree distance computed using Buttreys treeClust package in R, as discussed by Buttrey and Whitaker in 2015, to process mixed data, at the same time handling missing values and outliers. Visualization is also an important method for big data. We use the t-distributed Stochastic Neighbor Embedded (t-SNE) algorithm for visualization introduced by van der Maaten and Hinton in 2008, which produces visualization for high-dimensional data by assigning individual data points in a two- or three-dimensional map. We also use popular visualization techniques grouped under the name multidimensional scaling. We compare the results using the tree distance and the t-SNE algorithm to results from using Gower dissimilarity and multidimensional scaling. Unlike established dimensionality reduction techniques, which generally map from high dimensions directly to two (or three) dimensions, we explore a new approach in which the dimensionality reduction takes place in several separate steps. Our experiments show that our new techniques can outperform the established techniques in producing visualizations of high-dimensional mixed data.


Descriptors :   algorithms , data management , trees(data structures) , operations research , theses


Subject Categories : Theoretical Mathematics


Distribution Statement : APPROVED FOR PUBLIC RELEASE