Fast Algorithms on Imperfect, Heterogeneous, Distributed Data
Technical Report,01 Aug 2012,28 Feb 2018
Georgia Tech Research Corporation Atlanta United States
Pagination or Media Count:
This project studies the design and implementation of several variants of NMF for text, graph and hybrid data analytics. It addresses challenges including solving new data analytics problems and improving the scalability of existing NMF algorithms. There are two major types of matrix representation of data feature-data matrix and similarity matrix. Previous work showed successful application of standard NMF for feature-data matrix to areas such as text mining and image analysis, and Symmetric NMF SymNMF for similarity matrix to areas such as graph clustering and community detection. In this work, a divide-and-conquer strategy is applied to both methods to improve their time complexity from cubic growth with respect to the reduced low rank to linear growth, resulting in DC-NMF and HierSymNMF2 methods. Extensive experiments on large scale real world data show improved performance of these two methods. Furthermore, in this work NMF and SymNMF are combined into one formulation called Joint-NMF, to analyze hybrid data that contains both text content and connection structure information. They developed an open source software called SmallK smallk.github.io which offers several variants of NMF for fast clustering and topic modeling.
- Information Science