Fast Nonparametric Machine Learning Algorithms for High-Dimensional Massive Data and Applications
CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE
Pagination or Media Count:
Nonparametric methods have become increasingly popular in statistics and probabilistic AI communities. One well-known nonparametric method is nearest-neighbor . It uses the observations in the training set T closest in input space to a query q to form the prediction of q. Specifically, when k of the observations in T are considered, it is called k-nearest-neighbor or k-NN. Despite its simplicity, k-NN and its variants have been successful in many machine learning problems, including pattern recognition, text categorization, information retrieval, computational statistics, database and data mining. It is also used for estimating sample distributions and Bayes error. However, k-NN and many related nonparametric methods remain hampered by their computational complexity. Many spatial methods, such as metric-trees, have been proposed to alleviate the computational cost, but the effectiveness of these methods decreases as the number of dimensions of feature vectors increases. From another direction, researchers are trying to develop ways to find approximate answers. The premise of this research is that in many cases it is not necessary to insist on the exact answers instead, determining an approximate answer should be sufficient. In fact, some approximate methods show good performance in a number of applications, and some methods enjoy very good theoretical soundness. However, when facing hundreds or thousands dimensions, many algorithms do not work well in reality.