Accelerating Exact k-means Algorithms with Geometric Reasoning

Pelleg, Dan; Moore, Andrew

Accelerating Exact k-means Algorithms with Geometric Reasoning

Active / Technical Report | Accession Number: ADA374582 |

Abstract:

We present new algorithms for the k-means clustering problem. They use the kd-tree data structure to reduce the large number of nearest-neighbor queries issued by the traditional algorithm. Sufficient statistics are stored in the nodes of the kd-tree. Then an analysis of the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional k-means algorithm. Proofs of correctness are included. The kd-tree can also be used to initialize the k-means starting centers efficiently. Our algorithms can be easily extended to provide fast ways of computing the error of a given cluster assignment regardless of the method in which those clusters were obtained. We also show how to use them in a setting which allows approximate clustering results, with the benefit of running faster. We have implemented and tested our algorithms on both real and simulated data. Results show a speedup factor of up to 170 on real astrophysical data, and superiority over the naive algorithm on simulated data in up to 5 dimensions. Our algorithms scale well with respect to the number of points and number of centers allowing for clustering with tens of thousands of centers.

Author(s):

Pelleg, Dan ; Moore, Andrew

Author Organization(s):

CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE

Pagination:

0022

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

RECORD

Collection: TR

Identifying Numbers

Report Number(s):

CMU-CS-00-105

Monitor Series:

XD

Subject Terms

Joint Capability Areas:

JCA_5_Command and Control; JCA_1.2.7_Experimentation; JCA_1_Force Support; JCA_1.3_Human Capital Management; JCA_5.4_Decide; JCA_1.2.5_Lessons Learned; JCA_1.3.1_Personnel and Family Support

Modernization Areas:

Quantum Science and Computing

Communities of Interest:

C4I

Descriptor(s):

*ALGORITHMS, CLUSTERING, APPLIED MATHEMATICS, SET THEORY, POINT THEOREM, POINTS(MATHEMATICS), ANALYTIC GEOMETRY

Field(s)/Group(s):

Numerical Mathematics

Keyword(s):

*COMPUTATIONAL GEOMETRY

Report Date:

1999 Jan 01

Creation Date:

2000 Mar 22