Tuning a Major Part of a Clustering Algorithm.
PRINCETON UNIV NJ DEPT OF STATISTICS
Pagination or Media Count:
Most proposals for clustering algorithms have been based on introspection Few proposed algorithms have had their performance studied. This approach involves a striving to avoid comparing distances on remote parts of the data because metrics deserve only minimum trust, and b using a stochastically defined test bed to measure, and where possible understand, the performance of an evolving algorithm, with the intent of using our understanding to modify it in such a way as to improve its performance. This test bed involves 3 circular Gaussian samples, of size 50 each, centered at the vertices of an equilateral triangle of side tsigma. Its use assumes that a 3-group answer is being sought. Thus the only concerned is with a part of the clustering process. Our early algorithms begin to misbehave in the range 5 or t or 7. Our successive steps of improvement work at smaller and smaller t. The last version we have tried still performs usefully median misclassification about 16 at t 2.7, where knowledge of three populations would only let us hold misclassification to a median of 13.3. Comparison with a Gaussian maximum likelihood algorithm on the same set of triple samples shows only slightly better performance than for our algorithm.
- Statistics and Probability