Axiomatic Analysis of Co-occurrence Similarity Functions
CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE
Pagination or Media Count:
Finding similar items based on co-occurrence data is an important data mining task with applications ranging from recommender systems to keyword based advertising. A number of co-occurrence similarity functions have been proposed based on graph-theoretic, geometric, and statistical abstractions. Despite the variety of existing algorithms, however, there exists no formal methodology for analyzing their properties and comparing their benefits and limitations. At the same time, the wide range of applications and domains where co-occurrence-based similarity functions are deployed limits the conclusiveness of experimental evaluations beyond the narrow task typically considered by each method. This paper proposes an axiomatic approach to analyzing co-occurrence similarity functions. The approach is based on formulating general, domain-independent constraints that well-behaved methods must satisfy to avoid producing degenerate results. Such constraints are derived based on the impact that continuous aggregation of the co-occurrence data is expected to have on absolute or relative similarity estimates. Proposed constraint-based analysis is applied to several representative, popular similarity functions and reveals that, surprisingly, none of them satisfy all constraints unconditionally. The analysis leads to the design of a theoretically well-justified similarity function called Random Walk with Sink RWS. RWS is parameterized to satisfy the constraints unconditionally, with the parameterization having interesting probabilistic and graph-theoretic interpretations.
- Numerical Mathematics