Clustering Systems with Kolmogorov Complexity and MapReduce
Abstract:
In the eld of value management, an important problem is quantifying the processes and capabilities of an organizations network and the machines within. When the orga- nization is large, ever-changing, and responding to new demands, it is di cult to know at any given time what exactly is being run on the machines. Accordingly, one could lose track of approved or, worse, not approved or even malicious software, as the machines become employed for various tasks. Moreover, the level of utilization of the machines may a ect the maintenance and upkeep of the network. Our goal is to develop a tool that can cluster the machines on a network, in a meaningful way, using di erent attributes or features, and it does so autonomously, in an e cient and scalable system. The so- lution developed implements, at its core, a streaming algorithm that in real-time takes meaningful operating data from a network, compresses it, and sends it to a MapReduce clustering algorithm. The clustering algorithm uses a normalized compression distance to measure the similarity of two machines. The goal for this project was to implement the solution and measure the overall e ectiveness of the clusters. The implementation was successful in creating a software tool that can compress, determine the normalized compression distance and cluster the machines. More work however, needs to be done in using our system to extract more quantitative meaning from the clusters generated.