Accession Number:

ADA456094

Title:

Tools for Large Graph Mining

Descriptive Note:

Doctoral thesis

Corporate Author:

CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE

Personal Author(s):

Report Date:

2005-06-01

Pagination or Media Count:

118.0

Abstract:

Graphs show up in a surprisingly diverse set of disciplines, ranging from computer networks to sociology, biology, ecology and many more. How do such normal graphs look like How can we spot abnormal subgraphs within them Which nodesedges are suspicious How does a virus spread over a graph Answering these questions is vital for outlier detectionsuch as terrorist cells, money laundering rings, forecasting, simulationshow well will a new protocol work on a realistic computer network, immunization campaigns and many other applications. We attempt to answer these questions in two parts, First, we answer questions targeted at applications what patternsproperties of a graph are important for solving specific problems Here, we investigate the propagation behavior of a computer virus over a network, and find a simple formula for the epidemic thresholdbeyond which any viral outbreak might become an epidemic. We find an information survival threshold which determines whether, in a sensor or P2P network with failing nodes and links, a piece of information will survive or not. We also develop a scalable, parameter-free method for finding groups of similar nodes in a graph, corresponding to homogeneous regionsor CrossAssociations in the binary adjacency matrix of the graph. This can help navigate the structure of the graph, and find un-obvious patterns. In the second part of our work, we investigate recurring patters in real-world graphs, to gain a deeper understanding of their structure. This leads to the development of the R-MAT model of graph generation for creating synthetic but realistic graphs, which match many of the patterns found in real-world graphs, including power-law and lognormal degree distributions, small diameter and community effects,

Subject Categories:

  • Statistics and Probability
  • Computer Systems

Distribution Statement:

APPROVED FOR PUBLIC RELEASE