Accession Number:

ADA458752

Title:

Model-Based Clustering and Data Transformations for Gene Expression Data

Descriptive Note:

Technical rept.

Corporate Author:

GEORGE WASHINGTON UNIV WASHINGTON DC DEPT OF STATISTICS

Report Date:

2001-04-30

Pagination or Media Count:

41.0

Abstract:

Clustering is a useful exploratory technique for the analysis of gene expression data, and many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. Model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. This Gaussian mixture model has been shown to be a power tool for many applications. In addition, the issues of selecting a good clustering method and determining the correct number of clusters are reduced to model selection problems in the probability framework. We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has supeflor performance on our synthetic data sets, consistently selecting the correct model and the right number of clusters.

Subject Categories:

  • Genetic Engineering and Molecular Biology
  • Operations Research

Distribution Statement:

APPROVED FOR PUBLIC RELEASE