Accession Number:

ADA526660

Title:

PIPA: A High-Throughput Pipeline for Protein Function Annotation

Descriptive Note:

Conference paper

Corporate Author:

BIOTECHNOLOGY HPC SOFTWARE APPLICATIONS INST FORT DETRICK MD

Report Date:

2008-07-01

Pagination or Media Count:

12.0

Abstract:

Traditional experimental methods to determine the functions of proteins encoded in genomic sequences cannot keep pace with the avalanche of sequence data produced by new high-throughput sequencing technologies. This prompted the development of numerous bioinformatics approaches for automated protein function annotation. However, different function classification terminologies are frequently used by these different approaches, precluding the integration of multisource predictions. We developed Pipeline for Protein Annotation PIPA, a genome-wide protein function annotation pipeline that runs in a high-performance computing environment. PIPA integrates different tools and employs the Gene Ontology GO to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module, the pipeline execution module, consists of programs that enable the user access to and control of the pipelines parallel execution of multiple jobs, each searching a particular database for a chunk of the input data. The execution module wraps the second module, the core pipeline module. The integrated resources, the program for terminology conversion to GO, and the consensus annotation program constitute the main components of the core module. The third module is the preprocessing module. This last module contains the program for customized generation of protein function databases and the GO-mapping generation program, which creates GO mappings for the terminology conversion program. The current implementation of PIPA annotates protein functions by combining the results of an in-house-developed database for enzyme catalytic function prediction CatFam and the results of multiple integrated resources.

Subject Categories:

  • Information Science
  • Genetic Engineering and Molecular Biology
  • Computer Programming and Software

Distribution Statement:

APPROVED FOR PUBLIC RELEASE