Accession Number:

AD1099688

Title:

RefSeq Database Growth Influences the Accuracy of k-mer-Based Lowest Common Ancestor Species Identification

Descriptive Note:

Journal Article - Open Access

Corporate Author:

University of Maryland, Center for Bioinformatics and Computational Biology College Park United States

Report Date:

2018-10-30

Pagination or Media Count:

10.0

Abstract:

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera as a result, more reads are classified with newer database versions, but fewer are classified at the species level and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

Subject Categories:

  • Information Science
  • Biology

Distribution Statement:

APPROVED FOR PUBLIC RELEASE