Elsevier

Gene

Volume 416, Issues 1–2, 15 June 2008, Pages 44-47
Gene

Questionable 16S ribosomal RNA gene annotations are frequent in completed microbial genomes

https://doi.org/10.1016/j.gene.2008.02.023Get rights and content

Abstract

According to recent reports, many ribosomal RNA gene annotations are still questionable, and the use of inappropriate tools for annotation has been blamed. However, we believe that the abundant 16S rRNA partial sequence in the databases, mainly created by culture-independent PCR methods, is another main cause of the ambiguous annotations of 16S rRNA. To examine the current status of 16S rRNA gene annotations in complete microbial genomes, we used as a criterion the conserved anti-SD sequence, located at the 3′ end of the 16S rRNA gene, which is commonly overlooked by culture-independent PCR methods. In our large survey, 859 16S rRNA gene sequences from 252 different species of the microbial complete genomes were inspected. 67 species (234 genes) were detected with ambiguous annotations. The common anti-SD sequence and other conserved 16S rRNA sequence features could be detected in the downstream-intergenic regions for almost every questionable sequence, indicating that many of the 16S rRNA genes were annotated incorrectly. Furthermore, we found that more than 91.5% of the 93,716 sequences of the available 16S rRNA in the main databases are partial sequences. We also performed BLAST analysis for every questionable rRNA sequence, and most of the best hits in the analysis were rRNA partial sequences. This result indicates that partial sequences are prevalent in the databases, and that these sequences have significantly affected the accuracy of microbial genomic annotation. We suggest that the annotation of 16S rRNA genes in newly complete microbial genomes must be done in more detail, and that revision of questionable rRNA annotations should commence as soon as possible.

Introduction

Genome annotation is still one of the most fastidious steps in genome sequencing projects. However, wrong, ambiguous, inconsistent and missing gene annotations are still frequently found in the various public databases (Bocs et al., 2002). Over-reliance on automated annotation software, the use of out-of-date or invalidated databases as references, or the mistakes generated by human error are all possible causes for errors during annotation (reviewed by O. White, 2004, Jones et al., 2007). Even worse, questionable annotations have been used as references for subsequent annotation projects, and this has propagated many more errors throughout the databases (Brenner, 1999, White, 2004). As a result, it is not only necessary to carefully annotate genomes but to regularly revise these annotations.

In genome annotation, the prediction of an intact 16S rRNA gene is strategically different from that of a protein-coding gene. This is because the prediction of the ends of a 16S rRNA gene, which lacks start and stop codon features, is usually performed using sequence similarity searches (Jones et al., 2007). Although this similarity search is quite convenient, due to the strong conservation of the gene sequence and the recent addition of numerous 16S rRNA sequences in the databases, the annotations of 16S rRNA genes are often found to be questionable (Lagesen et al., 2007).

Apart from the use of inappropriate tools for genome annotation (Lagesen et al., 2007), there is another important factor that has long been ignored: the negative impact of incomplete sequences in databases. The culture-independent molecular methods with universal primers are commonly used for discovering 16S rRNA genes. However, this method is unable to cover both ends of the gene, and new rRNA sequences would only be partially sequenced. Moreover, many sequences were partially sequenced due to monetary considerations, as well as the fact that it is not necessary to have full sequences in some studies such as phylogenetic analysis. Hence a large number of partial sequences were introduced to the databases, where they competed with complete rRNA sequences in sequence similarity searches. Ultimately, newly sequenced genes would be annotated ambiguously if unguarded.

In our large-scale survey of studies of the anti-Shine–Dalgarno sequence, we demonstrated that a significant number of 16S rRNA genes were annotated ambiguously in many completed microbial genomes. The anti-SD sequence, CCTCC, is one of the most significant and conserved features in 16S rRNA genes. Functionally, this sequence directly binds with the SD sequence in the 5′ end untranslated regions of mRNAs at the translation initiation step, leading to a precise translation (Shine and Dalgarno, 1974, Hui and de Bore, 1987, Jacob et al., 1987, Calogero et al., 1988). The SD/anti-SD binding and conservation of their sequences are evidently critical, not only to translation initiation, but also to translation efficiency (reviewed by Kozak, 1983, Jacques and Dreyfus, 1990, McCarthy and Brimacombe, 1994, Schmitt et al., 1996). As this anti-SD sequence is located at the 3′ end, it is likely to be excluded from 16S rDNAs obtained with the PCR process. In this large survey of 16S ribosomal RNA genes, we directly showed that at least 67 out of 252 different species of the complete microbial genomes were annotated inaccurately. We also found that 16S rRNA partial sequences are one of the main causes for incorrect 16S rRNA gene annotations. This result strongly suggests that 16S rRNA gene annotations in completed genomes should be employed with much more caution, in order to reduce the number of errors that have accumulated in the databases.

Section snippets

Collection of genomic sequence data

In this study, we conducted a large-scale survey of an anti-SD sequence, CCTCC, for 16S ribosomal RNA genes in 252 different species of the complete microbial genomes. In order to avoid any analysis bias towards a certain species, only one representative genome is selected for any given micro-organism species. A total of 859 16S rRNA gene sequences from the National Centre for Biotechnology Information (NCBI) GenBank database were examined (www.ncbi.nlm.nih.gov). A detailed list of these

Numerous 16S ribosomal RNA genes in complete microbial genomes are annotated ambiguously.

To examine the current status of 16S rRNA gene annotations in the public databases, we downloaded the complete microbial genomes of 252 different species from the NCBI website, and individually inspected 859 16S rRNA gene sequences (as described above in the Materials and methods section).

We detected 67 species (234 genes) in 252 microbial genomes with questionable annotations. (This result is shown in Table 1 and more details are available in Suppl. Table 1 in the supplementary material.)

Conclusion

Computational biology has become more and more powerful, and it has made an enormous impact on modern biology and greatly increased the size of biological data. However, there are increasing amounts of ambiguous data or compounded errors which affect the accuracy of the databases. This study provides proof that many of the partial 16S rRNA coding sequences have been annotated as genomic non-coding regions. Many of these were annotated long ago (Suppl. Table 2) and there has been little effort

Acknowledgements

This project is financially supported by the Research Center of Biodiversity, and the Institute of Plant and Microbial Biology, Academia Sinica, Taiwan. We would also like to thank Yi-Ting Yu for data collection as well as Hsiao-Chi Chen for proof reading and editing.

References (14)

There are more references available in the full text version of this article.

Cited by (0)

1

These authors contributed equally to this work.

View full text