ABSTRACT
Massive parallel sequencing of RNA transcripts by the next generation technology (RNA-Seq) is a powerful method of generating critically important data for discovery of structure and function of eukaryotic genes. The transcripts may or may not carry protein-coding regions. If protein coding region is present, it should be a continuous (spliced) open reading frame. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions, complete or incomplete, in RNA transcripts assembled from RNA-Seq reads. Important feature of GeneMarkS-T is unsupervised estimation of parameters of the algorithm that makes unnecessary several conventional steps used in the gene prediction protocols, most importantly the manually curated preparation of training sets. We demonstrate that i/the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and ii/accuracy of GeneMarkS-T in identification of protein-coding regions and, particularly, in prediction of gene starts compares favorably to other existing methods.
Index Terms
- Identification of protein coding regions in RNA transcripts
Recommendations
Computational discovery of human coding and non-coding transcripts with conserved splice sites
Motivation: Long non-coding RNAs (lncRNAs) resemble protein-coding mRNAs but do not encode proteins. Most lncRNAs are under lower sequence constraints than protein-coding genes and lack conserved secondary structures, making it hard to predict them ...
Computational identification of evolutionarily conserved exons
RECOMB '04: Proceedings of the eighth annual international conference on Research in computational molecular biologyPhylogenetic hidden Markov models (phylo-HMMs) have recently been proposed as a means for addressing a multi-species version of the ab initio gene prediction problem. These models allow sequence divergence, a phylogeny, patterns of substitution, and ...
A new approach for gene prediction using comparative sequence analysis
SAC '05: Proceedings of the 2005 ACM symposium on Applied computingThe availability of large fragments of genomic DNA makes it possible to apply comparative genomics for identification of protein-coding regions. In this work, a comparative analysis is conducted on homologous genomic sequences of organisms with ...
Comments