ABSTRACT
In biological sequences, tandem repeats consist of tens to hundreds of residues of a repeated pattern, such as atgatgatgatgatg ('atg' repeated), often the result of replication slippage. Over time, these repeats decay so that the original sharp pattern of repetition is somewhat obscured, but even degenerate repeats pose a problem for sequence annotation: when two sequences both contain shared patterns of similar repetition, the result can be a false signal of sequence homology. We describe an implementation of a new hidden Markov model for detecting tandem repeats that shows substantially improved sensitivity to labeling decayed repetitive regions, presents low and reliable false annotation rates across a wide range of sequence composition, and produces scores that follow a stable distribution. On typical genomic sequence, the time and memory requirements of the resulting tool (ULTRA) are competitive with the most heavily used tool for repeat masking (TRF). ULTRA is released under an open source license and lays the groundwork for inclusion of the model in sequence alignment tools and annotation pipelines.
- Stephen F Altschul, Thomas L Madden, Alejandro A Sch"affer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman . 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research Vol. 25, 17 (1997), 3389--3402.Google ScholarCross Ref
- John AL Armour . 2006. Tandemly repeated DNA: why should anyone care? Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis Vol. 598, 1 (2006), 6--14.Google ScholarCross Ref
- Albino Bacolla, Jacquelynn E Larson, Jack R Collins, Jian Li, Aleksandar Milosavljevic, Peter D Stenson, David N Cooper, and Robert D Wells . 2008. Abundance and length of simple repeats in vertebrate genomes are determined by their structural properties. Genome Research Vol. 18, 10 (2008), 1545--1553.Google ScholarCross Ref
- Gary Benson . 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research Vol. 27, 2 (1999), 573.Google ScholarCross Ref
- Juan Caballero, Arian FA Smit, Leroy Hood, and Gustavo Glusman . 2014. Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Research Vol. 42, 12 (2014), e99--e99.Google ScholarCross Ref
- International Human Genome Sequencing Consortium . 2001. Initial sequencing and analysis of the human genome. Nature Vol. 409, 6822 (2001), 860.Google Scholar
- Sean R Eddy . 2009. A new generation of homology search tools based on probabilistic inference. Genome Informatics Vol. 23 (2009), 205--211.Google Scholar
- Marta Farré, Montserrat Bosch, Francesc López-Giráldez, Montserrat Ponsà, and Aurora Ruiz-Herrera . 2011. Assessing the role of tandem repeats in shaping the genomic architecture of great apes. PLoS One Vol. 6, 11 (2011), e27239.Google ScholarCross Ref
- Martin C Frith . 2010. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Research Vol. 39, 4 (2010), e23--e23.Google ScholarCross Ref
- Martin C Frith, Michiaki Hamada, and Paul Horton . 2010. Parameters for accurate genome alignment. BMC Bioinformatics Vol. 11, 1 (2010), 80.Google ScholarCross Ref
- Rita Gemayel, Marcelo D Vinces, Matthieu Legendre, and Kevin J Verstrepen . 2010. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annual Review of Genetics Vol. 44 (2010), 445--477.Google ScholarCross Ref
- An Jansen, Rita Gemayel, and KJ Verstrepen . 2012. Unstable microsatellite repeats facilitate rapid evolution of coding and regulatory sequences. Repetitive DNA Vol. 7 (2012), 108--125.Google ScholarCross Ref
- Yechezkel Kashi and David G King . 2006. Simple sequence repeats as advantageous mutators in evolution. TRENDS in Genetics Vol. 22, 5 (2006), 253--259.Google ScholarCross Ref
- Jessica Kolb, Nadia A Chuzhanova, Josef Högel, Karen M Vasquez, David N Cooper, Albino Bacolla, and Hildegard Kehrer-Sawatzki . 2009. Cruciform-forming inverted repeats appear to have mediated many of the microinversions that distinguish the human and chimpanzee genomes. Chromosome Research Vol. 17, 4 (2009), 469--483.Google ScholarCross Ref
- Sébastien Leclercq, Eric Rivals, and Philippe Jarne . 2007. Detecting microsatellites within genomes: significant variation among algorithms. BMC Bioinformatics Vol. 8, 1 (2007), 125.Google ScholarCross Ref
- Kian Guan Lim, Chee Keong Kwoh, Li Yang Hsu, and Adrianto Wirawan . 2012. Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Briefings in Bioinformatics Vol. 14, 1 (2012), 67--81.Google ScholarCross Ref
- Angelika Merkel and Neil Gemmell . 2008. Detecting short tandem repeats from genome data: opening the software black box. Briefings in Bioinformatics Vol. 9, 5 (2008), 355--366.Google ScholarCross Ref
- Jaina Mistry, Robert D Finn, Sean R Eddy, Alex Bateman, and Marco Punta . 2013. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Research Vol. 41, 12 (2013), e121--e121.Google ScholarCross Ref
- Aleksandr Morgulis, E Michael Gertz, Alejandro A Sch"affer, and Richa Agarwala . 2006. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. Journal of Computational Biology Vol. 13, 5 (2006), 1028--1040.Google ScholarCross Ref
- Michal Nánási, Tomávs Vinavr, and Brovna Brejová . 2014. Probabilistic approaches to alignment with tandem repeats. Algorithms for Molecular Biology Vol. 9, 1 (2014), 3.Google ScholarCross Ref
- Danilo Pumpernik, Borut Oblak, and Branko Borvstnik . 2008. Replication slippage versus point mutation rates in short tandem repeats of the human genome. Molecular Genetics and Genomics Vol. 279, 1 (2008), 53--61.Google ScholarCross Ref
- K Usdin and E Grabczyk . 2000. DNA repeat expansions and human disease. Cellular and Molecular Life Sciences Vol. 57, 6 (2000), 914--931.Google ScholarCross Ref
- Travis J Wheeler, Jody Clements, Sean R Eddy, Robert Hubley, Thomas A Jones, Jerzy Jurka, Arian FA Smit, and Robert D Finn . 2012. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Research Vol. 41, D1 (2012), D70--D82.Google ScholarCross Ref
- Wing-Cheong Wong, Sebastian Maurer-Stroh, and Frank Eisenhaber . 2010. More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Computational Biology Vol. 6, 7 (2010), e1000867.Google ScholarCross Ref
- John C Wootton and Scott Federhen . 1996. {33} Analysis of compositionally biased regions in sequence databases. Computer Methods for Macromolecular Sequence Analysis Vol. 266 (1996), 554--571.Google ScholarCross Ref
Index Terms
- ULTRA: A Model Based Tool to Detect Tandem Repeats
Recommendations
Identification and analysis of novel tandem repeats in the cell surface proteins of archaeal and bacterial genomes using computational tools: Primary Research Papers
We have identified four novel repeats and two domains in cell surface proteins encoded by the Methanosarcina acetivorans genome and in some archaeal and bacterial genomes. The repeats correspond to a certain number of amino acid residues present in ...
Detecting fuzzy amino acid tandem repeats in protein sequences
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and BiomedicineTandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins ...
Application of the Burrows-Wheeler Transform for searching for tandem repeats in DNA sequences
Genomic sequences contain a variety of repeated structures of various lengths and types, interspersed or in tandem. Repetitive structures play an important role in molecular biology; they are related to the genetic backgrounds of inherited diseases, and ...
Comments