ABSTRACT
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs are composed of p ⪈ 2 parts separated by constrained spacers These algorithms use a suffix tree for fulfilling this task. They are efficient enough to be able to extract site consensus, such as promoter sequences, from a whole collection of non coding sequences extracted from a genome. In particular, their time complexity scales linearly with N2n where n is the average length of the sequences and N their number. An application with interesting results to the identification of promoter consensus sequences in bacterial genomes is shown.
- 1.O. O. Berg and P. H. yon Hippel. Seleetton of DNA binding sites by regulatory proteins. If. The binding specificity of cyclic AMP receptor protein to recognition sites. J. Mol. B~ol., 200:709-793, 1988.Google ScholarCross Ref
- 2.P. Bieganski, J. Riedl, J. V. Carl~, and E M. P~zel. Generalized suffix trees for biological sequence data: applications and implementations. In Proc. of the .27th Hau~a~ Int. Oonf. on Systems Sc~., pages 35-44. iEEE Comp,ter Society Press, 1994.Google Scholar
- 3.A. Brazrna, I. Jona~en, J. Vdo, and E Ukkonen. Predicting gene regulatory elements ;n sdzao on a genomic scale. Gcno,ne Research, 8:1202-1215, 1998.Google Scholar
- 4.L R. Cardon and G. D. Stormo. Expectation Maximizalion algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragqnents. J. Mol. Bwl., 223 139-170, 1992.Google ScholarCross Ref
- 5.B. Combrugghe, S. B,mby, and H. Buc. Cyclic AMP receptor protein: role in transcription activation Sczence, 224:831- 838, 1984.Google Scholar
- 6.Y. M Fraenkel, Y. Mandel, D Friedherg, and H Margalit. Identification of common motifs in ,nahgned DNA sequences: application to ~schcr~ch~a eoh trp regulon. Oomput. Appl B;osc;., 11.379-387, 1995.Google Scholar
- 7.D. J. Galas, M. Eggert, and M S Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysts of promoter sequenee~q from Escher~ch~a colt J. Mot. B:ol., 18l~:117-128, 1985.Google Scholar
- 8.C. A. Gross, M. Lon~to, and R. Losiek. Bacterial sigma factors. In S. L. Knight and K. R. Yamamoto, editors, 7Yanscr~phonal Rcgulatlon, volume 1, pages 129-176. Cold Spring Harbor Laboratory Press, 1992.Google Scholar
- 9.D. Ousfield. Algorithms o~ Strings, 7Yees, and Sequences: Computer Smence and Computational B~oloBy. Cambridge University Press, 1997. Google ScholarDigital Library
- 10.J. D. Helmann. Compilation and analysis of Baedlus subtdts c~-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res., ~a:2aal-2aao, lgos.Google Scholar
- 11.A. K!ingenhoff, K. Frech, K. Qimndt, and T. Werner. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bsotnformatscs 1, 15:180-186, 1999.Google Scholar
- 12.C. E. Lawrence and A. A. Reilly. An expectation maximization (F.M) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: struct., lunar., and genetics, 7:41-51, 1990.Google Scholar
- 13.B. Lewin. Genes VI. Oxford University Press, 1997.Google Scholar
- 14.E. M. McOreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262-272, 1976. Google ScholarDigital Library
- 15.M. A. Mulder, H. Zappe, and L. M. 8teyn. Mycobacterial promoters. Tuber. L~ng Dis., 78:211-223, 1997.Google ScholarCross Ref
- 16.O. N. Ozoline, A. A. Deer, and M. V. Arkhipova. Noncanonical sequence elements in the promoter structure, cluster analysis of promoters recognized by Escher, ch,a cot: RNA polymerase. Pluelelc Acids Res., 25:4703-4709, 1998.Google ScholarCross Ref
- 17.W H. Press, S. A. Teukolsky, W. T Vetterling, and B. P. F!annery. Numerical Recipes in 0 : The Art of Sctent:fic Computing. Cambridge Univ. Press, 1993. Google ScholarDigital Library
- 18.M.T. Record, W. S. Reznikoff, M. L. Craig, K. L McQuade, and P. J. Schlax. Esc~erichia coli RNA polymerase a?~ promotets, and the kinetics of the steps of transcription initiation. ID F. C. Neidhardt, editor, Escher~ch~a colt and Salmonella, volume 1, pages 792-820. ASM Press, 1996.Google Scholar
- 19.M.-F Sagot. Spelling approximate repeated or common motifs using a suffix tree. In C. L. Lucchesi and A. V. Mourn, editors, LATIIV'98: TheoretscM In{ormat,cs, Lecture Notes in Computer Science, pages 111-127. Springer-Verlag, 1998. Google ScholarDigital Library
- 20.T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. Information content of binding sites on nucleotide sequences. Y. Mot. Blot., 188:415-431, 1986.Google ScholarCross Ref
- 21.G. D. Stormo and O. W. Hartzell Ill. identifying proteinbinding rotes from unaligned DNA fragments. Proc. Natl. Acad. Scs. USA, 86:1183-1187, 1989.Google ScholarCross Ref
- 22.M. Tompa. An exact method for finding short motifs in sequences, with application go the ribosome binding site problem. In Seventh Interna~,onal Sympossum on Intelhgent Systems for Molecular Bsology, pages 262-271, Heidelberg, Germany, 1999. AAAI Press. Google Scholar
- 23.E. Ukkonen On-line construction of snmx-trees. Algorithm, ca, 14:249-260, 1995.Google Scholar
- 24.J. van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational a~alysis of oligonueleotide frequencies. J. Mol. B,ot., 281:827-842, 1998.Google Scholar
- 25.A. Vanet, L. Marsan, A. Labigne, and M.-F. Sagot. Inferring regulatory elements from a whole genome. An analysis of the o~~ family of promoter signals. }999. submitted to J. Mol. B~ol.Google Scholar
- 26.A. Vane{, L. Marsan, and M.-F. Sagot. Promoter sequences and algorithlnical methods for identifying them. Research m Macrab~otogy, 150:1-21, 1999. in press.Google Scholar
- 27.T. Werner. Models for prediction and recognition of eukaryotic promoters. Mature. Oenome, 10:168-175, 1999Google Scholar
- 28.F. Wolfertstetter, K. Frech, G. Hcrrmann, and T. Werner. Identification of functional elements in unaligned nucleic acid sequence~ by a novel tuple search algorithms. Oomput. Appl. B,osc:., 12:71-80, 1996.Google Scholar
- Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification
Recommendations
Identification of specific sequence motifs in the upstream region of 242 human miRNA genes
We have identified novel over-represented and conserved motifs in the upstream regions of human and mouse miRNA stem-loop sequences by means of a new bioinformatic processing regimen. We observed sequence conservation -500bp upstream in 189 human and ...
An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences
We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and ...
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization
Special issue on applications in molecular biologyThe MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unaligned biopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about ...
Comments