Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification

Authors:
Laurent Marsan

Institut Gaspard Monge, Université de Marne la Vallée, 2, rue de la Butte Verte, 93160 - Noisy le Grand

Institut Gaspard Monge, Université de Marne la Vallée, 2, rue de la Butte Verte, 93160 - Noisy le Grand
View Profile

,
Marie-France Sagot

Institut Gaspard Monge, Université de Marne la Vallée, 2, rue de la Butte Verte, 93160 - Noisy le Grand and Institut Pasteur, Service d'Informatique Scientifique, 28, rue du Dr. Roux, 75324 - Paris Cedex 15

Institut Gaspard Monge, Université de Marne la Vallée, 2, rue de la Butte Verte, 93160 - Noisy le Grand and Institut Pasteur, Service d'Informatique Scientifique, 28, rue du Dr. Roux, 75324 - Paris Cedex 15
View Profile

RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biologyApril 2000Pages 210–219https://doi.org/10.1145/332306.332553

Published:08 April 2000Publication History

RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology

Pages 210–219

ABSTRACT

This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs are composed of p ⪈ 2 parts separated by constrained spacers These algorithms use a suffix tree for fulfilling this task. They are efficient enough to be able to extract site consensus, such as promoter sequences, from a whole collection of non coding sequences extracted from a genome. In particular, their time complexity scales linearly with N²n where n is the average length of the sequences and N their number. An application with interesting results to the identification of promoter consensus sequences in bacterial genomes is shown.

References

1.O. O. Berg and P. H. yon Hippel. Seleetton of DNA binding sites by regulatory proteins. If. The binding specificity of cyclic AMP receptor protein to recognition sites. J. Mol. B~ol., 200:709-793, 1988.Google ScholarCross Ref
2.P. Bieganski, J. Riedl, J. V. Carl~, and E M. P~zel. Generalized suffix trees for biological sequence data: applications and implementations. In Proc. of the .27th Hau~a~ Int. Oonf. on Systems Sc~., pages 35-44. iEEE Comp,ter Society Press, 1994.Google Scholar
3.A. Brazrna, I. Jona~en, J. Vdo, and E Ukkonen. Predicting gene regulatory elements ;n sdzao on a genomic scale. Gcno,ne Research, 8:1202-1215, 1998.Google Scholar
4.L R. Cardon and G. D. Stormo. Expectation Maximizalion algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragqnents. J. Mol. Bwl., 223 139-170, 1992.Google ScholarCross Ref
5.B. Combrugghe, S. B,mby, and H. Buc. Cyclic AMP receptor protein: role in transcription activation Sczence, 224:831- 838, 1984.Google Scholar
6.Y. M Fraenkel, Y. Mandel, D Friedherg, and H Margalit. Identification of common motifs in ,nahgned DNA sequences: application to ~schcr~ch~a eoh trp regulon. Oomput. Appl B;osc;., 11.379-387, 1995.Google Scholar
7.D. J. Galas, M. Eggert, and M S Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysts of promoter sequenee~q from Escher~ch~a colt J. Mot. B:ol., 18l~:117-128, 1985.Google Scholar
8.C. A. Gross, M. Lon~to, and R. Losiek. Bacterial sigma factors. In S. L. Knight and K. R. Yamamoto, editors, 7Yanscr~phonal Rcgulatlon, volume 1, pages 129-176. Cold Spring Harbor Laboratory Press, 1992.Google Scholar
9.D. Ousfield. Algorithms o~ Strings, 7Yees, and Sequences: Computer Smence and Computational B~oloBy. Cambridge University Press, 1997. Google ScholarDigital Library
10.J. D. Helmann. Compilation and analysis of Baedlus subtdts c~-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res., ~a:2aal-2aao, lgos.Google Scholar
11.A. K!ingenhoff, K. Frech, K. Qimndt, and T. Werner. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bsotnformatscs 1, 15:180-186, 1999.Google Scholar
12.C. E. Lawrence and A. A. Reilly. An expectation maximization (F.M) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: struct., lunar., and genetics, 7:41-51, 1990.Google Scholar
13.B. Lewin. Genes VI. Oxford University Press, 1997.Google Scholar
14.E. M. McOreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262-272, 1976. Google ScholarDigital Library
15.M. A. Mulder, H. Zappe, and L. M. 8teyn. Mycobacterial promoters. Tuber. L~ng Dis., 78:211-223, 1997.Google ScholarCross Ref
16.O. N. Ozoline, A. A. Deer, and M. V. Arkhipova. Noncanonical sequence elements in the promoter structure, cluster analysis of promoters recognized by Escher, ch,a cot: RNA polymerase. Pluelelc Acids Res., 25:4703-4709, 1998.Google ScholarCross Ref
17.W H. Press, S. A. Teukolsky, W. T Vetterling, and B. P. F!annery. Numerical Recipes in 0 : The Art of Sctent:fic Computing. Cambridge Univ. Press, 1993. Google ScholarDigital Library
18.M.T. Record, W. S. Reznikoff, M. L. Craig, K. L McQuade, and P. J. Schlax. Esc~erichia coli RNA polymerase a?~ promotets, and the kinetics of the steps of transcription initiation. ID F. C. Neidhardt, editor, Escher~ch~a colt and Salmonella, volume 1, pages 792-820. ASM Press, 1996.Google Scholar
19.M.-F Sagot. Spelling approximate repeated or common motifs using a suffix tree. In C. L. Lucchesi and A. V. Mourn, editors, LATIIV'98: TheoretscM In{ormat,cs, Lecture Notes in Computer Science, pages 111-127. Springer-Verlag, 1998. Google ScholarDigital Library
20.T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. Information content of binding sites on nucleotide sequences. Y. Mot. Blot., 188:415-431, 1986.Google ScholarCross Ref
21.G. D. Stormo and O. W. Hartzell Ill. identifying proteinbinding rotes from unaligned DNA fragments. Proc. Natl. Acad. Scs. USA, 86:1183-1187, 1989.Google ScholarCross Ref
22.M. Tompa. An exact method for finding short motifs in sequences, with application go the ribosome binding site problem. In Seventh Interna~,onal Sympossum on Intelhgent Systems for Molecular Bsology, pages 262-271, Heidelberg, Germany, 1999. AAAI Press. Google Scholar
23.E. Ukkonen On-line construction of snmx-trees. Algorithm, ca, 14:249-260, 1995.Google Scholar
24.J. van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational a~alysis of oligonueleotide frequencies. J. Mol. B,ot., 281:827-842, 1998.Google Scholar
25.A. Vanet, L. Marsan, A. Labigne, and M.-F. Sagot. Inferring regulatory elements from a whole genome. An analysis of the o~~ family of promoter signals. }999. submitted to J. Mol. B~ol.Google Scholar
26.A. Vane{, L. Marsan, and M.-F. Sagot. Promoter sequences and algorithlnical methods for identifying them. Research m Macrab~otogy, 150:1-21, 1999. in press.Google Scholar
27.T. Werner. Models for prediction and recognition of eukaryotic promoters. Mature. Oenome, 10:168-175, 1999Google Scholar
28.F. Wolfertstetter, K. Frech, G. Hcrrmann, and T. Werner. Identification of functional elements in unaligned nucleic acid sequence~ by a novel tuple search algorithms. Oomput. Appl. B,osc:., 12:71-80, 1996.Google Scholar

Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification
1. Applied computing
  1. Life and medical sciences

Recommendations

Identification of specific sequence motifs in the upstream region of 242 human miRNA genes

We have identified novel over-represented and conserved motifs in the upstream regions of human and mouse miRNA stem-loop sequences by means of a new bioinformatic processing regimen. We observed sequence conservation -500bp upstream in 189 human and ...
Read More
An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences

We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and ...
Read More
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization
Special issue on applications in molecular biology

The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unaligned biopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology
April 2000
329 pages
ISBN:1581131860
DOI:10.1145/332306
Editors:
Ron Shamir
Tel-Aviv Univ., Israel
,
Satoru Miyano
Univ. of Tokyo, Tokyo, Japan
,
Sorin Istrail
Sandia National Labs
,
Pavel Pevzner
Univ. of Southern California
,
Michael Waterman
Univ. of Southern California
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 April 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
consensus
model
motif extraction
promoter
structured motif
suffix tree
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate148of538submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 853
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification

RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology

ABSTRACT

References

Cited By

Recommendations

Identification of specific sequence motifs in the upstream region of 242 human miRNA genes

An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences

Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization