Simultaneously Learning DNA Motif along with Its Position and Sequence Rank Preferences through EM Algorithm

Zhang, ZhiZhuo; Chang, Cheng Wei; Hugo, Willy; Cheung, Edwin; Sung, Wing-Kin

doi:10.1007/978-3-642-29627-7_37

ZhiZhuo Zhang²⁰,
Cheng Wei Chang²¹,
Willy Hugo²⁰,
Edwin Cheung²¹ &
…
Wing-Kin Sung^20,21

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7262))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1342 Accesses
3 Citations

Abstract

Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e. position preference and sequence rank preference). This information is usually required from the user. This paper presents a de novo motif discovery algorithm called SEME which uses pure probabilistic mixture model to model the motif’s binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large scale synthetic datasets, 32 metazoan compendium benchmark datasets and 164 ChIP-Seq libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (co-TF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct co-TF motifs and, at the same time, predicted co-TF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each co-TF reveals potential interaction mechanisms between the primary TF and the co-TF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the co-TFs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ashburner, M.: Gene ontology: Tool for the unification of biology. Nature Genetics 25, 25–29 (2000)
Article Google Scholar
Bailey, T.L.: Dreme: Motif discovery in transcription factor chip-seq data. Bioinformatics 27(12), 1653 (2011)
Article Google Scholar
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 2, pp. 28–36 (1994)
Google Scholar
Berger, M.F., Bulyk, M.L.: Protein binding microarrays (pbms) for rapid, high-throughput characterization of the sequence specificities of dna binding proteins. Methods in Molecular Biology-Clifton then Totowa 338, 245 (2006)
Google Scholar
Chen, X., Hughes, T.R., Morris, Q.: Rankmotif++: a motif-search algorithm that accounts for relative ranks of k-mers in binding transcription factors. Bioinformatics 23(13), i72 (2007)
Article Google Scholar
Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B., Wong, E., Orlov, Y.L., Zhang, W., Jiang, J., et al.: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133(6), 1106–1117 (2008)
Article Google Scholar
Ettwiller, L., Paten, B., Ramialison, M., Birney, E., Wittbrodt, J.: Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nature Methods 4(7), 563–565 (2007)
Article Google Scholar
Euskirchen, G.M., Rozowsky, J.S., Wei, C.L., Lee, W.H., Zhang, Z.D., Hartman, S., Emanuelsson, O., Stolc, V., Weissman, S., Gerstein, M.B., et al.: Mapping of transcription factor binding regions in mammalian cells by chip: comparison of array-and sequencing-based technologies. Genome Research 17(6), 898 (2007)
Article Google Scholar
Frith, M.C., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Research 32(1), 189 (2004)
Article Google Scholar
Gao, N., Zhang, J., Rao, M.A., Case, T.C., Mirosevich, J., Wang, Y., Jin, R., Gupta, A., Rennie, P.S., Matusik, R.J.: The role of hepatocyte nuclear factor-3α (forkhead box a1) and androgen receptor in transcriptional regulation of prostatic genes. Molecular Endocrinology 17(8), 1484 (2003)
Article Google Scholar
Glynn, P.W., Iglehart, D.L.: Importance sampling for stochastic simulations. Management Science, 1367–1392 (1989)
Google Scholar
Hu, M., Yu, J., Taylor, J.M.G., Chinnaiyan, A.M., Qin, Z.S.: On the detection and refinement of transcription factor binding sites using chip-seq data. Nucleic Acids Research 38(7), 2154 (2010)
Article Google Scholar
Keilwagen, J., Grau, J., Paponov, I.A., Posch, S., Strickert, M., Grosse, I.: De-novo discovery of differentially abundant transcription factor binding sites including their positional preference. PLoS Computational Biology 7(2), e1001070 (2011)
Article MathSciNet Google Scholar
Kong, S.L., Li, G., Loh, S.L., Sung, W.K., Liu, E.T.: Cellular reprogramming by the conjoint action of erα, foxa1, and gata3 to a ligand-inducible growth state. Molecular Systems Biology 7(1) (2011)
Google Scholar
Kulakovskiy, I.V., Boeva, V.A., Favorov, A.V., Makeev, V.J.: Deep and wide digging for binding motifs in chip-seq data. Bioinformatics 26(20), 2622 (2010)
Article Google Scholar
Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A space and time efficient algorithm for constructing compressed suffix arrays. Computing and Combinatorics, 21–26 (2002)
Google Scholar
Linhart, C., Halperin, Y., Shamir, R.: Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets. Genome Research 18(7), 1180 (2008)
Article Google Scholar
Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein–dna binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology 20(8), 835–839 (2002)
Google Scholar
Liu, Y., Schmidt, B., Liu, W., Maskell, D.L.: CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognition Letters (2009)
Google Scholar
Mahony, S., Auron, P.E., Benos, P.V.: Dna familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Computational Biology 3(3), e61 (2007)
Article MathSciNet Google Scholar
Narang, V., Mittal, A., Sung, W.K.: Localized motif discovery in gene regulatory sequences. Bioinformatics 26(9), 1152 (2010)
Article Google Scholar
Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(suppl. 1), 207–214 (2001)
Article Google Scholar
Raphael, B., Liu, L.T., Varghese, G.: A uniform projection method for motif discovery in dna sequences. IEEE Transactions on Computational biology and Bioinformatics, 91–94 (2004)
Google Scholar
Reid, J.E., Wernisch, L.: Steme: efficient em to find motifs in large data sets. Nucleic Acids Research 39(18), e126–e126 (2011)
Article Google Scholar
Roth1JT, F.P., Hughes, J.D., Estep, P.W., Church, G.M.: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology 16, 939 (1998)
Article Google Scholar
Sahu, B., Laakso, M., Ovaska, K., Mirtti, T., Lundin, J., Rannikko, A., Sankila, A., Turunen, J.P., Lundin, M., Konsti, J., et al.: Dual role of foxa1 in androgen receptor binding to chromatin, androgen signalling and prostate cancer. The EMBO Journal 30(19), 3962–3976 (2011)
Article Google Scholar
Sharov, A.A., Ko, M.S.H.: Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder. DNA Research (2009)
Google Scholar
Sinha, S.: On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 22(14) (2006)
Google Scholar
Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 344–354 (2000)
Google Scholar
Valouev, A., Johnson, D.S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R.M., Sidow, A.: Genome-wide analysis of transcription factor binding sites based on chip-seq data. Nature Methods 5(9), 829 (2008)
Article Google Scholar
Wasserman, W.W., Sandelin, A.: Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics 5(4), 276–287 (2004)
Article Google Scholar
Wu, Q., Ng, H.H.: Mark the transition: chromatin modifications and cell fate decision. Cell Research (2011)
Google Scholar
Zhang, Z., Chang, C.W., Goh, W.L., Sung, W.K., Cheung, E.: Centdist: discovery of co-associated factors by motif distribution. Nucleic Acids Research 39(suppl. 2), W391 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

National University of Singapore, Singapore
ZhiZhuo Zhang, Willy Hugo & Wing-Kin Sung
Genome Institute of Singapore, Singapore
Cheng Wei Chang, Edwin Cheung & Wing-Kin Sung

Authors

ZhiZhuo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Wei Chang
View author publications
You can also search for this author in PubMed Google Scholar
Willy Hugo
View author publications
You can also search for this author in PubMed Google Scholar
Edwin Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Wing-Kin Sung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Tel-Aviv University, 69978, Tel-Aviv, Israel
Benny Chor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Chang, C.W., Hugo, W., Cheung, E., Sung, WK. (2012). Simultaneously Learning DNA Motif along with Its Position and Sequence Rank Preferences through EM Algorithm. In: Chor, B. (eds) Research in Computational Molecular Biology. RECOMB 2012. Lecture Notes in Computer Science(), vol 7262. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29627-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-29627-7_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29626-0
Online ISBN: 978-3-642-29627-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics