skip to main content
research-article
Public Access

Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

Published:13 March 2019Publication History
Skip Abstract Section

Abstract

In this article, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in an efficient and scalable manner, which is a fundamental task during learning. While exact score computation can be done using the MapReduce-style computation, our goal is to compute approximate scores much faster with probabilistic error bounds and in a scalable manner. We propose a novel approach, which is designed to achieve the following: (a) decentralized score computation using the principle of gossiping; (b) lower resource consumption via a probabilistic approach for maintaining scores using the properties of a Markov chain; and (c) effective distribution of tasks during score computation (on large datasets) by synergistically combining well-known hashing techniques. We conduct theoretical analysis of our approach in terms of convergence speed of the statistics required for score computation, and memory and network bandwidth consumption. We also discuss how our approach is capable of efficiently recomputing scores when new data are available. We conducted a comprehensive evaluation of our approach and compared with the MapReduce-style computation using datasets of different characteristics on a 16-node cluster. When the MapReduce-style computation provided exact statistics for score computation, it was nearly 10 times slower than our approach. Although it ran faster on randomly sampled datasets than on the entire datasets, it performed worse than our approach in terms of accuracy. Our approach achieved high accuracy (below 6% average relative error) in estimating the statistics for approximate score computation on all the tested datasets. In conclusion, it provides a feasible tradeoff between computation time and accuracy for fast approximate score computation on large-scale distributed data.

References

  1. {n.d.}. 2010. Java-Gossip. Retrieved from https://code.google.com/archive/p/java-gossip/.Google ScholarGoogle Scholar
  2. Devavrat Shah. 2008. Gossip algorithms. Foundations and Trends in Networking 3, 1 (2008), 1--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 2017. CloudLab. Retrieved from https://www.cloudlab.us/.Google ScholarGoogle Scholar
  4. 2017. Kyro. Retrieved from https://github.com/EsotericSoftware/kryo.Google ScholarGoogle Scholar
  5. 2017. LZ4 - Extremely Fast Compression. Retrieved from https://github.com/lz4/lz4.Google ScholarGoogle Scholar
  6. 2017. Snappy, a Fast Compressor/Decompressor. Retrieved from https://github.com/google/snappy.Google ScholarGoogle Scholar
  7. 2017. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/ml/datasets.html.Google ScholarGoogle Scholar
  8. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proc. 12th USENIX Symp. Oper. Syst. Des. Implement. (OSDI’16). Savannah, GA, 265--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jacinto Arias, Jose A. Gamez, and Jose M. Puerta. 2017. Learning distributed discrete Bayesian network classifiers under mapreduce with apache spark. Knowl.-Based Syst. 117 (2017), 16--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014), 4308.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Basak, I. Brinster, X. Ma, and O. J. Mengshoel. 2012. Accelerating Bayesian network parameter learning using hadoop and mapreduce. In Proc. 2012 BigMine Workshop. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative machine learning on spark. Proc. VLDB Endow. 9, 13 (September 2016), 1425--1436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Stephen P. Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proc. INFOCOM 2005. 1653--1664.Google ScholarGoogle ScholarCross RefCross Ref
  14. Alexandra M. Carvalho. 2009. Scoring Functions for Learning Bayesian Networks. IST, TULisbon/INESC-ID Technical Report 54/2009.Google ScholarGoogle Scholar
  15. W. Chen, T. Wang, D. Yang, K. Lei, and Y. Liu. 2013. Massively parallel learning of Bayesian networks with mapreduce for factor relationship analysis. In Proc. of Int. Joint Conf. Neural Netw. 1--5.Google ScholarGoogle Scholar
  16. D. M. Chickering. 1996. Learning bayesian networks is NP-Complete. In Learning from Data: Artificial Intelligence and Statistics V, D. Fisher and H. Lenz (Eds.). SpringerVerlag, 121--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gregory F. Cooper, Ivet Bahar, Michael J. Becich, Panayiotis V. Benos, Jeremy Berg, Jeremy U. Espino, Clark Glymour, Rebecca Crowley Jacobson, Michelle Kienholz, Adrian V. Lee, Xinghua Lu, and Richard Scheines. 2015. The center for causal discovery of biomedical knowledge from big data. J. Am. Med. Inform. Assoc. 22, 6 (2015), 1132--1136.Google ScholarGoogle ScholarCross RefCross Ref
  18. National Research Council. 2013. Frontiers in Massive Data Analysis. National Academies Press, Washington, DC.Google ScholarGoogle Scholar
  19. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proc. 6th OSDI Conf. 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proc. 21st Symp. Oper. Syst. Princ. 205--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. 1987. Epidemic algorithms for replicated database maintenance. In Proc. 6th Annu. ACM Symp. Princ. Distrib. Comput. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Q. Fang, K. Yue, X. Fu, H. Wu, and W. Liu. 2013. A mapreduce-based method for learning Bayesian network from massive data. In Proc. 2013 APWeb Conf. 697--708.Google ScholarGoogle Scholar
  23. Apache Flink. 2017. https://flink.apache.org.Google ScholarGoogle Scholar
  24. Chryssis Georgiou, Seth Gilbert, Rachid Guerraoui, and Dariusz Kowalski. 2008. On the complexity of asynchronous gossip. In Proc. 27th ACM Symp. Princ. Distrib. Comput. Toronto, Canada, 135--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. In Proc. 2011 IEEE 27th Int. Conf. Data Eng. (ICDE’11). 231--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Daniel Grossman and Pedro Domingos. 2004. Learning Bayesian network classifiers by maximizing conditional likelihood. In Proc. 21st Int. Conf. Mach. Learn. 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Taher H. Haveliwala, Aristides Gionis, Dan Klein, and Piotr Indyk. 2002. Evaluating strategies for similarity search on the web. In Proc. 11th WWW Conf. 432--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Katherine A. Heller and Zoubin Ghahramani. 2005. Bayesian hierarchical clustering. In Proc. 22nd Int. Conf. Mach. Learn. Bonn, Germany, 297--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib analytics library: Or MAD skills, the SQL. Proc. VLDB Endow. 5, 12 (August 2012), 1700--1711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hyperledger. 2017. Retrieved from http://hyperledger-fabric.readthedocs.io/en/release-1.0/gossip.html.Google ScholarGoogle Scholar
  31. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 13th ACM Symp. Theory Comput. 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Márk Jelasity, Alberto Montresor, and Ozalp Babaoglu. 2005. Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst. 23, 3 (August 2005), 219--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Karp, C. Schindelhauer, S. Shenker, and B. Vöcking. 2000. Randomized rumor spreading. In Proc. IEEE Symp. Found. Comput. Sci. 565--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Srinivas Kashyap, Supratim Deb, K. V. M. Naidu, Rajeev Rastogi, and Anand Srinivasan. 2006. Efficient gossip-based aggregate computation. In Proc. of the 35th ACM Principles of Database Systems. Chicago, IL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. David Kempe, Alin Dobra, and Johannes Gehrke. 2003. Gossip-based computation of aggregate information. In Proc. 44th IEEE Symp Found. Comput. Sci. 482--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Erricos John Kontoghiorghes. 2005. Handbook of Parallel Computing and Statistics. Chapman 8 Hall/CRC.Google ScholarGoogle Scholar
  38. Avinash Lakshman and Prashant Malik. 2009. Cassandra: A structured storage system on a P2P network. In Proc. 21st Symp. Parallelism Algorith. Archit. Alberta, Canada, 47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kun Li, Daisy Zhe Wang, Alin Dobra, and Christopher Dudley. 2015. UDA-GIST: An in-database framework to unify data-parallel and state-parallel analytics. Proc. VLDB Endow. 8, 5 (January 2015), 557--568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proc. 11th OSDI Conf. 583--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Kar Wai Lim, Changyou Chen, and Wray Buntine. 2013. Twitter-Network topic model: A full Bayesian treatment for social network and text modeling. In Proc. NIPS 2013 Topic Model Workshop. Australia, 1--5.Google ScholarGoogle Scholar
  42. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph Hellerstein. 2010. GraphLab: A new framework for parallel machine learning. In Proc. 26th Conf. Uncertain. Artif. Intell. (UAI’10). Catalina Island, CA, 340--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for machine learning in the cloud. In Proc. PVLDB Conf. 716--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Andrés R. Masegosa, Ana M. Martínez, Darío Ramos-López, Rafael Cabañas, Antonio Salmerón, Thomas D. Nielsen, Helge Langseth, and Anders L. Madsen. 2017. AMIDST: A java toolbox for scalable probabilistic machine learning. CoRR abs/1704.01427 (2017). Retrieved from http://arxiv.org/abs/1704.01427.Google ScholarGoogle Scholar
  45. Patrick McQuighan. 2010. Simulating the Poisson Process. Retrieved from http://www.math.uchicago.edu/may/VIGRE/VIGRE2010/REUPapers/Mcquighan.pdf.Google ScholarGoogle Scholar
  46. Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine learning in apache spark. J. Mach. Learn. Res. 17, 34 (2016), 1--7. Retrieved from http://jmlr.org/papers/v17/15-237.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sanchit Misra, Vasimuddin Md., Kiran Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, Maneesha R. Aluru, and Srinivas Aluru. 2014. Parallel Bayesian network structure learning for genome-scale gene networks. In Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. 461--472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. D. Mosk-Aoyama and D. Shah. 2008. Fast distributed algorithms for computing separable functions. IEEE Trans. Inform. Theory 54, 7 (2008), 2997--3007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. O. Nikolova and S. Aluru. 2012. Parallel Bayesian network structure learning with application to gene networks. In Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Judea Pearl. 2000. Causality: Models, Reasoning, and Inference. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. John Podesta, Penny Pritzker, Ernest Moniz, John Holdren, and Jeffrey Zients. 2014. Big Data: Seizing Opportunities, Preserving Values. Retrieved from http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf.Google ScholarGoogle Scholar
  52. Praveen Rao, Anas Katib, Kobus Barnard, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. Scalable score computation for learning multinomial Bayesian networks over distributed data. In Proc. 2017 AAAI Workshop Distrib. Mach. Learn. (DML’17). San Francisco, CA, 498--504.Google ScholarGoogle Scholar
  53. Sabina Serbu, Étienne Rivière, and Pascal Felber. 2009. Network-friendly gossiping. In Proc. 11th Int. Symp. Stab., Saf., Secur. Distrib. Syst. (SSS’09). Lyon, France, 655--669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Vasil Slavov, Anas Katib, and Praveen Rao. 2014. A tool for internet-scale cardinality estimation of XPath queries over distributed semistructured data. In Proc. 30th IEEE Int. Conf. Data Eng. Chicago, 1270--1273.Google ScholarGoogle ScholarCross RefCross Ref
  55. Vasil Slavov and Praveen Rao. 2011. Towards internet-scale cardinality estimation of XPath queries over distributed XML data. In Proc. 6th Int. Workshop Netw. Meets Databases. Athens, Greece, 1--8.Google ScholarGoogle Scholar
  56. Vasil Slavov and Praveen R. Rao. 2014. A gossip-based approach for internet-scale cardinality estimation of XPath queries over distributed semistructured data. VLDB Journal 23, 1 (2014), 51--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. SMILE-WIDE. 2014. Retrieved from http://smilewide.github.io/main.Google ScholarGoogle Scholar
  58. Apache Spark. 2017. Retrieved from https://spark.apache.org.Google ScholarGoogle Scholar
  59. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. 2001 ACM-SIGCOMM Conf. San Diego, CA, 149--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv:1610.09787.Google ScholarGoogle Scholar
  61. Wei Wei, Kenneth Joseph, Wei Lo, and Kathleen Carley. 2015. A Bayesian graphical model to discover latent events from twitter. In Proc. 9th Int. AAAI Conf. Web Soc. Media. Retrieved from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10476.Google ScholarGoogle Scholar
  62. Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. In Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD’15). Sydney, Australia, 1335--1344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proc. 2nd USENIX Conf. Hot Topics Cloud Comput. Boston, MA, 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Y. Zhao, J. Xu, and Y. Gao. 2013. A parallel algorithm for Bayesian network parameter learning based on factor graph. In Proc. IEEE Int. Conf. Tools Artif. Intell. 506--511. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 13, Issue 2
      April 2019
      342 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3319626
      Issue’s Table of Contents

      Copyright © 2019 ACM

      © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 March 2019
      • Revised: 1 November 2018
      • Accepted: 1 November 2018
      • Received: 1 February 2018
      Published in tkdd Volume 13, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format