Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

Authors:
Anas Katib

University of Missouri-Kansas City, MO

University of Missouri-Kansas City, MO
View Profile

,
Praveen Rao

University of Missouri-Kansas City, MO

University of Missouri-Kansas City, MO

0000-0002-1859-0438
View Profile

,
Kobus Barnard

University of Arizona, Tucson, AZ

University of Arizona, Tucson, AZ
View Profile

,
Charles Kamhoua

Army Research Lab, Adelphi, MD

Army Research Lab, Adelphi, MD
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 13 Issue 2Article No.: 14pp 1–40https://doi.org/10.1145/3301304

Published:13 March 2019Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

In this article, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in an efficient and scalable manner, which is a fundamental task during learning. While exact score computation can be done using the MapReduce-style computation, our goal is to compute approximate scores much faster with probabilistic error bounds and in a scalable manner. We propose a novel approach, which is designed to achieve the following: (a) decentralized score computation using the principle of gossiping; (b) lower resource consumption via a probabilistic approach for maintaining scores using the properties of a Markov chain; and (c) effective distribution of tasks during score computation (on large datasets) by synergistically combining well-known hashing techniques. We conduct theoretical analysis of our approach in terms of convergence speed of the statistics required for score computation, and memory and network bandwidth consumption. We also discuss how our approach is capable of efficiently recomputing scores when new data are available. We conducted a comprehensive evaluation of our approach and compared with the MapReduce-style computation using datasets of different characteristics on a 16-node cluster. When the MapReduce-style computation provided exact statistics for score computation, it was nearly 10 times slower than our approach. Although it ran faster on randomly sampled datasets than on the entire datasets, it performed worse than our approach in terms of accuracy. Our approach achieved high accuracy (below 6% average relative error) in estimating the statistics for approximate score computation on all the tested datasets. In conclusion, it provides a feasible tradeoff between computation time and accuracy for fast approximate score computation on large-scale distributed data.

References

{n.d.}. 2010. Java-Gossip. Retrieved from https://code.google.com/archive/p/java-gossip/.Google Scholar
Devavrat Shah. 2008. Gossip algorithms. Foundations and Trends in Networking 3, 1 (2008), 1--125. Google ScholarDigital Library
2017. CloudLab. Retrieved from https://www.cloudlab.us/.Google Scholar
2017. Kyro. Retrieved from https://github.com/EsotericSoftware/kryo.Google Scholar
2017. LZ4 - Extremely Fast Compression. Retrieved from https://github.com/lz4/lz4.Google Scholar
2017. Snappy, a Fast Compressor/Decompressor. Retrieved from https://github.com/google/snappy.Google Scholar
2017. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/ml/datasets.html.Google Scholar
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proc. 12th USENIX Symp. Oper. Syst. Des. Implement. (OSDI’16). Savannah, GA, 265--283. Google ScholarDigital Library
Jacinto Arias, Jose A. Gamez, and Jose M. Puerta. 2017. Learning distributed discrete Bayesian network classifiers under mapreduce with apache spark. Knowl.-Based Syst. 117 (2017), 16--26. Google ScholarDigital Library
Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014), 4308.Google ScholarCross Ref
A. Basak, I. Brinster, X. Ma, and O. J. Mengshoel. 2012. Accelerating Bayesian network parameter learning using hadoop and mapreduce. In Proc. 2012 BigMine Workshop. 1--8. Google ScholarDigital Library
Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative machine learning on spark. Proc. VLDB Endow. 9, 13 (September 2016), 1425--1436. Google ScholarDigital Library
Stephen P. Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proc. INFOCOM 2005. 1653--1664.Google ScholarCross Ref
Alexandra M. Carvalho. 2009. Scoring Functions for Learning Bayesian Networks. IST, TULisbon/INESC-ID Technical Report 54/2009.Google Scholar
W. Chen, T. Wang, D. Yang, K. Lei, and Y. Liu. 2013. Massively parallel learning of Bayesian networks with mapreduce for factor relationship analysis. In Proc. of Int. Joint Conf. Neural Netw. 1--5.Google Scholar
D. M. Chickering. 1996. Learning bayesian networks is NP-Complete. In Learning from Data: Artificial Intelligence and Statistics V, D. Fisher and H. Lenz (Eds.). SpringerVerlag, 121--130. Google ScholarDigital Library
Gregory F. Cooper, Ivet Bahar, Michael J. Becich, Panayiotis V. Benos, Jeremy Berg, Jeremy U. Espino, Clark Glymour, Rebecca Crowley Jacobson, Michelle Kienholz, Adrian V. Lee, Xinghua Lu, and Richard Scheines. 2015. The center for causal discovery of biomedical knowledge from big data. J. Am. Med. Inform. Assoc. 22, 6 (2015), 1132--1136.Google ScholarCross Ref
National Research Council. 2013. Frontiers in Massive Data Analysis. National Academies Press, Washington, DC.Google Scholar
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proc. 6th OSDI Conf. 137--150. Google ScholarDigital Library
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proc. 21st Symp. Oper. Syst. Princ. 205--220. Google ScholarDigital Library
Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. 1987. Epidemic algorithms for replicated database maintenance. In Proc. 6th Annu. ACM Symp. Princ. Distrib. Comput. 1--12. Google ScholarDigital Library
Q. Fang, K. Yue, X. Fu, H. Wu, and W. Liu. 2013. A mapreduce-based method for learning Bayesian network from massive data. In Proc. 2013 APWeb Conf. 697--708.Google Scholar
Apache Flink. 2017. https://flink.apache.org.Google Scholar
Chryssis Georgiou, Seth Gilbert, Rachid Guerraoui, and Dariusz Kowalski. 2008. On the complexity of asynchronous gossip. In Proc. 27th ACM Symp. Princ. Distrib. Comput. Toronto, Canada, 135--144. Google ScholarDigital Library
Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. In Proc. 2011 IEEE 27th Int. Conf. Data Eng. (ICDE’11). 231--242. Google ScholarDigital Library
Daniel Grossman and Pedro Domingos. 2004. Learning Bayesian network classifiers by maximizing conditional likelihood. In Proc. 21st Int. Conf. Mach. Learn. 46--54. Google ScholarDigital Library
Taher H. Haveliwala, Aristides Gionis, Dan Klein, and Piotr Indyk. 2002. Evaluating strategies for similarity search on the web. In Proc. 11th WWW Conf. 432--442. Google ScholarDigital Library
Katherine A. Heller and Zoubin Ghahramani. 2005. Bayesian hierarchical clustering. In Proc. 22nd Int. Conf. Mach. Learn. Bonn, Germany, 297--304. Google ScholarDigital Library
Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib analytics library: Or MAD skills, the SQL. Proc. VLDB Endow. 5, 12 (August 2012), 1700--1711. Google ScholarDigital Library
Hyperledger. 2017. Retrieved from http://hyperledger-fabric.readthedocs.io/en/release-1.0/gossip.html.Google Scholar
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 13th ACM Symp. Theory Comput. 604--613. Google ScholarDigital Library
Márk Jelasity, Alberto Montresor, and Ozalp Babaoglu. 2005. Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst. 23, 3 (August 2005), 219--252. Google ScholarDigital Library
R. Karp, C. Schindelhauer, S. Shenker, and B. Vöcking. 2000. Randomized rumor spreading. In Proc. IEEE Symp. Found. Comput. Sci. 565--574. Google ScholarDigital Library
Srinivas Kashyap, Supratim Deb, K. V. M. Naidu, Rajeev Rastogi, and Anand Srinivasan. 2006. Efficient gossip-based aggregate computation. In Proc. of the 35th ACM Principles of Database Systems. Chicago, IL. Google ScholarDigital Library
David Kempe, Alin Dobra, and Johannes Gehrke. 2003. Gossip-based computation of aggregate information. In Proc. 44th IEEE Symp Found. Comput. Sci. 482--491. Google ScholarDigital Library
Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. Google ScholarDigital Library
Erricos John Kontoghiorghes. 2005. Handbook of Parallel Computing and Statistics. Chapman 8 Hall/CRC.Google Scholar
Avinash Lakshman and Prashant Malik. 2009. Cassandra: A structured storage system on a P2P network. In Proc. 21st Symp. Parallelism Algorith. Archit. Alberta, Canada, 47. Google ScholarDigital Library
Kun Li, Daisy Zhe Wang, Alin Dobra, and Christopher Dudley. 2015. UDA-GIST: An in-database framework to unify data-parallel and state-parallel analytics. Proc. VLDB Endow. 8, 5 (January 2015), 557--568. Google ScholarDigital Library
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proc. 11th OSDI Conf. 583--598. Google ScholarDigital Library
Kar Wai Lim, Changyou Chen, and Wray Buntine. 2013. Twitter-Network topic model: A full Bayesian treatment for social network and text modeling. In Proc. NIPS 2013 Topic Model Workshop. Australia, 1--5.Google Scholar
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph Hellerstein. 2010. GraphLab: A new framework for parallel machine learning. In Proc. 26th Conf. Uncertain. Artif. Intell. (UAI’10). Catalina Island, CA, 340--349. Google ScholarDigital Library
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for machine learning in the cloud. In Proc. PVLDB Conf. 716--727. Google ScholarDigital Library
Andrés R. Masegosa, Ana M. Martínez, Darío Ramos-López, Rafael Cabañas, Antonio Salmerón, Thomas D. Nielsen, Helge Langseth, and Anders L. Madsen. 2017. AMIDST: A java toolbox for scalable probabilistic machine learning. CoRR abs/1704.01427 (2017). Retrieved from http://arxiv.org/abs/1704.01427.Google Scholar
Patrick McQuighan. 2010. Simulating the Poisson Process. Retrieved from http://www.math.uchicago.edu/may/VIGRE/VIGRE2010/REUPapers/Mcquighan.pdf.Google Scholar
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine learning in apache spark. J. Mach. Learn. Res. 17, 34 (2016), 1--7. Retrieved from http://jmlr.org/papers/v17/15-237.html. Google ScholarDigital Library
Sanchit Misra, Vasimuddin Md., Kiran Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, Maneesha R. Aluru, and Srinivas Aluru. 2014. Parallel Bayesian network structure learning for genome-scale gene networks. In Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. 461--472. Google ScholarDigital Library
D. Mosk-Aoyama and D. Shah. 2008. Fast distributed algorithms for computing separable functions. IEEE Trans. Inform. Theory 54, 7 (2008), 2997--3007. Google ScholarDigital Library
O. Nikolova and S. Aluru. 2012. Parallel Bayesian network structure learning with application to gene networks. In Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. 1--9. Google ScholarDigital Library
Judea Pearl. 2000. Causality: Models, Reasoning, and Inference. Cambridge University Press. Google ScholarDigital Library
John Podesta, Penny Pritzker, Ernest Moniz, John Holdren, and Jeffrey Zients. 2014. Big Data: Seizing Opportunities, Preserving Values. Retrieved from http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf.Google Scholar
Praveen Rao, Anas Katib, Kobus Barnard, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. Scalable score computation for learning multinomial Bayesian networks over distributed data. In Proc. 2017 AAAI Workshop Distrib. Mach. Learn. (DML’17). San Francisco, CA, 498--504.Google Scholar
Sabina Serbu, Étienne Rivière, and Pascal Felber. 2009. Network-friendly gossiping. In Proc. 11th Int. Symp. Stab., Saf., Secur. Distrib. Syst. (SSS’09). Lyon, France, 655--669. Google ScholarDigital Library
Vasil Slavov, Anas Katib, and Praveen Rao. 2014. A tool for internet-scale cardinality estimation of XPath queries over distributed semistructured data. In Proc. 30th IEEE Int. Conf. Data Eng. Chicago, 1270--1273.Google ScholarCross Ref
Vasil Slavov and Praveen Rao. 2011. Towards internet-scale cardinality estimation of XPath queries over distributed XML data. In Proc. 6th Int. Workshop Netw. Meets Databases. Athens, Greece, 1--8.Google Scholar
Vasil Slavov and Praveen R. Rao. 2014. A gossip-based approach for internet-scale cardinality estimation of XPath queries over distributed semistructured data. VLDB Journal 23, 1 (2014), 51--76. Google ScholarDigital Library
SMILE-WIDE. 2014. Retrieved from http://smilewide.github.io/main.Google Scholar
Apache Spark. 2017. Retrieved from https://spark.apache.org.Google Scholar
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. 2001 ACM-SIGCOMM Conf. San Diego, CA, 149--160. Google ScholarDigital Library
Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv:1610.09787.Google Scholar
Wei Wei, Kenneth Joseph, Wei Lo, and Kathleen Carley. 2015. A Bayesian graphical model to discover latent events from twitter. In Proc. 9th Int. AAAI Conf. Web Soc. Media. Retrieved from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10476.Google Scholar
Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc. Google ScholarDigital Library
Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. In Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD’15). Sydney, Australia, 1335--1344. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proc. 2nd USENIX Conf. Hot Topics Cloud Comput. Boston, MA, 10--10. Google ScholarDigital Library
Y. Zhao, J. Xu, and Y. Gao. 2013. A parallel algorithm for Bayesian network parameter learning based on factor graph. In Proc. IEEE Int. Conf. Tools Artif. Intell. 506--511. Google ScholarDigital Library

Index Terms

Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Learning in probabilistic graphical models
        Bayesian network models

Recommendations

A Bayesian hierarchical score for structure learning from related data sets
Abstract
Score functions for learning the structure of Bayesian networks in the literature assume that data are a homogeneous set of observations; whereas it is often the case that they comprise different related, but not homogeneous, data sets ...
Read More
Learning Bayesian network classifiers by risk minimization

Bayesian networks (BNs) provide a powerful graphical model for encoding the probabilistic relationships among a set of variables, and hence can naturally be used for classification. However, Bayesian network classifiers (BNCs) learned in the common way ...
Read More
Locally averaged Bayesian Dirichlet metrics for learning the structure and the parameters of Bayesian networks

The marginal likelihood of the data computed using Bayesian score metrics is at the core of score+search methods when learning Bayesian networks from data. However, common formulations of those Bayesian score metrics rely on free parameters which are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 13, Issue 2
April 2019
342 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3319626
Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
University of Louisiana at Lafayette, USA
Issue’s Table of Contents
Copyright © 2019 ACM
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 March 2019
- Revised: 1 November 2018
- Accepted: 1 November 2018
- Received: 1 February 2018
Published in tkdd Volume 13, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Approximate score computation
bayesian networks
distributed data
gossip algorithms
structure learning
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 585
  Total Downloads
- Downloads (Last 12 months)83
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

A Bayesian hierarchical score for structure learning from related data sets

Learning Bayesian network classifiers by risk minimization

Locally averaged Bayesian Dirichlet metrics for learning the structure and the parameters of Bayesian networks