skip to main content
research-article

Privacy-preserving decision trees over vertically partitioned data

Published:27 October 2008Publication History
Skip Abstract Section

Abstract

Privacy and security concerns can prevent sharing of data, derailing data-mining projects. Distributed knowledge discovery, if done correctly, can alleviate this problem. We introduce a generalized privacy-preserving variant of the ID3 algorithm for vertically partitioned data distributed over two or more parties. Along with a proof of security, we discuss what would be necessary to make the protocols completely secure. We also provide experimental results, giving a first demonstration of the practical complexity of secure multiparty computation-based data mining.

References

  1. Agrawal, D. and Aggarwal, C. C. 2001. On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Santa Barbara, CA. ACM Press, New York, 247--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Evfimievski, A., and Srikant, R. 2003. Information sharing across private databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, CA. ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, R. and Srikant, R. 2000. Privacy-Preserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data, Dallas, TX. ACM Press, New York, 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Atallah, M. J., Elmongui, H. G., Deshpande, V., and Schwarz, L. B. 2003. Secure supply-chain protocols. In Proceedings of the IEEE International Conference on E-Commerce, Newport Beach, CA. IEEE Computer Society Press, 293--302.Google ScholarGoogle Scholar
  5. Blake, C. and Merz, C. 1998. UCI repository of machine learning databases. http://citeseer.comp.nus.edu.sg/context/123650/0.Google ScholarGoogle Scholar
  6. Cohen, H., Miyaji, A., and Ono, T. 1998. Efficient elliptic curve exponentiation using mixed coordinates. In Proceedings of the International Conference on the Theory and Applications of Cryptology and Information Security (ASIACRYPT). Springer-Verlag, London, UK, 51--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cox, M. J., Engelschall, R. S., Henson, S., and rie, B. L. 1998--2005. The OpenSSL Toolkit.Google ScholarGoogle Scholar
  8. Cramer, R., Damgard, I., and Nielsen, J. B. 2001. Multiparty computation from threshold homomorphic encryption. In Proceedings of the International Conference on the Theory and Application of Cryptographic Techniques (EUROCRYPT). Springer-Verlag, London, UK, 280--299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Damgard, I., Jurik, M., and Nielsen, J. 2003. A generalization of Paillier's public-key system with applications to electronic voting.Google ScholarGoogle Scholar
  10. Du, W. and Atallah, M. J. 2001. Secure multi-party computation problems and their applications: A review and open problems. In Proceedings of the New Security Paradigms Workshop. ACM, New York, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Du, W. and Zhan, Z. 2002. Building decision tree classifier on private data. In Proceedings of the IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, Maebashi City, Japan, C. Clifton and V. Estivill-Castro, Eds. vol. 14. Australian Computer Society, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Duda, R. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley & Sons, Hoboken, NJ.Google ScholarGoogle Scholar
  13. Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Evidence-Based Medicine Working Group. 1992. Evidence-Based medicine. A new approach to teaching the practice of medicine. J. Amer. Medical Assoc. 268, 17 (Nov.), 2420--2425.Google ScholarGoogle Scholar
  15. Freedman, M. J., Nissim, K., and Pinkas, B. 2004. Efficient private matching and set intersection. In Proceedings of the 23rd Annual International Conference on the Theory and Applications of Cryptographic Techniques, International Association for Cryptologic Research (IACR), Interlaken, Switzerland. Springer, 1--19.Google ScholarGoogle Scholar
  16. Goethals, B., Laur, S., Lipmaa, H., and Mielikäinen, T. 2004. On secure scalar product computation for privacy-preserving data mining. In Proceedings of the 7th Annual International Conference in Information Security and Cryptology (ICISC), New York, C. Park and S. Chee, Eds. vol. 3506, Springer, 104--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Goldreich, O. 2004. General Cryptographic Protocols, Vol. 2. In The Foundations of Cryptography, vol. 2. Cambridge University Press, Cambridge, UK, 599--764.Google ScholarGoogle ScholarCross RefCross Ref
  18. Goldreich, O., Micali, S., and Wigderson, A. 1987. How to play any mental game—A completeness theorem for protocols with honest majority. In Proceedings of the 19th ACM Symposium on the Theory of Computing. ACM, New York, 218--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Huang, Z., Du, W., and Chen, B. 2005. Deriving private information from randomized data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD. ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jagannathan, G. and Wright, R. N. 2005. Privacy-Preserving distributed k-means clustering over arbitrarily partitioned data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL. ACM Press, New York, 593--599. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kantarcioglu, M. and Clifton, C. 2002. Privacy-Preserving distributed mining of association rules on horizontally partitioned data. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), Madison, WI. ACM Press, New York, 24--31.Google ScholarGoogle Scholar
  22. Kantarcioglu, M. and Vaidya, J. 2002. An architecture for privacy-preserving mining of client information. In Proceedings of the IEEE International Conference on Data Mining, Workshop on Privacy, Security, and Data Mining, Maebashi City, Japan, C. Clifton and V. Estivill-Castro, Eds. vol. 14. Australian Computer Society, 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kantarcioǧlu, M. and Clifton, C. 2004. Privacy-Preserving distributed mining of association rules on horizontally partitioned data. IEEE Trans. Knowl. Data Eng. 16, 9 (Sept.), 1026--1037. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kargupta, H., Datta, S., Wang, Q., and Sivakumar, K. 2003. On the privacy preserving properties of random data perturbation techniques. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). IEEE Computer Society, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lewis, M. 2003. Department of defense appropriations act, 2004. Title VIII Section 8120. Enacted as Public Law 108-87.Google ScholarGoogle Scholar
  26. Lin, X., Clifton, C., and Zhu, M. 2005. Privacy preserving clustering with distributed EM mixture modeling. Knowl. Inf. Syst. 8, 1 (Jul.), 68--81.Google ScholarGoogle ScholarCross RefCross Ref
  27. Lindell, Y. and Pinkas, B. 2000. Privacy preserving data mining. In Advances in Cryptology (CRYPTO). Springer-Verlag, New York, NY, 36--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lindell, Y. and Pinkas, B. 2002. Privacy preserving data mining. J. Cryptol. 15, 3, 177--206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 1, 81--106. Google ScholarGoogle ScholarCross RefCross Ref
  30. Rizvi, S. J. and Haritsa, J. R. 2002. Maintaining data privacy in association rule mining. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, VLDB Endowment, 682--693. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Schneier, B. 1995. Applied Cryptography, 2nd ed. John Wiley & Sons, Hoboken, NJ.Google ScholarGoogle Scholar
  32. Shirao, K., Hoff, P., Ohtsu, A., Loehrer, P., Hyodo, I., Wadler, S., Wadleigh, R., O'Dwyer, P., Muro, K., Yamada, Y., Boku, N., Nagashima, F., and Abbruzzese, J. 2004. Comparison of the efficacy, toxicity, and pharmacokinetics of a uracil/tegafur (UFT) plus oral leucovorin (LV) regimen between Japanese and American patients with advanced colorectal cancer: Joint United States and Japan study of UFT/LV. J. Clinical Oncol. 22, 17 (Sept. 1), 3466--3474.Google ScholarGoogle ScholarCross RefCross Ref
  33. Vaidya, J. and Clifton, C. 2002. Privacy-Preserving association rule mining in vertically partitioned data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada. ACM Press, New York, 639--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vaidya, J. and Clifton, C. 2003. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC. ACM Press, New York, 206--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vaidya, J. and Clifton, C. 2004. Privacy preserving naïve Bayes classifier for vertically partitioned data. In Proceedings of the SIAM International Conference on Data Mining. SIAM, Philadelphia, PA, 522--526.Google ScholarGoogle Scholar
  36. Vaidya, J. and Clifton, C. 2005a. Privacy-Preserving decision trees over vertically partitioned data. In the 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, Storrs, CT. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vaidya, J. and Clifton, C. 2005b. Secure set intersection cardinality with application to association rule mining. J. Comput. Security 13, 4 (Nov.), 593--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wang, K., Xu, Y., She, R., and Yu, P. S. 2006. Classification spanning private databases. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo Park, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wright, R. and Yang, Z. 2004. Privacy-Preserving Bayesian network structure computation on distributed heterogeneous data. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA. ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yao, A. C. 1986. How to generate and exchange secrets. In Proceedings of the 27th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 162--167. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Privacy-preserving decision trees over vertically partitioned data

          Recommendations

          Reviews

          Richard CHBEIR

          Iterative dichotomiser 3 (ID3) is a classification algorithm that uses a fixed set of examples to build a decision tree. This paper presents an interesting variant of the ID3 algorithm that can be used to classify vertically partitioned data while preserving the privacy of participated sites and parties. This variant is practical and beneficial in various scenarios and applications. In the first section, the authors introduce their work, and present a motivating scenario related to cancer treatments. In the second section, they present how to create the ID3 tree, by explaining and providing a set of required concepts and algorithms. Section 3 is devoted to explaining how the tree can be used, and is illustrated with an example of a weather dataset. Section 4 discusses the proofs related to the security of the provided algorithms, and computational complexity is addressed in Section 5. In Section 6, the authors present the implementation of the algorithm, with a set of experimental studies conducted to show the performance of their approach. In Section 7, the authors address the problem of securing the protocols of ID3, by providing a theoretical study of a secure multiparty dot product protocol complemented with an experimental study. Section 8 is dedicated to presenting current approaches related to this work. Although the paper is very interesting, it remains very technical, and data mining skills are required to understand the concepts (particularly the main classification algorithms). A comparison study with classical algorithms would have made the paper easier to understand. In addition, Sections 6 and 7 should have been merged and restructured. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Knowledge Discovery from Data
            ACM Transactions on Knowledge Discovery from Data  Volume 2, Issue 3
            October 2008
            124 pages
            ISSN:1556-4681
            EISSN:1556-472X
            DOI:10.1145/1409620
            Issue’s Table of Contents

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 27 October 2008
            • Accepted: 1 August 2008
            • Revised: 1 May 2008
            • Received: 1 September 2007
            Published in tkdd Volume 2, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader