skip to main content
10.1145/3448016.3457239acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

Published:18 June 2021Publication History

ABSTRACT

Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.

We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.

Skip Supplemental Material Section

Supplemental Material

3448016.3457239.mp4

mp4

46.3 MB

References

  1. Pierre Andrews, Aditya Kalro, and Alexander Sidorov. 2016. Productionizing machine learning pipelines at scale. ML Systems workshop at ICML (2016).Google ScholarGoogle Scholar
  2. Nima Asadi, Jimmy Lin, and Arjen P De Vries. 2013. Runtime optimizations for tree-based machine learning models. IEEE TKDE , Vol. 26, 9 (2013), 2281--2292.Google ScholarGoogle Scholar
  3. Manos Athanassoulis. 2020. Let's talk about deletes! https://blogs.bu.edu/mathan/2020/06/29/lets-talk-about-deletes/.Google ScholarGoogle Scholar
  4. Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution.. CIDR, Vol. 5. 225--237.Google ScholarGoogle Scholar
  5. Joos-Hendrik Böse, Valentin Flunkert, Jan Gasthaus. 2017. Probabilistic demand forecasting at scale. VLDB , Vol. 10, 12 (2017), 1694--1705.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Leo Breiman. 2001. Random forests. Machine learning , Vol. 45, 1 (2001), 5--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees .CRC press.Google ScholarGoogle Scholar
  8. Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. IEEE Symposium on Security and Privacy (2015), 463--480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gert Cauwenberghs and Tomaso Poggio. 2001. Incremental and decremental support vector machine learning. NeurIPS (2001), 409--415.Google ScholarGoogle Scholar
  10. Tianqi Chen 2016. Xgboost: A scalable tree boosting system. KDD .Google ScholarGoogle Scholar
  11. Zeyu Ding, Yuxin Wang, Guanhong Wang, Danfeng Zhang, and Daniel Kifer. 2018. Detecting violations of differential privacy. CCS (2018), 475--489.Google ScholarGoogle Scholar
  12. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. ToC (2006).Google ScholarGoogle Scholar
  13. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning . Vol. 1. Springer.Google ScholarGoogle Scholar
  14. GDPR.eu. Article 17: Right to be forgotten . https://gdpr.eu/article-17-right-to-be-forgotten.Google ScholarGoogle Scholar
  15. GDPR.eu. Recital 74: Responsibility and liability of the controller . https://gdpr.eu/recital-74-responsibility-and-liability-of-the-controller/.Google ScholarGoogle Scholar
  16. GDPR.eu. Recital 75: Risks to the rights and freedoms of natural persons . https://gdpr.eu/recital-75-risks-to-the-rights-and-freedoms-of-natural-persons/.Google ScholarGoogle Scholar
  17. Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine learning , Vol. 63, 1 (2006), 3--42.Google ScholarGoogle Scholar
  18. Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Zou. 2019. Making AI Forget You: Data Deletion in Machine Learning. NeurIPS (2019).Google ScholarGoogle Scholar
  19. Ashish Gupta, Inderpal Singh Mumick, et almbox. 1995. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Engineering , Vol. 18.Google ScholarGoogle Scholar
  20. Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining time-changing data streams. KDD. 97--106.Google ScholarGoogle Scholar
  21. Masayuki Karasuyama and Ichiro Takeuchi. 2009. Multiple Incremental Decremental Learning of Support Vector Machines. NeurIPS (2009), 907--915.Google ScholarGoogle Scholar
  22. Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. VLDB , Vol. 11, 13 (2018), 2209--2222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ruey-Hsia Li and Geneva G Belford. 2002. Instability of decision tree classification algorithms. KDD. 570--575.Google ScholarGoogle Scholar
  24. Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. CIDR (2013).Google ScholarGoogle Scholar
  25. Xiangrui Meng, Joseph Bradley, , et almbox. 2016. Mllib: Machine learning in apache spark. JMLR , Vol. 17, 1 (2016), 1235--1241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. 2020. Descent-to-Delete: Gradient-Based Methods for Machine Unlearning. arxiv: stat.ML/2007.02923Google ScholarGoogle Scholar
  27. Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-serving: Flexible, high-performance ml serving. ML Systems workshop at NeurIPSGoogle ScholarGoogle Scholar
  28. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, et almbox. 2011. Scikit-learn: Machine learning in Python. JMLR , Vol. 12 (2011), 2825--2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. SIGMOD Record , Vol. 47, 2 (2018), 17--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Laura Elena Raileanu and Kilian Stoffel. 2004. Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence , Vol. 41, 1 (2004), 77--93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, and Manos Athanassoulis. 2020. Lethe: A Tunable Delete-Aware LSM Engine. SIGMOD .Google ScholarGoogle Scholar
  32. Sebastian Schelter. 2020. "Amnesia"--A Selection of Machine Learning Models That Can Forget User Data Very Fast. CIDR (2020).Google ScholarGoogle Scholar
  33. Sebastian Schelter, Felix Biessmann, Tim Januschowski,et almbox. 2018. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin (2018).Google ScholarGoogle Scholar
  34. David Sculley, Gary Holt,et almbox. 2015. Hidden technical debt in machine learning systems. NeurIPS. 2503--2511.Google ScholarGoogle Scholar
  35. Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. 2020. Understanding and benchmarking the impact of GDPR on database systems. PVLDB (2020).Google ScholarGoogle Scholar
  36. Julia Stoyanovich, Bill Howe, and H.V. Jagadish. 2020. Responsible Data Management. VLDB , Vol. 13, 12 (2020), 3474--3489.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Paul E Utgoff, Neil C Berkman, and Jeffery A Clouse. 1997. Decision tree induction based on efficient tree restructuring. Machine Learning , Vol. 29, 1 (1997), 5--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Louis Wehenkel and Mania Pavella. 1991. Decision trees and transient stability of electric power systems. Automatica , Vol. 27, 1 (1991), 115--134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yinjun Wu, Edgar Dobriban, and Susan B. Davidson. 2020. DeltaGrad: Rapid retraining of machine learning models. arxiv: cs.LG/2006.14755Google ScholarGoogle Scholar
  40. Ting Ye, Hucheng Zhou, Will Y Zou, et almbox. 2018. Rapidscorer: fast tree ensemble evaluation by maximizing compactness in data level parallelization. KDD .Google ScholarGoogle Scholar
  41. Jingren Zhou and Kenneth A Ross. 2002. Implementing database operations using SIMD instructions. SIGMOD . 145--156.Google ScholarGoogle Scholar

Index Terms

  1. HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader