research-article

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

Authors:
Sebastian Schelter

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

,
Stefan Grafberger

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

,
Ted Dunning

Hewlett Packard Enterprise, Santa Clara, CA, USA

Hewlett Packard Enterprise, Santa Clara, CA, USA
View Profile

SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataJune 2021Pages 1545–1557https://doi.org/10.1145/3448016.3457239

Published:18 June 2021Publication History

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 1545–1557

ABSTRACT

Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.

We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.

Supplemental Material

3448016.3457239.mp4

mp4

46.3 MB

Download

References

Pierre Andrews, Aditya Kalro, and Alexander Sidorov. 2016. Productionizing machine learning pipelines at scale. ML Systems workshop at ICML (2016).Google Scholar
Nima Asadi, Jimmy Lin, and Arjen P De Vries. 2013. Runtime optimizations for tree-based machine learning models. IEEE TKDE , Vol. 26, 9 (2013), 2281--2292.Google Scholar
Manos Athanassoulis. 2020. Let's talk about deletes! https://blogs.bu.edu/mathan/2020/06/29/lets-talk-about-deletes/.Google Scholar
Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution.. CIDR, Vol. 5. 225--237.Google Scholar
Joos-Hendrik Böse, Valentin Flunkert, Jan Gasthaus. 2017. Probabilistic demand forecasting at scale. VLDB , Vol. 10, 12 (2017), 1694--1705.Google ScholarDigital Library
Leo Breiman. 2001. Random forests. Machine learning , Vol. 45, 1 (2001), 5--32.Google ScholarDigital Library
Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees .CRC press.Google Scholar
Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. IEEE Symposium on Security and Privacy (2015), 463--480.Google ScholarDigital Library
Gert Cauwenberghs and Tomaso Poggio. 2001. Incremental and decremental support vector machine learning. NeurIPS (2001), 409--415.Google Scholar
Tianqi Chen 2016. Xgboost: A scalable tree boosting system. KDD .Google Scholar
Zeyu Ding, Yuxin Wang, Guanhong Wang, Danfeng Zhang, and Daniel Kifer. 2018. Detecting violations of differential privacy. CCS (2018), 475--489.Google Scholar
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. ToC (2006).Google Scholar
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning . Vol. 1. Springer.Google Scholar
GDPR.eu. Article 17: Right to be forgotten . https://gdpr.eu/article-17-right-to-be-forgotten.Google Scholar
GDPR.eu. Recital 74: Responsibility and liability of the controller . https://gdpr.eu/recital-74-responsibility-and-liability-of-the-controller/.Google Scholar
GDPR.eu. Recital 75: Risks to the rights and freedoms of natural persons . https://gdpr.eu/recital-75-risks-to-the-rights-and-freedoms-of-natural-persons/.Google Scholar
Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine learning , Vol. 63, 1 (2006), 3--42.Google Scholar
Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Zou. 2019. Making AI Forget You: Data Deletion in Machine Learning. NeurIPS (2019).Google Scholar
Ashish Gupta, Inderpal Singh Mumick, et almbox. 1995. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Engineering , Vol. 18.Google Scholar
Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining time-changing data streams. KDD. 97--106.Google Scholar
Masayuki Karasuyama and Ichiro Takeuchi. 2009. Multiple Incremental Decremental Learning of Support Vector Machines. NeurIPS (2009), 907--915.Google Scholar
Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. VLDB , Vol. 11, 13 (2018), 2209--2222.Google ScholarDigital Library
Ruey-Hsia Li and Geneva G Belford. 2002. Instability of decision tree classification algorithms. KDD. 570--575.Google Scholar
Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. CIDR (2013).Google Scholar
Xiangrui Meng, Joseph Bradley, , et almbox. 2016. Mllib: Machine learning in apache spark. JMLR , Vol. 17, 1 (2016), 1235--1241.Google ScholarDigital Library
Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. 2020. Descent-to-Delete: Gradient-Based Methods for Machine Unlearning. arxiv: stat.ML/2007.02923Google Scholar
Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-serving: Flexible, high-performance ml serving. ML Systems workshop at NeurIPSGoogle Scholar
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, et almbox. 2011. Scikit-learn: Machine learning in Python. JMLR , Vol. 12 (2011), 2825--2830.Google ScholarDigital Library
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. SIGMOD Record , Vol. 47, 2 (2018), 17--28.Google ScholarDigital Library
Laura Elena Raileanu and Kilian Stoffel. 2004. Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence , Vol. 41, 1 (2004), 77--93.Google ScholarDigital Library
Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, and Manos Athanassoulis. 2020. Lethe: A Tunable Delete-Aware LSM Engine. SIGMOD .Google Scholar
Sebastian Schelter. 2020. "Amnesia"--A Selection of Machine Learning Models That Can Forget User Data Very Fast. CIDR (2020).Google Scholar
Sebastian Schelter, Felix Biessmann, Tim Januschowski,et almbox. 2018. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin (2018).Google Scholar
David Sculley, Gary Holt,et almbox. 2015. Hidden technical debt in machine learning systems. NeurIPS. 2503--2511.Google Scholar
Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. 2020. Understanding and benchmarking the impact of GDPR on database systems. PVLDB (2020).Google Scholar
Julia Stoyanovich, Bill Howe, and H.V. Jagadish. 2020. Responsible Data Management. VLDB , Vol. 13, 12 (2020), 3474--3489.Google ScholarDigital Library
Paul E Utgoff, Neil C Berkman, and Jeffery A Clouse. 1997. Decision tree induction based on efficient tree restructuring. Machine Learning , Vol. 29, 1 (1997), 5--44.Google ScholarDigital Library
Louis Wehenkel and Mania Pavella. 1991. Decision trees and transient stability of electric power systems. Automatica , Vol. 27, 1 (1991), 115--134.Google ScholarDigital Library
Yinjun Wu, Edgar Dobriban, and Susan B. Davidson. 2020. DeltaGrad: Rapid retraining of machine learning models. arxiv: cs.LG/2006.14755Google Scholar
Ting Ye, Hucheng Zhou, Will Y Zou, et almbox. 2018. Rapidscorer: fast tree ensemble evaluation by maximizing compactness in data level parallelization. KDD .Google Scholar
Jingren Zhou and Kenneth A Ross. 2002. Implementing database operations using SIMD instructions. SIGMOD . 145--156.Google Scholar

Index Terms

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning
1. Information systems
  1. Data management systems

Recommendations

Graph Unlearning
CCS '22: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security

Machine unlearning is a process of removing the impact of some training data from the machine learning (ML) models upon receiving removal requests. While straightforward and legitimate, retraining the ML model from scratch incurs a high computational ...
Read More
When Machine Unlearning Jeopardizes Privacy
CCS '21: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security

The right to be forgotten states that a data owner has the right to erase their data from an entity storing it. In the context of machine learning (ML), the right to be forgotten requires an ML model owner to remove the data owner's data from the ...
Read More
Machine Unlearning in Gradient Boosting Decision Trees
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Various machine learning applications take users' data to train the models. Recently enforced legislation requires companies to remove users' data upon requests, i.e.,the right to be forgotten. In the context of machine learning, the trained model ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
General Chairs:
Guoliang Li
Tsinghua University (China)
,
Zhanhuai Li
Northwestern Polytechnical University (China)
,
Program Chairs:
Stratos Idreos
Harvard University (USA)
,
Divesh Srivastava
AT&T (USA)
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
decision trees
machine unlearning
serving systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 425
  Total Downloads
- Downloads (Last 12 months)144
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Graph Unlearning

When Machine Unlearning Jeopardizes Privacy

Machine Unlearning in Gradient Boosting Decision Trees