ABSTRACT
Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.
We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.
Supplemental Material
- Pierre Andrews, Aditya Kalro, and Alexander Sidorov. 2016. Productionizing machine learning pipelines at scale. ML Systems workshop at ICML (2016).Google Scholar
- Nima Asadi, Jimmy Lin, and Arjen P De Vries. 2013. Runtime optimizations for tree-based machine learning models. IEEE TKDE , Vol. 26, 9 (2013), 2281--2292.Google Scholar
- Manos Athanassoulis. 2020. Let's talk about deletes! https://blogs.bu.edu/mathan/2020/06/29/lets-talk-about-deletes/.Google Scholar
- Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution.. CIDR, Vol. 5. 225--237.Google Scholar
- Joos-Hendrik Böse, Valentin Flunkert, Jan Gasthaus. 2017. Probabilistic demand forecasting at scale. VLDB , Vol. 10, 12 (2017), 1694--1705.Google ScholarDigital Library
- Leo Breiman. 2001. Random forests. Machine learning , Vol. 45, 1 (2001), 5--32.Google ScholarDigital Library
- Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees .CRC press.Google Scholar
- Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. IEEE Symposium on Security and Privacy (2015), 463--480.Google ScholarDigital Library
- Gert Cauwenberghs and Tomaso Poggio. 2001. Incremental and decremental support vector machine learning. NeurIPS (2001), 409--415.Google Scholar
- Tianqi Chen 2016. Xgboost: A scalable tree boosting system. KDD .Google Scholar
- Zeyu Ding, Yuxin Wang, Guanhong Wang, Danfeng Zhang, and Daniel Kifer. 2018. Detecting violations of differential privacy. CCS (2018), 475--489.Google Scholar
- Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. ToC (2006).Google Scholar
- Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning . Vol. 1. Springer.Google Scholar
- GDPR.eu. Article 17: Right to be forgotten . https://gdpr.eu/article-17-right-to-be-forgotten.Google Scholar
- GDPR.eu. Recital 74: Responsibility and liability of the controller . https://gdpr.eu/recital-74-responsibility-and-liability-of-the-controller/.Google Scholar
- GDPR.eu. Recital 75: Risks to the rights and freedoms of natural persons . https://gdpr.eu/recital-75-risks-to-the-rights-and-freedoms-of-natural-persons/.Google Scholar
- Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine learning , Vol. 63, 1 (2006), 3--42.Google Scholar
- Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Zou. 2019. Making AI Forget You: Data Deletion in Machine Learning. NeurIPS (2019).Google Scholar
- Ashish Gupta, Inderpal Singh Mumick, et almbox. 1995. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Engineering , Vol. 18.Google Scholar
- Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining time-changing data streams. KDD. 97--106.Google Scholar
- Masayuki Karasuyama and Ichiro Takeuchi. 2009. Multiple Incremental Decremental Learning of Support Vector Machines. NeurIPS (2009), 907--915.Google Scholar
- Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. VLDB , Vol. 11, 13 (2018), 2209--2222.Google ScholarDigital Library
- Ruey-Hsia Li and Geneva G Belford. 2002. Instability of decision tree classification algorithms. KDD. 570--575.Google Scholar
- Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. CIDR (2013).Google Scholar
- Xiangrui Meng, Joseph Bradley, , et almbox. 2016. Mllib: Machine learning in apache spark. JMLR , Vol. 17, 1 (2016), 1235--1241.Google ScholarDigital Library
- Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. 2020. Descent-to-Delete: Gradient-Based Methods for Machine Unlearning. arxiv: stat.ML/2007.02923Google Scholar
- Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-serving: Flexible, high-performance ml serving. ML Systems workshop at NeurIPSGoogle Scholar
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, et almbox. 2011. Scikit-learn: Machine learning in Python. JMLR , Vol. 12 (2011), 2825--2830.Google ScholarDigital Library
- Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: a survey. SIGMOD Record , Vol. 47, 2 (2018), 17--28.Google ScholarDigital Library
- Laura Elena Raileanu and Kilian Stoffel. 2004. Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence , Vol. 41, 1 (2004), 77--93.Google ScholarDigital Library
- Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, and Manos Athanassoulis. 2020. Lethe: A Tunable Delete-Aware LSM Engine. SIGMOD .Google Scholar
- Sebastian Schelter. 2020. "Amnesia"--A Selection of Machine Learning Models That Can Forget User Data Very Fast. CIDR (2020).Google Scholar
- Sebastian Schelter, Felix Biessmann, Tim Januschowski,et almbox. 2018. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin (2018).Google Scholar
- David Sculley, Gary Holt,et almbox. 2015. Hidden technical debt in machine learning systems. NeurIPS. 2503--2511.Google Scholar
- Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. 2020. Understanding and benchmarking the impact of GDPR on database systems. PVLDB (2020).Google Scholar
- Julia Stoyanovich, Bill Howe, and H.V. Jagadish. 2020. Responsible Data Management. VLDB , Vol. 13, 12 (2020), 3474--3489.Google ScholarDigital Library
- Paul E Utgoff, Neil C Berkman, and Jeffery A Clouse. 1997. Decision tree induction based on efficient tree restructuring. Machine Learning , Vol. 29, 1 (1997), 5--44.Google ScholarDigital Library
- Louis Wehenkel and Mania Pavella. 1991. Decision trees and transient stability of electric power systems. Automatica , Vol. 27, 1 (1991), 115--134.Google ScholarDigital Library
- Yinjun Wu, Edgar Dobriban, and Susan B. Davidson. 2020. DeltaGrad: Rapid retraining of machine learning models. arxiv: cs.LG/2006.14755Google Scholar
- Ting Ye, Hucheng Zhou, Will Y Zou, et almbox. 2018. Rapidscorer: fast tree ensemble evaluation by maximizing compactness in data level parallelization. KDD .Google Scholar
- Jingren Zhou and Kenneth A Ross. 2002. Implementing database operations using SIMD instructions. SIGMOD . 145--156.Google Scholar
Index Terms
- HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning
Recommendations
Graph Unlearning
CCS '22: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications SecurityMachine unlearning is a process of removing the impact of some training data from the machine learning (ML) models upon receiving removal requests. While straightforward and legitimate, retraining the ML model from scratch incurs a high computational ...
When Machine Unlearning Jeopardizes Privacy
CCS '21: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications SecurityThe right to be forgotten states that a data owner has the right to erase their data from an entity storing it. In the context of machine learning (ML), the right to be forgotten requires an ML model owner to remove the data owner's data from the ...
Machine Unlearning in Gradient Boosting Decision Trees
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningVarious machine learning applications take users' data to train the models. Recently enforced legislation requires companies to remove users' data upon requests, i.e.,the right to be forgotten. In the context of machine learning, the trained model ...
Comments