Abstract
We propose a method for the release of differentially private synthetic datasets. In many contexts, data contain sensitive values which cannot be released in their original form in order to protect individuals’ privacy. Synthetic data is a protection method that releases alternative values in place of the original ones, and differential privacy (DP) is a formal guarantee for quantifying the privacy loss. We propose a method that maximizes the distributional similarity of the synthetic data relative to the original data using a measure known as the pMSE, while guaranteeing \(\epsilon \)-DP. We relax common DP assumptions concerning the distribution and boundedness of the original data. We prove theoretical results for the privacy guarantee and provide simulations for the empirical failure rate of the theoretical results under typical computational limitations. We give simulations for the accuracy of linear regression coefficients generated from the synthetic data compared with the accuracy of non-DP synthetic data and other DP methods. Additionally, our theoretical results extend a prior result for the sensitivity of the Gini Index to include continuous predictors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We put the minus sign before the u function because our quality function decreases for more desirable outputs.
References
Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20
Awan, J., Slavkovic, A.: Structure and sensitivity in differential privacy: comparing k-norm mechanisms. arXiv preprint arXiv:1801.09236 (2018)
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM (2007)
Bertsimas, D., Dunn, J.: Optimal classification trees. Mach. Learn. 106(7), 1039–1082 (2017)
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138. ACM (2005)
Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. arXiv preprint arXiv:1602.01063 (2016)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984)
Bun, M., Steinke, T.: Concentrated differential privacy: simplifications, extensions, and lower bounds. In: Hirt, M., Smith, A. (eds.) TCC 2016 Part I. LNCS, vol. 9985, pp. 635–658. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53641-4_24
Charest, A.-S.: How can we analyze differentially-private synthetic datasets? J. Priv. Confid. 2(2), 3 (2011)
Chaudhuri, K., Sarwate, A., Sinha, K.: Near-optimal differentially private principal components. In: Advances in Neural Information Processing Systems, pp. 989–997 (2012)
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 486–503. Springer, Heidelberg (2006). https://doi.org/10.1007/11761679_29
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Dwork, C., Naor, M., Pitassi, T., Rothblum, G.N., Yekhanin, S.: Pan-private streaming algorithms. In: ICS, pp. 66–80 (2010)
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends®Theor. Comput. Sci. 9(3–4), 211–407 (2014)
Dwork, C., Rothblum, G.N.: Concentrated differential privacy. arXiv preprint arXiv:1603.01887 (2016)
Friedman, A., Schuster, A.: Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–502. ACM (2010)
Kapralov, M., Talwar, K.: On differentially private low rank approximation. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1395–1414. SIAM (2013)
Karwa, V., Krivitsky, P.N., Slavković, A.B.: Sharing social network data: differentially private estimation of exponential family random-graph models. J. R. Stat. Soc.: Ser. C (Appl. Stat.) 66(3), 481–500 (2017)
Karwa, V., Slavković, A.: Inference using noisy degrees: differentially private \(\beta \)-model and synthetic graphs. Ann. Stat. 44(1), 87–112 (2016)
Kasiviswanathan, S.P., Lee, H.K., Nissim, K., Raskhodnikova, S., Smith, A.: What can we learn privately? SIAM J. Comput. 40(3), 793–826 (2011)
Kifer, D., Smith, A., Thakurta, A.: Private convex empirical risk minimization and high dimensional regression. In: Conference on Learning Theory, p. 25-1 (2012)
Li, B., Karwa, V., Slavković, A., Steorts, B.: Release of differentially private high dimensional histograms (2018, Pre-print)
Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 123–134. ACM (2010)
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 277–286. IEEE (2008)
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 94–103. IEEE (2007)
McSherry, F.D.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. ACM (2009)
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, pp. 75–84. ACM (2007)
Nissim, K., et al.: Differential privacy: a primer for a non-technical audience (2017). https://www.ftc.gov/system/files/documents/public_comments/2017/11/00023-141742.pdf
Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 4 (2017)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–17 (2003)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531–544 (2002)
Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)
Wang, Y.-X., Lei, J., Fienberg, S.E.: On-average KL-privacy and its equivalence to generalization for max-entropy mechanisms. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds.) PSD 2016. LNCS, vol. 9867, pp. 121–134. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45381-1_10
Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010)
Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1, 111–124 (2009)
Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22(6), 797–822 (2013)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 25 (2017)
Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy. Proc. VLDB Endow. 5(11), 1364–1375 (2012)
Acknowledgments
This work is supported by the U.S. Census Bureau and NSF grants BCS-0941553 and SES-1534433 to the Department of Statistics at the Pennsylvania State University. Thanks to Bharath Sriperumbudur for special aid in deriving the final form of the theoretical proof.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
10 Appendix: Proof of Theorem 4.1
Proof
We first show that using the expected value, and approximating it, can be bounded above by the supremum across all possible datasets \(X^s\) generated using \(\theta \).
can be rewritten as
where \(u(X, \theta ) = E_\theta [pMSE(X, X^s_\theta )|X, \theta ]\). Since the absolute value is a convex function, we can apply Jensen’s inequality and get
Then by taking the supremum over any data set \(X^s_\theta \), we obtain
This also bounds our approximation of the expected value that we propose to use in practice, since the supremum is also greater than or equal to the sample mean.
Now writing this explicitly in terms of the CART model, we get
where \(a_i\), \(m_i\), and D are defined as before, and \(a_i^\prime \) and \(m_i^\prime \) are the corresponding values for the model fit using \(X^\prime \). Expanding this we get
and we can cancel the third terms because \(\varSigma _{i=1}^{D+1}m_i = \varSigma _{i=1}^{D+1}m_i^\prime \). When we multiple by 2n, the remaining inside term is equivalent to the sensitivity of the impurity, i.e.,
By bounding the impurity, we bound the pMSE. We can rewrite the above as
since the optimal CART model finds the minimum impurity across any D. The greatest possible difference then is the difference between these two optimums. And we can bound this above by
Let \(X^{comb}\) and \(X^{\prime comb}\) be the combined data matrices as described in Algorithm 1, including the 0, 1 outcome variable. Recall that only one record has changed between \(X^{comb}\) and \(X^{\prime comb}\) (total number of records staying fixed), and it is labeled 0. We know that for a given D optimal split points producing \(D + 1\) nodes on \(X^{comb}\), there are \(a_i\) records labeled 1 and \(\tilde{m}_i\) total records in each bin, such that \(\exists \; j \ne k \ne l_1 \ne ... \ne l_{D-1} \; s.t. \; \tilde{m}_j - m_j = m_k - \tilde{m}_k = 1, \; \tilde{m}_{l_v} = m_{l_v}\) for \(v = \{1,..., D - 1\}\). In the same way, for a given D optimal split points producing \(D + 1\) nodes on \(X^{\prime comb}\), there are \(a^\prime _i\) records labeled 1 and \(\tilde{m}^\prime _i\) total records in each bin, such that \(\exists \; j^\prime \ne k^\prime \ne l^\prime _1 \ne ... \ne l^\prime _{D-1} \; s.t. \; \tilde{m}^\prime _{j^\prime } - m^\prime _{j^\prime } = m^\prime _{k^\prime } - \tilde{m}^\prime _{k^\prime } = 1, \; \tilde{m}^\prime _{l^\prime _v} = m^\prime _{l^\prime _v}\) for \(v = \{1,..., D - 1\}\). What this simply means is that after changing one record, the discrete counts in the nodes change by at most one in two of the nodes and does not change in the other \(D - 1\) nodes.
Due to the fact that the CART model produces the D splits that minimize the impurity, we know both that
and
The inequality (14) implies that after changing one record, if new split points are chosen, the impurity must be equivalent or better than simply keeping the previous splits and changing the counts. The inequality (15) implies that the first split points chosen must be equivalent or better than using the new splits with the changed counts. If this were not the case, the first split points would have never been made in the first place. These lead to the final step.
Because we have an absolute value, we consider two cases.
The last step we know because \(a_i \le m_i\), and \(\frac{n^2}{n(n-1)} \le 2\).
Finally, this gives us \(\varDelta GI \le 2 \implies \frac{\varDelta GI}{2n} = \varDelta u \le \frac{1}{n}\).
11 Appendix: Full Simulation Results
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Snoke, J., Slavković, A. (2018). pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-99771-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)