pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity

Snoke, Joshua; Slavković, Aleksandra

doi:10.1007/978-3-319-99771-1_10

Joshua Snoke¹⁵ &
Aleksandra Slavković¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1022 Accesses
11 Citations

Abstract

We propose a method for the release of differentially private synthetic datasets. In many contexts, data contain sensitive values which cannot be released in their original form in order to protect individuals’ privacy. Synthetic data is a protection method that releases alternative values in place of the original ones, and differential privacy (DP) is a formal guarantee for quantifying the privacy loss. We propose a method that maximizes the distributional similarity of the synthetic data relative to the original data using a measure known as the pMSE, while guaranteeing $\epsilon $-DP. We relax common DP assumptions concerning the distribution and boundedness of the original data. We prove theoretical results for the privacy guarantee and provide simulations for the empirical failure rate of the theoretical results under typical computational limitations. We give simulations for the accuracy of linear regression coefficients generated from the synthetic data compared with the accuracy of non-DP synthetic data and other DP methods. Additionally, our theoretical results extend a prior result for the sensitivity of the Gini Index to include continuous predictors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We put the minus sign before the u function because our quality function decreases for more desirable outputs.

References

Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20
Chapter Google Scholar
Awan, J., Slavkovic, A.: Structure and sensitivity in differential privacy: comparing k-norm mechanisms. arXiv preprint arXiv:1801.09236 (2018)
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM (2007)
Google Scholar
Bertsimas, D., Dunn, J.: Optimal classification trees. Mach. Learn. 106(7), 1039–1082 (2017)
Article MathSciNet Google Scholar
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138. ACM (2005)
Google Scholar
Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. arXiv preprint arXiv:1602.01063 (2016)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984)
MATH Google Scholar
Bun, M., Steinke, T.: Concentrated differential privacy: simplifications, extensions, and lower bounds. In: Hirt, M., Smith, A. (eds.) TCC 2016 Part I. LNCS, vol. 9985, pp. 635–658. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53641-4_24
Chapter Google Scholar
Charest, A.-S.: How can we analyze differentially-private synthetic datasets? J. Priv. Confid. 2(2), 3 (2011)
Google Scholar
Chaudhuri, K., Sarwate, A., Sinha, K.: Near-optimal differentially private principal components. In: Advances in Neural Information Processing Systems, pp. 989–997 (2012)
Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Book MATH Google Scholar
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 486–503. Springer, Heidelberg (2006). https://doi.org/10.1007/11761679_29
Chapter Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Dwork, C., Naor, M., Pitassi, T., Rothblum, G.N., Yekhanin, S.: Pan-private streaming algorithms. In: ICS, pp. 66–80 (2010)
Google Scholar
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends®Theor. Comput. Sci. 9(3–4), 211–407 (2014)
MathSciNet MATH Google Scholar
Dwork, C., Rothblum, G.N.: Concentrated differential privacy. arXiv preprint arXiv:1603.01887 (2016)
Friedman, A., Schuster, A.: Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–502. ACM (2010)
Google Scholar
Kapralov, M., Talwar, K.: On differentially private low rank approximation. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1395–1414. SIAM (2013)
Google Scholar
Karwa, V., Krivitsky, P.N., Slavković, A.B.: Sharing social network data: differentially private estimation of exponential family random-graph models. J. R. Stat. Soc.: Ser. C (Appl. Stat.) 66(3), 481–500 (2017)
Article MathSciNet Google Scholar
Karwa, V., Slavković, A.: Inference using noisy degrees: differentially private $\beta $-model and synthetic graphs. Ann. Stat. 44(1), 87–112 (2016)
Article MathSciNet Google Scholar
Kasiviswanathan, S.P., Lee, H.K., Nissim, K., Raskhodnikova, S., Smith, A.: What can we learn privately? SIAM J. Comput. 40(3), 793–826 (2011)
Article MathSciNet Google Scholar
Kifer, D., Smith, A., Thakurta, A.: Private convex empirical risk minimization and high dimensional regression. In: Conference on Learning Theory, p. 25-1 (2012)
Google Scholar
Li, B., Karwa, V., Slavković, A., Steorts, B.: Release of differentially private high dimensional histograms (2018, Pre-print)
Google Scholar
Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 123–134. ACM (2010)
Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 277–286. IEEE (2008)
Google Scholar
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 94–103. IEEE (2007)
Google Scholar
McSherry, F.D.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. ACM (2009)
Google Scholar
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, pp. 75–84. ACM (2007)
Google Scholar
Nissim, K., et al.: Differential privacy: a primer for a non-technical audience (2017). https://www.ftc.gov/system/files/documents/public_comments/2017/11/00023-141742.pdf
Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 4 (2017)
Google Scholar
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–17 (2003)
Google Scholar
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531–544 (2002)
Google Scholar
Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Google Scholar
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)
Article MathSciNet Google Scholar
Wang, Y.-X., Lei, J., Fienberg, S.E.: On-average KL-privacy and its equivalence to generalization for max-entropy mechanisms. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds.) PSD 2016. LNCS, vol. 9867, pp. 121–134. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45381-1_10
Chapter Google Scholar
Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010)
Article MathSciNet Google Scholar
Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1, 111–124 (2009)
Google Scholar
Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22(6), 797–822 (2013)
Article Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 25 (2017)
Article MathSciNet Google Scholar
Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy. Proc. VLDB Endow. 5(11), 1364–1375 (2012)
Article Google Scholar

Download references

Acknowledgments

This work is supported by the U.S. Census Bureau and NSF grants BCS-0941553 and SES-1534433 to the Department of Statistics at the Pennsylvania State University. Thanks to Bharath Sriperumbudur for special aid in deriving the final form of the theoretical proof.

Author information

Authors and Affiliations

Department of Statistics, Pennsylvania State University, University Park, PA, 16802, USA
Joshua Snoke & Aleksandra Slavković

Authors

Joshua Snoke
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Slavković
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joshua Snoke .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Josep Domingo-Ferrer
University of Valencia, Burjassot, Spain
Francisco Montes

Appendices

10 Appendix: Proof of Theorem 4.1

Proof

We first show that using the expected value, and approximating it, can be bounded above by the supremum across all possible datasets $X^s$ generated using $\theta $.

$$\begin{aligned} \varDelta u = \underset{\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; |u(X, \theta ) - u(X^\prime , \theta )| \end{aligned}$$

(5)

can be rewritten as

$$\begin{aligned} \varDelta u = \underset{\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; |E_\theta [pMSE(X, X^s_\theta )|X,\theta ] - E_\theta [pMSE(X^\prime , X^s_\theta )|X,\theta ]| \end{aligned}$$

(6)

where $u(X, \theta ) = E_\theta [pMSE(X, X^s_\theta )|X, \theta ]$. Since the absolute value is a convex function, we can apply Jensen’s inequality and get

$$\begin{aligned} \le \underset{\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; E_\theta [|pMSE(X, X^s_\theta ) - pMSE(X^\prime , X^s_\theta )||X,\theta ]. \end{aligned}$$

(7)

Then by taking the supremum over any data set $X^s_\theta $, we obtain

$$\begin{aligned} \le \underset{X^s_\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; |pMSE(X,X^s_\theta ) - pMSE(X^\prime , X^s_\theta )|. \end{aligned}$$

(8)

This also bounds our approximation of the expected value that we propose to use in practice, since the supremum is also greater than or equal to the sample mean.

Now writing this explicitly in terms of the CART model, we get

$$\begin{aligned} \underset{a_i, \; m_i, \; a_i^\prime , \; m_i^\prime }{sup} \; \frac{1}{2n}\Bigg |\varSigma _{i = 1}^{D + 1} m_i\Big (\frac{a_i}{m_i} - 0.5\Big )^2 - m_i^\prime \Big (\frac{a_i^\prime }{m_i^\prime } - 0.5\Big )^2\Bigg | \end{aligned}$$

(9)

where $a_i$, $m_i$, and D are defined as before, and $a_i^\prime $ and $m_i^\prime $ are the corresponding values for the model fit using $X^\prime $. Expanding this we get

$$\begin{aligned} \underset{a_i, \; m_i, \; a_i^\prime , \; m_i^\prime }{sup} \; \frac{1}{2n}\Bigg |\varSigma _{i = 1}^{D + 1} \Big (\frac{a_i^2}{m_i} - a_i - 0.25m_i\Big ) - \Big (\frac{a_i^{\prime 2}}{m_i^\prime } - a_i^\prime - 0.25m_i^\prime \Big )\Bigg | \end{aligned}$$

(10)

and we can cancel the third terms because $\varSigma _{i=1}^{D+1}m_i = \varSigma _{i=1}^{D+1}m_i^\prime $. When we multiple by 2n, the remaining inside term is equivalent to the sensitivity of the impurity, i.e.,

$$\begin{aligned} \underset{a_i, \; m_i, \; a_i^\prime , \; m_i^\prime }{sup} \; \Bigg |GI(X, X^s, D) - GI(X^\prime , X^s, D) \Bigg | = \varDelta GI \end{aligned}$$

(11)

By bounding the impurity, we bound the pMSE. We can rewrite the above as

$$\begin{aligned} \Bigg |\underset{D}{min} \; GI(X, X^s, D) - \underset{D}{min} \; GI(X^\prime , X^s, D) \Bigg | \end{aligned}$$

(12)

since the optimal CART model finds the minimum impurity across any D. The greatest possible difference then is the difference between these two optimums. And we can bound this above by

$$\begin{aligned} \le \underset{D}{sup} \; \Bigg |GI(X, X^s, D) - GI(X^\prime , X^s, D) \Bigg |. \end{aligned}$$

(13)

Let $X^{comb}$ and $X^{\prime comb}$ be the combined data matrices as described in Algorithm 1, including the 0, 1 outcome variable. Recall that only one record has changed between $X^{comb}$ and $X^{\prime comb}$ (total number of records staying fixed), and it is labeled 0. We know that for a given D optimal split points producing $D + 1$ nodes on $X^{comb}$, there are $a_i$ records labeled 1 and $\tilde{m}_i$ total records in each bin, such that $\exists \; j \ne k \ne l_1 \ne ... \ne l_{D-1} \; s.t. \; \tilde{m}_j - m_j = m_k - \tilde{m}_k = 1, \; \tilde{m}_{l_v} = m_{l_v}$ for $v = \{1,..., D - 1\}$. In the same way, for a given D optimal split points producing $D + 1$ nodes on $X^{\prime comb}$, there are $a^\prime _i$ records labeled 1 and $\tilde{m}^\prime _i$ total records in each bin, such that $\exists \; j^\prime \ne k^\prime \ne l^\prime _1 \ne ... \ne l^\prime _{D-1} \; s.t. \; \tilde{m}^\prime _{j^\prime } - m^\prime _{j^\prime } = m^\prime _{k^\prime } - \tilde{m}^\prime _{k^\prime } = 1, \; \tilde{m}^\prime _{l^\prime _v} = m^\prime _{l^\prime _v}$ for $v = \{1,..., D - 1\}$. What this simply means is that after changing one record, the discrete counts in the nodes change by at most one in two of the nodes and does not change in the other $D - 1$ nodes.

Due to the fact that the CART model produces the D splits that minimize the impurity, we know both that

$$\begin{aligned} \varSigma _{i = 1}^{D + 1}{a^\prime _i\Big (1-\frac{a^\prime _i}{m^\prime _i}\Big )} \le \varSigma _{i = 1}^{D + 1}{a_i\Big (1-\frac{a_i}{\tilde{m}_i}\Big )} \end{aligned}$$

(14)

and

$$\begin{aligned} \varSigma _{i = 1}^{D + 1}{a_i\Big (1-\frac{a_i}{m_i}\Big )} \le \varSigma _{i = 1}^{D + 1}{a^\prime _i\Big (1-\frac{a^\prime _i}{\tilde{m}^\prime _i}\Big )}. \end{aligned}$$

(15)

The inequality (14) implies that after changing one record, if new split points are chosen, the impurity must be equivalent or better than simply keeping the previous splits and changing the counts. The inequality (15) implies that the first split points chosen must be equivalent or better than using the new splits with the changed counts. If this were not the case, the first split points would have never been made in the first place. These lead to the final step.

Because we have an absolute value, we consider two cases.

(16)

The last step we know because $a_i \le m_i$, and $\frac{n^2}{n(n-1)} \le 2$.

(17)

Finally, this gives us $\varDelta GI \le 2 \implies \frac{\varDelta GI}{2n} = \varDelta u \le \frac{1}{n}$.

11 Appendix: Full Simulation Results

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Snoke, J., Slavković, A. (2018). pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-99771-1_10
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics