Skip to main content

pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Included in the following conference series:

Abstract

We propose a method for the release of differentially private synthetic datasets. In many contexts, data contain sensitive values which cannot be released in their original form in order to protect individuals’ privacy. Synthetic data is a protection method that releases alternative values in place of the original ones, and differential privacy (DP) is a formal guarantee for quantifying the privacy loss. We propose a method that maximizes the distributional similarity of the synthetic data relative to the original data using a measure known as the pMSE, while guaranteeing \(\epsilon \)-DP. We relax common DP assumptions concerning the distribution and boundedness of the original data. We prove theoretical results for the privacy guarantee and provide simulations for the empirical failure rate of the theoretical results under typical computational limitations. We give simulations for the accuracy of linear regression coefficients generated from the synthetic data compared with the accuracy of non-DP synthetic data and other DP methods. Additionally, our theoretical results extend a prior result for the sensitivity of the Gini Index to include continuous predictors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We put the minus sign before the u function because our quality function decreases for more desirable outputs.

References

  • Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20

    Chapter  Google Scholar 

  • Awan, J., Slavkovic, A.: Structure and sensitivity in differential privacy: comparing k-norm mechanisms. arXiv preprint arXiv:1801.09236 (2018)

  • Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM (2007)

    Google Scholar 

  • Bertsimas, D., Dunn, J.: Optimal classification trees. Mach. Learn. 106(7), 1039–1082 (2017)

    Article  MathSciNet  Google Scholar 

  • Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138. ACM (2005)

    Google Scholar 

  • Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. arXiv preprint arXiv:1602.01063 (2016)

  • Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984)

    MATH  Google Scholar 

  • Bun, M., Steinke, T.: Concentrated differential privacy: simplifications, extensions, and lower bounds. In: Hirt, M., Smith, A. (eds.) TCC 2016 Part I. LNCS, vol. 9985, pp. 635–658. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53641-4_24

    Chapter  Google Scholar 

  • Charest, A.-S.: How can we analyze differentially-private synthetic datasets? J. Priv. Confid. 2(2), 3 (2011)

    Google Scholar 

  • Chaudhuri, K., Sarwate, A., Sinha, K.: Near-optimal differentially private principal components. In: Advances in Neural Information Processing Systems, pp. 989–997 (2012)

    Google Scholar 

  • Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  • Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 486–503. Springer, Heidelberg (2006). https://doi.org/10.1007/11761679_29

    Chapter  Google Scholar 

  • Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  • Dwork, C., Naor, M., Pitassi, T., Rothblum, G.N., Yekhanin, S.: Pan-private streaming algorithms. In: ICS, pp. 66–80 (2010)

    Google Scholar 

  • Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends®Theor. Comput. Sci. 9(3–4), 211–407 (2014)

    MathSciNet  MATH  Google Scholar 

  • Dwork, C., Rothblum, G.N.: Concentrated differential privacy. arXiv preprint arXiv:1603.01887 (2016)

  • Friedman, A., Schuster, A.: Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–502. ACM (2010)

    Google Scholar 

  • Kapralov, M., Talwar, K.: On differentially private low rank approximation. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1395–1414. SIAM (2013)

    Google Scholar 

  • Karwa, V., Krivitsky, P.N., Slavković, A.B.: Sharing social network data: differentially private estimation of exponential family random-graph models. J. R. Stat. Soc.: Ser. C (Appl. Stat.) 66(3), 481–500 (2017)

    Article  MathSciNet  Google Scholar 

  • Karwa, V., Slavković, A.: Inference using noisy degrees: differentially private \(\beta \)-model and synthetic graphs. Ann. Stat. 44(1), 87–112 (2016)

    Article  MathSciNet  Google Scholar 

  • Kasiviswanathan, S.P., Lee, H.K., Nissim, K., Raskhodnikova, S., Smith, A.: What can we learn privately? SIAM J. Comput. 40(3), 793–826 (2011)

    Article  MathSciNet  Google Scholar 

  • Kifer, D., Smith, A., Thakurta, A.: Private convex empirical risk minimization and high dimensional regression. In: Conference on Learning Theory, p. 25-1 (2012)

    Google Scholar 

  • Li, B., Karwa, V., Slavković, A., Steorts, B.: Release of differentially private high dimensional histograms (2018, Pre-print)

    Google Scholar 

  • Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 123–134. ACM (2010)

    Google Scholar 

  • Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 277–286. IEEE (2008)

    Google Scholar 

  • McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 94–103. IEEE (2007)

    Google Scholar 

  • McSherry, F.D.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. ACM (2009)

    Google Scholar 

  • Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, pp. 75–84. ACM (2007)

    Google Scholar 

  • Nissim, K., et al.: Differential privacy: a primer for a non-technical audience (2017). https://www.ftc.gov/system/files/documents/public_comments/2017/11/00023-141742.pdf

  • Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 4 (2017)

    Google Scholar 

  • Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–17 (2003)

    Google Scholar 

  • Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531–544 (2002)

    Google Scholar 

  • Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)

    Google Scholar 

  • Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)

    Article  MathSciNet  Google Scholar 

  • Wang, Y.-X., Lei, J., Fienberg, S.E.: On-average KL-privacy and its equivalence to generalization for max-entropy mechanisms. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds.) PSD 2016. LNCS, vol. 9867, pp. 121–134. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45381-1_10

    Chapter  Google Scholar 

  • Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010)

    Article  MathSciNet  Google Scholar 

  • Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1, 111–124 (2009)

    Google Scholar 

  • Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22(6), 797–822 (2013)

    Article  Google Scholar 

  • Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 25 (2017)

    Article  MathSciNet  Google Scholar 

  • Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy. Proc. VLDB Endow. 5(11), 1364–1375 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the U.S. Census Bureau and NSF grants BCS-0941553 and SES-1534433 to the Department of Statistics at the Pennsylvania State University. Thanks to Bharath Sriperumbudur for special aid in deriving the final form of the theoretical proof.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joshua Snoke .

Editor information

Editors and Affiliations

Appendices

10 Appendix: Proof of Theorem 4.1

Proof

We first show that using the expected value, and approximating it, can be bounded above by the supremum across all possible datasets \(X^s\) generated using \(\theta \).

$$\begin{aligned} \varDelta u = \underset{\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; |u(X, \theta ) - u(X^\prime , \theta )| \end{aligned}$$
(5)

can be rewritten as

$$\begin{aligned} \varDelta u = \underset{\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; |E_\theta [pMSE(X, X^s_\theta )|X,\theta ] - E_\theta [pMSE(X^\prime , X^s_\theta )|X,\theta ]| \end{aligned}$$
(6)

where \(u(X, \theta ) = E_\theta [pMSE(X, X^s_\theta )|X, \theta ]\). Since the absolute value is a convex function, we can apply Jensen’s inequality and get

$$\begin{aligned} \le \underset{\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; E_\theta [|pMSE(X, X^s_\theta ) - pMSE(X^\prime , X^s_\theta )||X,\theta ]. \end{aligned}$$
(7)

Then by taking the supremum over any data set \(X^s_\theta \), we obtain

$$\begin{aligned} \le \underset{X^s_\theta }{sup} \; \underset{\delta (X,X^\prime )=1}{sup} \; |pMSE(X,X^s_\theta ) - pMSE(X^\prime , X^s_\theta )|. \end{aligned}$$
(8)

This also bounds our approximation of the expected value that we propose to use in practice, since the supremum is also greater than or equal to the sample mean.

Now writing this explicitly in terms of the CART model, we get

$$\begin{aligned} \underset{a_i, \; m_i, \; a_i^\prime , \; m_i^\prime }{sup} \; \frac{1}{2n}\Bigg |\varSigma _{i = 1}^{D + 1} m_i\Big (\frac{a_i}{m_i} - 0.5\Big )^2 - m_i^\prime \Big (\frac{a_i^\prime }{m_i^\prime } - 0.5\Big )^2\Bigg | \end{aligned}$$
(9)

where \(a_i\), \(m_i\), and D are defined as before, and \(a_i^\prime \) and \(m_i^\prime \) are the corresponding values for the model fit using \(X^\prime \). Expanding this we get

$$\begin{aligned} \underset{a_i, \; m_i, \; a_i^\prime , \; m_i^\prime }{sup} \; \frac{1}{2n}\Bigg |\varSigma _{i = 1}^{D + 1} \Big (\frac{a_i^2}{m_i} - a_i - 0.25m_i\Big ) - \Big (\frac{a_i^{\prime 2}}{m_i^\prime } - a_i^\prime - 0.25m_i^\prime \Big )\Bigg | \end{aligned}$$
(10)

and we can cancel the third terms because \(\varSigma _{i=1}^{D+1}m_i = \varSigma _{i=1}^{D+1}m_i^\prime \). When we multiple by 2n, the remaining inside term is equivalent to the sensitivity of the impurity, i.e.,

$$\begin{aligned} \underset{a_i, \; m_i, \; a_i^\prime , \; m_i^\prime }{sup} \; \Bigg |GI(X, X^s, D) - GI(X^\prime , X^s, D) \Bigg | = \varDelta GI \end{aligned}$$
(11)

By bounding the impurity, we bound the pMSE. We can rewrite the above as

$$\begin{aligned} \Bigg |\underset{D}{min} \; GI(X, X^s, D) - \underset{D}{min} \; GI(X^\prime , X^s, D) \Bigg | \end{aligned}$$
(12)

since the optimal CART model finds the minimum impurity across any D. The greatest possible difference then is the difference between these two optimums. And we can bound this above by

$$\begin{aligned} \le \underset{D}{sup} \; \Bigg |GI(X, X^s, D) - GI(X^\prime , X^s, D) \Bigg |. \end{aligned}$$
(13)

Let \(X^{comb}\) and \(X^{\prime comb}\) be the combined data matrices as described in Algorithm 1, including the 0, 1 outcome variable. Recall that only one record has changed between \(X^{comb}\) and \(X^{\prime comb}\) (total number of records staying fixed), and it is labeled 0. We know that for a given D optimal split points producing \(D + 1\) nodes on \(X^{comb}\), there are \(a_i\) records labeled 1 and \(\tilde{m}_i\) total records in each bin, such that \(\exists \; j \ne k \ne l_1 \ne ... \ne l_{D-1} \; s.t. \; \tilde{m}_j - m_j = m_k - \tilde{m}_k = 1, \; \tilde{m}_{l_v} = m_{l_v}\) for \(v = \{1,..., D - 1\}\). In the same way, for a given D optimal split points producing \(D + 1\) nodes on \(X^{\prime comb}\), there are \(a^\prime _i\) records labeled 1 and \(\tilde{m}^\prime _i\) total records in each bin, such that \(\exists \; j^\prime \ne k^\prime \ne l^\prime _1 \ne ... \ne l^\prime _{D-1} \; s.t. \; \tilde{m}^\prime _{j^\prime } - m^\prime _{j^\prime } = m^\prime _{k^\prime } - \tilde{m}^\prime _{k^\prime } = 1, \; \tilde{m}^\prime _{l^\prime _v} = m^\prime _{l^\prime _v}\) for \(v = \{1,..., D - 1\}\). What this simply means is that after changing one record, the discrete counts in the nodes change by at most one in two of the nodes and does not change in the other \(D - 1\) nodes.

Due to the fact that the CART model produces the D splits that minimize the impurity, we know both that

$$\begin{aligned} \varSigma _{i = 1}^{D + 1}{a^\prime _i\Big (1-\frac{a^\prime _i}{m^\prime _i}\Big )} \le \varSigma _{i = 1}^{D + 1}{a_i\Big (1-\frac{a_i}{\tilde{m}_i}\Big )} \end{aligned}$$
(14)

and

$$\begin{aligned} \varSigma _{i = 1}^{D + 1}{a_i\Big (1-\frac{a_i}{m_i}\Big )} \le \varSigma _{i = 1}^{D + 1}{a^\prime _i\Big (1-\frac{a^\prime _i}{\tilde{m}^\prime _i}\Big )}. \end{aligned}$$
(15)

The inequality (14) implies that after changing one record, if new split points are chosen, the impurity must be equivalent or better than simply keeping the previous splits and changing the counts. The inequality (15) implies that the first split points chosen must be equivalent or better than using the new splits with the changed counts. If this were not the case, the first split points would have never been made in the first place. These lead to the final step.

Because we have an absolute value, we consider two cases.

(16)

The last step we know because \(a_i \le m_i\), and \(\frac{n^2}{n(n-1)} \le 2\).

(17)

Finally, this gives us \(\varDelta GI \le 2 \implies \frac{\varDelta GI}{2n} = \varDelta u \le \frac{1}{n}\).

11 Appendix: Full Simulation Results

Fig. 4.
figure 4

Boxplots showing simulation results. The rows indicate the different coefficients, and the columns indicate different values of \(\epsilon \). Boxplots are also subdivided within methods by the tree depth (for the pMSE mechanism method) and the bound (for others).

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Snoke, J., Slavković, A. (2018). pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99771-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99770-4

  • Online ISBN: 978-3-319-99771-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics