ABSTRACT
Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table's dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishing even a subset of its cuboids may compromise individuals' privacy. In this paper, we address this problem using differential privacy (DP), which provides provable privacy guarantees for individuals by adding noise to query answers. We choose an initial subset of cuboids to compute directly from the fact table, injecting DP noise as usual; and then compute the remaining cuboids from the initial set. Given a fixed privacy guarantee, we show that it is NP-hard to choose the initial set of cuboids so that the maximal noise over all published cuboids is minimized, or so that the number of cuboids with noise below a given threshold (precise cuboids) is maximized. We provide an efficient procedure with running time polynomial in the number of cuboids to select the initial set of cuboids, such that the maximal noise in all published cuboids will be within a factor (ln|L| + 1)^2 of the optimal, where |L| is the number of cuboids to be published, or the number of precise cuboids will be within a factor (1 - 1/e) of the optimal. We also show how to enforce consistency in the published cuboids while simultaneously improving their utility (reducing error). In an empirical evaluation on real and synthetic data, we report the amounts of error of different publishing algorithms, and show that our approaches outperform baselines significantly.
- www.cs.cmu.edu/~compthink/mindswaps/oct07/difpriv.ppt. 2007.Google Scholar
- N. R. Adam and J. C. Wortmann. Security-control methods for statistical databases: A comparative study. ACM Comput. Surv., 21(4):515--556, 1989. Google ScholarDigital Library
- R. Agrawal, R. Srikant, and D. Thomas. Privacy preserving OLAP. In SIGMOD, pages 251--262, 2005. Google ScholarDigital Library
- B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, pages 273--282, 2007. Google ScholarDigital Library
- R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent patterns in sensitive data. In KDD, pages 503--512, 2010. Google ScholarDigital Library
- A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pages 609--618, 2008. Google ScholarDigital Library
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. Google ScholarDigital Library
- K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, pages 289--296, 2008.Google ScholarDigital Library
- D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge Univ. Press, 2009. Google ScholarDigital Library
- C. Dwork. Differential privacy: A survey of results. In TAMC, pages 1--19, 2008. Google ScholarDigital Library
- C. Dwork. The differential privacy frontier (extended abstract). In TCC, pages 496--502, 2009. Google ScholarDigital Library
- C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265--284, 2006. Google ScholarDigital Library
- D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In STOC, pages 361--370, 2009. Google ScholarDigital Library
- A. Friedman and A. Schuster. Data mining with differential privacy. In KDD, pages 493--502, 2010. Google ScholarDigital Library
- B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey on recent developments. ACM Comput. Surv., 42(4), 2010. Google ScholarDigital Library
- S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In KDD, pages 265--273, 2008. Google ScholarDigital Library
- A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility-maximizing privacy mechanisms. In STOC, pages 351--360, 2009. Google ScholarDigital Library
- M. Götz, A. Machanavajjhala, G. Wang, X. Xiao, and J. Gehrke. Publishing search logs - a comparative study of privacy guarantees. TKDE, 2011.Google Scholar
- M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially-private queries through consistency. In PVLDB, pages 1021--1032, 2010. Google ScholarDigital Library
- S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In FOCS, pages 531--540, 2008. Google ScholarDigital Library
- D. Kifer. Attacks on privacy and de Finetti's theorem. In SIGMOD, pages 127--138, 2009. Google ScholarDigital Library
- A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. Releasing search queries and clicks privately. In WWW, pages 171--180, 2009. Google ScholarDigital Library
- C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing histogram queries under differential privacy. In PODS, pages 123--134, 2010. Google ScholarDigital Library
- N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, pages 106--115, 2007.Google ScholarCross Ref
- X. Li, J. Han, and H. Gonzalez. High-dimensional OLAP: A minimal cubing approach. In VLDB, pages 528--539, 2004. Google ScholarDigital Library
- A. Machanavajjhala, J. Gehrke, D. Kifer, andM. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, page~24, 2006. Google ScholarDigital Library
- A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory meets practice on the map. In ICDE, pages 277--286, 2008. Google ScholarDigital Library
- F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In SIGMOD, pages 19--30, 2009. Google ScholarDigital Library
- F. McSherry and I. Mironov. Differentially private recommender systems: building privacy into the Netflix prize contenders. In KDD, pages 627--636, 2009. Google ScholarDigital Library
- K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75--84, 2007. Google ScholarDigital Library
- V. Rastogi and S. Nath. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, pages 735--746, 2010. Google ScholarDigital Library
- P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS, page 188, 1998. Google ScholarDigital Library
- S. D. Silvey. Statistical Inference. Chapman-Hall, 1975.Google Scholar
- L. Wang, S. Jajodia, and D. Wijesekera. Preserving privacy in on-line analytical processing data cubes. In Secure Data Management in Decentralized Systems, pages 355--380. 2007.Google ScholarCross Ref
- R. C.-W. Wong, A. W.-C. Fu, K. Wang, and J. Pei. Minimality attack in privacy preserving data publishing. In VLDB, pages 543--554, 2007. Google ScholarDigital Library
- X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. In ICDE, pages 225--236, 2010.Google ScholarCross Ref
Index Terms
- Differentially private data cubes: optimizing noise sources and consistency
Recommendations
An efficient method for maintaining data cubes incrementally
The data cube operator computes group-bys for all possible combinations of a set of dimension attributes. Since computing a data cube typically incurs a considerable cost, the data cube is often precomputed and stored as materialized views in data ...
Space-efficient cubes for OLAP range-sum queries
Data cubes support a powerful data analysis method called the range-sum query. The range-sum query is widely used in finding trends and in discovering relationships among attributes in diverse database applications. A range-sum query computes aggregate ...
Differentially private multidimensional data publishing
Various organizations collect data about individuals for various reasons, such as service improvement. In order to mine the collected data for useful information, data publishing has become a common practice among those organizations and data analysts, ...
Comments