Abstract
Huge amounts of unstructured data including image, video, audio, and text are ubiquitously generated and shared, and it is a challenge to protect sensitive personal information in them, such as human faces, voiceprints, and authorships. Differential privacy is the standard privacy protection technology that provides rigorous privacy guarantees for various data. This survey summarizes and analyzes differential privacy solutions to protect unstructured data content before it is shared with untrusted parties. These differential privacy methods obfuscate unstructured data after they are represented with vectors and then reconstruct them with obfuscated vectors. We summarize specific privacy models and mechanisms together with possible challenges in them. We also discuss their privacy guarantees against AI attacks and utility losses. Finally, we discuss several possible directions for future research.
- [1] . 2020. Context aware local differential privacy. In 37th International Conference on Machine Learning (ICML’20). 52–62.Google ScholarDigital Library
- [2] . 2012. t-Plausibility: Generalizing words to desensitize text. Transactions on Data Privacy 5, 3 (2012), 505–534.Google ScholarDigital Library
- [3] . 2013. Geo-indistinguishability: Differential privacy for location-based systems. In Proc. of the 20th ACM SIGSAC Conference on Computer and Communications Security (CCS’13). 901–914.Google ScholarDigital Library
- [4] . 2017. BLENDER: Enabling local search with a hybrid differential privacy model. In 26th USENIX Security Symposium. 747–764.Google Scholar
- [5] . 2020. Private summation in the multi-message shuffle model. In Proc. of the 27th ACM SIGSAC Conference on Computer and Communications Security (CCS’20). 657–676.Google ScholarDigital Library
- [6] . 2012. The Johnson-Lindenstrauss transform itself preserves differential privacy. In 53rd IEEE Annual Symposium on Foundations of Computer Science (FOCS’12). 410–419.Google Scholar
- [7] . 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.Google ScholarCross Ref
- [8] . 2020. Differential Privacy for eye tracking with temporal correlations. Cryptology ePrint Archive, Report 2020/340. https://eprint.iacr.org/2020/340.Google Scholar
- [9] . 2021. A multi-cloud model based many-objective intelligent algorithm for efficient task scheduling in Internet of Things. IEEE Internet of Things Journal 8, 12 (2021), 9645–9653.
DOI: Google ScholarCross Ref - [10] . 2022. Privid: Practical, privacy-preserving video analytics queries. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 209–228.Google Scholar
- [11] . 2013. Broadening the scope of differential privacy using metrics. In The 13th Privacy Enhancing Technologies Symposium (PETS’13). 82–102.Google Scholar
- [12] . 2015. Constructing elastic distinguishability metrics for location privacy. Proceedings on Privacy Enhancing Technologies. 2 (2015), 156–170.Google ScholarCross Ref
- [13] . 2021. Perceptual Indistinguishability-Net (PI-Net): Facial image obfuscation with manipulable semantics. In Proc. of the 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 6478–6487.Google ScholarCross Ref
- [14] . 2019. A practical method to reduce privacy loss when disclosing statistics based on small samples. AEA Papers and Proceedings 109 (2019), 414–20.Google ScholarCross Ref
- [15] . 2009. Distance metrics. [EB/OL]. https://numerics.mathdotnet.com/Distance.html.Google Scholar
- [16] . 2018. Privacy at scale: Local differential privacy in practice. In Proc. of the 44th International Conference on Management of Data (SIGMOD’18). 1655–1658.Google ScholarDigital Library
- [17] . 2020. A hybrid blockchain-based identity authentication scheme for multi-WSN. IEEE Transactions on Services Computing 13, 2 (2020), 241–251.Google Scholar
- [18] . 2013. Practicing differential privacy in health care: A review. Transactions on Data Privacy 6, 1 (2013), 35–67.Google ScholarDigital Library
- [19] . 2020. SoK: Differential privacies. Proceedings on Privacy Enhancing Technologies 2020, 2 (2020), 288–313.Google ScholarCross Ref
- [20] . 2013. Local privacy and statistical minimax rates. In 54th IEEE Annual Symposium on Foundations of Computer Science (FOCS’13). 429–438.Google Scholar
- [21] . 2008. Differential privacy: A survey of results. Theory and Applications of Models of Computation 4978 (2008), 1–19.Google ScholarCross Ref
- [22] . 2009. The differential privacy frontier. In The 6th Theory of Cryptography Conference. 496–502.Google ScholarDigital Library
- [23] . 2006. Calibrating noise to sensitivity in private data analysis. In The 3th Theory of Cryptography Conference (TCC’06). 265–284.Google ScholarDigital Library
- [24] . 2017. Exposed! A survey of attacks on private data. Annual Review of Statistics and Its Application 4, 1 (2017), 61–84.Google ScholarCross Ref
- [25] . 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proc. of the 21st ACM SIGSAC Conference on Computer and Communications Security (CCS’14). 1054–1067.Google ScholarDigital Library
- [26] . 2018. Image pixelization with differential privacy. In 32nd Annual IFIP WG 11.3 Conference on Data and Applications Security and Privacy (DBSec’18). 148–162.Google ScholarDigital Library
- [27] . 2019. Practical image obfuscation with provable privacy. In 20th IEEE International Conference on Multimedia and Expo (ICME’19). 784–789.Google ScholarCross Ref
- [28] . 2020. A survey of differentially private generative adversarial networks. In The AAAI Workshop on Privacy-Preserving Artificial Intelligence.Google Scholar
- [29] . 2018. Time series sanitization with metric-based privacy. In 6th IEEE International Congress on Big Data. 264–267.Google Scholar
- [30] . 2019. Speaker anonymization using x-vector and neural waveform models. In The 10th ISCA Speech Synthesis Workshop. 155–160.Google Scholar
- [31] . 2019. Generalised differential privacy for text document processing. In The 8th International Conference on Principles of Security and Trust (POST’19). 123–148.Google ScholarCross Ref
- [32] . 2020. Locality sensitive hashing with extended differential privacy. arXiv preprint, arXiv:2010.09393 (2020).Google Scholar
- [33] . 2020. Research challenges in designing differentially private text generation mechanisms. arXiv preprint, arXiv:2012.05403 (2020).Google Scholar
- [34] . 2020. Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In The 13th ACM International Conference on Web Search and Data Mining (WSDM’20). 178–186.Google ScholarDigital Library
- [35] . 2019. Leveraging hierarchical representations for preserving privacy and utility in text. In 19th IEEE International Conference on Data Mining (ICDM’19). 210–219.Google ScholarCross Ref
- [36] . 2021. Private release of text embedding vectors. In Proc. of the 1st Workshop on Trustworthy Natural Language Processing. 15–27.Google Scholar
- [37] . 2019. Decision tree classification with differential privacy: A survey. ACM Computing Surveys 52, 4 (2019), 1–33.Google ScholarDigital Library
- [38] . 2009. Large-scale privacy protection in Google street view. In 12th IEEE International Conference on Computer Vision (ICCV’09). 2373–2380.Google ScholarCross Ref
- [39] . 2018. Hyperbolic entailment cones for learning hierarchical embeddings. In The 35th International Conference on Machine Learning (ICML’18). 1646–1655.Google Scholar
- [40] . 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In The 23rd International Conference on Machine Learning (ICML’06). 369–376.Google ScholarDigital Library
- [41] . 2020. Providing input-discriminative protection for local differential privacy. In 36th IEEE International Conference on Data Engineering (ICDE’20). 505–516.Google ScholarCross Ref
- [42] . 2021. Secure and utility-aware data collection with condensed local differential privacy. IEEE Transactions on Dependable and Secure Computing 18, 5 (2021), 2365–2378.
DOI: Google ScholarDigital Library - [43] . 2017. Minimax filter: Learning to preserve privacy from inference attacks. Journal of Machine Learning Research 18, 1 (2017), 4704–4734.Google ScholarDigital Library
- [44] . 2020. Voice-indistinguishability: Protecting voiceprint in privacy-preserving speech data release. In 21st IEEE International Conference on Multimedia and Expo (ICME’20). 1–6.Google ScholarCross Ref
- [45] . 2020. DEAL: Differentially private auction for blockchain-based microgrids energy trading. IEEE Transactions on Services Computing 13, 2 (2020), 263–275.Google Scholar
- [46] . 2020. Differential privacy techniques for cyber physical systems: A survey. IEEE Communications Surveys & Tutorials 22, 1 (2020), 746–789.Google ScholarDigital Library
- [47] . 2016. On the (in) effectiveness of mosaicing and blurring as tools for document redaction. Proceedings on Privacy Enhancing Technologies 2016, 4 (2016), 403–417.Google ScholarCross Ref
- [48] . 2020. The bounded Laplace mechanism in differential privacy. Journal of Privacy and Confidentiality 10, 1 (2020), 1–15.
DOI: Google ScholarCross Ref - [49] . 2021. Applications of differential privacy in social network analysis: A survey. IEEE Transactions on Knowledge & Data Engineering (TKDE’21).
DOI: Google ScholarCross Ref - [50] . 2021. TIDY: Publishing a time interval dataset with differential privacy. IEEE Transactions on Knowledge and Data Engineering (TKDE) 33, 5 (2021), 2280–2294.Google ScholarCross Ref
- [51] . 2015. Speaker de-identification using diphone recognition and speech synthesis. In The 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG’15). 1–7.Google ScholarCross Ref
- [52] . 2020. Not all attributes are created equal: \( d_\mathcal {X} \)-private mechanisms for linear queries. Proceedings on Privacy Enhancing Technologies 2020, 1 (2020), 103–125.Google ScholarCross Ref
- [53] . 2011. What can we learn privately? SIAM Journal on Computing 40, 3 (2011), 793–826.Google ScholarDigital Library
- [54] . 2013. Privacy via the Johnson-Lindenstrauss transform. Journal of Privacy and Confidentiality 5, 1 (2013), 39–71.Google ScholarCross Ref
- [55] . 2011. Authorship attribution in the wild. Language Resources and Evaluation 45, 1 (2011), 83–94.Google ScholarDigital Library
- [56] . 2016. Location-dependent privacy. In 55th IEEE Conference on Decision and Control (CDC’16). 7586–7591.Google ScholarDigital Library
- [57] . 2016. Multi-owner multi-user privacy. In 55th IEEE Conference on Decision and Control (CDC’16). 1787–1793.Google ScholarDigital Library
- [58] . 2021. Kal\( \varepsilon \)ido: Real-time privacy control for eye-tracking systems. In The 30th USENIX Security Symposium.Google Scholar
- [59] . 2021. Differentially private imaging via latent space manipulation. arXiv preprint arXiv: 2103.05472 (2021).Google Scholar
- [60] . 2019. Differential privacy for eye-tracking data. In The 11th ACM Symposium on Eye Tracking Research & Applications (ETRA’19). 28:1–28:10.Google Scholar
- [61] . 2018. Generalized Gaussian mechanism for differential privacy. IEEE Transactions on Knowledge and Data Engineering (TKDE) 31, 4 (2018), 747–756.Google ScholarDigital Library
- [62] . 1994. An improved randomized response strategy. Journal of the Royal Statistical Society: Series B (Methodological) 56, 1 (1994), 93–95.Google ScholarCross Ref
- [63] . 2016. Defeating image obfuscation with deep learning. arXiv preprint, arXiv:1609.00408 (2016).Google Scholar
- [64] . 2007. Mechanism design via differential privacy. In The 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). 94–103.Google Scholar
- [65] . 2009. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proc. of the 35th ACM SIGMOD International Conference on Management of Data (SIGMOD’09). 19–30.Google ScholarDigital Library
- [66] . 2020. Impact of frequency of location reports on the privacy level of geo-indistinguishability. Proceedings on Privacy Enhancing Technologies 2020, 2 (2020), 379–396.Google ScholarCross Ref
- [67] . 2013. Efficient estimation of word representations in vector space. arXiv preprint, arXiv:1301.3781 (2013).Google Scholar
- [68] . 2013. Distributed representations of words and phrases and their compositionality. In Proc. of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS’13). 3111–3119.Google Scholar
- [69] . 2019. Utility-optimized local differential privacy mechanisms for distribution estimation. In The 28th USENIX Security Symposium. 1877–1894.Google Scholar
- [70] . 2019. The GDPR & speech data: Reflections of legal and technology communities, first steps towards a common understanding. In Proc. of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH’19). 3695–3699.Google ScholarCross Ref
- [71] . 2019. Chasing accuracy and privacy, and catching both: A literature survey on differentially private histogram publication. arXiv preprint, arXiv:1910.14028 (2019).Google Scholar
- [72] . 2017. Poincaré embeddings for learning hierarchical representations. In The 31st Annual Conference on Neural Information Processing Systems (NeurIPS’17). 6338–6347.Google Scholar
- [73] . 2014. Glove: Global vectors for word representation. In The 19th Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarCross Ref
- [74] . 2020. Visor: Privacy-preserving video analytics as a cloud service. In The 29th USENIX Security Symposium. 1039–1056.Google Scholar
- [75] . 2018. A two-stage locality-sensitive hashing based approach for privacy-preserving mobile service recommendation in cross-platform edge environment. Future Generation Computer Systems 88 (2018), 636–643.Google ScholarDigital Library
- [76] . 2019. Speech sanitizer: Speech content desensitization and voice anonymization. IEEE Transactions on Dependable and Secure Computing 18, 6 (2019), 2631–2642.
DOI: Google ScholarDigital Library - [77] . 2018. Hidebehind: Enjoy voice input with voiceprint unclonability and anonymity. In The 16th ACM Conference on Embedded Networked Sensor Systems (SenSys’18). 82–94.Google ScholarDigital Library
- [78] . 2021. Customizable reliable privacy-preserving data sharing in cyber-physical social network. IEEE Transactions on Network Science and Engineering 8, 1 (2021), 269–281.Google ScholarCross Ref
- [79] . 2020. GAN-driven personalized spatial-temporal private data sharing in cyber-physical social systems. IEEE Transactions on Network Science and Engineering 7, 4 (2020), 2576–2586.Google ScholarCross Ref
- [80] . 2010. Differentially private aggregation of distributed time-series with transformation and encryption. In Proc. of the 36th ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 735–746.Google ScholarDigital Library
- [81] . 2015. Semantic noise: Privacy-protection of nominal microdata through uncorrelated noise addition. In The 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’15). 1106–1113.Google ScholarDigital Library
- [82] . 2014. W3-privacy: Understanding what, when, and where inference channels in multi-camera surveillance video. Multimedia Tools and Applications 68, 1 (2014), 135–158.Google ScholarDigital Library
- [83] . 2016. C-sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology 67, 1 (2016), 148–163.Google ScholarDigital Library
- [84] . 2019. Locally private Bayesian inference for count models. In The 36th International Conference on Machine Learning. 5638–5648.Google Scholar
- [85] . 2005. Enabling video privacy through computer vision. IEEE Security & Privacy 3, 3 (2005), 50–57.Google ScholarDigital Library
- [86] . 2015. Privacy-preserving deep learning. In Proc. of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). 1310–1321.Google ScholarDigital Library
- [87] . 2017. Membership inference attacks against machine learning models. In 38th IEEE Symposium on Security and Privacy (S&P’17). 3–18.Google Scholar
- [88] . 2018. x-vectors: Robust DNN embeddings for speaker recognition. In 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 5329–5333.Google ScholarDigital Library
- [89] . 2019. Auditing data provenance in text-generation models. In The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19). 196–206.Google ScholarDigital Library
- [90] . 2020. Evaluating voice conversion-based privacy protection against informed attackers. In The 45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 2802–2806.Google ScholarCross Ref
- [91] . 2021. Improved differentially private Euclidean distance approximation. In Proc. of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’21). 42–56.Google Scholar
- [92] . 2019. Privacy-aware eye tracking using differential privacy. In The 11th ACM Symposium on Eye Tracking Research & Applications (ETRA’19). 27:1–27:9.Google Scholar
- [93] . 2020. Human action image generation with differential privacy. In The 21st IEEE International Conference on Multimedia and Expo (ICME’20). 1–6.Google ScholarCross Ref
- [94] . 2018. Natural and effective obfuscation by head inpainting. In Proc. of the 21st IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5050–5059.Google ScholarCross Ref
- [95] . 2019. Geo-graph-indistinguishability: Protecting location privacy for LBS over road networks. In 33rd IFIP Annual WG 11.3 Conference on Data and Applications Security and Privacy (DBSec’19). 143–163.Google ScholarDigital Library
- [96] . 2008. Privacy-preserving anonymization of set-valued data. Proceedings of the VLDB Endowment 1, 1 (2008), 115–125.Google ScholarDigital Library
- [97] . 2015. Differential privacy with bounded priors: Reconciling utility and privacy in genome-wide association studies. In Proc. of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). 1286–1297.Google ScholarDigital Library
- [98] . 2020. SoK: Differential privacy as a causal property. In 41st IEEE Symposium on Security and Privacy (S&P’20). 354–371.Google Scholar
- [99] . 1961. Counterspeculation, auctions, and competitive sealed tenders. Journal of Finance 16, 1 (1961), 8–37.Google ScholarCross Ref
- [100] . 2020. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Computer Speech & Language 60 (2020), 101026.
DOI: Google ScholarDigital Library - [101] . 2021. DP-cryptography: Marrying differential privacy and cryptography in emerging applications. Communications of the ACM 64, 2 (2021), 84–93.Google ScholarDigital Library
- [102] . 2018. Technical privacy metrics: A systematic survey. ACM Computing Surveys 51, 3 (2018), 1–38.Google ScholarDigital Library
- [103] . 2020. Publishing video data with indistinguishable objects. In Proc. of the 23rd International Conference on Extending Database Technology (EDBT’20). 323–334.Google Scholar
- [104] . 2020. VideoDP: A universal platform for video analytics with differential privacy. Proceedings on Privacy Enhancing Technologies 2020, 4 (2020), 277–296.Google ScholarCross Ref
- [105] . 2021. FinPrivacy: A privacy-preserving mechanism for fingerprint identification. ACM Transactions on Internet Technology 21, 3 (2021).
DOI: Google ScholarDigital Library - [106] . 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.Google ScholarDigital Library
- [107] . 2020. Towards compression-resistant privacy-preserving photo sharing on social networks. In The 21st International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (Mobihoc’20). 81–90.Google Scholar
- [108] . 1965. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60, 309 (1965), 63–69.Google ScholarCross Ref
- [109] . 2018. SynTF: Synthetic and differentially private term frequency vectors for privacy-preserving text mining. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18). 305–314.Google ScholarDigital Library
- [110] . 2019. P3SGD: Patient privacy preserving SGD for regularizing deep CNNs in pathological image classification. In Proc. of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 2099–2108.Google ScholarCross Ref
- [111] . 2018. Towards privacy-preserving visual recognition via adversarial training: A pilot study. In The 15th European Conference on Computer Vision (ECCV’18). 606–624.Google ScholarCross Ref
- [112] . 2020. Linear and range counting under metric-based local differential privacy. In 17th IEEE International Symposium on Information Theory (ISIT’20). 908–913.Google Scholar
- [113] . 2020. Personalized location privacy protection for location-based services in vehicular networks. IEEE Wireless Communications Letters 9, 10 (2020), 1633–1637.Google ScholarCross Ref
- [114] . 2021. Density-aware differentially private textual perturbations using truncated Gumbel noise. Proceedings of FLAIRS 34, 1 (2021), 1–8.
DOI: Google ScholarCross Ref - [115] . 2020. A differentially private text perturbation method using a regularized Mahalanobis metric. In Proc. of the 2nd Workshop on PrivateNLP at the 25th Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 7–17.Google ScholarCross Ref
- [116] . 2021. On a utilitarian approach to privacy preserving text generation. In Proc. of the 3rd Workshop on Privacy in Natural Language Processing. 11–20.Google Scholar
- [117] . 2020. Sequence data matching and beyond: New privacy-preserving primitives based on Bloom filters. IEEE Transactions on Information Forensics and Security (TIFS) 15 (2020), 2973–2987.Google ScholarCross Ref
- [118] . 2016. Object contour detection with a fully convolutional encoder-decoder network. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 193–202.Google ScholarCross Ref
- [119] . 2019. PrivKV: Key-value data collection with local differential privacy. In 40th IEEE Symposium on Security and Privacy (S&P’19). 317–331.Google Scholar
- [120] . 2021. Differential privacy for text analytics via natural text sanitization. In The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP).Google Scholar
- [121] . 2016. Scalable local-recoding anonymization using locality sensitive hashing for big data privacy preservation. In The 25th ACM International on Conference on Information and Knowledge Management (CIKM’16). 1793–1802.Google ScholarDigital Library
- [122] . 2017. Differentially private data publishing and analysis: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE) 29, 8 (2017), 1619–1638.Google ScholarDigital Library
Index Terms
- A Survey on Differential Privacy for Unstructured Data Content
Recommendations
A Novel Differential Privacy Approach that Enhances Classification Accuracy
C3S2E '16: Proceedings of the Ninth International C* Conference on Computer Science & Software EngineeringIn the recent past, there has been a tremendous increase of large repositories of data, examples being in healthcare data, consumer data from retailers, and airline passenger data. These data are continually being shared with interested parties, either ...
Differential privacy for eye-tracking data
ETRA '19: Proceedings of the 11th ACM Symposium on Eye Tracking Research & ApplicationsAs large eye-tracking datasets are created, data privacy is a pressing concern for the eye-tracking community. De-identifying data does not guarantee privacy because multiple datasets can be linked for inferences. A common belief is that aggregating ...
Enhancing data utility in differential privacy via microaggregation-based k-anonymity
It is not uncommon in the data anonymization literature to oppose the "old" k-anonymity model to the "new" differential privacy model, which offers more robust privacy guarantees. Yet, it is often disregarded that the utility of the anonymized results ...
Comments