ABSTRACT
Search engines play an important role on the Web, helping users find relevant resources and answers to their questions. At the same time, search logs can also be of great utility to researchers. For instance, a number of recent research efforts have relied on them to build prediction and inference models, for applications ranging from economics and marketing to public health surveillance. However, companies rarely release search logs, also due to the related privacy issues that ensue, as they are inherently hard to anonymize. As a result, it is very difficult for researchers to have access to search data, and even if they do, they are fully dependent on the company providing them. Aiming to overcome these issues, this paper presents Private Data Donor (PDD), a decentralized and private-by-design platform providing crowd-sourced Web searches to researchers. We build on a cryptographic protocol for privacy-preserving data aggregation, and address a few practical challenges to add reliability into the system with regards to users disconnecting or stopping using the platform. We discuss how PDD can be used to build a flu monitoring model, and evaluate the impact of the privacy-preserving layer on the quality of the results. Finally, we present the implementation of our platform, as a browser extension and a server, and report on a pilot deployment with real users.
- I. E. Akkus, R. Chen, M. Hardt, P. Francis, and J. Gehrke. Non-tracking Web Analytics. In ACM CCS, 2012. Google ScholarDigital Library
- F. Bao, R. H. Deng, and H. Zhu. Variations of Diffie-Hellman problem. In ICICS, 2003.Google ScholarCross Ref
- N. Borisov, G. Danezis, and I. Goldberg. DP5: A Private Presence Service. In PoPETS, 2015.Google ScholarCross Ref
- C. Castelluccia, E. Mykletun, and G. Tsudik. Efficient Aggregation of encrypted data in Wireless Sensor Networks. In Mobiquitous, 2005. Google ScholarDigital Library
- T.-H. H. Chan, E. Shi, and D. Song. Privacy-preserving stream aggregation with fault tolerance. In Financial Cryptography, 2012.Google ScholarCross Ref
- R. Chen, I. E. Akkus, and P. Francis. SplitX: High-performance Private Analytics. In SIGCOMM, 2013. Google ScholarDigital Library
- R. Chen, A. Reznichenko, P. Francis, and J. Gehrke. Towards statistical queries over distributed private user data. In NSDI, 2012. Google ScholarDigital Library
- H. Choi and H. Varian. Predicting the Present with Google Trends. Economic Record, 88(s1), 2012.Google Scholar
- :chutten. Two Days, or How Long Until the Data is In. Online at https://blog.mozilla.org/data/2017/09/19/two-days-or-how-long-until-the-data-is-in/, 2017.Google Scholar
- H. Corrigan-Gibbs and D. Boneh. Prio: Private, Robust, and Scalable Computation of Aggregate Statistics. In NSDI, 2017. Google ScholarDigital Library
- W. Diffie and M. Hellman. New Directions in Cryptography. IEEE Transactions on Information Theory, 22(6), 1976. Google ScholarDigital Library
- C. Dwork. Differential Privacy. In ICALP, 2006. Google ScholarDigital Library
- U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In ACM CCS, 2014. Google ScholarDigital Library
- M. Ettredge, J. Gerdes, and G. Karuga. Using Web-based Search Data to Predict Macroeconomic Statistics. Communications of the ACM, 48(11), 2005. Google ScholarDigital Library
- L. Fan and H. Jin. A Practical Framework for Privacy-Preserving Data Analytics. In The World Wide Web Conference, 2015. Google ScholarDigital Library
- G. Fanti, V. Pihur, and lfar Erlingsson. Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries. In PoPETS, 2016.Google ScholarCross Ref
- A. Gervais, R. Shokri, A. Singla, S. Capkun, and V. Lenders. Quantifying Web-Search Privacy. In ACM CCS, 2014. Google ScholarDigital Library
- J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232), 2009.Google Scholar
- S. Goel, J. M. Hofman, S. Lahaie, D. M. Pennock, and D. J. Watts. Predicting consumer behavior with Web search. Proceedings of the National Academy of Sciences, 107(41), 2010.Google ScholarCross Ref
- K. Kursawe, G. Danezis, and M. Kohlweiss. Privacy-friendly Aggregation for the Smart-grid. In PETS, 2011. Google ScholarDigital Library
- V. Lampos, A. C. Miller, S. Crossan, and C. Stefansen. Advances in Nowcasting Influenza-like Illness Rates using Search Query Logs. Scientific Reports, 5(12760), 2015.Google Scholar
- V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox. Assessing the Impact of a Health Intervention via User-Generated Internet Content. Data Mining and Knowledge Discovery, 29(5), 2015. Google ScholarDigital Library
- V. Lampos, B. Zou, and I. J. Cox. Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance. In The World Wide Web Conference, 2017. Google ScholarDigital Library
- F. J. Massey Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46(253), 1951.Google Scholar
- F. D. McSherry. Privacy Integrated Queries: An Extensible Platform for Privacy-preserving Data Analysis. In SIGMOD, 2009. Google ScholarDigital Library
- L. Melis, G. Danezis, and E. De Cristofaro. Efficient Private Statistics with Succinct Sketches. In NDSS, 2016.Google ScholarCross Ref
- P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: Privacy Preserving Data Analysis Made Easy. In SIGMOD '12, 2012. Google ScholarDigital Library
- Mozilla Labs. A Week in the Life of a Browser: Aggregated Data Sample. https://web.archive.org/web/20110711092459/ https://testpilot.mozillalabs.com/testcases/a-week-life/aggregated-data.html, 2011.Google Scholar
- S. Palan and C. Schitter. Prolific.ac - A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 2018.Google Scholar
- J. Paparrizos, R. W. White, and E. Horvitz. Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results. Journal of Oncology Practice, 12(8), 2016.Google ScholarCross Ref
- V. Pinchin. I'm Feeling Yucky :( Searching for symptoms on Google. Online at https://www.blog.google/products/search/im-feeling-yucky-searching-for-symptoms/, 2016.Google Scholar
- P. M. Polgreen, Y. Chen, D. M. Pennock, F. D. Nelson, and R. A. Weinstein. Using Internet Searches for Influenza Surveillance. Clin. Infect. Dis., 47(11):1443-1448, 2008.Google ScholarCross Ref
- A. Pyrgelis, E. De Cristofaro, and G. J. Ross. Privacy-friendly mobility analytics using aggregate location data. In SIGSPATIAL, 2016. Google ScholarDigital Library
- A. Pyrgelis, C. Troncoso, and E. De Cristofaro. Knock Knock, Who's There? Membership Inference on Aggregate Location Data. In NDSS, 2018.Google Scholar
- Y. Research. L18 - Anonymized Yahoo! Search Logs with Relevance Judgments. Online at https://webscope.sandbox.yahoo.com/catalog.php?datatype=l.Google Scholar
- L. Soldaini and E. Yom-Tov. Inferring Individual Attributes from Search Engine Queries and Auxiliary Information. In The World Wide Web Conference, 2017. Google ScholarDigital Library
- D. Sullivan. Google now handles at least 2 trillion searches per year. Online at https://searchengineland.com/google-now-handles-2-999-trillion-searches-per-year-250247, 2016.Google Scholar
- M. Wagner, V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox. Estimating the Population Impact of a New Pediatric Influenza Vaccination Program in England Using Social Media Content. Journal of Medical Internet Research, 19(12), 2017.Google ScholarCross Ref
- R. White and E. Horvitz. Evaluation of the feasibility of screening patients for early signs of lung carcinoma in web search logs. JAMA Oncology, 3(3), 2017.Google Scholar
- L. Wu and E. Brynjolfsson. The Future of Prediction: How Google Searches Foreshadow Housing Prices and Sales. University of Chicago Press, 2015.Google Scholar
- S. Yang, M. Santillana, and S. C. Kou. Accurate Estimation of Influenza Epidemics using Google Search Data via ARGO. Proceedings of the National Academy of Sciences, 112(47), 2015.Google ScholarCross Ref
- E. Yom-Tov. Crowdsourced Health - How What You Do on the Internet Will Improve Medicine. MIT Press, 2016. Google ScholarDigital Library
- S. T. Zargar, J. Joshi, and D. Tipper. A Survey of Defense Mechanisms Against Distributed Denial of Service (DDoS) Flooding Attacks. IEEE Communications Surveys Tutorials, 15(4), 2013.Google ScholarCross Ref
- T. Zeller. AOL executive quits after posting of search data. https://web.archive.org/web/20061126162350/ http://www.iht.com/articles/2006/08/22/business/aol.php, 2006.Google Scholar
- P. Zhao and B. Yu. On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7, 2006. Google ScholarDigital Library
- B. Zou, V. Lampos, and I. Cox. Multi-Task Learning Improves Disease Models from Web Search. In The World Wide Web Conference, 2018. Google ScholarDigital Library
- H. Zou and T. Hastie. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 2005.Google ScholarCross Ref
Recommendations
Multi-level privacy preserving data publishing
Policedata is an important source of social media data and can be regarded as a technical assistance to increase government accountability and transparency. Notably, it contains large amounts of personal private information that should be preserved ...
Comments