skip to main content
10.1145/3308558.3313474acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Privacy-Preserving Crowd-Sourcing of Web Searches with Private Data Donor

Published:13 May 2019Publication History

ABSTRACT

Search engines play an important role on the Web, helping users find relevant resources and answers to their questions. At the same time, search logs can also be of great utility to researchers. For instance, a number of recent research efforts have relied on them to build prediction and inference models, for applications ranging from economics and marketing to public health surveillance. However, companies rarely release search logs, also due to the related privacy issues that ensue, as they are inherently hard to anonymize. As a result, it is very difficult for researchers to have access to search data, and even if they do, they are fully dependent on the company providing them. Aiming to overcome these issues, this paper presents Private Data Donor (PDD), a decentralized and private-by-design platform providing crowd-sourced Web searches to researchers. We build on a cryptographic protocol for privacy-preserving data aggregation, and address a few practical challenges to add reliability into the system with regards to users disconnecting or stopping using the platform. We discuss how PDD can be used to build a flu monitoring model, and evaluate the impact of the privacy-preserving layer on the quality of the results. Finally, we present the implementation of our platform, as a browser extension and a server, and report on a pilot deployment with real users.

References

  1. I. E. Akkus, R. Chen, M. Hardt, P. Francis, and J. Gehrke. Non-tracking Web Analytics. In ACM CCS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Bao, R. H. Deng, and H. Zhu. Variations of Diffie-Hellman problem. In ICICS, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. N. Borisov, G. Danezis, and I. Goldberg. DP5: A Private Presence Service. In PoPETS, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  4. C. Castelluccia, E. Mykletun, and G. Tsudik. Efficient Aggregation of encrypted data in Wireless Sensor Networks. In Mobiquitous, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T.-H. H. Chan, E. Shi, and D. Song. Privacy-preserving stream aggregation with fault tolerance. In Financial Cryptography, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  6. R. Chen, I. E. Akkus, and P. Francis. SplitX: High-performance Private Analytics. In SIGCOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Chen, A. Reznichenko, P. Francis, and J. Gehrke. Towards statistical queries over distributed private user data. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Choi and H. Varian. Predicting the Present with Google Trends. Economic Record, 88(s1), 2012.Google ScholarGoogle Scholar
  9. :chutten. Two Days, or How Long Until the Data is In. Online at https://blog.mozilla.org/data/2017/09/19/two-days-or-how-long-until-the-data-is-in/, 2017.Google ScholarGoogle Scholar
  10. H. Corrigan-Gibbs and D. Boneh. Prio: Private, Robust, and Scalable Computation of Aggregate Statistics. In NSDI, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Diffie and M. Hellman. New Directions in Cryptography. IEEE Transactions on Information Theory, 22(6), 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Dwork. Differential Privacy. In ICALP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In ACM CCS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Ettredge, J. Gerdes, and G. Karuga. Using Web-based Search Data to Predict Macroeconomic Statistics. Communications of the ACM, 48(11), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Fan and H. Jin. A Practical Framework for Privacy-Preserving Data Analytics. In The World Wide Web Conference, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Fanti, V. Pihur, and lfar Erlingsson. Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries. In PoPETS, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Gervais, R. Shokri, A. Singla, S. Capkun, and V. Lenders. Quantifying Web-Search Privacy. In ACM CCS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232), 2009.Google ScholarGoogle Scholar
  19. S. Goel, J. M. Hofman, S. Lahaie, D. M. Pennock, and D. J. Watts. Predicting consumer behavior with Web search. Proceedings of the National Academy of Sciences, 107(41), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  20. K. Kursawe, G. Danezis, and M. Kohlweiss. Privacy-friendly Aggregation for the Smart-grid. In PETS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Lampos, A. C. Miller, S. Crossan, and C. Stefansen. Advances in Nowcasting Influenza-like Illness Rates using Search Query Logs. Scientific Reports, 5(12760), 2015.Google ScholarGoogle Scholar
  22. V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox. Assessing the Impact of a Health Intervention via User-Generated Internet Content. Data Mining and Knowledge Discovery, 29(5), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. Lampos, B. Zou, and I. J. Cox. Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance. In The World Wide Web Conference, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. J. Massey Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46(253), 1951.Google ScholarGoogle Scholar
  25. F. D. McSherry. Privacy Integrated Queries: An Extensible Platform for Privacy-preserving Data Analysis. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Melis, G. Danezis, and E. De Cristofaro. Efficient Private Statistics with Succinct Sketches. In NDSS, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  27. P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: Privacy Preserving Data Analysis Made Easy. In SIGMOD '12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mozilla Labs. A Week in the Life of a Browser: Aggregated Data Sample. https://web.archive.org/web/20110711092459/ https://testpilot.mozillalabs.com/testcases/a-week-life/aggregated-data.html, 2011.Google ScholarGoogle Scholar
  29. S. Palan and C. Schitter. Prolific.ac - A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 2018.Google ScholarGoogle Scholar
  30. J. Paparrizos, R. W. White, and E. Horvitz. Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results. Journal of Oncology Practice, 12(8), 2016.Google ScholarGoogle ScholarCross RefCross Ref
  31. V. Pinchin. I'm Feeling Yucky :( Searching for symptoms on Google. Online at https://www.blog.google/products/search/im-feeling-yucky-searching-for-symptoms/, 2016.Google ScholarGoogle Scholar
  32. P. M. Polgreen, Y. Chen, D. M. Pennock, F. D. Nelson, and R. A. Weinstein. Using Internet Searches for Influenza Surveillance. Clin. Infect. Dis., 47(11):1443-1448, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  33. A. Pyrgelis, E. De Cristofaro, and G. J. Ross. Privacy-friendly mobility analytics using aggregate location data. In SIGSPATIAL, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Pyrgelis, C. Troncoso, and E. De Cristofaro. Knock Knock, Who's There? Membership Inference on Aggregate Location Data. In NDSS, 2018.Google ScholarGoogle Scholar
  35. Y. Research. L18 - Anonymized Yahoo! Search Logs with Relevance Judgments. Online at https://webscope.sandbox.yahoo.com/catalog.php?datatype=l.Google ScholarGoogle Scholar
  36. L. Soldaini and E. Yom-Tov. Inferring Individual Attributes from Search Engine Queries and Auxiliary Information. In The World Wide Web Conference, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Sullivan. Google now handles at least 2 trillion searches per year. Online at https://searchengineland.com/google-now-handles-2-999-trillion-searches-per-year-250247, 2016.Google ScholarGoogle Scholar
  38. M. Wagner, V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox. Estimating the Population Impact of a New Pediatric Influenza Vaccination Program in England Using Social Media Content. Journal of Medical Internet Research, 19(12), 2017.Google ScholarGoogle ScholarCross RefCross Ref
  39. R. White and E. Horvitz. Evaluation of the feasibility of screening patients for early signs of lung carcinoma in web search logs. JAMA Oncology, 3(3), 2017.Google ScholarGoogle Scholar
  40. L. Wu and E. Brynjolfsson. The Future of Prediction: How Google Searches Foreshadow Housing Prices and Sales. University of Chicago Press, 2015.Google ScholarGoogle Scholar
  41. S. Yang, M. Santillana, and S. C. Kou. Accurate Estimation of Influenza Epidemics using Google Search Data via ARGO. Proceedings of the National Academy of Sciences, 112(47), 2015.Google ScholarGoogle ScholarCross RefCross Ref
  42. E. Yom-Tov. Crowdsourced Health - How What You Do on the Internet Will Improve Medicine. MIT Press, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. T. Zargar, J. Joshi, and D. Tipper. A Survey of Defense Mechanisms Against Distributed Denial of Service (DDoS) Flooding Attacks. IEEE Communications Surveys Tutorials, 15(4), 2013.Google ScholarGoogle ScholarCross RefCross Ref
  44. T. Zeller. AOL executive quits after posting of search data. https://web.archive.org/web/20061126162350/ http://www.iht.com/articles/2006/08/22/business/aol.php, 2006.Google ScholarGoogle Scholar
  45. P. Zhao and B. Yu. On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. B. Zou, V. Lampos, and I. Cox. Multi-Task Learning Improves Disease Models from Web Search. In The World Wide Web Conference, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Zou and T. Hastie. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 2005.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    WWW '19: The World Wide Web Conference
    May 2019
    3620 pages
    ISBN:9781450366748
    DOI:10.1145/3308558

    Copyright © 2019 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 13 May 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate1,899of8,196submissions,23%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format