skip to main content
10.1145/1281192.1281195acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Challenges in mining social network data: processes, privacy, and paradoxes

Published:12 August 2007Publication History

ABSTRACT

The profileration of rich social media, on-line communities, and collectively produced knowledge resources has accelerated the convergence of technological and social networks, producing environments that reflect both the architecture of the underlying information systems and the social structure on their members. In studying the consequences of these developments, we are faced with the opportunity to analyze social network data at unprecedented levels of scale and temporal resolution; this has led to a growing body of research at the intersection of the computing and social sciences.

We discuss some of the current challenges in the analysis of large-scale social network data, focusing on two themes in particular: the inference of social processes from data, and the problem of maintaining individual privacy in studies of social networks. While early research on this type of data focused on structural questions, recent work has extended this to consider the social processes that unfold within the networks. Particular lines of investigation have focused on processes in on-line social systems related to communication [1, 22], community formation [2, 8, 16, 23], information-seeking and collective problem-solving [20, 21, 18], marketing [12, 19, 24, 28], the spread of news [3, 17], and the dynamics of popularity [29]. There are a number of fundamental issues, however, for which we have relatively little understanding, including the extent to which the outcomes of these types of social processes are predictable from their early stages (see e.g. [29]), the differences between properties of individuals and properties of aggregate populations in these types of data, and the extent to which similar social phenomena in different domains have uniform underlying explanations.

The second theme we pursue is concerned with the problem of privacy. While much of the research on large-scale social systems has been carried out on data that is public, some of the richest emerging sources of social interaction data come from settings such as e-mail, instant messaging, or phone communication in which users have strong expectations of privacy. How can such data be made available to researchers while protecting the privacy of the individuals represented in the data? Many of the standard approaches here are variations on the principle of anonymization - the names of individuals are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed.

In recent joint work with Lars Backstrom and Cynthia Dwork, we have identified some fundamental limitations on the power of network anonymization to ensure privacy [7]. In particular, we describe a family of attacks such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes. The attacks are based on the uniqueness of small random subgraphs embedded in an arbitrary network, using ideas related to those found in arguments from Ramsey theory [6, 14]. Combined with other recent examples of privacy breaches in data containing rich textual or time-series information [9, 26, 27, 30], these results suggest that anonymization contains pitfalls even in very simple settings. In this way, our approach can be seen as a step toward understanding how techniques of privacy-preserving data mining (see e.g. [4, 5, 10, 11, 13, 15, 25] and the references therein) can inform how we think about the protection of eventhe most skeletal social network data.

Skip Supplemental Material Section

Supplemental Material

p4-kleinberg-200.mov

mov

125.6 MB

p4-kleinberg-768.mov

mov

421 MB

References

  1. Lada A. Adamic and Eytan Adar. How to search a social network. Social Networks, 27(3):187--203, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  2. Lada A. Adamic, Orkut Buyukkokten, and Eytan Adar. A social network caught in the web. First Monday, 8(6), 2003.Google ScholarGoogle Scholar
  3. Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit structure and the dynamics of blogspace. In Workshop on the Weblogging Ecosystem, 2004.Google ScholarGoogle Scholar
  4. Dakshi Agrawal and Charu C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms. In Proc. 20th ACM Symposium on Principles of Database Systems, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In Proc. ACM SIGMOD International Conference on Management of Data, pages 439--450, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Noga Alon and Joel Spencer. The Probabilistic Method. John Wiley & Sons, second edition, 2000.Google ScholarGoogle Scholar
  7. Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography. In Proc. 16th International World Wide Web Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: Membership, growth, and evolution. In Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael Barbaro and Tom Zeller Jr. A face is exposed for aol searcher no. 4417749. New York Times, 9 August 2006.Google ScholarGoogle Scholar
  10. Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The SuLQ framework. In Proc. 24th ACM Symposium on Principles of Database Systems, pages 128--138, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proc. 22nd ACM Symposium on Principles of Database Systems, pages 202--210, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Pedro Domingos and Matt Richardson. Mining the network value of customers. In Proc. 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 57--66, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proc. 3rd International Conference on Very Large Data Bases, pages 265--284, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Paul Erdös. Some remarks on the theory of graphs. Bulletin of the AMS, 53:292--294, 1947.Google ScholarGoogle ScholarCross RefCross Ref
  15. Alexandre V. Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In Proc. 22nd ACM Symposium on Principles of Database Systems, pages 211--222, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Scott A. Golder, Dennis Wilkinson, and Bernardo A. Huberman. Rhythms of social interaction: Messaging within a massive online network. In Proc. 3rd International Conference on Communities and Technologies, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  17. Daniel Gruhl, David Liben-Nowell, R. V. Guha, and Andrew Tomkins. Information diffusion through blogspace. In Proc. 13th International World Wide Web Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael Kearns, Siddharth Suri, and Nick Monfort. An experimental study of the coloring problem on human subject networks. Science, 313(5788):824--827, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  19. David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence in a social network. In Proc. 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 137--146, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jon Kleinberg. Complex networks and decentralized search algorithms. In Proc. International Congress of Mathematicians, 2006.Google ScholarGoogle Scholar
  21. Jon Kleinberg and Prabhakar Raghavan. Query incentive networks. In Proc. 46th IEEE Symposium on Foundations of Computer Science, pages 132--141, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gueorgi Kossinets and Duncan Watts. Empirical analysis of an evolving social network. Science, 311:88--90, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  23. Ravi Kumar, Jasmine Novak, and Andrew Tomkins. Structure and evolution of online social networks. In Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 611--617, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jure Leskovec, Lada Adamic, and Bernardo Huberman. The dynamics of viral marketing. In Proc. 7th ACM Conference on Electronic Commerce, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nina Mishra and Mark Sandler. Privacy via pseudorandom sketches. In Proc. 25th ACM Symposium on Principles of Database Systems, pages 143--152, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Arvind Narayanan and Vitaly Shmatikov. How to break anonymity of the netflix prize dataset, October 2006. arxiv cs/0610105.Google ScholarGoogle Scholar
  27. Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. Anti-aliasing on the web. In Proc. 13th International World Wide Web Conference, pages 30--39, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Matt Richardson and Pedro Domingos. Mining knowledge-sharing sites for viral marketing. In Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 61--70, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Matthew Salganik, Peter Dodds, and Duncan Watts. Experimental study of inequality and unpredictability in an artificial cultural market. Science, 311:854--856, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  30. Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics, 25, 1997.Google ScholarGoogle Scholar

Index Terms

  1. Challenges in mining social network data: processes, privacy, and paradoxes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2007
      1080 pages
      ISBN:9781595936097
      DOI:10.1145/1281192

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 August 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader