Abstract
Nodes in real-world networks organize into densely linked communities where edges appear with high concentration among the members of the community. Identifying such communities of nodes has proven to be a challenging task due to a plethora of definitions of network communities, intractability of methods for detecting them, and the issues with evaluation which stem from the lack of a reliable gold-standard ground-truth. In this paper, we distinguish between structural and functional definitions of network communities. Structural definitions of communities are based on connectivity patterns, like the density of connections between the community members, while functional definitions are based on (often unobserved) common function or role of the community members in the network. We argue that the goal of network community detection is to extract functional communities based on the connectivity structure of the nodes in the network. We then identify networks with explicitly labeled functional communities to which we refer as ground-truth communities. In particular, we study a set of 230 large real-world social, collaboration, and information networks where nodes explicitly state their community memberships. For example, in social networks, nodes explicitly join various interest-based social groups. We use such social groups to define a reliable and robust notion of ground-truth communities. We then propose a methodology, which allows us to compare and quantitatively evaluate how different structural definitions of communities correspond to ground-truth functional communities. We study 13 commonly used structural definitions of communities and examine their sensitivity, robustness and performance in identifying the ground-truth. We show that the 13 structural definitions are heavily correlated and naturally group into four classes. We find that two of these definitions, Conductance and Triad participation ratio, consistently give the best performance in identifying ground-truth communities. We also investigate a task of detecting communities given a single seed node. We extend the local spectral clustering algorithm into a heuristic parameter-free community detection method that easily scales to networks with more than 100 million nodes. The proposed method achieves 30 % relative improvement over current local clustering methods.
Similar content being viewed by others
References
Abrahao BD, Soundarajan S, Hopcroft JE, Kleinberg R (2012) On the separability of structural classes of communities. In KDD ’12: proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 624–632
Ahn Y-Y, Bagrow JP, Lehmann S (2010) Link communities reveal multi-scale complexity in networks. Nature 466:761–764
Andersen R, Chung F, Lang K (2006) Local graph partitioning using PageRank vectors. In FOCS ’06: proceedings of the 47th annual IEEE symposium on foundations of computer science, pp 475–486
Andersen R, Lang K (2006) Communities from seed sets. In: WWW ’06 proceedings of the 15th international conference on, World Wide Web, pp 223–232
Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth and evolution. In KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 44–54
Danon L, Duch J, Diaz-Guilera A, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 29(09):P09008
Dhillon I, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957
Feld SL (1981) The focused organization of social ties. Am J Sociol 86(5):1015–1035
Flake G, Lawrence S, Giles C (2000) Efficient identification of web communities. In KDD ’00: proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 150–160
Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174
Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Nat Acad Sci USA 104(1):36–41
Girvan M, Newman M (2002) Community structure in social and biological networks. Proc Nat Acad Sci USA 99(12):7821–7826
Gleich DF, Seshadhri C (2012) Neighborhoods are good communities. In KDD ’12: proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 597–605
Granovetter MS (1973) The strength of weak ties. Am J Sociol 78:1360–1380
Kairam S, Wang D, Leskovec J (2012) The life and death of online groups: predicting group growth and longevity. In WSDM ’12: ACM international conference on web search and data mining
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20:359–392
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97
Leskovec J, Adamic L, Huberman B (2007) The dynamics of viral marketing. ACM Trans Web 1(1):5
Leskovec J, Lang K, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In WWW ’10: proceedings of the 19th international conference on World Wide Web
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Lin W, Kong X, Yu PS, Wu Q, Jia Y, Li C (2012) Community detection in incomplete information networks. In WWW ’12: proceedings of the 21st international conference on, World Wide Web, pp 341–350
Meilǎ M (2005) Comparing clusterings: an axiomatic view. In ICML ’05: proceedings of the 22nd international conference on machine learning. New York, NY, USA, pp 577–584
Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In IMC ’07: proceedings of the 7th ACM SIGCOMM conference on internet, measurement, pp 29–42
Newman M (2006) Modularity and community structure in networks. Proc Nat Acad Sci USA 103(23):8577–8582
Newman M, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113
Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818
Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Defining and identifying communities in networks. Proc Nat Acad Sci USA 101(9):2658–2663
Ren Y, Kraut R, Kiesler S (2007) Applying common identity and bond theory to design of online communities. Organ Stud 28(3):377–408
Schaeffer S (2007) Graph clustering. Comp Sci Rev 1(1):27–64
Shi C, Yu PS, Cai Y, Yan Z, Wu B (2011) On selection of objective functions in multi-objective community detection. In CIKM ’11: proceedings of the 20th ACM international conference on information and, knowledge management, pp 2301–2304
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905
Spielman D, Teng S-H (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC ’04: proceedings of the 36th annual ACM symposium on theory of computing, pp 81–90
Sun Y, Yu Y, Han J (2009) Ranking-based clustering of heterogeneous information networks with star network schema. In KDD ’09: proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 797–806
von Luxburg U (2010) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274
Watts D, Strogatz S (1998) Collective dynamics of small-world networks. Nature 393:440–442
Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the state of the art and comparative study. ACM Comput Surv 45(4). Art no 43
Yang J, Leskovec J (2012) Community-affiliation graph model for overlapping network community detection. In ICDM ’12: proceedings of the 2012 IEEE international conference on data mining, pp 1170–1175
Yang J, Leskovec J (2012) Defining and evaluating network communities based on ground-truth. In ICDM ’12: proceedings of the 2012 IEEE international conference on data mining, pp 745–754
Yang J, Leskovec J (2013) Overlapping community detection at scale: a non-negative factorization approach. In WSDM ’13: proceedings of the sixth ACM international conference on web search and data mining, pp 587–596
Yang J, Leskovec J (2013) Structure and overlaps of communities in networks. ACM Trans Intell Syst Technol (to appear)
Acknowledgments
This research has been supported in part by NSF IIS-1016909, CNS-1010921, CAREER IIS-1149837, IIS- 1159679, ARO MURI, DARPA XDATA, DARPA GRAPHS, ARL AHPCRC, Okawa Foundation, Docomo, Boeing, Allyes, Volkswagen, Intel, Alfred P. Sloan Fellowship, and the Microsoft Faculty Fellowship.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, J., Leskovec, J. Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42, 181–213 (2015). https://doi.org/10.1007/s10115-013-0693-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0693-z