Skip to main content
Log in

Defining and evaluating network communities based on ground-truth

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Nodes in real-world networks organize into densely linked communities where edges appear with high concentration among the members of the community. Identifying such communities of nodes has proven to be a challenging task due to a plethora of definitions of network communities, intractability of methods for detecting them, and the issues with evaluation which stem from the lack of a reliable gold-standard ground-truth. In this paper, we distinguish between structural and functional definitions of network communities. Structural definitions of communities are based on connectivity patterns, like the density of connections between the community members, while functional definitions are based on (often unobserved) common function or role of the community members in the network. We argue that the goal of network community detection is to extract functional communities based on the connectivity structure of the nodes in the network. We then identify networks with explicitly labeled functional communities to which we refer as ground-truth communities. In particular, we study a set of 230 large real-world social, collaboration, and information networks where nodes explicitly state their community memberships. For example, in social networks, nodes explicitly join various interest-based social groups. We use such social groups to define a reliable and robust notion of ground-truth communities. We then propose a methodology, which allows us to compare and quantitatively evaluate how different structural definitions of communities correspond to ground-truth functional communities. We study 13 commonly used structural definitions of communities and examine their sensitivity, robustness and performance in identifying the ground-truth. We show that the 13 structural definitions are heavily correlated and naturally group into four classes. We find that two of these definitions, Conductance and Triad participation ratio, consistently give the best performance in identifying ground-truth communities. We also investigate a task of detecting communities given a single seed node. We extend the local spectral clustering algorithm into a heuristic parameter-free community detection method that easily scales to networks with more than 100 million nodes. The proposed method achieves 30 % relative improvement over current local clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abrahao BD, Soundarajan S, Hopcroft JE, Kleinberg R (2012) On the separability of structural classes of communities. In KDD ’12: proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 624–632

  2. Ahn Y-Y, Bagrow JP, Lehmann S (2010) Link communities reveal multi-scale complexity in networks. Nature 466:761–764

    Article  Google Scholar 

  3. Andersen R, Chung F, Lang K (2006) Local graph partitioning using PageRank vectors. In FOCS ’06: proceedings of the 47th annual IEEE symposium on foundations of computer science, pp 475–486

  4. Andersen R, Lang K (2006) Communities from seed sets. In: WWW ’06 proceedings of the 15th international conference on, World Wide Web, pp 223–232

  5. Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth and evolution. In KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 44–54

  6. Danon L, Duch J, Diaz-Guilera A, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 29(09):P09008

    Google Scholar 

  7. Dhillon I, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957

    Article  Google Scholar 

  8. Feld SL (1981) The focused organization of social ties. Am J Sociol 86(5):1015–1035

    Article  Google Scholar 

  9. Flake G, Lawrence S, Giles C (2000) Efficient identification of web communities. In KDD ’00: proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 150–160

  10. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174

    Article  MathSciNet  Google Scholar 

  11. Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Nat Acad Sci USA 104(1):36–41

    Article  Google Scholar 

  12. Girvan M, Newman M (2002) Community structure in social and biological networks. Proc Nat Acad Sci USA 99(12):7821–7826

    Article  MathSciNet  MATH  Google Scholar 

  13. Gleich DF, Seshadhri C (2012) Neighborhoods are good communities. In KDD ’12: proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 597–605

  14. Granovetter MS (1973) The strength of weak ties. Am J Sociol 78:1360–1380

    Article  Google Scholar 

  15. Kairam S, Wang D, Leskovec J (2012) The life and death of online groups: predicting group growth and longevity. In WSDM ’12: ACM international conference on web search and data mining

  16. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20:359–392

    Article  MathSciNet  Google Scholar 

  17. Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97

    Article  Google Scholar 

  18. Leskovec J, Adamic L, Huberman B (2007) The dynamics of viral marketing. ACM Trans Web 1(1):5

    Article  Google Scholar 

  19. Leskovec J, Lang K, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In WWW ’10: proceedings of the 19th international conference on World Wide Web

  20. Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123

    Article  MathSciNet  MATH  Google Scholar 

  21. Lin W, Kong X, Yu PS, Wu Q, Jia Y, Li C (2012) Community detection in incomplete information networks. In WWW ’12: proceedings of the 21st international conference on, World Wide Web, pp 341–350

  22. Meilǎ M (2005) Comparing clusterings: an axiomatic view. In ICML ’05: proceedings of the 22nd international conference on machine learning. New York, NY, USA, pp 577–584

  23. Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In IMC ’07: proceedings of the 7th ACM SIGCOMM conference on internet, measurement, pp 29–42

  24. Newman M (2006) Modularity and community structure in networks. Proc Nat Acad Sci USA 103(23):8577–8582

    Article  Google Scholar 

  25. Newman M, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113

    Article  Google Scholar 

  26. Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818

    Article  Google Scholar 

  27. Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Defining and identifying communities in networks. Proc Nat Acad Sci USA 101(9):2658–2663

    Article  Google Scholar 

  28. Ren Y, Kraut R, Kiesler S (2007) Applying common identity and bond theory to design of online communities. Organ Stud 28(3):377–408

    Article  Google Scholar 

  29. Schaeffer S (2007) Graph clustering. Comp Sci Rev 1(1):27–64

    Article  MathSciNet  MATH  Google Scholar 

  30. Shi C, Yu PS, Cai Y, Yan Z, Wu B (2011) On selection of objective functions in multi-objective community detection. In CIKM ’11: proceedings of the 20th ACM international conference on information and, knowledge management, pp 2301–2304

  31. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

    Article  Google Scholar 

  32. Spielman D, Teng S-H (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC ’04: proceedings of the 36th annual ACM symposium on theory of computing, pp 81–90

  33. Sun Y, Yu Y, Han J (2009) Ranking-based clustering of heterogeneous information networks with star network schema. In KDD ’09: proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 797–806

  34. von Luxburg U (2010) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274

    Google Scholar 

  35. Watts D, Strogatz S (1998) Collective dynamics of small-world networks. Nature 393:440–442

    Article  Google Scholar 

  36. Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the state of the art and comparative study. ACM Comput Surv 45(4). Art no 43

  37. Yang J, Leskovec J (2012) Community-affiliation graph model for overlapping network community detection. In ICDM ’12: proceedings of the 2012 IEEE international conference on data mining, pp 1170–1175

  38. Yang J, Leskovec J (2012) Defining and evaluating network communities based on ground-truth. In ICDM ’12: proceedings of the 2012 IEEE international conference on data mining, pp 745–754

  39. Yang J, Leskovec J (2013) Overlapping community detection at scale: a non-negative factorization approach. In WSDM ’13: proceedings of the sixth ACM international conference on web search and data mining, pp 587–596

  40. Yang J, Leskovec J (2013) Structure and overlaps of communities in networks. ACM Trans Intell Syst Technol (to appear)

Download references

Acknowledgments

This research has been supported in part by NSF IIS-1016909, CNS-1010921, CAREER IIS-1149837, IIS- 1159679, ARO MURI, DARPA XDATA, DARPA GRAPHS, ARL AHPCRC, Okawa Foundation, Docomo, Boeing, Allyes, Volkswagen, Intel, Alfred P. Sloan Fellowship, and the Microsoft Faculty Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaewon Yang.

A Appendix

A Appendix

See Figs. 10, 11, 12, 13, 14, 15.

Fig. 10
figure 10

Average metrics of top k communities by 6 scores. C conductance, T TPR, M modularity, F flake ODF, D FOMD, CR cut ratio

Fig. 11
figure 11

Average metrics of top k communities by 6 scores. C conductance, T TPR, M modularity, F flake ODF, D FOMD, CR cut ratio

Fig. 12
figure 12

\(Z\)-score of 6 scores versus the perturbation intensity for each null model. C conductance, T TPR, M modularity, F flake ODF, D FOMD, CR cut ratio

Fig. 13
figure 13

\(Z\)-score of 6 scores versus the perturbation intensity for each null model. C conductance, T TPR, M modularity, F flake ODF, D FOMD, CR cut ratio

Fig. 14
figure 14

\(Z\)-score of 6 scores versus the community size for each null model. C conductance, T TPR, M modularity, F flake ODF, D FOMD, CR cut ratio

Fig. 15
figure 15

\(Z\)-score of 6 scores versus the community size for each null model. C conductance, T TPR, M modularity, F flake ODF, D FOMD, CR cut ratio

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Leskovec, J. Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42, 181–213 (2015). https://doi.org/10.1007/s10115-013-0693-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0693-z

Keywords

Navigation