skip to main content
10.1145/2588555.2588574acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Mining statistically significant connected subgraphs in vertex labeled graphs

Published:18 June 2014Publication History

ABSTRACT

The steady growth of graph data in various applications has resulted in wide-spread research in finding significant sub-structures in a graph. In this paper, we address the problem of finding statistically significant connected subgraphs where the nodes of the graph are labeled. The labels may be either discrete where they assume values from a pre-defined set, or continuous where they assume values from a real domain and can be multi-dimensional. We motivate the problem citing applications in spatial co-location rule mining and outlier detection. We use the chi-square statistic as a measure for quantifying the statistical significance. Since the number of connected subgraphs in a general graph is exponential, the naive algorithm is impractical. We introduce the notion of contracting edges that merge vertices together to form a super-graph. We show that if the graph is dense enough to start with, the number of super-vertices is quite low, and therefore, running the naive algorithm on the super-graph is feasible. If the graph is not dense, we provide an algorithm to reduce the number of super-vertices further, thereby providing a trade-off between accuracy and time. Empirically, the chi-square value obtained by this reduction is always within 96% of the optimal value, while the time spent is only a fraction of that for the optimal. In addition, we also show that our algorithm is scalable and it significantly enhances the ability to analyze real datasets.

References

  1. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509--512, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  2. S. Barua and J. Sander. SSCP: Mining statistically significant co-location patterns. In STD, pages 2--20, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Barua and J. Sander. Mining statistically significant co-location and segregation patterns. TKDE, 99(pre):1, 2013.Google ScholarGoogle Scholar
  4. Y. Chi, Y. Yang, and R. Muntz. Indexing and mining free trees. In ICDM, pages 509--512, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Denise, M. Régnier, and M. Vandenbogaert. Assessing the statistical significance of overrepresented oligonucleotides. In WABI, pages 537--552, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Durak, A. Pinar, T. G. Kolda, and C. Seshadhri. Degree relations of triangles in real-world networks and graph models. In CIKM, pages 1712--1716, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Edgington and P. Onghena. Randomization Tests. Marcel Dekker, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Erd\Hos and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica Scientia Hungary, 12:261--267, 1961.Google ScholarGoogle Scholar
  9. P. Erdös and A. Rényi. On random graphs, I. Publicationes Mathematicae (Debrecen), 6:290--297, 1959.Google ScholarGoogle ScholarCross RefCross Ref
  10. R. Frank, W. Jin, and M. Ester. Efficiently mining regional outliers in spatial data. In SSTD, pages 112--129, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. He and A. Singh. Graphrank: Statistical modeling and mining of significant subgraphs in the feature space. In ICDM, pages 885--890, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Hogg, A. Craig, and J. McKean. Introduction to Mathematical Statistics. Pearson Education, 2004.Google ScholarGoogle Scholar
  13. P. Holme and B. J. Kim. Growing scale-free networks with tunable clustering. Physical Review E, 65(2):026107, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  14. Y. Huang, J. Pei, and H. Xiong. Mining co-location patterns with rare events from spatial data sets. GeoInformatica, 10(3):239--260, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Jiang, J. Cheng, D. Wang, C. Wang, and G. Tan. A general framework for efficient continuous multidimensional top-k query processing in sensor networks. IEEE Trans. Parallel Distrib. Syst., 23(9):1668--1680, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Kou, C.-T. Lu, and D. Chen. Spatial weighted outlier detection. In SDM, pages 613--617, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. Lijffijt, P. Papapetrou, and K. Puolam\"aki. A statistical significance testing approach to mining the most informative set of patterns. In DMKD, pages 1--26, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Physical Review E, 64(2):026118, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Pei, D. Jiang, and A. Zhang. Mining cross-graph quasi-cliques in gene expression and protein interaction data. In ICDE, pages 353--354, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Popa, A. Rostamizadeh, R. Karp, C. Papadimitriou, and I. Stoica. Balancing traffic load in wireless networks with curveball routing. In MobiHoc, pages 170--179, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Ranu and A. Singh. Graphsig: A scalable approach to mining significant subgraphs in large graph databases. In ICDE, pages 844--855, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Read and N. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  23. M. Régnier and M. Vandenbogaert. Comparison of statistical significance criteria. J. Bioinf. & Comp. Bio., 4:85--97, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  24. P. Roy and S. Tomar. Biodiversity characterization at landscape level using geospatial modelling technique. Biological Conservation, 95(1):95--109, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  25. M. Sachan and A. Bhattacharya. Mining statistically significant substrings using the chi-square statistic. PVLDB, 5(10):1052--1063, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Scott, T. Ideker, R. M. Karp, and R. Sharan. Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comp. Bio., 13(2):133--144, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  27. S. Shekhar and Y. Huang. Discovering spatial co-location patterns: A summary of results. In SSTD, pages 236--256, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Shekhar, C.-T. Lu, and P. Zhang. Detecting graph-based spatial outliers: algorithms and applications (a summary of results). In KDD, pages 371--376, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Wang, W. Ding, H. Z. Lo, T. F. Stepinski, J. Salazar, and M. Morabito. Crime hotspot mapping using the crime related factors -- a spatial data mining approach. Appl. Intell., 39(4):772--781, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. J. Watts and S. H. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393(6684):409--10, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  31. K. Wongpanya, K. Sripimanwat, and K. Jenjerapongvej. Simplification of frequency test for random number generation based on chi-square. In AICT, pages 305--308, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Xing and A. A. Ghorbani. Weighted pagerank algorithm. In CNSR, pages 305--314, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. X. Yan, H. Cheng, J. Han, and P. Yu. Mining significant graph patterns by leap search. In SIGMOD, pages 433--444, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International, 23, 2001.Google ScholarGoogle Scholar
  35. C. H. You, L. B. Holder, and D. J. Cook. Temporal and structural analysis of biological networks in combination with microarray data. In CIBCB, pages 62--69, 2008.Google ScholarGoogle Scholar

Index Terms

  1. Mining statistically significant connected subgraphs in vertex labeled graphs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
      June 2014
      1645 pages
      ISBN:9781450323765
      DOI:10.1145/2588555

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader