research-article

Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling

Authors:
Michael Mitzenmacher

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Jakub Pachocki

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA
View Profile

,
Richard Peng

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

,
Charalampos Tsourakakis

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Shen Chen Xu

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2015Pages 815–824https://doi.org/10.1145/2783258.2783385

Published:10 August 2015Publication History

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 815–824

ABSTRACT

Extracting dense subgraphs from large graphs is a key primitive in a variety of graph mining applications, ranging from mining social networks and the Web graph to bioinformatics [41]. In this paper we focus on a family of poly-time solvable formulations, known as the k-clique densest subgraph problem (k-Clique-DSP) [57]. When k=2, the problem becomes the well-known densest subgraph problem (DSP) [22, 31, 33, 39]. Our main contribution is a sampling scheme that gives densest subgraph sparsifier, yielding a randomized algorithm that produces high-quality approximations while providing significant speedups and improved space complexity. We also extend this family of formulations to bipartite graphs by introducing the (p,q)-biclique densest subgraph problem ((p,q)-Biclique-DSP), and devise an exact algorithm that can treat both clique and biclique densities in a unified way.

As an example of performance, our sparsifying algorithm extracts the 5-clique densest subgraph --which is a large-near clique on 62 vertices-- from a large collaboration network. Our algorithm achieves 100% accuracy over five runs, while achieving an average speedup factor of over 10,000. Specifically, we reduce the running time from ∼2 107 seconds to an average running time of 0.15 seconds. We also use our methods to study how the k-clique densest subgraphs change as a function of time in time-evolving networks for various small values of k. We observe significant deviations between the experimental findings on real-world networks and stochastic Kronecker graphs, a random graph model that mimics real-world networks in certain aspects.

We believe that our work is a significant advance in routines with rigorous theoretical guarantees for scalable extraction of large near-cliques from networks.

Supplemental Material

p815.mp4

mp4

196.1 MB

Download

References

http://www.avglab.com/soft/hipr.tar.Google Scholar
http://snap.stanford.edu/data/index.html.Google Scholar
http://grouplens.org/datasets.Google Scholar
http://research.nii.ac.jp/~uno/codes.htm.Google Scholar
Large Near-Clique Detection. http://tinyurl.com/o6y33g9.Google Scholar
J. Abello, M. G. C. Resende, and S. Sudarsky. Massive quasi-clique detection. In LATIN, 2002. Google ScholarDigital Library
R. Andersen and K. Chellapilla. Finding dense subgraphs with size bounds. In WAW, 2009. Google ScholarDigital Library
A. Andoni, A. Gupta, and R. Krauthgamer. Towards (1+ ε)-approximate flow sparsifiers. In SODA, pages 279--293. SIAM, 2014. Google ScholarDigital Library
A. Angel, N. Sarkas, N. Koudas, and D. Srivastava. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. In VLDB, 5(6), pages 574--585, Feb. 2012. Google ScholarDigital Library
Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama. Greedily finding a dense subgraph. In Journal of Algorithms, 34(2), 2000. Google ScholarDigital Library
G. D. Bader and C. W. Hogue. An automated method for finding molecular complexes in large protein interaction networks. In BMC bioinformatics, 2003.Google Scholar
B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. In VLDB , 5(5):454--465, 2012. Google ScholarDigital Library
O. D. Balalau, F. Bonchi, T. Chan, F. Gullo, and M. Sozio. Finding subgraphs with maximum total density and limited overlap. In WSDM, pages 379--388. ACM, 2015. Google ScholarDigital Library
V. Batagelj and M. Zaversnik. An o(m) algorithm for cores decomposition of networks. In Arxiv, arXiv.cs/0310049, 2003.Google Scholar
A. A. Benczúr and D. R. Karger. Approximating s-t minimum cuts in Õ(n²) time. In STOC, pages 47--55, 1996. Google ScholarDigital Library
A. Bhaskara, M. Charikar, E. Chlamtac, U. Feige, and A. Vijayaraghavan. Detecting high log-densities: an o(n^1/4) approximation for densest k-subgraph. In STOC, pages 201--210, 2010. Google ScholarDigital Library
S. Bhattacharya, M. Henzinger, D. Nanongkai, and C. E. Tsourakakis. Space-and time-efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In STOC, 2015 (to appear). Google ScholarDigital Library
I. M. Bomze, M. Budinich, P. M. Pardalos, and M. Pelillo. The maximum clique problem. In Handbook of combinatorial optimization, pages 1--74, 1999.Google ScholarCross Ref
F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich. Core decomposition of uncertain graphs. In KDD, pages 1316--1325, 2014. Google ScholarDigital Library
C. Bron and J. Kerbosch. Algorithm 457: finding all cliques of an undirected graph. In Communications of the ACM, 16(9):575--577, 1973. Google ScholarDigital Library
G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In WSDM, pages 95--106, 2008. Google ScholarDigital Library
M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX, pages 84--95, 2000. Google ScholarDigital Library
J. Chen, Y. Saad. Dense Subgraph Extraction with Application to Community Detection. In TKDE, vol. 24, pages 1216--1230, 2012. Google ScholarDigital Library
B. V. Cherkassky and A. V. Goldberg. On implementing the push-relabel method for the maximum flow problem. In Algorithmica, 19(4), pages 390--410, 1997.Google ScholarCross Ref
N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. In SIAM Journal on Computing, 14(1), pages 210--223, 1985. Google ScholarDigital Library
D. Eppstein. Arboricity and bipartite subgraph listing algorithms. In Information Processing Letters, 51(4), pages 207--211, 1994. Google ScholarDigital Library
D. Eppstein, M. Löffler, and D. Strash. Listing all maximal cliques in sparse graphs in near-optimal time. In ISAAC, 2010.Google Scholar
U. Feige, G. Kortsarz, and D. Peleg. The dense k-subgraph problem. In Algorithmica, 29(3), 2001.Google ScholarDigital Library
I. Finocchi, M. Finocchi, and E. G. Fusco. Counting small cliques in mapreduce. In ArXiv arXiv:1403.0734, 2014.Google Scholar
E. Fratkin, B. T. Naughton, D. L. Brutlag, and S. Batzoglou. Motifcut: regulatory motifs finding with maximum density subgraphs. In Bioinformatics, vol. 22(14), pages 150--157, 2006. Google ScholarDigital Library
G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast parametric maximum flow algorithm and applications. In Journal of Computing, 18(1), 1989. Google ScholarDigital Library
D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB, pages 721--732, 2005. Google ScholarDigital Library
A. V. Goldberg. Finding a maximum density subgraph. Tech. report, UC Berkeley, 1984. Google ScholarDigital Library
A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. In Journal of the ACM (JACM), 35(4), pages 921--940, 1988. Google ScholarDigital Library
J. Håstad. Clique is hard to approximate within n^1-ε. In Acta Mathematica, 182(1), 1999.Google ScholarCross Ref
R. Jin, Y. Xiang, N. Ruan, and D. Fuhry. 3-hop: a high-compression indexing scheme for reachability query. In SIGMOD, 2009. Google ScholarDigital Library
D. S. Johnson and M. A. Trick. Cliques, coloring, and satisfiability: second DIMACS implementation challenge American Mathematical Soc., 1996. Google ScholarDigital Library
R. Kannan and V. Vinay. Analyzing the structure of large graphs, manuscript, 1999.Google Scholar
S. Khuller and B. Saha. On finding dense subgraphs. In ICALP, 2009. Google ScholarDigital Library
D. E. Knuth. Seminumerical algorithms. 2007.Google Scholar
V. E. Lee, N. Ruan, R. Jin, and C. C. Aggarwal. A survey of algorithms for dense subgraph discovery. In Managing and Mining Graph Data, pages 303--336, Springer, 2010.Google ScholarCross Ref
Y. T. Lee and A. Sidford. Path finding methods for linear programming: Solving linear programs in õ (vrank) iterations and faster algorithms for maximum flow. In FOCS, pages 424--433, 2014. Google ScholarDigital Library
F. T. Leighton and A. Moitra. Extensions and limits to vertex sparsification. In STOC, pages 47--56, 2010. Google ScholarDigital Library
J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. In The Journal of Machine Learning Research, vol. 11, pages 985--1042, 2010. Google ScholarDigital Library
J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD, pages 177--187, 2005. Google ScholarDigital Library
K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT pages 260--272, 2004.Google ScholarCross Ref
M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google ScholarCross Ref
A. Moitra. Approximation algorithms for multicommodity-type problems with guarantees independent of the graph size. In FOCS, pages 3--12, 2009. Google ScholarDigital Library
J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. In Mathematical Programming, vol. 118(2), pages 237--251, 2009. Google ScholarDigital Library
R. Pagh and C. E. Tsourakakis. Colorful triangle counting and a mapreduce implementation. In Information Processing Letters, vol. 112(7), pages 277--281, 2012. Google ScholarDigital Library
A. E. Sariyuce, C. Seshadhri, A. Pinar, and U. V. Catalyurek. Finding the hierarchy of dense subgraphs using nucleus decompositions. In WWW, 2015. Google ScholarDigital Library
D. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC, pages 81--90, 2004. Google ScholarDigital Library
M. Thorup and U. Zwick. Approximate distance oracles. In Journal of the ACM (JACM), vol. 52(1), pages 1--24, 2005. Google ScholarDigital Library
M. Thorup and U. Zwick. Spanners and emulators with sublinear distance errors. In SODA, pages 802--809, 2006. Google ScholarDigital Library
C.E. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. Tsiarli. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. In KDD, pages 104--112, 2013. Google ScholarDigital Library
C. E. Tsourakakis. Mathematical and Algorithmic Analysis of Network and Biological Data. PhD thesis, Carnegie Mellon University, 2013.Google Scholar
C. E. Tsourakakis. The k-clique densest subgraph problem. In WWW, pages 1122--1132, 2015. Google ScholarDigital Library
C. E. Tsourakakis, M. N. Kolountzakis, and G. L. Miller. Triangle sparsifiers. In J. Graph Algorithms Appl., 15(6), pages 703--726, 2011.Google ScholarCross Ref
T. Uno. An efficient algorithm for solving pseudo clique enumeration problem. In Algorithmica, 56(1), 2010. Google ScholarDigital Library
N. Wang, J. Zhang, K.-L. Tan, and A. K. Tung. On triangulation-based dense neighborhood graph discovery. In VLDB, 4(2), pages 58--68, 2010. Google ScholarDigital Library

Index Terms

Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling
1. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms

Recommendations

The K-clique Densest Subgraph Problem
WWW '15: Proceedings of the 24th International Conference on World Wide Web

Numerous graph mining applications rely on detecting subgraphs which are large near-cliques. Since formulations that are geared towards finding large near-cliques are hard and frequently inapproximable due to connections with the Maximum Clique problem, ...
Read More
Efficient Algorithms for Densest Subgraph Discovery on Large Directed Graphs
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Given a directed graph G, the directed densest subgraph (DDS) problem refers to the finding of a subgraph from G, whose density is the highest among all the subgraphs of G. The DDS problem is fundamental to a wide range of applications, such as fraud ...
Read More
Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Finding dense subgraphs is an important graph-mining task with many applications. Given that the direct optimization of edge density is not meaningful, as even a single edge achieves maximum density, research has focused on optimizing alternative ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dense subgraphs
graph mining
near-clique extraction
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 86
  Total Citations
  View Citations
- 985
  Total Downloads
- Downloads (Last 12 months)68
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

The K-clique Densest Subgraph Problem

Efficient Algorithms for Densest Subgraph Discovery on Large Directed Graphs

Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees