ABSTRACT
Extracting dense subgraphs from large graphs is a key primitive in a variety of graph mining applications, ranging from mining social networks and the Web graph to bioinformatics [41]. In this paper we focus on a family of poly-time solvable formulations, known as the k-clique densest subgraph problem (k-Clique-DSP) [57]. When k=2, the problem becomes the well-known densest subgraph problem (DSP) [22, 31, 33, 39]. Our main contribution is a sampling scheme that gives densest subgraph sparsifier, yielding a randomized algorithm that produces high-quality approximations while providing significant speedups and improved space complexity. We also extend this family of formulations to bipartite graphs by introducing the (p,q)-biclique densest subgraph problem ((p,q)-Biclique-DSP), and devise an exact algorithm that can treat both clique and biclique densities in a unified way.
As an example of performance, our sparsifying algorithm extracts the 5-clique densest subgraph --which is a large-near clique on 62 vertices-- from a large collaboration network. Our algorithm achieves 100% accuracy over five runs, while achieving an average speedup factor of over 10,000. Specifically, we reduce the running time from ∼2 107 seconds to an average running time of 0.15 seconds. We also use our methods to study how the k-clique densest subgraphs change as a function of time in time-evolving networks for various small values of k. We observe significant deviations between the experimental findings on real-world networks and stochastic Kronecker graphs, a random graph model that mimics real-world networks in certain aspects.
We believe that our work is a significant advance in routines with rigorous theoretical guarantees for scalable extraction of large near-cliques from networks.
Supplemental Material
- http://www.avglab.com/soft/hipr.tar.Google Scholar
- http://snap.stanford.edu/data/index.html.Google Scholar
- http://grouplens.org/datasets.Google Scholar
- http://research.nii.ac.jp/~uno/codes.htm.Google Scholar
- Large Near-Clique Detection. http://tinyurl.com/o6y33g9.Google Scholar
- J. Abello, M. G. C. Resende, and S. Sudarsky. Massive quasi-clique detection. In LATIN, 2002. Google ScholarDigital Library
- R. Andersen and K. Chellapilla. Finding dense subgraphs with size bounds. In WAW, 2009. Google ScholarDigital Library
- A. Andoni, A. Gupta, and R. Krauthgamer. Towards (1+ ε)-approximate flow sparsifiers. In SODA, pages 279--293. SIAM, 2014. Google ScholarDigital Library
- A. Angel, N. Sarkas, N. Koudas, and D. Srivastava. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. In VLDB, 5(6), pages 574--585, Feb. 2012. Google ScholarDigital Library
- Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama. Greedily finding a dense subgraph. In Journal of Algorithms, 34(2), 2000. Google ScholarDigital Library
- G. D. Bader and C. W. Hogue. An automated method for finding molecular complexes in large protein interaction networks. In BMC bioinformatics, 2003.Google Scholar
- B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. In VLDB , 5(5):454--465, 2012. Google ScholarDigital Library
- O. D. Balalau, F. Bonchi, T. Chan, F. Gullo, and M. Sozio. Finding subgraphs with maximum total density and limited overlap. In WSDM, pages 379--388. ACM, 2015. Google ScholarDigital Library
- V. Batagelj and M. Zaversnik. An o(m) algorithm for cores decomposition of networks. In Arxiv, arXiv.cs/0310049, 2003.Google Scholar
- A. A. Benczúr and D. R. Karger. Approximating s-t minimum cuts in Õ(n2) time. In STOC, pages 47--55, 1996. Google ScholarDigital Library
- A. Bhaskara, M. Charikar, E. Chlamtac, U. Feige, and A. Vijayaraghavan. Detecting high log-densities: an o(n1/4) approximation for densest k-subgraph. In STOC, pages 201--210, 2010. Google ScholarDigital Library
- S. Bhattacharya, M. Henzinger, D. Nanongkai, and C. E. Tsourakakis. Space-and time-efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In STOC, 2015 (to appear). Google ScholarDigital Library
- I. M. Bomze, M. Budinich, P. M. Pardalos, and M. Pelillo. The maximum clique problem. In Handbook of combinatorial optimization, pages 1--74, 1999.Google ScholarCross Ref
- F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich. Core decomposition of uncertain graphs. In KDD, pages 1316--1325, 2014. Google ScholarDigital Library
- C. Bron and J. Kerbosch. Algorithm 457: finding all cliques of an undirected graph. In Communications of the ACM, 16(9):575--577, 1973. Google ScholarDigital Library
- G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In WSDM, pages 95--106, 2008. Google ScholarDigital Library
- M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX, pages 84--95, 2000. Google ScholarDigital Library
- J. Chen, Y. Saad. Dense Subgraph Extraction with Application to Community Detection. In TKDE, vol. 24, pages 1216--1230, 2012. Google ScholarDigital Library
- B. V. Cherkassky and A. V. Goldberg. On implementing the push-relabel method for the maximum flow problem. In Algorithmica, 19(4), pages 390--410, 1997.Google ScholarCross Ref
- N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. In SIAM Journal on Computing, 14(1), pages 210--223, 1985. Google ScholarDigital Library
- D. Eppstein. Arboricity and bipartite subgraph listing algorithms. In Information Processing Letters, 51(4), pages 207--211, 1994. Google ScholarDigital Library
- D. Eppstein, M. Löffler, and D. Strash. Listing all maximal cliques in sparse graphs in near-optimal time. In ISAAC, 2010.Google Scholar
- U. Feige, G. Kortsarz, and D. Peleg. The dense k-subgraph problem. In Algorithmica, 29(3), 2001.Google ScholarDigital Library
- I. Finocchi, M. Finocchi, and E. G. Fusco. Counting small cliques in mapreduce. In ArXiv arXiv:1403.0734, 2014.Google Scholar
- E. Fratkin, B. T. Naughton, D. L. Brutlag, and S. Batzoglou. Motifcut: regulatory motifs finding with maximum density subgraphs. In Bioinformatics, vol. 22(14), pages 150--157, 2006. Google ScholarDigital Library
- G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast parametric maximum flow algorithm and applications. In Journal of Computing, 18(1), 1989. Google ScholarDigital Library
- D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB, pages 721--732, 2005. Google ScholarDigital Library
- A. V. Goldberg. Finding a maximum density subgraph. Tech. report, UC Berkeley, 1984. Google ScholarDigital Library
- A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. In Journal of the ACM (JACM), 35(4), pages 921--940, 1988. Google ScholarDigital Library
- J. Håstad. Clique is hard to approximate within n1-ε. In Acta Mathematica, 182(1), 1999.Google ScholarCross Ref
- R. Jin, Y. Xiang, N. Ruan, and D. Fuhry. 3-hop: a high-compression indexing scheme for reachability query. In SIGMOD, 2009. Google ScholarDigital Library
- D. S. Johnson and M. A. Trick. Cliques, coloring, and satisfiability: second DIMACS implementation challenge American Mathematical Soc., 1996. Google ScholarDigital Library
- R. Kannan and V. Vinay. Analyzing the structure of large graphs, manuscript, 1999.Google Scholar
- S. Khuller and B. Saha. On finding dense subgraphs. In ICALP, 2009. Google ScholarDigital Library
- D. E. Knuth. Seminumerical algorithms. 2007.Google Scholar
- V. E. Lee, N. Ruan, R. Jin, and C. C. Aggarwal. A survey of algorithms for dense subgraph discovery. In Managing and Mining Graph Data, pages 303--336, Springer, 2010.Google ScholarCross Ref
- Y. T. Lee and A. Sidford. Path finding methods for linear programming: Solving linear programs in õ (vrank) iterations and faster algorithms for maximum flow. In FOCS, pages 424--433, 2014. Google ScholarDigital Library
- F. T. Leighton and A. Moitra. Extensions and limits to vertex sparsification. In STOC, pages 47--56, 2010. Google ScholarDigital Library
- J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. In The Journal of Machine Learning Research, vol. 11, pages 985--1042, 2010. Google ScholarDigital Library
- J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD, pages 177--187, 2005. Google ScholarDigital Library
- K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT pages 260--272, 2004.Google ScholarCross Ref
- M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google ScholarCross Ref
- A. Moitra. Approximation algorithms for multicommodity-type problems with guarantees independent of the graph size. In FOCS, pages 3--12, 2009. Google ScholarDigital Library
- J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. In Mathematical Programming, vol. 118(2), pages 237--251, 2009. Google ScholarDigital Library
- R. Pagh and C. E. Tsourakakis. Colorful triangle counting and a mapreduce implementation. In Information Processing Letters, vol. 112(7), pages 277--281, 2012. Google ScholarDigital Library
- A. E. Sariyuce, C. Seshadhri, A. Pinar, and U. V. Catalyurek. Finding the hierarchy of dense subgraphs using nucleus decompositions. In WWW, 2015. Google ScholarDigital Library
- D. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC, pages 81--90, 2004. Google ScholarDigital Library
- M. Thorup and U. Zwick. Approximate distance oracles. In Journal of the ACM (JACM), vol. 52(1), pages 1--24, 2005. Google ScholarDigital Library
- M. Thorup and U. Zwick. Spanners and emulators with sublinear distance errors. In SODA, pages 802--809, 2006. Google ScholarDigital Library
- C.E. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. Tsiarli. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. In KDD, pages 104--112, 2013. Google ScholarDigital Library
- C. E. Tsourakakis. Mathematical and Algorithmic Analysis of Network and Biological Data. PhD thesis, Carnegie Mellon University, 2013.Google Scholar
- C. E. Tsourakakis. The k-clique densest subgraph problem. In WWW, pages 1122--1132, 2015. Google ScholarDigital Library
- C. E. Tsourakakis, M. N. Kolountzakis, and G. L. Miller. Triangle sparsifiers. In J. Graph Algorithms Appl., 15(6), pages 703--726, 2011.Google ScholarCross Ref
- T. Uno. An efficient algorithm for solving pseudo clique enumeration problem. In Algorithmica, 56(1), 2010. Google ScholarDigital Library
- N. Wang, J. Zhang, K.-L. Tan, and A. K. Tung. On triangulation-based dense neighborhood graph discovery. In VLDB, 4(2), pages 58--68, 2010. Google ScholarDigital Library
Index Terms
- Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling
Recommendations
The K-clique Densest Subgraph Problem
WWW '15: Proceedings of the 24th International Conference on World Wide WebNumerous graph mining applications rely on detecting subgraphs which are large near-cliques. Since formulations that are geared towards finding large near-cliques are hard and frequently inapproximable due to connections with the Maximum Clique problem, ...
Efficient Algorithms for Densest Subgraph Discovery on Large Directed Graphs
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataGiven a directed graph G, the directed densest subgraph (DDS) problem refers to the finding of a subgraph from G, whose density is the highest among all the subgraphs of G. The DDS problem is fundamental to a wide range of applications, such as fraud ...
Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningFinding dense subgraphs is an important graph-mining task with many applications. Given that the direct optimization of edge density is not meaningful, as even a single edge achieves maximum density, research has focused on optimizing alternative ...
Comments