CoFIM: A community-based framework for influence maximization on large-scale networks
Introduction
Influence maximization (IM) is a classic network optimization problem. It orients from the area of viral marketing [1], [2]. Consider that a company has developed a new product and wants to promote it among the customers. The company plains to select some people and let them try this product for free, hoping that these people could recommend it to their family and friends. As an expectation, the product will be largely adopted in the network with the “world-of-mouth” effect [3]. However, most of the time the budget of a company is limited, then a natural question is: how to select the “best” initial adopters so that the product can be most widely adopted? Mathematically speaking, it is defined as the problem of finding k seed nodes in a network such that they cause the maximum scale of cascading, which we call influence maximization.
The study of influence maximization relies on the diffusion models [4]. Currently the most widely used models are IC (Independent Cascade) model and LT (Linear Threshold) model [5]. In the two models, each node is in one of two states: active or inactive. Active nodes are those who have adopted the product and will propagate it to their neighbors. Inactive nodes are those who have not heard of the product or rejected to adopt it. Initially all nodes are inactive, then k seed nodes are selected to be activated and propagation starts from the k nodes.
IC model: In this model, at step t, for an active node u, it will try to activate each of its inactive neighbor v, and succeed with probability puv. Node u has only one chance to activate v, whether succeed or not, u will make no attempt to activate v in the future. If v was successfully activated, then from step v will be active and try to activate its inactive neighbors. If no node is activated as step T, the diffusion process will stop.
LT model: In this model, the activation of node v depends on the set of its active neighbors. For each directed edge from u to v, there is a weight bu, v indicating the influence of u on v. For each node v, the constraint ∑u neighbor of vbuv ≤ 1 must be satisfied. Node v has an activation threshold θv, which is between 0 and 1. Once the condition ∑u neighbor of vbuv ≥ θv is satisfied, v will become active. When no more nodes can be activated, the diffusion process will stop.
The influence maximization problem was firstly formatted by Kempe et al. [5]. Given a network G(V, E), influence maximization aims to find a subset S of vertices such that the diffusion orients from S can cause the maximum cascade of influence, i.e., where σ(S) is an objective function evaluating the influence spread, which is defined as the expected number of active users in the network after the difussion stopped.
Kempe et al. [5] proved that under traditional linear threshold (LT) and independent cascade (IC) diffusion models, this problem is NP hard, and the objective function is submodular. For arbitrary function σ( · ) that maps subsets of a finite ground set U to non-negative real numbers, we say the function is submodular, if it satisfies the following two properties: (1) monotone increasing, i.e., ∀S ⊂ T, σ(S) ≤ σ(T); (2) diminishing returns, i.e., ∀v ∈ V, S⊆T, we have which means that the marginal gain from adding an element to a set S is at least as high as the marginal gain from adding the same element to a superset of S. Based on the mathematical properties of submodular funcitons [6], Kempe et al. proposed a “hill-climbing” greedy algorithm to solve this problem, which begin with an empty set of S, and then iteratively add an element to S that brings the maximum marginal gain, util . Kempe et al. proved that the solution provided by the greedy algorithm provides a factor of () performance guarantee to the optimal solution under both the LT and IC models. Here e is the base of the natural logarithm and ε is any positive real number. The second item 1/e orients from the submodular function itself. The third item ε is the error of approximating the objective function using Monte–Carlo simulations. Theoretically, the traditional greedy algorithm provides a 63% guarantee to the optimal solution. In real experiments, the solution provided by the greedy algorithm is quite close to the optimal solution. However, to have a good approximation of the objection function given the seed set S, the greedy algorithm requires tens of thousands of Monte–Carlo simulations, which seriously limits its application on large-scale networks.
To solve the time efficiency problem of traditional greedy algorithm, a spectral of algorithms were proposed by researchers in recent years. Some works make use of submodularity, such as the CELF algorithm proposed by Leskovec et al. [7]. Some research works assume that the influence can spread on the network only through shortest paths [8], [9], [10] so that the objective function can be exactly computed. Another way to reduce the time complexity is to simply select top k nodes based on some heuristic metrics [11], [12], [13], such as the degree centrally, betweeness centrality et al. However, since the heuristic methods take no consideration of propagation models, they usually give poor solutions.
In the age of big data, network scale grows in millions, if not billions. Traditional influence maximization methods either cannot handle large-scale networks, or provide inaccurate solutions with low influence spread. To solve the problem of traditional influence maximization algorithms, in this paper we propose CoFIM: a Community-based Framework for Influence Maximization on large-scale networks. In our framework the influence propagation process is divided into two phases: (i) seeds expansion; and (2) intra-community propagation. The first phase is the expansion of seed nodes among different communities at the beginning of diffusion. The second phase is the influence propagation within communities which are independent of each other. Based on the framework, we derive a simplified evaluation form of the total influence spread which is sumodular and can be efficiently computed. Then we further develop a fast greedy algorithm to select the seed nodes.
Experimental results on synthetic and nine real-world large datasets including networks with millions of nodes and hundreds of millions of edges show that our algorithm can significantly outperform other state-of-the-art methods in terms of time and memory efficiency with almost no comprise on accuracy as evaluated by the influence spread.
The rest of this paper is organized as follows. Section 2 reviews the literature on influence maximization. Section 3 elaborates the preliminaries of the problem to be addressed. Section 4 introduces our CoFIM framework. Section 5 presents the evaluation framework, including the datasets, evaluation metrics, baseline methods, and experimental procedure. Experiment results are presented in Section 6. Section 7 concludes this paper.
Section snippets
Literature review
Since Kempe et al. [5] formally formatted the influence maximization problem and proposed the greedy algorithm, a lot of research works have been published to tackle this problem. Generally, these works can be divided into four categories: (1) submodularity-based algorithms; (2) centrality-based heuristic algorithms; (3) influence path based-algorithms; and (4) community-based algorithms. Here we give a brief review on these research works.
Problem definition
The influence maximization problem was firstly formulated by Kempe et al. [5]. Under the IC or LT model, we use S to denote the seed set, i.e., the set of active nodes at step and S(t) as the set of active nodes at step t. It is easy to see that the propagation stops when . Then the number of overall activated nodes after the propagation stopped can be represented by . Since the diffusion models are usually stochastic, we use σ(S) to denote the expected number of overall
Two-phases diffusion model
Given any network with non-overlapping community structure, we assume that the diffusion can be divided into two phases: (i) seed expansion and (ii) intra-community propagation. The first phase is the expansion of seed nodes across different communities. The second phase is influence diffusion within each community and there is no inter-community propagation. We call seed set S as the first-order seeds, while the neighbor of S as second-order seeds. We assume the first phase is the propagation
Real-world networks
We first evaluate the performance of our community-based influence maximization algorithm on nine real world datasets, which provide a spectral of application areas and ranges from medium size to mega-scale, reflecting the volume and variety properties of big data. The largest dataset contains 3.1 million nodes and about 117 million edges. Two medium-sized datasets (NetHEPT and NetPHY) are downloaded from the website2 provided by Chen
Influence spread
We first compare the influence spread of different algorithms on nine real world datasets, as shown in Fig. 1, where x-axis represents the number of seed nodes we want to find while y-axis represents the overall influence spread. From the results on the nine real-world datasets, we see that our CoFIM algorithm is always among the ones (together with TIM+ and IMM) providing the best performance in terms of influence spread. The IPA algorithm shows the worst performance on all the networks except
Conclusion
Influence maximization is a classic propagation optimization problem studied in the area of social network analysis and viral marketing. The simple greedy algorithm, proposed by Kempe et al., though provides a factor of () approximation to the optimal solution, cannot be applied to large-scale networks due to its low time efficiency. Other submodularity-based or node centrality-based heuristics algorithms, either achieve limited improvement in time efficiency, or cannot provide any
Acknowledgements
We appreciate the anonymous reviewer’s valuable comments. This work was partly supported by the Fundamental Research Funds for the Central Universities (No. 0903005203400, 0216005202068). This work was also partially supported by the foundation from the “China Equipment and Resource Sharing” project (No. 025-226009002, 226009003, Tsinghua University). This work was also partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU
References (54)
A note on maximizing a submodular set function subject to a knapsack constraint
Oper. Res. Lett.
(2004)- et al.
A fast parallel modularity optimization algorithm (fpmqa) for community detection in online social network
Knowl. Based Syst.
(2013) - et al.
Incorder: incremental density-based community detection in dynamic networks
Knowl. Based Syst.
(2014) - et al.
Targeted revision: a learning-based approach for incremental community detection in dynamic networks
Physica A
(2016) - et al.
Epidemic spreading on complex networks with overlapping and non-overlapping community structure
Physica A
(2015) - et al.
Identifying influential nodes in complex networks with community structure
Knowl. Based Syst.
(2013) - et al.
Cim: community-based influence maximization in social networks
ACM Trans. Intell. Syst. Technol. (TIST)
(2014) - et al.
A note on maximizing the spread of influence in social networks
Inf. Process. Lett.
(2011) - et al.
Ct-ic: continuously activated and time-restricted independent cascade model for viral marketing
Knowl. Based Syst.
(2014) - et al.
How community structure influences epidemic spread in social networks
Physica A
(2008)
Talk of the network: a complex systems look at the underlying process of word-of-mouth
Mark. Lett.
The dynamics of viral marketing
ACM Trans. Web (TWEB)
Social ties and word-of-mouth referral behavior
J. Consumer Res.
New product diffusion models in marketing: a review and directions for research
J. Market.
Maximizing the spread of influence through a social network
ACM 9th SIGKDD International Conference on Knowledge Discovery and Data Mining
An analysis of approximations for maximizing submodular set functionsi
Math. Program.
Cost-effective outbreak detection in networks
ACM 13th SIGKDD International Conference on Knowledge Discovery and Data Mining
Tractable models for information diffusion in social networks
Knowledge Discovery in Databases (PKDD)
Scalable influence maximization for prevalent viral marketing in large-scale social networks
ACM 16th SIGKDD International Conference on Knowledge Discovery and Data Mining
Scalable and parallelizable processing of influence maximization for large-scale social networks
IEEE 29th International Conference on Data Engineering (ICDE)
Efficient influence maximization in social networks
ACM 15th SIGKDD International Conference on Knowledge Discovery and Data Mining
A potential-based node selection strategy for influence maximization in a social network
Advanced Data Mining and Applications
A new centrality measure for influence maximization in social networks
Pattern Recognition and Machine Intelligence
Celf++: optimizing the greedy algorithm for influence maximization in social networks
ACM 20th International Conference Companion on World Wide Web
Community structure in social and biological networks
Proc. Nation. Acad. Sci.
Uncovering the overlapping community structure of complex networks in nature and society
Nature
Extraction and classification of dense communities in the web
ACM 16th International Conference on World Wide Web
Cited by (161)
HMSG: Heterogeneous graph neural network based on Metapath SubGraph learning
2023, Knowledge-Based SystemsCompetition-based generalized self-profit maximization in dual-attribute network
2023, Theoretical Computer ScienceFIP: A fast overlapping community-based influence maximization algorithm using probability coefficient of global diffusion in social networks
2023, Expert Systems with ApplicationsTSIFIM: A three-stage iterative framework for influence maximization in complex networks
2023, Expert Systems with Applications