Elsevier

Knowledge-Based Systems

Volume 117, 1 February 2017, Pages 88-100
Knowledge-Based Systems

CoFIM: A community-based framework for influence maximization on large-scale networks

https://doi.org/10.1016/j.knosys.2016.09.029Get rights and content

Abstract

Influence maximization is a classic optimization problem studied in the area of social network analysis and viral marketing. Given a network, it is defined as the problem of finding k seed nodes so that the influence spread of the network can be optimized. Kempe et al. have proved that this problem is NP hard and the objective function is submodular, based on which a greedy algorithm was proposed to give a near-optimal solution. However, this simple greedy algorithm is time consuming, which limits its application on large-scale networks. Heuristic algorithms generally cannot provide any performance guarantee. To solve this problem, in this paper we propose CoFIM, a community-based framework for influence maximization on large-scale networks. In our framework the influence propagation process is divided into two phases: (i) seeds expansion; and (ii) intra-community propagation. The first phase is the expansion of seed nodes among different communities at the beginning of diffusion. The second phase is the influence propagation within communities which are independent of each other. Based on the framework, we derive a simple evaluation form of the total influence spread which is submodular and can be efficiently computed. Then we further propose a fast algorithm to select the seed nodes.

Experimental results on synthetic and nine real-world large datasets including networks with millions of nodes and hundreds of millions of edges show that our algorithm achieves competitive results in influence spread as compared with state-of-the-art algorithms and it is much more efficient in terms of both time and memory usage.

Introduction

Influence maximization (IM) is a classic network optimization problem. It orients from the area of viral marketing [1], [2]. Consider that a company has developed a new product and wants to promote it among the customers. The company plains to select some people and let them try this product for free, hoping that these people could recommend it to their family and friends. As an expectation, the product will be largely adopted in the network with the “world-of-mouth” effect [3]. However, most of the time the budget of a company is limited, then a natural question is: how to select the “best” initial adopters so that the product can be most widely adopted? Mathematically speaking, it is defined as the problem of finding k seed nodes in a network such that they cause the maximum scale of cascading, which we call influence maximization.

The study of influence maximization relies on the diffusion models [4]. Currently the most widely used models are IC (Independent Cascade) model and LT (Linear Threshold) model [5]. In the two models, each node is in one of two states: active or inactive. Active nodes are those who have adopted the product and will propagate it to their neighbors. Inactive nodes are those who have not heard of the product or rejected to adopt it. Initially all nodes are inactive, then k seed nodes are selected to be activated and propagation starts from the k nodes.

IC model: In this model, at step t, for an active node u, it will try to activate each of its inactive neighbor v, and succeed with probability puv. Node u has only one chance to activate v, whether succeed or not, u will make no attempt to activate v in the future. If v was successfully activated, then from step t+1, v will be active and try to activate its inactive neighbors. If no node is activated as step T, the diffusion process will stop.

LT model: In this model, the activation of node v depends on the set of its active neighbors. For each directed edge from u to v, there is a weight bu, v indicating the influence of u on v. For each node v, the constraint ∑u neighbor of vbuv ≤ 1 must be satisfied. Node v has an activation threshold θv, which is between 0 and 1. Once the condition ∑u neighbor of vbuvθv is satisfied, v will become active. When no more nodes can be activated, the diffusion process will stop.

The influence maximization problem was firstly formatted by Kempe et al. [5]. Given a network G(V, E), influence maximization aims to find a subset S of k=|S| vertices such that the diffusion orients from S can cause the maximum cascade of influence, i.e., S*=argSmaxσ(S)where σ(S) is an objective function evaluating the influence spread, which is defined as the expected number of active users in the network after the difussion stopped.

Kempe et al. [5] proved that under traditional linear threshold (LT) and independent cascade (IC) diffusion models, this problem is NP hard, and the objective function is submodular. For arbitrary function σ( · ) that maps subsets of a finite ground set U to non-negative real numbers, we say the function is submodular, if it satisfies the following two properties: (1) monotone increasing, i.e., ∀S ⊂ T, σ(S) ≤ σ(T); (2) diminishing returns, i.e., ∀vV, ST, we have σ(S{v})σ(S)σ(T{v})σ(T), which means that the marginal gain from adding an element to a set S is at least as high as the marginal gain from adding the same element to a superset of S. Based on the mathematical properties of submodular funcitons [6], Kempe et al. proposed a “hill-climbing” greedy algorithm to solve this problem, which begin with an empty set of S, and then iteratively add an element to S that brings the maximum marginal gain, util |S|=k. Kempe et al. proved that the solution provided by the greedy algorithm provides a factor of (11/eɛ) performance guarantee to the optimal solution under both the LT and IC models. Here e is the base of the natural logarithm and ε is any positive real number. The second item 1/e orients from the submodular function itself. The third item ε is the error of approximating the objective function using Monte–Carlo simulations. Theoretically, the traditional greedy algorithm provides a 63% guarantee to the optimal solution. In real experiments, the solution provided by the greedy algorithm is quite close to the optimal solution. However, to have a good approximation of the objection function given the seed set S, the greedy algorithm requires tens of thousands of Monte–Carlo simulations, which seriously limits its application on large-scale networks.

To solve the time efficiency problem of traditional greedy algorithm, a spectral of algorithms were proposed by researchers in recent years. Some works make use of submodularity, such as the CELF algorithm proposed by Leskovec et al. [7]. Some research works assume that the influence can spread on the network only through shortest paths [8], [9], [10] so that the objective function can be exactly computed. Another way to reduce the time complexity is to simply select top k nodes based on some heuristic metrics [11], [12], [13], such as the degree centrally, betweeness centrality et al. However, since the heuristic methods take no consideration of propagation models, they usually give poor solutions.

In the age of big data, network scale grows in millions, if not billions. Traditional influence maximization methods either cannot handle large-scale networks, or provide inaccurate solutions with low influence spread. To solve the problem of traditional influence maximization algorithms, in this paper we propose CoFIM: a Community-based Framework for Influence Maximization on large-scale networks. In our framework the influence propagation process is divided into two phases: (i) seeds expansion; and (2) intra-community propagation. The first phase is the expansion of seed nodes among different communities at the beginning of diffusion. The second phase is the influence propagation within communities which are independent of each other. Based on the framework, we derive a simplified evaluation form of the total influence spread which is sumodular and can be efficiently computed. Then we further develop a fast greedy algorithm to select the seed nodes.

Experimental results on synthetic and nine real-world large datasets including networks with millions of nodes and hundreds of millions of edges show that our algorithm can significantly outperform other state-of-the-art methods in terms of time and memory efficiency with almost no comprise on accuracy as evaluated by the influence spread.

The rest of this paper is organized as follows. Section 2 reviews the literature on influence maximization. Section 3 elaborates the preliminaries of the problem to be addressed. Section 4 introduces our CoFIM framework. Section 5 presents the evaluation framework, including the datasets, evaluation metrics, baseline methods, and experimental procedure. Experiment results are presented in Section 6. Section 7 concludes this paper.

Section snippets

Literature review

Since Kempe et al. [5] formally formatted the influence maximization problem and proposed the greedy algorithm, a lot of research works have been published to tackle this problem. Generally, these works can be divided into four categories: (1) submodularity-based algorithms; (2) centrality-based heuristic algorithms; (3) influence path based-algorithms; and (4) community-based algorithms. Here we give a brief review on these research works.

Problem definition

The influence maximization problem was firstly formulated by Kempe et al. [5]. Under the IC or LT model, we use S to denote the seed set, i.e., the set of active nodes at step t=0, and S(t) as the set of active nodes at step t. It is easy to see that the propagation stops when S(t)=Φ. Then the number of overall activated nodes after the propagation stopped can be represented by t=0|S(t)|. Since the diffusion models are usually stochastic, we use σ(S) to denote the expected number of overall

Two-phases diffusion model

Given any network with non-overlapping community structure, we assume that the diffusion can be divided into two phases: (i) seed expansion and (ii) intra-community propagation. The first phase is the expansion of seed nodes across different communities. The second phase is influence diffusion within each community and there is no inter-community propagation. We call seed set S as the first-order seeds, while the neighbor of S as second-order seeds. We assume the first phase is the propagation

Real-world networks

We first evaluate the performance of our community-based influence maximization algorithm on nine real world datasets, which provide a spectral of application areas and ranges from medium size to mega-scale, reflecting the volume and variety properties of big data. The largest dataset contains 3.1 million nodes and about 117 million edges. Two medium-sized datasets (NetHEPT and NetPHY) are downloaded from the website2 provided by Chen

Influence spread

We first compare the influence spread of different algorithms on nine real world datasets, as shown in Fig. 1, where x-axis represents the number of seed nodes we want to find while y-axis represents the overall influence spread. From the results on the nine real-world datasets, we see that our CoFIM algorithm is always among the ones (together with TIM+ and IMM) providing the best performance in terms of influence spread. The IPA algorithm shows the worst performance on all the networks except

Conclusion

Influence maximization is a classic propagation optimization problem studied in the area of social network analysis and viral marketing. The simple greedy algorithm, proposed by Kempe et al., though provides a factor of (11/eɛ) approximation to the optimal solution, cannot be applied to large-scale networks due to its low time efficiency. Other submodularity-based or node centrality-based heuristics algorithms, either achieve limited improvement in time efficiency, or cannot provide any

Acknowledgements

We appreciate the anonymous reviewer’s valuable comments. This work was partly supported by the Fundamental Research Funds for the Central Universities (No. 0903005203400, 0216005202068). This work was also partially supported by the foundation from the “China Equipment and Resource Sharing” project (No. 025-226009002, 226009003, Tsinghua University). This work was also partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU

References (54)

  • J. Goldenberg et al.

    Talk of the network: a complex systems look at the underlying process of word-of-mouth

    Mark. Lett.

    (2001)
  • J. Leskovec et al.

    The dynamics of viral marketing

    ACM Trans. Web (TWEB)

    (2007)
  • J.J. Brown et al.

    Social ties and word-of-mouth referral behavior

    J. Consumer Res.

    (1987)
  • V. Mahajan et al.

    New product diffusion models in marketing: a review and directions for research

    J. Market.

    (1990)
  • D. Kempe et al.

    Maximizing the spread of influence through a social network

    ACM 9th SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2003)
  • G.L. Nemhauser et al.

    An analysis of approximations for maximizing submodular set functionsi

    Math. Program.

    (1978)
  • J. Leskovec et al.

    Cost-effective outbreak detection in networks

    ACM 13th SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2007)
  • M. Kimura et al.

    Tractable models for information diffusion in social networks

    Knowledge Discovery in Databases (PKDD)

    (2006)
  • ChenW. et al.

    Scalable influence maximization for prevalent viral marketing in large-scale social networks

    ACM 16th SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2010)
  • J. Kim et al.

    Scalable and parallelizable processing of influence maximization for large-scale social networks

    IEEE 29th International Conference on Data Engineering (ICDE)

    (2013)
  • ChenW. et al.

    Efficient influence maximization in social networks

    ACM 15th SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2009)
  • WangY. et al.

    A potential-based node selection strategy for influence maximization in a social network

    Advanced Data Mining and Applications

    (2009)
  • S. Kundu et al.

    A new centrality measure for influence maximization in social networks

    Pattern Recognition and Machine Intelligence

    (2011)
  • A. Goyal et al.

    Celf++: optimizing the greedy algorithm for influence maximization in social networks

    ACM 20th International Conference Companion on World Wide Web

    (2011)
  • M. Girvan et al.

    Community structure in social and biological networks

    Proc. Nation. Acad. Sci.

    (2002)
  • G. Palla et al.

    Uncovering the overlapping community structure of complex networks in nature and society

    Nature

    (2005)
  • Y. Dourisboure et al.

    Extraction and classification of dense communities in the web

    ACM 16th International Conference on World Wide Web

    (2007)
  • Cited by (161)

    View all citing articles on Scopus
    View full text