CoFIM: A community-based framework for influence maximization on large-scale networks

doi:10.1016/j.knosys.2016.09.029

Knowledge-Based Systems

Volume 117, 1 February 2017, Pages 88-100

https://doi.org/10.1016/j.knosys.2016.09.029 Get rights and content

Abstract

Influence maximization is a classic optimization problem studied in the area of social network analysis and viral marketing. Given a network, it is defined as the problem of finding k seed nodes so that the influence spread of the network can be optimized. Kempe et al. have proved that this problem is NP hard and the objective function is submodular, based on which a greedy algorithm was proposed to give a near-optimal solution. However, this simple greedy algorithm is time consuming, which limits its application on large-scale networks. Heuristic algorithms generally cannot provide any performance guarantee. To solve this problem, in this paper we propose CoFIM, a community-based framework for influence maximization on large-scale networks. In our framework the influence propagation process is divided into two phases: (i) seeds expansion; and (ii) intra-community propagation. The first phase is the expansion of seed nodes among different communities at the beginning of diffusion. The second phase is the influence propagation within communities which are independent of each other. Based on the framework, we derive a simple evaluation form of the total influence spread which is submodular and can be efficiently computed. Then we further propose a fast algorithm to select the seed nodes.

Experimental results on synthetic and nine real-world large datasets including networks with millions of nodes and hundreds of millions of edges show that our algorithm achieves competitive results in influence spread as compared with state-of-the-art algorithms and it is much more efficient in terms of both time and memory usage.

Introduction

Influence maximization (IM) is a classic network optimization problem. It orients from the area of viral marketing [1], [2]. Consider that a company has developed a new product and wants to promote it among the customers. The company plains to select some people and let them try this product for free, hoping that these people could recommend it to their family and friends. As an expectation, the product will be largely adopted in the network with the “world-of-mouth” effect [3]. However, most of the time the budget of a company is limited, then a natural question is: how to select the “best” initial adopters so that the product can be most widely adopted? Mathematically speaking, it is defined as the problem of finding k seed nodes in a network such that they cause the maximum scale of cascading, which we call influence maximization.

The study of influence maximization relies on the diffusion models [4]. Currently the most widely used models are IC (Independent Cascade) model and LT (Linear Threshold) model [5]. In the two models, each node is in one of two states: active or inactive. Active nodes are those who have adopted the product and will propagate it to their neighbors. Inactive nodes are those who have not heard of the product or rejected to adopt it. Initially all nodes are inactive, then k seed nodes are selected to be activated and propagation starts from the k nodes.

IC model: In this model, at step t, for an active node u, it will try to activate each of its inactive neighbor v, and succeed with probability p_uv. Node u has only one chance to activate v, whether succeed or not, u will make no attempt to activate v in the future. If v was successfully activated, then from step $t + 1,$ v will be active and try to activate its inactive neighbors. If no node is activated as step T, the diffusion process will stop.

LT model: In this model, the activation of node v depends on the set of its active neighbors. For each directed edge from u to v, there is a weight b_{u, v} indicating the influence of u on v. For each node v, the constraint ∑_{u neighbor of v}b_uv ≤ 1 must be satisfied. Node v has an activation threshold θ_v, which is between 0 and 1. Once the condition ∑_{u neighbor of v}b_uv ≥ θ_v is satisfied, v will become active. When no more nodes can be activated, the diffusion process will stop.

The influence maximization problem was firstly formatted by Kempe et al. [5]. Given a network G(V, E), influence maximization aims to find a subset S of $k = | S |$ vertices such that the diffusion orients from S can cause the maximum cascade of influence, i.e., $S^{*} = \arg_{S} \max σ (S)$ where σ(S) is an objective function evaluating the influence spread, which is defined as the expected number of active users in the network after the difussion stopped.

Kempe et al. [5] proved that under traditional linear threshold (LT) and independent cascade (IC) diffusion models, this problem is NP hard, and the objective function is submodular. For arbitrary function σ( · ) that maps subsets of a finite ground set U to non-negative real numbers, we say the function is submodular, if it satisfies the following two properties: (1) monotone increasing, i.e., ∀S ⊂ T, σ(S) ≤ σ(T); (2) diminishing returns, i.e., ∀v ∈ V, S⊆T, we have $σ (S \cup {v}) - σ (S) \geq σ (T \cup {v}) - σ (T),$ which means that the marginal gain from adding an element to a set S is at least as high as the marginal gain from adding the same element to a superset of S. Based on the mathematical properties of submodular funcitons [6], Kempe et al. proposed a “hill-climbing” greedy algorithm to solve this problem, which begin with an empty set of S, and then iteratively add an element to S that brings the maximum marginal gain, util $| S | = k$ . Kempe et al. proved that the solution provided by the greedy algorithm provides a factor of ( $1 - 1 / e - ɛ$ ) performance guarantee to the optimal solution under both the LT and IC models. Here e is the base of the natural logarithm and ε is any positive real number. The second item 1/e orients from the submodular function itself. The third item ε is the error of approximating the objective function using Monte–Carlo simulations. Theoretically, the traditional greedy algorithm provides a 63% guarantee to the optimal solution. In real experiments, the solution provided by the greedy algorithm is quite close to the optimal solution. However, to have a good approximation of the objection function given the seed set S, the greedy algorithm requires tens of thousands of Monte–Carlo simulations, which seriously limits its application on large-scale networks.

To solve the time efficiency problem of traditional greedy algorithm, a spectral of algorithms were proposed by researchers in recent years. Some works make use of submodularity, such as the CELF algorithm proposed by Leskovec et al. [7]. Some research works assume that the influence can spread on the network only through shortest paths [8], [9], [10] so that the objective function can be exactly computed. Another way to reduce the time complexity is to simply select top k nodes based on some heuristic metrics [11], [12], [13], such as the degree centrally, betweeness centrality et al. However, since the heuristic methods take no consideration of propagation models, they usually give poor solutions.

In the age of big data, network scale grows in millions, if not billions. Traditional influence maximization methods either cannot handle large-scale networks, or provide inaccurate solutions with low influence spread. To solve the problem of traditional influence maximization algorithms, in this paper we propose CoFIM: a Community-based Framework for Influence Maximization on large-scale networks. In our framework the influence propagation process is divided into two phases: (i) seeds expansion; and (2) intra-community propagation. The first phase is the expansion of seed nodes among different communities at the beginning of diffusion. The second phase is the influence propagation within communities which are independent of each other. Based on the framework, we derive a simplified evaluation form of the total influence spread which is sumodular and can be efficiently computed. Then we further develop a fast greedy algorithm to select the seed nodes.

Experimental results on synthetic and nine real-world large datasets including networks with millions of nodes and hundreds of millions of edges show that our algorithm can significantly outperform other state-of-the-art methods in terms of time and memory efficiency with almost no comprise on accuracy as evaluated by the influence spread.

The rest of this paper is organized as follows. Section 2 reviews the literature on influence maximization. Section 3 elaborates the preliminaries of the problem to be addressed. Section 4 introduces our CoFIM framework. Section 5 presents the evaluation framework, including the datasets, evaluation metrics, baseline methods, and experimental procedure. Experiment results are presented in Section 6. Section 7 concludes this paper.

Section snippets

Literature review

Since Kempe et al. [5] formally formatted the influence maximization problem and proposed the greedy algorithm, a lot of research works have been published to tackle this problem. Generally, these works can be divided into four categories: (1) submodularity-based algorithms; (2) centrality-based heuristic algorithms; (3) influence path based-algorithms; and (4) community-based algorithms. Here we give a brief review on these research works.

Problem definition

The influence maximization problem was firstly formulated by Kempe et al. [5]. Under the IC or LT model, we use S to denote the seed set, i.e., the set of active nodes at step $t = 0,$ and S(t) as the set of active nodes at step t. It is easy to see that the propagation stops when $S (t) = Φ$ . Then the number of overall activated nodes after the propagation stopped can be represented by $\sum_{t = 0}^{\infty} | S (t) |$ . Since the diffusion models are usually stochastic, we use σ(S) to denote the expected number of overall

Two-phases diffusion model

Given any network with non-overlapping community structure, we assume that the diffusion can be divided into two phases: (i) seed expansion and (ii) intra-community propagation. The first phase is the expansion of seed nodes across different communities. The second phase is influence diffusion within each community and there is no inter-community propagation. We call seed set S as the first-order seeds, while the neighbor of S as second-order seeds. We assume the first phase is the propagation

Real-world networks

We first evaluate the performance of our community-based influence maximization algorithm on nine real world datasets, which provide a spectral of application areas and ranges from medium size to mega-scale, reflecting the volume and variety properties of big data. The largest dataset contains 3.1 million nodes and about 117 million edges. Two medium-sized datasets (NetHEPT and NetPHY) are downloaded from the website² provided by Chen

Influence spread

We first compare the influence spread of different algorithms on nine real world datasets, as shown in Fig. 1, where x-axis represents the number of seed nodes we want to find while y-axis represents the overall influence spread. From the results on the nine real-world datasets, we see that our CoFIM algorithm is always among the ones (together with TIM+ and IMM) providing the best performance in terms of influence spread. The IPA algorithm shows the worst performance on all the networks except

Conclusion

Influence maximization is a classic propagation optimization problem studied in the area of social network analysis and viral marketing. The simple greedy algorithm, proposed by Kempe et al., though provides a factor of ( $1 - 1 / e - ɛ$ ) approximation to the optimal solution, cannot be applied to large-scale networks due to its low time efficiency. Other submodularity-based or node centrality-based heuristics algorithms, either achieve limited improvement in time efficiency, or cannot provide any

Acknowledgements

We appreciate the anonymous reviewer’s valuable comments. This work was partly supported by the Fundamental Research Funds for the Central Universities (No. 0903005203400, 0216005202068). This work was also partially supported by the foundation from the “China Equipment and Resource Sharing” project (No. 025-226009002, 226009003, Tsinghua University). This work was also partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU

References (54)

M. Sviridenko
A note on maximizing a submodular set function subject to a knapsack constraint
Oper. Res. Lett.
(2004)
Z. Bu et al.
A fast parallel modularity optimization algorithm (fpmqa) for community detection in online social network
Knowl. Based Syst.
(2013)
H. Sun et al.
Incorder: incremental density-based community detection in dynamic networks
Knowl. Based Syst.
(2014)
ShangJ. et al.
Targeted revision: a learning-based approach for incremental community detection in dynamic networks
Physica A
(2016)
ShangJ. et al.
Epidemic spreading on complex networks with overlapping and non-overlapping community structure
Physica A
(2015)
ZhangX. et al.
Identifying influential nodes in complex networks with community structure
Knowl. Based Syst.
(2013)
ChenY.-C. et al.
Cim: community-based influence maximization in social networks
ACM Trans. Intell. Syst. Technol. (TIST)
(2014)
E. Even-Dar et al.
A note on maximizing the spread of influence in social networks
Inf. Process. Lett.
(2011)
J. Kim et al.
Ct-ic: continuously activated and time-restricted independent cascade model for viral marketing
Knowl. Based Syst.
(2014)
WuX. et al.
How community structure influences epidemic spread in social networks
Physica A
(2008)

J. Goldenberg et al.

Talk of the network: a complex systems look at the underlying process of word-of-mouth

Mark. Lett.

(2001)

J. Leskovec et al.

The dynamics of viral marketing

ACM Trans. Web (TWEB)

(2007)

J.J. Brown et al.

Social ties and word-of-mouth referral behavior

J. Consumer Res.

(1987)

V. Mahajan et al.

New product diffusion models in marketing: a review and directions for research

J. Market.

(1990)

D. Kempe et al.

Maximizing the spread of influence through a social network

ACM 9th SIGKDD International Conference on Knowledge Discovery and Data Mining

(2003)

G.L. Nemhauser et al.

An analysis of approximations for maximizing submodular set functionsi

Math. Program.

(1978)

J. Leskovec et al.

Cost-effective outbreak detection in networks

ACM 13th SIGKDD International Conference on Knowledge Discovery and Data Mining

(2007)

M. Kimura et al.

Tractable models for information diffusion in social networks

Knowledge Discovery in Databases (PKDD)

(2006)

ChenW. et al.

Scalable influence maximization for prevalent viral marketing in large-scale social networks

ACM 16th SIGKDD International Conference on Knowledge Discovery and Data Mining

(2010)

J. Kim et al.

Scalable and parallelizable processing of influence maximization for large-scale social networks

IEEE 29th International Conference on Data Engineering (ICDE)

(2013)

ChenW. et al.

Efficient influence maximization in social networks

ACM 15th SIGKDD International Conference on Knowledge Discovery and Data Mining

(2009)

WangY. et al.

A potential-based node selection strategy for influence maximization in a social network

Advanced Data Mining and Applications

(2009)

S. Kundu et al.

A new centrality measure for influence maximization in social networks

Pattern Recognition and Machine Intelligence

(2011)

A. Goyal et al.

Celf++: optimizing the greedy algorithm for influence maximization in social networks

ACM 20th International Conference Companion on World Wide Web

(2011)

M. Girvan et al.

Community structure in social and biological networks

Proc. Nation. Acad. Sci.

(2002)

G. Palla et al.

Uncovering the overlapping community structure of complex networks in nature and society

Nature

(2005)

Y. Dourisboure et al.

Extraction and classification of dense communities in the web

ACM 16th International Conference on World Wide Web

(2007)

Cited by (161)

HMSG: Heterogeneous graph neural network based on Metapath SubGraph learning
2023, Knowledge-Based Systems
Heterogeneous graph neural network (HGNN) models, capable of learning low-dimensional dense vectors from heterogeneous graphs for downstream graph-mining tasks, have attracted increasing attention in recent years. For these models, metapath-based methods have been widely adopted. However, most existing metapath-based HGNN models either discard intermediate nodes within a metapath, resulting in information loss, or indiscriminately aggregate information along a metapath containing different types of nodes, resulting in unavoidable learning bias. To overcome these limitations, a new HGNN model named HMSG, is proposed in this paper to comprehensively capture structural, semantic and attribute information from both homogeneous and heterogeneous neighbors more purposefully. To achieve this, a type-specific linear transformation is first applied to transfer the node attributes to different types of nodes with the same latent factor space. In the new model, the heterogeneous graph is decomposed into multiple metapath-based homogeneous and heterogeneous subgraphs where each subgraph associates specific semantic and structural information; this is different from existing models, which mainly rely on symmetric metapaths. Subsequently, tailored attention-based message aggregation methods are independently applied to each subgraph such that information learning can be more targeted. Finally, information from different subgraphs is fused through graph-level attention to obtain a complete representation. The learned representations are evaluated by several graph-mining tasks. Results indicate that the HMSG attains the best performance in all evaluation metrics than state-of-the-art baselines. Further ablation experiments demonstrate the effectiveness of the modules designed for the HMSG.
A fast module identification and filtering approach for influence maximization problem in social networks
2023, Information Sciences
In this paper, we explore influence maximization, one of the most widely studied problems in social network analysis. However, developing an effective algorithm for influence maximization is still a challenging task given its NP-hard nature. To tackle this issue, we propose the CSP (Combined modules for Seed Processing) algorithm, which aim to identify influential nodes. In CSP, graph modules are initially identified by a combination of criteria such as the clustering coefficient, degree, and common neighbors of nodes. Nodes with the same label are then clustered together into modules using label diffusion. Subsequently, only the most influential modules are selected using a filtering method based on their diffusion capacity. The algorithm then merges neighboring modules into distinct modules and extracts a candidate set of influential nodes using a new metric to quickly select seed sets. The number of selected nodes for the candidate set is restricted by a defined limit measure. Finally, seed nodes are chosen from the candidate set using a novel node scoring measure. We evaluated the proposed algorithm on both real-world and synthetic networks, and our experimental results indicate that the CSP algorithm outperforms other competitive algorithms in terms of solution quality and speedup on tested networks.
Competition-based generalized self-profit maximization in dual-attribute network
2023, Theoretical Computer Science
Profit Maximization in social advertising aims at selecting some users of social network as initial adopters and information sources to trigger the spread of promotion information such that the profit generated by all the adopters reaches maximum when the dissemination terminates. A lot of related works mainly study this problem under three assumptions: pure network, single product and one-dimension diffusion model. However, in real advertising activities conducted in social networks, advertisement of competitive products may spread almost at the same time. And there are many factors that can influence the probability with which a potential consumer makes an adoption. For the purpose of approximating real social advertising, we propose the Dual-Attribute Compete (DAC) model where attributes of both potential consumers and competitive products are taken into consideration and the information about competitive products can spread simultaneously. Therefore, it can capture not only the competition between different products but also the reaction of potential consumers to products. Under DAC model, we study the Competition-based Generalized Self-profit Maximization (CGSM) problem whose purpose is selecting at most k individuals to form an optimal seed set as the source of information diffusion to maximize the profit related to adopters. Given that the objective function of CGSM problem is generally nonsubmodular, we design R-CGSM algorithm to tackle it. Based on the analysis of martingale and the concept of Shapley value, this algorithm uses sandwich method to get a pretty good solution of CGSM problem. We evaluate the R-CGSM algorithm by conducting experiments on four different data sets representing a synthetic network and three social networks in real world, respectively. Results of experiments validate the effectiveness and accuracy of R-CGSM algorithm.
Determination of influential nodes based on the Communities’ structure to maximize influence in social networks
2023, Neurocomputing
With the increasing development of social networks, they have turned into important research platforms. Influence maximization is one of the most important research issues in the field of social networks. This problem detects influential k-node with the greatest influence spread. The influence maximization faces two important challenges, time efficiency and optimal selection of seed nodes. In order to solve such challenges, we propose an algorithm based on optimal pruning and scoring adjustment, which is called IMBC for short. The IMBC (Influence Maximization Based on Community structure) algorithm uses optimal pruning and a minimum of dominating nodes to improve time efficiency. In addition, for optimal selection of seed nodes, the IMBC algorithm modulates the scores of nodes with a high Rich-Club coefficient. In order to select influential nodes, we first select an optimal set using the minimum dominating nodes and node scores, with the aim of optimal pruning in influence spread calculations. Because large-scale social networks have many nodes, optimal pruning reduces computational overhead. Then, the seed nodes are selected based on the scoring adjustment. Scoring adjustment is done to avoid the Rich Club phenomenon because avoiding this phenomenon causes a large amount of diffusion in social networks. The experimental results show that the proposed algorithm performs better than the algorithms presented in recent years in influence spread and runtime. Therefore, the IMBC algorithm is a balance between quality and efficiency. Also, in the PGP dataset results, the PHG algorithm with as much as a 5.08% increase in influence spread, and the runtime has decreased by 97%.
FIP: A fast overlapping community-based influence maximization algorithm using probability coefficient of global diffusion in social networks
2023, Expert Systems with Applications
Influence maximization is the process of identifying a small set of influential nodes from a complex network to maximize the number of activation nodes. Due to the critical issues such as accuracy, stability, and time complexity in selecting the seed set, many studies and algorithms has been proposed in recent decade. However, most of the influence maximization algorithms run into major challenges such as the lack of optimal seed nodes selection, unsuitable influence spread, and high time complexity. In this paper intends to solve the mentioned challenges, by decreasing the search space to reduce the time complexity. Furthermore, It selects the seed nodes with more optimal influence spread concerning the characteristics of a community structure, diffusion capability of overlapped and hub nodes within and between communities, and the probability coefficient of global diffusion. The proposed algorithm, called the FIP algorithm, primarily detects the overlapping communities, weighs the communities, and analyzes the emotional relationships of the community’s nodes. Moreover, the search space for choosing the seed nodes is limited by removing insignificant communities. Then, the candidate nodes are generated using the effect of the probability of global diffusion. Finally, the role of important nodes and the diffusion impact of overlapping nodes in the communities are measured to select the final seed nodes. Experimental results in real-world and synthetic networks indicate that the proposed FIP algorithm has significantly outperformed other algorithms in terms of efficiency and runtime.
TSIFIM: A three-stage iterative framework for influence maximization in complex networks
2023, Expert Systems with Applications
The problem of influence maximization is a classic issue that has been well-studied in the field of network science, but most of existing researches are compromising among computational complexity or result accuracy. In this work, a three-stage iterative framework for influence maximization (TSIFIM) is presented to find a set of seed spreaders in complex networks. In TSIFIM, the initial candidate seeds are first selected by considering the global communicability of each node and its importance in their local network. Then, in addition to the candidate seeds, other remained nodes are assigned to the specific communities based on the proposed local resource allocation similarity index, and the core node in each community which satisfies the local influence threshold condition are selected as the supplementary candidate seeds. Furthermore, we employ an adaptive search strategy to find the optimal solution among these candidates. The proposed algorithm is compared with eight popular influence maximization algorithms on nine real-world networks to verify the performance. Experimental results show that TSIFIM has better performance in terms of influence spreading, sensitivity analysis, seed dispersion and statistical test.

View all citing articles on Scopus

View full text

CoFIM: A community-based framework for influence maximization on large-scale networks

Abstract

Introduction

Section snippets

Literature review

Problem definition

Two-phases diffusion model

Real-world networks

Influence spread

Conclusion

Acknowledgements

Oper. Res. Lett.

Knowl. Based Syst.

Knowl. Based Syst.

Physica A

Physica A

Knowl. Based Syst.

ACM Trans. Intell. Syst. Technol. (TIST)

Inf. Process. Lett.

Knowl. Based Syst.

Physica A

Talk of the network: a complex systems look at the underlying process of word-of-mouth

Mark. Lett.

The dynamics of viral marketing

ACM Trans. Web (TWEB)

Social ties and word-of-mouth referral behavior

J. Consumer Res.

New product diffusion models in marketing: a review and directions for research

J. Market.

Maximizing the spread of influence through a social network

ACM 9th SIGKDD International Conference on Knowledge Discovery and Data Mining

An analysis of approximations for maximizing submodular set functionsi

Math. Program.

Cost-effective outbreak detection in networks

ACM 13th SIGKDD International Conference on Knowledge Discovery and Data Mining

Tractable models for information diffusion in social networks

Knowledge Discovery in Databases (PKDD)

Scalable influence maximization for prevalent viral marketing in large-scale social networks

ACM 16th SIGKDD International Conference on Knowledge Discovery and Data Mining

Scalable and parallelizable processing of influence maximization for large-scale social networks

IEEE 29th International Conference on Data Engineering (ICDE)

Efficient influence maximization in social networks

ACM 15th SIGKDD International Conference on Knowledge Discovery and Data Mining

A potential-based node selection strategy for influence maximization in a social network

Advanced Data Mining and Applications

A new centrality measure for influence maximization in social networks

Pattern Recognition and Machine Intelligence

Celf++: optimizing the greedy algorithm for influence maximization in social networks

ACM 20th International Conference Companion on World Wide Web

Community structure in social and biological networks

Proc. Nation. Acad. Sci.

Uncovering the overlapping community structure of complex networks in nature and society

Nature

Extraction and classification of dense communities in the web

ACM 16th International Conference on World Wide Web