Abstract

Community search is a query-oriented variant of community detection problem, and the goal is to retrieve a single community from a given set of nodes. Most of the existing community search methods adopt handcrafted features, so there are some limitations in applications. Our idea is motivated by the recent advances of node embedding. Node embedding uses deep learning method to obtain feature representation of nodes directly from graph structure automatically and offers a new method to measure the distance between two nodes. In this paper, we propose a two-stage community search algorithm with a minimum spanning tree strategy based on node embedding. At the first stage, we propose a node embedding model NEBRW and map nodes to the points in a low-dimensional vector space. At the second stage, we propose a new definition of community from the distance viewpoint, transform the problem of community search to a variant of minimum spanning tree problem, and uncover the target community with an improved Prim algorithm. We test our algorithm on both synthetic and real-world network datasets. The experimental results show that our algorithm is more effective for community search than baselines.

1. Introduction

Community detection is one of the most popular problems in social network analysis, and its goal is to identify all communities in a network [13]. Discovering communities in social networks may offer insight on how the networks are organised and have many applications [4, 5]. However, there are many application scenarios in which we are interested in a particular community instead of all communities in networks. For example, in a scholar network such as DBLP, we are interested in a group of data mining experts, but not all experts in the community are known [6]. As another example in recommendation systems, for offering a tourist the most relevant and personalized local venue recommendations, his local interesting community needs to be mined first [7]. Both of the above examples are query-oriented variant of community detection problem where only a single community shall be detected [810]. The community search problem has also been studied as local community detection [11, 12] or seed set expansion [6, 13, 14].

The traditional community detection methods aim to enumerate all the communities in a network, and the running time is proportional to the size of the entire graph; thus, their efficiency is inadequate for community search which aims to find a particular community [15]. To address this limitation, a lot of research studies have been devoted in the community search problem. Luo et al. [16], Huang et al. [17], Ma et al. [18], and Liu et al. [19] study the scenario in which only a query node requires to be pre-known in the target community , but sometimes they perform poorly since they have no size limitation of the algorithmic returned results. Kloumann and Kleinberg [6] and Clauset [20] study the scenario in which a researcher need to pre-know the number of members in target community , and Kloumann and Kleinberg [6] make a further assumption that a node set of size in also requires to be pre-known which is hard to set in real application.

In this paper, we focus on a particular case of community search problem: for a graph , given a node , the goal is to find nodes which are in the same community with . Motivated by node embedding providing a new approach to learn node features from graph structure directly, we propose a two-stage community search algorithm. At the first stage, we propose a Node Embedding model with a Biased Random Walk (NEBRW) based on the Skip-gram model and map nodes to the points in a low-dimensional vector space. Moreover, we transform the proximity between each pair of nodes into their distance. At the second stage, we define the community of a query node as the nodes which are connected via shortest distance. Therefore, the problem of community search is transformed to a variant of minimum spanning tree problem. For the purpose, we define a new measurement of the distance between two nodes and implement a new community search algorithm with a minimum spanning tree-based approach.

To sum up, our main contributions in this paper are summarized as follows:We propose a node embedding model NEBRW based on the Skip-gram model and map nodes to the points in a low-dimensional vector space. Moreover, we define a new measurement of the distance between two nodes.We propose a new definition of community from distance viewpoint: each community is a group of nodes which are connected via the shortest distance and transform the problem of community search to a variant of minimum spanning tree problem.Based on the above definition, we design a novel Community Search algorithm with a Minimum Spanning Tree approach (CSMST) and test the algorithm on both synthetic and real-world network datasets. The experimental results show that our algorithm is more effective at community search than baselines.

The rest of the paper is organised as follows. Section 2 introduces some related work. We give a formal definition of the community search problem in section 3 and give the detailed algorithm in section 4. We report experimental results in section 5, followed by conclusions in section 6.

Our work is partly inspired by the work on community search and partly by the work on node embedding. In this section, we review both lines of work below.

2.1. Community Detection and Community Search

Community detection is an interesting problem in social network analysis, and various types of algorithms have been proposed, including modularity maximization model [3], hierarchical clustering model [21], and distance dynamics model [22]. The goal of community detection is to enumerate all the communities in a network, and the recent work is reviewed in the literature [2, 4]. Community search is a query-oriented variant of community detection problem, and the goal is to obtain a single community from a given set of nodes [8, 10]. The traditional community detection methods aim to enumerate all the communities in a network; thus, their efficiency is inadequate for community search.

Community search has attracted a lot of attention, and lots of algorithms have been proposed. However, the problem definition is not in complete accord. Among them, a mainstream direction of efforts focuses on querying the community from a query node. Luo et al. [16] define a local modularity and identify the subgraph with the maximum value of starting from a query node with a locally optimized approach. Huang et al. [17] introduce a similarity-based community quality function and design an algorithm LTE for revealing the natural community of a query node via local optimization of the measure. Different from Huang’s similarity measure which only focuses on the adjacent nodes, Ma et al. [18] introduce a -NS measure which also takes into account nonadjacent vertices within a distance away and propose a -NS based community search algorithm. In addition, Clauset [20] and Panagiotakis et al. [23] also assume that the approximate size of target community requires to be pre-known. Clauset [20] defines a local modularity measure and proposes an algorithm to identity the community with a fixed number of nodes by maximizing in a greedy fashion. Panagiotakis et al. [23] propose a flow propagation algorithm FlowPro to find the community surrounding a query node. Another direction of efforts focuses on finding the community from a set of query nodes. Kloumann and Kleinberg [6] study the scenario in which a researcher needs to pre-know an initial node set of size from .

Furthermore, there is another kind of minimum spanning tree-based community detection algorithms. Saoud and Moussaoui [24] construct the minimum spanning tree of the network based on the dissimilarities of nodes for each edge and get groups of nodes by removing the highest edges dissimilarities, and then, they merge group pairs to identify the final community structure maximizing the modularity. In order to overcome the limitation of modularity maximization, Asmi et al. [25] propose a new algorithm to reveal the communities in social networks based on minimum spanning tree and the strength of similarity between two nodes.

2.2. Node Embedding

The key challenge in networked data mining is how to find a proper representation of network structure that can be exploited by downstream tasks [19, 26]. Most of the existing data mining models are designed to handle vectorized data, and the networked data cannot be directly input into these models. Node embedding enables the automatic discovery of vector representation of nodes directly from graph structure [27], and the relevant work is reviewed in the literature [2830]. Besides homogeneous networks, there are also some heterogeneous networks [31] based on embedding approaches [27, 32, 33] proposed in recent years.

According to literature [30], network graph embedding output includes node embedding, edge embedding, hybrid embedding, and whole-graph embedding. Our work belongs to node embedding. The most related work to ours is the word2vec-based node embedding algorithms, such as DeepWalk [34], node2vec [35], and NEMCNB [19]. By viewing nodes as words and random walk paths on networks as sentences, these methods generalize word embedding techniques in natural language processing from lists of sentences to graphs [14]. These algorithms usually include two steps. Firstly, node paths are generated by performing random walks on a network. Secondly, vector representation of nodes is learned by adopting word embedding technique.

Recently, research on incorporating node embedding into community detection has attracted great interest of scholars. A line of research is to learn low-dimensional vector representations of nodes from network topology and feed them as node features to clustering algorithms such as -means. For improving the community detection accuracy, Jin et al. [36] define a new pairwise Markov Random Field framework which not only utilizes network embedding but also uses network topology to adjust of the improper division of nodes.

Motivated by the above work, we propose a new node embedding model to learn vector representation of nodes and design a minimum spanning tree-based community search algorithm.

3. Problem Definition and Solution Approach

A network can be represented by graph , where is the set of nodes and is the set of edges. A community in network is a subgraph within which the nodes are in close proximity. The general community search problem is defined as follows.

Problem 1. (community search). For a given network , we are interested in a potential community , but pre-know only a member , and the goal is to find out the members in .

The traditional way of quantifying the quality of a community focuses on the density of internal edges, e.g., local modularity [16] is the ratio of the number of internal edges to external edges and [20] is the fraction of boundary edges which are internal to the community. In this paper, we quantify the quality of a community with the shortest distance connecting all nodes in it from a viewpoint of distance. Based on this, we formally define the community search problem based on minimum spanning tree as follows.

Problem 2. (community search based on minimum spanning tree). For a network and distance function between nodes, given a query node and a size constraint , we seek to find an induced subgraph of , such that(1) contains (2)(3) is connected(4) is minimized among all feasible choices for

We now discuss the problem of finding subgraph . Firstly, is a connected subgraph, and there are at least edges in it. For a connected subgraph , there are at most edges in it when is minimized. We put these two things together and get the conclusion that there are exactly edges in . Secondly, suppose a subgraph and , for any nodes , if , then . When is minimized, it means that connecting all nodes in with shortest total length; thus, is the minimum spanning tree of graph .

Based on the above discussion, the community search problem is transformed into a variant of minimum spanning tree problem which starts from node and contains only nodes. However, our problem is different from the classic minimum spanning tree [37, 38]. The problem definition of classical minimum spanning tree is described as follows. Given a set of nodes, connect them by a network having the smallest sum of the edge lengths [38]. It differs from our problem in two aspects. The first aspect is that the classic minimum spanning tree aims to connect all nodes in network with shortest edge lengths, while we aim to connect the nodes in which is a small part of . The second is that the nodes in are pre-known, but we do not know which nodes belong in except that node belongs in it and there are nodes in it. Thus, we cannot adopt the classic minimum spanning tree algorithms directly. In our solution, we design an improved Prim algorithm to solve this problem.

To solve Problem 2, we propose a two-stage community search algorithm CSMST which includes node embedding representation and community search. The illustration is shown as Figure 1.

At the first stage, we focus on the representation of complex networks. How to represent networked data is an important aspect when we apply data mining techniques to analyze network datasets. Instead of traditional handcrafted feature extraction based on domain experts’ knowledge, we learn vector representation of nodes automatically from the graph structure via node embedding technique.

At the second stage, we focus on community search problem. Based on the vector representation of nodes obtained at the first stage, we define a distance measurement between pairs of nodes. We treat the community of a query node as the node set connected via shortest distance and implement a community search algorithm with a minimum spanning tree approach.

4. The Algorithm of CSMST

CSMST is an algorithm with two stages. At the first stage, we focus on the network representation problem and propose a Node Embedding model with a Biased Random Walk (NEBRW) to learn low-dimensional vector representations for nodes. Moreover, a distance measurement between nodes based on their vector representations is given. At the second stage, we identify the target community of a query node with a variant minimum spanning tree approach.

4.1. Our Node Embedding Algorithm

This is the first stage of our CSMST process. We first introduce a Skip-gram model for network and then give our node embedding model NEBRW.

4.1.1. Skip-Gram Model for Network

Given a network , the goal of node embedding is to learn a mapping from nodes to a low-dimensional space, , where and each node in is associated with a real-valued -dimensional vector . By viewing nodes as words and random walk paths on as a corpus, we can learn the embedding of nodes via Skip-gram model [39].

Given a random walk , the context of node , denoted as , is the nodes in a window of size centered at , i.e., . The embedding of nodes is learned by the maximizing objective function:where we assume that the nodes in is independent of each other; thus, equation (1) can be expressed as

The learned vectors are expected to be able to preserve as many properties of the network as possible; thus, they can be an alternative to traditional handcrafted features extracted from the graph [40].

4.1.2. Node Embedding Model NEBRW

In this section, we introduce our node embedding model NEBRW, a method for learning low-dimensional vector representations of nodes in a network based on the Skip-gram model. We learn node representations from a network in two steps, and the process is shown in Figure 2.

In line 2 to 10, by simulating random walk on of fixed length starting from each node times, we get a list of node paths. In line 11 to 13, by viewing nodes as words and random walk paths on as a corpus, we leverage the Skip-gram model to learn vector representations of nodes in . The detailed node embedding algorithm is shown in Algorithm 1.

Input: a network ; walks per node ; walk length ;
windows size ; dimension
Output: vector representations of nodes in
(1)begin
(2) initialize
(3)
(4)whiledo
(5)  foreachdo
    //rw is a random walk function
(6)   
(7)   append into
(8)  end
(9)  
(10)end
(11) construct a corpus consisting of sentences which are stored in
(12) use the Skip-gram to learn the mapping by treating as a corpus
(13) return
(14)end

The main difference among NEBRW, DeepWalk, and node2vec is that they adopt different random walk strategies. DeepWalk [34] uses pure random walk over networks. node2vec [35] adopts a biased random walk method by capturing the first-order and second-order proximity between nodes. In detail, we adopt a closest-neighbor biased random walk method [19] in the NEBRW model. Formally, we use to denote a random walk of fixed length , and is the th node in the walk. In the process of a random walk, suppose the current node is , and is the neighbor node set of . We use additional information for the neighbor node of in order to estimate the proximity between and :

The probability of a neighbor node being the next node is proportional to, i.e.,

The detailed algorithm of biased random walk is shown in Function (Algorithm 2).

Function()
Input: network ; start node ; walk length
Output: node path
(1)begin
(2) Initialize ;
(3);
(4)whiledo
(5)  ;
(6)  ;
(7)  foreachdo
(8)   ;
(9)  end
(10)  randomly select a node in with the probabilities in , denoted as ;
(11)  add to ;
(12)  ;
(13)end
(14) return ;
(15)end

After mapping nodes to points in a low-dimensional vector space, the proximity of nodes can be measured by their distance. The distance between two nodes grows in inverse proportion to their similarity. Therefore, the distance between nodes and is defined as follows:where is the dot product of and , which is the proximity score between nodes and .

4.2. The Algorithm of Community Search with Minimum Spanning Tree

This is the second stage of our CSMST process. Based on the learned vector representations of nodes, we compute the distance between pairs of nodes by formula (5) and propose a novel community search algorithm CSMST.

In what follows, we give the description of CSMST. According to the analysis of Problem 2, we identify the target community of a query node by constructing a minimum spanning tree with nodes. We initialize a target community , a shell node set , and expand by iteratively adding the node , which has shortest length with the current subgraph , at a time until its node number reaches . Figure 3 shows an example of our minimum spanning tree algorithm. Starting from a query node in , a compact subgraph with 8 nodes is discovered by constructing a minimum spanning tree. The pseudocode of CSMST is shown in Algorithm 3.

Input: network and its node embedding ; query node ; expected number of returned nodes
Output: target community
begin
 initialize ,
 define a variable of map type
foreach nodedo
   
end
while and do
  find edge such that is minimum
  add to
  foreach node
  do
   ifthen
    del
   end
   else
    
   end
  end
end
 return
end

5. Experiments

In this section, we evaluate the effectiveness of our community search algorithm CSMST on synthetic as well as real-world networks.

5.1. Experiment Setup

To validate the performance of CSMST, we give the experiment setup in this section.

5.1.1. Baselines

We compare CSMST against the following five community search algorithms for proving the advantage of our community search method:(1)Clauset’s algorithm [20]: this is a classical community search algorithm which discovers the target community by maximizing metric .(2)GMAC [18]: this is a classical similarity-based community search algorithm which uses as the node similarity measurement. We fix in the following experiments as suggested by the authors.(3)FlowPro [23]: this is a representative community search algorithm based on flow propagation. When the algorithm converges, the flow stored in the nodes that belong to the community of query node is higher than that stored in the nodes of other communities. The top nodes with higher flow stored are chosen as the predicted community.(4)NEMCNB [19]: it is a recent proposed community search algorithm which discovers the target community by adding a node iteratively from the shell node set to the target community that has the largest similarity with nodes in the current community. The purpose of choosing NEMCNB as a baseline is to evaluate the effectiveness of retrieving communities with a minimum spanning tree strategy.(5)MSTW [25]: this is a community detection algorithm based on minimum spanning tree and the strength of similarity between two nodes proposed by Asmi et al. The purpose of choosing MSTW as a baseline is to evaluate the effectiveness of node similarity measurement using node embedding.

For fair comparison with the other community search algorithms, a few modifications are required. GMAC, NEMCNB, and MSTW do not specify the number of nodes to be added to the predicted community as the stopping condition. Thus, we naturally choose the top members from algorithmic result as the predicted community. We also compare our node embedding model NEBRW in CSMST against the following two embedding baselines:(1)DeepWalk [34]: this is the first node embedding algorithm which generalized the advancements of word embedding in natural language processing from sequences of words to graphs.(2)node2vec [35]: this is another node embedding algorithm based on a biased random walk procedure that can explore neighborhoods in a BFS as well as DFS fashion. For learning representations where nodes that are close in the original network have similar embeddings, we set and in the following experiments which are also adopted by the authors in their experiments.

In the experiments, we use DeepWalk and node2vec to learn the vector representations of nodes and then retrieve communities with the minimum spanning tree strategy which is adopted by CSMST. The purpose of choosing DeepWalk and node2vec is to evaluate the effectiveness of our node embedding method NEBRW.

5.1.2. Datasets and Evaluation Metrics

We employ both synthetic and real-world networks for the evaluations. The widely used synthetic benchmark for community detection is a class of LFR benchmark networks introduced by Lancichinetti et al. [41]. We generate four groups of LFR benchmark networks, and in each group, there are ten networks.

In addition, we use four real-world network datasets to evaluate the performance of the community search algorithms. (1) Zachary Karate Club Network (Karate for short) [42], in which there are 34 nodes and 78 edges, describes the friendships among 34 members of a karate club at a US university. (2) NCAA football network (Football for short) [1], in which there are 115 nodes and 613 edges, describes American football games between Division IA colleges during regular season Fall 2000. (3) Books about US politics network (Polbooks for short) [43], in which there are 105 nodes and 441 edges, is a network of books about US politics published around the time of the 2004 presidential election and sold by Amazon.com. (4) YouTube social network (YouTube for short) [44], in which there are 1134890 nodes and 2987624 edges, is a video-sharing website that includes a social network.

Both the synthetic and real-world networks have ground-truth community structure. In the experiments, we set the same query node and the same number of returned nodes for different algorithms. If an algorithmic result contains more corrected nodes, it will obtain a higher value of and :where is the set of nodes in ground-truth community of the query node and is the set of nodes obtained by community search algorithm. In the experiments, we set , and the value of is equal to . So, we use the evaluation metric to compare algorithmic performance.

5.2. Evaluation on Synthetic Networks

The parameters of the LFR network generating model are introduced as follows: the number of nodes , the average degree of nodes , the maximum degree of nodes , and others except mixing parameter are set to their default values. Mixing parameter is the fraction of the number of edges of each node outside its community, which is used to control the difficulty of community detection [4], and larger would result in lower community detection accuracy.

We generate four groups of LFR networks by varying parameter and mixing parameter . In each group of LFR networks, we vary mixing parameter from 0.05 to 0.5 with a span of 0.05 and get ten networks. The detailed parameter values are set as Table 1. There are total forty networks with ground-truth communities.

For experiments of each algorithm on each dataset, we repeat the community search experiments for ( is the number of nodes in the network) times which start from each node at a time, and then report algorithmic average on this dataset. We evaluate our algorithm on these four groups of LFR network datasets, together with five community search baselines and two node embedding methods. We set the common parameter values as following for NEMCNB, CSMST, DeepWalk, and node2vec: walks per node , walk length , dimension , and window size . LFR30K and LFR50K networks are too big for FlowPro to handle because of its high time complexity. Thus, we only compare CSMST with the other baselines on these two groups of networks. The experimental results are shown in Figures 4 and 5, respectively, and we can get the following conclusions.

Firstly, combining Figures 4 and 5, we can discover that increasing mixing parameter leads to performance degradation due to increased difficulty of community detection. This is because the higher the mixing parameter of a network, the weaker community structure it has. Empirical studies of the community search algorithms on the four groups of LFR networks verify this.

Secondly, Figure 4 shows that, with the increase of , the performances of Clauset and MSTW drop rapidly; meanwhile, the other algorithms drop slowly. Compared with the other five community search baselines, CSMST algorithm achieves the best performance on the four groups of LFR networks, followed by NEMCNB, FlowPro, and GMAC. The main difference between CSMST and NEMCNB is that the community expansion strategy adopted is different, and the comparison results show that minimum spanning tree strategy is better than similarity-based strategy. The main difference between CSMST and MSTW is that the node similarity measurement adopted is different, and the comparison results show that the node similarity measurement based on node embedding is better than that based on network structure.

Thirdly, Figure 5 shows that our node embedding algorithm NEBRW is better than DeepWalk and node2vec in community search experiments on LFR networks.

5.3. Evaluation on Real-World Networks

We adopt the same experimental method on real-world networks as that, on synthetic networks and report algorithmic average on these datasets. Firstly, we perform the experiments on Karate, Football, and Polbooks. The common parameters are set as the following for NEMCNB, CSMST, DeepWalk, and node2vec: walks per node , walk length , dimension , and window size . The comparison results with both community search and node embedding baselines on these real-world network datasets are reported in Figures 6 and 7, respectively.

Then, we perform the experiment on YouTube. The common parameters are set as follows: , , , and . We compare with Clauset, DeepWalk, and MSTW because the network is too big for other algorithms to handle due to their high time complexity. The comparison results are reported in Figure 8.

Compared with the other five community search baselines, we can see that CSMST algorithm achieves the best performance on Karate, Football, and YouTube datasets. On Polbooks, MSTW achieves the best performance; however, the difference among MSTW, CSMST, Clauset, and FlowPro is small.

Compared with DeepWalk and node2vec, NEBRW achieves the best performance on Karate and Polbooks datasets. And, on Football dataset, DeepWalk algorithm achieves the best performance; however, the differences among DeepWalk, NEBRW, and node2vec are small. This further proves that NEBRW model and CSMST algorithm have greater advantage in community search tasks.

5.4. Discussion of Parameter

Parameter is important in the definition of community search problem, and it is interpreted as the number of nodes in the target community. However, there are some scenarios in which we do not pre-know it. In this section, we discuss the effect of parameter in CSMST algorithm. We choose LFR5K(), LFR5K(), and LFR5K() as the test network datasets. We perform experiments by varying parameter from to with a span of ( is the number of nodes in the ground-truth community). The larger will return more corrected nodes, which leads to a higher value but a lower value. Thus, in addition to the and metrics, we also adopt to measure algorithmic performance of different values of . The experimental results are shown in Figure 9:

We discuss the experimental results in detail. Firstly, with the increasing of mixing parameter , the difficulty of community detection on correspondent LFR networks is increasing. The experimental results verify this point. Secondly, on each test network dataset, the experimental results show a consistent pattern that, with the increase of parameter , the values of metric decrease, the values of metric increase, and the values of increase first and then decrease. The larger the parameter is, the more nodes are returned; thus, the metric shows an upward tendency and the metric shows a downward tendency. The achieves the maximal value when the value of parameter is equal to the number of nodes in target community.

6. Conclusion and Future Work

In this paper, we study communities from the viewpoint of distance and transform community search problem into a variant of minimum spanning tree problem. Moreover, we propose a node embedding model NEBRW based on Skip-gram and design a new community search algorithm CSMST via an improved Prim-based approach. Communities detected by CSMST are the node sets connected with minimum total distance. CSMST algorithm achieves good performance on both synthetic and real-world networks.

In the future, we will study the node embedding technique in heterogeneous social media networks and study the community search problem in heterogeneous networks.

Data Availability

The data and code used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The project was supported by the National Key R&D Program of China under Grant 2018YFB1004700, National Natural Science Foundation of China (61772122 and 61872074), and Fundamental Research Funds for the Universities of Heilongjiang Province of China (YWK10236200141).