1 Introduction

Networks provide a fundamental model for defining a relationship between various entities. Networks have applications in diverse domains like social computing [82], systems biology [76, 141], cyber-physical systems [148], recommender system [74], language modeling [125], and network medicine [4]. Social network reflects the relationship between people, citation network relates research papers, a biological network can define protein–protein interactions, a word co-occurrence network defines linguistic relationships and many more. Analysis and mining of these complex networks can generate various insights, which can be very useful for both scientific and business community. Friend recommendation in social networks, protein function prediction from protein interaction networks, terrorist group identification from communication networks, influential paper detection from citation networks etc. are some typical examples. We usually define these tasks formally as link prediction [71], node classification [8], graph clustering [105], and influential node detection [79]. Performing these tasks on large real-world networks pose various challenges.

Traditional methods for network embedding use graph algorithm based approaches, which uses adjacency matrix as network representation. Also, these methods adopt iterative processing, which results in high computational cost when applied to large networks. For example, for node classification, most of the approaches like iterative classification algorithm (ICA) [91] and label propagation [149] are iterative approaches. Machine learning methods cannot be directly applied to networks because such methods assume that the data have independent and identical distribution (i.i.d), which is not true in the case of graph-structured data. Using sparse adjacency matrix representation is also not practical to perform machine learning. An alternate method to perform machine learning on network data is to use hand engineered features generated using network statistics and other measures [34], which is a time-consuming process. Traditional distributed computing platforms [26] are not well suited for parallel processing of graph-structured data. Many specialized distributed graph analytic platforms like Pregel [78], Giraph [81] and Graphx [137] are developed, but their efficiency is limited by the complex phenomenon of real-world networks like scale-free property and power law distributions.

An interesting direction towards applying machine learning on network data is to map the data to a low dimensional latent space, and then to apply traditional machine learning algorithms. This process of mapping the network data to vector space is known as network embedding. Many linear and non-linear dimensionality reduction methods [126] were initially used to generate network embedding. Most of these methods were based on matrix factorization, and hence suffered from scalability issue. More recently, machine learning community has come up with new theories and architectures to learn complex features from high dimensional data. These approaches are referred to as representation learning,Footnote 1 which aims at finding a set of transformations that can map the high dimensional data to a low dimensional manifold. With the success of representation learning on image [44, 60, 128, 134], speech [23, 40, 48], and natural language processing [19, 21, 108], researchers attempted to use these methods on network data and created fruitful results.

Fig. 1
figure 1

Subset of github user interaction network

Given an input network, we can generate embedding in different output formats, which includes node, edge, subgraph and whole-graph embedding. Edge embedding aims to map the edges of a network to a latent space, and subgraph embedding attempts to map the graph components (subgraph structures) to a vector space. Whole-graph embedding aims to generate the representation of a complete graph in vector space, and many works used graph kernel methods [3, 127] for generating whole-graph representations. Node embedding, which represents vertices of a graph in vector space, is the more focused and well-studied problem, which is covered throughout this survey. Figure 1a shows Gephi [5] visualization of a small subset of github user interaction network, and Fig. 1b shows its 2-D representation in vector space, generated by DeepWalk [96], and plotted using t-SNE [75]. Generating low dimensional vectors as node embedding from a large real-world network is not straightforward. The vector representation should preserve the structural properties of the network which includes the first order, second order and higher order proximities between nodes. The network data is highly sparse and usually non-linear, and the embedding algorithm should generate the embedding from sparse and non-linear data. Many real-word networks contain millions of nodes and edges, and the embedding algorithm should be scalable. In reality, many networks may be heterogeneous, attributed, scale-free and dynamic, and the embedding method should adapt to all such situations.

A few efforts are already made to survey [22, 38, 46, 89] the various approaches for network embedding. In this survey, we focus on the recent methods for node embedding which are inspired by the recent advancements in representation learning. We provide a taxonomy of node embedding methods based on the type of the networks. Networks are classified into broader categories such as homogeneous networks, heterogeneous networks, attributed networks, signed networks, and dynamic networks. We discuss the common models of network representation learning and reviews the major works which come under each model with respect to each type of network. Further, we discuss the applications of network embedding along with the data sets used in the network embedding research.

2 Terminologies and problem definition

Definition 1

A Network is a graph \(G =(V,E)\), where \(V=\{v_1,v_2\ldots v_n\}\), is the set of vertices and \(e \in E\) is an edge between any two vertices. An adjacency matrix A defines the connectivity of G, \(A_{ij}=1\) if \(v_i\) and \(v_j\) are connected, else \(A_{ij}=0\).

Definition 2

A homogeneous network is a network \(G =(V,E)\), where each node \(v_i \in V\) belongs to the same type and each edge \(e_i \in E\) also belong to the same type.

Definition 3

A attribute network can be defined as \(G_A=(V,E,A,F)\) where V is the set of vertices, E is the set of edges, A is the adjacency matrix and \(F \in R^{n \times k}\), ith row of F denotes the k dimensional attribute vector of node i.

Definition 4

A heterogeneous network is a network \(G =(V,E)\), where each node \(v_i \in V\) and each edge \(e_i \in E\), are associated with mapping functions \(F(v):V \rightarrow T_v\) and \(f(e):E \rightarrow T_e\), where \(T_v\) and \(T_e\) denotes the entity and relationship types respectively.

Definition 5

A signed network is a network \(G =(V,E)\), \(v \in V\), \(e \in E\) and for each edge, \(e_{ij}= +1\) or \(e_{ij}=-1\), denoting a positive link or a negative link between \(v_i\) and \(v_j\).

Definition 6

A dynamic network can be defined as a series of snapshots \(G=\{G_1,G_2\ldots G_n\}\) where \(G_i=(V_i,E_i)\) and n is the number of snapshots.

Definition 7

First order proximity describes the pair wise proximity between the vertices which is defined using the edge weight \(e_{ij}\) between node \(v_i\) and node \(v_j\).

Definition 8

Second order proximity for a pair of nodes \(v_i\) and \(v_j\) is the proximity of neighborhood structures of the nodes \(v_i\) and \(v_j\).

Problem 1

Network embedding—Given a network \(G =(V,E)\), the task is to learn a transformation function \(f:V_i \rightarrow K_i \in R^d\), where \(d<< \vert V \vert\), such that f preserves the first order, second order and higher order proximities of the network. d defines the number of dimensions of the real valued vector.

3 Models of network embedding

Researches used various models for network embedding which includes both linear and nonlinear dimensionality reduction techniques. Models based on matrix factorization, models that combine random walk sampling with shallow neural networks, and deep neural architectures are the most commonly used models. Other approaches focus on modeling an optimization function based on the structure and the properties to be preserved and solving it using gradient based methods.

3.1 Matrix factorization

Network embedding can be considered as a structure-preserving dimensionality reduction process, which assumes that the input data lie in a low dimensional latent space. Network data can be represented in matrix form, which includes adjacency matrix, laplacian matrix, node transition probability matrix and many more. A matrix factorization can be applied on any of these matrices to generate node embeddings. Locally linear embedding [103], Laplacian eigenmaps [6], Structure preserving embedding [107], Graph factorization [2], GraRep [14], HOPE [94] etc. are some among the matrix factorization based methods for network embedding. A detailed survey on these methods can be found in [13, 22, 33]. In this paper, we focus on the methods based on recent advancements in representation learning.

Fig. 2
figure 2

A flow diagram of random walk based approaches for network embedding

3.2 Random walk

Capturing the network structure is a primary concern while generating node embeddings. A random walk is a well-known method which can capture the local structure of the graph. Even if each row in an adjacency matrix corresponds to a node vector which defines the connectivity structure and is analogous to a one hot vector representation, it is very sparse and high dimensional. The word2vec model [85] succeeded in developing a word representation by generating dense low dimensional vectors from sparse high dimensional one hot vectors, using a shallow neural network architecture. Word2vec defines two neural architectures, namely continuous bag-of-word model and skip-gram model [86]. The training is done using stochastic gradient decent(SGD) [9]. Word2vec uses two optimization strategies called hierarchical softmax and negative sampling to speed up the training process. Many network embedding methods [29, 42, 92, 96, 101] are inspired from word2vec, which first applies a random walk on the network to generate node sequences that are analogous to sentences in word2vec, followed by using the skip-gram architecture to generate node embeddings. Random walk based approaches try to maximize the probability that the nodes that tend to co-occur on truncated walk lay closer in the embedding space. A general architecture of random walk based procedure for network embedding is shown in Fig. 2.

3.3 Deep architecture

Fig. 3
figure 3

A flow diagram of deep architecture based approaches for network embedding

The aim of network embedding is to map the nodes from a high dimensional network space to a low dimensional feature space. Some works used specialized neural network models [84, 104], while many others used generalized models over graph-structured data to represent graph in a euclidean space. Network data is inherently non-linear, and using shallow neural network architectures for generating node embedding may result in sub-optimal solutions. Deep neural networks [7, 63, 106] have been successfully used in various domains to learn multiple levels of feature representations from complex and non-linear data. To train large neural networks with more than one hidden layers, may theories and architectures were proposed recently, which includes deep belief networks (DBN) with greedy layer-wise pre-training [49], deep convolutional neural networks (CNN) [60], long short-term memory networks (LSTM) [41], and generative adversarial networks (GAN) [37]. An autoencoder is a neural architecture which acts as a building block in training deep belief networks. An autoencoder is a three-layer neural network which reconstructs the input vectors at their output layer through a number of non-linear transformations on the input. As an unsupervised feature learning technique, an autoencoder can generate a deep latent representation for the input data. Multiple layers of autoencoders are stacked together to form a stacked autoencoder and it is used as the deep neural architecture for generating node embeddings in many works [15, 39, 129]. Convolutional neural networks, which are very popular in image processing tasks, are not directly applied to graphs, but some works use convolutional architectures [18, 27, 56, 57], which use spectral graph theory [47] to generate node embeddings. Generative adversarial networks (GANs) are deep neural network architectures comprised of two components, a generator and a discriminator, competing one against the other. A few works [24, 130] on network embedding are inspired from GANs. A general architecture of using deep architectures for network embedding is shown in Fig. 3.

4 Network representation learning methods

In this section, we review the major works which come under each model with respect to each type of network. The classification of network embedding methods based on different types of networks is depicted in Fig. 4.

Fig. 4
figure 4

Network embedding methods based on the types of networks

4.1 Homogeneous network

Most of the works on network embedding focus on non-attributed, static, unsigned homogeneous networks. Preserving the structural property of the network is the primary objective of homogeneous network embedding methods. Figure 5 shows the major works on homogeneous network embedding which are grouped under major models of network embedding. Table 1 shows a summary of the input, objective function, model used, and properties preserved by some of these methods.

Fig. 5
figure 5

Homogeneous network embedding methods

Proximity preservation during network embedding is the main aim followed in most of the random walk based methods. Among those, DeepWalk [96] gained a lot of attraction, as it is inspired from the well-studied word2vec algorithm. DeepWalk algorithm involves a two-step process. (1) A truncated random walk on the network to generate a sequence of vertices, which creates an analogy of sentence in word2vec. (2) Using a skip-gram model, which uses a shallow neural network architecture to generate node embeddings. The skip-gram is a generative model whose objective is to maximize the probability of neighbors in the walk, given the representation of a vertex. For each node \(v_i\), skip-gram assigns a current d dimensional representation, \(\phi (v_i )\in R^d\) and maximizes the co-occurrence probability of its neighbors in the walk to update this representation. The optimization becomes,

$$\begin{aligned} \max \log P(v_{i-w},\ldots ,v_{i+w}|\phi (v_i)) \end{aligned}$$
(1)

where \(v_{i-w},\ldots ,v_{i+w}\) denotes the neighbors of \(v_i\) in the node sequence, and w is the context size. Computing the softmax at the output layer of skip-gram is computationally expensive and DeepWalk approximates softmax using two strategies, hierarchical softmax and negative sampling. These strategies reduces the time complexity of skip-gram model and speed up the training process. As random walk being a sampling strategy, the time complexity to perform random walk is linear w.r.t the number of edges. The complexity of the skip-gram architecture is proportional to \(O(C (D + D log_{2}(V )))\), where C is the context window size, D is the number of dimensions and \(log_{2}(V)\) is the time to build the hierarchical softmax over V vertices. DeepWalk is parallelizable and can be implemented without the knowledge of entire graph, which makes it suitable for large-scale machine learning. DeepWalk motivated many subsequent works [17, 24, 29, 42, 92, 95, 101] and also acted as a baseline for various works in the area of network representation learning. Walklet [97] modified the random walk used in DeepWalk by explicitly preserving the proximities between vertices during the random walk, and showed that the multi-scale representations thus generated can improve the performance of multi-label classification task. Max-Margin DeepWalk [121] and Discriminative Deep Random Walk [69] extended DeepWalk by associating classification objective with embedding objective and thereby demonstrated the performance improvement of the multi-label classification task.

The quality of the network embedding can be further improved by preserving the structural equivalence of the nodes along with the proximity information. Node2vec [42] works towards this goal by performing a biased random walk which can provide more flexibility in exploring neighborhood compared to DeepWalk. During random walk, node2vec uses two sampling strategies, breadth-first search (BFS) and depth-first search (DFS) which traverse the search space by exploring both community structures and structurally equivalent nodes in the network. Based on the random walk sequence, node2Vec extends skip-gram architecture to optimize the objective function,

$$\begin{aligned} \max \sum _{v_t \in W_{v}}^N \sum _{-w\le t'\le w} \log P(v_{t'} \vert \phi (v_t)) \end{aligned}$$
(2)

where \(v_t\) is the node taken from the random walk sequence \(W_v\), \(v_{t'}\) is the neighbor node of node \(v_t\) within the window w, and \(\phi (v_t)\in \mathbb {R}^d\) is the feature representation of the node \(v_t\). Node2vec incurs additional space and time complexity over deepwalk as it involves BFS and DFS search during random walk. Node2vec can preserve the structural equivalence of nodes in the network but is limited by the size of the context window. Struc2vec [101] aims at preserving structural equivalence to a better extent by computing the structural similarity between each pair of vertices in the network. Struc2vec constructs a multilayer network, where each layer denotes a hierarchy in measuring the structural similarity, and then applies random walk sampling followed by skip-gram learning on the multilayer graph to generate the embedding of each vertex.

Preserving higher order structural patterns in large-scale networks is a challenging issue. HARP [17] is a meta-strategy that can achieve this goal. HARP can be used to improve the state-of-the-art NRL algorithms [42, 96, 115] so as to avoid these algorithms to get stuck in local optima, as these models rely on non-convex optimization, solved by SGD. HARP progresses through three steps—graph coarsening, embedding and representation refinement. In coarsening, a large network is divided into smaller networks by preserving the original structure using two strategies, edge collapsing and star collapsing. The embedding algorithm is applied to the coarsest graph and the embedding is generated. The last step is to prolong and refine the network from coarsest to finest. To perform refinement, HARP uses two strategies, multilevel hierarchical softmax and multilevel negative sampling.The overall time complexity of HARP (with deepWalk) is \(O(\gamma |V |)\) where \(\gamma\) is the number of walks and V is the number of vertices. Experiments show that the HARP extension can improve the performance of DeepWalk, LINE, and Node2vec over multi-label classification.

Network structure is inherently non-linear, and using a shallow neural network for network embedding may lead to suboptimal solutions. SDNE [129] addresses this challenge by using a deep architecture, build with stacked autoencoders, to generate network embeddings. SDNE deploys a deep belief network, implements multiple layers of non-linear functions, and map the data into a non-linear feature space. In order to maintain the structure-preserving property and to address sparsity, SDNE trains a joint optimization function (shown as equation 3) which preserves the first order and second order proximities. This function preserves the second order proximity using stacked autoencoders and the first order proximities using Laplacian Eigenmaps.

$$\begin{aligned} L= & {} \gamma ' {\vert \vert (Y-X)\odot B\vert \vert }^2 + \alpha '\sum _{i=1,j=1}^{n} a_{ij} {\vert \vert \phi _i - \phi _j \vert \vert }_2^2 \nonumber \\&+ \frac{\lambda }{2}\sum _{k=1}^{K}\left( {\vert \vert W^{(k)}\vert \vert }_F^2+{\vert \vert \hat{W}^{(k)}\vert \vert }_F^2\right) \end{aligned}$$
(3)

Here X and Y are the input and reconstructed data. The model minimizes the reconstruction error to capture the global information. B has been used to handle the sparsity of the adjacency matrix \(a_{ij}\). \(\phi _i\) and \(\phi _j\) represent the feature representation of node i and j and \(W^{(k)}\) is the hidden layer weight matrix for the autoencoder. The time complexity of SDNE is O(ncdi), where n is the number of vertices, d is the embedding dimension, c is the average degree of the network and i is the number of iterations.

In language modeling, alternate method to generate word vectors is to find the low dimensional linear projections from positive point-wise mutual information matrix (PPMI) [12, 68] of words and their contexts. DNGR [15] is inspired from [68], which first uses a random surfing model to generate a probabilistic co-occurrence matrix that captures the graph structure information. The PMMI matrix is then calculated from the co-occurrence matrix. Instead of applying singular value decomposition (SVD) as in [14], DNGR applies a stacked denoising autoencoder on PMMI matrix which learns a non-linear function to map high dimensional vertex vectors into low dimensional node embeddings. The authors of DNGR claim that, using the probabilistic co-occurrence matrix is well suited for weighted networks and is less computationally expensive compared to sampling-based methods [42, 96]. The time complexity of DNGR is defined to be linear w.r.t. the number of vertices in the graph. The objective function of DNGR is defined as

$$\begin{aligned} {\arg \min _{\theta _{1},\theta _{2}} \sum _{i=1}^{n} {\vert \vert x_i-g_{\theta _{2}}(f_{\theta _{1}}(y_i))\vert \vert }^2} \end{aligned}$$
(4)

where \(x_{i}\) is the ith instance, \(y_{i}\) is the corrupted input data of \(x_{i}\), and \(f_{\theta _{1}}\) and \(g_{\theta _{2}}\) are the encoding and decoding functions of the autoencoder respectively.

A few efforts are made to apply some variants of CNN [27, 47], to perform representation learning on networks. GCN [56] is one such approach whose goal is to learn a function from the network which takes as input (1) \(N \times K\) feature matrix, where N is the number of nodes and K is the number of input features. (2) An adjacency matrix A. The GCN produces an output Z which is an \(N \times D\) matrix, where D is the number of dimensions per node. GCN uses the layer-wise propagation rule

$$\begin{aligned} H_{(l+1)} = \sigma \left( \hat{D}^{-\frac{1}{2}}\hat{A} \hat{D}^{-\frac{1}{2}} H_{(l)} W_{(l)}\right) \end{aligned}$$
(5)

where \(W_l\) denote the weight matrix of lth network, \(\hat{A}=A+I\), D is the diagonal node degree matrix and \(H_{(0)}=X\), the matrix of node attributes. The authors interpreted GCN as a generalized version of the weisfeiler-Lehman algorithm on graphs.The complexity of the convolution operation is O(efc) where e is the number of edges, f is the number of filters and g is the node dimension. FastGCN [18] is an enhancement over GCN where the authors interpret graph convolutions as integral transforms of embedding functions under probability measures, and uses Monte Carlo approaches to consistently estimate the integrals. Parametric graph convolution [119] is another enhancement over GCN which generalizes a convolutional filter by adding a hyper-parameter that influences the filter size, and thereby improves the performance of GCN. Variational graph autoencoder(VGAE) [57] is another work which uses variational autoencoder to learn latent representations from undirected graphs.The authors demonstrated this model using a graph convolutional network (GCN) encoder and a simple inner product decoder.

GraphGAN [130] directly follows the GAN architecture which tries to learn two models: (1) a generator which approximates the underlying connectivity distribution and generates fake vertex pairs to fool the discriminator. (2) a discriminator that tries to distinguish the vertex pairs that is generated by the generator from the real ones. The objective of the discriminator is to maximize the logarithmic probability of assigning correct labels to real and generated samples. The generator objective is to minimize the logarithmic probability that the discriminator correctly identifies the samples generated by the generator.A sigmoid and softmax function are used as the discriminator and generator function respectively.The work also proposes an alternate method called graphsoftmax, which can improve the performance of softmax. The objective function of GraphGAN is modeled as a two-player minmax game with cost function

$$\begin{aligned} \begin{aligned}&{\min }_{\theta _G} {\max }_{\theta _D} V(G,D) = {\sum }_{c=1}^{V} (E_{v\sim p_{true}(. \vert v_{c})} [\log D(v,v_c;\theta _D)]\\&\qquad + E_{v\sim G(. \vert v_{c}; \theta _G)} [\log (1- D(v,v_c; \theta _D))]) \end{aligned} \end{aligned}$$
(6)

Here the generator G tries to generate vertices which resembles vertex \(v_{c}\)’s neighbors by approximate the underlying true connectivity distribution \(p_{true}(v|v_{c})\), and the discriminator D tries to discriminate the true neighbor of \(v_{c}\) from those generated by G by finding the probability of an edge to exist between v and \(v_{c}\) which is represented as \(D(v,v_c;\theta _D)\). By minimizng and maximizing the cost function, the optimal parameters for D and G can be learned, and GraphGAN uses policy gradient ascent procedure to learn the parameters. The time complexity of each iteration of GraphGAN is O(VlogV), where V is the number of vertices.

ANE [24] proposes a different approach which uses adversarial learning as a regularizer to learn more robust network representations. ANE employs a structure-preserving component and an adversarial learning component. For structure preservation, ANE uses a method called inductive DeepWalk (IDW). IDW perform random walk using PMMI matrix to explore the neighborhood, and optimizes a parameterized generator function to generate embeddings. The adversarial learning component consists of a generator and a discriminator. It shares the generator function with the structure-preserving component. Initially, the discriminator is trained to separate the prior samples from the embedding vectors. The parameters of the generator are then updated so as to fool the discriminator and thereby performing regularization on the embedding generated by the structure-preserving component.

Table 1 Summary of homogeneous network embedding methods

4.1.1 Other works

LINE—The objective of LINE [115] is to preserve first order and second order proximity during embedding. LINE first calculates the joint probability between two vertices in two ways, one using edge weights and other using node vectors. To preserve first order proximity, LINE defines an objective function to minimize the distance between two probability distributions. Objective function to preserve second order proximity is also defined in a similar way. LINE uses edge sampling strategy to speed up the computations.

NETMF—The works [42, 96, 115] lacked through theoretical analysis and the same is provided by [100]. The work reveals that, all these methods are essentially performing implicit matrix factorization. Analysing closed form matrices of all the methods, [100] eventually discusses the relationship between these methods and their connection with the Graph Laplacian. The authors also propose a method called NETMF which explicitly factorize the closed form implicit matrix of DeepWalk using singular value decomposition (SVD) and generates the node embeddings.

GraphSAGE—It [45] is an inductive representation learning method which is suitable for large graphs. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates node embeddings by sampling and aggregating features from the nodes local neighborhood.

Ep—Embedding propagation(Ep) [35] is network representation method inspired from label propagation. EP sends forward and backward messages between neighboring nodes. Forward messages contain label representations and backward messages contain gradients that result from aggregating the label representations and applying a reconstruction loss. Node representations are computed from label representations.

4.2 Attributed network embedding

In most of the real-world networks, nodes or edges are associated with single or multiple attributes which provide some semantic information. In this section, we will cover some methods which perform network embedding on such attributed networks [98, 99]. Figure 6 shows the major works under attributed network embedding. Table 2 shows a summary of the input, objective function, model used, and properties preserved by some of these methods.

Fig. 6
figure 6

Attributed network embedding methods

Nodes of the network may have text associated with it. TADW [139] aims to embed networks by using the structural information and the text information associated with the nodes. The work proves the equivalence of DeepWalk and closed form matrix factorization, and creates a PMMI matrix using vertex–context pairs for further processing.TADW performs inductive matrix completion [90] to associate text features into PMMI matrix, and low-rank matrix factorization on the resultant matrix to generate the network embedding. The objective function of TADW is stated as

$$\begin{aligned} \begin{aligned} {\min }_{W,H} {\vert \vert M - W^THT \vert \vert }_F^2 + \frac{\lambda }{2}\big ({\vert \vert W\vert \vert }_F^2+{\vert \vert H}\vert \vert _F^2\big ) \end{aligned} \end{aligned}$$
(7)

where M and T are the word-context matrix and text feature matrix respectively,and \(\min _{W,H} {\vert \vert M - W^THT \vert \vert }_F^2\) represents the low rank matrix decomposition of matrix M The complexity of each iteration of minimizing W and H is \(O(n_{0}(M)k+|V |f_{t}k+|V |k^2)\) where \(n_{0}(M)\) indicates the number of non-zero entries of M, and k denotes the low rank of M.

Accelerated attributed network embedding(AANE) [51] is another approach which uses connectivity information and attribute information to perform network embedding. ANNE modes a joint optimization function with two components, a strategy based on spectral clustering to preserve node proximities and a matrix factorization framework to approximate the attribute affinity matrix. Further, the authors have provided a distributed algorithm to solve the optimization objective in an efficient manner.The loss function of AANE is modeled as

$$\begin{aligned} \begin{aligned} {\min }_{H} {\big \Vert S - HH^T \big \Vert }_F^2 + \lambda {\sum }_{(i,j) \in \epsilon }w_{ij}({\big \Vert h_{i}-h_{j} \big \Vert }_F^2 )\end{aligned} \end{aligned}$$
(8)

Here S represent the attribute affinity matrix, H the embedding matrix and \(h_{i}\) and \(h_{j}\) are the vector representations of node i and node j. \(\min _{H} {\vert \vert S - HH^T \vert \vert }_F^2\) component preserve the node attribute proximity and \(\lambda \sum _{(i,j) \in \epsilon }w_{ij}({\vert \vert h_{i}-h_{j} \vert \vert }_F^2)\) component preserve the network stucture proximity. The time complexity of AANE is \(O(nN_{A}+n^2 )\), where \(N_{A}\) is the number of nonzero in attribute affinity matrix A and n is the number of nodes in the network.

In real-world networks like citation networks, papers may have text associated with it, the category information that the paper belongs to, and a reference link to other papers. Such networks can be modeled as graphs with node structure, content, and labels. TriDNR [95] aims at generating embedding by exploiting all these three levels. It uses the idea from DeepWalk and paragraph vector algorithm [62] to embed node, text and label information. TriDNR models a joint optimization function which learns inter-node, node-content and label-content correlations, and the training is done using SGD. It also uses hierarchical softmax to speed up the computations. The objective function of TriDNR is to maximize the log-likelihood

$$\begin{aligned} \begin{aligned}&\max (1-\alpha ) {\sum }_{i=1}^N {\sum }_{s \in S} {\sum }_{-b \le j \le b, j\ne 0} \log P(v_{i+j}\vert v_{i})\\&\quad \quad +\,\alpha {\sum }_{i=1}^N {\sum }_{-b \le j \le b}\log P(w_j\vert v_{i}) \\&\quad \quad + \,\alpha {\sum }_{i=1}^{\vert L \vert } {\sum }_{-b \le j \le b}\log P(w_j\vert c_{i}) \end{aligned} \end{aligned}$$
(9)

The first component of the equation to is maximize the likelihood of the neighboring nodes given current node \(v_{i}\), the second component maximizes probability of observing contextual words given the current node \(v_{i}\), and the third component maximizes the likelihood of observing the words given a class label \(c_i\). \(\alpha\) is balance parameter to control the proportion of network structure, text, and label information.

DeepGL [102] is a deep architecture which performs hierarchical representation learning on attributed networks. DeepGL first generates a set of base features by performing graphlet decomposition on higher order network motifs(graphlets). DeepGL learns a set of relational feature operations, which when applied on the base features generates a set of higher level features. At each layer of the deep architecture of DeepGL, features from lower order subgraphs are combined using composition of relational feature operations to generate higher order subgraph patterns. DeepGL is designed to be effective for network-based transfer learning tasks. The optimization function of DeepGL is stated as

$$\begin{aligned} \begin{aligned} arg {\max }_{x_{i} \notin X} \left\{K(y,x_i) -\beta {\sum }_{x_j \in X}K(x_i,x_j) \right\} \end{aligned} \end{aligned}$$
(10)

which aims to find a set of features \(x_{i}\) that maximizes it similarity to to the label y and minimizes the similarity between the features \(x_{i}\) and \(x_{j}\) in the collection. The complexity of generating node embeddings with the DeepGL is \(O(F(M + NF))\), where N, M, and F are the number of nodes, edges, and node features respectively.

The GCN methods presented in the previous section can also deal with attributed networks.

Table 2 Summary of attributed network embedding methods

4.2.1 Other works

DANE—It [70] aims at generating representations from a network with structure and attribute information. DANE provides an online learning model, which extends the basic design to a distributed environment. DANE takes as input an adjacency matrix and a user feature matrix, and generates two embedding Ea and Ex, using spectral clustering based methods. Further, DANE generates a single embedding representation by maximizing the correlation between Ea and Ex. DANE uses matrix perturbation theory to update Ea and Ex, and to generate the updated embedding E.

LANE—Label informed attributed network embedding(LANE) [52] affiliates labels with the attributed network and maps the network into a low-dimensional representation by modeling their structural proximities and correlations.

CANE—Context-aware network embedding (CANE) [122] is another attributed network embedding which learns various context-aware embeddings for a vertex according to the neighbors it interacts with.

NEEC—NEEC [53] aims at improving the attributed network embedding by learning and incorporating the expert cognition into the embedding objective.

IIRL—IIRL [138] uses two terminologies, structure close links and content close links to define the topological and attribute similarity between nodes. A joint optimization function is defined to preserve the proximity of structure-close and content-close links in the embedding space, and the training is done using a gradient based algorithm.

4.3 Heterogeneous network embedding

Typically, some network mining tasks demand the data to be modeled as heterogeneous networks [111] that involve nodes and edges of different types. For example, a citation network can be modeled as a heterogeneous network with authors, papers, and venue as nodes and relationship between these types as edges. In this section, we will cover some methods which perform network embedding on heterogeneous networks. Figure 7 shows the major works on heterogeneous network embedding and Table 3 shows a summary of these methods.

Fig. 7
figure 7

Heterogeneous network embedding methods

Metapath2vec [29] is an extension of random walk and skip-gram based methods which can be applied to heterogeneous networks. A meta path [112] is a path that can be represented in the form \(V_1 \xrightarrow {R_1} V_2 \xrightarrow {R_2}\cdots V_t \xrightarrow {R_t} V_{t+1}\cdots \xrightarrow {R_{k-1}}V_k\) where \(R= R_1 \circ R_2\circ R_3\circ \cdots \circ R_{k-1}\) defines the composition relations between node types \(V_1\) and \(V_k\). Metapath2vec performs a metapath based random walk through the heterogeneous network, and generate paths which can capture both the structural and semantic relationship between different types of nodes. The resulting paths are fed to a heterogeneous skip-gram model which can learn the representation of nodes by maximizing the probability of heterogeneous context nodes, given the input node. The objective function of heterogeneous skip-gram is stated as

$$\begin{aligned} {arg \max _{\theta } \sum _{v \in V} \sum _{t \in T_{V}} \sum _{c_t\in N_{t(v)} } \log P(c_{t} \vert v;\, \theta )} \end{aligned}$$
(11)

where t denotes the node type, \(N_{t(v)}\) denotes neighborhood of node v, and \(P(c_{t} \vert v; \theta )\) is a softmax function which calculates the probability of co-occurrence of context-input pairs. The time complexity of metapath2vec is same as that of deepwalk as both uses the skip-gram architecture for learning node representations. The work also discusses an algorithm called metapath2vec++, which provides heterogeneity in negative sampling by maintaining separate multinomial distributions for each node type in the output layer of the skip-gram model, and thereby provides more efficiency and accuracy in representation.

The main aim of HNE [16] is to map the multimodal objects in the heterogeneous network to a common space such that the similarity between the objects can be computed. HNE considers a heterogeneous network with text–text, text–image, and image-i-mage interactions as input. Text and image data are transformed into d-dimensional vectors and are mapped to a latent space using linear transformations. An objective function is modeled to minimize the distance between the objects if they are topologically connected. The loss function of HNE is stated as

$$\begin{aligned} \begin{aligned}&{\min }_{U,V} \frac{1}{N_{II}} {\sum }_{v_i,v_j \in V_I}L(x_i,x_j)+ \lambda _3\big ({\vert \vert U\vert \vert }_F^2+{\vert \vert V}\vert \vert _F^2\big ) + \frac{\lambda _1}{N_{TT}}{\sum }_{v_i,v_j \in V_T}L(z_i,z_j)\\&\qquad +\,\frac{\lambda _2}{N_{IT}}{\sum }_{v_i \in V_I,v_j \in V_T}L(x_i,z_j) \end{aligned} \end{aligned}$$
(12)

where the first component represents the loss w.r.t. text to text similarity, second component represents the loss w.r.t. image to image similarity and the third component represents the loss w.r.t. text to image similarity. \(N_{II}\), \(N_{TT}\) and \(N_{IT}\) are the numbers of the three types of links, and \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) are the three balancing parameters. HNE further proposes a deep architecture which can map different modalities into a common subspace, and can construct the feature representation. A CNN is used to learn the image features and a fully connected layer is used to learn the text features. A linear embedding layer is used to map the input to a common subspace. A prediction layer is used to calculate the loss function and the training is done using SGD.

The authors of LINE [115] extended their network embedding approach on the homogeneous network to a heterogeneous network using PTE [114]. PTE constructs a heterogeneous text network by combining a word-word network, a word-document network, and a word-label network. PTE then apply LINE to embed the three bipartite networks. PTE further models a joint optimization function which can collectively embed the three bipartite networks to generate a single heterogeneous text network embedding. The loss function of PTE is stated as

$$\begin{aligned} \begin{aligned}&minimize\, - {\sum }_{(i,j) \in E_{ww}} w_{ij}\log p(v_i \vert v_j) - {\sum }_{(i,j) \in E_{wd}} w_{ij}\log p(v_i \vert d_j) \\&\qquad \qquad - {\sum }_{(i,j) \in E_{wl}} w_{ij}\log p(v_i \vert l_j) \end{aligned} \end{aligned}$$
(13)

The first, second and third term of the equation is to minimize the negative log-likelihood of co-occurrence of word-word pair,word-document pair and word-label pair respectively. The authors provided two approaches (1) a joint training and (2) a pre-training and fine tuning to perform the learning process.

HIN2Vec [32] is another work which uses meta path based approach for representation learning in heterogeneous information networks. Initially, hin2vec proposes a conceptual neural network architecture which is trained to learn the relationship between the nodes, by placing the possible meta paths at the output layer. The objective function of HIN2Vec is

$$\begin{aligned} \begin{aligned} {\max }_{O}{\sum }_{x,y,r \in D}\log O_{x,y,r}(x,y,r) \end{aligned} \end{aligned}$$
(14)

The function takes as input, a pair of nodes x and y, and a relationship r, and tries to maximize the prediction probability whether the relationship exists between x and y. Further, the authors provide an enhanced neural network architecture which can learn the node embedding and meta path embedding during the training process.

Table 3 Summary of heterogeneous network embedding methods

4.4 Signed networks

Signed networks [67, 116] are part of real social systems where the relationship between entities can be either positive or negative. In this section, we will cover some methods which perform network embedding on signed networks. Various works on signed network embedding are listed in Fig. 8 and a summary of these methods is shown in Table 4.

Fig. 8
figure 8

Signed network embedding methods

SIDE [55] is a network embedding method for signed directed networks. SIDE follows the random sampling strategy and hierarchical optimization, which is well exploited by language models. SIDE performs a truncated random walk on a signed directed network, and generates positive and negative node pairs based on structural balance theory. Then, SIDE models an optimization function which can be stated as

$$\begin{aligned} \begin{aligned}&maximize {\sum }_{(u,v) \in D}[-\log (P(u,v)) + {\sum }_{j=1}^n-\log P(u,v_j)]\\&\qquad \qquad +\frac{\lambda }{2}(\vert \vert b^{in,+}\vert \vert ^2 +\vert \vert b^{in,-}\vert \vert ^2+\vert \vert b^{out,+}\vert \vert ^2+\vert \vert b^{out,-}\vert \vert ^2) \end{aligned} \end{aligned}$$
(15)

The function tries to find the parameters that maximize the likelihood p(uv) between two nodes u and v such that the likelihood value is high for positively connected nodes and low for negatively connected nodes. The latter part of the objective function regularizes the bias terms in the function. The time complexity SIDE is linear w.r.t. the number of nodes in the network.

SiNE [133] is a deep learning architecture for signed network embedding. It uses the structural balance theory which assumes that, a node is more similar to a node with a positive link than to a node with a negative link. SINE first defines a similarity function between the d-dimensional representations of nodes, and models an optimization function to learn the parameters of the similarity function which is stated as

$$\begin{aligned} \begin{aligned}&{\min }_{X,x_0,\theta } \frac{1}{C} \left[ {\sum }_{(x_i,x_j,x_k) \in P} \max (0,f(x_i,x_k)+\delta - f(x_i,x_j)) \right. \\&\left. \qquad \qquad +\, {\sum }_{(x_i,x_j,x_0) \in P_0} \max (0,f(x_i,x_0)+ \delta _0 - f(x_i,x_j))\right] \\&\qquad \qquad +\,\alpha (R(\theta )+\vert \vert X\vert \vert _F^2+\vert \vert x_{0}\vert \vert _2^2) \end{aligned} \end{aligned}$$
(16)

where P defines the set of triplets \((v_i, v_j, v_k)\) where \(v_i\) and \(v_j\) have a positive link while \(v_i\) and \(v_k\) have a negative link, and \(P_0\) defines set of triplets \((v_i, v_j, v_0)\) where \(v_i\) and \(v_j\) have a positive link while \(v_i\) and \(v_0\) have a negative link. C is the size of the training data and \(\theta\) is the set of parameters to learn. SIDE uses a deep neural architecture with two hidden layers to optimize the objective function.

Table 4 Summary of signed network embedding methods

SNE [140] is a log-bilinear model [88] for generating embedding from the signed network. Given a path, SNE tries to predict the embedding of node v by linearly combining the feature vectors nodes in the path with corresponding signed weight vectors. A scoring function is used to measure the similarity between actual and predicted representation. The optimization objective is

$$\begin{aligned} \begin{aligned} \max {\sum }_{v \in V} \log \frac{exp(s(v,h))}{{\sum }_{v^i \in V}exp(s(v^i,h))} \end{aligned} \end{aligned}$$
(17)

The objective function is to maximize the logarithmic likelihood of a target node v, generated by a path of nodes h and their edge types, using a softmax function. Attribute signed network embedding [132] is another work which addresses network embedding on the signed network with an attribute associated with nodes. Initially, SNEA defines two optimization functions to generate embedding, one for modeling user attributes and other for modeling signed networks using structural balance theory. Then it models a joint optimization function by combining these components, and the training is done using gradient descent.

4.5 Dynamic network embedding

Many real-world networks are dynamic and will evolve over time [66, 109]. Between adjacent snapshots, new nodes and edges may be added and existing ones may be lost. In this section, we will cover some methods which perform network embedding on dynamic networks. Figure 9 shows the major works on dynamic network embedding and Table 5 describes a summary of these methods.

Fig. 9
figure 9

Dynamic network embedding methods

The research done by [147] aims at developing a temporal latent space model that can predict links over time based on a sequence of previous graph snapshots. The authors first model a quadratic loss function to learn the temporal latent space from a dynamic social network which is stated as

$$\begin{aligned} \begin{aligned}&arg\, {\min }_{z_1,z_2,\ldots z_t}{\sum }_{\varGamma =1}^t\vert \vert G_\varGamma -Z_\varGamma Z_\varGamma ^T\vert \vert _F^2 + \lambda {\sum }_{\varGamma =1}^t\, {\sum }_{u}(1-Z_\varGamma (u)Z_{\varGamma -1}(u)^T)\\&\qquad subject \; to : \forall\,\, u,\varGamma ,Z_\varGamma \ge 0, Z_\varGamma (u)Z_{\varGamma }(u)^T=1 \end{aligned} \end{aligned}$$
(18)

where the first term denotes the matrix factorization of adjacency matrix representations of the network snapshots \(G_\varGamma\) and the second term \(1-Z_\varGamma (u)Z_{\varGamma -1}(u)^T\) penalizes node u for a sudden change in its latent position. The objective of the loss function is to maintain the temporal smoothness while generating embedding of consecutive snapshots by incorporating a temporal regularizer into a matrix factorization framework. A standard block-coordinate gradient descent (BCGD) algorithm is provided as a solution to the optimization problem. They also present two lemmas which prove the correctness of the method, followed by a thorough theoretical analysis of the solution. The time complexity of BCGD algorithm is \(O(rk\sum _T(n+m_{T}))\), where n is the number of nodes, \(m_{T}\) is the number of edges in the graph \(G_{T}\), k is the number of dimensions, and T is the number of timestamps. In the later section, the authors describe two variants of the proposed algorithm, namely local BCGD algorithm and incremental BCGD algorithm, with local and incremental updates respectively. Then they compare the proposed methods with other latent space inferring approaches in terms of inference time and memory consumption, and prove the quality of learned latent spaces in terms of their link predictive power.

Another perspective of a dynamic network is a temporal network [50, 93], whose edges are active only when an interaction happens between the nodes. These interactions may lead to a flow of information between the nodes. Continuous-time dynamic network embedding (CTDNE) [92] aims at developing embedding for temporal networks by incorporating temporal dependencies into the state-of-the-art methods. In a temporal network, each edge is labeled with a timestamp which denotes the time of activation of an edge. CTDNE first perform a temporal random walk where an edge is traversed in the increasing order of timestamps and generates time-aware node sequences. Further, CTDNE uses the skip-gram architecture to learn time preserving node embeddings from the node sequences. The optimization objective of CTDNE is defined as

$$\begin{aligned} \begin{aligned} {\max }_{f} \log P_r(W_T =\{v_{i-w},\ldots ,v_{i+w}\}\setminus v_{i}\vert f(v_i)) \end{aligned} \end{aligned}$$
(19)

where \(v_{i-w},\ldots ,v_{i+w}\) is the neighboring vertices of vertex \(v_i\) and w is the context window size. The objective is to learn a function f which generates the node embeddings by maximizing the probability that the vertices co-occur in the temporal random walk occupy closer in the latent space. The time complexity of CTDNE is same as that of deepwalk as both uses the skip-gram architecture for learning node embeddings.

Table 5 Summary of dynamic network embedding methods

DynGEM [39] is a deep autoencoder based architecture to embed a dynamic network, which is inspired from SDNE. Given n snapshots of a dynamic network, DynGEM incrementally builds the embedding of the snapshot at time \(t_n\) from the embedding of the snapshot at time \(t_{n-1}\). At each time step, DynGEM initializes the embedding from the previous time step, performs incremental learning, and thereby reduces the time for convergence from the second iteration. Incremental learning can be viewed as a transfer learning task where the model only needs to learn the changes between two graph snapshots. The dynamic network may grow in size. DynGEM uses a heuristic, prop size, to dynamically determine the number of hidden units required for each snapshot. The authors also provide various stability metrics for generating stable dynamic network embeddings. Unlike SDNE which uses a sigmoid function for activations and SGD for training, DynGEM uses ReLU in all autoencoder layers to support weighted graphs, and Nesterov momentum with properly tuned hyperparameters for training. The loss function of DynGEM is stated as

$$\begin{aligned} L= & {} \gamma ' {\vert \vert (Y-X)\odot B\vert \vert }^2 + \alpha '\sum _{i=1,j=1}^{n} a_{ij} {\vert \vert \phi _i - \phi _j \vert \vert }_2^2\nonumber \\&+ \frac{\lambda }{2}\sum _{k=1}^{K}({\vert \vert W^{(k)}\vert \vert }_F^2+{\vert \vert \hat{W}^{(k)}\vert \vert }_F^2) \end{aligned}$$
(20)

where the first and second term represents the second order and first order proximities respectively. DynGEM uses similar optimization objective as that of SDNE [129], but unlike SDNE which operates on static network, DynGEM optimizes the parameters of the objective function at each time step, thereby by learn the parameters across a series of snapshots. The time complexity of DynGEM is O(ncdit), where n is the number of vertices, d is the embedding dimension, c is the average degree of the network and i is the number of iterations, and t is the number of snapshots.

DynamicTriad [146] is another dynamic network embedding method which tries to preserve both structural and evolution pattern of the network. The aim of the work is to capture the network dynamics and to learn the low dimensional vectors for each node at different time steps. The work considers triadic closure as an important phenomenon which leads to network evolution, and is used to preserve the temporal dynamics while generating embedding. DynamicTriad models an optimization function with three components which is stated as

$$\begin{aligned} \begin{aligned} {\sum }_{t=1}^T (L^t_{sh,1}+\beta _{0}L^t_{tr,2})+ \beta _{1} {\sum }_{t=1}^{T-1} {\sum }_{i=1}^N\vert \vert u^{t+1}_{i} -u^{t}_{i}\vert \vert ^2_2 \end{aligned} \end{aligned}$$
(21)

where \(L^t_{sh,1}\) is a loss function to preserve the structural connectivity, \(L^t_{tr,2}\) is a loss function to preserve the triadic closure process and \(\sum _{t=1}^{T-1}\sum _{i=1}^N\vert \vert u^{t+1}_{i} -u^{t}_{i}\vert \vert ^2_2\) is a loss function to impose temporal smoothness by minimizing the euclidean distance between embedding vectors in adjacent time steps. DyRep [120] is another work which considers both topological evolution and temporal interactions, and aims to develop embeddings which encode both structural and temporal information.

5 Datasets for network representation learning

In this section, we discuss the major network datasets used in network representation learning research.

5.1 Social networks

BlogCatalog [117] This is a dataset used in most of the network representation learning research. BlogCatalog is a social network denoting the relationship between blogger authors listed on BlogCatalog website. The topic category of each author can act as the label of each node. To model BlogCatalog as an attributed network, we may use tags and short description of blogs as user attributes. Blogger users and groups can be considered as heterogeneous node types, and can form a heterogeneous network with user-user and user-group edges.

YelpFootnote 2 This network represents the user friendship relationship in the Yelp social networking service. User reviews can be considered as the attribute information. A heterogeneous network can be modeled using users (U), businesses (B), cities(C) and categories (T) as nodes, and friendships (U–U), user reviews (B–U), business cities (B–C), and business categories (B–T) as edges.

Flickr [117] The Flickr network denotes contacts between users of photo sharing website Flickr. The interest group of each user can be used as the label of each node. To model Flickr as an attributed network, we may use aggregate tags on the user photos as user attributes.

Youtube [118] It is a social network where users are linked if they share a common video. The users can be grouped based on their tastes and can form the label of each user.

Facebook Twitter [65] Social networks showing friend and following relationships between users. They are usually used in works which use networks with the scale-free property.

5.2 Citation networks

DBLP [113] Three datasets using DBLP data (author citation network, paper citation network and co-authorship network) are used in NRL research. Author citation network links authors when one author cites the other, paper citation network links papers when one paper cites the other and co-authorship network links authors if they co-author at least a single paper. Paper title or paper abstract can be used as the attribute associated with each node.

ArXiV [66] Two datasets using ArXiV, ArXiV-GR-QC and ArXiv Astro-PH are used in network representation learning research. Both are co-author collaboration networks where authors are linked if they co-author at least a single paper.

CiteseerFootnote 3 and Cora [83] Data from both Citeseer and Cora is used as paper citation network with paper text denoting the attributes of nodes.

Aminer computer science data [113] and database information system data [112] These datasets are commonly used to model heterogeneous networks with the author, paper, and venue as node types and with author–author, author–paper, and paper–venue edges.

5.3 Other networks

Wikipedia [77] A language network using word co-occurrences can be constructed from Wikipedia data with POS tags as node labels. Wikieditor [140] is a signed network extracted from Wikipedia dataset [61]. Positive or negative edges are given based on the co-edit relationship between the users.

PPI [11] Protein–Protein Interaction Network(PPI) is a biological network showing the interaction between proteins. Protein functions or post-translational modifications can be considered as node labels.

EpinionsFootnote 4 and SlashdotFootnote 5 Epinions is a user-user signed network constructed from product review site Epinions. Positive and negative links between users indicate the trust and distrust between them. The product review written by the users can be considered as the attributes. Slashdot is a technology news site which allows users to annotate other users as friends.

Dynamic network datasets Collaboration network snapshots from HEP-TH dataset [36], autonomous system communication network snapshots [64] from BGP (Border Gateway Protocol) logs, email communication network snapshots from ENRON dataset [58], user collaboration network snapshots from Github data,Footnote 6 timestamped communication networks from Chinese Telecom and PPDai [146], academic network snapshots from Aminer datasetFootnote 7 etc. are some dynamic network snapshots that are used to conduct the experiments with representation learning on dynamic networks.

6 Applications of network representation learning

Researchers applied network representation learning on various network mining applications and demonstrated the performance improvement of such tasks over the state-of-the-art methods. A pipeline of network embedding based network mining is shown in Fig. 10. In this section, we discuss the major applications of network representation learning.

Fig. 10
figure 10

A pipeline of network embedding based network mining

6.1 Node classification

Node classification [8] is the process of assigning labels to the unlabeled nodes in a network by considering the labels assigned to the labeled nodes and the topological structure of the network. The task is classified into single-label and multi-label node classification [59] depending upon the number of labels to be assigned to each node. A network embedding approach for node classification can be explained in three steps. (1) Embed the network to a low dimensional space. (2) Associate the known labels with the nodes, which form the training set (3) A lib-linear [30] classifier is trained to build the model, and can be used to predict the label of unlabeled nodes. The efficiency of the task can be measured using several evaluation measures like micro-F1, macro-F1 and accuracy. Node classification has been widely used as a benchmark for testing the efficiency of network representation methods. The effect of network embedding on node classification was tested on different datasets by various methods discussed in section 4.1 and the results presented by the authors are summarized below.

DeepWalk used social networks(BlogCatalog, Flickr, Youtube), Node2vec used social, biological, and language networks(BlogCatalog, P2P, Wikipedia), LINE used social, citation, and language networks (Flickr, Youtube, DBLP, Wikipedia), SDNE used social networks(BlogCatalog, Flickr, Youtube), GCN used citation networks(Citeseer, Cora, PubMed), HARP used social and citation networks(BlogCatalog, DBLP, Citeseer), ANE used citation and language networks(Cora, Citeseer, Wikipedia), GraphGAN used social and language networks(BlogCatalog, Wikipedia), and NETMF used social, biological, and language networks(BlogCatalog, Flickr, PPI, Wikipedia), for conducting the experiments with node classification problem on homogeneous networks. The effect of network embedding on node classification has been tested on attributed networks by TADW using citation and language networks(Cora, Citeseer, Wikipedia), AANE using social networks(BlogCatalog, Flickr, Youtube), DANE using social and citation networks (BlogCatalog, Flickr, DBLP, Epinions), IIRL using social and citation networks (BlogCatalog, Flickr, DBLP), and TriDNR using citation networks (Citeseer, DBLP). Experiment on node classification was tested on heterogeneous networks by Metapath2vec using citation networks (Aminer), and HIN2Vec using social and citation networks (BlogCatalog, Yelp, DBLP). The effect of network embedding on node classification has been tested on signed networks by SNE using language network(Wikieditor), SiNE using social networks( Epinions, Slashdot), and SIDE using social and language networks (Epinions, Slashdot, Wikipedia) and on dynamic networks by DynamicTriad using communication and citation networks (Mobile, Loan, Aminer).

6.2 Link prediction

Link prediction [71, 73] is the one among the most well-studied network mining tasks that has got greater attention in recent years due to its wide range of applications. The link prediction problem can be defined as, given a social network at time \(t_1\), the model needs to predict the edges that will be added to the network during the interval from current time \(t_1\) to a given future time \(t_2\). In general, it can be related to the problem of inferring missing links from an observed network. Link prediction is useful in a variety of domains such as in social networks, where it recommends real-world friends, and in genomics, where it discovers the novel interaction between genes. The traditional method for link prediction is to define a similarity score between nodes based on similarity measures [71] like common neighbors, Adamic-Adar and preferential attachment. In a network embedding approach for link prediction, the nodes are first mapped into a low dimensional space. Then the vector similarity measures like cosine similarity and nearest neighbor approximation can be used to score the predicted links. The efficiency of the link prediction task can be measured using several evaluation measures such as precision and area under receiver operating curve (AOC) [73].

Node2vec used social, biological, and citation networks (Facebook, PPI, ArXiV), and SDNE and GraphGAN used citation networks (ArXiV) for conducting link prediction experiments on homogeneous networks. Link prediction experiments were conducted on attributed networks by DeepGL using various network datasets available at network repository,Footnote 8 and on heterogeneous networks by HIN2Vec using social and citation networks (BlogCatalog, Yelp, DBLP). The effect of network embedding on link prediction has been studied in signed networks by SNE using social and language networks (Slashdot, Wikieditor), SiNE using social networks (Epinions, Slashdot), SIDE using social and language networks (Epinions, Slashdot, Wiki), and SNEA using social networks (Epinions, Slashdot). Link prediction is an important challenge in dynamic networks and the significance of using node representations for link prediction in dynamic networks was tested by TNE using various network datasets from Koblenz Large Network Collection,Footnote 9 DynamicTraid using communication and citation networks (Mobile, Loan, Academic), DynGem using communication and citation networks (HEP-TH, ENRON, AS), CTDNE using various temporal network datasets, and DyRep using Github social network snapshots.

6.3 Network visualization

A network can be meaningfully visualized by creating a layout in 2-D space. In a network embedding approach for visualization, the learned node embeddings generated by the embedding algorithm is passed to a visualization tool (t-SNE [75], tensor flow embedding projector [1], PCA plot), and is visualized in a two-dimensional vector space. The visualization of the same dataset may differ across different embedding algorithms due to the differences in the properties preserved by each method.

Homogeneous networks are visualized by DNGR using t-SNE visualization of Wine DatasetFootnote 10 and ANE using t-SNE visualization of paper citation network (DBLP). TriDNR gives t-SNE visualization of attributed citation network (Citeseer). Metapath2vec provides tensorflow embedding projector visualization of a heterogeneous network (Aminer). SNE provides t-SNE visualization of a signed network (Wikieditor). Dynamic network snapshots are visualized by DynGem using a synthetic network (SYS), and DynRep using user collaboration network (Github).

6.4 Node clustering

Node clustering [105] is the process of grouping the nodes in a network into different clusters such that the sparsely connected dense subgraphs will be separated from each other. Functional module identification [28] in PPI networks is a typical application of node clustering. Traditional approaches for graph clustering [105] include methods based on k-spanning tree, betweenness centrality, shared nearest neighbor and clique enumeration. In a network embedding based approach for node clustering, the nodes are first mapped to a low dimensional space and vector space based clustering methods (eg. K-means clustering) are applied to generate the node clusters. Accuracy (AC) and normalized mutual information (NMI) [110] are the commonly used measures for evaluating the performance of node clustering task.

Some works used node clustering as the benchmark for evaluating the quality of node embeddings. DNGR performed node clustering on homogeneous language network (20-newsgroup network), DANE performed node clustering on attributed social and citation networks (BlogCatalog, Flickr, Epinions, DBLP), Methapath2vec and HNE performed node clustering on heterogeneous citation and social networks(Aminer, BlogCatalog), and SNEA performed node clustering on signed social networks (Epinions, Slashdot).

6.5 Other applications

Network representation learning is also applied in other areas of data mining and information retrieval. SDNE, HNE, and DynGem used network embedding for network reconstruction. GraphGAN used network embedding to build a recommender system using Movielens dataset. A user-movie bipartite graph is constructed and used the learned representations of users and movies to recommend unwatched movies to the user. CUNE [143] aimed at enhancing the recommender system by incorporating the social information from the user-item bipartite network with rating information. CUNE constructs a user-interaction network from the user-item bipartite network, extracts implicit social information by embedding nodes of the user interaction network, and finally learns an objective function that incorporates top-k social links with the matrix factorization framework. Researchers [20] used network embedding (DeepWalk) to analyze Wikipedia pages for identifying historical analogies. The work [54] aimed at predicting users multi-interests from user interactions on health-related datasets. Other applications of network embedding include anomaly detection [39], multimodal search [16], information diffusion [10, 145], community detection [145], anchor-link prediction [80], emerging relation detection [144], sentiment link prediction [131], author identification [19], social relation extraction [123], and name disambiguation [142].

7 Conclusion and future works

As revolutionary advances in representation learning have got tremendous success in several application domains, the area of network mining also got influenced by representation learning techniques, due to its high-quality result and state-of-the-art performance. Various approaches based on representation learning were developed to learn node representations from large and complex networks. In this paper, we build a taxonomy of network representation learning methods based on the type of networks and review the major research works that come under each category. We further discuss the various network datasets used in network representation learning research. Finally, we review the major applications of network embedding.

Network representation learning is a young and promising field with a lot of unsolved challenges which provides various directions for future works.

Preserving complex structure and properties of real-world networks: Most of the real-world networks are very complex, and may contain higher order structures like network motifs [87]. They also exhibit complex properties, which include scale-free property, hyper edges, and nodes with high betweenness centrality . Even if some efforts have been made to work with scale-free property [31] and hyper networks [43, 124], a significant improvement has to be made in these directions.

Complex network types: The taxonomy of the types of networks that we provided in this review is not mutually exclusive. More complex network types can be modeled by combining these basic types. For example, a citation network can be modeled as a dynamic heterogeneous attributed network which demands novel efforts in generating node embeddings.

Addressing the big graph challenge: Many real-world networks are very large with millions of nodes and vertices. Even if most of the embedding methods are designed to be highly scalable, a significant amount of work is to be done to adapt them towards such huge networks. As network embedding is basically an optimization problem, large-scale optimization methods can be used to improve its scalability. Another interesting direction towards enhancing the scalability is to develop new embedding strategies which can make use of the large-scale graph processing platforms like Giraph and Graphx, or to parallelize the existing methods so as to work with these distributed computing platforms.

More Applications: Most of the research on network embedding focused on node classification, node clustering, and link prediction. Network mining is a fast growing field with a lot of applications in various domains. So there exists an exciting direction of further work towards extending the existing methods or developing novel embedding methods towards solving more network mining tasks such as network evolution detection [135, 136], influential node detection [79], and network summarization [72].

A few efforts are already made to learn hyperbolic embeddings [25] and to use deep reinforcement learning [100] for network embedding, and more work is to be done in these significant directions.