Incorporating network structure with node contents for community detection on large networks using deep learning
Introduction
The development of Internet has led to producing more and more variety of data, such as online comments, product reviews and co-author networks, which have affected all aspects of people's lives, and thus the analysis of those data has attracted more and more attention of researchers in various fields. One hot topic in the studies of such social media or online data is to discovery the underlying structure with group effect, which is the so-called community structure. The vertices (users) related to the communities in the network can be divided into groups, in which vertices have more multiple connection but the connections are relatively sparse in the whole network. Those individuals or users belonging to the same community share common profiles or have common interests. The identification of communities consisting of users with similarity is very important, and has been applied in many areas, e.g., sociology, biology and computer science. For example, in biology, some different units belonging to an organization have some related functions that are interconnected with special structures to characterize the whole effect of the organization. The interaction among a set of proteins in a cell can form an RNA polymerase for transcription of genes. In computer science, finding the salient communities from an organization of people can create a guide which helps to web marketing, behavior prediction of users belonging to a community and understanding the functions of a complex system [1].
For community detection, some methods have put forward, which can be cast as graph clustering. In normalized cut (n-cut) [4], the Laplacian matrix is the main objective to be processed. The eigenvectors with a non-zero eigenvalue, which are obtained by the eigenvalue decomposition (EVD) of graph Laplacian matrix, are treated as graph representation. Some other works can be also transformed to spectral clustering. For example, the modularity maximization model [10] first constructs a graph that is based on feature vectors, and then solves the top k eigenvectors as network representation for clustering. Here, we can deem modularity matrix as graph Laplacian matrix. We realize that, those two methods (i.e. modularity optimization and n-cut) can easily capture topology-based and content-based features by EVD of the corresponding spectral matrices separately, as shown in Fig. 1. However, those methods for community detection are often limited to obtain the important information about the structure of the communities in networks. It demonstrates that one of the techniques can overcome the problem, which only consider topological structure (or node contents), by fusing the vertex information (or say node contents) with linkage information for community detection [2].
When considering both topological and content information for community discovery, we can combine these two objective functions into one in the form of linearity directly. However, some classical graph embedding methods, such as locally linear embedding (LLE) [11], show that the relations among vertices in the real-world networks are not certainly linear. So, the model based on this linearly combination strategy is still limited on real-world networks. Moreover, although we could get a network representation by fusing those two types of information, the problem in the optimization of the combination model is, the low efficiency for deciding an appropriate ratio of two kinds of information due to manual tuning such ratios.
In the recent years, deep learning is used in many areas, such as speech recognition [6], image classification [7] and so on. As we known, neural network is a good framework for nonlinear computation with the elements that simulate the structure and properties of neurons [8]. Among them, autoencoder is proposed by Ng [5], which aims to obtain features from the input data. We found that autoencoder and spectral methods all intent to obtain the low-dimensional approximation of the corresponding matrix. Based on this similarity, we adopt autoencoder as a breakpoint method to solve the disadvantages of linear optimization, and to achieve the incorporation of these two different spectral methods.
In order to not only take the advantage of spectral methods but also achieve the incorporation of linkage and node content information, we propose an autoencoder-based method for community detection using the normalized-cut and modularity maximization. Our work is inspired by the similarity in theory between autoencoder and spectral methods in terms of getting an intrinsic structure of the spectral matrix. The framework of our main idea is shown in Fig. 1. We realized that autoencoder is a type of unsupervised learning methods, and thus only treat the low-dimensional encoding in the hidden layer as the network representation. In our method, we adopt modularity maximization model and normalized-cut to portray linkage and content information, separately, and construct the spectral matrices (i.e. modularity matrix and Markov matrix) as the input of the autoencoder. We design a unified objective function to get the best reconstruction of the combination matrix that consists of modularity matrix and Markov matrix, while make use of autoencoder to get a best encoding in the hidden layer as the network representation which is used to finding communities nicely. Furthermore, by building a multi-layers autoencoder, we adopt deep autoencoder to obtain a powerful representation by means of the deep structure, and combine with the intrinsic information of the original data to achieve an improvement for discovering communities. In total, our framework has three main contributions as follows:
- •
First, in theory both the autoencoder and spectral methods are related to the low-dimensional approximation of the specified corresponding matrix, i.e. the modularity matrix and Markov matrix. This study utilizes the autoencoder to obtain a low-dimensional encoding which can best reconstruct the joint matrix consisting of the modularity matrix and the Markov matrix, and treats this encoding as the graph representations for community detection. The important point is to propose an autoencoder-based method that can achieve the joint optimization of modularity model and normalized-cut without a seam.
- •
Second, this encoding supplies a nonlinear way to integrate the linkage and content information. This helps to further improve the performance of community detection, when using both those two types of information. The autoencoder not only encodes the important factors of the data in the process of reducing the dimension, but also automatically learns the weight of the relationship among the various factors to obtain the minimum of reconstruction error. In this framework, therefore, the performance improvement of our method is realized by its self-tuning characteristic, but not depend on adjusting balance factor.
- •
Furthermore, by stacking a series of autoencoders, we built a multi-layer autoencoder in favor of enhancing better generalization ability of the encoding in the hidden layer. Benefitting from the deep structure, we get a powerful encoding with both topological and content information, which can effectively aid network community detection.
The rest of the paper is organized as follows. In Section 2, we give a brief review of the related work. The proposed framework and the relevant algorithms are introduced in Section 3. Next, datasets and experimental setting are described in Section 4, and followed experimental evaluation and the analysis of the balance factor in this Section demonstrate the effectiveness of the proposed new method. The paper is then concluded in Section 5.
Section snippets
Related work
There exist three aspects of relevant works regarding the topic here, which are community detection with topological structure or content information alone, and the combination of links and node contents. As described above, it is not appropriate that node community memberships are denoted by using the network topology or content information alone. Combining topology and content achieves an improvement for community detection, as showed in studies [1], [2], [3], [17], [19], [20]. However, they
Framework of community detection using deep learning
To fully utilize the advantages of deep neural network (DNN) for combining network topology and content information, we re-examine the properties of the modularity maximization model and normalized cut, which are the leading models for community detection, and re-search the DNN framework to find out a certain approach appropriate to realize a seamless combination of the different modalities. These two models seek for a low-rank embedding to represent of the community structure and reconstruct
Experiments
Here we give the comparisons between our algorithm and some state-of-the-art community detection algorithms on a wealth of real-world networks. There are also some detailed descriptions on the baseline methods, networked datasets and experimental setups.
Conclusion
In this paper, we proposed a new method that fuses the topological and content information for community detection using the deep learning framework. This study is inspired by the similarity between autoencoder and spectral methods in terms of a low-dimensional approximation of the spectral matrix. The proposed method provides a nice approach for finding a low-dimensional encoding of the community structure and achieving collective optimization of modularity and normalized-cut without a seam.
Acknowledgments
The work was supported by National Basic Research Program of China (2013CB329301), and Natural Science Foundation of China (61772361, 61503281, 61303110).
Jinxin Cao received his B.S. degree from Shandong Normal University, China, in 2010. Since 2011, he has been a post-graduate and Ph.D. joint program student in school of Computer Science and Technology at Tianjin University, China. His research interests includes data mining and analysis of complex networks.
References (39)
- et al.
Performance evaluation of deep feature learning for RGB-D image/video classification
Inf. Sci.
(2017) - et al.
Min-wise independent permutations
J. Comput. Syst. Sci.
(2000) - et al.
Network cross-validation for determining the number of communities in network data
J. Am. Stat. Assoc.
(2017) - et al.
Using content and interactions for discovering communities in social networks
- et al.
Author2vec: learning author representations by combining content and link information
- et al.
Joint identification of network communities and semantics via integrative modeling of network topologies and node contents
- et al.
Normalized cuts and image segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
(2000) Sparse autoencoder
CS294A Lecture Notes
(2011)Deep learning: from speech recognition to language and multimodal processing
APSIPA Trans. Signal and Inf. Process.
(2016)- et al.
Auto-association by multilayer perceptrons and singular value decomposition
Biol. Cybern.
(1988)
The approximation of one matrix by another of lower rank
Psychometrika
Modularity and community structure in networks
Proc. Ntl. Acad. Sci.
Nonlinear dimensionality reduction by locally linear embedding
Science
Block-LDA: jointly modeling entity-annotated text and entity-entity links
Clique percolation in random networks
Phys. Rev. Lett.
Multi-assignment clustering for Boolean data
Comparing community structure identification
J. Stat. Mech. Theory Exp.
Discovering social circles in ego networks
ACM Trans. Knowl. Discov. Data
Community detection in networks with node features
Electron. J. Stat.
Cited by (66)
Dynamic community detection including node attributes
2023, Expert Systems with ApplicationsOverlapping community detection on complex networks with Graph Convolutional Networks
2023, Computer CommunicationsGumbel-SoftMax based graph convolution network approach for community detection
2023, International Journal of Information Technology (Singapore)
Jinxin Cao received his B.S. degree from Shandong Normal University, China, in 2010. Since 2011, he has been a post-graduate and Ph.D. joint program student in school of Computer Science and Technology at Tianjin University, China. His research interests includes data mining and analysis of complex networks.
Di Jin received his B.S., M.S. and Ph.D. degree from College of Computer Science and Technology, Jilin University, China, in 2005, 2008 and 2012. Since 2012, he has been associate professor in Tianjin University. His current research interests include artificial intelligence, complex network analysis, and network community detection.
Liang Yang received his Ph.D. degree from the State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences in 2016. He has been an assistant professor in School of Information Engineering, Tianjin University of Commerce. His current research interests include community detection, machine learning and computer vision.
Jianwu Dang graduated from Tsinghua University, China, in 1982, and got his M.S. at the same university in 1984. He worked for Tianjin University as a lecture from 1984 to 1988. He was awarded the PhD from Shizuoka University, Japan in 1992. Since 2001, he has moved to Japan Advanced Institute of Science and Technology (JAIST). His research interests are in all the fields of speech production, speech synthesis, and speech cognition.