Visualization and machine learning analysis of complex networks in hyperspherical space
Introduction
Complex networks represent a vast category of data systems describing the topological organization of many complex systems, ranging from social, technological and ecological to molecular ones [1], [2], [3], [4]. The representation of this type of data as networks provides information about the topological, spatial, and functional relations of the data. In mathematical terms, networks are graphs–simple, directed and/or weighted–in which nodes represent the entities of the system and edges represent relations between such entities. The simplest of all the possible representations of networked data is by means of simple graphs. In this case only the connectivity between entities is captured by the graph, excluding other structural factors such as directionality, nature of nodes and strength of relations. Thus, an important challenge in this modeling scenario is to extract as much information as possible from this reduced representation of the data. Thus, the use of data analysis techniques, such as machine learning and pattern recognition [5], are important research areas of analysis for networked type of data.
Machine learning stands at developing computational methods for “learning” with accumulated experiences, either in a supervised or an unsupervised way [6], [7], [8]. In supervised learning the inference of concepts from the data is performed from a training set [9]. Then, the learning process constructs a mapping function from this training, which can then be applied to data not “seen” before by the model. These models correspond either to those of classification or regression. On the other hand, the main goal of unsupervised learning is to reveal intrinsic structures that are embedded within the data relationships [10]. In this case, the algorithms are designed to learn solely guided by the structure of the data provided without any prior knowledge about it. The typical unsupervised learning techniques are: clustering [11], [12], [13], [14], outlier detection [15], [16], dimensionality reduction [17], and association [18].
An area of unsupervised machine learning on networked systems which has received a great deal of attention is graph/network clustering [19], [20], [21], [22], [23], [24], [25], [26], [27]. In general, the problem consists on the unsupervised detection of groups of nodes–known as communities in network theory [22], [23], [25], [26]–which share more similarity among them than with nodes outside these clusters. The main interest in network clustering is due to its numerous applications, making the problem of graph clustering a data-driven task. The most frequently used definition of community in networks is the one based on edge density. For instance, in her 2007 overview of graph clustering Schaeffer [20] recall that “it is generally agreed upon that a subset of vertices forms a good cluster if the induced subgraph is dense, but there are relatively few connections from the included vertices to vertices in the rest of the graph”. In his seminal overview of 2010 Fortunato [23] pointed out that “communities in graphs are related, explicitly or implicitly, to the concept of edge density (inside versus outside the community)”. He makes clear the difference with data clustering where “communities are sets of points which are “close” to each other, with respect to a measure of distance or similarity, defined for each pair of points”. More recently, Silva and Zhao [5] in their book tacitly define a community: “as a subgraph whose vertices are densely connected within itself, but sparsely connected with the remainder of the network”. However, the complexity of graphs representing real-world systems is sufficiently large for not having to restrict our definition of clusters to those based on edge density only. As a data-driven problem our main task is to design methods that allow the detection of clusters of nodes/edges which are structurally similar to each other and that may contain important functional information about the processes taking place on real-world systems.
Let us consider here an example for motivating the use of other definitions of clustering on graphs/networks. Suppose that there is strong empirical evidence that groups of fused triangles–pairs of triangles that share an edge–represent functional groups for certain classes of real-world networks. In Fig. 1 we illustrate a hypothetical network displaying three clusters of fused triangles represented in three different colors. Even by eye we can see that there are two “communities” according to the traditional definition based on edge density. Thus, this means that every method designed to detect density-communities will fail in detecting the fused-triangle clusters in this network. It does not mean that a method designed to detect such triangle-based structures is better or worse than the ones to detect density-communities. They simply are designed for performing different tasks on the same dataset. With the goal of enriching the structural information contained in graph clustering a series of methods have been proposed which use the embedding of the graphs in geometric spaces. For instance, Xiao and Hancock [28] embed graphs using the heat-kernel and then by equating the spectral heat kernel and its Gaussian form they are able to approximate the Euclidean distance between nodes on the manifold. After this they perform principal component analysis (PCA) and demonstrate that it leads to well defined graph clusters. Other approaches use tools from subspace analysis on a Grassmann manifold to produce low dimensional representation of the original graphs which preserves important structural information [29]. Others embed the networks into hyperbolic space such that network community structure is obtained from node similarity in such “underlying hidden metric space” [30]. From a pattern recognition perspective, the embedding of graphs into different spaces is a widely used technique [31], [32], [33], [34]. In general, these methods can be grouped under the umbrella of “geometric learning” methods [35]. Many of these algorithms are based on spectral techniques on graphs [36], [37]. Specifically, these approaches propose to embed the vertices of the original graph into a low dimensional space, which consists of the top eigenvectors of a special matrix and then carrying out the clustering in such low dimensional spaces [35].
The goal of this paper is twofold. On one side we propose a method for visualizing complex networks and graphs embedded in the communicability hyperspherical space. This goal is reached by using multidimensional scaling to reduce the dimensionality of the communicability space to a three-dimensional one. The second goal is to use clustering methods to detect clusters of nodes having more communicability among them than with the rest of the nodes. In this case, again, instead of “imposing” an embedding of the network in a given manifold we consider the geometric space generated by the flow of “items” on a network in a diffusion-like process. This space is a Euclidean sphere, where n is the number of nodes of the graph. After testing the method in a few benchmark networks we embarked in the analysis of two real-world systems. One is a citation network and the other a network of gene co-participation in human genetic diseases. In the first case we discovered the existence of groups of authors which represent wide-range of disciplines mainly demarked by their level of mathematization. In the second example we discover a few genes which co-participate in neurological diseases and cancer, as well as in other groups of diseases and cancer.
Section snippets
Preliminaries
Here we follow standard notation and definitions in network theory (see for instance [2]). Let be a simple graph and let A be its adjacency matrix. We consider here undirected graphs such that the associated adjacency matrix is symmetric, and its eigenvalues are real. We label the eigenvalues of A in non-increasing order: . Since A is a real-valued, symmetric matrix, we can decompose A into where Λ is a diagonal matrix containing the eigenvalues of A and
Hyperspherical embedding of networks
An important property of the communicability function of networks is that it induces an embedding of the network into a given Euclidean space. The important parameter in this case is the difference between the number of weighted closed walks that start at (and return to) the corresponding nodes u and v, and the number of weighted walks that start at node u (respectively v) and ends at the node v (respectively u). This difference, which is defined below as serves as a quantification of the
Network visualization via nonmetric multidimensional scaling
The main goal of this section is to propose a method to visualize networks which are naturally embedded into a Euclidean hyperspherical space. The hyperspherical embedding induced by the communicability geometry of a network does not allow to visualize the corresponding network due to the high dimensionality of the embedding spaces. Then, we aim here to reduce such space dimensionality to a 3-dimensional (3D) Euclidean space which allow us to visualize the network structure. We selected the 3D
Cluster analysis
Our second goal in this paper is to propose a method for detecting clusters of nodes and edges in complex networks. In our context of network analysis the problem of clustering in the multidimensional communicability space consists in having nodes close to each other if they share certain structural similarities which make them to cluster together, while those structural dissimilar nodes are placed far apart in the 3D embedding studied here. The problem of clustering is one of the most popular
Conclusions
In this work we propose a way to extract network information by considering the Euclidean hyperdimensional representation that naturally emerges from the communicability function of relational data. It should be remarked that this “geometric learning” approach differs from others in the literature in the following. While many geometric learning methods are based on imposed embedding of the network in given spaces, here we exploit a natural embedding of the graph emerging from the flow of items
María Pereda received a Master Degree in Research in Process Systems Engineering from University of Valladolid (Spain) in 2010, and a Ph.D. degree in Process Systems Engineering from University of Valladolid (Spain) in 2014. She is currently a Postdoctoral researcher in the Computational Social Science and Humanities Group at RWTH Aachen University (Germany).
References (67)
- et al.
Complex networks: structure and dynamics
Phys. Rep.
(2006) - et al.
Machine learning on big data
Neurocomputing
(2017) Data clustering: 50 years beyond k-means
Pattern Recognit. Lett.
(2010)- et al.
On-line outlier detection and data cleaning
Comput. Chem. Eng.
(2004) - et al.
Community discovery in networks with deep sparse filtering
Pattern Recognit.
(2018) - et al.
Quantum-behaved discrete multi-objective particle swarm optimization for complex network clustering
Pattern Recognit.
(2017) - et al.
Communities in networks
Not. Am. Math. Soc.
(2009) - et al.
Community structure in graphs
- et al.
Community detection in networks: a user guide
Phys. Rep.
(2016) - et al.
Clustering on multi-layer graphs via subspace analysis on Grassmann manifolds
IEEE Trans. Signal Process.
(2014)
Surveying network community structure in the hidden metric space
Physica A
Graph embedding in vector spaces by node attribute statistics
Pattern Recognit.
Geometric deep learning: going beyond Euclidean data
IEEE Signal Process. Mag.
Network properties revealed through matrix functions
SIAM Rev.
Subgraph centrality in complex networks
Phys. Rev. E
Emergence of scaling in random networks
Science
Relative neighborhood graphs and their relatives
Proc. IEEE
A cluster separation measure
IEEE Trans. Pattern Anal. Mach. Intell.
Community structure in social and biological networks
Proc. Natl. Acad. Sci.
The structure and function of complex networks
SIAM Rev.
The Structure of Complex Networks: Theory and Applications
Complex Networks: Principles, Methods and Applications
Machine Learning Complex Networks
Machine learning: trends, perspectives, and prospects
Science
A few useful things to know about machine learning
Commun. ACM
Data Mining: Practical Machine Learning Tools and Techniques
Chameleon: hierarchical clustering using dynamic modeling
Computer
Algorithms for Clustering Data
Finding Groups in Data: An Introduction to Cluster Analysis
Dimensionality Reduction: A Comparative Review
Cited by (0)
María Pereda received a Master Degree in Research in Process Systems Engineering from University of Valladolid (Spain) in 2010, and a Ph.D. degree in Process Systems Engineering from University of Valladolid (Spain) in 2014. She is currently a Postdoctoral researcher in the Computational Social Science and Humanities Group at RWTH Aachen University (Germany).
Ernesto Estrada has been the Chair in Complexity Science at the University of Strathclyde since 2008. He is now ARAID Senior Researcher at the Institute of Applied Mathematics at the University of Zaragoza, Spain. He is also the Editor-in-Chief of the Journal of Complex Networks and has written two textbooks on networks published by Oxford University Press. He has published about 200 papers on networks and its applications in leading international journals and has been invited to major scientific conferences, including being plenary speaker at the 2012 SIAM Annual Meeting.