Elsevier

Pattern Recognition

Volume 86, February 2019, Pages 320-331
Pattern Recognition

Visualization and machine learning analysis of complex networks in hyperspherical space

https://doi.org/10.1016/j.patcog.2018.09.018Get rights and content

Highlights

  • Networks and graphs are naturally embedded in Euclidean hyperspheres.

  • Communicability embedding of networks/graphs reveals clusters in networks.

  • Nonmetric multidimensional scaling allows visualization of networks in 3D communicability space.

  • Communicability clusters of papers in a citation network reveal levels of mathematization.

  • Communicability clusters in a gene-gene network reveal genes that co-participate in cancer and other diseases.

Abstract

A complex network is a condensed representation of the relational topological framework of a complex system. A main reason for the existence of such networks is the transmission of items through the entities of these complex systems. Here, we consider a communicability function that accounts for the routes through which items flow on networks. Such a function induces a natural embedding of a network in a Euclidean high-dimensional sphere. We use one of the geometric parameters of this embedding, namely the angle between the position vectors of the nodes in the hyperspheres, to extract structural information from networks. First we propose a simple method for visualizing networks by reducing the dimensionality of the communicability space to 3-dimensional spheres. Secondly, we use clustering analysis to cluster the nodes of the networks based on their similarities in terms of their capacity to successfully deliver information through the network. After testing these approaches in benchmark networks and compare them with the most used clustering methods in networks we analyze two real-world examples. In the first, consisting of a citation network, we discover citation groups that reflect the level of mathematics used in their publications. In the second, we discover groups of genes that coparticipate in human diseases, reporting a few genes that coparticipate in cancer and other diseases. Both examples emphasize the potential of the current methodology for the discovery of new patterns in relational data.

Introduction

Complex networks represent a vast category of data systems describing the topological organization of many complex systems, ranging from social, technological and ecological to molecular ones [1], [2], [3], [4]. The representation of this type of data as networks provides information about the topological, spatial, and functional relations of the data. In mathematical terms, networks are graphs–simple, directed and/or weighted–in which nodes represent the entities of the system and edges represent relations between such entities. The simplest of all the possible representations of networked data is by means of simple graphs. In this case only the connectivity between entities is captured by the graph, excluding other structural factors such as directionality, nature of nodes and strength of relations. Thus, an important challenge in this modeling scenario is to extract as much information as possible from this reduced representation of the data. Thus, the use of data analysis techniques, such as machine learning and pattern recognition [5], are important research areas of analysis for networked type of data.

Machine learning stands at developing computational methods for “learning” with accumulated experiences, either in a supervised or an unsupervised way [6], [7], [8]. In supervised learning the inference of concepts from the data is performed from a training set [9]. Then, the learning process constructs a mapping function from this training, which can then be applied to data not “seen” before by the model. These models correspond either to those of classification or regression. On the other hand, the main goal of unsupervised learning is to reveal intrinsic structures that are embedded within the data relationships [10]. In this case, the algorithms are designed to learn solely guided by the structure of the data provided without any prior knowledge about it. The typical unsupervised learning techniques are: clustering [11], [12], [13], [14], outlier detection [15], [16], dimensionality reduction [17], and association [18].

An area of unsupervised machine learning on networked systems which has received a great deal of attention is graph/network clustering [19], [20], [21], [22], [23], [24], [25], [26], [27]. In general, the problem consists on the unsupervised detection of groups of nodes–known as communities in network theory [22], [23], [25], [26]–which share more similarity among them than with nodes outside these clusters. The main interest in network clustering is due to its numerous applications, making the problem of graph clustering a data-driven task. The most frequently used definition of community in networks is the one based on edge density. For instance, in her 2007 overview of graph clustering Schaeffer [20] recall that “it is generally agreed upon that a subset of vertices forms a good cluster if the induced subgraph is dense, but there are relatively few connections from the included vertices to vertices in the rest of the graph”. In his seminal overview of 2010 Fortunato [23] pointed out that “communities in graphs are related, explicitly or implicitly, to the concept of edge density (inside versus outside the community)”. He makes clear the difference with data clustering where “communities are sets of points which are “close” to each other, with respect to a measure of distance or similarity, defined for each pair of points”. More recently, Silva and Zhao [5] in their book tacitly define a community: “as a subgraph whose vertices are densely connected within itself, but sparsely connected with the remainder of the network”. However, the complexity of graphs representing real-world systems is sufficiently large for not having to restrict our definition of clusters to those based on edge density only. As a data-driven problem our main task is to design methods that allow the detection of clusters of nodes/edges which are structurally similar to each other and that may contain important functional information about the processes taking place on real-world systems.

Let us consider here an example for motivating the use of other definitions of clustering on graphs/networks. Suppose that there is strong empirical evidence that groups of fused triangles–pairs of triangles that share an edge–represent functional groups for certain classes of real-world networks. In Fig. 1 we illustrate a hypothetical network displaying three clusters of fused triangles represented in three different colors. Even by eye we can see that there are two “communities” according to the traditional definition based on edge density. Thus, this means that every method designed to detect density-communities will fail in detecting the fused-triangle clusters in this network. It does not mean that a method designed to detect such triangle-based structures is better or worse than the ones to detect density-communities. They simply are designed for performing different tasks on the same dataset. With the goal of enriching the structural information contained in graph clustering a series of methods have been proposed which use the embedding of the graphs in geometric spaces. For instance, Xiao and Hancock [28] embed graphs using the heat-kernel and then by equating the spectral heat kernel and its Gaussian form they are able to approximate the Euclidean distance between nodes on the manifold. After this they perform principal component analysis (PCA) and demonstrate that it leads to well defined graph clusters. Other approaches use tools from subspace analysis on a Grassmann manifold to produce low dimensional representation of the original graphs which preserves important structural information [29]. Others embed the networks into hyperbolic space such that network community structure is obtained from node similarity in such “underlying hidden metric space” [30]. From a pattern recognition perspective, the embedding of graphs into different spaces is a widely used technique [31], [32], [33], [34]. In general, these methods can be grouped under the umbrella of “geometric learning” methods [35]. Many of these algorithms are based on spectral techniques on graphs [36], [37]. Specifically, these approaches propose to embed the vertices of the original graph into a low dimensional space, which consists of the top eigenvectors of a special matrix and then carrying out the clustering in such low dimensional spaces [35].

The goal of this paper is twofold. On one side we propose a method for visualizing complex networks and graphs embedded in the communicability hyperspherical space. This goal is reached by using multidimensional scaling to reduce the dimensionality of the communicability space to a three-dimensional one. The second goal is to use clustering methods to detect clusters of nodes having more communicability among them than with the rest of the nodes. In this case, again, instead of “imposing” an embedding of the network in a given manifold we consider the geometric space generated by the flow of “items” on a network in a diffusion-like process. This space is a Euclidean (n1)sphere, where n is the number of nodes of the graph. After testing the method in a few benchmark networks we embarked in the analysis of two real-world systems. One is a citation network and the other a network of gene co-participation in human genetic diseases. In the first case we discovered the existence of groups of authors which represent wide-range of disciplines mainly demarked by their level of mathematization. In the second example we discover a few genes which co-participate in neurological diseases and cancer, as well as in other groups of diseases and cancer.

Section snippets

Preliminaries

Here we follow standard notation and definitions in network theory (see for instance [2]). Let Γ=(V,E) be a simple graph and let A be its adjacency matrix. We consider here undirected graphs such that the associated adjacency matrix is symmetric, and its eigenvalues are real. We label the eigenvalues of A in non-increasing order: λ1λ2λn. Since A is a real-valued, symmetric matrix, we can decompose A into A=UΛUT, where Λ is a diagonal matrix containing the eigenvalues of A and U=[ψ1,,ψn]

Hyperspherical embedding of networks

An important property of the communicability function of networks is that it induces an embedding of the network into a given Euclidean space. The important parameter in this case is the difference between the number of weighted closed walks that start at (and return to) the corresponding nodes u and v, and the number of weighted walks that start at node u (respectively v) and ends at the node v (respectively u). This difference, which is defined below as ξuv2 serves as a quantification of the

Network visualization via nonmetric multidimensional scaling

The main goal of this section is to propose a method to visualize networks which are naturally embedded into a Euclidean hyperspherical space. The hyperspherical embedding induced by the communicability geometry of a network does not allow to visualize the corresponding network due to the high dimensionality of the embedding spaces. Then, we aim here to reduce such space dimensionality to a 3-dimensional (3D) Euclidean space which allow us to visualize the network structure. We selected the 3D

Cluster analysis

Our second goal in this paper is to propose a method for detecting clusters of nodes and edges in complex networks. In our context of network analysis the problem of clustering in the multidimensional communicability space consists in having nodes close to each other if they share certain structural similarities which make them to cluster together, while those structural dissimilar nodes are placed far apart in the 3D embedding studied here. The problem of clustering is one of the most popular

Conclusions

In this work we propose a way to extract network information by considering the Euclidean hyperdimensional representation that naturally emerges from the communicability function of relational data. It should be remarked that this “geometric learning” approach differs from others in the literature in the following. While many geometric learning methods are based on imposed embedding of the network in given spaces, here we exploit a natural embedding of the graph emerging from the flow of items

María Pereda received a Master Degree in Research in Process Systems Engineering from University of Valladolid (Spain) in 2010, and a Ph.D. degree in Process Systems Engineering from University of Valladolid (Spain) in 2014. She is currently a Postdoctoral researcher in the Computational Social Science and Humanities Group at RWTH Aachen University (Germany).

References (67)

  • L. Ma et al.

    Surveying network community structure in the hidden metric space

    Physica A

    (2012)
  • J. Gibert et al.

    Graph embedding in vector spaces by node attribute statistics

    Pattern Recognit.

    (2012)
  • M.M. Bronstein et al.

    Geometric deep learning: going beyond Euclidean data

    IEEE Signal Process. Mag.

    (2017)
  • A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm,...
  • E. Estrada et al.

    Network properties revealed through matrix functions

    SIAM Rev.

    (2010)
  • E. Estrada et al.

    Subgraph centrality in complex networks

    Phys. Rev. E

    (2005)
  • A.-L. Barabási et al.

    Emergence of scaling in random networks

    Science

    (1999)
  • J.W. Jaromczyk et al.

    Relative neighborhood graphs and their relatives

    Proc. IEEE

    (1992)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1979)
  • M. Girvan et al.

    Community structure in social and biological networks

    Proc. Natl. Acad. Sci.

    (2002)
  • M.E.J. Newman

    The structure and function of complex networks

    SIAM Rev.

    (2003)
  • E. Estrada

    The Structure of Complex Networks: Theory and Applications

    (2011)
  • V. Latora et al.

    Complex Networks: Principles, Methods and Applications

    (2017)
  • T. Silva et al.

    Machine Learning Complex Networks

    (2016)
  • M.I. Jordan et al.

    Machine learning: trends, perspectives, and prospects

    Science

    (2015)
  • P. Domingos

    A few useful things to know about machine learning

    Commun. ACM

    (2012)
  • I.H. Witten et al.

    Data Mining: Practical Machine Learning Tools and Techniques

    (2016)
  • G. Gan, C. Ma, J. Wu, Data clustering: theory, algorithms, and applications, Data clustering: theory, algorithms, and...
  • G. Karypis et al.

    Chameleon: hierarchical clustering using dynamic modeling

    Computer

    (1999)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • L. Kaufman et al.

    Finding Groups in Data: An Introduction to Cluster Analysis

    (2005)
  • C.-T. Lu, D. Chen, Y. Kou, Algorithms for spatial outlier detection,...
  • L. van der Maaten et al.

    Dimensionality Reduction: A Comparative Review

    (2007)
  • Cited by (0)

    María Pereda received a Master Degree in Research in Process Systems Engineering from University of Valladolid (Spain) in 2010, and a Ph.D. degree in Process Systems Engineering from University of Valladolid (Spain) in 2014. She is currently a Postdoctoral researcher in the Computational Social Science and Humanities Group at RWTH Aachen University (Germany).

    Ernesto Estrada has been the Chair in Complexity Science at the University of Strathclyde since 2008. He is now ARAID Senior Researcher at the Institute of Applied Mathematics at the University of Zaragoza, Spain. He is also the Editor-in-Chief of the Journal of Complex Networks and has written two textbooks on networks published by Oxford University Press. He has published about 200 papers on networks and its applications in leading international journals and has been invited to major scientific conferences, including being plenary speaker at the 2012 SIAM Annual Meeting.

    View full text