Elsevier

Neurocomputing

Volume 71, Issues 16–18, October 2008, Pages 3124-3130
Neurocomputing

Class structure visualization with semi-supervised growing self-organizing maps

https://doi.org/10.1016/j.neucom.2008.04.049Get rights and content

Abstract

We present a semi-supervised learning method for the growing self-organising maps (GSOM) that allows fast visualisation of data class structure on the 2D feature map. Instead of discarding data with missing values, the network can be trained from data with up to 60% of their class labels and 25% of attribute values missing, while able to make class prediction with over 90% accuracy for the benchmark datasets used. The proposed algorithm is compared to three variants of semi-supervised K-means learning on four real-world benchmark datasets and showed comparable performance and better generalisation.

Introduction

When all information regarding the measurement values and the type of the class are known, supervised learning is the primary technique that is used for building classifiers. The term supervised comes from the fact that when training or building classifiers, the predicted results from the classifier are compared with the known results and the errors are fed back to the classifier to improve the accuracy, like a supervisor guiding the training. In data mining terminology, supervised learning is also referred to as directed data mining.

The classification problem has the goal of maximising the generalised classification accuracy, such that high prediction accuracy for both the training data and new data can be obtained. Further to merely boosting the classification accuracy, it can often be useful to exploit the understanding of the class structure in the labelled data. This can be done by supervised learning of topology-preserving networks like self organizing maps (SOM) where the complexity of the class structure, in terms of similarity and degree of overlapping of classes, can be visually identified on the two-dimensional (2D) grid.

However, complete data with all entries labelled and without missing measurements are always difficult and expensive to gather. Therefore, it can often occur that the collected data are incomplete, missing either measurement values or labels. In classical supervised learning, these incomplete data entries are discarded, but there are many algorithms that can learn from partially labelled data (a dataset that contains both items with complete and incomplete information) [3], [5], [7], [12] that combine both unsupervised and supervised learning to make full use of the collected data. In many cases, the proposed semi-supervised algorithm that uses both labelled and unlabelled data improves the performance of the resulting classifier. Therefore learning from partially labelled data has become an important area of research and a recent workshop—ICML 2005 LPCTD (Learning with Partially Classified Training Data) Workshop, Germany—was held with this as the theme.

Previous studies of growing self-organizing map (GSOM) [1], [10], [11], [16] have all focused on unsupervised clustering tasks. In this paper, we propose to fuse a modified form of the supervised learning architecture proposed by Fritzke [9] with the GSOM [2], thus taking advantage of a co-evolving topology-preserving network that provides instant data visualisation on 2D network grid and a supervised learning network for class structure visualisation. The modifications made to the Fritzke's supervised learning architecture involve changes to the error calculation formula to enable processing of data that have missing labels. After the modifications, the algorithm becomes semi-supervised. Most importantly, when all labels are present it behaves identically as a supervised one, yet when all labels are missing it functions the same as an unsupervised one, therebymaximising the use of all information present in the data. Three good reasons for using GSOM as the topology preserving network are:

  • dynamic allocation of nodes to accommodate for both complex class structure and data similarity;

  • constantly visualisable 2D grid for better and easier understanding of complexity, with overlaps and data structure in the labelled data space;

  • SOM has demonstrated the ability to process data with missing measurement values (datasets with up to 25% missing values still can produce good clustering [13]) and is also inherited by GSOM.

The remaining sections of this paper are organised as follows. Section 2 describes the GSOM algorithm and the proposed semi-supervised learning architecture. In Section 3, we present the simulation results of the proposed algorithm which is applied to benchmark datasets, both synthetic and real world. Also in Section 3, we give some general discussions of the simulation results for each dataset used. Section 4 gives the conclusion and possible future directions for this paper.

Section snippets

Background and algorithm

The dynamic self-organising map [2] is a variant of Kohonen's SOM that can grow into adaptive size and shape with controllable spread. Initially, the GSOM network has only one lattice of nodes (e.g., four nodes for a rectangular lattice) as shown in Fig. 1(a). The GSOM algorithm consists of three phases of training: a growing phase, a rough-tuning phase and a fine-tuning phase. Prior to training, a growth control variable that is called the spread factor (SF, where SF[0,1] and 0 represents

Results and discussions

Prior to testing the semi-supervised learning algorithm, we will first illustrate the visualisation of class structures in Section 3.1, which is the result of combining topology preservation, data visualisation and classification (fully supervised learning, with no missing data). Since this dataset is only for illustration purpose, no testing set is generated.

Later in this section, the Iris flower and Wisconsin breast cancer datasets of benchmark datasets [6] are used to test the

Conclusions

In this paper, we present a semi-supervised learning algorithm for GSOM which is tested against benchmark data. The proposed semi-supervised learning algorithm can train classifiers from incomplete data and provide class structure visualisation. Incompleteness of data, masked attribute values and/or class labels are introduced in the datasets and the classification results are compared to the results obtained when using complete data and discarding missing data. When the percentage of masked

Arthur Hsu received his Bachelor degree in 1999 and PhD in 2006 from University of Melbourne, Australia. His main research interests are in the clustering, classification and optimisation methods. He has applied his work widely to different areas of scientific research, such as bioinformatics, wireless sensor networks and natural language processing.

References (16)

There are more references available in the full text version of this article.

Cited by (0)

Arthur Hsu received his Bachelor degree in 1999 and PhD in 2006 from University of Melbourne, Australia. His main research interests are in the clustering, classification and optimisation methods. He has applied his work widely to different areas of scientific research, such as bioinformatics, wireless sensor networks and natural language processing.

Saman Halgamuge received his PhD from Darmstadt University of Technology, Germany, in 1995. He is currently a Professor in the University of Melbourne. Dr. Halgamuge has published in the areas of pattern recognition, stochastic modelling and optimisation, bioinformatics, bio-inspired computing and sensor networks with applications in mechatronics and bioengineering.

View full text