Class structure visualization with semi-supervised growing self-organizing maps
Introduction
When all information regarding the measurement values and the type of the class are known, supervised learning is the primary technique that is used for building classifiers. The term supervised comes from the fact that when training or building classifiers, the predicted results from the classifier are compared with the known results and the errors are fed back to the classifier to improve the accuracy, like a supervisor guiding the training. In data mining terminology, supervised learning is also referred to as directed data mining.
The classification problem has the goal of maximising the generalised classification accuracy, such that high prediction accuracy for both the training data and new data can be obtained. Further to merely boosting the classification accuracy, it can often be useful to exploit the understanding of the class structure in the labelled data. This can be done by supervised learning of topology-preserving networks like self organizing maps (SOM) where the complexity of the class structure, in terms of similarity and degree of overlapping of classes, can be visually identified on the two-dimensional (2D) grid.
However, complete data with all entries labelled and without missing measurements are always difficult and expensive to gather. Therefore, it can often occur that the collected data are incomplete, missing either measurement values or labels. In classical supervised learning, these incomplete data entries are discarded, but there are many algorithms that can learn from partially labelled data (a dataset that contains both items with complete and incomplete information) [3], [5], [7], [12] that combine both unsupervised and supervised learning to make full use of the collected data. In many cases, the proposed semi-supervised algorithm that uses both labelled and unlabelled data improves the performance of the resulting classifier. Therefore learning from partially labelled data has become an important area of research and a recent workshop—ICML 2005 LPCTD (Learning with Partially Classified Training Data) Workshop, Germany—was held with this as the theme.
Previous studies of growing self-organizing map (GSOM) [1], [10], [11], [16] have all focused on unsupervised clustering tasks. In this paper, we propose to fuse a modified form of the supervised learning architecture proposed by Fritzke [9] with the GSOM [2], thus taking advantage of a co-evolving topology-preserving network that provides instant data visualisation on 2D network grid and a supervised learning network for class structure visualisation. The modifications made to the Fritzke's supervised learning architecture involve changes to the error calculation formula to enable processing of data that have missing labels. After the modifications, the algorithm becomes semi-supervised. Most importantly, when all labels are present it behaves identically as a supervised one, yet when all labels are missing it functions the same as an unsupervised one, therebymaximising the use of all information present in the data. Three good reasons for using GSOM as the topology preserving network are:
dynamic allocation of nodes to accommodate for both complex class structure and data similarity;
constantly visualisable 2D grid for better and easier understanding of complexity, with overlaps and data structure in the labelled data space;
SOM has demonstrated the ability to process data with missing measurement values (datasets with up to 25% missing values still can produce good clustering [13]) and is also inherited by GSOM.
The remaining sections of this paper are organised as follows. Section 2 describes the GSOM algorithm and the proposed semi-supervised learning architecture. In Section 3, we present the simulation results of the proposed algorithm which is applied to benchmark datasets, both synthetic and real world. Also in Section 3, we give some general discussions of the simulation results for each dataset used. Section 4 gives the conclusion and possible future directions for this paper.
Section snippets
Background and algorithm
The dynamic self-organising map [2] is a variant of Kohonen's SOM that can grow into adaptive size and shape with controllable spread. Initially, the GSOM network has only one lattice of nodes (e.g., four nodes for a rectangular lattice) as shown in Fig. 1(a). The GSOM algorithm consists of three phases of training: a growing phase, a rough-tuning phase and a fine-tuning phase. Prior to training, a growth control variable that is called the spread factor (SF, where and 0 represents
Results and discussions
Prior to testing the semi-supervised learning algorithm, we will first illustrate the visualisation of class structures in Section 3.1, which is the result of combining topology preservation, data visualisation and classification (fully supervised learning, with no missing data). Since this dataset is only for illustration purpose, no testing set is generated.
Later in this section, the Iris flower and Wisconsin breast cancer datasets of benchmark datasets [6] are used to test the
Conclusions
In this paper, we present a semi-supervised learning algorithm for GSOM which is tested against benchmark data. The proposed semi-supervised learning algorithm can train classifiers from incomplete data and provide class structure visualisation. Incompleteness of data, masked attribute values and/or class labels are introduced in the datasets and the classification results are compared to the results obtained when using complete data and discarding missing data. When the percentage of masked
Arthur Hsu received his Bachelor degree in 1999 and PhD in 2006 from University of Melbourne, Australia. His main research interests are in the clustering, classification and optimisation methods. He has applied his work widely to different areas of scientific research, such as bioinformatics, wireless sensor networks and natural language processing.
References (16)
Growing cell structures—a self-organising network for unsupervised and supervised learning
Neural Networks
(1994)- et al.
Enhancement of topology preservation and hierarchical dynamic self-organising maps for data visualisation
Int. J. Approx. Reason.
(2003) Controlling the spread of dynamic self organising maps
Neural Comput. Appl.
(2004)- D. Alahakoon, S.K. Halgamuge, B. Srinivasan, Dynamic self-organising maps with controlled growth for knowledge...
- M.-R. Amini, P. Gallinari, The use of unlabeled data to improve supervised learning for text summarization, in:...
- S. Basu, A. Banerjee, R. Mooney, Semi-supervised clustering by seeding, in: Proceedings of 19th International...
- et al.
Using manifold structure for partially labelled classification
Proc. Adv. Neural Inf. Process. Syst.
(2002) - C.L. Blake, C.J. Merz, UCI repository of machine learning databases...
Cited by (0)
Arthur Hsu received his Bachelor degree in 1999 and PhD in 2006 from University of Melbourne, Australia. His main research interests are in the clustering, classification and optimisation methods. He has applied his work widely to different areas of scientific research, such as bioinformatics, wireless sensor networks and natural language processing.
Saman Halgamuge received his PhD from Darmstadt University of Technology, Germany, in 1995. He is currently a Professor in the University of Melbourne. Dr. Halgamuge has published in the areas of pattern recognition, stochastic modelling and optimisation, bioinformatics, bio-inspired computing and sensor networks with applications in mechatronics and bioengineering.