Elsevier

Neural Networks

Volume 28, April 2012, Pages 90-105
Neural Networks

A life-long learning vector quantization approach for interactive learning of multiple categories

https://doi.org/10.1016/j.neunet.2011.12.003Get rights and content

Abstract

We present a new method capable of learning multiple categories in an interactive and life-long learning fashion to approach the “stability–plasticity dilemma”. The problem of incremental learning of multiple categories is still largely unsolved. This is especially true for the domain of cognitive robotics, requiring real-time and interactive learning. To achieve the life-long learning ability for a cognitive system, we propose a new learning vector quantization approach combined with a category-specific feature selection method to allow several metrical “views” on the representation space of each individual vector quantization node. These category-specific features are incrementally collected during the learning process, so that a balance between the correction of wrong representations and the stability of acquired knowledge is achieved. We demonstrate our approach for a difficult visual categorization task, where the learning is applied for several complex-shaped objects rotated in depth.

Introduction

Humans are able to acquire and maintain knowledge during their complete lifetime. This outstanding ability is called life-long learning (Bagnall, 1990). In contrast to this, artificial neural networks are typically only adapted during their learning phase and their weights, representing the learned knowledge, are fixed afterwards. Such a static learning architecture can be powerful in constrained and stationary environmental settings but may not be suitable for technical applications like assistive robots or interactive agents. This is because these systems require a continuous error correction and need to enlarge their knowledge base to operate in changing and unpredictable environments.

Our target is to propose a novel categorization approach that enables interactive and life-long learning in high-dimensional sensory feature spaces. This enforces particular requirements on the learning architecture: The assumption of an unpredictably changing learning environment forces the learning system to self-adapt its representation parameters. Scalability to a large number of categories requires efficient memory usage. Learning in direct interaction with humans needs real-time update of the stored category representations. Finally, the ability of learning multiple categories at the same time provides a great advantage for an efficient and natural human training dialog with the learning system (the isolated training of single categories induces an explosion of individual category exemplars that must be shown).

The fundamental problem of life-long learning with artificial neural networks is the so-called “stability–plasticity dilemma”. Here the term plasticity refers to the ability of a learning system to incorporate new acquired knowledge into its internal representation. Plasticity can be achieved with incremental neural networks like the growing neural gas (Fritzke, 1995). For this architecture the training process starts with a minimal network size and iteratively increases the network size based on an insertion criterion. The final network dimensionality then reflects the complexity of the current learning task. However, already learned knowledge should also be preserved to guarantee the stability of previously learned information, posing the largely unsolved stability problem. This challenge occurs if a network model is trained with a limited and changing training ensemble for life-long learning tasks, making it infeasible to store all experiences during the complete operation time of the system. Therefore, in this paper we are particularly interested in incremental learning of representations under the condition, where a particular training vector can only be accessed for a limited time period. As a consequence, the training with such a changing data ensemble typically causes the well-known “catastrophic forgetting effect” (French, 1999): With the incorporation of newly acquired knowledge, the previously learned knowledge is quickly fading out. Closely related to this effect is the term “catastrophic interference” (McCloskey & Cohen, 1989): Patterns of different categories which are similar in feature space, confuse the learning and overwrite earlier presented patterns.

The requirements for life-long learning architectures are also dependent on the targeted recognition task. For identification tasks, where the target is the separation of a specific instance (e.g. one particular physical object) from all other instances, the combination of incremental learning with stability considerations of consolidated network parts are typically sufficient (Kirstein, Wersing, & Körner, 2008). In contrast to this, for categorization tasks the mapping from several object instances to a shared attribute (e.g. the basic shape) is required. This means for the example of visual categorization, where an individual object (e.g. red–white car) typically belongs to several different categories, a decoupled representation for each category (“red”, “white” and “car”) has to be learned. This decoupling provides a more efficient representation and a higher generalization performance compared to object identification architectures. It can be achieved using additional metrical adaptation or feature selection methods. However, due to the fact that exemplars of a category are incrementally presented, considerable changes to the feature weighting and selection can occur. Therefore for categorization tasks a balance between the stability of knowledge and the correction of wrong category representations must be found. This balance complicates the learning of such representations compared to identification tasks. Finally we consider feature weighting and selection methods without a priori assumptions as advantageous for learning of arbitrary categories.

To satisfy our requirements for life-long learning we combine an exemplar-based neural network with a category-specific feed-forward feature selection method, where the interactive and life-long learning of both parts is the major novelty of our proposed method. Although our approach is applicable for any kind of categories we concentrate in this paper on a challenging visual categorization task of rotated complex-shaped objects. In the following we discuss related work addressing life-long learning, feature selection, visual categorization and online learning in more detail.

One of the first attempts to approach the “stability–plasticity dilemma” led to the development of the adaptive resonance theory (ART) and especially Fuzzy ARTMAP (Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992). This network architecture is widely accepted but is known to be sensitive to the noise level, to the presentation order of the training data and to the selection of the vigilance parameter (Polikar, Udpa, Udpa, & Honavar, 2001). This parameter controls the maximal size of the hypercubical receptive field of a single ART node and therefore is crucial with respect to the generalization capability. Furthermore ART is also unsuited for high-dimensional and sparse feature representations (Kirstein, Wersing, & Körner, 2008) that are required for our visual categorization task.

Similarly to the ART network family life-long learning architectures are typically based on exemplar-based learning techniques like learning vector quantization (LVQ) (Kohonen, 1989) or growing neural gas (GNG) (Fritzke, 1995). Such neural architectures are beneficial for life-long learning, because for a specific input vector the learning methods modify only small portions of the overall network. Thus stability can be better achieved compared to the multi-layer perceptron (MLP), where all weights are modified at each learning step. Furthermore the learning of exemplar-based networks is commonly based on some similarity measurements (e.g. Euclidean distance), where the chosen metric has a strong impact on the generalization performance. To relax this dependency, metrical adaptation methods can be used that individually weight the different feature dimensions as proposed for the generalized relevance learning vector quantization (GRLVQ) (Hammer & Villmann, 2002) algorithm.

A common strategy for life-long learning architectures is the usage of a node specific learning rate combined with an incremental node insertion rule (Furao and Hasegawa, 2006, Hamker, 2001, Kirstein, Wersing, and Körner, 2008). This permits plasticity of newly inserted neurons, while the stability of matured neurons is preserved. The major drawback of these architectures is the inefficient separation of co-occurring categories, because typically the complete feature vectors are used to represent the different classes and no assignment of feature vector parts to different classes is considered.

Other approaches to the “stability–plasticity dilemma” were proposed by Ozawa et al. (2005) and Polikar et al. (2001). Polikar et al. (2001) proposed the “Learn++” approach that is based on the boosting (Schapire, 1990) technique. This method combines several weak classifiers to a so-called strong classifier based on a majority-voting schema, where the weak classifiers are incrementally added to the network and afterwards are kept fixed. The proposed “Learn++” can therefore be used for life-long learning tasks, but for more complex learning problems a large amount of such weak classifiers are required to represent the categories. This makes the method unsuitable for our desired interactive learning capability. In contrast to this Ozawa et al. (2005) proposed to store representative input–output pairs into a long-term memory for stabilizing an incremental learning radial basis function (RBF) network. Additionally it also accounts for a feature selection mechanism based on incremental principal component analysis, but no class-specific feature selection is applied to efficiently separate co-occurring categories.

In the context of text categorization feature selection methods are a common technique to enhance the performance (Yang & Pedersen, 1997), while for visual categorization tasks feature selection gained distinctly less interest. One exception are approaches based on boosting (Viola & Jones, 2001), where this is an integrated part of the learning method. Category-specific feature selection is considered to be an important part for our categorization approach. Commonly only a small subset of extracted features is relevant for a specific category, while the other features are irrelevant or even can cause confusions. Furthermore small category-specific feature subsets are beneficial with respect to the computational costs to allow fast interactive learning. Therefore in the following a brief overview of different feature selection techniques is given.

There are basically three groups of feature selection methods, namely filter, wrapper and embedded methods (Guyon & Elissee, 2003). Filter methods (see Forman (2003) for an overview) are independent from the used classifier and commonly select a subset of features as a pre-processing step. The corresponding feature selection is typically based on some feature ranking method (Furey et al., 2000, Kira and Rendell, 1992), but also the training of single variable classifiers is used. The second group of feature selection methods are wrapper methods (Kohavi & John, 1997). Similar to the filter approaches these wrapper methods are independent from the underlying recognition architecture but they use the learning algorithm as a “black box” to weight different feature subsets (e.g. based on the training error). Due to the incorporation of the learning method to guide the feature selection process and to evaluate the different feature subsets, wrapper methods are considered to select better sets compared to filter methods (Guyon & Elissee, 2003). Wrapper methods furthermore can be categorized into backward and forward selection methods, where the backward selection starts with a full set of features and iteratively eliminates irrelevant features. In contrast to this, forward selection methods start with an empty set of features and incrementally add new features. The major advantage of the backward selection methods is that they can detect combination features very efficiently. This enables good performance even for less class-specific feature sets. Although forward selection methods require distinctly more class-specific single features they are faster and thus preferable for interactive learning tasks. The last group of feature selection methods are the so called embedded methods. Here the feature selection is an integrated part of the recognition architecture and is typically optimized together with the network parameters, so that these methods usually cannot be transferred to other learning approaches. One strategy of this group is to add sparsity constraints to the error function (Perkins, Lacker, & Theiler, 2003), which prune out irrelevant network connections.

In the recent years many architectures dealing with categorization tasks have been proposed in the computer vision research field. Such category learning approaches can be partitioned into generative and discriminative models (Fritz, 2008). Generative probabilistic models, as proposed by Fei-Fei, Fergus, and Perona (2003), Fergus, Perona, and Zisserman (2003)Leibe, Leonardis, and Schiele (2004), or Mikolajczyk, Leibe, and Schiele (2006), first model the underlying joint probability P(x,tc) for each category tc and all training examples x individually and afterwards use the Bayes theorem to calculate the posterior class probability p(tcx) (Bishop, 2006). The advantages of generative models are that expert knowledge can be incorporated as prior information and that those models usually require only a few training examples to reach a good categorization performance. In contrast to this, discriminant models directly learn the mapping from x to tc based on a decision function Φ(x) or estimate the posterior class probability P(tcx) in a single step (Ng & Jordan, 2001). Common approaches for this group of categorization models are based on support vector machines (Heisele et al., 2001), boosting (Opelt et al., 2004, Viola and Jones, 2001) or SNOW (Agarwal, Awan, & Roth, 2004). Such discriminant models tend to achieve a better categorization performance compared to generative models if a large ensemble of training examples is available (Ng & Jordan, 2001).

In general most of these categorization approaches are robust against partial occlusions, scale changes, and are able to deal with cluttered scenes. However, many models have only been demonstrated to work with data sets restricted to canonical views of categories. Thomas et al. (2006) try to overcome this limitation by training several pose-specific implicit shape models (ISM) (Leibe et al., 2004) for each category. After the training of these ISMs, detected parts from neighboring pose-dependent ISMs are associated by so-called “activation links”. These links then allow the detection of categories from many viewpoints. Additionally categorization architectures are commonly designed for offline usage only, where the required training time is not important. This makes them unsuitable for our desired interactive training. Recent work of Fritz, Kruijff, and Schiele (2007) and Fei-Fei, Fergus, and Perona (2007) addresses this issue by proposing incremental clustering methods, which in general allow interactive category learning, but still these approaches are restricted to the canonical views of the categories.

The development of online and interactive learning systems has become increasingly popular in the recent years (Arsenio, 2004, Roth et al., 2006, Steels and Kaplan, 2001, Wersing et al., 2007). Most of these methods were not applied to categorization tasks, because their learning methods are unsuitable for a more abstract and variable category representation. The work of Skočaj et al. (2007) is of particular interest with respect to online and interactive learning of categories. It enables learning of several simple color and shape categories by selecting a single feature that describes the particular category most consistently. Finally the corresponding category is then represented by the mean and variance of this selected feature (Skočaj et al., 2007) or more recently by an incremental kernel density estimation using mixtures of Gaussians (Skočaj, Kristan, & Leonardis, 2008). Although this architecture shares some common targets with our proposed learning method, the restriction to a single feature only allows the representation of categories with little appearance changes. This is basically because more complex categories typically require several features to adequately represent all category instances. To avoid this limitation we propose a forward feature selection process that incrementally selects an arbitrary number of features if they are required for the representation of a particular category.

The manuscript is structured as follows: in Section 2 we introduce our category learning architecture that enables interactive and life-long learning of arbitrary categories. Afterwards we describe the feature extraction methods used to extract shape and color features in Section 3. In Section 4 we show the application of our proposed learning method to a visual categorization task. Finally we discuss the results and related work in Section 5 and give the pseudocode notation of our category learning method in Appendix.

Section snippets

Incremental and life-long learning of categories

Our memory architecture is based on an exemplar-based incremental learning network combined with a forward feature selection method to allow life-long learning of arbitrary categories. Both parts are optimized together to find a balance between insertion of features and allocation of representation nodes, while using as little resources as possible. This is crucial for interactive learning with respect to the required computational costs. In the following we refer to this architecture as

Feature extraction

We investigate the learning capabilities of our method based on a visual categorization task. Three feature extraction methods are used to provide shape and color information as illustrated in Fig. 4. Although features from different visual modalities are extracted this qualitative separation of the extracted features is not given to the learning system as a priori information. For our categorization task we are particularly interested in discovering the structure of the categories from the

Experimental results

In the following section our proposed cLVQ life-long learning architecture is compared with a single layer perceptron (SLP), an incremental support vector machine (SVM) (Martinetz, Labusch, & Schneegaß, 2009) and two modified cLVQ versions cGRLVQ and cLV Q. The comparison of the exemplar-based networks is done to measure the effect of the proposed feature weighting, and feature selection method with respect to the categorization performance, number of allocated resources and required training

Discussion

We have proposed an architecture for fast interactive life-long learning of arbitrary categories that is able to perform an incremental allocation of cLVQ nodes, automatic feature selection and feature weighting. This automatic control of the architecture complexity is crucial for interactive and life-long learning, where an exhaustive parameter search is not feasible. Additionally we use the proposed wrapper method for incremental feature selection, because the representation of categories

Acknowledgment

The authors thank Stephan Hasler for providing the visualization for the parts-based features.

References (53)

  • Arsenio, A. M. (2004). Developmental learning on a humanoid robot. In Proc. ternational joint conference on neuronal...
  • R.G. Bagnall

    Lifelong education: the institutionalisation of an illiberal and regressive ideology?

    Educational Philosophy and Theory

    (1990)
  • C.M. Bishop

    Pattern recognition and machine learning

    (2006)
  • G.A. Carpenter et al.

    Fuzzy ARTMAP: a neural network architecture for incremental supervised learning of analog multidimensional maps

    IEEE Transaction on Neural Networks

    (1992)
  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • Fei-Fei, L., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object...
  • Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In...
  • G. Forman

    An extensive empirical study of feature selection metrics for text classification

    Journal of Machine Learning Research

    (2003)
  • Fritz, M. (2008). Modeling, representation and learning of visual categories. Ph.D. thesis. Technical University of...
  • Fritz, M., Kruijff, G.-J. M., & Schiele, B. (2007). Cross-modal learning of visual categories using different levels of...
  • Fritz, M., Leibe, B., Caputo, B., & Schiele, B. (2005). Integrating representative and discriminative models for object...
  • B. Fritzke

    A growing neural gas network learns topologies

  • T.S. Furey et al.

    Support vector machine classification and validation of cancer tissue samples using microarray expression data

    Bioinformatics

    (2000)
  • I. Guyon et al.

    An introduction to variable and feature selection

    Journal of Machine Learning Research

    (2003)
  • Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proc. alvey vision conference (pp....
  • Hasler, S., Wersing, H., & Körner, E. (2007). A comparison of features in parts-based object recognition hierarchies....
  • Cited by (0)

    View full text