Elsevier

Information Fusion

Volume 44, November 2018, Pages 78-96
Information Fusion

A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines

https://doi.org/10.1016/j.inffus.2017.12.007Get rights and content

Highlights

  • Autoencoders are a growing family of tools for nonlinear feature fusion.

  • A taxonomy of these methods is proposed, detailing each one of them.

  • Comparisons to other feature fusion techniques and applications are studied.

  • Guidelines on autoencoder design and example results are provided.

  • Available software for building autoencoders is summarized.

Abstract

Many of the existing machine learning algorithms, both supervised and unsupervised, depend on the quality of the input characteristics to generate a good model. The amount of these variables is also important, since performance tends to decline as the input dimensionality increases, hence the interest in using feature fusion techniques, able to produce feature sets that are more compact and higher level. A plethora of procedures to fuse original variables for producing new ones has been developed in the past decades. The most basic ones use linear combinations of the original variables, such as PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis), while others find manifold embeddings of lower dimensionality based on non-linear combinations, such as Isomap or LLE (Linear Locally Embedding) techniques.

More recently, autoencoders (AEs) have emerged as an alternative to manifold learning for conducting nonlinear feature fusion. Dozens of AE models have been proposed lately, each with its own specific traits. Although many of them can be used to generate reduced feature sets through the fusion of the original ones, there also AEs designed with other applications in mind.

The goal of this paper is to provide the reader with a broad view of what an AE is, how they are used for feature fusion, a taxonomy gathering a broad range of models, and how they relate to other classical techniques. In addition, a set of didactic guidelines on how to choose the proper AE for a given task is supplied, together with a discussion of the software tools available. Finally, two case studies illustrate the usage of AEs with datasets of handwritten digits and breast cancer.

Introduction

The development of the first machine learning techniques dates back to the middle of the 20th century, supported mainly by previously established statistical methods. By then, early research on how to emulate the functioning of the human brain through a machine was underway. McCulloch and Pitts cell [1] was proposed back in 1943, and the Hebb rule [2] that the Perceptron [3] is founded on was stated in 1949. Therefore, it is not surprising that artificial neural networks (ANNs), especially since the backpropagation algorithm was rediscovered in 1986 by Rumelhart, Hinton and Willians [4], have become one of the essential models.

ANNs have been applied to several machine learning tasks, mostly following a supervised approach. As was mathematically demonstrated [5] in 1989, a multilayer feedforward ANN (MLP) is an universal approximator, hence their usefulness in classification and regression problems. However, a proper algorithm able to train an MLP with several hidden layers was not available, due to the vanishing gradient [6] problem. The gradient descent algorithm, firstly used for convolutional neural networks [7] and later for unsupervised learning [8], was one of the foundations of modern deep learning [9] methods.

Under the umbrella of deep learning, multiple techniques have emerged and evolved. These include DBNs (Deep Belief Networks) [10], CNNs (Convolutional Neural Networks) [11], RNNs (Recurrent Neural Networks) [12] as well as LSTMs (Long Short-Term Memory) [13] or AEs (autoencoders).

The most common architecture in unsupervised deep learning is that of the encoder-decoder [14]. Some techniques lack the encoder or the decoder and have to compute costly optimization algorithms to find a code or sampling methods to reach a reconstruction, respectively. Unlike those, AEs capture both parts in their structure, with the aim that training them becomes easier and faster. In general terms, AEs are ANNs which produce codifications for input data and are trained so that their decodifications resemble the inputs as closely as possible.

AEs were firstly introduced [15] as a way of conducting pretraining in ANNs. Although mainly developed inside the context of deep learning, not all AE models are necessarily ANNs with multiple hidden layers. As explained below, an AE can be a deep ANN, i.e. in the stacked AEs configuration, or it can be a shallow ANN with a single hidden layer. See Section 2 for a more detailed introduction to AEs.

While many machine learning algorithms are able to work with raw input features, it is also true that, for the most part, their behavior is degraded as the number of variables grows. This is mainly due to the problem known as the curse of dimensionality [16], as well as the justification for a field of study called feature engineering. Engineering of features started as a manual process, relying in an expert able to decide by observation which variables were better for the task at hand. Notwithstanding, automated feature selection [17] methods were soon available.

Feature selection is only one of the approaches to reduce input space dimensionality. Selecting the best subset of input variables is an NP-hard combinatorial problem. Moreover, feature selection techniques usually evaluate each variable independently, but it is known that variables that separately do not provide useful information may do so when they are used together. For this reason other alternatives, primarily feature construction or extraction [18], emerged. In addition to these two denominations, feature selection and feature extraction, when dealing with dimensionality reduction it is also frequent to use other terms. The most common are as follows:

This is probably the broadest term, encompassing most of the others. Feature engineering can be carried out by manual or automated means, and be based on the selection of original characteristics or the construction of new ones through transformations.

It is the denomination used when the process to select among the existing features or construct new ones is automated. Thus, we can perform both feature selection and feature extraction through algorithms such as the ones mentioned below. Despite the use of automatic methods, sometimes an expert is needed to decide which algorithm is the most appropriate depending on data traits, to evaluate the optimum amount of variables to extract, etc.

Although this term is sometimes interchangeably used with the previous one, it is mostly used to refer to the use of ANNs to fully automate the feature generation process. Applying ANNs to learn distributed representations of concepts was proposed by Hinton in [21]. Today, learning representations is mainly linked to processing natural language, images and other signals with specific kinds of ANNs, such as CNNs [11].

Picking the most informative subset of variables started as a manual process usually in charge of domain experts. It can be considered a special case of feature weighting, as discussed in [23]. Although in certain fields the expert is still an important factor, nowadays the selection of variables is usually carried out using computer algorithms. These can operate in supervised or unsupervised manner. The former approach usually relies on correlation or mutual information between input and output variables [24], [25], while the latter tends to avoid redundancy among features [26]. Feature selection is overall an essential strategy in the data preprocessing [22], [27] phase.

The goal of this technique is to find a better data representation for the machine learning algorithm intended to use, since the original representation might not be the best one. It can be faced both manually, in which case the feature construction term is of common use, and automatically. Some elemental techniques such as normalization, discretization or scaling of variables, as well as basic transformations applied to certain data types,1 are also considered within this field. New features can be extracted by finding linear combinations of the original ones, as in PCA (Principal Component Analysis) [29], [30] or LDA (Linear Discriminant Analysis) [31], as well as nonlinear combinations, like Kernel PCA [32] or Isomap [33]. The latter ones are usually known as manifold learning [34] algorithms, and fall in the scope of nonlinear dimensionality reduction techniques [35]. Feature extraction methods can also be categorized as supervised (e.g. LDA) or non-supervised (e.g. PCA).

This more recent term has emerged with the growth of multimedia data processing by machine learning algorithms, especially images, text and sound. As stated in [36], feature fusion methods aim to combine variables to remove redundant and irrelevant information. Manifold learning algorithms, and especially those based on ANNs, fall into this category.

Among the existing AE models there are several that are useful to perform feature fusion. This is the aim of the most basic one, which can be extended via several regularizations and adaptations to different kinds of data. These options will be explored through the present work, whose aim is to provide the reader with a didactic review on the inner workings of these distinct AE models and the ways they can be used to learn new representations of data.

The following are the main contributions of this paper:

  • A proposal of a global taxonomy of AEs dedicated to feature fusion.

  • Descriptions of these AE models including the necessary mathematical formulation and explanations.

  • A theoretical comparison between AEs and other popular feature fusion techniques.

  • A comprehensive review of other AE models as well as their applications.

  • A set of guidelines on how to design an AE, and several examples on how an AE may behave when its architecture and parameters are altered.

  • A summary of the available software for creating deep learning models and specifically AEs.

Additionally, we provide a case study with the well known dataset MNIST [37], which gives the reader some intuitions on the results provided by an AE with different architectures and parameters. The scrips to reproduce these experiments are provided in a repository, and their use will be further described in Section 6.

The rest of this paper is structured as follows. The foundations and essential aspects of AEs are introduced in Section 2, including the proposal of a global taxonomy. Section 3 is devoted to thoroughly describing the AE models able to operate as feature fusion mechanisms and several models which have further applications. The relationship between these AE models and other feature fusion methods is portrayed in Section 4, while applications of different kinds of AEs are described in Section 5. Section 6 provides a set of guidelines on how to design an AE for the task at hand, followed by the software pieces where it can be implemented, as well as the case study with MNIST data. Concluding remarks can be found in Section 7. Lastly, an Appendix briefly describes the datasets used through the present work.

Section snippets

Autoencoder essentials

AEs are ANNs2 with a symmetric structure, where the middle layer represents an encoding of the input data. AEs are trained to reconstruct their input onto the output layer, while verifying certain restrictions which prevent them from simply copying the data along the network. Although the term autoencoder is the most popular nowadays, they were also known as autoassociative neural

Autoencoders for feature fusion

As has been already established, AEs are tools originally designed for finding useful representations of data by learning nonlinear ways to combine their features. Usually, this leads to a lower-dimensional space, but different modifications can be applied in order to discover features which satisfy certain requirements. All of these possibilities are discussed in this section, which begins by establishing the foundations of the most basic AE, and later encompasses several diverse variants,

Comparison to other feature fusion techniques

AEs are only several of a very diverse range of feature fusion methods [36]. These can be grouped according to whether they perform supervised or unsupervised learning. In the first case, they are usually known as distance metric learning techniques [95]. Some adversarial AEs, as well as AEs preserving class neighborhood structure [96], can be sorted into this category, since they are able to make use of the class information. However, this section focuses on the latter case, since most AEs are

Applications in feature learning and beyond

The ability of AEs to perform feature fusion is useful for easing the learning of predictive models, improving classification and regression results, and also for facilitating unsupervised tasks that are harder to conduct in high-dimensional spaces, such as clustering. Some specific cases of these applications are portrayed within the following subsections, including:

  • Classification: reducing or transforming the training data in order to achieve better performance in a classifier.

  • Data

Guidelines, software and examples on autoencoder design

This section attempts to guide the user along the process of designing an AE for a given problem, reviewing the range of choices the user has and their utility, then summarizing the available software for deep learning and outlining the steps needed to implement an AE. It also provides a case study with the MNIST dataset where the impact of several parameters of AEs is explored, as well as different AE types with identical parameter settings.

Conclusions

As Pedro Domingos states in his famous tutorial [19], and as can be seen from the large number of publications on the subject, feature engineering is the key to obtain good machine learning models, able to generalize and provide decent performance. This process consists in choosing the most relevant subset of features or combining some of them to create new ones. Automated fusion of features, specially when performed by nonlinear techniques, has demonstrated to be very effective. Neural

Acknowledgments

This work is supported by the Spanish National Research Projects TIN2015-68454-R and TIN2014-57251-P, and Project BigDaP-TOOLS - Ayudas Fundación BBVA a Equipos de Investigación Científica 2016.

References (160)

  • D.E. Rumelhart et al.

    Learning representations by back-propagating errors

    Nature

    (1986)
  • S. Hochreiter

    The vanishing gradient problem during learning recurrent neural nets and problem solutions

    Int. J. Uncertainty Fuzziness Knowl. Based Syst.

    (1998)
  • Y. LeCun et al.

    Backpropagation applied to handwritten zip code recognition

    Neural Comput.

    (1989)
  • G.E. Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Comput.

    (2006)
  • G.E. Hinton

    Deep belief networks

    Scholarpedia

    (2009)
  • Y. LeCun et al.

    The handbook of brain theory and neural networks

    Ch. Convolutional Networks for Images, Speech, and Time Series

    (1998)
  • R.J. Williams et al.

    A learning algorithm for continually running fully recurrent neural networks

    Neural Comput.

    (1989)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • M. Ranzato et al.

    A unified energy-based framework for unsupervised learning

  • D.H. Ballard

    Modular learning in neural networks

    Proceedings of the Sixth National Conference on Artificial Intelligence - Volume 1, AAAI’87, AAAI Press

    (1987)
  • R. Bellman

    Dynamic Programming

    (1957)
  • H. Liu et al.

    Feature Extraction, Construction and Selection: A Data Mining Perspective

    (1998)
  • P. Domingos

    A few useful things to know about machine learning

    Commun. ACM

    (2012)
  • Y. Bengio et al.

    Representation learning: a review and new perspectives

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • G.E. Hinton

    Learning distributed representations of concepts

    Proceedings of the Eighth Annual Conference of the Cognitive Science Society

    (1986)
  • S. García et al.

    Data Preprocessing in Data Mining

    (2015)
  • D. Wettschereck et al.

    A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms

    (1997)
  • M.A. Hall

    Correlation-based feature selection for machine learning, PhD. thesis

    (1999)
  • H. Peng et al.

    Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • P. Mitra et al.

    Unsupervised feature selection using feature similarity

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • I. Guyon et al.

    An Introduction to Feature Extraction

    (2006)
  • K. Pearson

    LIII. On lines and planes of closest fit to systems of points in space

    Philosoph. Mag. Series

    (1901)
  • H. Hotelling

    Analysis of a complex of statistical variables into principal components

    J. Edu. Psychol.

    (1933)
  • R.A. Fisher

    The statistical utilization of multiple measurements

    Ann. Hum. Genet.

    (1938)
  • B. Schölkopf et al.

    Nonlinear component analysis as a kernel eigenvalue problem

    Neural Comput.

    (1998)
  • J.B. Tenenbaum et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2000)
  • L. Cayton

    Algorithms for Manifold Learning

    Technical report

    (2005)
  • J.A. Lee et al.

    Nonlinear Dimensionality Reduction

    (2007)
  • U.G. Mangai et al.

    A survey of decision fusion and feature fusion strategies for pattern classification

    IETE Tech. Rev.

    (2010)
  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • M.A. Kramer

    Nonlinear principal component analysis using autoassociative neural networks

    AlChE J.

    (1991)
  • H. Schwenk et al.

    Training methods for adaptive boosting of neural networks

    Advances in Neural Information Processing Systems

    (1998)
  • R. Hecht-Nielsen

    Replicator neural networks for universal optimal source coding

    Science

    (1995)
  • D. Chicco et al.

    Deep autoencoder neural networks for gene ontology annotation predictions

    Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM

    (2014)
  • H. Kamyshanska et al.

    On autoencoder scoring

  • L. Deng et al.

    Binary coding of speech spectrograms using a deep auto-encoder

    Eleventh Annual Conference of the International Speech Communication Association

    (2010)
  • P. Baldi

    Autoencoders, unsupervised learning, and deep architectures

    Proceedings of ICML Workshop on Unsupervised and Transfer Learning

    (2012)
  • D.E. Knuth

    Two notes on notation

    Am. Math. Monthly

    (1992)
  • X. Glorot et al.

    Domain adaptation for large-scale sentiment classification: a deep learning approach

    Proceedings of the 28th International Conference on Machine Learning (ICML-11)

    (2011)
  • Ç. Gülçehre et al.

    Knowledge matters: importance of prior information for optimization

    J. Mach. Learn. Res.

    (2016)
  • Cited by (247)

    View all citing articles on Scopus
    View full text