A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines

doi:10.1016/j.inffus.2017.12.007

Information Fusion

Volume 44, November 2018, Pages 78-96

https://doi.org/10.1016/j.inffus.2017.12.007 Get rights and content

Highlights

•
Autoencoders are a growing family of tools for nonlinear feature fusion.
•
A taxonomy of these methods is proposed, detailing each one of them.
•
Comparisons to other feature fusion techniques and applications are studied.
•
Guidelines on autoencoder design and example results are provided.
•
Available software for building autoencoders is summarized.

Abstract

Many of the existing machine learning algorithms, both supervised and unsupervised, depend on the quality of the input characteristics to generate a good model. The amount of these variables is also important, since performance tends to decline as the input dimensionality increases, hence the interest in using feature fusion techniques, able to produce feature sets that are more compact and higher level. A plethora of procedures to fuse original variables for producing new ones has been developed in the past decades. The most basic ones use linear combinations of the original variables, such as PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis), while others find manifold embeddings of lower dimensionality based on non-linear combinations, such as Isomap or LLE (Linear Locally Embedding) techniques.

More recently, autoencoders (AEs) have emerged as an alternative to manifold learning for conducting nonlinear feature fusion. Dozens of AE models have been proposed lately, each with its own specific traits. Although many of them can be used to generate reduced feature sets through the fusion of the original ones, there also AEs designed with other applications in mind.

The goal of this paper is to provide the reader with a broad view of what an AE is, how they are used for feature fusion, a taxonomy gathering a broad range of models, and how they relate to other classical techniques. In addition, a set of didactic guidelines on how to choose the proper AE for a given task is supplied, together with a discussion of the software tools available. Finally, two case studies illustrate the usage of AEs with datasets of handwritten digits and breast cancer.

Introduction

The development of the first machine learning techniques dates back to the middle of the 20th century, supported mainly by previously established statistical methods. By then, early research on how to emulate the functioning of the human brain through a machine was underway. McCulloch and Pitts cell [1] was proposed back in 1943, and the Hebb rule [2] that the Perceptron [3] is founded on was stated in 1949. Therefore, it is not surprising that artificial neural networks (ANNs), especially since the backpropagation algorithm was rediscovered in 1986 by Rumelhart, Hinton and Willians [4], have become one of the essential models.

ANNs have been applied to several machine learning tasks, mostly following a supervised approach. As was mathematically demonstrated [5] in 1989, a multilayer feedforward ANN (MLP) is an universal approximator, hence their usefulness in classification and regression problems. However, a proper algorithm able to train an MLP with several hidden layers was not available, due to the vanishing gradient [6] problem. The gradient descent algorithm, firstly used for convolutional neural networks [7] and later for unsupervised learning [8], was one of the foundations of modern deep learning [9] methods.

Under the umbrella of deep learning, multiple techniques have emerged and evolved. These include DBNs (Deep Belief Networks) [10], CNNs (Convolutional Neural Networks) [11], RNNs (Recurrent Neural Networks) [12] as well as LSTMs (Long Short-Term Memory) [13] or AEs (autoencoders).

The most common architecture in unsupervised deep learning is that of the encoder-decoder [14]. Some techniques lack the encoder or the decoder and have to compute costly optimization algorithms to find a code or sampling methods to reach a reconstruction, respectively. Unlike those, AEs capture both parts in their structure, with the aim that training them becomes easier and faster. In general terms, AEs are ANNs which produce codifications for input data and are trained so that their decodifications resemble the inputs as closely as possible.

AEs were firstly introduced [15] as a way of conducting pretraining in ANNs. Although mainly developed inside the context of deep learning, not all AE models are necessarily ANNs with multiple hidden layers. As explained below, an AE can be a deep ANN, i.e. in the stacked AEs configuration, or it can be a shallow ANN with a single hidden layer. See Section 2 for a more detailed introduction to AEs.

While many machine learning algorithms are able to work with raw input features, it is also true that, for the most part, their behavior is degraded as the number of variables grows. This is mainly due to the problem known as the curse of dimensionality [16], as well as the justification for a field of study called feature engineering. Engineering of features started as a manual process, relying in an expert able to decide by observation which variables were better for the task at hand. Notwithstanding, automated feature selection [17] methods were soon available.

Feature selection is only one of the approaches to reduce input space dimensionality. Selecting the best subset of input variables is an NP-hard combinatorial problem. Moreover, feature selection techniques usually evaluate each variable independently, but it is known that variables that separately do not provide useful information may do so when they are used together. For this reason other alternatives, primarily feature construction or extraction [18], emerged. In addition to these two denominations, feature selection and feature extraction, when dealing with dimensionality reduction it is also frequent to use other terms. The most common are as follows:

This is probably the broadest term, encompassing most of the others. Feature engineering can be carried out by manual or automated means, and be based on the selection of original characteristics or the construction of new ones through transformations.

It is the denomination used when the process to select among the existing features or construct new ones is automated. Thus, we can perform both feature selection and feature extraction through algorithms such as the ones mentioned below. Despite the use of automatic methods, sometimes an expert is needed to decide which algorithm is the most appropriate depending on data traits, to evaluate the optimum amount of variables to extract, etc.

Although this term is sometimes interchangeably used with the previous one, it is mostly used to refer to the use of ANNs to fully automate the feature generation process. Applying ANNs to learn distributed representations of concepts was proposed by Hinton in [21]. Today, learning representations is mainly linked to processing natural language, images and other signals with specific kinds of ANNs, such as CNNs [11].

Picking the most informative subset of variables started as a manual process usually in charge of domain experts. It can be considered a special case of feature weighting, as discussed in [23]. Although in certain fields the expert is still an important factor, nowadays the selection of variables is usually carried out using computer algorithms. These can operate in supervised or unsupervised manner. The former approach usually relies on correlation or mutual information between input and output variables [24], [25], while the latter tends to avoid redundancy among features [26]. Feature selection is overall an essential strategy in the data preprocessing [22], [27] phase.

The goal of this technique is to find a better data representation for the machine learning algorithm intended to use, since the original representation might not be the best one. It can be faced both manually, in which case the feature construction term is of common use, and automatically. Some elemental techniques such as normalization, discretization or scaling of variables, as well as basic transformations applied to certain data types,¹ are also considered within this field. New features can be extracted by finding linear combinations of the original ones, as in PCA (Principal Component Analysis) [29], [30] or LDA (Linear Discriminant Analysis) [31], as well as nonlinear combinations, like Kernel PCA [32] or Isomap [33]. The latter ones are usually known as manifold learning [34] algorithms, and fall in the scope of nonlinear dimensionality reduction techniques [35]. Feature extraction methods can also be categorized as supervised (e.g. LDA) or non-supervised (e.g. PCA).

This more recent term has emerged with the growth of multimedia data processing by machine learning algorithms, especially images, text and sound. As stated in [36], feature fusion methods aim to combine variables to remove redundant and irrelevant information. Manifold learning algorithms, and especially those based on ANNs, fall into this category.

Among the existing AE models there are several that are useful to perform feature fusion. This is the aim of the most basic one, which can be extended via several regularizations and adaptations to different kinds of data. These options will be explored through the present work, whose aim is to provide the reader with a didactic review on the inner workings of these distinct AE models and the ways they can be used to learn new representations of data.

The following are the main contributions of this paper:

•
A proposal of a global taxonomy of AEs dedicated to feature fusion.
•
Descriptions of these AE models including the necessary mathematical formulation and explanations.
•
A theoretical comparison between AEs and other popular feature fusion techniques.
•
A comprehensive review of other AE models as well as their applications.
•
A set of guidelines on how to design an AE, and several examples on how an AE may behave when its architecture and parameters are altered.
•
A summary of the available software for creating deep learning models and specifically AEs.

Additionally, we provide a case study with the well known dataset MNIST [37], which gives the reader some intuitions on the results provided by an AE with different architectures and parameters. The scrips to reproduce these experiments are provided in a repository, and their use will be further described in Section 6.

The rest of this paper is structured as follows. The foundations and essential aspects of AEs are introduced in Section 2, including the proposal of a global taxonomy. Section 3 is devoted to thoroughly describing the AE models able to operate as feature fusion mechanisms and several models which have further applications. The relationship between these AE models and other feature fusion methods is portrayed in Section 4, while applications of different kinds of AEs are described in Section 5. Section 6 provides a set of guidelines on how to design an AE for the task at hand, followed by the software pieces where it can be implemented, as well as the case study with MNIST data. Concluding remarks can be found in Section 7. Lastly, an Appendix briefly describes the datasets used through the present work.

Section snippets

Autoencoder essentials

AEs are ANNs² with a symmetric structure, where the middle layer represents an encoding of the input data. AEs are trained to reconstruct their input onto the output layer, while verifying certain restrictions which prevent them from simply copying the data along the network. Although the term autoencoder is the most popular nowadays, they were also known as autoassociative neural

Autoencoders for feature fusion

As has been already established, AEs are tools originally designed for finding useful representations of data by learning nonlinear ways to combine their features. Usually, this leads to a lower-dimensional space, but different modifications can be applied in order to discover features which satisfy certain requirements. All of these possibilities are discussed in this section, which begins by establishing the foundations of the most basic AE, and later encompasses several diverse variants,

Comparison to other feature fusion techniques

AEs are only several of a very diverse range of feature fusion methods [36]. These can be grouped according to whether they perform supervised or unsupervised learning. In the first case, they are usually known as distance metric learning techniques [95]. Some adversarial AEs, as well as AEs preserving class neighborhood structure [96], can be sorted into this category, since they are able to make use of the class information. However, this section focuses on the latter case, since most AEs are

Applications in feature learning and beyond

The ability of AEs to perform feature fusion is useful for easing the learning of predictive models, improving classification and regression results, and also for facilitating unsupervised tasks that are harder to conduct in high-dimensional spaces, such as clustering. Some specific cases of these applications are portrayed within the following subsections, including:

•
Classification: reducing or transforming the training data in order to achieve better performance in a classifier.
•
Data

Guidelines, software and examples on autoencoder design

This section attempts to guide the user along the process of designing an AE for a given problem, reviewing the range of choices the user has and their utility, then summarizing the available software for deep learning and outlining the steps needed to implement an AE. It also provides a case study with the MNIST dataset where the impact of several parameters of AEs is explored, as well as different AE types with identical parameter settings.

Conclusions

As Pedro Domingos states in his famous tutorial [19], and as can be seen from the large number of publications on the subject, feature engineering is the key to obtain good machine learning models, able to generalize and provide decent performance. This process consists in choosing the most relevant subset of features or combining some of them to create new ones. Automated fusion of features, specially when performed by nonlinear techniques, has demonstrated to be very effective. Neural

Acknowledgments

This work is supported by the Spanish National Research Projects TIN2015-68454-R and TIN2014-57251-P, and Project BigDaP-TOOLS - Ayudas Fundación BBVA a Equipos de Investigación Científica 2016.

References (160)

K. Hornik et al.
Multilayer feedforward networks are universal approximators
Neural Networks
(1989)
J. Schmidhuber
Deep learning in neural networks: an overview
Neural Networks
(2015)
M. Dash et al.
Feature selection for classification
Intell. Data Anal.
(1997)
S. García et al.
Tutorial on practical tips of the most influential data preprocessing algorithms in data mining
Knowl. Based Syst.
(2016)
P. Baldi et al.
Neural networks and principal component analysis: learning from examples without local minima
Neural Networks
(1989)
B.A. Olshausen et al.
Sparse coding with an overcomplete basis set: a strategy employed by v1?
Vision Res.
(1997)
F. Zhuang et al.
Representation learning via dual-autoencoder for recommendation
Neural Networks
(2017)
W.S. McCulloch et al.
A logical calculus of the ideas immanent in nervous activity
Bull. Math. Biophys.
(1943)
D.O. Hebb
The Organization of Behavior: A Neuropsychological Theory
(1949)
F. Rosenblatt
The Perceptron, a Perceiving and Recognizing Automaton (Project PARA)
(1957)

D.E. Rumelhart et al.

Learning representations by back-propagating errors

Nature

(1986)

S. Hochreiter

The vanishing gradient problem during learning recurrent neural nets and problem solutions

Int. J. Uncertainty Fuzziness Knowl. Based Syst.

(1998)

Y. LeCun et al.

Backpropagation applied to handwritten zip code recognition

Neural Comput.

(1989)

G.E. Hinton et al.

A fast learning algorithm for deep belief nets

Neural Comput.

(2006)

G.E. Hinton

Deep belief networks

Scholarpedia

(2009)

Y. LeCun et al.

The handbook of brain theory and neural networks

Ch. Convolutional Networks for Images, Speech, and Time Series

(1998)

R.J. Williams et al.

A learning algorithm for continually running fully recurrent neural networks

Neural Comput.

(1989)

S. Hochreiter et al.

Long short-term memory

Neural Comput.

(1997)

M. Ranzato et al.

A unified energy-based framework for unsupervised learning

D.H. Ballard

Modular learning in neural networks

Proceedings of the Sixth National Conference on Artificial Intelligence - Volume 1, AAAI’87, AAAI Press

(1987)

R. Bellman

Dynamic Programming

(1957)

H. Liu et al.

Feature Extraction, Construction and Selection: A Data Mining Perspective

(1998)

P. Domingos

A few useful things to know about machine learning

Commun. ACM

(2012)

Y. Bengio et al.

Representation learning: a review and new perspectives

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

G.E. Hinton

Learning distributed representations of concepts

Proceedings of the Eighth Annual Conference of the Cognitive Science Society

(1986)

S. García et al.

Data Preprocessing in Data Mining

(2015)

D. Wettschereck et al.

A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms

(1997)

M.A. Hall

Correlation-based feature selection for machine learning, PhD. thesis

(1999)

H. Peng et al.

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

(2005)

P. Mitra et al.

Unsupervised feature selection using feature similarity

IEEE Trans. Pattern Anal. Mach. Intell.

(2002)

I. Guyon et al.

An Introduction to Feature Extraction

(2006)

K. Pearson

LIII. On lines and planes of closest fit to systems of points in space

Philosoph. Mag. Series

(1901)

H. Hotelling

Analysis of a complex of statistical variables into principal components

J. Edu. Psychol.

(1933)

R.A. Fisher

The statistical utilization of multiple measurements

Ann. Hum. Genet.

(1938)

B. Schölkopf et al.

Nonlinear component analysis as a kernel eigenvalue problem

Neural Comput.

(1998)

J.B. Tenenbaum et al.

A global geometric framework for nonlinear dimensionality reduction

Science

(2000)

L. Cayton

Algorithms for Manifold Learning

Technical report

(2005)

J.A. Lee et al.

Nonlinear Dimensionality Reduction

(2007)

U.G. Mangai et al.

A survey of decision fusion and feature fusion strategies for pattern classification

IETE Tech. Rev.

(2010)

Y. LeCun et al.

Gradient-based learning applied to document recognition

Proc. IEEE

(1998)

M.A. Kramer

Nonlinear principal component analysis using autoassociative neural networks

AlChE J.

(1991)

H. Schwenk et al.

Training methods for adaptive boosting of neural networks

Advances in Neural Information Processing Systems

(1998)

R. Hecht-Nielsen

Replicator neural networks for universal optimal source coding

Science

(1995)

D. Chicco et al.

Deep autoencoder neural networks for gene ontology annotation predictions

Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM

(2014)

H. Kamyshanska et al.

On autoencoder scoring

L. Deng et al.

Binary coding of speech spectrograms using a deep auto-encoder

Eleventh Annual Conference of the International Speech Communication Association

(2010)

P. Baldi

Autoencoders, unsupervised learning, and deep architectures

Proceedings of ICML Workshop on Unsupervised and Transfer Learning

(2012)

D.E. Knuth

Two notes on notation

Am. Math. Monthly

(1992)

X. Glorot et al.

Domain adaptation for large-scale sentiment classification: a deep learning approach

Proceedings of the 28th International Conference on Machine Learning (ICML-11)

(2011)

Ç. Gülçehre et al.

Knowledge matters: importance of prior information for optimization

J. Mach. Learn. Res.

(2016)

Cited by (247)

Imputation of data Missing Not at Random: Artificial generation and benchmark analysis
2024, Expert Systems with Applications
Experimental assessment of different missing data imputation methods often compute error rates between the original values and the estimated ones. This experimental setup relies on complete datasets that are injected with missing values. The injection process is straightforward for the Missing Completely At Random and Missing At Random mechanisms; however, the Missing Not At Random mechanism poses a major challenge, since the available artificial generation strategies are limited. Furthermore, the studies focused on this latter mechanism tend to disregard a comprehensive baseline of state-of-the-art imputation methods. In this work, both challenges are addressed: four new Missing Not At Random generation strategies are introduced and a benchmark study is conducted to compare six imputation methods in an experimental setup that covers 10 datasets and five missingness levels (10% to 80%). The overall findings are that, for most missing rates and datasets, the best imputation method to deal with Missing Not At Random values is the Multiple Imputation by Chained Equations, whereas for higher missingness rates autoencoders show promising results.
Estimation of aquatic ecosystem health using deep neural network with nonlinear data mapping
2024, Ecological Informatics
Estimation of aquatic ecosystem health indices can assist in reducing the burden of time-consuming, labor-intensive, and cost-effective fieldwork for the sustainable evaluation of freshwater ecosystem status. In this study, we developed a deep neural network to estimate the trophic diatom index (TDI), benthic macroinvertebrate index (BMI), and fish assessment index (FAI) using water quality and hydraulic and hydrological data. A convolutional neural network (CNN) model was built to estimate health indices. In addition, an autoencoder was adopted to produce manifold features that were used as inputs for the CNN model. Conventional machine learning models, including artificial neural networks, support vector machines, random forests, and extreme gradient boosting, have been developed to estimate the TDI, BMI, and FAI. The results showed that the CNN with an autoencoder exhibited the best performance, with validation accuracies of Nash Sutcliffe Efficiency (NSE) and root mean squared error (RMSE) values of 0.592 and 17.249 for TDI, 0.669 and 12.282 for BMI, and 0.638 and 13.897 for FAI, respectively. The autoencoder enhanced the nonlinear feature learning of the time series and static input data, which contributed to improving the CNN feature extraction for accurate estimation of aquatic ecosystem health indices compared to other data-driven approaches. Therefore, deep learning techniques can be used to investigate aquatic ecosystem health by successfully reflecting the quantitative and qualitative features of health indices.
A transferred spatio-temporal deep model based on multi-LSTM auto-encoder for air pollution time series missing value imputation
2024, Future Generation Computer Systems
Air pollution is one of the most severe problems facing the world. Research on air quality prediction and analysis of influencing factors also continues to grow. When conducting this research, valid, authentic, and high-quality air pollution data are necessary to obtain reasonable results. However, Missing values are unavoidable in multivariate time series due to multiple causes, such as sensor and communication failure. Most previous algorithms on missing data cannot effectively pay attention to air pollution’s temporal and spatial mechanism, handle multiple missing patterns, or deal with high missing rates sequences. This paper proposes a new deep spatiotemporal imputation methodology to address this problem effectively, namely transferred Multiple LSTM based deep auto-encoder (TMLSTM-AE). Our idea is intuitive: train an auto-encoder to estimate the missing values. It uses spatial and time series information to fill in single missing, multiple missing, block missing, and long-interval consecutive missing in air quality data. To verify the effectiveness and priority of the proposed model, we conducted a case study in a city in Shaanxi, China. Long-interval consecutive missing and different missing patterns $P M_{2.5}$ data are filled. The results indicate that the model proposed in this paper performs well and outperforms existing models for different missing patterns and long-interval consecutive missing.
Explainable RUL estimation of turbofan engines based on prognostic indicators and heterogeneous ensemble machine learning predictors
2024, Engineering Applications of Artificial Intelligence
Data-driven prognostics of systems exploit sensor measurements to predict the degradation evolution and anticipate failures, corresponding to the estimation of the remaining useful life (RUL). This task uses feature engineering to build prognostic indicators (HI) and machine learning (ML) to estimate the RUL. However, high variability in data coming from similar systems operating under different conditions negatively affects the RUL performance. Hence, this paper presents a new methodology that combines feature and ML engineering methods to provide an explainable RUL prediction. The key contributions lie in constructing efficient prognostic indicators that isolate distinct profile trajectories, enabling adaptive RUL extraction for each system. An ensemble of heterogeneous ML predictors is also trained using these indicators and RUL trajectories, effectively addressing variability issues and enhancing RUL performance. The proposed methodology is rigorously validated using NASA-provided turbofan engine data (C-MAPSS), demonstrating its performance, compared to state-of-the-art results, with improved score and accuracy of prediction.
Siamese Autoencoder Architecture for the Imputation of Data Missing Not at Random
2024, Journal of Computational Science
Missing data is an issue that can negatively impact any task performed with the available data and it is often found in real-world domains such as healthcare. One of the most common strategies to address this issue is to perform imputation, where the missing values are replaced by estimates. Several approaches based on statistics and machine learning techniques have been proposed for this purpose, including deep learning architectures such as generative adversarial networks and autoencoders. In this work, we propose a novel siamese neural network suitable for missing data imputation, which we call Siamese Autoencoder-based Approach for Imputation (SAEI). Besides having a deep autoencoder architecture, SAEI also has a custom loss function and triplet mining strategy that are tailored for the missing data issue. The proposed SAEI approach is compared to seven state-of-the-art imputation methods in an experimental setup that comprises 14 heterogeneous datasets of the healthcare domain injected with Missing Not At Random values at a rate between 10% and 60%. The results show that SAEI significantly outperforms all the remaining imputation methods for all experimented settings, achieving an average improvement of 35%. This work is an extension of the article Siamese Autoencoder-Based Approach for Missing Data Imputation [1] presented at the International Conference on Computational Science 2023. It includes new experiments focused on runtime, generalization capabilities, and the impact of the imputation in classification tasks, where the results show that SAEI is the imputation method that induces the best classification results, improving the F1 scores for 50% of the used datasets.
Asymmetric Autoencoders: An NN alternative for resource-constrained devices in IoT networks
2024, Ad Hoc Networks
Local computation and communication are known challenges for energy-constrained devices that can become even more complex if we consider data acquisition with noise. Thus, developing systems that address these problems is fundamental when implementing sensing nodes in IoT networks. Fortunately, sensed data has intrinsic redundancies that allow compression with little or no information loss, which can even be used to suppress the collected noise. Many solutions using Neural Networks (NNs) have emerged to address both issues, resorting to autoencoders to extract these redundancies to reduce data transmissions in IoT networks and to remove noise from data in general. However, solutions that resort to NNs often rely on increasing the number of NN layers to achieve performance improvements, which can be tricky when deploying them in resource-constrained devices. Models with multiple layers require more space to store their parameters and more computations. To address these problems, we propose Asymmetric Autoencoders (AAEs), a model that modifies the typical autoencoder, which adopts a symmetric encoder-decoder architecture, in favour of a design that has fewer NN and other resources in the encoder than in the decoder. Our experiments with single-sensor temporal-data compression show that our proposed AAEs can offer a similar or smaller reconstruction error compared to the symmetric AEs while using encoders with fewer parameters and that require fewer floating-point operations (FLOPs) with each compression operation. For instance, the proposed AAEs can outperform the best symmetrical implementations by executing five to seven times fewer FLOPs. Given their inherently IoT-friendly design and positive results, we show that AAEs are a valuable model for NN deployment in sensor nodes, as they can achieve similar or better performance than symmetric autoencoders while saving sensor node resources.

View all citing articles on Scopus

View full text

A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines

Highlights

Abstract

Introduction

Section snippets

Autoencoder essentials

Autoencoders for feature fusion

Comparison to other feature fusion techniques

Applications in feature learning and beyond

Guidelines, software and examples on autoencoder design

Conclusions

Acknowledgments

Neural Networks

Neural Networks

Intell. Data Anal.

Knowl. Based Syst.

Neural Networks

Vision Res.

Neural Networks

A logical calculus of the ideas immanent in nervous activity

Bull. Math. Biophys.

The Organization of Behavior: A Neuropsychological Theory

The Perceptron, a Perceiving and Recognizing Automaton (Project PARA)

Learning representations by back-propagating errors

Nature

The vanishing gradient problem during learning recurrent neural nets and problem solutions

Int. J. Uncertainty Fuzziness Knowl. Based Syst.

Backpropagation applied to handwritten zip code recognition

Neural Comput.

A fast learning algorithm for deep belief nets

Neural Comput.

Deep belief networks

Scholarpedia

The handbook of brain theory and neural networks

Ch. Convolutional Networks for Images, Speech, and Time Series

A learning algorithm for continually running fully recurrent neural networks

Neural Comput.

Long short-term memory

Neural Comput.

A unified energy-based framework for unsupervised learning

Modular learning in neural networks

Proceedings of the Sixth National Conference on Artificial Intelligence - Volume 1, AAAI’87, AAAI Press

Dynamic Programming

Feature Extraction, Construction and Selection: A Data Mining Perspective

A few useful things to know about machine learning

Commun. ACM

Representation learning: a review and new perspectives

IEEE Trans. Pattern Anal. Mach. Intell.

Learning distributed representations of concepts

Proceedings of the Eighth Annual Conference of the Cognitive Science Society

Data Preprocessing in Data Mining

A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms

Correlation-based feature selection for machine learning, PhD. thesis

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

Unsupervised feature selection using feature similarity

IEEE Trans. Pattern Anal. Mach. Intell.

An Introduction to Feature Extraction

LIII. On lines and planes of closest fit to systems of points in space

Philosoph. Mag. Series

Analysis of a complex of statistical variables into principal components

J. Edu. Psychol.

The statistical utilization of multiple measurements

Ann. Hum. Genet.

Nonlinear component analysis as a kernel eigenvalue problem

Neural Comput.

A global geometric framework for nonlinear dimensionality reduction

Science

Algorithms for Manifold Learning

Technical report

Nonlinear Dimensionality Reduction

A survey of decision fusion and feature fusion strategies for pattern classification

IETE Tech. Rev.

Gradient-based learning applied to document recognition

Proc. IEEE

Nonlinear principal component analysis using autoassociative neural networks

AlChE J.

Training methods for adaptive boosting of neural networks

Advances in Neural Information Processing Systems

Replicator neural networks for universal optimal source coding