SurveyReservoir computing approaches to recurrent neural network training
Introduction
Artificial recurrent neural networks (RNNs) represent a large and varied class of computational models that are designed by more or less detailed analogy with biological brain modules. In an RNN numerous abstract neurons (also called units or processing elements) are interconnected by likewise abstracted synaptic connections (or links), which enable activations to propagate through the network. The characteristic feature of RNNs that distinguishes them from the more widely used feedforward neural networks is that the connection topology possesses cycles. The existence of cycles has a profound impact:
- •
An RNN may develop a self-sustained temporal activation dynamics along its recurrent connection pathways, even in the absence of input. Mathematically, this renders an RNN to be a dynamical system, while feedforward networks are functions.
- •
If driven by an input signal, an RNN preserves in its internal state a nonlinear transformation of the input history — in other words, it has a dynamical memory, and is able to process temporal context information.
This review article concerns a particular subset of RNN-based research in two aspects:
- •
RNNs are used for a variety of scientific purposes, and at least two major classes of RNN models exist: they can be used for purposes of modeling biological brains, or as engineering tools for technical applications. The first usage belongs to the field of computational neuroscience, while the second frames RNNs in the realms of machine learning, the theory of computation, and nonlinear signal processing and control. While there are interesting connections between the two attitudes, this survey focuses on the latter, with occasional borrowings from the first.
- •
From a dynamical systems perspective, there are two main classes of RNNs. Models from the first class are characterized by an energy-minimizing stochastic dynamics and symmetric connections. The best known instantiations are Hopfield networks [1], [2], Boltzmann machines [3], [4], and the recently emerging Deep Belief Networks [5]. These networks are mostly trained in some unsupervised learning scheme. Typical targeted network functionalities in this field are associative memories, data compression, the unsupervised modeling of data distributions, and static pattern classification, where the model is run for multiple time steps per single input instance to reach some type of convergence or equilibrium (but see e.g., [6] for extension to temporal data). The mathematical background is rooted in statistical physics. In contrast, the second big class of RNN models typically features a deterministic update dynamics and directed connections. Systems from this class implement nonlinear filters, which transform an input time series into an output time series. The mathematical background here is nonlinear dynamical systems. The standard training mode is supervised. This survey is concerned only with RNNs of this second type, and when we speak of RNNs later on, we will exclusively refer to such systems.1
RNNs (of the second type) appear as highly promising and fascinating tools for nonlinear time series processing applications, mainly for two reasons. First, it can be shown that under fairly mild and general assumptions, such RNNs are universal approximators of dynamical systems [7]. Second, biological brain modules almost universally exhibit recurrent connection pathways too. Both observations indicate that RNNs should potentially be powerful tools for engineering applications.
Despite this widely acknowledged potential, and despite a number of successful academic and practical applications, the impact of RNNs in nonlinear modeling has remained limited for a long time. The main reason for this lies in the fact that RNNs are difficult to train by gradient-descent-based methods, which aim at iteratively reducing the training error. While a number of training algorithms have been proposed (a brief overview is given in Section 2.5), these all suffer from the following shortcomings:
- •
The gradual change of network parameters during learning drives the network dynamics through bifurcations [8]. At such points, the gradient information degenerates and may become ill-defined. As a consequence, convergence cannot be guaranteed.
- •
A single parameter update can be computationally expensive, and many update cycles may be necessary. This results in long training times, and renders RNN training feasible only for relatively small networks (in the order of tens of units).
- •
It is intrinsically hard to learn dependences requiring long-range memory, because the necessary gradient information exponentially dissolves over time [9] (but see the Long Short-Term Memory networks [10] for a possible escape).
- •
Advanced training algorithms are mathematically involved and need to be parameterized by a number of global control parameters, which are not easily optimized. As a result, such algorithms need substantial skill and experience to be successfully applied.
In this situation of slow and difficult progress, in 2001 a fundamentally new approach to RNN design and training was proposed independently by Wolfgang Maass under the name of Liquid State Machines [11] and by Herbert Jaeger under the name of Echo State Networks [12]. This approach, which had predecessors in computational neuroscience [13] and subsequent ramifications in machine learning as the Backpropagation-Decorrelation [14] learning rule, is now increasingly often collectively referred to as Reservoir Computing (RC). The RC paradigm avoids the shortcomings of gradient-descent RNN training listed above, by setting up RNNs in the following way:
- •
A recurrent neural network is randomly created and remains unchanged during training. This RNN is called the reservoir. It is passively excited by the input signal and maintains in its state a nonlinear transformation of the input history.
- •
The desired output signal is generated as a linear combination of the neuron’s signals from the input-excited reservoir. This linear combination is obtained by linear regression, using the teacher signal as a target.
Fig. 1 graphically contrasts previous methods of RNN training with the RC approach.
Reservoir Computing methods have quickly become popular, as witnessed for instance by a theme issue of Neural Networks [15], and today constitute one of the basic paradigms of RNN modeling [16]. The main reasons for this development are the following:
Modeling accuracy. RC has starkly outperformed previous methods of nonlinear system identification, prediction and classification, for instance in predicting chaotic dynamics (three orders of magnitude improved accuracy [17]), nonlinear wireless channel equalization (two orders of magnitude improvement [17]), the Japanese Vowel benchmark (zero test error rate, previous best: 1.8% [18]), financial forecasting (winner of the international forecasting competition NN32), and in isolated spoken digits recognition (improvement of word error rate on benchmark from 0.6% of previous best system to 0.2% [19], and further to 0% test error in recent unpublished work).
Modeling capacity. RC is computationally universal for continuous-time, continuous-value real-time systems modeled with bounded resources (including time and value resolution) [20], [21].
Biological plausibility. Numerous connections of RC principles to architectural and dynamical properties of mammalian brains have been established. RC (or closely related models) provides explanations of why biological brains can carry out accurate computations with an “inaccurate” and noisy physical substrate [22], [23], especially accurate timing [24]; of the way in which visual information is superimposed and processed in primary visual cortex [25], [26]; of how cortico-basal pathways support the representation of sequential information; and RC offers a functional interpretation of the cerebellar circuitry [27], [28]. A central role is assigned to an RC circuit in a series of models explaining sequential information processing in human and primate brains, most importantly of speech signals [13], [29], [30], [31].
Extensibility and parsimony. A notorious conundrum of neural network research is how to extend previously learned models by new items without impairing or destroying previously learned representations (catastrophic interference [32]). RC offers a simple and principled solution: new items are represented by new output units, which are appended to the previously established output units of a given reservoir. Since the output weights of different output units are independent of each other, catastrophic interference is a non-issue.
These encouraging observations should not mask the fact that RC is still in its infancy, and significant further improvements and extensions are desirable. Specifically, just simply creating a reservoir at random is unsatisfactory. It seems obvious that, when addressing a specific modeling task, a specific reservoir design that is adapted to the task will lead to better results than a naive random creation. Thus, the main stream of research in the field is today directed at understanding the effects of reservoir characteristics on task performance, and at developing suitable reservoir design and adaptation methods. Also, new ways of reading out from the reservoirs, including combining them into larger structures, are devised and investigated. While shifting from the initial idea of having a fixed randomly created reservoir and training only the readout, the current paradigm of reservoir computing remains (and differentiates itself from other RNN training approaches) as producing/training the reservoir and the readout separately and differently.
This review offers a conceptual classification and a comprehensive survey of this research.
As is true for many areas of machine learning, methods in reservoir computing converge from different fields and come with different names. We would like to make a distinction here between these differently named “tradition lines”, which we like to call brands, and the actual finer-grained ideas on producing good reservoirs, which we will call recipes. Since recipes can be useful and mixed across different brands, this review focuses on classifying and surveying them. To be fair, it has to be said that the authors of this survey associate themselves mostly with the Echo State Networks brand, and thus, willingly or not, are influenced by its mindset.
Overview. We start by introducing a generic notational framework in Section 2. More specifically, we define what we mean by problem or task in the context of machine learning in Section 2.1. Then we define a general notation for expansion (or kernel) methods for both non-temporal (Section 2.2) and temporal (Section 2.3) tasks, introduce our notation for recurrent neural networks in Section 2.4, and outline classical training methods in Section 2.5. In Section 3 we detail the foundations of Reservoir Computing and proceed by naming the most prominent brands. In Section 4 we introduce our classification of the reservoir generation/adaptation recipes, which transcends the boundaries between the brands. Following this classification we then review universal (Section 5), unsupervised (Section 6), and supervised (Section 7) reservoir generation/adaptation recipes. In Section 8 we provide a classification and review the techniques for reading the outputs from the reservoirs reported in literature, together with discussing various practical issues of readout training. A final discussion (Section 9) wraps up the entire picture.
Section snippets
Formulation of the problem
Let a problem or a task in our context of machine learning be defined as a problem of learning a functional relation between a given input and a desired output , where , and is the number of data points in the training dataset . A non-temporal task is where the data points are independent of each other and the goal is to learn a function such that is minimized, where is an error measure, for instance, the normalized
Reservoir methods
Reservoir computing methods differ from the “traditional” designs and learning techniques listed above in that they make a conceptual and computational separation between a dynamic reservoir — an RNN as a nonlinear temporal expansion function — and a recurrence-free (usually linear) readout that produces the desired output from the expansion.
This separation is based on the understanding (common with kernel methods) that and serve different purposes: expands the input history
Our classification of reservoir recipes
The successes of applying RC methods to benchmarks (see the listing in Section 1) outperforming classical fully trained RNNs do not imply that randomly generated reservoirs are optimal and cannot be improved. In fact, “random” is almost by definition an antonym to “optimal”. The results rather indicate the need for some novel methods of training/generating the reservoirs that are very probably not a direct extension of the way the output is trained (as in BP). Thus besides application studies
Generic reservoir recipes
The most classical methods of producing reservoirs all fall into this category. All of them generate reservoirs randomly, with topology and weight characteristics depending on some preset parameters. Even though they are not optimized for a particular input or target , a good manual selection of the parameters is to some extent task-dependent, complying with the “no free lunch” principle just mentioned.
Unsupervised reservoir adaptation
In this section we describe reservoir training/generation methods that try to optimize some measure defined on the activations of the reservoir, for a given input , but regardless of the desired output . In Section 6.1 we survey measures that are used to estimate the quality of the reservoir, irrespective of the methods optimizing them. Then local, Section 6.2, and global, Section 6.3 unsupervised reservoir training methods are surveyed.
Supervised reservoir pre-training
In this section we discuss methods for training reservoirs to perform a specific given task, i.e., not only the concrete input , but also the desired output is taken into account. Since a linear readout from a reservoir is quickly trained, the suitability of a candidate reservoir for a particular task (e.g., in terms of NRMSE (1)) is inexpensive to check. Notice that even for most methods of this class the explicit target signal is not technically required for training
Readouts from the reservoirs
Conceptually, training a readout from a reservoir is a common supervised non-temporal task of mapping to . This is a well investigated domain in machine learning, much more so than learning temporal mappings with memory. A large choice of methods is available, and in principle any of them can be applied. Thus we will only briefly go through the ones reported to be successful in the literature.
Discussion
The striking success of the original RC methods in outperforming fully trained RNNs in many (though not all) tasks, established an important milestone, or even a turning point, in the research of RNN training. The fact that a randomly generated fixed RNN with only a linear readout trained consistently outperforms state-of-art RNN training methods had several consequences:
- •
First of all it revealed that we do not really know how to train RNNs well, and something new is needed. The error
Acknowledgments
This work is partially supported by Planet Intelligent Systems GmbH, a private company with an inspiring interest in fundamental research. The authors are also thankful to Benjamin Schrauwen, Michael Thon, and an anonymous reviewer of this journal for their helpful constructive feedback.
References (149)
- et al.
A learning algorithm for Boltzmann machines
Cognitive Science
(1985) - et al.
Approximation of dynamical systems by continuous time recurrent neural networks
Neural Networks
(1993) - et al.
Special issue on echo state networks and liquid state machines — Editorial
Neural Networks
(2007) - et al.
Optimization and applications of echo state networks with leaky-integrator neurons
Neural Networks
(2007) - et al.
Timing in the absence of clocks: Encoding time in neural network states
Neuron
(2007) - et al.
The cerebellum as a liquid state machine
Neural Networks
(2007) - et al.
Neurological basis of language and sequential cognition: Evidence from simulation, aphasia, and ERP studies
Brain and Language
(2003) - et al.
Identification of prosodic attitudes by atemporal recurrent network
Cognitive Brain Research
(2003) - et al.
Analyzing the weight dynamics of recurrent learning algorithms
Neurocomputing
(2005) - et al.
Improving reservoirs using intrinsic plasticity
Neurocomputing
(2008)
An experimental unification of reservoir computing methods
Neural Networks
Isolated word recognition with the liquid state machine: A case study
Information Processing Letters
Decoupled echo state networks with lateral inhibition
Neural Networks
Edge of chaos and prediction of computational performance for neural circuit models
Neural Networks
Hopfield network
Scholarpedia
Neural networks and physical systems with emergent collective computational abilities
Proceedings of the National Academy of Sciences of the United States of America
Boltzmann machine
Scholarpedia
Reducing the dimensionality of data with neural networks
Science
Modeling human motion using binary latent variables
Learning long-term dependencies with gradient descent is difficult
IEEE Transactions on Neural Networks
Learning to forget: Continual prediction with LSTM
Neural Computation
Real-time computing without stable states: A new framework for neural computation based on perturbations
Neural Computation
Complex sensory-motor sequence learning based on recurrent state representation and reinforcement learning
Biological Cybernetics
Echo state network
Scholarpedia
Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication
Science
A model for real-time computation in generic neural microcircuits
Principles of real-time computing with feedback applied to cortical microcircuit models
Temporal information transformed into a spatial code by a neural network with realistic properties
Science
A statistical analysis of information-processing properties of lamina-specific cortical microcircuit models
Cerebral Cortex
Reconstruction of natural scenes from ensemble responses in the lateral genicualate nucleus
Journal of Neuroscience
Temporal dynamics of information content carried by neurons in the primary visual cortex
Dynamical working memory and timed responses: The role of reverberating loops in the olivo-cerebellar system
Neural Computation
A neurolinguistic model of grammatical construction processing
Journal of Cognitive Neuroscience
Catastrophic interference in connectionist networks
Detecting strange attractors in turbulence
A learning algorithm for continually running fully recurrent neural networks
Neural Computation
Learning internal representations by error propagation
Backpropagation through time: What it does and how to do it
Proceedings of the IEEE
New results on recurrent network training: Unifying the algorithms and accelerating convergence
IEEE Transactions on Neural Networks
Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks
IEEE Transactions on Neural Networks
Fast training of recurrent networks based on the EM algorithm
IEEE Transactions on Neural Networks
Long short-term memory
Neural Computation
Adaptive nonlinear system identification with echo state networks
Computer models and analysis tools for neural microcircuits
Cited by (2353)
Coherent all-optical reservoir computing for nonlinear equalization in long-haul optical fiber communication systems
2024, Optics and Laser TechnologySpeCluRC-NTL: Spearman's distance-based clustering Reservoir Computing solution for NTL detection in smart grids
2024, International Journal of Electrical Power and Energy SystemsBPM displacement measurement and prediction at HLS II
2024, Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated EquipmentEuler State Networks: Non-dissipative Reservoir Computing
2024, NeurocomputingShort-term memory characteristics of TiN/WO<inf>X</inf>/FTO-based transparent memory device
2024, Chinese Journal of Physics