Elsevier

Computer Science Review

Volume 3, Issue 3, August 2009, Pages 127-149
Computer Science Review

Survey
Reservoir computing approaches to recurrent neural network training

https://doi.org/10.1016/j.cosrev.2009.03.005Get rights and content

Abstract

Echo State Networks and Liquid State Machines introduced a new paradigm in artificial recurrent neural network (RNN) training, where an RNN (the reservoir) is generated randomly and only a readout is trained. The paradigm, becoming known as reservoir computing, greatly facilitated the practical application of RNNs and outperformed classical fully trained RNNs in many tasks. It has lately become a vivid research field with numerous extensions of the basic idea, including reservoir adaptation, thus broadening the initial paradigm to using different methods for training the reservoir and the readout. This review systematically surveys both current ways of generating/adapting the reservoirs and training different types of readouts. It offers a natural conceptual classification of the techniques, which transcends boundaries of the current “brand-names” of reservoir methods, and thus aims to help in unifying the field and providing the reader with a detailed “map” of it.

Introduction

Artificial recurrent neural networks (RNNs) represent a large and varied class of computational models that are designed by more or less detailed analogy with biological brain modules. In an RNN numerous abstract neurons (also called units or processing elements) are interconnected by likewise abstracted synaptic connections (or links), which enable activations to propagate through the network. The characteristic feature of RNNs that distinguishes them from the more widely used feedforward neural networks is that the connection topology possesses cycles. The existence of cycles has a profound impact:

  • An RNN may develop a self-sustained temporal activation dynamics along its recurrent connection pathways, even in the absence of input. Mathematically, this renders an RNN to be a dynamical system, while feedforward networks are functions.

  • If driven by an input signal, an RNN preserves in its internal state a nonlinear transformation of the input history — in other words, it has a dynamical memory, and is able to process temporal context information.

This review article concerns a particular subset of RNN-based research in two aspects:

  • RNNs are used for a variety of scientific purposes, and at least two major classes of RNN models exist: they can be used for purposes of modeling biological brains, or as engineering tools for technical applications. The first usage belongs to the field of computational neuroscience, while the second frames RNNs in the realms of machine learning, the theory of computation, and nonlinear signal processing and control. While there are interesting connections between the two attitudes, this survey focuses on the latter, with occasional borrowings from the first.

  • From a dynamical systems perspective, there are two main classes of RNNs. Models from the first class are characterized by an energy-minimizing stochastic dynamics and symmetric connections. The best known instantiations are Hopfield networks [1], [2], Boltzmann machines [3], [4], and the recently emerging Deep Belief Networks [5]. These networks are mostly trained in some unsupervised learning scheme. Typical targeted network functionalities in this field are associative memories, data compression, the unsupervised modeling of data distributions, and static pattern classification, where the model is run for multiple time steps per single input instance to reach some type of convergence or equilibrium (but see e.g., [6] for extension to temporal data). The mathematical background is rooted in statistical physics. In contrast, the second big class of RNN models typically features a deterministic update dynamics and directed connections. Systems from this class implement nonlinear filters, which transform an input time series into an output time series. The mathematical background here is nonlinear dynamical systems. The standard training mode is supervised. This survey is concerned only with RNNs of this second type, and when we speak of RNNs later on, we will exclusively refer to such systems.1

RNNs (of the second type) appear as highly promising and fascinating tools for nonlinear time series processing applications, mainly for two reasons. First, it can be shown that under fairly mild and general assumptions, such RNNs are universal approximators of dynamical systems [7]. Second, biological brain modules almost universally exhibit recurrent connection pathways too. Both observations indicate that RNNs should potentially be powerful tools for engineering applications.

Despite this widely acknowledged potential, and despite a number of successful academic and practical applications, the impact of RNNs in nonlinear modeling has remained limited for a long time. The main reason for this lies in the fact that RNNs are difficult to train by gradient-descent-based methods, which aim at iteratively reducing the training error. While a number of training algorithms have been proposed (a brief overview is given in Section 2.5), these all suffer from the following shortcomings:

  • The gradual change of network parameters during learning drives the network dynamics through bifurcations [8]. At such points, the gradient information degenerates and may become ill-defined. As a consequence, convergence cannot be guaranteed.

  • A single parameter update can be computationally expensive, and many update cycles may be necessary. This results in long training times, and renders RNN training feasible only for relatively small networks (in the order of tens of units).

  • It is intrinsically hard to learn dependences requiring long-range memory, because the necessary gradient information exponentially dissolves over time [9] (but see the Long Short-Term Memory networks [10] for a possible escape).

  • Advanced training algorithms are mathematically involved and need to be parameterized by a number of global control parameters, which are not easily optimized. As a result, such algorithms need substantial skill and experience to be successfully applied.

In this situation of slow and difficult progress, in 2001 a fundamentally new approach to RNN design and training was proposed independently by Wolfgang Maass under the name of Liquid State Machines [11] and by Herbert Jaeger under the name of Echo State Networks [12]. This approach, which had predecessors in computational neuroscience [13] and subsequent ramifications in machine learning as the Backpropagation-Decorrelation [14] learning rule, is now increasingly often collectively referred to as Reservoir Computing (RC). The RC paradigm avoids the shortcomings of gradient-descent RNN training listed above, by setting up RNNs in the following way:

  • A recurrent neural network is randomly created and remains unchanged during training. This RNN is called the reservoir. It is passively excited by the input signal and maintains in its state a nonlinear transformation of the input history.

  • The desired output signal is generated as a linear combination of the neuron’s signals from the input-excited reservoir. This linear combination is obtained by linear regression, using the teacher signal as a target.

Fig. 1 graphically contrasts previous methods of RNN training with the RC approach.

Reservoir Computing methods have quickly become popular, as witnessed for instance by a theme issue of Neural Networks [15], and today constitute one of the basic paradigms of RNN modeling [16]. The main reasons for this development are the following:

  • Modeling accuracy. RC has starkly outperformed previous methods of nonlinear system identification, prediction and classification, for instance in predicting chaotic dynamics (three orders of magnitude improved accuracy [17]), nonlinear wireless channel equalization (two orders of magnitude improvement [17]), the Japanese Vowel benchmark (zero test error rate, previous best: 1.8% [18]), financial forecasting (winner of the international forecasting competition NN32), and in isolated spoken digits recognition (improvement of word error rate on benchmark from 0.6% of previous best system to 0.2% [19], and further to 0% test error in recent unpublished work).

  • Modeling capacity. RC is computationally universal for continuous-time, continuous-value real-time systems modeled with bounded resources (including time and value resolution) [20], [21].

  • Biological plausibility. Numerous connections of RC principles to architectural and dynamical properties of mammalian brains have been established. RC (or closely related models) provides explanations of why biological brains can carry out accurate computations with an “inaccurate” and noisy physical substrate [22], [23], especially accurate timing [24]; of the way in which visual information is superimposed and processed in primary visual cortex [25], [26]; of how cortico-basal pathways support the representation of sequential information; and RC offers a functional interpretation of the cerebellar circuitry [27], [28]. A central role is assigned to an RC circuit in a series of models explaining sequential information processing in human and primate brains, most importantly of speech signals [13], [29], [30], [31].

  • Extensibility and parsimony. A notorious conundrum of neural network research is how to extend previously learned models by new items without impairing or destroying previously learned representations (catastrophic interference [32]). RC offers a simple and principled solution: new items are represented by new output units, which are appended to the previously established output units of a given reservoir. Since the output weights of different output units are independent of each other, catastrophic interference is a non-issue.

These encouraging observations should not mask the fact that RC is still in its infancy, and significant further improvements and extensions are desirable. Specifically, just simply creating a reservoir at random is unsatisfactory. It seems obvious that, when addressing a specific modeling task, a specific reservoir design that is adapted to the task will lead to better results than a naive random creation. Thus, the main stream of research in the field is today directed at understanding the effects of reservoir characteristics on task performance, and at developing suitable reservoir design and adaptation methods. Also, new ways of reading out from the reservoirs, including combining them into larger structures, are devised and investigated. While shifting from the initial idea of having a fixed randomly created reservoir and training only the readout, the current paradigm of reservoir computing remains (and differentiates itself from other RNN training approaches) as producing/training the reservoir and the readout separately and differently.

This review offers a conceptual classification and a comprehensive survey of this research.

As is true for many areas of machine learning, methods in reservoir computing converge from different fields and come with different names. We would like to make a distinction here between these differently named “tradition lines”, which we like to call brands, and the actual finer-grained ideas on producing good reservoirs, which we will call recipes. Since recipes can be useful and mixed across different brands, this review focuses on classifying and surveying them. To be fair, it has to be said that the authors of this survey associate themselves mostly with the Echo State Networks brand, and thus, willingly or not, are influenced by its mindset.

Overview. We start by introducing a generic notational framework in Section 2. More specifically, we define what we mean by problem or task in the context of machine learning in Section 2.1. Then we define a general notation for expansion (or kernel) methods for both non-temporal (Section 2.2) and temporal (Section 2.3) tasks, introduce our notation for recurrent neural networks in Section 2.4, and outline classical training methods in Section 2.5. In Section 3 we detail the foundations of Reservoir Computing and proceed by naming the most prominent brands. In Section 4 we introduce our classification of the reservoir generation/adaptation recipes, which transcends the boundaries between the brands. Following this classification we then review universal (Section 5), unsupervised (Section 6), and supervised (Section 7) reservoir generation/adaptation recipes. In Section 8 we provide a classification and review the techniques for reading the outputs from the reservoirs reported in literature, together with discussing various practical issues of readout training. A final discussion (Section 9) wraps up the entire picture.

Section snippets

Formulation of the problem

Let a problem or a task in our context of machine learning be defined as a problem of learning a functional relation between a given input u(n)RNu and a desired output ytarget(n)RNy, where n=1,,T, and T is the number of data points in the training dataset {(u(n),ytarget(n))}. A non-temporal task is where the data points are independent of each other and the goal is to learn a function y(n)=y(u(n)) such that E(y,ytarget) is minimized, where E is an error measure, for instance, the normalized

Reservoir methods

Reservoir computing methods differ from the “traditional” designs and learning techniques listed above in that they make a conceptual and computational separation between a dynamic reservoir — an RNN as a nonlinear temporal expansion function — and a recurrence-free (usually linear) readout that produces the desired output from the expansion.

This separation is based on the understanding (common with kernel methods) that x() and y() serve different purposes: x() expands the input history u(n),

Our classification of reservoir recipes

The successes of applying RC methods to benchmarks (see the listing in Section 1) outperforming classical fully trained RNNs do not imply that randomly generated reservoirs are optimal and cannot be improved. In fact, “random” is almost by definition an antonym to “optimal”. The results rather indicate the need for some novel methods of training/generating the reservoirs that are very probably not a direct extension of the way the output is trained (as in BP). Thus besides application studies

Generic reservoir recipes

The most classical methods of producing reservoirs all fall into this category. All of them generate reservoirs randomly, with topology and weight characteristics depending on some preset parameters. Even though they are not optimized for a particular input u(n) or target ytarget(n), a good manual selection of the parameters is to some extent task-dependent, complying with the “no free lunch” principle just mentioned.

Unsupervised reservoir adaptation

In this section we describe reservoir training/generation methods that try to optimize some measure defined on the activations x(n) of the reservoir, for a given input u(n), but regardless of the desired output ytarget(n). In Section 6.1 we survey measures that are used to estimate the quality of the reservoir, irrespective of the methods optimizing them. Then local, Section 6.2, and global, Section 6.3 unsupervised reservoir training methods are surveyed.

Supervised reservoir pre-training

In this section we discuss methods for training reservoirs to perform a specific given task, i.e., not only the concrete input u(n), but also the desired output ytarget(n) is taken into account. Since a linear readout from a reservoir is quickly trained, the suitability of a candidate reservoir for a particular task (e.g., in terms of NRMSE (1)) is inexpensive to check. Notice that even for most methods of this class the explicit target signal ytarget(n) is not technically required for training

Readouts from the reservoirs

Conceptually, training a readout from a reservoir is a common supervised non-temporal task of mapping x(n) to ytarget(n). This is a well investigated domain in machine learning, much more so than learning temporal mappings with memory. A large choice of methods is available, and in principle any of them can be applied. Thus we will only briefly go through the ones reported to be successful in the literature.

Discussion

The striking success of the original RC methods in outperforming fully trained RNNs in many (though not all) tasks, established an important milestone, or even a turning point, in the research of RNN training. The fact that a randomly generated fixed RNN with only a linear readout trained consistently outperforms state-of-art RNN training methods had several consequences:

  • First of all it revealed that we do not really know how to train RNNs well, and something new is needed. The error

Acknowledgments

This work is partially supported by Planet Intelligent Systems GmbH, a private company with an inspiring interest in fundamental research. The authors are also thankful to Benjamin Schrauwen, Michael Thon, and an anonymous reviewer of this journal for their helpful constructive feedback.

References (149)

  • David Verstraeten et al.

    An experimental unification of reservoir computing methods

    Neural Networks

    (2007)
  • David Verstraeten et al.

    Isolated word recognition with the liquid state machine: A case study

    Information Processing Letters

    (2005)
  • Yanbo Xue et al.

    Decoupled echo state networks with lateral inhibition

    Neural Networks

    (2007)
  • Robert A. Legenstein et al.

    Edge of chaos and prediction of computational performance for neural circuit models

    Neural Networks

    (2007)
  • John J. Hopfield

    Hopfield network

    Scholarpedia

    (2007)
  • John J. Hopfield

    Neural networks and physical systems with emergent collective computational abilities

    Proceedings of the National Academy of Sciences of the United States of America

    (1982)
  • Geoffrey E. Hinton

    Boltzmann machine

    Scholarpedia

    (2007)
  • Geoffrey E. Hinton et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • Graham W. Taylor et al.

    Modeling human motion using binary latent variables

  • Kenji Doya, Bifurcations in the learning of recurrent neural networks, in: Proceedings of IEEE International Symposium...
  • Yoshua Bengio et al.

    Learning long-term dependencies with gradient descent is difficult

    IEEE Transactions on Neural Networks

    (1994)
  • Felix A. Gers et al.

    Learning to forget: Continual prediction with LSTM

    Neural Computation

    (2000)
  • Wolfgang Maass et al.

    Real-time computing without stable states: A new framework for neural computation based on perturbations

    Neural Computation

    (2002)
  • Herbert Jaeger, The “echo state” approach to analysing and training recurrent neural networks, Technical Report GMD...
  • Peter F. Dominey

    Complex sensory-motor sequence learning based on recurrent state representation and reinforcement learning

    Biological Cybernetics

    (1995)
  • Jochen J. Steil, Backpropagation-decorrelation: Recurrent learning with O(N) complexity, in: Proceedings of the IEEE...
  • Herbert Jaeger

    Echo state network

    Scholarpedia

    (2007)
  • Herbert Jaeger et al.

    Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication

    Science

    (2004)
  • David Verstraeten, Benjamin Schrauwen, Dirk Stroobandt, Reservoir-based techniques for speech recognition, in:...
  • Wolfgang Maass et al.

    A model for real-time computation in generic neural microcircuits

  • Wolfgang Maass et al.

    Principles of real-time computing with feedback applied to cortical microcircuit models

  • Dean V. Buonomano et al.

    Temporal information transformed into a spatial code by a neural network with realistic properties

    Science

    (1995)
  • Stefan Haeusler et al.

    A statistical analysis of information-processing properties of lamina-specific cortical microcircuit models

    Cerebral Cortex

    (2007)
  • Garrett B. Stanley et al.

    Reconstruction of natural scenes from ensemble responses in the lateral genicualate nucleus

    Journal of Neuroscience

    (1999)
  • Danko Nikolić et al.

    Temporal dynamics of information content carried by neurons in the primary visual cortex

  • Werner M. Kistler et al.

    Dynamical working memory and timed responses: The role of reverberating loops in the olivo-cerebellar system

    Neural Computation

    (2002)
  • Peter F. Dominey et al.

    A neurolinguistic model of grammatical construction processing

    Journal of Cognitive Neuroscience

    (2006)
  • Robert M. French

    Catastrophic interference in connectionist networks

  • Floris Takens

    Detecting strange attractors in turbulence

  • Ronald J. Williams et al.

    A learning algorithm for continually running fully recurrent neural networks

    Neural Computation

    (1989)
  • David E. Rumelhart et al.

    Learning internal representations by error propagation

  • Paul J. Werbos

    Backpropagation through time: What it does and how to do it

    Proceedings of the IEEE

    (1990)
  • Amir F. Atiya et al.

    New results on recurrent network training: Unifying the algorithms and accelerating convergence

    IEEE Transactions on Neural Networks

    (2000)
  • Gintaras V. Puškorius et al.

    Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks

    IEEE Transactions on Neural Networks

    (1994)
  • Sheng Ma et al.

    Fast training of recurrent networks based on the EM algorithm

    IEEE Transactions on Neural Networks

    (1998)
  • Sepp Hochreiter et al.

    Long short-term memory

    Neural Computation

    (1997)
  • Herbert Jaeger, Short term memory in echo state networks, Technical Report GMD Report 152, German National Research...
  • Herbert Jaeger, Tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the “echo state network”...
  • Herbert Jaeger

    Adaptive nonlinear system identification with echo state networks

  • Thomas Natschläger et al.

    Computer models and analysis tools for neural microcircuits

  • Cited by (2353)

    • BPM displacement measurement and prediction at HLS II

      2024, Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment
    View all citing articles on Scopus
    View full text