Elsevier

Neurocomputing

Volume 428, 7 March 2021, Pages 291-307
Neurocomputing

Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting

https://doi.org/10.1016/j.neucom.2020.11.050Get rights and content

Abstract

Neural networks can achieve excellent results in a wide variety of applications. However, when they attempt to sequentially learn, they tend to learn the new task while catastrophically forgetting previous ones. We propose a model that overcomes catastrophic forgetting in sequential reinforcement learning by combining ideas from continual learning in both the image classification domain and the reinforcement learning domain. This model features a dual memory system which separates continual learning from reinforcement learning and a pseudo-rehearsal system that “recalls” items representative of previous tasks via a deep generative network. Our model sequentially learns Atari 2600 games without demonstrating catastrophic forgetting and continues to perform above human level on all three games. This result is achieved without: demanding additional storage requirements as the number of tasks increases, storing raw data or revisiting past tasks. In comparison, previous state-of-the-art solutions are substantially more vulnerable to forgetting on these complex deep reinforcement learning tasks.

Introduction

There has been enormous growth in research around reinforcement learning since the development of Deep Q-Networks (DQNs) [1]. DQNs apply Q-learning to deep networks so that complicated reinforcement tasks can be learnt. However, as with most distributed models, DQNs can suffer from Catastrophic Forgetting (CF) [2], [3]. This is where a model has the tendency to forget previous knowledge as it learns new knowledge. Pseudo-rehearsal is a method for overcoming CF by rehearsing randomly generated examples of previous tasks, while learning on real data from a new task. Although pseudo-rehearsal methods have been widely used in image classification, they have been virtually unexplored in reinforcement learning. Solving CF in the reinforcement learning domain is essential if we want to achieve artificial agents that can continuously learn.

Continual learning is important to neural networks because CF limits their potential in numerous ways. For example, imagine a previously trained network whose function needs to be extended or partially changed. The typical solution would be to train the neural network on all of the previously learnt data (that was still relevant) along with the data to learn the new function. This can be an expensive operation because previous datasets (which tend to be very large in deep learning) would need to be stored and retrained. However, if a neural network could adequately perform continual learning, it would only be necessary for it to directly learn on data representing the new function. Furthermore, continual learning is also desirable because it allows the solution for multiple tasks to be compressed into a single network where weights common to both tasks may be shared. This can also benefit the speed at which new tasks are learnt because useful features may already be present in the network.

Our Reinforcement-Pseudo-Rehearsal model (RePR1) achieves continual learning in the reinforcement domain. It does so by utilising a dual memory system where a freshly initialised DQN is trained on the new task and then knowledge from this short-term network is transferred to a separate DQN containing long-term knowledge of all previously learnt tasks. A generative network is used to produce states (short sequences of data) representative of previous tasks which can be rehearsed while transferring knowledge of the new task. For each new task, the generative network is trained on pseudo-items produced by the previous generative network, alongside data from the new task. Therefore, the system can prevent CF without the need for a large memory store holding data from previously encountered training examples.

The reinforcement tasks learnt by RePR are Atari 2600 games. These games are considered complex because the input space of the games is large which currently requires reinforcement learning to use deep neural networks (i.e. deep reinforcement learning). Applying pseudo-rehearsal methods to deep reinforcement learning is challenging because these reinforcement learning methods are notoriously unstable compared to image classification (due to the deadly triad [4]). In part, this is because target values are consistently changing during learning. We have found that using pseudo-rehearsal while learning these non-stationary targets is difficult because it increases the interference between new and old tasks. Furthermore, generative models struggle to produce high quality data resembling these reinforcement learning tasks, which can prevent important task knowledge from being learnt for the first time, as well as relearnt once it is forgotten.

Our RePR model applies pseudo-rehearsal to the difficult domain of deep reinforcement learning. RePR introduces a dual memory model suitable for reinforcement learning. This model is novel compared to previously used dual memory pseudo-rehearsal models in two important aspects. Firstly, the model isolates reinforcement learning to the short-term system, so that the long-term system can use supervised learning (i.e. mean squared error) with fixed target values (converged on by the short-term network), preventing non-stationary target values from increasing the interference between new and old tasks. Importantly, this differs from previous applications of pseudo-rehearsal, where both the short-term and long-term systems learn with the same cross-entropy loss function. Secondly, RePR transfers knowledge between the dual memory system using real samples, rather than those produced by a generative model. This allows tasks to be learnt and retained to a higher performance in reinforcement learning. The source code for RePR can be found at https://bitbucket.org/catk1ns0n/repr_public/.

A summary of the main contributions of this paper are:

  • the first successful application of pseudo-rehearsal methods to complex deep reinforcement learning tasks;

  • above state-of-the-art performance when sequentially learning complex reinforcement tasks, without storing any raw data from previously learnt tasks;

  • empirical evidence demonstrating the need for a dual memory system as it facilitates new learning by separating the reinforcement learning system from the continual learning system.

Section snippets

Deep Q-learning

In deep Q-learning [1], the neural network is taught to predict the discounted reward that would be received from taking each one of the possible actions given the current state. More specifically, it minimises the following loss function:LDQN=E(st,at,rt,dt,st+1)~U(D)yt-Q(st,at;ψt)2,yt=rt,ifterminalatt+1rt+γmaxat+1Q(st+1,at+1;ψt-),otherwisewhere there exist two Q functions, a deep predictor network and a deep target network. The predictor’s parameters ψt are updated continuously by stochastic

The RePR model

RePR is a dual memory model which uses pseudo-rehearsal with a generative network to achieve sequential learning in reinforcement tasks. The first part of our dual memory model is the short-term memory (STM) system,3 which serves a role analogous to that of the hippocampus in learning and is used to learn

Related work

This section will focus on methods for preventing CF in reinforcement learning and will generally concentrate on how to learn a new policy without forgetting those previously learnt for different tasks. There is a lot of related research outside of this domain (see [17] for a broad review), predominantly around continual learning in image classification. However, because these methods cannot be directly applied to complex reinforcement learning tasks, we have excluded them from this review.

Method

Our current research applies pseudo-rehearsal to deep Q-learning so that a DQN can be used to learn multiple Atari 2600 games4 in sequence. All agents select between 18 possible actions representing different combinations of joystick movements and pressing the fire button. Our DQN is based upon [1] with a few minor changes which we found helped the network to learn the individual tasks quicker. The specifics

RePR performance on CF

The first experiment investigates how well RePR compares to a lower and upper baseline. The no-reh condition is the lower baseline because it does not contain a component to assist in retaining the previously learnt tasks. The reh condition is the upper baseline for RePR because it rehearses real items from previously learnt tasks and thus, demonstrates how RePR would perform if its GAN could perfectly generate states from previous tasks to rehearse alongside learning the new task.6

Discussion

Our experiments have demonstrated RePR to be an effective solution to CF when sequentially learning multiple tasks. To our knowledge, pseudo-rehearsal has not been used until now to successfully prevent CF on complex reinforcement learning tasks. RePR has advantages over popular weight constraint methods, such as EWC, because it does not constrain the network to retain similar weights when learning a new task. This allows the internal layers of the network to change according to new knowledge,

Conclusion

In conclusion, pseudo-rehearsal can be used with deep reinforcement learning methods to achieve continual learning. We have shown that our RePR model can be used to sequentially learn a number of complex reinforcement tasks, without scaling in complexity as the number of tasks increases and without revisiting or storing raw data from past tasks. Pseudo-rehearsal has major benefits over weight constraint methods as it is less restrictive on the network and this is supported by our experimental

CRediT authorship contribution statement

Craig Atkinson: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Writing - review & editing, Visualization. Brendan McCane: Supervision, Writing - review & editing. Lech Szymanski: Supervision, Writing - review & editing. Anthony Robins: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X GPU used for this research. We also wish to acknowledge the use of New Zealand eScience Infrastructure (NeSI) high performance computing facilities. New Zealand’s national facilities are provided by NeSI and funded jointly by NeSI’s collaborator institutions and through the Ministry of Business, Innovation & Employment’s Research Infrastructure programme. URL https://www.nesi.org.nz.

Craig Atkinson received his B.Sc. (Hons.) from the University of Otago, Dunedin, New Zealand, in 2017. He has just completed his doctorate in Computer Science at the University of Otago. His research interests include deep reinforcement learning and continual learning.

References (41)

  • K. Louie et al.

    Temporally structured replay of awake hippocampal ensemble activity during rapid eye movement sleep

    Neuron

    (2001)
  • G.I. Parisi et al.

    Continual lifelong learning with neural networks: A review

    Neural Networks

    (2019)
  • W.C. Abraham et al.

    Memory retention - the synaptic stability versus plasticity dilemma

    Trends in Neurosciences

    (2005)
  • V. Mnih et al.

    Human-level control through deep reinforcement learning

    Nature

    (2015)
  • M. McCloskey, N.J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in:...
  • J. Kirkpatrick et al.

    Overcoming catastrophic forgetting in neural networks

    Proceedings of the National Academy of Sciences

    (2017)
  • R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction (2nd ed.), complete draft...
  • D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, in: Advances in Neural Information...
  • S.-A. Rebuffi et al.

    iCaRL: Incremental classifier and representation learning

    IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Riemer et al.

    Generative knowledge distillation for general purpose function compression

    Neural Information Processing Systems Workshop on Teaching Machines, Robots, and Humans

    (2017)
  • A. Robins

    Catastrophic forgetting, rehearsal and pseudorehearsal

    Connection Science

    (1995)
  • S. Gais et al.

    Sleep transforms the cerebral trace of declarative memories

    Proceedings of the National Academy of Sciences

    (2007)
  • C. Atkinson, B. McCane, L. Szymanski, A. Robins, Pseudo-recursal: Solving the catastrophic forgetting problem in deep...
  • H. Shin, J.K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay, in: Advances in Neural Information...
  • I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative...
  • G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv e-prints...
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of Wasserstein GANs, in: Advances...
  • T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality, stability, and variation,...
  • A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell,...
  • T. Kobayashi

    Check regularization: Combining modularity and elasticity for memory consolidation

  • Cited by (62)

    • Pre-trained Online Contrastive Learning for Insurance Fraud Detection

      2024, Proceedings of the AAAI Conference on Artificial Intelligence
    View all citing articles on Scopus

    Craig Atkinson received his B.Sc. (Hons.) from the University of Otago, Dunedin, New Zealand, in 2017. He has just completed his doctorate in Computer Science at the University of Otago. His research interests include deep reinforcement learning and continual learning.

    Brendan McCane received the B.Sc. (Hons.) and Ph.D. degrees from the James Cook University of North Queensland, Townsville City, Australia, in 1991 and 1996, respectively. He joined the Computer Science Department, University of Otago, Otago, New Zealand, in 1997. He served as the Head of the Department from 2007 to 2012. His current research interests include computer vision, pattern recognition, machine learning, and medical and biological imaging. He also enjoys reading, swimming, fishing and long walks on the beach with his dogs.

    Lech Szymanski received the B.A.Sc. (Hons.) degree in computer engineering and the M.A.Sc. degree in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 2001 and 2005, respectively, and the Ph.D. degree in computer science from the University of Otago, Otago, New Zealand, in 2012. He is currently a Lecturer at the Computer Science Department at the University of Otago. His research interests include machine learning, artificial neural networks, and deep architectures.

    Anthony Robins completed his doctorate in cognitive science at the University of Sussex (UK) in 1989. He is currently a Professor of Computer Science at the University of Otago, New Zealand. His research interests include artificial neural networks, computational models of memory, and computer science education.

    View full text