Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting

doi:10.1016/j.neucom.2020.11.050

Neurocomputing

Volume 428, 7 March 2021, Pages 291-307

https://doi.org/10.1016/j.neucom.2020.11.050 Get rights and content

Abstract

Neural networks can achieve excellent results in a wide variety of applications. However, when they attempt to sequentially learn, they tend to learn the new task while catastrophically forgetting previous ones. We propose a model that overcomes catastrophic forgetting in sequential reinforcement learning by combining ideas from continual learning in both the image classification domain and the reinforcement learning domain. This model features a dual memory system which separates continual learning from reinforcement learning and a pseudo-rehearsal system that “recalls” items representative of previous tasks via a deep generative network. Our model sequentially learns Atari 2600 games without demonstrating catastrophic forgetting and continues to perform above human level on all three games. This result is achieved without: demanding additional storage requirements as the number of tasks increases, storing raw data or revisiting past tasks. In comparison, previous state-of-the-art solutions are substantially more vulnerable to forgetting on these complex deep reinforcement learning tasks.

Introduction

There has been enormous growth in research around reinforcement learning since the development of Deep Q-Networks (DQNs) [1]. DQNs apply Q-learning to deep networks so that complicated reinforcement tasks can be learnt. However, as with most distributed models, DQNs can suffer from Catastrophic Forgetting (CF) [2], [3]. This is where a model has the tendency to forget previous knowledge as it learns new knowledge. Pseudo-rehearsal is a method for overcoming CF by rehearsing randomly generated examples of previous tasks, while learning on real data from a new task. Although pseudo-rehearsal methods have been widely used in image classification, they have been virtually unexplored in reinforcement learning. Solving CF in the reinforcement learning domain is essential if we want to achieve artificial agents that can continuously learn.

Continual learning is important to neural networks because CF limits their potential in numerous ways. For example, imagine a previously trained network whose function needs to be extended or partially changed. The typical solution would be to train the neural network on all of the previously learnt data (that was still relevant) along with the data to learn the new function. This can be an expensive operation because previous datasets (which tend to be very large in deep learning) would need to be stored and retrained. However, if a neural network could adequately perform continual learning, it would only be necessary for it to directly learn on data representing the new function. Furthermore, continual learning is also desirable because it allows the solution for multiple tasks to be compressed into a single network where weights common to both tasks may be shared. This can also benefit the speed at which new tasks are learnt because useful features may already be present in the network.

Our Reinforcement-Pseudo-Rehearsal model (RePR¹) achieves continual learning in the reinforcement domain. It does so by utilising a dual memory system where a freshly initialised DQN is trained on the new task and then knowledge from this short-term network is transferred to a separate DQN containing long-term knowledge of all previously learnt tasks. A generative network is used to produce states (short sequences of data) representative of previous tasks which can be rehearsed while transferring knowledge of the new task. For each new task, the generative network is trained on pseudo-items produced by the previous generative network, alongside data from the new task. Therefore, the system can prevent CF without the need for a large memory store holding data from previously encountered training examples.

The reinforcement tasks learnt by RePR are Atari 2600 games. These games are considered complex because the input space of the games is large which currently requires reinforcement learning to use deep neural networks (i.e. deep reinforcement learning). Applying pseudo-rehearsal methods to deep reinforcement learning is challenging because these reinforcement learning methods are notoriously unstable compared to image classification (due to the deadly triad [4]). In part, this is because target values are consistently changing during learning. We have found that using pseudo-rehearsal while learning these non-stationary targets is difficult because it increases the interference between new and old tasks. Furthermore, generative models struggle to produce high quality data resembling these reinforcement learning tasks, which can prevent important task knowledge from being learnt for the first time, as well as relearnt once it is forgotten.

Our RePR model applies pseudo-rehearsal to the difficult domain of deep reinforcement learning. RePR introduces a dual memory model suitable for reinforcement learning. This model is novel compared to previously used dual memory pseudo-rehearsal models in two important aspects. Firstly, the model isolates reinforcement learning to the short-term system, so that the long-term system can use supervised learning (i.e. mean squared error) with fixed target values (converged on by the short-term network), preventing non-stationary target values from increasing the interference between new and old tasks. Importantly, this differs from previous applications of pseudo-rehearsal, where both the short-term and long-term systems learn with the same cross-entropy loss function. Secondly, RePR transfers knowledge between the dual memory system using real samples, rather than those produced by a generative model. This allows tasks to be learnt and retained to a higher performance in reinforcement learning. The source code for RePR can be found at https://bitbucket.org/catk1ns0n/repr_public/.

A summary of the main contributions of this paper are:

•
the first successful application of pseudo-rehearsal methods to complex deep reinforcement learning tasks;
•
above state-of-the-art performance when sequentially learning complex reinforcement tasks, without storing any raw data from previously learnt tasks;
•
empirical evidence demonstrating the need for a dual memory system as it facilitates new learning by separating the reinforcement learning system from the continual learning system.

Section snippets

Deep Q-learning

In deep Q-learning [1], the neural network is taught to predict the discounted reward that would be received from taking each one of the possible actions given the current state. More specifically, it minimises the following loss function: $L_{DQN} = E_{(s_{t}, a_{t}, r_{t}, d_{t}, s_{t + 1}) ~ U (D)} [{(y_{t} - Q (s_{t}, a_{t}; ψ_{t}))}^{2}],$ $y_{t} = \{\begin{matrix} r_{t}, & if terminal at t + 1 \\ r_{t} + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; ψ_{t}^{-}), & otherwise \end{matrix}$ where there exist two Q functions, a deep predictor network and a deep target network. The predictor’s parameters $ψ_{t}$ are updated continuously by stochastic

The RePR model

RePR is a dual memory model which uses pseudo-rehearsal with a generative network to achieve sequential learning in reinforcement tasks. The first part of our dual memory model is the short-term memory (STM) system,³ which serves a role analogous to that of the hippocampus in learning and is used to learn

Related work

This section will focus on methods for preventing CF in reinforcement learning and will generally concentrate on how to learn a new policy without forgetting those previously learnt for different tasks. There is a lot of related research outside of this domain (see [17] for a broad review), predominantly around continual learning in image classification. However, because these methods cannot be directly applied to complex reinforcement learning tasks, we have excluded them from this review.

Method

Our current research applies pseudo-rehearsal to deep Q-learning so that a DQN can be used to learn multiple Atari 2600 games⁴ in sequence. All agents select between 18 possible actions representing different combinations of joystick movements and pressing the fire button. Our DQN is based upon [1] with a few minor changes which we found helped the network to learn the individual tasks quicker. The specifics

RePR performance on CF

The first experiment investigates how well RePR compares to a lower and upper baseline. The $no - reh$ condition is the lower baseline because it does not contain a component to assist in retaining the previously learnt tasks. The reh condition is the upper baseline for RePR because it rehearses real items from previously learnt tasks and thus, demonstrates how RePR would perform if its GAN could perfectly generate states from previous tasks to rehearse alongside learning the new task.⁶

Discussion

Our experiments have demonstrated RePR to be an effective solution to CF when sequentially learning multiple tasks. To our knowledge, pseudo-rehearsal has not been used until now to successfully prevent CF on complex reinforcement learning tasks. RePR has advantages over popular weight constraint methods, such as EWC, because it does not constrain the network to retain similar weights when learning a new task. This allows the internal layers of the network to change according to new knowledge,

Conclusion

In conclusion, pseudo-rehearsal can be used with deep reinforcement learning methods to achieve continual learning. We have shown that our RePR model can be used to sequentially learn a number of complex reinforcement tasks, without scaling in complexity as the number of tasks increases and without revisiting or storing raw data from past tasks. Pseudo-rehearsal has major benefits over weight constraint methods as it is less restrictive on the network and this is supported by our experimental

CRediT authorship contribution statement

Craig Atkinson: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Writing - review & editing, Visualization. Brendan McCane: Supervision, Writing - review & editing. Lech Szymanski: Supervision, Writing - review & editing. Anthony Robins: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X GPU used for this research. We also wish to acknowledge the use of New Zealand eScience Infrastructure (NeSI) high performance computing facilities. New Zealand’s national facilities are provided by NeSI and funded jointly by NeSI’s collaborator institutions and through the Ministry of Business, Innovation & Employment’s Research Infrastructure programme. URL https://www.nesi.org.nz.

Craig Atkinson received his B.Sc. (Hons.) from the University of Otago, Dunedin, New Zealand, in 2017. He has just completed his doctorate in Computer Science at the University of Otago. His research interests include deep reinforcement learning and continual learning.

References (41)

K. Louie et al.
Temporally structured replay of awake hippocampal ensemble activity during rapid eye movement sleep
Neuron
(2001)
G.I. Parisi et al.
Continual lifelong learning with neural networks: A review
Neural Networks
(2019)
W.C. Abraham et al.
Memory retention - the synaptic stability versus plasticity dilemma
Trends in Neurosciences
(2005)
V. Mnih et al.
Human-level control through deep reinforcement learning
Nature
(2015)
M. McCloskey, N.J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in:...
J. Kirkpatrick et al.
Overcoming catastrophic forgetting in neural networks
Proceedings of the National Academy of Sciences
(2017)
R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction (2nd ed.), complete draft...
D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, in: Advances in Neural Information...
S.-A. Rebuffi et al.
iCaRL: Incremental classifier and representation learning
IEEE Conference on Computer Vision and Pattern Recognition
(2017)
M. Riemer et al.
Generative knowledge distillation for general purpose function compression
Neural Information Processing Systems Workshop on Teaching Machines, Robots, and Humans
(2017)

A. Robins

Catastrophic forgetting, rehearsal and pseudorehearsal

Connection Science

(1995)

S. Gais et al.

Sleep transforms the cerebral trace of declarative memories

Proceedings of the National Academy of Sciences

(2007)

C. Atkinson, B. McCane, L. Szymanski, A. Robins, Pseudo-recursal: Solving the catastrophic forgetting problem in deep...

H. Shin, J.K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay, in: Advances in Neural Information...

I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative...

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv e-prints...

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of Wasserstein GANs, in: Advances...

T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality, stability, and variation,...

A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell,...

T. Kobayashi

Check regularization: Combining modularity and elasticity for memory consolidation

Cited by (62)

CeCR: Cross-entropy contrastive replay for online class-incremental continual learning
2024, Neural Networks
Aiming at the realization of learning continually from an online data stream, replay-based methods have shown superior potential. The main challenge of replay-based methods is the selection of representative samples which are stored in the buffer and replayed. In this paper, we propose the Cross-entropy Contrastive Replay (CeCR) method in the online class-incremental setting. First, we present the Class-focused Memory Retrieval method that proceeds the class-level sampling without replacement. Second, we put forward the class-mean approximation memory update method that selectively replaces the mistakenly classified training samples with samples of current input batch. In addition, the Cross-entropy Contrastive Loss is proposed to implement the model training with obtaining more solid knowledge to achieve effective learning. Experiments show that the CeCR method has comparable or improved performance in two benchmark datasets in comparison with the state-of-the-art methods.
Erratum: Contributions by metaplasticity to solving the Catastrophic Forgetting Problem: (Trends in Neurosciences, 45, 656–666, 2022) (Trends in Neurosciences (2022) 45(9) (656–666), (S0166223622001205), (10.1016/j.tins.2022.06.002))
2023, Trends in Neurosciences
Mutual mentor: Online contrastive distillation network for general continual learning
2023, Neurocomputing
The goal of General Continual Learning (GCL) is to preserve learned knowledge and learn new knowledge with constant memory from infinite data stream where task boundaries are blurry. Distilling the model’s response of reserved samples between the old and new models is an effective way to achieve promising performance on GCL. However, it accumulates the inherent old model’s response bias and is not robust to model changes. To this end, we propose a Mutual Mentor General Continual Learning (MMGCL) framework to tackle these problems, which explores a training process in which the student and teacher models mentor each other. Concretely, the student model consolidates the learned knowledge by respectively aligning the relation and adaptive responses with those of the teacher model while the teacher model updates its parameters by integrating the parameters of the student model to accumulate new knowledge. To further improve the effectiveness of the mutual mentor, we integrate the inter-instance knowledge to optimize the outputs of the teacher model, which can not only supervise the student model but also indirectly optimize the teacher model. Extensive experiments on six benchmark datasets demonstrate that our MMGCL significantly outperforms state-of-the-art approaches under diverse continual learning settings with various buffer sizes.
Uncertainty-aware transfer across tasks using hybrid model-based successor feature reinforcement learning☆
2023, Neurocomputing
Sample efficiency, which refers to the number of samples required for a learning agent to attain a specific level of performance, is central to developing practical reinforcement learning (RL) for complex and large-scale decision-making problems. The ability to transfer and generalize knowledge gained from previous experiences to downstream tasks can significantly improve sample efficiency. Recent research indicates that successor feature (SF) RL algorithms enable knowledge generalization between tasks with different rewards but identical transition dynamics. It has recently been hypothesized that combining model-based (MB) methods with SF algorithms can alleviate the limitation of fixed transition dynamics. Furthermore, uncertainty-aware exploration is widely recognized as another appealing approach for improving sample efficiency. An agent can efficiently explore to better understand an environment by tracking uncertainty about the value of each available action. Putting together two ideas of hybrid model-based successor feature (MB-SF) and uncertainty leads to an approach to the problem of sample efficient uncertainty-aware knowledge transfer across tasks with different transition dynamics or/and reward functions. In this paper, the uncertainty of the value of each action is approximated by a Kalman filter (KF)-based multiple-model adaptive estimation. This KF-based framework treats the parameters of a model as random variables. To the best of our knowledge, this is the first attempt at formulating a hybrid MB-SF algorithm capable of generalizing knowledge across large or continuous state space tasks with various transition dynamics while requiring less computation at decision time than MB methods. We highlight why previous SF-based methods are constrained to knowledge generalization across same transition dynamics, present our novel approach on a firm theoretical foundation, and design a set of demonstration tasks to empirically validate the effectiveness of our proposed approach. The number of samples required to learn the tasks was compared to recent SF and MB baselines. The results show that our algorithm generalizes its knowledge across different transition dynamics, learns downstream tasks with significantly fewer samples than starting from scratch, and outperforms existing approaches. We believe that our proposed framework can account for the computationally efficient behavioural flexibilities observed in the empirical literature and can also serve as a solid theoretical foundation for future experimental work.
Pre-trained Online Contrastive Learning for Insurance Fraud Detection
2024, Proceedings of the AAAI Conference on Artificial Intelligence
UNVEILING THE ANOMALIES IN AN EVER-CHANGING WORLD: A BENCHMARK FOR PIXEL-LEVEL ANOMALY DETECTION IN CONTINUAL LEARNING
2024, arXiv

View all citing articles on Scopus

Brendan McCane received the B.Sc. (Hons.) and Ph.D. degrees from the James Cook University of North Queensland, Townsville City, Australia, in 1991 and 1996, respectively. He joined the Computer Science Department, University of Otago, Otago, New Zealand, in 1997. He served as the Head of the Department from 2007 to 2012. His current research interests include computer vision, pattern recognition, machine learning, and medical and biological imaging. He also enjoys reading, swimming, fishing and long walks on the beach with his dogs.

Lech Szymanski received the B.A.Sc. (Hons.) degree in computer engineering and the M.A.Sc. degree in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 2001 and 2005, respectively, and the Ph.D. degree in computer science from the University of Otago, Otago, New Zealand, in 2012. He is currently a Lecturer at the Computer Science Department at the University of Otago. His research interests include machine learning, artificial neural networks, and deep architectures.

Anthony Robins completed his doctorate in cognitive science at the University of Sussex (UK) in 1989. He is currently a Professor of Computer Science at the University of Otago, New Zealand. His research interests include artificial neural networks, computational models of memory, and computer science education.

View full text

Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting

Abstract

Introduction

Section snippets

Deep Q-learning

The RePR model

Related work

Method

RePR performance on CF

Discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Neuron

Neural Networks

Trends in Neurosciences

Human-level control through deep reinforcement learning

Nature

Overcoming catastrophic forgetting in neural networks

Proceedings of the National Academy of Sciences

iCaRL: Incremental classifier and representation learning

IEEE Conference on Computer Vision and Pattern Recognition

Generative knowledge distillation for general purpose function compression

Neural Information Processing Systems Workshop on Teaching Machines, Robots, and Humans

Catastrophic forgetting, rehearsal and pseudorehearsal

Connection Science

Sleep transforms the cerebral trace of declarative memories

Proceedings of the National Academy of Sciences

Check regularization: Combining modularity and elasticity for memory consolidation