Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting
Introduction
There has been enormous growth in research around reinforcement learning since the development of Deep Q-Networks (DQNs) [1]. DQNs apply Q-learning to deep networks so that complicated reinforcement tasks can be learnt. However, as with most distributed models, DQNs can suffer from Catastrophic Forgetting (CF) [2], [3]. This is where a model has the tendency to forget previous knowledge as it learns new knowledge. Pseudo-rehearsal is a method for overcoming CF by rehearsing randomly generated examples of previous tasks, while learning on real data from a new task. Although pseudo-rehearsal methods have been widely used in image classification, they have been virtually unexplored in reinforcement learning. Solving CF in the reinforcement learning domain is essential if we want to achieve artificial agents that can continuously learn.
Continual learning is important to neural networks because CF limits their potential in numerous ways. For example, imagine a previously trained network whose function needs to be extended or partially changed. The typical solution would be to train the neural network on all of the previously learnt data (that was still relevant) along with the data to learn the new function. This can be an expensive operation because previous datasets (which tend to be very large in deep learning) would need to be stored and retrained. However, if a neural network could adequately perform continual learning, it would only be necessary for it to directly learn on data representing the new function. Furthermore, continual learning is also desirable because it allows the solution for multiple tasks to be compressed into a single network where weights common to both tasks may be shared. This can also benefit the speed at which new tasks are learnt because useful features may already be present in the network.
Our Reinforcement-Pseudo-Rehearsal model (RePR1) achieves continual learning in the reinforcement domain. It does so by utilising a dual memory system where a freshly initialised DQN is trained on the new task and then knowledge from this short-term network is transferred to a separate DQN containing long-term knowledge of all previously learnt tasks. A generative network is used to produce states (short sequences of data) representative of previous tasks which can be rehearsed while transferring knowledge of the new task. For each new task, the generative network is trained on pseudo-items produced by the previous generative network, alongside data from the new task. Therefore, the system can prevent CF without the need for a large memory store holding data from previously encountered training examples.
The reinforcement tasks learnt by RePR are Atari 2600 games. These games are considered complex because the input space of the games is large which currently requires reinforcement learning to use deep neural networks (i.e. deep reinforcement learning). Applying pseudo-rehearsal methods to deep reinforcement learning is challenging because these reinforcement learning methods are notoriously unstable compared to image classification (due to the deadly triad [4]). In part, this is because target values are consistently changing during learning. We have found that using pseudo-rehearsal while learning these non-stationary targets is difficult because it increases the interference between new and old tasks. Furthermore, generative models struggle to produce high quality data resembling these reinforcement learning tasks, which can prevent important task knowledge from being learnt for the first time, as well as relearnt once it is forgotten.
Our RePR model applies pseudo-rehearsal to the difficult domain of deep reinforcement learning. RePR introduces a dual memory model suitable for reinforcement learning. This model is novel compared to previously used dual memory pseudo-rehearsal models in two important aspects. Firstly, the model isolates reinforcement learning to the short-term system, so that the long-term system can use supervised learning (i.e. mean squared error) with fixed target values (converged on by the short-term network), preventing non-stationary target values from increasing the interference between new and old tasks. Importantly, this differs from previous applications of pseudo-rehearsal, where both the short-term and long-term systems learn with the same cross-entropy loss function. Secondly, RePR transfers knowledge between the dual memory system using real samples, rather than those produced by a generative model. This allows tasks to be learnt and retained to a higher performance in reinforcement learning. The source code for RePR can be found at https://bitbucket.org/catk1ns0n/repr_public/.
A summary of the main contributions of this paper are:
- •
the first successful application of pseudo-rehearsal methods to complex deep reinforcement learning tasks;
- •
above state-of-the-art performance when sequentially learning complex reinforcement tasks, without storing any raw data from previously learnt tasks;
- •
empirical evidence demonstrating the need for a dual memory system as it facilitates new learning by separating the reinforcement learning system from the continual learning system.
Section snippets
Deep Q-learning
In deep Q-learning [1], the neural network is taught to predict the discounted reward that would be received from taking each one of the possible actions given the current state. More specifically, it minimises the following loss function:where there exist two Q functions, a deep predictor network and a deep target network. The predictor’s parameters are updated continuously by stochastic
The RePR model
RePR is a dual memory model which uses pseudo-rehearsal with a generative network to achieve sequential learning in reinforcement tasks. The first part of our dual memory model is the short-term memory (STM) system,3 which serves a role analogous to that of the hippocampus in learning and is used to learn
Related work
This section will focus on methods for preventing CF in reinforcement learning and will generally concentrate on how to learn a new policy without forgetting those previously learnt for different tasks. There is a lot of related research outside of this domain (see [17] for a broad review), predominantly around continual learning in image classification. However, because these methods cannot be directly applied to complex reinforcement learning tasks, we have excluded them from this review.
Method
Our current research applies pseudo-rehearsal to deep Q-learning so that a DQN can be used to learn multiple Atari 2600 games4 in sequence. All agents select between 18 possible actions representing different combinations of joystick movements and pressing the fire button. Our DQN is based upon [1] with a few minor changes which we found helped the network to learn the individual tasks quicker. The specifics
RePR performance on CF
The first experiment investigates how well RePR compares to a lower and upper baseline. The condition is the lower baseline because it does not contain a component to assist in retaining the previously learnt tasks. The reh condition is the upper baseline for RePR because it rehearses real items from previously learnt tasks and thus, demonstrates how RePR would perform if its GAN could perfectly generate states from previous tasks to rehearse alongside learning the new task.6
Discussion
Our experiments have demonstrated RePR to be an effective solution to CF when sequentially learning multiple tasks. To our knowledge, pseudo-rehearsal has not been used until now to successfully prevent CF on complex reinforcement learning tasks. RePR has advantages over popular weight constraint methods, such as EWC, because it does not constrain the network to retain similar weights when learning a new task. This allows the internal layers of the network to change according to new knowledge,
Conclusion
In conclusion, pseudo-rehearsal can be used with deep reinforcement learning methods to achieve continual learning. We have shown that our RePR model can be used to sequentially learn a number of complex reinforcement tasks, without scaling in complexity as the number of tasks increases and without revisiting or storing raw data from past tasks. Pseudo-rehearsal has major benefits over weight constraint methods as it is less restrictive on the network and this is supported by our experimental
CRediT authorship contribution statement
Craig Atkinson: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Writing - review & editing, Visualization. Brendan McCane: Supervision, Writing - review & editing. Lech Szymanski: Supervision, Writing - review & editing. Anthony Robins: Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X GPU used for this research. We also wish to acknowledge the use of New Zealand eScience Infrastructure (NeSI) high performance computing facilities. New Zealand’s national facilities are provided by NeSI and funded jointly by NeSI’s collaborator institutions and through the Ministry of Business, Innovation & Employment’s Research Infrastructure programme. URL https://www.nesi.org.nz.
Craig Atkinson received his B.Sc. (Hons.) from the University of Otago, Dunedin, New Zealand, in 2017. He has just completed his doctorate in Computer Science at the University of Otago. His research interests include deep reinforcement learning and continual learning.
References (41)
- et al.
Temporally structured replay of awake hippocampal ensemble activity during rapid eye movement sleep
Neuron
(2001) - et al.
Continual lifelong learning with neural networks: A review
Neural Networks
(2019) - et al.
Memory retention - the synaptic stability versus plasticity dilemma
Trends in Neurosciences
(2005) - et al.
Human-level control through deep reinforcement learning
Nature
(2015) - M. McCloskey, N.J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in:...
- et al.
Overcoming catastrophic forgetting in neural networks
Proceedings of the National Academy of Sciences
(2017) - R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction (2nd ed.), complete draft...
- D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, in: Advances in Neural Information...
- et al.
iCaRL: Incremental classifier and representation learning
IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Generative knowledge distillation for general purpose function compression
Neural Information Processing Systems Workshop on Teaching Machines, Robots, and Humans
(2017)
Catastrophic forgetting, rehearsal and pseudorehearsal
Connection Science
Sleep transforms the cerebral trace of declarative memories
Proceedings of the National Academy of Sciences
Check regularization: Combining modularity and elasticity for memory consolidation
Cited by (62)
Pre-trained Online Contrastive Learning for Insurance Fraud Detection
2024, Proceedings of the AAAI Conference on Artificial Intelligence
Craig Atkinson received his B.Sc. (Hons.) from the University of Otago, Dunedin, New Zealand, in 2017. He has just completed his doctorate in Computer Science at the University of Otago. His research interests include deep reinforcement learning and continual learning.
Brendan McCane received the B.Sc. (Hons.) and Ph.D. degrees from the James Cook University of North Queensland, Townsville City, Australia, in 1991 and 1996, respectively. He joined the Computer Science Department, University of Otago, Otago, New Zealand, in 1997. He served as the Head of the Department from 2007 to 2012. His current research interests include computer vision, pattern recognition, machine learning, and medical and biological imaging. He also enjoys reading, swimming, fishing and long walks on the beach with his dogs.
Lech Szymanski received the B.A.Sc. (Hons.) degree in computer engineering and the M.A.Sc. degree in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 2001 and 2005, respectively, and the Ph.D. degree in computer science from the University of Otago, Otago, New Zealand, in 2012. He is currently a Lecturer at the Computer Science Department at the University of Otago. His research interests include machine learning, artificial neural networks, and deep architectures.
Anthony Robins completed his doctorate in cognitive science at the University of Sussex (UK) in 1989. He is currently a Professor of Computer Science at the University of Otago, New Zealand. His research interests include artificial neural networks, computational models of memory, and computer science education.