ABSTRACT
We consider the task of reinforcement learning in an environment in which rare significant events occur independently of the actions selected by the controlling agent. If these events are sampled according to their natural probability of occurring, convergence of conventional reinforcement learning algorithms is likely to be slow, and the learning algorithms may exhibit high variance. In this work, we assume that we have access to a simulator, in which the rare event probabilities can be artificially altered. Then, importance sampling can be used to learn with this simulation data. We introduce algorithms for policy evaluation, using both tabular and function approximation representations of the value function. We prove that in both cases, the reinforcement learning algorithms converge. In the tabular case, we also analyze the bias and variance of our approach compared to TD-learning. We evaluate empirically the performance of the algorithm on random Markov Decision Processes, as well as on a large network planning task.
- Ahamed, T. P. I., Borkar, V. S., & Juneja, S. (2006). Adaptive importance sampling technique for Markov chains using stochastic approximation. Oper. Res., 54, 489--504. Google ScholarDigital Library
- Asmussen, S. & Glynn, P. (2007). Stochastic Simulation: Algorithms and Analysis. Springer.Google Scholar
- Baxter, J. & Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319--350. Google ScholarCross Ref
- Bertsekas, D. & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific. Google ScholarDigital Library
- Bhatnagar, S., Borkar, V. S., & Akarapu, M. (2006). A simulation-based algorithm for ergodic control of Markov chains conditioned on rare events. Journal of Machine Learning Research, 7, 1937--1962. Google ScholarDigital Library
- Bucklew, J. (2004). Introduction to Rare Event Simulation. Springer.Google Scholar
- Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. (2007). Bias and variance approximation in value function estimates. Management Science, 53, 308. Google ScholarDigital Library
- Precup, D., Sutton, R., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proc. 18th International Conf. on Machine Learning, 417--424. Google ScholarDigital Library
- Precup, D., Sutton, R., & Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proc. 17th International Conf. on Machine Learning, 759--766. Google ScholarDigital Library
- Rubinstein, R. & Kroese, D. (2004). The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer-Verlag. Google ScholarDigital Library
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9--44. Google ScholarDigital Library
- Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning. The MIT Press. Google ScholarDigital Library
Index Terms
- Reinforcement learning in the presence of rare events
Recommendations
Reinforcement learning with rare significant events: direct policy search vs. gradient policy search
GECCO '21: Proceedings of the Genetic and Evolutionary Computation Conference CompanionThis paper shows that the CMAES direct policy search method fares significantly better than PPO gradient policy search for a reinforcement learning task where significant events are rare.
Automated Curriculum Learning by Rewarding Temporally Rare Events
2018 IEEE Conference on Computational Intelligence and Games (CIG)Reward shaping allows reinforcement learning (RL) agents to accelerate learning by receiving additional reward signals. However, these signals can be difficult to design manually, especially for complex RL tasks. We propose a simple and general approach ...
Learning Through Rare Events: Significant Interruptions at the Baltimore & Ohio Railroad Museum
The collapse of the roof of the Baltimore & Ohio (B&O) Railroad Museum Roundhouse onto its collections during a snowstorm in 2003 provides a starting point for our exploration of the link between learning and rare events. The collapse occurred as the ...
Comments