An Upper Bound on the Loss from Approximate Optimal-Value Functions

Singh, Satinder P.; Yee, Richard C.

doi:10.1023/A:1022693225949

An Upper Bound on the Loss from Approximate Optimal-Value Functions

Published: September 1994

Volume 16, pages 227–233, (1994)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

An Upper Bound on the Loss from Approximate Optimal-Value Functions

Download PDF

Satinder P. Singh¹ &
Richard C. Yee¹

813 Accesses
54 Citations
Explore all metrics

Abstract

Many reinforcement learning approaches can be formulated using the theory of Markov decision processes and the associated method of dynamic programming (DP). The value of this theoretical understanding, however, is tempered by many practical concerns. One important question is whether DP-based approaches that use function approximation rather than lookup tables can avoid catastrophic effects on performance. This note presents a result of Bertsekas (1987) which guarantees that small errors in the approximation of a task's optimal value function cannot produce arbitrarily bad performance when actions are selected by a greedy policy. We derive an upper bound on performance loss that is slightly tighter than that in Bertsekas (1987), and we show the extension of the bound to Q-learning (Watkins, 1989). These results provide a partial theoretical rationale for the approximation of value functions, an issue of great practical importance in reinforcement learning.

References

Anderson, C.W. (1986). Learning and Problem Solving with Multilayer Connectionist Systems. PhD thesis, University of Massachusetts, Department of Computer and Information Science, University of Massachusetts, Amherst, MA 01003.
Barto, A.G., Bradtke, S.J., and Singh, S.P. (1991). Real-time learning and control using asynchronous dynamic programming. Technical Report TR-91-57, Department of Computer Science, University of Massachusetts.
Barto, A.G., Sutton, R.S., and Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13 (5), 834–846.
Google Scholar
Barto, A.G., Sutton, R.S., and Watkins, C.J.C.H. (1990). Learning and sequential decision making. In M. Gabriel and J. Moore (Eds.), Learning and Computational Neuroscience: Foundations of Adaptive Networks, chapter 13. Cambridge, MA: Bradford Books/MIT Press.
Google Scholar
Bertsekas, D.P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Bradtke, S.J. (1993). Reinforcement learning applied to linear quadratic regulation. In S.J. Hanson, J.D. Cowan, and C.L. Giles (Eds.), Advances in Neural Information Processing Systems 5, San Mateo, CA. IEEE, Morgan Kaufmann.
Google Scholar
Porteus, E. (1971). Some bounds for discounted sequential decision processes. Management Science, 19, 7–11.
Google Scholar
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In B.W. Porter and R.H. Mooney (Eds.), Machine Learning: Proceedings of the Seventh International Conference (ML90), pages 216–224, San Mateo, CA. Morgan Kaufmann.
Google Scholar
Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8 (3/4), 257–277.
Google Scholar
Watkins, C.J.C.H. and Dayan, P. (1992). Q-learning. Machine Learning, 8 (3/4), 279–292.
Google Scholar
Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, King's College, University of Cambridge, Cambridge, England.
Werbos, P.J. (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17 (1), 7–20.
Google Scholar
Williams, R.J. and Baird, L.C. (1993). Analysis of some incremental variants of policy iteration: First steps toward understanding actor-critic learning systems. Technical Report NU-CCS-93-11, Northeastern University, College of Computer Science, Boston, MA 02115.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Massachusetts, Amherst, MA, 01003
Satinder P. Singh & Richard C. Yee

Authors

Satinder P. Singh
View author publications
You can also search for this author in PubMed Google Scholar
Richard C. Yee
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, S.P., Yee, R.C. An Upper Bound on the Loss from Approximate Optimal-Value Functions. Machine Learning 16, 227–233 (1994). https://doi.org/10.1023/A:1022693225949

Download citation

Issue Date: September 1994
DOI: https://doi.org/10.1023/A:1022693225949

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Upper Bound on the Loss from Approximate Optimal-Value Functions

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning

Model-Free Optimal Control: A Critical Analysis

Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

An Upper Bound on the Loss from Approximate Optimal-Value Functions

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning

Model-Free Optimal Control: A Critical Analysis

Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation