Approximate policy iteration: a survey and some new methods

Bertsekas, Dimitri P.

doi:10.1007/s11768-011-1005-3

Approximate policy iteration: a survey and some new methods

Published: 19 July 2011

Volume 9, pages 310–335, (2011)
Cite this article

Journal of Control Theory and Applications Aims and scope Submit manuscript

Dimitri P. Bertsekas¹

1510 Accesses
134 Citations
Explore all metrics

Abstract

We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Closed-form expressions of the run-length distribution of the nonparametric double sampling precedence monitoring scheme

Article Open access 12 April 2024

Introduction to Reinforcement Learning

References

D. P. Bertsekas, H. Yu. Solution of Large Systems of Equations Using Approximate Dynamic Programming Methods. Report LIDS-P-2754. Cambridge: Laboratory for Information and Decision Systems, MIT, 2007.
Google Scholar
R. S. Sutton, A. G. Barto. Reinforcement Learning. Cambridge: MIT Press, 1998.
Google Scholar
A. Gosavi. Simulation-Based Optimization Parametric Optimization Techniques and Reinforcement Learning. New York: Springer-Verlag, 2003.
MATH Google Scholar
X. R. Cao. Stochastic Learning and Optimization: A Sensitivity-Based Approach. New York: Springer-Verlag, 2007.
MATH Google Scholar
H. Chang, M. Fu, J. Hu, et al. Simulation-Based Algorithms for Markov Decision Processes. New York: Springer-Verlag, 2007.
MATH Google Scholar
S. Meyn. Control Techniques for Complex Networks. New York: Cambridge University Press, 2007.
Google Scholar
W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York: John Wiley & Sons, 2007.
Book MATH Google Scholar
L. Busoniu, R. Babuska, B. De Schutter, et al. Reinforcement Learning and Dynamic Programming Using Function Approximators. New York: CRC Press, 2010.
Book Google Scholar
D. P. Bertsekas. Approximate dynamic programming. Web-based chapter. Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA: Athena Scientific, 2010: 321–540.
Google Scholar
D. White, D. Sofge. Handbook of Intelligent Control. New York: Van Nostrand Reinhold, 1992.
Google Scholar
J. Si, A. Barto, W. Powell, et al, eds. Learning and Approximate Dynamic Programming. New York: IEEE, 2004.
Google Scholar
F. L. Lewis, G. G. Lendaris, D. Liu. Special issue on adaptive dynamic programming and reinforcement learning in feedback control. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 2008, 38(4): 896–897.
Article Google Scholar
A. G. Barto, S. J. Bradtke, S. P. Singh. Real-time learning and control using asynchronous dynamic programming. Artificial Intelligence, 1995, 72(1/2): 81–138.
Article Google Scholar
V. S. Borkar. Reinforcement learning: a bridge between numerical methods and Monte Carlo. Perspectives in Mathematical Science — I: Probability and Statistics, World Scientific Publishing Co., 2009: 71–91.
F. L. Lewis, D. Vrabie. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 2009, 9(3): 32–50.
Article Google Scholar
C. Szepesvari. Reinforcement Learning Algorithms for MDPs. Report TR09-13. Edmonton, CA: Department of Computing Science, University of Alberta, 2009.
Google Scholar
M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: John Wiley & Sons, 1994.
MATH Google Scholar
D. P. Bertsekas. Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA: Athena Scientific, 2007.
Google Scholar
R. J. Williams, L. C. Baird. Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-critic Learning Systems. Report NU-CCS-93-11, Boston, MA: College of Computer Science, Northeastern University, 1993.
Google Scholar
I. Menache, S. Mannor, N. Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals Operation Research, 2005, 134(1): 215–238.
Article MathSciNet MATH Google Scholar
H. Yu, D. P. Bertsekas. Basis function adaptation methods for cost approximation in MDP. Proceedings of 2009 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2009), New York: IEEE, 2009: 74–81.
Chapter Google Scholar
L. Busoniu, D. Ernst, B. De Schutter, et al. Cross-entropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 2010, 41(1): 1–14.
Google Scholar
D. D. Castro, S. Mannor. Adaptive Bases for Reinforcement Learning. Berlin: Springer-Verlag, 2010.
Google Scholar
D. P. Bertsekas, H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 2009, 227(1): 27–50.
Article MathSciNet MATH Google Scholar
A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 1959, 3(3): 210–229.
Article Google Scholar
A. L. Samuel. Some studies in machine learning using the game of checkers — II: recent progress. IBM Journal of Research and Development, 1967, 11(6): 601–617.
Article Google Scholar
A. G. Barto, R. S. Sutton, C. W. Anderson. Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 1983, 13(5): 835–846.
Google Scholar
R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9–44.
Google Scholar
G. J. Tesauro. Practical issues in temporal difference learning. Machine Learning, 1992, 8(3/4): 257–277.
Article MATH Google Scholar
L. Gurvits, L. J. Lin, S. J. Hanson. Incremental Learning of Evaluation Functions for Absorbing Markov Chains: New Methods and Theorems. Princeton: Siemens Corporate Research, 1994.
Google Scholar
T. Jaakkola, M. I. Jordan, S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 1994, 6(6): 1185–1201.
Article MATH Google Scholar
F. Pineda. Mean-field analysis for batched TD(λ). Neural Computation, 1997, 9: 1403–1419.
Article Google Scholar
J. N. Tsitsiklis, B. V. Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690.
Article MATH Google Scholar
J. N. Tsitsiklis, B. Van Roy. Average cost temporal-difference learning. Automatica, 1999, 35(11): 1799–1808.
Article MATH Google Scholar
S. J. Bradtke, A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1/3): 33–57.
MATH Google Scholar
D. P. Bertsekas, S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in Neuro-dynamic Programming. Report LIDS-P-2349. Cambridge: Laboratory for Information and Decision Systems, MIT, 1996.
Google Scholar
M. A. Krasnoselskii, J. B. Rutitcki, V. J. Stecenko, et al. Approximate Solution of Operator Equations. Translated by D. Louvish, Groningen: Wolters-Noordhoff Publisher, 1972.
Google Scholar
C. A. J. Fletcher. Computational Galerkin Methods. New York: Springer-Verlag, 1984.
MATH Google Scholar
D. P. Bertsekas. Projected Equations, Variational Inequalities, and Temporal Difference Methods. Report LIDS-P-2808. Cambridge: Laboratory for Information and Decision Systems, MIT, 2009.
Google Scholar
H. Yu, D. P. Bertsekas. Error bounds for approximations from projected linear equations. Mathematics of Operations Research, 2010, 35(2): 306–329.
Article MathSciNet MATH Google Scholar
H. Yu. Least Squares Temporal Difference Methods: An Analysis Under General Conditions. Technical report C-2010-39, Finland: Department Computer Science, University of Helsinki, 2010.
Google Scholar
H. Yu. Convergence of least squares temporal difference methods under general conditions. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1207–1214.
S. P. Singh, T. Jaakkola, M. I. Jordan. Learning without state-estimation in partially observable Markovian decision processes. Proceedings of the 11th Machine Learning Conference, San Francisco, CA: Morgan Kaufmann Publishers Inc., 1994: 284–292.
Google Scholar
S. P. Singh, T. Jaakkola, M. I. Jordan. Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems 7, Vancouver, BC, Canada, 1995: 361–368.
G. J. Gordon. Stable function approximation in dynamic programming. Machine Learning: Proceedings of the 12th International Conference, Pittsburgh, PA: mCarnegie Mellon University, 1995.
Google Scholar
J. N. Tsitsiklis, B. V. Roy. Feature-based methods for large-scale dynamic programming. Machine Learning, 1996, 22(1/3): 59–94.
MATH Google Scholar
B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 2006, 31(2): 234–244.
Article MathSciNet MATH Google Scholar
B. Scherrer. Should one compute the temporal difference fixed point or minimize the Bellman residual: the unified oblique projection view. Proceedings of International Conference on Machine Learning, Haifa, Israel, 2010: 959–966.
C. Thiery, B. Scherrer. Least-squares policy iteration: bias-variance trade-off in control problems. Proceedings of 27th International Conference on Machine Learning, Haifa, Israel. 2010: 1071–1078.
M. Wang, N. Polydorides, D. P. Bertsekas. Approximate Simulation-Based Solution of Large-scale Least Squares Problems. Report LIDS-P-2819. Cambridge: Laboratory for Information and Decision Systems, MIT. 2009.
Google Scholar
H. Yu, D. P. Bertsekas. Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automation Control, 2009, 54(7): 1515–1531.
Article MathSciNet Google Scholar
D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming. Belmont, MA: Athena Scientific, 1996.
MATH Google Scholar
H. Yao, Z. Liu. Preconditioned temporal difference learning. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. 2008: 1208–1215.
D. P. Bertsekas, E. Gafni. Projection methods for variational inequalities with applications to the traffic assignment problem. Mathmatical Programmming Studies, 1982, 17(17): 139–159.
MathSciNet MATH Google Scholar
B. Martinet. Regularisation d’inequations variationnelles par approximations successives. Revue Francaise Informatique Recherche Operationnelle, 1970, 4(R-3): 154–159.
MathSciNet Google Scholar
R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 1976, 14(5): 877–898.
Article MathSciNet MATH Google Scholar
D. P. Bertsekas. Convex Optimization Theory. Belmont, MA: Athena Scientific, 2009.
Google Scholar
G. Strang. Linear Algebra and its Applications. 4th ed. Wellesley, MA: Wellesley-Cambridge Press, 2009.
Google Scholar
L. N. Trefethen, D. Bau. Numerical Linear Algebra. Philadelphia: SIAM, 1997.
Book MATH Google Scholar
J. A. Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49(2/3): 1–15.
Article Google Scholar
A. Nedić, D. P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems: Theory and Applications, 2003, 13(1/2): 79–110.
Article MathSciNet MATH Google Scholar
D. P. Bertsekas, V. S. Borkar, A. Nedić. Improved temporal difference methods with linear function approximation. Learning and Approximate Dynamic Programming. J. Si, A. Barto, W. Powell, et al., eds. New York: IEEE, 2004: 231–255.
Google Scholar
D. S. Choi, B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 2006, 16(2): 207–239.
Article MathSciNet MATH Google Scholar
J. Liu. Monte Carlo Strategies in Scientific Computing. New York: Springer-Verlag, 2001.
MATH Google Scholar
R. Y. Rubinstein, D. P. Kroese. Simulation and the Monte Carlo Method. 2nd ed. New York: John Wiley & Sons, 2008.
MATH Google Scholar
N. Polydorides, M. Wang, D. P. Bertsekas. Approximate Solution of Large-scale Linear Inverse Problems with Monte Carlo Simulation. Report LIDS-P-2822. Cambridge: Laboratory for Information and Decision Systems, MIT, 2009.
Google Scholar
V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge: Cambridge University Press, 2008.
Google Scholar
R. Munos. Error bounds for approximate policy iteration. Proceedings of the 20th International Conference on Machine Learning, Washington D. C., 2003: 560–567.
D. P. Bertsekas. Pathologies of temporal difference methods in approximate dynamic programming. Proceedings of IEEE Conference on Decision and Control. New York: IEEE, 2010.
Google Scholar
H. Yu. Least Squares Temporal Difference Methods: An Analysis Under General Conditions. Report C-2010-39. Finland: Department of Computer Science, University of Helsinki, 2010.
Google Scholar
H. Yu. Convergence of least squares temporal difference methods under general conditions. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1207–1214.
R. S. Sutton, C. Szepesvari, H. R. Maei. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2008: 1–8.
H. R. Maei, C. Szepesvari, S. Bhatnagar, et al. Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2009: 1204–1212.
R. S. Sutton, H. R. Maei, D. Precup, et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proceedings of the 26th International Conference on Machine Learning, New York, 2009: 1–8.
D. P. Bertsekas, H. Yu. Q-Learning and enhanced policy iteration in discounted dynamic programming. Report LIDS-P-2831. Cambridge: Laboratory for Information and Decision Systems, MIT, 2010.
Google Scholar
J. N. Tsitsiklis, B. Van Roy. Optimal stopping of Markov processes: hilbert space theory, approximation algorithms, and an application to pricing financial derivatives. IEEE Transactions on Automatic Control, 1999, 44(10): 1840–1851.
Article MATH Google Scholar
H. Yu, D. P. Bertsekas. A Least Squares Q-learning Algorithm for Optimal Stopping Problems. Report 2731. Cambridge: Labotary for Information and Decision Systems, MIT, 2006.
Google Scholar
M. G. Lagoudakis, R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4: 1107–1149.
Article MathSciNet Google Scholar
T. Jung, D. Polani. Kernelizing LSPE(λ). IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, New York: IEEE, 2007: 338–345.
Chapter Google Scholar
L. Busoniu, D. Ernst, B. De Schutter, et al. Online least-squares policy iteration for reinforcement learning control. American Control Conference, New York: IEEE, 2010: 486–491.
Google Scholar
D. P. Bertsekas. Lecture at NSF Workshop on Reinforcement Learning. Harpers Ferry, New York, 1996.
Google Scholar
V. V. Desai, V. F. Farias, C. C. Moallemi. Approximate dynamic programming via a smoothed approximate linear program. Advances in Neural Information Processing Systems 22. Vancouver, BC, Canada, 2009: http://www.moallemi.com/ciamac/papers/salp-2009.pdf.
V. F. Farias, B. Van Roy. Tetris: a study of randomized constraint sampling. Probabilistic and Randomized Methods for Design Under Uncertainty. G. Calafiore, F. Dabbene, eds. New York: Springer-Verlag, 2006: 189–202.
Chapter Google Scholar
S. Kakade. A natural policy gradient. Advances in Neural Information Processing Systems. Vancouver, BC, Canada, 2002: 1531–1538.
I. Szita, A. Lorinz. Learning tetris using the noisy cross-entropy method. Neural Computation, 2006, 18(12): 2936–2941.
Article MATH Google Scholar
B. Van Roy. Feature-Based Methods for Large Scale Dynamic Programming. Report LIDS-TH-2289. Cambridge: Laboratory for Information and Decision Systems, MIT, 1995.
Google Scholar
P. T. de Boer, D. P. Kroese, S. Mannor, et al. A tutorial on the cross-entropy method. Annals of Operations Research, 2005, 134(1): 19–67.
Article MathSciNet MATH Google Scholar
C. Thiery, B. Scherrer. Improvements on learning tetris with cross-entropy. International Computer Games Association Journal, 2009, 32(1): 23–33.
Google Scholar
P. J. Werbos. Approximate dynamic programming for real-time control and neural modeling. Handbook of Intelligent Control. New York: Van Nostrand, 1992: 493–525.
Google Scholar
P. J. Werbös. Neurocontrol and supervised learning: an overview and valuation. Handbook of Intelligent Control. D. A. White, D. A. Sofge, eds. New York: Van Nostrand, 1992: 65–89.
Google Scholar
L. C. Baird. Advantage Updating. Report WL-TR-93-1146. Wright-Patterson Air Force Base, OH: Wright Laboratory, 1993.
Google Scholar
M. E. Harmon, L. C. Baird, A. H. Klopf. Advantage updating applied to a differential game. G. Tesauro, D. S. Touretzky, T. K. Leen, eds. Advances in Neural Information Processing Systems, Denver, CO, 1994: 353–360.
D. P. Bertsekas. Differential training of rollout policies. Proceedings of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, IL, 1997.
E. V. Denardo. Contraction mappings in the theory underlying dynamic programming. SIAM Review, 1967, 9(2): 165–177.
Article MathSciNet MATH Google Scholar
D. P. Bertsekas. Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 1977, 15(3): 438–464.
Article MathSciNet MATH Google Scholar
D. P. Bertsekas, S. E. Shreve. Stochastic Optimal Control: the Discrete Time Case. New York: Academic Press, 1978.
MATH Google Scholar
L. S. Shapley. Stochastic games. Proceedings of National Academy Sciences, 1953, 39(10): 1095–1100.
Article MathSciNet MATH Google Scholar
D. P. Bertsekas, H. Yu. Asynchronous distributed policy iteration in dynamic programming. Proceedings of Allerton Conferrence on Information Sciences and Systems, Allerton Park, IL, 2010.
D. P. Bertsekas. Distributed dynamic programming. IEEE Transactions on Automatic Control, 1982, 27(3): 610–616.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Dimitri P. Bertsekas

Authors

Dimitri P. Bertsekas
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

This work was supported by the National Science Foundation (No.ECCS-0801549), the LANL Information Science and Technology Institute, and the Air Force (No.FA9550-10-1-0412).

Dimitri P. BERTSEKAS studied engineering at the National Technical University of Athens, Greece, obtained his M.S. degree in Electrical Engineering at the George Washington University, Washington D.C. in 1969, and his Ph.D. in System Science in 1971 at the Massachusetts Institute of Technology. He has held faculty positions with the Engineering-Economic Systems Department, Stanford University (1971–1974) and the Electrical Engineering Department of the University of Illinois, Urbana (1974–1979). Since 1979 he has been teaching at the Electrical Engineering and Computer Science Department of the Massachusetts Institute of Technology (MIT), where he is currently McAfee Professor of Engineering. He consults regularly with private industry and has held editorial positions in several journals. His research at MIT spans several fields, including optimization, control, large-scale computation, and data communication networks, and is closely tied to his teaching and book authoring activities. He has written numerous research papers, and fourteen books, several of which are used as textbooks in MIT classes.

He was awarded the INFORMS 1997 Prize for Research Excellence in the Interface Between Operations Research and Computer Science for his book ‘Neuro-dynamic Programming’ (coauthored with John Tsitsiklis), the 2000 Greek National Award for Operations Research, the 2001 ACC John R. Ragazzini Education Award, and the 2009 INFORMS Expository Writing Award. In 2001, he was elected to the United States National Academy of Engineering.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertsekas, D.P. Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9, 310–335 (2011). https://doi.org/10.1007/s11768-011-1005-3

Download citation

Received: 07 January 2011
Published: 19 July 2011
Issue Date: August 2011
DOI: https://doi.org/10.1007/s11768-011-1005-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximate policy iteration: a survey and some new methods

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Closed-form expressions of the run-length distribution of the nonparametric double sampling precedence monitoring scheme

Introduction to Reinforcement Learning

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Approximate policy iteration: a survey and some new methods

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Closed-form expressions of the run-length distribution of the nonparametric double sampling precedence monitoring scheme

Introduction to Reinforcement Learning

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation