Skip to main content
Log in

Approximate policy iteration: a survey and some new methods

  • Published:
Journal of Control Theory and Applications Aims and scope Submit manuscript

Abstract

We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D. P. Bertsekas, H. Yu. Solution of Large Systems of Equations Using Approximate Dynamic Programming Methods. Report LIDS-P-2754. Cambridge: Laboratory for Information and Decision Systems, MIT, 2007.

    Google Scholar 

  2. R. S. Sutton, A. G. Barto. Reinforcement Learning. Cambridge: MIT Press, 1998.

    Google Scholar 

  3. A. Gosavi. Simulation-Based Optimization Parametric Optimization Techniques and Reinforcement Learning. New York: Springer-Verlag, 2003.

    MATH  Google Scholar 

  4. X. R. Cao. Stochastic Learning and Optimization: A Sensitivity-Based Approach. New York: Springer-Verlag, 2007.

    MATH  Google Scholar 

  5. H. Chang, M. Fu, J. Hu, et al. Simulation-Based Algorithms for Markov Decision Processes. New York: Springer-Verlag, 2007.

    MATH  Google Scholar 

  6. S. Meyn. Control Techniques for Complex Networks. New York: Cambridge University Press, 2007.

    Google Scholar 

  7. W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York: John Wiley & Sons, 2007.

    Book  MATH  Google Scholar 

  8. L. Busoniu, R. Babuska, B. De Schutter, et al. Reinforcement Learning and Dynamic Programming Using Function Approximators. New York: CRC Press, 2010.

    Book  Google Scholar 

  9. D. P. Bertsekas. Approximate dynamic programming. Web-based chapter. Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA: Athena Scientific, 2010: 321–540.

    Google Scholar 

  10. D. White, D. Sofge. Handbook of Intelligent Control. New York: Van Nostrand Reinhold, 1992.

    Google Scholar 

  11. J. Si, A. Barto, W. Powell, et al, eds. Learning and Approximate Dynamic Programming. New York: IEEE, 2004.

    Google Scholar 

  12. F. L. Lewis, G. G. Lendaris, D. Liu. Special issue on adaptive dynamic programming and reinforcement learning in feedback control. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 2008, 38(4): 896–897.

    Article  Google Scholar 

  13. A. G. Barto, S. J. Bradtke, S. P. Singh. Real-time learning and control using asynchronous dynamic programming. Artificial Intelligence, 1995, 72(1/2): 81–138.

    Article  Google Scholar 

  14. V. S. Borkar. Reinforcement learning: a bridge between numerical methods and Monte Carlo. Perspectives in Mathematical Science — I: Probability and Statistics, World Scientific Publishing Co., 2009: 71–91.

  15. F. L. Lewis, D. Vrabie. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 2009, 9(3): 32–50.

    Article  Google Scholar 

  16. C. Szepesvari. Reinforcement Learning Algorithms for MDPs. Report TR09-13. Edmonton, CA: Department of Computing Science, University of Alberta, 2009.

    Google Scholar 

  17. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: John Wiley & Sons, 1994.

    MATH  Google Scholar 

  18. D. P. Bertsekas. Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA: Athena Scientific, 2007.

    Google Scholar 

  19. R. J. Williams, L. C. Baird. Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-critic Learning Systems. Report NU-CCS-93-11, Boston, MA: College of Computer Science, Northeastern University, 1993.

    Google Scholar 

  20. I. Menache, S. Mannor, N. Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals Operation Research, 2005, 134(1): 215–238.

    Article  MathSciNet  MATH  Google Scholar 

  21. H. Yu, D. P. Bertsekas. Basis function adaptation methods for cost approximation in MDP. Proceedings of 2009 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2009), New York: IEEE, 2009: 74–81.

    Chapter  Google Scholar 

  22. L. Busoniu, D. Ernst, B. De Schutter, et al. Cross-entropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 2010, 41(1): 1–14.

    Google Scholar 

  23. D. D. Castro, S. Mannor. Adaptive Bases for Reinforcement Learning. Berlin: Springer-Verlag, 2010.

    Google Scholar 

  24. D. P. Bertsekas, H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 2009, 227(1): 27–50.

    Article  MathSciNet  MATH  Google Scholar 

  25. A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 1959, 3(3): 210–229.

    Article  Google Scholar 

  26. A. L. Samuel. Some studies in machine learning using the game of checkers — II: recent progress. IBM Journal of Research and Development, 1967, 11(6): 601–617.

    Article  Google Scholar 

  27. A. G. Barto, R. S. Sutton, C. W. Anderson. Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 1983, 13(5): 835–846.

    Google Scholar 

  28. R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9–44.

    Google Scholar 

  29. G. J. Tesauro. Practical issues in temporal difference learning. Machine Learning, 1992, 8(3/4): 257–277.

    Article  MATH  Google Scholar 

  30. L. Gurvits, L. J. Lin, S. J. Hanson. Incremental Learning of Evaluation Functions for Absorbing Markov Chains: New Methods and Theorems. Princeton: Siemens Corporate Research, 1994.

    Google Scholar 

  31. T. Jaakkola, M. I. Jordan, S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 1994, 6(6): 1185–1201.

    Article  MATH  Google Scholar 

  32. F. Pineda. Mean-field analysis for batched TD(λ). Neural Computation, 1997, 9: 1403–1419.

    Article  Google Scholar 

  33. J. N. Tsitsiklis, B. V. Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690.

    Article  MATH  Google Scholar 

  34. J. N. Tsitsiklis, B. Van Roy. Average cost temporal-difference learning. Automatica, 1999, 35(11): 1799–1808.

    Article  MATH  Google Scholar 

  35. S. J. Bradtke, A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1/3): 33–57.

    MATH  Google Scholar 

  36. D. P. Bertsekas, S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in Neuro-dynamic Programming. Report LIDS-P-2349. Cambridge: Laboratory for Information and Decision Systems, MIT, 1996.

    Google Scholar 

  37. M. A. Krasnoselskii, J. B. Rutitcki, V. J. Stecenko, et al. Approximate Solution of Operator Equations. Translated by D. Louvish, Groningen: Wolters-Noordhoff Publisher, 1972.

    Google Scholar 

  38. C. A. J. Fletcher. Computational Galerkin Methods. New York: Springer-Verlag, 1984.

    MATH  Google Scholar 

  39. D. P. Bertsekas. Projected Equations, Variational Inequalities, and Temporal Difference Methods. Report LIDS-P-2808. Cambridge: Laboratory for Information and Decision Systems, MIT, 2009.

    Google Scholar 

  40. H. Yu, D. P. Bertsekas. Error bounds for approximations from projected linear equations. Mathematics of Operations Research, 2010, 35(2): 306–329.

    Article  MathSciNet  MATH  Google Scholar 

  41. H. Yu. Least Squares Temporal Difference Methods: An Analysis Under General Conditions. Technical report C-2010-39, Finland: Department Computer Science, University of Helsinki, 2010.

    Google Scholar 

  42. H. Yu. Convergence of least squares temporal difference methods under general conditions. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1207–1214.

  43. S. P. Singh, T. Jaakkola, M. I. Jordan. Learning without state-estimation in partially observable Markovian decision processes. Proceedings of the 11th Machine Learning Conference, San Francisco, CA: Morgan Kaufmann Publishers Inc., 1994: 284–292.

    Google Scholar 

  44. S. P. Singh, T. Jaakkola, M. I. Jordan. Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems 7, Vancouver, BC, Canada, 1995: 361–368.

  45. G. J. Gordon. Stable function approximation in dynamic programming. Machine Learning: Proceedings of the 12th International Conference, Pittsburgh, PA: mCarnegie Mellon University, 1995.

    Google Scholar 

  46. J. N. Tsitsiklis, B. V. Roy. Feature-based methods for large-scale dynamic programming. Machine Learning, 1996, 22(1/3): 59–94.

    MATH  Google Scholar 

  47. B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 2006, 31(2): 234–244.

    Article  MathSciNet  MATH  Google Scholar 

  48. B. Scherrer. Should one compute the temporal difference fixed point or minimize the Bellman residual: the unified oblique projection view. Proceedings of International Conference on Machine Learning, Haifa, Israel, 2010: 959–966.

  49. C. Thiery, B. Scherrer. Least-squares policy iteration: bias-variance trade-off in control problems. Proceedings of 27th International Conference on Machine Learning, Haifa, Israel. 2010: 1071–1078.

  50. M. Wang, N. Polydorides, D. P. Bertsekas. Approximate Simulation-Based Solution of Large-scale Least Squares Problems. Report LIDS-P-2819. Cambridge: Laboratory for Information and Decision Systems, MIT. 2009.

    Google Scholar 

  51. H. Yu, D. P. Bertsekas. Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automation Control, 2009, 54(7): 1515–1531.

    Article  MathSciNet  Google Scholar 

  52. D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming. Belmont, MA: Athena Scientific, 1996.

    MATH  Google Scholar 

  53. H. Yao, Z. Liu. Preconditioned temporal difference learning. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. 2008: 1208–1215.

  54. D. P. Bertsekas, E. Gafni. Projection methods for variational inequalities with applications to the traffic assignment problem. Mathmatical Programmming Studies, 1982, 17(17): 139–159.

    MathSciNet  MATH  Google Scholar 

  55. B. Martinet. Regularisation d’inequations variationnelles par approximations successives. Revue Francaise Informatique Recherche Operationnelle, 1970, 4(R-3): 154–159.

    MathSciNet  Google Scholar 

  56. R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 1976, 14(5): 877–898.

    Article  MathSciNet  MATH  Google Scholar 

  57. D. P. Bertsekas. Convex Optimization Theory. Belmont, MA: Athena Scientific, 2009.

    Google Scholar 

  58. G. Strang. Linear Algebra and its Applications. 4th ed. Wellesley, MA: Wellesley-Cambridge Press, 2009.

    Google Scholar 

  59. L. N. Trefethen, D. Bau. Numerical Linear Algebra. Philadelphia: SIAM, 1997.

    Book  MATH  Google Scholar 

  60. J. A. Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49(2/3): 1–15.

    Article  Google Scholar 

  61. A. Nedić, D. P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems: Theory and Applications, 2003, 13(1/2): 79–110.

    Article  MathSciNet  MATH  Google Scholar 

  62. D. P. Bertsekas, V. S. Borkar, A. Nedić. Improved temporal difference methods with linear function approximation. Learning and Approximate Dynamic Programming. J. Si, A. Barto, W. Powell, et al., eds. New York: IEEE, 2004: 231–255.

    Google Scholar 

  63. D. S. Choi, B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 2006, 16(2): 207–239.

    Article  MathSciNet  MATH  Google Scholar 

  64. J. Liu. Monte Carlo Strategies in Scientific Computing. New York: Springer-Verlag, 2001.

    MATH  Google Scholar 

  65. R. Y. Rubinstein, D. P. Kroese. Simulation and the Monte Carlo Method. 2nd ed. New York: John Wiley & Sons, 2008.

    MATH  Google Scholar 

  66. N. Polydorides, M. Wang, D. P. Bertsekas. Approximate Solution of Large-scale Linear Inverse Problems with Monte Carlo Simulation. Report LIDS-P-2822. Cambridge: Laboratory for Information and Decision Systems, MIT, 2009.

    Google Scholar 

  67. V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge: Cambridge University Press, 2008.

    Google Scholar 

  68. R. Munos. Error bounds for approximate policy iteration. Proceedings of the 20th International Conference on Machine Learning, Washington D. C., 2003: 560–567.

  69. D. P. Bertsekas. Pathologies of temporal difference methods in approximate dynamic programming. Proceedings of IEEE Conference on Decision and Control. New York: IEEE, 2010.

    Google Scholar 

  70. H. Yu. Least Squares Temporal Difference Methods: An Analysis Under General Conditions. Report C-2010-39. Finland: Department of Computer Science, University of Helsinki, 2010.

    Google Scholar 

  71. H. Yu. Convergence of least squares temporal difference methods under general conditions. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1207–1214.

  72. R. S. Sutton, C. Szepesvari, H. R. Maei. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2008: 1–8.

  73. H. R. Maei, C. Szepesvari, S. Bhatnagar, et al. Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2009: 1204–1212.

  74. R. S. Sutton, H. R. Maei, D. Precup, et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proceedings of the 26th International Conference on Machine Learning, New York, 2009: 1–8.

  75. D. P. Bertsekas, H. Yu. Q-Learning and enhanced policy iteration in discounted dynamic programming. Report LIDS-P-2831. Cambridge: Laboratory for Information and Decision Systems, MIT, 2010.

    Google Scholar 

  76. J. N. Tsitsiklis, B. Van Roy. Optimal stopping of Markov processes: hilbert space theory, approximation algorithms, and an application to pricing financial derivatives. IEEE Transactions on Automatic Control, 1999, 44(10): 1840–1851.

    Article  MATH  Google Scholar 

  77. H. Yu, D. P. Bertsekas. A Least Squares Q-learning Algorithm for Optimal Stopping Problems. Report 2731. Cambridge: Labotary for Information and Decision Systems, MIT, 2006.

    Google Scholar 

  78. M. G. Lagoudakis, R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4: 1107–1149.

    Article  MathSciNet  Google Scholar 

  79. T. Jung, D. Polani. Kernelizing LSPE(λ). IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, New York: IEEE, 2007: 338–345.

    Chapter  Google Scholar 

  80. L. Busoniu, D. Ernst, B. De Schutter, et al. Online least-squares policy iteration for reinforcement learning control. American Control Conference, New York: IEEE, 2010: 486–491.

    Google Scholar 

  81. D. P. Bertsekas. Lecture at NSF Workshop on Reinforcement Learning. Harpers Ferry, New York, 1996.

    Google Scholar 

  82. V. V. Desai, V. F. Farias, C. C. Moallemi. Approximate dynamic programming via a smoothed approximate linear program. Advances in Neural Information Processing Systems 22. Vancouver, BC, Canada, 2009: http://www.moallemi.com/ciamac/papers/salp-2009.pdf.

  83. V. F. Farias, B. Van Roy. Tetris: a study of randomized constraint sampling. Probabilistic and Randomized Methods for Design Under Uncertainty. G. Calafiore, F. Dabbene, eds. New York: Springer-Verlag, 2006: 189–202.

    Chapter  Google Scholar 

  84. S. Kakade. A natural policy gradient. Advances in Neural Information Processing Systems. Vancouver, BC, Canada, 2002: 1531–1538.

  85. I. Szita, A. Lorinz. Learning tetris using the noisy cross-entropy method. Neural Computation, 2006, 18(12): 2936–2941.

    Article  MATH  Google Scholar 

  86. B. Van Roy. Feature-Based Methods for Large Scale Dynamic Programming. Report LIDS-TH-2289. Cambridge: Laboratory for Information and Decision Systems, MIT, 1995.

    Google Scholar 

  87. P. T. de Boer, D. P. Kroese, S. Mannor, et al. A tutorial on the cross-entropy method. Annals of Operations Research, 2005, 134(1): 19–67.

    Article  MathSciNet  MATH  Google Scholar 

  88. C. Thiery, B. Scherrer. Improvements on learning tetris with cross-entropy. International Computer Games Association Journal, 2009, 32(1): 23–33.

    Google Scholar 

  89. P. J. Werbos. Approximate dynamic programming for real-time control and neural modeling. Handbook of Intelligent Control. New York: Van Nostrand, 1992: 493–525.

    Google Scholar 

  90. P. J. Werbös. Neurocontrol and supervised learning: an overview and valuation. Handbook of Intelligent Control. D. A. White, D. A. Sofge, eds. New York: Van Nostrand, 1992: 65–89.

    Google Scholar 

  91. L. C. Baird. Advantage Updating. Report WL-TR-93-1146. Wright-Patterson Air Force Base, OH: Wright Laboratory, 1993.

    Google Scholar 

  92. M. E. Harmon, L. C. Baird, A. H. Klopf. Advantage updating applied to a differential game. G. Tesauro, D. S. Touretzky, T. K. Leen, eds. Advances in Neural Information Processing Systems, Denver, CO, 1994: 353–360.

  93. D. P. Bertsekas. Differential training of rollout policies. Proceedings of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, IL, 1997.

  94. E. V. Denardo. Contraction mappings in the theory underlying dynamic programming. SIAM Review, 1967, 9(2): 165–177.

    Article  MathSciNet  MATH  Google Scholar 

  95. D. P. Bertsekas. Monotone mappings with application in dynamic programming. SIAM Journal on Control and Optimization, 1977, 15(3): 438–464.

    Article  MathSciNet  MATH  Google Scholar 

  96. D. P. Bertsekas, S. E. Shreve. Stochastic Optimal Control: the Discrete Time Case. New York: Academic Press, 1978.

    MATH  Google Scholar 

  97. L. S. Shapley. Stochastic games. Proceedings of National Academy Sciences, 1953, 39(10): 1095–1100.

    Article  MathSciNet  MATH  Google Scholar 

  98. D. P. Bertsekas, H. Yu. Asynchronous distributed policy iteration in dynamic programming. Proceedings of Allerton Conferrence on Information Sciences and Systems, Allerton Park, IL, 2010.

  99. D. P. Bertsekas. Distributed dynamic programming. IEEE Transactions on Automatic Control, 1982, 27(3): 610–616.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work was supported by the National Science Foundation (No.ECCS-0801549), the LANL Information Science and Technology Institute, and the Air Force (No.FA9550-10-1-0412).

Dimitri P. BERTSEKAS studied engineering at the National Technical University of Athens, Greece, obtained his M.S. degree in Electrical Engineering at the George Washington University, Washington D.C. in 1969, and his Ph.D. in System Science in 1971 at the Massachusetts Institute of Technology. He has held faculty positions with the Engineering-Economic Systems Department, Stanford University (1971–1974) and the Electrical Engineering Department of the University of Illinois, Urbana (1974–1979). Since 1979 he has been teaching at the Electrical Engineering and Computer Science Department of the Massachusetts Institute of Technology (MIT), where he is currently McAfee Professor of Engineering. He consults regularly with private industry and has held editorial positions in several journals. His research at MIT spans several fields, including optimization, control, large-scale computation, and data communication networks, and is closely tied to his teaching and book authoring activities. He has written numerous research papers, and fourteen books, several of which are used as textbooks in MIT classes.

He was awarded the INFORMS 1997 Prize for Research Excellence in the Interface Between Operations Research and Computer Science for his book ‘Neuro-dynamic Programming’ (coauthored with John Tsitsiklis), the 2000 Greek National Award for Operations Research, the 2001 ACC John R. Ragazzini Education Award, and the 2009 INFORMS Expository Writing Award. In 2001, he was elected to the United States National Academy of Engineering.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertsekas, D.P. Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9, 310–335 (2011). https://doi.org/10.1007/s11768-011-1005-3

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11768-011-1005-3

Keywords

Navigation