Abstract
The actor-critic algorithm of Barto and others for simulation-based optimization of Markov decision processes is cast as a two time scale stochastic approximation. Convergence analysis, approximation issues and an example are studied.
Similar content being viewed by others
References
Abounadi J, Bertsekas D, Borkar V 1996a ODE analysis of stochastic algorithms involving supnorm non-expansive maps (preprint)
Abounadi J, Bertsekas D, Borkar V 1996b Q-learning algorithms for average cost problems (preprint)
Barto A, Sutton R, Anderson C 1983 Neuron-like elements that can solve difficult learning control problems.IEEE Trans. Syst. Man Cybern. 13: 835–846
Barto A, Bradtke S, Singh S 1995 Learning to act using real-time dynamic programming.Artif. Intell. (Special Issue on Computational Theories of Interaction and Agency) 72: 81–138
Benveniste A, Metivier M, Priouret P 1990Adaptive algorithms and stochastic approximations (Berlin-Heidelberg: Springer-Verlag)
Bertsekas D, Tsitsiklis J 1989Parallel and distributed computation: Numerical methods (Englewood Cliffs, NJ: Prentice Hall)
Borkar V 1994Asynchronous stochastic approximation. SIAM J. Control Optimization (to appear)
Borkar V 1996Stochastic approximation with two time scales. Syst. Control Lett. 29: 291–294
Brandière O, Duflo M 1996 Les algorithmes stochastiques contournent-ils les pièges?Ann. Inst. Henri Poincarè 32: 395–427
Chazan D, Miranker W 1969 Chaotic oscillationsLinear Algebra Appl. 2: 199–222
Hirsch M 1989 Convergent activation dynamics in continuous time networks.Neural Networks 2: 331–349
Keerthi S S, Ravindran B 1994 A tutorial survey of reinforcement learning.Sādhanā 19: 851–889
Konda V 1996Learning algorithms for Markov decision processes. Master’s thesis, Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore
Kushner H, Clark D 1978Stochastic approximation for constrained and unconstrained systems (New York: Springer-Verlag)
Neveu J 1975Discrete parameter martingales (Amsterdam: North Holland)
Pemantle R 1990 Non-convergence to unstable points in urn models and stochastic approximations.Ann. Probab. 18: 698–712
Polyak B 1990 New method of stochastic approximation type.Autom. Remote Control 51: 937–946
Puterman M 1994Markov decision processes (New York: John Wiley)
Santharam G, Sastry P S 1997 A reinforcement learning neural network for adaptive control of Markov chains.IEEE Trans. Syst. Man Cybern. 27: 588–600
Schäl M 1987 Estimation and control of discounted dynamic programming.Stochastics 20: 51–71
Schweitzer P, Seidman A 1985 Generalized polynomial approximations in Markovian decision processes.J. Math. Anal. Appl. 110: 568–582
Tsitsiklis J 1994 Asynchronous stochastic approximation and Q-learning.Mach. Learning 16: 185–202
Tsitsiklis J, Van Roy B 1996 Feature-based methods for large scale dynamic programming.Mach. Learning 22: 59–94
Walrand J 1988Introduction to queueing networks (Englewood Cliffs, NJ: Prentice Hall)
Watkins C 1989Learning from delayed rewards. Ph D thesis, Cambridge University, Cambridge, England
Watkins C, Dayan P 1992 Q-learning.Mach. Learning 8: 279–292
Williams R, Baird L III 1990 A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. InProc. Sixth Yale Workshop on Adaptive and Learning Systems, New Haven, CT, pp 96–101
Yoshizawa T 1966Stability theory by Liapunov’s second method (Tokyo: Mathematical Society of Japan)
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Borkar, V.S., Konda, V.R. The actor-critic algorithm as multi-time-scale stochastic approximation. Sadhana 22, 525–543 (1997). https://doi.org/10.1007/BF02745577
Issue Date:
DOI: https://doi.org/10.1007/BF02745577