The actor-critic algorithm as multi-time-scale stochastic approximation

Borkar, Vivek S; Konda, Vijaymohan R

doi:10.1007/BF02745577

The actor-critic algorithm as multi-time-scale stochastic approximation

Special Issue On Optimization
Published: August 1997

Volume 22, pages 525–543, (1997)
Cite this article

Sadhana Aims and scope Submit manuscript

Vivek S Borkar¹ &
Vijaymohan R Konda¹

425 Accesses
12 Citations
Explore all metrics

Abstract

The actor-critic algorithm of Barto and others for simulation-based optimization of Markov decision processes is cast as a two time scale stochastic approximation. Convergence analysis, approximation issues and an example are studied.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reinforcement Learning for Approximate Optimal Control

Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces

Article 29 September 2023

References

Abounadi J, Bertsekas D, Borkar V 1996a ODE analysis of stochastic algorithms involving supnorm non-expansive maps (preprint)
Abounadi J, Bertsekas D, Borkar V 1996b Q-learning algorithms for average cost problems (preprint)
Barto A, Sutton R, Anderson C 1983 Neuron-like elements that can solve difficult learning control problems.IEEE Trans. Syst. Man Cybern. 13: 835–846
Google Scholar
Barto A, Bradtke S, Singh S 1995 Learning to act using real-time dynamic programming.Artif. Intell. (Special Issue on Computational Theories of Interaction and Agency) 72: 81–138
Google Scholar
Benveniste A, Metivier M, Priouret P 1990Adaptive algorithms and stochastic approximations (Berlin-Heidelberg: Springer-Verlag)
MATH Google Scholar
Bertsekas D, Tsitsiklis J 1989Parallel and distributed computation: Numerical methods (Englewood Cliffs, NJ: Prentice Hall)
MATH Google Scholar
Borkar V 1994Asynchronous stochastic approximation. SIAM J. Control Optimization (to appear)
Borkar V 1996Stochastic approximation with two time scales. Syst. Control Lett. 29: 291–294
MathSciNet Google Scholar
Brandière O, Duflo M 1996 Les algorithmes stochastiques contournent-ils les pièges?Ann. Inst. Henri Poincarè 32: 395–427
MATH Google Scholar
Chazan D, Miranker W 1969 Chaotic oscillationsLinear Algebra Appl. 2: 199–222
Article MATH MathSciNet Google Scholar
Hirsch M 1989 Convergent activation dynamics in continuous time networks.Neural Networks 2: 331–349
Article Google Scholar
Keerthi S S, Ravindran B 1994 A tutorial survey of reinforcement learning.Sādhanā 19: 851–889
MATH MathSciNet Google Scholar
Konda V 1996Learning algorithms for Markov decision processes. Master’s thesis, Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore
Google Scholar
Kushner H, Clark D 1978Stochastic approximation for constrained and unconstrained systems (New York: Springer-Verlag)
Google Scholar
Neveu J 1975Discrete parameter martingales (Amsterdam: North Holland)
MATH Google Scholar
Pemantle R 1990 Non-convergence to unstable points in urn models and stochastic approximations.Ann. Probab. 18: 698–712
MATH MathSciNet Google Scholar
Polyak B 1990 New method of stochastic approximation type.Autom. Remote Control 51: 937–946
MATH MathSciNet Google Scholar
Puterman M 1994Markov decision processes (New York: John Wiley)
MATH Google Scholar
Santharam G, Sastry P S 1997 A reinforcement learning neural network for adaptive control of Markov chains.IEEE Trans. Syst. Man Cybern. 27: 588–600
Google Scholar
Schäl M 1987 Estimation and control of discounted dynamic programming.Stochastics 20: 51–71
MATH MathSciNet Google Scholar
Schweitzer P, Seidman A 1985 Generalized polynomial approximations in Markovian decision processes.J. Math. Anal. Appl. 110: 568–582
Article MATH MathSciNet Google Scholar
Tsitsiklis J 1994 Asynchronous stochastic approximation and Q-learning.Mach. Learning 16: 185–202
MATH Google Scholar
Tsitsiklis J, Van Roy B 1996 Feature-based methods for large scale dynamic programming.Mach. Learning 22: 59–94
MATH Google Scholar
Walrand J 1988Introduction to queueing networks (Englewood Cliffs, NJ: Prentice Hall)
MATH Google Scholar
Watkins C 1989Learning from delayed rewards. Ph D thesis, Cambridge University, Cambridge, England
Google Scholar
Watkins C, Dayan P 1992 Q-learning.Mach. Learning 8: 279–292
MATH Google Scholar
Williams R, Baird L III 1990 A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. InProc. Sixth Yale Workshop on Adaptive and Learning Systems, New Haven, CT, pp 96–101
Yoshizawa T 1966Stability theory by Liapunov’s second method (Tokyo: Mathematical Society of Japan)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Automation, Indian Institute of Science, 560 012, Bangalore, India
Vivek S Borkar & Vijaymohan R Konda

Authors

Vivek S Borkar
View author publications
You can also search for this author in PubMed Google Scholar
Vijaymohan R Konda
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Borkar, V.S., Konda, V.R. The actor-critic algorithm as multi-time-scale stochastic approximation. Sadhana 22, 525–543 (1997). https://doi.org/10.1007/BF02745577

Download citation

Issue Date: August 1997
DOI: https://doi.org/10.1007/BF02745577

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The actor-critic algorithm as multi-time-scale stochastic approximation

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning for Approximate Optimal Control

Reinforcement Learning for Approximate Optimal Control

Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The actor-critic algorithm as multi-time-scale stochastic approximation

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning for Approximate Optimal Control

Reinforcement Learning for Approximate Optimal Control

Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation