Skip to main content
Log in

The actor-critic algorithm as multi-time-scale stochastic approximation

  • Special Issue On Optimization
  • Published:
Sadhana Aims and scope Submit manuscript

Abstract

The actor-critic algorithm of Barto and others for simulation-based optimization of Markov decision processes is cast as a two time scale stochastic approximation. Convergence analysis, approximation issues and an example are studied.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abounadi J, Bertsekas D, Borkar V 1996a ODE analysis of stochastic algorithms involving supnorm non-expansive maps (preprint)

  • Abounadi J, Bertsekas D, Borkar V 1996b Q-learning algorithms for average cost problems (preprint)

  • Barto A, Sutton R, Anderson C 1983 Neuron-like elements that can solve difficult learning control problems.IEEE Trans. Syst. Man Cybern. 13: 835–846

    Google Scholar 

  • Barto A, Bradtke S, Singh S 1995 Learning to act using real-time dynamic programming.Artif. Intell. (Special Issue on Computational Theories of Interaction and Agency) 72: 81–138

    Google Scholar 

  • Benveniste A, Metivier M, Priouret P 1990Adaptive algorithms and stochastic approximations (Berlin-Heidelberg: Springer-Verlag)

    MATH  Google Scholar 

  • Bertsekas D, Tsitsiklis J 1989Parallel and distributed computation: Numerical methods (Englewood Cliffs, NJ: Prentice Hall)

    MATH  Google Scholar 

  • Borkar V 1994Asynchronous stochastic approximation. SIAM J. Control Optimization (to appear)

  • Borkar V 1996Stochastic approximation with two time scales. Syst. Control Lett. 29: 291–294

    MathSciNet  Google Scholar 

  • Brandière O, Duflo M 1996 Les algorithmes stochastiques contournent-ils les pièges?Ann. Inst. Henri Poincarè 32: 395–427

    MATH  Google Scholar 

  • Chazan D, Miranker W 1969 Chaotic oscillationsLinear Algebra Appl. 2: 199–222

    Article  MATH  MathSciNet  Google Scholar 

  • Hirsch M 1989 Convergent activation dynamics in continuous time networks.Neural Networks 2: 331–349

    Article  Google Scholar 

  • Keerthi S S, Ravindran B 1994 A tutorial survey of reinforcement learning.Sādhanā 19: 851–889

    MATH  MathSciNet  Google Scholar 

  • Konda V 1996Learning algorithms for Markov decision processes. Master’s thesis, Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore

    Google Scholar 

  • Kushner H, Clark D 1978Stochastic approximation for constrained and unconstrained systems (New York: Springer-Verlag)

    Google Scholar 

  • Neveu J 1975Discrete parameter martingales (Amsterdam: North Holland)

    MATH  Google Scholar 

  • Pemantle R 1990 Non-convergence to unstable points in urn models and stochastic approximations.Ann. Probab. 18: 698–712

    MATH  MathSciNet  Google Scholar 

  • Polyak B 1990 New method of stochastic approximation type.Autom. Remote Control 51: 937–946

    MATH  MathSciNet  Google Scholar 

  • Puterman M 1994Markov decision processes (New York: John Wiley)

    MATH  Google Scholar 

  • Santharam G, Sastry P S 1997 A reinforcement learning neural network for adaptive control of Markov chains.IEEE Trans. Syst. Man Cybern. 27: 588–600

    Google Scholar 

  • Schäl M 1987 Estimation and control of discounted dynamic programming.Stochastics 20: 51–71

    MATH  MathSciNet  Google Scholar 

  • Schweitzer P, Seidman A 1985 Generalized polynomial approximations in Markovian decision processes.J. Math. Anal. Appl. 110: 568–582

    Article  MATH  MathSciNet  Google Scholar 

  • Tsitsiklis J 1994 Asynchronous stochastic approximation and Q-learning.Mach. Learning 16: 185–202

    MATH  Google Scholar 

  • Tsitsiklis J, Van Roy B 1996 Feature-based methods for large scale dynamic programming.Mach. Learning 22: 59–94

    MATH  Google Scholar 

  • Walrand J 1988Introduction to queueing networks (Englewood Cliffs, NJ: Prentice Hall)

    MATH  Google Scholar 

  • Watkins C 1989Learning from delayed rewards. Ph D thesis, Cambridge University, Cambridge, England

    Google Scholar 

  • Watkins C, Dayan P 1992 Q-learning.Mach. Learning 8: 279–292

    MATH  Google Scholar 

  • Williams R, Baird L III 1990 A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. InProc. Sixth Yale Workshop on Adaptive and Learning Systems, New Haven, CT, pp 96–101

  • Yoshizawa T 1966Stability theory by Liapunov’s second method (Tokyo: Mathematical Society of Japan)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Borkar, V.S., Konda, V.R. The actor-critic algorithm as multi-time-scale stochastic approximation. Sadhana 22, 525–543 (1997). https://doi.org/10.1007/BF02745577

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02745577

Keywords

Navigation