Abstract
The goal of this paper is two-fold: First, we present a sensitivity point of view on the optimization of Markov systems. We show that Markov decision processes (MDPs) and the policy-gradient approach, or perturbation analysis (PA), can be derived easily from two fundamental sensitivity formulas, and such formulas can be flexibly constructed, by first principles, with performance potentials as building blocks. Second, with this sensitivity view we propose an event-based optimization approach, including the event-based sensitivity analysis and event-based policy iteration. This approach utilizes the special feature of a system characterized by events and illustrates how the potentials can be aggregated using the special feature and how the aggregated potential can be used in policy iteration. Compared with the traditional MDP approach, the event-based approach has its advantages: the number of aggregated potentials may scale to the system size despite that the number of states grows exponentially in the system size, this reduces the policy space and saves computation; the approach does not require actions at different states to be independent; and it utilizes the special feature of a system and does not need to know the exact transition probability matrix. The main ideas of the approach are illustrated by an admission control problem.
Similar content being viewed by others
References
Barto, A., and Mahadevan, S. 2003. Recent advances in hierarchical reinforcement learning, special issue on reinforcement learning. Discret. Event Dyn. Syst. Theory Appl. 13: 41–77.
Baxter, J., and Bartlett, P. L. 2001. Infinite-horizon policy-gradient estimation. J. Artif. Intell. Res. 15: 319–350.
Baxter, J., Bartlett, P. L., and Weaver, L. 2001. Experiments with infinite-horizon policy-gradient estimation. J. Artif. Intell. Res. 15: 351–381.
Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control, Volume I and II. Belmont, MA: Athena Scientific.
Cao, X. R. 1994. Realization Probabilities: The Dynamics of Queueing Systems. New York: Springer-Verlag.
Cao, X. R. 1998. The relation among potentials, perturbation analysis, Markov decision processes, and other topics. J. Discret. Event Dyn. Syst. 8: 71–87.
Cao, X. R. 1999. Single sample path based optimization of Markov chains. J. Optim. Theory Appl. 100(3): 527–548.
Cao, X. R. 2000. A unified approach to Markov decision problems and performance sensitivity analysis. Automatica 36: 771–774.
Cao, X. R. 2004a. The potential structure of sample paths and performance sensitivities of Markov systems. IEEE Trans. Automat. Contr. 49: 2129–2142.
Cao, X. R. 2004b. A basic formula for on-line policy gradient algorithms, IEEE Trans. Automat. Contr. to appear.
Cao, X. R. 2004c. Event-based optimization of Markov systems. Manuscript to be submitted.
Cao, X. R., and Chen, H. F. 1997. Perturbation realization, potentials and sensitivity analysis of Markov processes. IEEE Trans. Automat. Contr. 42: 1382–1393.
Cao, X. R., and Guo, X. 2004. A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: Multichain cases. Automatica 40: 1749–1759.
Cao, X. R., and Wan, Y. W. 1998. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. IEEE Trans. Control Syst. Technol. 6: 482–494.
Cao, X. R., Yuan, X. M., and Qiu, L. 1996. A single sample path-based performance sensitivity formula for Markov chains. IEEE Trans. Automat. Contr. 41: 1814–1817.
Cao, X. R., Ren, Z. Y., Bhatnagar, S., Fu, M., and Marcus, S. 2002. A time aggregation approach to Markov decision processes. Automatica 38: 929–943.
Chong, E. K. P., and Ramadge, P. J. 1994. Stochastic optimization of regenerative systems using infinitesimal perturbation analysis. IEEE Trans. Automat. Contr. 39: 1400–1410.
Cooper, W. L., Henderson, S. G., and Lewis, M. E. 2003. Convergence of simulation-based policy iteration. Probab. Eng. Inf. Sci. 17: 213–234.
Dijk, N. V. 1993. Queueing Networks and Product Forms: A Systems Approach. Chichester: John Willey and Sons.
Fang, H. T., and Cao, X. R. 2004. Potential-based on-line policy iteration algorithms for Markov decision processes. IEEE Trans. Automat. Contr. 49: 493–505.
Ho, Y. C., and Cao, X. R. 1983. Perturbation analysis and optimization of queueing networks. J. Optim. Theory Appl. 40(4): 559–582.
Ho, Y. C., and Cao, X. R. 1991. Perturbation Analysis of Discrete-Event Dynamic Systems. Boston: Kluwer Academic Publisher.
Ho, Y. C., Zhao, Q. C., and Pepyne, D. L. 2003. The no free lunch theorem, complexity and computer security. IEEE Trans. Automat. Contr. 48: 783–793.
Marbach, P., and Tsitsiklis, T. N. 2001. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Contr. 46: 191–209.
Meuleau, N., Peshkin, L., Kim, K. -E., and Kaelbling, P. L. 1999. Learning finite-state controllers for partially observable environments. Proceedings of the Fifteenth International Conference on Uncertainty in Artificial Intelligence.
Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: Wiley.
Suri, R., and Leung, Y. T. 1989. Single run optimization of discrete event simulations—An empirical study using the M/M/1 queue. IIE Trans. 21: 35–49.
Theocharous, G., and Kaelbling, P. L. 2004. Approximate planning in POMDPS with macro-actions. Advances in Neural Information Processing Systems 16 (NIPS-03). Cambridge, MA: MIT Press. 775-782.
Watkins, C., and Dayan, P. 1992. Q-learning. Mach. Learn. 8: 279–292.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported in part by a grant from Hong Kong UGC.
Rights and permissions
About this article
Cite this article
Cao, XR. Basic Ideas for Event-Based Optimization of Markov Systems. Discrete Event Dyn Syst 15, 169–197 (2005). https://doi.org/10.1007/s10626-004-6211-4
Issue Date:
DOI: https://doi.org/10.1007/s10626-004-6211-4