skip to main content
article

Combining expert advice in reactive environments

Published:01 September 2006Publication History
Skip Abstract Section

Abstract

“Experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any fixed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game.A general exploration-exploitation experts method is presented along with a proper definition of value. The definition is shown to be adequate in that it both captures the impact of an expert's actions on the environment and is learnable. The new experts method is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player's actions on the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner's Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven.

References

  1. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. 1997. How to use expert advice. J. ACM 44, 427--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493--507.Google ScholarGoogle ScholarCross RefCross Ref
  4. de Farias, D., and Megiddo, N. 2004. How to combine expert (or novice) advice when actions impact the environment. In Advances in Neural Information Processing Systems, Vol. 16.Google ScholarGoogle Scholar
  5. Feller, W. 1971. Probability Theory and its Applications. Wiley, New York.Google ScholarGoogle Scholar
  6. Foster, D. P. and Vohra, R. V. 1993. A randomization rule for selecting forecasts. Oper. Res. 41, 704--709. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Foster, D. and Vohra, R. 1999. Regret and the on-line decision problem. Games Econ. Behav. 29, 7--35.Google ScholarGoogle ScholarCross RefCross Ref
  8. Freund, Y., and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory, (P. Vitányi, Ed.), Lecture Notes in Computer Science, vol. 904. Springer-Verlag, New York, 23--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Freund, Y., and Schapire, R. E. 1999. Adaptive game playing using multiplicative weights. Games Econ. Behav. 29, 79--103.Google ScholarGoogle ScholarCross RefCross Ref
  10. Fudenberg, D., and Levine, D. 1997. The Theory of Learning in Games. The MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  11. Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. ASA 58, 13--30.Google ScholarGoogle Scholar
  12. Kakade, S. 2003. On the sample complexity of reinforcement learning. Ph.D. dissertation, Gatsby Computational Neuroscience Unit, University College, London, England.Google ScholarGoogle Scholar
  13. Kearns, M., and Singh, S. 1999. Finite-sample convergence rates for Q-learning and indirect algorithms. In Neural Information Processing Systems 12. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kearns, M., and Singh, S. 2002. Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 2, 209--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lai, T.-L., and Yakowitz, S. 1995. Machine learning and nonparametric bandit theory. IEEE Trans. Automat. Cont. 40, 7, 1199--1209.Google ScholarGoogle ScholarCross RefCross Ref
  16. Littlestone, N., and Warmuth, M. 1994. The weighted majority algorithm. Inf. Comput. 108, 2, 212--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Vovk, V. 1998. A game of prediction with expert advice. J. Compu. Syst. Sci. 56, 153--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Watkins, C., and Dayan, P. 1992. Q-learning. Mach. Learn. 8, 279--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Williams, D. 1991. Probability with Martingales. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar

Index Terms

  1. Combining expert advice in reactive environments

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of the ACM
          Journal of the ACM  Volume 53, Issue 5
          September 2006
          173 pages
          ISSN:0004-5411
          EISSN:1557-735X
          DOI:10.1145/1183907
          Issue’s Table of Contents

          Copyright © 2006 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 September 2006
          Published in jacm Volume 53, Issue 5

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader