article

Combining expert advice in reactive environments

Authors:
Daniela Pucci De Farias

Massachusetts Institute of Technology, Cambridge, Massachusetts

Massachusetts Institute of Technology, Cambridge, Massachusetts
View Profile

,
Nimrod Megiddo

IBM Almaden Research Center, San Jose, California

IBM Almaden Research Center, San Jose, California
View Profile

Authors Info & Claims

Journal of the ACM Volume 53 Issue 5pp 762–799https://doi.org/10.1145/1183907.1183911

Published:01 September 2006Publication History

Journal of the ACM

Abstract

“Experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any fixed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game.A general exploration-exploitation experts method is presented along with a proper definition of value. The definition is shown to be adequate in that it both captures the impact of an expert's actions on the environment and is learnable. The new experts method is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player's actions on the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner's Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven.

References

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 1. Google ScholarDigital Library
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. 1997. How to use expert advice. J. ACM 44, 427--485. Google ScholarDigital Library
Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493--507.Google ScholarCross Ref
de Farias, D., and Megiddo, N. 2004. How to combine expert (or novice) advice when actions impact the environment. In Advances in Neural Information Processing Systems, Vol. 16.Google Scholar
Feller, W. 1971. Probability Theory and its Applications. Wiley, New York.Google Scholar
Foster, D. P. and Vohra, R. V. 1993. A randomization rule for selecting forecasts. Oper. Res. 41, 704--709. Google ScholarDigital Library
Foster, D. and Vohra, R. 1999. Regret and the on-line decision problem. Games Econ. Behav. 29, 7--35.Google ScholarCross Ref
Freund, Y., and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory, (P. Vitányi, Ed.), Lecture Notes in Computer Science, vol. 904. Springer-Verlag, New York, 23--37. Google ScholarDigital Library
Freund, Y., and Schapire, R. E. 1999. Adaptive game playing using multiplicative weights. Games Econ. Behav. 29, 79--103.Google ScholarCross Ref
Fudenberg, D., and Levine, D. 1997. The Theory of Learning in Games. The MIT Press, Cambridge, MA.Google Scholar
Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. ASA 58, 13--30.Google Scholar
Kakade, S. 2003. On the sample complexity of reinforcement learning. Ph.D. dissertation, Gatsby Computational Neuroscience Unit, University College, London, England.Google Scholar
Kearns, M., and Singh, S. 1999. Finite-sample convergence rates for Q-learning and indirect algorithms. In Neural Information Processing Systems 12. MIT Press. Google ScholarDigital Library
Kearns, M., and Singh, S. 2002. Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 2, 209--232. Google ScholarDigital Library
Lai, T.-L., and Yakowitz, S. 1995. Machine learning and nonparametric bandit theory. IEEE Trans. Automat. Cont. 40, 7, 1199--1209.Google ScholarCross Ref
Littlestone, N., and Warmuth, M. 1994. The weighted majority algorithm. Inf. Comput. 108, 2, 212--261. Google ScholarDigital Library
Vovk, V. 1998. A game of prediction with expert advice. J. Compu. Syst. Sci. 56, 153--173. Google ScholarDigital Library
Watkins, C., and Dayan, P. 1992. Q-learning. Mach. Learn. 8, 279--292. Google ScholarDigital Library
Williams, D. 1991. Probability with Martingales. Cambridge University Press, Cambridge, UK.Google Scholar

Index Terms

Combining expert advice in reactive environments
1. Computing methodologies
  1. Artificial intelligence
    1. Control methods
    2. Search methodologies
2. Theory of computation
  1. Design and analysis of algorithms
    1. Algorithm design techniques
      1. Dynamic programming

Recommendations

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions (''experts''), under partial observation: In each round t, only the ...
Read More
How to Better Use Expert Advice

This work is concerned with online learning from expert advice. Extensive work on this problem generated numerous "expert advice algorithms" whose total loss is provably bounded above in terms of the loss incurred by the best expert in hindsight. Such ...
Read More
Budgeted prediction with expert advice
AAAI'15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

We consider a budgeted variant of the problem of learning from expert advice with N experts. Each queried expert incurs a cost and there is a given budget B on the total cost of experts that can be queried in any prediction round. We provide an online ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 53, Issue 5
September 2006
173 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1183907
Issue’s Table of Contents

Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2006
Published in jacm Volume 53, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Sequential decision making
complexity and performance bounds
experts algorithms
exploration-exploitation tradeoffs
reactive environments
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 734
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Combining expert advice in reactive environments

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

How to Better Use Expert Advice

Budgeted prediction with expert advice

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Combining expert advice in reactive environments

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

How to Better Use Expert Advice

Budgeted prediction with expert advice

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media