Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

Kaufmann, Emilie; Korda, Nathaniel; Munos, Rémi

doi:10.1007/978-3-642-34106-9_18

Emilie Kaufmann²³,
Nathaniel Korda²⁴ &
Rémi Munos²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7568))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

2755 Accesses
134 Citations

Abstract

The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, S., Goyal, N.: Analysis of thompson sampling for the multi-armed bandit problem. In: Conference on Learning Theory, COLT (2012)
Google Scholar
Audibert, J.-Y., Bubeck, S.: Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research 11, 2785–2836 (2010)
MathSciNet MATH Google Scholar
Audibert, J.-Y., Munos, R., Szepesvári, C.: Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19), 1876–1902 (2009)
Article MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2), 235–256 (2002)
Article MATH Google Scholar
Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: NIPS (2011)
Google Scholar
Garivier, A., Cappé, O.: The kl-ucb algorithm for bounded stochastic bandits and beyond. In: Conference on Learning Theory, COLT (2011)
Google Scholar
Granmo, O.C.: Solving two-armed bernoulli bandit problems using a bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics 3(2), 207–234 (2010)
Article MathSciNet MATH Google Scholar
Honda, J., Takemura, A.: An asymptotically optimal bandit algorithm for bounded support models. In: Conference on Learning Theory, COLT (2010)
Google Scholar
Kaufmann, E., Garivier, A., Cappé, O.: On bayesian upper-confidence bounds for bandit problems. In: AISTATS (2012)
Google Scholar
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1), 4–22 (1985)
Article MathSciNet MATH Google Scholar
Maillard, O.-A., Munos, R., Stoltz, G.: A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In: Conference on Learning Theory, COLT (2011)
Google Scholar
May, B.C., Korda, N., Lee, A., Leslie, D.: Optimistic bayesian sampling in contextual bandit problems. Journal of Machine Learning Research 13, 2069–2106 (2012)
Google Scholar
Salomon, A., Audibert, J.-Y.: Deviations of Stochastic Bandit Regret. In: Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS, vol. 6925, pp. 159–173. Springer, Heidelberg (2011)
Chapter Google Scholar
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Telecom Paristech UMR CNRS 5141, France
Emilie Kaufmann
INRIA Lille-Nord Europe, France
Nathaniel Korda & Rémi Munos

Authors

Emilie Kaufmann
View author publications
You can also search for this author in PubMed Google Scholar
Nathaniel Korda
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Munos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Technion, 32000, Haifa, Israel
Nader H. Bshouty
Ecolre Normale Sup’erieure, CNRS, INRIA, 45 rue d’Ulm, 75005, Paris, France
Gilles Stoltz
Ecole Normale Supérieure de Cachan, 61, avenue du Président Wilson, 94 235, Cachan cedex, France
Nicolas Vayatis
Division of Computer Science, Hokkaido University, N-14, W-9, 060-0814, Sapporo, Japan
Thomas Zeugmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaufmann, E., Korda, N., Munos, R. (2012). Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2012. Lecture Notes in Computer Science(), vol 7568. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34106-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-34106-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34105-2
Online ISBN: 978-3-642-34106-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics