Skip to main content

Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

  • Conference paper
Algorithmic Learning Theory (ALT 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7568))

Included in the following conference series:

Abstract

The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, S., Goyal, N.: Analysis of thompson sampling for the multi-armed bandit problem. In: Conference on Learning Theory, COLT (2012)

    Google Scholar 

  2. Audibert, J.-Y., Bubeck, S.: Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research 11, 2785–2836 (2010)

    MathSciNet  MATH  Google Scholar 

  3. Audibert, J.-Y., Munos, R., SzepesvĂ¡ri, C.: Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19), 1876–1902 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2), 235–256 (2002)

    Article  MATH  Google Scholar 

  5. Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: NIPS (2011)

    Google Scholar 

  6. Garivier, A., Cappé, O.: The kl-ucb algorithm for bounded stochastic bandits and beyond. In: Conference on Learning Theory, COLT (2011)

    Google Scholar 

  7. Granmo, O.C.: Solving two-armed bernoulli bandit problems using a bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics 3(2), 207–234 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  8. Honda, J., Takemura, A.: An asymptotically optimal bandit algorithm for bounded support models. In: Conference on Learning Theory, COLT (2010)

    Google Scholar 

  9. Kaufmann, E., Garivier, A., Cappé, O.: On bayesian upper-confidence bounds for bandit problems. In: AISTATS (2012)

    Google Scholar 

  10. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1), 4–22 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  11. Maillard, O.-A., Munos, R., Stoltz, G.: A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In: Conference on Learning Theory, COLT (2011)

    Google Scholar 

  12. May, B.C., Korda, N., Lee, A., Leslie, D.: Optimistic bayesian sampling in contextual bandit problems. Journal of Machine Learning Research 13, 2069–2106 (2012)

    Google Scholar 

  13. Salomon, A., Audibert, J.-Y.: Deviations of Stochastic Bandit Regret. In: Kivinen, J., SzepesvĂ¡ri, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS, vol. 6925, pp. 159–173. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  14. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kaufmann, E., Korda, N., Munos, R. (2012). Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2012. Lecture Notes in Computer Science(), vol 7568. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34106-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34106-9_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34105-2

  • Online ISBN: 978-3-642-34106-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics