Abstract
The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, S., Goyal, N.: Analysis of thompson sampling for the multi-armed bandit problem. In: Conference on Learning Theory, COLT (2012)
Audibert, J.-Y., Bubeck, S.: Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research 11, 2785–2836 (2010)
Audibert, J.-Y., Munos, R., SzepesvĂ¡ri, C.: Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19), 1876–1902 (2009)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2), 235–256 (2002)
Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: NIPS (2011)
Garivier, A., Cappé, O.: The kl-ucb algorithm for bounded stochastic bandits and beyond. In: Conference on Learning Theory, COLT (2011)
Granmo, O.C.: Solving two-armed bernoulli bandit problems using a bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics 3(2), 207–234 (2010)
Honda, J., Takemura, A.: An asymptotically optimal bandit algorithm for bounded support models. In: Conference on Learning Theory, COLT (2010)
Kaufmann, E., Garivier, A., Cappé, O.: On bayesian upper-confidence bounds for bandit problems. In: AISTATS (2012)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1), 4–22 (1985)
Maillard, O.-A., Munos, R., Stoltz, G.: A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In: Conference on Learning Theory, COLT (2011)
May, B.C., Korda, N., Lee, A., Leslie, D.: Optimistic bayesian sampling in contextual bandit problems. Journal of Machine Learning Research 13, 2069–2106 (2012)
Salomon, A., Audibert, J.-Y.: Deviations of Stochastic Bandit Regret. In: Kivinen, J., SzepesvĂ¡ri, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS, vol. 6925, pp. 159–173. Springer, Heidelberg (2011)
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kaufmann, E., Korda, N., Munos, R. (2012). Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2012. Lecture Notes in Computer Science(), vol 7568. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34106-9_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-34106-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34105-2
Online ISBN: 978-3-642-34106-9
eBook Packages: Computer ScienceComputer Science (R0)