Abstract
In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of trials to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is well understood, bandit problems with large strategy sets are still a topic of active investigation, motivated by practical applications, such as online auctions and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets and payoff functions that enable the design of efficient solutions.
In this work, we study a general setting for the multi-armed bandit problem, in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric. We refer to this problem as the Lipschitz MAB problem. We present a solution for the multi-armed bandit problem in this setting. That is, for every metric space, we define an isometry invariant that bounds from below the performance of Lipschitz MAB algorithms for this metric space, and we present an algorithm that comes arbitrarily close to meeting this bound. Furthermore, our technique gives even better results for benign payoff functions. We also address the full-feedback (“best expert”) version of the problem, where after every round the payoffs from all arms are revealed.
- Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. In Proceedings of the 25th Advances in Neural Information Processing Systems (NIPS’11). 2312--2320. Google ScholarDigital Library
- Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. 2008. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization. In Proceedings of the 21st Conference on Learning Theory (COLT’08). 263--274.Google Scholar
- Ittai Abraham and Dahlia Malkhi. 2005. Name independent routing for growth bounded networks. In Proceedings of the 17th ACM Symposium on Parallel Algorithms and Architectures (SPAA). 49--55. Google ScholarDigital Library
- Rajeev Agrawal. 1995. The continuum-armed bandit problem. SIAM J. Control Optimiz. 33, 6 (1995), 1926--1951. Google ScholarDigital Library
- Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. 2016. A near-optimal exploration-exploitation approach for assortment selection. In Proceedings of the 17th ACM Conference on Economics and Computation (ACM EC’16). 599--600. Google ScholarDigital Library
- Shipra Agrawal and Nikhil R. Devanur. 2014. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM Conference on Economics and Computation (ACM EC’14). Google ScholarDigital Library
- Kareem Amin, Michael Kearns, and Umar Syed. 2011. Bandits, query learning, and the haystack dimension. In Proceedings of the 24th Conference on Learning Theory (COLT’11).Google Scholar
- J.-Y. Audibert and S. Bubeck. 2010. Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11 (2010), 2785--2836. Google ScholarDigital Library
- J.-Y. Audibert, R. Munos, and Cs. Szepesvári. 2009. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410 (2009), 1876--1902. Google ScholarDigital Library
- Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3 (2002), 397--422. Google ScholarDigital Library
- Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 2--3 (2002), 235--256. Google ScholarDigital Library
- Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 1 (2002), 48--77. Google ScholarDigital Library
- Peter Auer and Ronald Ortner. 2010. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61 (2010), 55--65.Google ScholarCross Ref
- Peter Auer, Ronald Ortner, and Csaba Szepesvári. 2007. Improved rates for the stochastic continuum-armed bandit problem. In Proceedings of the 20th Conference on Learning Theory (COLT’07). 454--468. Google ScholarDigital Library
- Baruch Awerbuch and Robert Kleinberg. 2008. Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74, 1 (Feb. 2008), 97--114. Google ScholarDigital Library
- Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. 2014. Online stochastic optimization under correlated bandit feedback. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1557--1565. Google ScholarDigital Library
- Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. 2015. Dynamic pricing with limited supply. ACM Trans. Econ. Comput. 3, 1 (2015), 4. Google ScholarDigital Library
- Moshe Babaioff, Robert Kleinberg, and Aleksandrs Slivkins. 2015. Truthful mechanisms with implicit payment computation. J. ACM 62, 2 (2015), 10. Google ScholarDigital Library
- Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. 2014. Characterizing truthful multi-armed bandit mechanisms. SIAM J. Comput. 43, 1 (2014), 194--230.Google ScholarDigital Library
- Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2018. Bandits with knapsacks. J. ACM 65, 3 (2018). Google ScholarDigital Library
- Dirk Bergemann and Juuso Välimäki. 2006. Bandit problems. In The New Palgrave Dictionary of Economics, 2nd ed., Steven Durlauf and Larry Blume (Eds.). Macmillan Press.Google Scholar
- Donald Berry and Bert Fristedt. 1985. Bandit Problems: Sequential Allocation of Experiments. Chapman 8 Hall.Google ScholarCross Ref
- Donald A. Berry, Robert W. Chen, Alan Zame, David C. Heath, and Larry A. Shepp. 1997. Bandit problems with infinitely many arms. Ann. Stat. 25, 5 (1997), 2103--2116.Google ScholarCross Ref
- Omar Besbes and Assaf Zeevi. 2009. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operat. Res. 57, 6 (2009), 1407--1420. Google ScholarDigital Library
- Avrim Blum. 1997. Empirical support for winnow and weighted-majority-based algorithms: Results on a calendar scheduling domain. Mach. Learn. 26 (1997), 5--23. Google ScholarDigital Library
- Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2003. Online learning in online auctions. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 202--204. Google ScholarDigital Library
- Sébastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5, 1 (2012).Google ScholarCross Ref
- Sébastien Bubeck and Rémi Munos. 2010. Open loop optimistic planning. In Proceedings of the 23rd Conference on Learning Theory (COLT’10). 477--489.Google Scholar
- Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. 2008. Online optimization in X-armed bandits. In Proceedings of the 21st Advances in Neural Information Processing Systems (NIPS’08). 201--208. Google ScholarDigital Library
- Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvari. 2011. Online optimization in X-armed bandits. J. Mach. Learn. Res. 12 (2011), 1587--1627. Google ScholarDigital Library
- Sébastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. 2011. Lipschitz bandits without the Lipschitz constant. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT’11). 144--158. Google ScholarDigital Library
- Adam Bull. 2015. Adaptive-treed bandits. Bernoulli J. Stat. 21, 4 (2015), 2289--2307.Google ScholarCross Ref
- G. Cantor. 1883. Über unendliche, lineare Punktmannichfaltigkeiten, 4. Math. Ann. 21 (1883), 51--58.Google ScholarCross Ref
- Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. 1997. How to use expert advice. J. ACM 44, 3 (1997), 427--485. Google ScholarDigital Library
- Nicolò Cesa-Bianchi and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press. Google Scholar
- Hubert T.-H. Chan, Anupam Gupta, Bruce M. Maggs, and Shuheng Zhou. 2005. On hierarchical routing in bounded-growth metrics. In Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms (SODA’05). 762--771. Google ScholarDigital Library
- Richard Cole and Lee-Ad Gottlieb. 2006. Searching dynamic point sets in spaces with bounded doubling dimension. In Proceedings of the 38th ACM Symposium on Theory of Computing (STOC’06). 574--583. Google ScholarDigital Library
- Eric Cope. 2009. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. IEEE Trans. Auto. Control 54, 6 (2009), 1243--1253.Google ScholarCross Ref
- Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley 8 Sons, New York. Google ScholarDigital Library
- Varsha Dani, Thomas P. Hayes, and Sham Kakade. 2007. The price of bandit information for online optimization. In Proceedings of the 20th Advances in Neural Information Processing Systems (NIPS’07). Google ScholarDigital Library
- Varsha Dani, Thomas P. Hayes, and Sham Kakade. 2008. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Conference on Learning Theory (COLT’08). 355--366.Google Scholar
- Thomas Desautels, Andreas Krause, and Joel Burdick. 2012. Parallelizing exploration-exploitation tradeoffs with Gaussian process bandit optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). Google ScholarDigital Library
- Nikhil Devanur and Sham M. Kakade. 2009. The price of truthfulness for pay-per-click auctions. In Proceedings of the 10th ACM Conference on Electronic Commerce (EC’09). 99--106. Google ScholarDigital Library
- Abraham Flaxman, Adam Kalai, and H. Brendan McMahan. 2005. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms (SODA’05). 385--394. Google ScholarDigital Library
- Christodoulos A. Floudas. 1999. Deterministic Global Optimization: Theory, Algorithms and Applications. Kluwer Academic Publishers.Google ScholarCross Ref
- Yoav Freund, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. 1997. Using and combining predictors that specialize. In Proceedings of the 29th ACM Symposium on Theory of Computing (STOC’97). 334--343. Google ScholarDigital Library
- Aurélien Garivier and Olivier Cappé. 2011. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Conference on Learning Theory (COLT’11).Google Scholar
- E. N. Gilbert. 1952. A comparison of signalling alphabets. Bell Syst. Tech. J. 31 (May 1952), 504--522.Google ScholarCross Ref
- John Gittins, Kevin Glazebrook, and Richard Weber. 2011. Multi-Armed Bandit Allocation Indices. John Wiley 8 Sons.Google Scholar
- Anupam Gupta, Mike Dinitz, and Kanat Tangwongsan. 2007. Private communication.Google Scholar
- Anupam Gupta, Robert Krauthgamer, and James R. Lee. 2003. Bounded geometries, fractals, and low--distortion embeddings. In Proceedings of the 44th IEEE Symposium on Foundations of Computer Science (FOCS’03). 534--543. Google ScholarDigital Library
- Elad Hazan and Satyen Kale. 2011. Better algorithms for benign bandits. J. Mach. Learn. Res. 12 (2011), 1287--1311. Google ScholarDigital Library
- Elad Hazan and Nimrod Megiddo. 2007. Online learning with prior information. In Proceedings of the 20th Conference on Learning Theory (COLT’07). 499--513. Google ScholarDigital Library
- J. Heinonen. 2001. Lectures on Analysis on Metric Spaces. Springer-Verlag, New York.Google Scholar
- Kirsten Hildrum, John Kubiatowicz, and Satish Rao. 2004. Object location in realistic networks. In Proceedings of the 16th ACM Symposium on Parallel Algorithms and Architectures (SPAA’04). 25--35. Google ScholarDigital Library
- Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. 2016. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. J. Artific. Intell. Res. 55 (2016), 317--359. Google ScholarDigital Library
- Junya Honda and Akimichi Takemura. 2010. An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the 23rd Conference on Learning Theory (COLT’10).Google Scholar
- D. R. Karger and M. Ruhl. 2002. Finding nearest neighbors in growth-restricted metrics. In Proceedings of the 34th ACM Symposium on Theory of Computing (STOC’02). 63--66. Google ScholarDigital Library
- Jon Kleinberg, Aleksandrs Slivkins, and Tom Wexler. 2009. Triangulation and embedding using small sets of beacons. J. ACM 56, 6 (Sept. 2009). Google ScholarDigital Library
- Robert Kleinberg. 2004. Nearly tight bounds for the continuum-armed bandit problem. In Proceedings of the 18th Advances in Neural Information Processing Systems (NIPS’04). Google ScholarDigital Library
- Robert Kleinberg. 2005. Online Decision Problems with Large Strategy Sets. Ph.D. Dissertation. MIT. Google ScholarDigital Library
- Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. 2008. Regret bounds for sleeping experts and bandits. In Proceedings of the 21st Conference on Learning Theory (COLT’08). 425--436.Google Scholar
- Robert Kleinberg and Aleksandrs Slivkins. 2010. Sharp dichotomies for regret minimization in metric spaces. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA’10). Google ScholarDigital Library
- Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2008. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing (STOC’08). 681--690. Google ScholarDigital Library
- Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2008. Multi-Armed Bandits in Metric Spaces. Technical report. Retrieved from http://arxiv.org/abs/0809.4882.Google Scholar
- Robert D. Kleinberg and Frank T. Leighton. 2003. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS’03). Google ScholarDigital Library
- Levente Kocsis and Csaba Szepesvari. 2006. Bandit-based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML’06). 282--293. Google ScholarDigital Library
- Andreas Krause and Cheng Soon Ong. 2011. Contextual Gaussian process bandit optimization. In Proceedings of the 25th Advances in Neural Information Processing Systems (NIPS’11). 2447--2455. Google ScholarDigital Library
- Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6 (1985), 4--22. Google ScholarDigital Library
- Tyler Lu, Dávid Pál, and Martin Pál. 2010. Showing relevant ads via Lipschitz context multi-armed bandits. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’10).Google Scholar
- Stefan Magureanu, Richard Combes, and Alexandre Proutiere. 2014. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of the 27th Conference on Learning Theory (COLT’14). 975--999.Google Scholar
- Odalric-Ambrym Maillard and Rémi Munos. 2010. Online learning in adversarial Lipschitz environments. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD’10). 305--320. Google ScholarDigital Library
- Odalric-Ambrym Maillard and Rémi Munos. 2011. Adaptive bandits: Towards the best history-dependent strategy. In Proceedings of the 24th Conference on Learning Theory (COLT’11).Google Scholar
- S. Mazurkiewicz and W. Sierpinski. 1920. Contribution à la topologie des ensembles dénombrables. Fund. Math. 1 (1920), 17--27.Google ScholarCross Ref
- Manor Mendel and Sariel Har-Peled. 2005. Fast construction of nets in low dimensional metrics, and their applications. In Proceedings of the 21st ACM Symposium on Computational Geometry (SoCG’05). 150--158. Google ScholarDigital Library
- Stanislav Minsker. 2013. Estimation of extreme values and associated level sets of a regression function via selective sampling. In Proceedings of the 26th Conference on Learning Theory (COLT’13). 105--121.Google Scholar
- Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. Google ScholarDigital Library
- Rémi Munos. 2011. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Proceedings of the 25th Conference on Advances in Neural Information Processing Systems (NIPS’11). 783--791. Google ScholarDigital Library
- Rémi Munos. 2014. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Found. Trends Mach. Learn. 7, 1 (2014), 1--129.Google ScholarDigital Library
- Rémi Munos and Pierre-Arnaud Coquelin. 2007. Bandit algorithms for tree search. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI’07).Google Scholar
- Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. 2007. Bandits for taxonomies: A model-based approach. In Proceedings of the SIAM International Conference on Data Mining (SDM’07).Google ScholarCross Ref
- Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. 2007. Multi-armed bandit problems with dependent arms. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). Google ScholarDigital Library
- Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 784--791. Google ScholarDigital Library
- Herbert Robbins. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58 (1952), 527--535.Google ScholarCross Ref
- Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. A metric for distributions with applications to image databases. Int. J. Comput. Vision 40, 2 (2000), 99--121.Google ScholarDigital Library
- Manfred Schroeder. 1991. Fractal, Chaos and Power Laws: Minutes from an Infinite Paradise. W. H. Freeman and Co.Google Scholar
- Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. Google ScholarDigital Library
- Aleksandrs Slivkins. 2007. Distance estimation and object location via rings of neighbors. Distributed Computing 19, 4 (Mar. 2007), 313--333. Google ScholarDigital Library
- Aleksandrs Slivkins. 2007. Towards fast decentralized construction of locality-aware overlay networks. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). 89--98. Google ScholarDigital Library
- Aleksandrs Slivkins. 2011. Multi-armed bandits on implicit metric spaces. In Proceedings of the 25th Advances in Neural Information Processing Systems (NIPS’11). Google ScholarDigital Library
- Aleksandrs Slivkins. 2014. Contextual bandits with similarity information. J. Mach. Learn. Res. 15, 1 (2014), 2533--2568. Google ScholarDigital Library
- Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked bandits in metric spaces: Learning optimally diverse rankings over large document collections. J. Mach. Learn. Res. 14 (Feb. 2013), 399--436. Google ScholarDigital Library
- Aleksandrs Slivkins and Eli Upfal. 2008. Adapting to a changing environment: the Brownian restless bandits. In Proceedings of the 21st Conference on Learning Theory (COLT’08). 343--354.Google Scholar
- Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 1015--1022. Google ScholarDigital Library
- Michel Talagrand. 2005. The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer.Google Scholar
- Kunal Talwar. 2004. Bypassing the embedding: Algorithms for low-dimensional metrics. In Proceedings of the 36th ACM Symposium on Theory of Computing (STOC’04). 281--290. Google ScholarDigital Library
- William R. Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 3--4 (1933), 285--294.Google ScholarCross Ref
- Michal Valko, Alexandra Carpentier, and Rémi Munos. 2013. Stochastic simultaneous optimistic optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 19--27. Google ScholarDigital Library
- R. R. Varshamov. 1957. Estimate of the number of signals in error correcting codes. Doklady Akadamii Nauk 177 (1957), 739--741.Google Scholar
- V. Vovk. 1998. A game of prediction with expert advice. J. Comput. Syst. Sci. 56, 2 (1998), 153--173. Google ScholarDigital Library
- Yizao Wang, Jean-Yves Audibert, and Rémi Munos. 2008. Algorithms for infinitely many-armed bandits. In Advances in Neural Information Processing Systems. MIT Press, 1729--1736. Google ScholarDigital Library
- Zizhuo Wang, Shiming Deng, and Yinyu Ye. 2014. Close the gaps: A learning-while-doing algorithm for single-product revenue management problems. Operat. Res. 62, 2 (2014), 318--331. Google ScholarDigital Library
Index Terms
- Bandits and Experts in Metric Spaces
Recommendations
Multi-armed bandits in metric spaces
STOC '08: Proceedings of the fortieth annual ACM symposium on Theory of computingIn a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of $n$ trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is ...
Ranked bandits in metric spaces: learning diverse rankings over large document collections
Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical ...
Contextual bandits with similarity information
In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is ...
Comments