ABSTRACT
Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield – for example, the probability of a click on a recommendation. This common machine learning task is highly non-trivial, as the data-generating process for contexts and actions is often skewed by the recommender system itself. Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling. This in turn makes off-policy learning – the typical setup in industry – particularly challenging.
In this work, we propose and validate a general pessimistic reward modelling approach for off-policy learning in recommendation. Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule. We show how it alleviates a well-known decision making phenomenon known as the Optimiser’s Curse, and draw parallels with existing work on pessimistic policy learning. Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case. Empirical observations in a wide range of environments show that being conservative in decision-making leads to a significant and robust increase in recommendation performance. The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces.
Supplemental Material
- A. Agarwal, S. Basu, T. Schnabel, and T. Joachims. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. In Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’17). ACM, 687–696.Google Scholar
- A. Agarwal, K. Takatsu, I. Zaitsev, and T. Joachims. 2019. A General Framework for Counterfactual Learning-to-Rank. In Proc. of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’19). ACM, 5–14.Google Scholar
- A. Agarwal, X. Wang, C. Li, M. Bendersky, and M. Najork. 2019. Addressing Trust Bias for Unbiased Learning-to-Rank. In Proc. of the 2019 World Wide Web Conference(WWW ’19). ACM, 4–14.Google Scholar
- J. O. Berger and R. L. Wolpert. 1988. The Likelihood Principle. IMS.Google Scholar
- L. Bottou, J. Peters, J. Quiñonero-Candela, D. Charles, D. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research 14, 1 (2013), 3207–3260.Google ScholarDigital Library
- A. Chaney, B. Stewart, and B. Engelhardt. 2018. How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility. In Proc. of the 12th ACM Conference on Recommender Systems(RecSys ’18). ACM, 224–232.Google Scholar
- O. Chapelle and L. Li. 2011. An Empirical Evaluation of Thompson Sampling. In Proc. of the 24th International Conference on Neural Information Processing Systems(NIPS’11). 2249–2257.Google Scholar
- M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi. 2019. Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proc. of the 12th ACM International Conference on Web Search and Data Mining(WSDM ’19). ACM, 456–464.Google ScholarDigital Library
- Y. Chen, Y. Wang, X. Zhao, J. Zou, and M. de Rijke. 2020. Block-Aware Item Similarity Models for Top-N Recommendation. ACM Trans. Inf. Syst. 38, 4, Article 42 (Sept. 2020), 26 pages.Google Scholar
- Z. Chen, Y. Wang, D. Lin, D. Z. Cheng, L. Hong, E. H. Chi, and C. Cui. 2021. Beyond Point Estimate: Inferring Ensemble Prediction Variation from Neuron Activation Strength in Recommender Systems. In Proc. of the 14th ACM International Conference on Web Search and Data Mining(WSDM ’21). ACM, 76–84.Google Scholar
- M. Choi, J. Kim, J. Lee, H. Shim, and J. Lee. 2021. Session-aware Linear Item-Item Models for Session-based Recommendation. In Proc. of the 2021 World Wide Web Conference(WWW ’21).Google Scholar
- M. F. Dacrema, P. Cremonesi, and D. Jannach. 2019. Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. In Proc. of the 13th ACM Conference on Recommender Systems(RecSys ’19). ACM, 101–109.Google Scholar
- M. Dudík, J. Langford, and L. Li. 2011. Doubly Robust Policy Evaluation and Learning. In Proc. of the 28th International Conference on International Conference on Machine Learning(ICML’11). 1097–1104.Google Scholar
- B. Dumitrascu, K. Feng, and B. E. Engelhardt. 2018. PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits. In Proc. of the 32nd International Conference on Neural Information Processing Systems(NIPS’18). 4629–4638.Google Scholar
- B. Efron and R. J. Tibshirani. 1994. An introduction to the bootstrap. CRC press.Google Scholar
- E. Elahi, W. Wang, D. Ray, A. Fenton, and T. Jebara. 2019. Variational Low Rank Multinomials for Collaborative Filtering with Side-information. In Proc. of the 13th ACM Conference on Recommender Systems(RecSys ’19). ACM, 340–347.Google Scholar
- V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. 2019. Generalized Multiple Importance Sampling. Statist. Sci. 34, 1 (02 2019), 129–155.Google Scholar
- M. Farajtabar, Y. Chow, and M. Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. In Proc. of the 35th International Conference on Machine Learning(ICML’18, Vol. 80). PMLR, 1447–1456.Google Scholar
- L. Faury, U. Tanielian, F. Vasile, E. Smirnova, and E. Dohmatob. 2020. Distributionally Robust Counterfactual Risk Minimization. In Proc. of the 34th AAAI Conference on Artificial Intelligence(AAAI’20). AAAI Press.Google Scholar
- Y. Gal and Z. Ghahramani. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proc. of The 33rd International Conference on Machine Learning(ICML ’16). PMLR, 1050–1059.Google Scholar
- F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruttin, and A. Huber. 2014. Offline and Online Evaluation of News Recommender Systems at Swissinfo.Ch. In Proc. of the 8th ACM Conference on Recommender Systems(RecSys ’14). 169–176.Google ScholarDigital Library
- A. Gilotte, C. Calauzènes, T. Nedelec, A. Abraham, and S. Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proc. of the 11th ACM International Conference on Web Search and Data Mining(WSDM ’18). ACM, 198–206.Google Scholar
- D. Guo, S. I. Ktena, P. K. Myana, F. Huszar, W. Shi, A. Tejani, M. Kneier, and S. Das. 2020. Deep Bayesian Bandits: Exploring in Online Personalized Recommendations. In Proc. of the 14th ACM Conference on Recommender Systems. ACM, 456–461.Google Scholar
- X. He, O. Pan, J.and Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. In Proc. of the 8th International Workshop on Data Mining for Online Advertising(ADKDD’14). ACM, 1–9.Google ScholarDigital Library
- L. Hui and M. Belkin. 2021. Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks. In Proc. of the 9th International Conference on Learning Representations(ICLR ’21). arxiv:2006.07322 [cs.LG]Google Scholar
- E. L. Ionides. 2008. Truncated Importance Sampling. Journal of Computational and Graphical Statistics 17, 2(2008), 295–311.Google ScholarCross Ref
- O. Jeunen. 2019. Revisiting Offline Evaluation for Implicit-feedback Recommender Systems. In Proc. of the 13th ACM Conference on Recommender Systems(RecSys ’19). ACM, 596–600.Google ScholarDigital Library
- O. Jeunen and B. Goethals. 2020. An Empirical Evaluation of Doubly Robust Learning for Recommendation. In Proc. of the ACM RecSys Workshop on Bandit Learning from User Interactions(REVEAL ’20).Google Scholar
- O. Jeunen, D. Mykhaylov, D. Rohde, F. Vasile, A. Gilotte, and M. Bompaire. 2019. Learning from Bandit Feedback: An Overview of the State-of-the-art. In Proc. of the ACM RecSys Workshop on Reinforcement Learning and Robust Estimators for Recommendation(REVEAL ’19).Google Scholar
- O. Jeunen, D. Rohde, and F. Vasile. 2019. On the Value of Bandit Feedback for Offline Recommender System Evaluation. In Proc. of the ACM RecSys Workshop on Reinforcement Learning and Robust Estimators for Recommendation(REVEAL ’19).Google Scholar
- O. Jeunen, D. Rohde, F. Vasile, and M. Bompaire. 2020. Joint Policy-Value Learning for Recommendation. In Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’20). ACM, 1223–1233.Google Scholar
- O. Jeunen, J. Van Balen, and B. Goethals. 2020. Closed-Form Models for Collaborative Filtering with Side-Information. In Proc. of the 14th ACM Conference on Recommender Systems(RecSys ’20). ACM, 651–656.Google Scholar
- Y. Jin, Z. Yang, and Z. Wang. 2020. Is Pessimism Provably Efficient for Offline RL?arxiv:2012.15085 [cs.LG]Google Scholar
- T. Joachims, A. Swaminathan, and M. de Rijke. 2018. Deep Learning with Logged Bandit Feedback. In Proc. of the 6th International Conference on Learning Representations(ICLR ’18).Google Scholar
- T. Joachims, A. Swaminathan, and T. Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In Proc. of the 10th ACM International Conference on Web Search and Data Mining(WSDM ’17). ACM, 781–789.Google Scholar
- R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. 2020. MOReL: Model-Based Offline Reinforcement Learning. In Advances in Neural Information Processing Systems(NeurIPS ’20, Vol. 33).Google Scholar
- A. Kumar, A. Zhou, G. Tucker, and S. Levine. 2020. Conservative Q-Learning for Offline Reinforcement Learning. In Advances in Neural Information Processing Systems(NeurIPS ’20, Vol. 33).Google Scholar
- D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, and M. de Rijke. 2016. Large-scale validation of counterfactual learning methods: A test-bed. arXiv preprint arXiv:1612.00367(2016).Google Scholar
- S. Levine, A. Kumar, G. Tucker, and J. Fu. 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arxiv:2005.01643 [cs.LG]Google Scholar
- L. Li, W. Chu, J. Langford, and R. E. Schapire. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proc. of the 19th International Conference on World Wide Web(WWW ’10). ACM, 661–670.Google ScholarDigital Library
- S. Li, A. Karatzoglou, and C. Gentile. 2016. Collaborative Filtering Bandits. In Proc. of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’16). ACM, 539–548.Google Scholar
- D. Liang, R. G. Krishnan, M. D Hoffman, and T. Jebara. 2018. Variational autoencoders for collaborative filtering. In Proc. of the 2018 World Wide Web Conference(WWW ’18). ACM, 689–698.Google Scholar
- Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill. 2020. Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration. In Advances in Neural Information Processing Systems(NeurIPS ’20, Vol. 33).Google Scholar
- B. London and T. Sandler. 2019. Bayesian Counterfactual Risk Minimization. In Proc. of the 36th International Conference on Machine Learning(ICML ’19, Vol. 97). PMLR, 4125–4133.Google Scholar
- R. Lopez, I. Dhillion, and M. I. Jordan. 2021. Learning from eXtreme Bandit Feedback. In Proc. of the 35th AAAI Conference on Artificial Intelligence(AAAI’21). AAAI Press.Google ScholarCross Ref
- C. Ma, L. Ma, Y. Zhang, R. Tang, X. Liu, and M. Coates. 2020. Probabilistic Metric Learning with Adaptive Margin for Top-K Recommendation. In Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’20). ACM, 1036–1044.Google Scholar
- J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi. 2018. Modeling Task Relationships in Multi-Task Learning with Multi-Gate Mixture-of-Experts. In Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’18). ACM, 1930–1939.Google ScholarDigital Library
- J. Ma, Z. Zhao, X. Yi, J. Yang, M. Chen, J. Tang, L. Hong, and E. H. Chi. 2020. Off-Policy Learning in Two-Stage Recommender Systems. In Proc. of the 2020 World Wide Web Conference(WWW ’20). ACM.Google ScholarDigital Library
- Y. Ma, Y. Wang, and B. Narayanaswamy. 2019. Imitation-Regularized Offline Learning. In Proc. of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)(AIStats ’19, Vol. 89). PMLR, 2956–2965.Google Scholar
- M. Mansoury, H. Abdollahpouri, M. Pechenizkiy, B. Mobasher, and R. Burke. 2020. Feedback Loop and Bias Amplification in Recommender Systems. In Proc. of the 29th ACM International Conference on Information & Knowledge Management(CIKM ’20). ACM, 2145–2148.Google Scholar
- A. Masegosa. 2020. Learning under Model Misspecification: Applications to Variational and Ensemble methods. In Advances in Neural Information Processing Systems(NeurIPS ’20, Vol. 33). 5479–5491.Google Scholar
- A. Maurer and M. Pontil. 2009. Empirical Bernstein Bounds and Sample Variance Penalization. Stat. 1050(2009), 21.Google Scholar
- B. C. May, N. Korda, A. Lee, and D. S. Leslie. 2012. Optimistic Bayesian Sampling in Contextual-Bandit Problems. J. Mach. Learn. Res. 13, 1 (June 2012), 2069–2106.Google Scholar
- J. McInerney, B. Lacker, S. Hansen, K. Higley, H. Bouchard, A. Gruson, and R. Mehrotra. 2018. Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits. In Proc. of the 12th ACM Conference on Recommender Systems(RecSys ’18). ACM, 31–39.Google Scholar
- H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, 2013. Ad click prediction: a view from the trenches. In Proc. of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1222–1230.Google ScholarDigital Library
- R. Mehrotra, J. McInerney, H. Bouchard, M. Lalmas, and F. Diaz. 2018. Towards a Fair Marketplace: Counterfactual Evaluation of the Trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems. In Proc. of the 27th ACM International Conference on Information and Knowledge Management(CIKM ’18). ACM, 2243–2251.Google Scholar
- R. Mehrotra, N. Xue, and M. Lalmas. 2020. Bandit Based Optimization of Multiple Objectives on a Music Streaming Platform. In Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’20). ACM, 3224–3233.Google Scholar
- K. P. Murphy. 2021. Probabilistic Machine Learning: An introduction. MIT Press.Google Scholar
- D. Mykhaylov, D. Rohde, F. Vasile, M. Bompaire, and O. Jeunen. 2019. Three Methods for Training on Bandit Feedback. In Proc. of the NeurIPS Workshop on Causality and Machine Learning(CausalML ’19).Google Scholar
- X. Ning and G. Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recommender Systems. In Proc. of the 2011 IEEE 11th International Conference on Data Mining(ICDM ’11). IEEE Computer Society, 497–506.Google Scholar
- H. Oosterhuis and M. de Rijke. 2020. Policy-Aware Unbiased Learning to Rank for Top-k Rankings. In Proc. of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’20). ACM, 489–498.Google ScholarDigital Library
- I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. 2016. Deep Exploration via Bootstrapped DQN. In Advances in Neural Information Processing Systems, Vol. 29. 4026–4034.Google Scholar
- A. B. Owen. 2013. Monte Carlo theory, methods and examples.Google Scholar
- D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou. 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. In Proc. of the ACM RecSys Workshop on Offline Evaluation for Recommender Systems(REVEAL ’18).Google Scholar
- M. Rossetti, F. Stella, and M. Zanker. 2016. Contrasting Offline and Online Results when Evaluating Recommendation Algorithms. In Proc. of the 10th ACM Conference on Recommender Systems(RecSys ’16). ACM, 31–34.Google ScholarDigital Library
- N. Sachdeva, Y. Su, and T. Joachims. 2020. Off-Policy Bandits with Deficient Support. In Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 965–975.Google Scholar
- Y. Saito, S. Aihara, M. Matsutani, and Y. Narita. 2020. Large-scale Open Dataset, Pipeline, and Benchmark for Bandit Algorithms. arxiv:2008.07146 [cs.LG]Google Scholar
- O. Sakhi, S. Bonner, D. Rohde, and F. Vasile. 2020. BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals. In Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’20). ACM, 783–793.Google Scholar
- S. Sedhain, A. Menon, S. Sanner, and D. Braziunas. 2016. On the Effectiveness of Linear Models for One-Class Collaborative Filtering. Proc. of the AAAI Conference on Artificial Intelligence 30, 1(2016).Google Scholar
- I. Shenbin, A. Alekseev, E. Tutubalina, V. Malykh, and S. I. Nikolenko. 2020. RecVAE: A New Variational Autoencoder for Top-N Recommendations with Implicit Feedback. In Proc. of the 13th International Conference on Web Search and Data Mining(WSDM ’20). ACM, 528–536.Google ScholarDigital Library
- H. Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 2 (2000), 227 – 244.Google ScholarCross Ref
- N. Si, F. Zhang, Z. Zhou, and J. Blanchet. 2020. Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits. In International Conference on Machine Learning(ICML’20).Google Scholar
- J. E. Smith and R. L. Winkler. 2006. The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis. Management Science 52, 3 (2006), 311–322.Google ScholarDigital Library
- H. Steck. 2019. Embarrassingly Shallow Autoencoders for Sparse Data. In The World Wide Web Conference(WWW ’19). ACM, 3251–3257.Google ScholarDigital Library
- Y. Su, M. Dimakopoulou, A. Krishnamurthy, and M. Dudik. 2020. Doubly robust off-policy evaluation with shrinkage. In Proc. of the 37th International Conference on Machine Learning(ICML ’20). PMLR, 9167–9176.Google Scholar
- Y. Su, L. Wang, M. Santacatterina, and T. Joachims. 2019. CAB: Continuous Adaptive Blending for Policy Evaluation and Learning. In International Conference on Machine Learning(ICML’19). 6005–6014.Google Scholar
- A. Swaminathan and T. Joachims. 2015. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In Proc. of the 32nd International Conference on International Conference on Machine Learning(ICML’15). JMLR.org, 814–823.Google Scholar
- A. Swaminathan and T. Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems. 3231–3239.Google Scholar
- H. Tang, J. Liu, M. Zhao, and X. Gong. 2020. Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations. In Proc. of the 14th ACM Conference on Recommender Systems(RecSys ’20). ACM, 269–278.Google Scholar
- F. Vasile, D. Rohde, O. Jeunen, and A. Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proc. of the 28th ACM Conference on User Modeling, Adaptation and Personalization(UMAP ’20). ACM, 392–393.Google Scholar
- T. J. Walsh, I. Szita, C. Diuk, and M. L. Littman. 2009. Exploring Compact Reinforcement-Learning Representations with Linear Regression. In Proc. of the 25th Conference on Uncertainty in Artificial Intelligence(UAI ’09). AUAI Press, 591–598.Google Scholar
- X. Xin, A. Karatzoglou, I. Arapakis, and J. M. Jose. 2020. Self-Supervised Reinforcement Learning for Recommender Systems. In Proc. of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’20). ACM, 931–940.Google ScholarDigital Library
- T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma. 2020. MOPO: Model-Based Offline Policy Optimization. In Advances in Neural Information Processing Systems(NeurIPS ’20, Vol. 33).Google Scholar
- Z. Zhao, L. Hong, L. Wei, J. Chen, A. Nath, S. Andrews, A. Kumthekar, M. Sathiamoorthy, X. Yi, and E. H. Chi. 2019. Recommending What Video to Watch next: A Multitask Ranking System. In Proceedings of the 13th ACM Conference on Recommender Systems(RecSys ’19). ACM, 43–51.Google ScholarDigital Library
Recommendations
Pessimistic Decision-Making for Recommender Systems
Modern recommender systems are often modelled under the sequential decision-making paradigm, where the system decides which recommendations to show in order to maximise some notion of either imminent or long-term reward. Such methods often require an ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent SystemsRecent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Doubly constrained offline reinforcement learning for learning path recommendation
AbstractLearning path recommendation refers to the application of interactive recommendation systems in the field of education, aimed at optimizing learning outcomes while minimizing the workload of learners, teachers, and curriculum designers. ...
Highlights- We apply offline reinforcement learning to the task of learning path recommendation.
- Our model handls the extrapolation error in RL within educational settings.
- The performance of the RL-based system is influenced by the simulated ...
Comments