Abstract
Digital experiments are routinely used to test the value of a treatment relative to a status-quo control setting—for instance, a new search relevance algorithm for a website or a new results layout for a mobile app. As digital experiments have become increasingly pervasive in organizations and a wide variety of research areas, their growth has prompted a new set of challenges for experimentation platforms. One challenge is that experiments often focus on the average treatment effect (ATE) without explicitly considering differences across major sub-groups: heterogeneous treatment effect (HTE). This is especially problematic, because ATEs have decreased in many organizations as the more obvious benefits have already been realized. However, questions abound regarding the pervasiveness of user HTEs and how best to detect them. We propose a framework for detecting and analyzing user HTEs in digital experiments. Our framework combines an array of user characteristics with double machine learning. Analysis of 27 real-world experiments spanning 1.76 billion sessions and simulated data demonstrates the effectiveness of our detection method relative to existing techniques. We also find that transaction, demographic, engagement, satisfaction, and lifecycle characteristics exhibit statistically significant HTEs in 10% to 20% of our real-world experiments, underscoring the importance of considering user heterogeneity when analyzing experiment results; otherwise, personalized features and experiences cannot happen, thus reducing effectiveness. In terms of the number of experiments and user sessions, we are not aware of any study that has examined user HTEs at this scale. Our findings have important implications for information retrieval, user modeling, platforms, and digital experience contexts, in which online experiments are often used to evaluate the effectiveness of design artifacts.
- [1] . 2008. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans. Inf. Syst. 26, 3 (2008), 1–34.Google ScholarDigital Library
- [2] . 2015. Predicting behavior. IEEE Intell. Syst. 30, 3 (2015), 35–43.Google ScholarDigital Library
- [3] . 2018. Big data in psychology: A framework for research advancement. Am. Psychol. 73, 4 (2018), 899–917.Google ScholarCross Ref
- [4] . 2020. Deep learning for adverse event detection from web search. IEEE Trans. Knowl. Data Eng. 34, 6 (2020), 2681–2695.Google ScholarCross Ref
- [5] . 2020. A deep learning architecture for psychometric natural language processing. ACM Trans. Inf. Syst. 38, 1 (2020), 1–29.Google ScholarDigital Library
- [6] . 2019. The effects of working memory, perceptual speed, and inhibition in aggregated search. ACM Trans. Inf. Syst. 37, 3 (2019), 1–34.Google ScholarDigital Library
- [7] . 2015. Pareto Distribution. John Wiley & Sons, Ltd, 1–10.Google Scholar
- [8] . 2016. Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. U.S.A. 113, 27 (2016), 7353–7360.Google ScholarCross Ref
- [9] . 2019. Generalized random forests. Ann. Stat. 47, 2 (2019), 1148–1178. Google ScholarCross Ref
- [10] . 2017. Understanding and leveraging the impact of response latency on user behaviour in web search. ACM Trans. Inf. Syst. 36, 2 (2017), 1–42.Google ScholarDigital Library
- [11] . 2012. Customer event history for churn prediction: How long is long enough? Expert Syst. Appl. 39, 18 (2012), 13517–13522.Google ScholarDigital Library
- [12] . 2021. Treatment effect detection with controlled FDR under dependence for large-scale experiments. arXiv:2110.07279. Retrieved from https://arxiv.org/abs/2110.07279.Google Scholar
- [13] . 2015. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 5 (2015), 2055–2085.Google ScholarCross Ref
- [14] . 2005. Estimation of marginal effects using margeff. Stata J. 5, 3 (2005), 309–329. Google ScholarCross Ref
- [15] . 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57, 1 (1995), 289–300. Google ScholarCross Ref
- [16] . 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 4 (2001), 1165–1188.Google Scholar
- [17] . 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327, 8476 (1986), 307–310.Google ScholarCross Ref
- [18] . 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.Google ScholarDigital Library
- [19] . 2015. Predictive analytics: Predictive modeling at the micro level. IEEE Intell. Syst. 30, 3 (2015), 6–8.Google ScholarDigital Library
- [20] . 2017. Double/debiased/neyman machine learning of treatment effects. Am. Econ. Rev. 107, 5 (2017), 261–65.Google ScholarCross Ref
- [21] . 2018. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, 1 (2018), C1–C68.Google ScholarCross Ref
- [22] . 2021. Causal Inference: The mixtape. Yale University Press.Google ScholarCross Ref
- [23] . 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 123–132.Google ScholarDigital Library
- [24] . 2017. User modeling on demographic attributes in big mobile social networks. ACM Trans. Inf. Syst. 35, 4 (2017), 1–33.Google ScholarDigital Library
- [25] . 2021. Interactive identification of individuals with positive treatment effect while controlling false discoveries. arXiv:2102.10778. Retrieved from https://arxiv.org/abs/2102.10778.Google Scholar
- [26] . 2022. Conditional calibration for false discovery rate control under dependence. Ann. Stat. 50, 6 (2022), 3091–3118.Google Scholar
- [27] . 2012. Sentimental spidering: Leveraging opinion information in focused crawlers. ACM Trans. Inf. Syst. 30, 4 (2012), 1–30.Google ScholarDigital Library
- [28] . 2021. Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog. ACM Trans. Inf. Syst. 39, 2 (2021), 1–32.Google ScholarDigital Library
- [29] . 2011. Dynamics of customer response to promotional and relational direct mailings from an apparel retailer: The moderating role of relationship strength. J. Retail. 87, 2 (2011), 166–181.Google ScholarCross Ref
- [30] . 2017. Whose and what social media complaints have happier resolutions? Evidence from Twitter. J. Manage. Inf. Syst. 34, 2 (2017), 314–340.Google ScholarCross Ref
- [31] . 2015. Evaluating recommender systems. In Recommender Systems Handbook. Springer, 265–308.Google ScholarCross Ref
- [32] . 2022. Auto-debias: Debiasing masked language models with automated biased prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1012–1023.Google ScholarCross Ref
- [33] . 2019. Top challenges from the first practical online controlled experiments summit. ACM SIGKDD Explor. Newslett. 21, 1 (2019), 20–35.Google ScholarDigital Library
- [34] . 2015. Understanding and supporting cross-device web search for exploratory tasks with mobile touch interactions. ACM Trans. Inf. Syst. 33, 4 (2015), 1–34.Google ScholarDigital Library
- [35] . 1987. Non-parametric logistic and proportional odds regression. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 36, 3 (1987), 260–276.Google Scholar
- [36] . 2021. Heterogeneous effects of software patches in a multiplayer online battle arena game. In Proceedings of the 16th International Conference on the Foundations of Digital Games (FDG’21) 2021. 1–9.Google ScholarDigital Library
- [37] . 2002. Do better customers utilize electronic distribution channels? The case of PC banking. Manage. Sci. 48, 6 (2002), 732–748.Google ScholarDigital Library
- [38] . 2005. Evolution of web site design patterns. ACM Trans. Inf. Syst. 23, 4 (2005), 463–497.Google ScholarDigital Library
- [39] . 2009. Web Analytics 2.0: The Art of Online Accountability and Science of Customer Centricity. John Wiley & Sons.Google Scholar
- [40] . 2003. Sample size for multiple regression: Obtaining regression coefficients that are accurate, not simply significant. Psychol. Methods 8, 3 (2003), 305–321.Google ScholarCross Ref
- [41] . 2012. On effect size. Psychol. Methods 17, 2 (2012), 137–152.Google ScholarCross Ref
- [42] . 2018. Advanced customer analytics: Strategic value through integration of relationship-oriented big data. J. Manage. Inf. Syst. 35, 2 (2018), 540–574.Google ScholarCross Ref
- [43] . 2015. Online controlled experiments: Lessons from running a/b/n tests for 12 years. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1–1.Google ScholarDigital Library
- [44] . 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1168–1176.Google ScholarDigital Library
- [45] . 2020. Trustworthy Online Controlled Experiments: A Practical Guide to a/b Testing. Cambridge University Press.Google ScholarCross Ref
- [46] . 2020. Online randomized controlled experiments at scale: Lessons and extensions to medicine. Trials 21, 1 (2020), 1–9.Google ScholarCross Ref
- [47] . 2017. The surprising power of online experiments. Harv. Bus. Rev. 95, 5 (2017), 74–82.Google Scholar
- [48] . 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc. Natl. Acad. Sci. U.S.A> 116, 10 (2019), 4156–4165.Google ScholarCross Ref
- [49] . 2022. Benchmarking intersectional biases in NLP. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3598–3609.Google ScholarCross Ref
- [50] . 2020. Path to purpose? How online customer journeys differ for hedonic versus utilitarian purchases. J. Market. 84, 4 (2020), 127–146.Google ScholarCross Ref
- [51] . 2020. TheoryOn: A design framework and system for unlocking behavioral knowledge through ontology learning. MIS Quart. 44, 4 (2020).Google ScholarCross Ref
- [52] . 2021. Profiling users for question answering communities via flow-based constrained co-embedding model. ACM Trans. Inf. Syst. 40, 2 (2021), 1–38.Google ScholarDigital Library
- [53] . 2016. Raising the odds of success: The current state of experimentation in product development. Inf. Softw. Technol. 77 (2016), 80–91.Google ScholarDigital Library
- [54] . 2012. Modeling partial customer churn: On the value of first product-category purchase sequences. Expert Syst. Appl. 39, 12 (2012), 11250–11256.Google ScholarDigital Library
- [55] . 2001. Satisfaction, repurchase intent, and repurchase behavior: Investigating the moderating effect of customer characteristics. J. Market. Res. 38, 1 (2001), 131–142.Google ScholarCross Ref
- [56] . 2004. Modeling online browsing and path analysis using clickstream data. Market. Sci. 23, 4 (2004), 579–595.Google ScholarDigital Library
- [57] . 2021. MyrrorBot: A digital assistant based on holistic user models for personalized access to online services. ACM Trans. Inf. Syst. 39, 4 (2021), 1–34.Google ScholarDigital Library
- [58] . 2020. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108, 2 (
09 2020), 299–319. Google ScholarCross Ref - [59] . 2009. Causal inference in statistics: An overview. Stat. Surv. 3 (2009), 96–146.Google ScholarCross Ref
- [60] . 2003. The impact of customer relationship characteristics on profitable lifetime duration. J. Market. 67, 1 (2003), 77–99.Google ScholarCross Ref
- [61] . 2018. Preventing false discovery of heterogeneous treatment effect subgroups in randomized trials. Trials 19, 1 (2018), 1–15.Google ScholarCross Ref
- [62] . 1988. Root-N-consistent semiparametric regression. Econometrica 56, 4 (1988), 931–954.Google ScholarCross Ref
- [63] . 2022. Relevance assessments for web search evaluation: Should we randomise or prioritise the pooled documents? ACM Trans. Inf. Syst. 40, 4 (2022), 1–35.Google ScholarDigital Library
- [64] . 1987. Counting your customers: Who-are they and what will they do next? Manage. Sci. 33, 1 (1987), 1–24.Google ScholarCross Ref
- [65] . 2008. A unified approach to false discovery rate estimation. BMC Bioinf. 9, 1 (2008), 1–14.Google ScholarCross Ref
- [66] . 2021. Causal inference and machine learning in practice with econml and causalml: Industrial use cases at microsoft, tripadvisor, uber. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 4072–4073.Google ScholarDigital Library
- [67] . 2016. A nonparametric bayesian analysis of heterogenous treatment effects in digital experimentation. J. Bus. Econ. Stat. 34, 4 (2016), 661–672.Google ScholarCross Ref
- [68] . 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 17–26.Google ScholarDigital Library
- [69] . 2021. What and how long: Prediction of mobile app engagement. ACM Trans. Inf. Syst. 40, 1 (2021), 1–38.Google ScholarDigital Library
- [70] . 2019. Learning triggers for heterogeneous treatment effects. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5183–5190.Google ScholarDigital Library
- [71] . 2022. Understanding the “Pathway” towards a searcher’s learning objective. ACM Trans. Inf. Syst. 40, 4 (2022), 1–42.Google ScholarDigital Library
- [72] . 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).Google Scholar
- [73] . 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 113, 523 (2018), 1228–1242.Google ScholarCross Ref
- [74] . 2021. Personalized and explainable employee training course recommendations: A bayesian variational approach. ACM Trans. Inf. Syst. 40, 4 (2021), 1–32.Google ScholarDigital Library
- [75] . 2021. Combining graph convolutional neural networks and label propagation. ACM Trans. Inf. Syst. 40, 4 (2021), 1–27.Google ScholarDigital Library
- [76] . 2021. HyperSoRec: Exploiting hyperbolic user and item representations with multiple aspects for social-aware recommendation. ACM Trans. Inf. Syst. 40, 2 (2021), 1–28.Google ScholarDigital Library
- [77] . 2021. Dynamic structural role node embedding for user modeling in evolving networks. ACM Trans. Inf. Syst. 40, 3 (2021), 1–21.Google ScholarDigital Library
- [78] . 2020. Attributed collaboration network embedding for academic relationship mining. ACM Trans. Web 15, 1 (2020), 1–20.Google ScholarDigital Library
- [79] . 2016. Improving the sensitivity of online controlled experiments: Case studies at netflix. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 645–654.Google ScholarDigital Library
- [80] . 2018. False discovery rate controlled heterogeneous treatment effect detection for online controlled experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 876–885.Google ScholarDigital Library
- [81] . 2020. Orthogonal Traffic Assignment in Online Overlapping A/B Tests.
Technical Report . Tencent EasyChair Whitepaper.Google Scholar - [82] . 2021. Clarifying ambiguous keywords with personal word embeddings for personalized search. ACM Trans. Inf. Syst. 40, 3 (2021), 1–29.Google ScholarDigital Library
- [83] . 2022. Jointly predicting future content in multiple social media sites based on multi-task learning. ACM Trans. Inf. Syst. 40, 4 (2022), 1–28.Google ScholarDigital Library
Index Terms
- Examining User Heterogeneity in Digital Experiments
Recommendations
Automated user modeling for personalized digital libraries
Digital libraries (DLs) have become one of the most typical ways of accessing any kind of digitalized information. Due to this key role, users welcome any improvements on the services they receive from DLs. One trend used to improve digital services is ...
Evaluating Intelligent User Interfaces with User Experiments
IUI '16 Companion: Companion Publication of the 21st International Conference on Intelligent User InterfacesUser experiments are an essential tool to evaluate the user experience of intelligent user interfaces. This tutorial teaches the practical aspects of designing and setting up user experiments, as well as state-of-the-art methods to statistically ...
Cross-representation mediation of user models
Personalization is considered a powerful methodology for improving the effectiveness of information search and decision making. It has led to the dissemination of systems capable of suggesting relevant and personalized information (or items) to the users,...
Comments