ABSTRACT
Improving our understanding of how humans perceive AI teammates is an important foundation for our general understanding of human-AI teams. Extending relevant work from cognitive science, we propose a framework based on item response theory for modeling these perceptions. We apply this framework to real-world experiments, in which each participant works alongside another person or an AI agent in a question-answering setting, repeatedly assessing their teammate’s performance. Using this experimental data, we demonstrate the use of our framework for testing research questions about people’s perceptions of both AI agents and other people. We contrast mental models of AI teammates with those of human teammates as we characterize the dimensionality of these mental models, their development over time, and the influence of the participants’ own self-perception. Our results indicate that people expect AI agents’ performance to be significantly better on average than the performance of other humans, with less variation across different types of problems. We conclude with a discussion of the implications of these findings for human-AI interaction.
Supplemental Material
Available for Download
- Terry A Ackerman. 1994. Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education 7, 4 (1994), 255–278.Google ScholarCross Ref
- Kamran Alipour, Arijit Ray, Xiao Lin, Michael Cogswell, Jürgen P. Schulze, Yi Yao, and Giedrius T. Burachas. 2021. Improving users’ mental model with attention-directed counterfactual edits. (2021). arXiv:2110.06863Google Scholar
- Janet Wilde Astington and Jennifer M Jenkins. 1995. Theory of mind development and social understanding. Cognition & Emotion 9, 2-3 (1995), 151–165.Google ScholarCross Ref
- Joshua Attenberg, Panos Ipeirotis, and Foster Provost. 2015. Beat the machine: Challenging humans to find a predictive model’s “unknown unknowns”. J. Data and Information Quality 6, 1 (2015), 1–17.Google ScholarDigital Library
- Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S Weld. 2021. Is the most accurate AI the best teammate? Optimizing AI for teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 11405–11414.Google ScholarCross Ref
- Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: The role of mental models in human-AI team performance. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2019), Vol. 7. 2–11.Google ScholarCross Ref
- Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2019), Vol. 33. 2429–2437.Google ScholarDigital Library
- Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the whole exceed its parts? The effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.Google ScholarDigital Library
- Winston H F Barnes. 1944. The nature of explanation. Nature 153, 3890 (1944), 605–605.Google Scholar
- Ado Abdu Bichi and Rohaya Talib. 2018. Item response theory: An introduction to latent trait models to test and item development. International Journal of Evaluation and Research in Education 7, 2 (2018), 142–151.Google Scholar
- Sebastian Bordt and Ulrike Von Luxburg. 2022. A bandit model for human-machine decision making with private information and opacity. In Proceedings of the 25th International Conference on AI and Statistics (AI-Stats 2022). 7300–7319.Google Scholar
- Nathan Bos, Kimberly Glasgow, John Gersh, Isaiah Harbison, and Celeste Lyn Paul. 2019. Mental models of AI-based systems: User predictions and explanations of image classification results. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 63. 184–188.Google ScholarCross Ref
- Jordan Boyd-Graber and Benjamin Börschinger. 2020. What question answering can learn from trivia nerds. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7422–7435.Google ScholarCross Ref
- Moritz C. Buehler and Thomas H. Weisswange. 2020. Theory of mind based communication for human agent cooperation. In 2020 IEEE International Conference on Human-Machine Systems (ICHMS). 1–6.Google Scholar
- Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. "Hello AI": Uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making. Proceedings of the ACM on Human-Computer Interaction 3, Article 104 (2019), 24 pages.Google ScholarDigital Library
- Noah Castelo, Maarten W Bos, and Donald R Lehmann. 2019. Task-dependent algorithm aversion. Journal of Marketing Research 56, 5 (2019), 809–825.Google ScholarCross Ref
- Chelsea Chandler, Peter W Foltz, and Brita Elvevåg. 2022. improving the applicability of AI for psychiatric applications through human-in-the-loop methodologies. Schizophrenia Bulletin 48, 5 (2022), 949–957.Google ScholarCross Ref
- Valerie Chen, Q. Vera Liao, Jennifer Wortman Vaughan, and Gagan Bansal. 2023. Understanding the Role of Human Intuition on Reliance in Human-AI Decision-Making with Explanations. arXiv preprint arXiv:2301.07255 (2023).Google Scholar
- Hao-Fei Cheng, Logan Stapleton, Anna Kawakami, Venkatesh Sivaraman, Yanghuidi Cheng, Diana Qing, Adam Perer, Kenneth Holstein, Zhiwei Steven Wu, and Haiyi Zhu. 2022. How child welfare workers reduce racial disparities in algorithmic decisions. In CHI ’22: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–22.Google ScholarDigital Library
- Maria De-Arteaga, Riccardo Fogliato, and Alexandra Chouldechova. 2020. A case for humans-in-the-loop: Decisions in the presence of erroneous algorithmic scores. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Greg d’Eon, Jason d’Eon, James R Wright, and Kevin Leyton-Brown. 2022. The Spotlight: A general method for discovering systematic errors in deep learning models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1962–1981.Google ScholarDigital Library
- Kate Donahue, Alexandra Chouldechova, and Krishnaram Kenthapadi. 2022. Human-algorithm collaboration: Achieving complementarity and avoiding unfairness. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1639–1656.Google ScholarDigital Library
- Jeff Druce, James Niehaus, Vanessa Moody, David D. Jensen, and Michael L. Littman. 2021. Brittle AI, causal confusion, and bad mental models: challenges and successes in the XAI program. CoRR abs/2106.05506 (2021). arXiv:2106.05506Google Scholar
- John Dunlosky and Janet Metcalfe. 2008. Metacognition. Sage Publications.Google Scholar
- David Dunning. 2011. The Dunning–Kruger effect: On being ignorant of one’s own ignorance. In Advances in Experimental Social Psychology. Vol. 44. 247–296.Google Scholar
- Jean-Paul Fox. 2010. Bayesian Item Response Modeling: Theory and Applications. Springer, New York.Google Scholar
- Chris Frith and Uta Frith. 2005. Theory of mind. Current Biology 15, 17 (2005), R644–R645.Google ScholarCross Ref
- Katy Ilonka Gero, Zahra Ashktorab, Casey Dugan, Qian Pan, James Johnson, Werner Geyer, Maria Ruiz, Sarah Miller, David R Millen, Murray Campbell, 2020. Mental models of AI agents in a cooperative game setting. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Johannes Hartig and Jana Höhler. 2009. Multidimensional IRT models for the assessment of competencies. Studies in Educational Evaluation 35, 2 (2009), 57–63.Google ScholarCross Ref
- Patrick Hemmer, Max Schemmer, Michael Vössing, and Niklas Kühl. 2021. Human-AI complementarity in hybrid intelligence systems: A structured literature review. In Proceedings of the Twenty-fifth Pacific Asia Conference on Information Systems,. 1–14.Google Scholar
- Kenneth Holstein and Vincent Aleven. 2022. Designing for human-AI complementarity in K-12 education. AI Magazine 43, 2 (2022), 239–248.Google ScholarDigital Library
- G. Humphreys, Lloyd. 1979. The construct of general intelligence. Intelligence 3, 2 (1979), 105–120.Google ScholarCross Ref
- Ece Kamar. 2016. Directions in hybrid iIntelligence: Complementing AI systems with human intelligence.. In Proceedings of the International Joint Conference on AI (IJCAI 2016). 4070–4073.Google Scholar
- Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining human and machine intelligence in large-scale crowdsourcing.. In Proceedings of the AAMAS Conference, Vol. 12. 467–474.Google Scholar
- Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. UnifiedQA: Crossing format boundaries with a single QA system. arXiv preprint arXiv:2005.00700 (2020).Google Scholar
- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems. 22199–22213.Google Scholar
- Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell me more? The effects of mental model soundness on personalizing an intelligent agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1–10.Google ScholarDigital Library
- Aakriti Kumar, Padhraic Smyth, and Mark Steyvers. 2023. Differentiating mental models of self and others: a hierarchical framework for knowledge assessment. PsyArXiv (2023).Google Scholar
- David La Barbera, Kevin Roitero, and Stefano Mizzaro. 2022. A hybrid human-in-the-loop framework for fact checking. In Proceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022).Google Scholar
- Vivian Lai, Chacha Chen, Q. Vera Liao, Alison Smith-Renner, and Chenhao Tan. 2021. Towards a science of human-AI decision making: A survey of empirical studies. CoRR abs/2112.11471 (2021). https://arxiv.org/abs/2112.11471Google Scholar
- Sau lai Lee, Ivy Yee man Lau, S. Kiesler, and Chi-Yue Chiu. 2005. Human mental models of humanoid robots. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation. 2767–2772.Google Scholar
- Garston Liang, Jennifer F Sloane, Christopher Donkin, and Ben R Newell. 2022. Adapting to the algorithm: How accuracy comparisons promote the use of a decision aid. Cognitive Research: Principles and Implications 7, 1 (2022), 14.Google ScholarCross Ref
- Jennifer A Livingston. 2003. Metacognition: An Overview.Google Scholar
- Jennifer Marie Logg. 2017. Theory of machine: When do people rely on algorithms?Harvard Business School working paper series# 17-086 (2017).Google Scholar
- Jennifer M Logg. 2022. The psychology of big data: Developing a “theory of machine” to examine perceptions of algorithms. In The Psychology of Technology: Social Science Research in the Age of Big Data, Sandra Matz (Ed.). American Psychological Association, 349–378.Google Scholar
- Yong Luo and Khaleel Al-Harbi. 2017. Performances of LOO and WAIC as IRT model selection methods. Psychological Test and Assessment Modeling 59, 2 (2017), 183.Google Scholar
- John Mathieu, Tonia Heffner, Gerald Goodwin, Eduardo Salas, and Janis Cannon-Bowers. 2000. The influence of shared mental models on team process and performance. Journal of Applied Psychology 85 (04 2000), 273–283.Google Scholar
- Michael Merry, Pat Riddle, and Jim Warren. 2021. A mental models approach for defining explainable artificial intelligence. BMC Medical Informatics and Decision Making 21, 1 (2021), 1–12.Google ScholarCross Ref
- Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence 267 (2019), 1–38.Google Scholar
- Don A Moore and Daylian M Cain. 2007. Overconfidence and underconfidence: When and why people underestimate (and overestimate) the competition. Organizational Behavior and Human Decision Processes 103, 2 (2007), 197–213.Google ScholarCross Ref
- Mahsan Nourani, Chiradeep Roy, Jeremy E Block, Donald R Honeycutt, Tahrima Rahman, Eric Ragan, and Vibhav Gogate. 2021. Anchoring bias affects mental model formation and user reliance in explainable AI systems. In Proceedings of the 26th International Conference on Intelligent User Interfaces. 340–350.Google ScholarDigital Library
- Rohan Paleja, Muyleng Ghuy, Nadun Ranawaka Arachchige, Reed Jensen, and Matthew Gombolay. 2021. The utility of explainable AI in ad hoc human-machine teaming. In Advances in Neural Information Processing Systems, Vol. 34. 610–623.Google Scholar
- Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari. 2022. A unifying framework for combining complementary strengths of humans and ML toward better predictive decision-making. arXiv preprint arXiv:2204.10806 (2022).Google Scholar
- Mark D. Reckase. 1997. The past and future of multidimensional item response theory. Applied Psychological Measurement 21, 1 (1997), 25–36.Google ScholarCross Ref
- Beau G. Schelble, Christopher Flathmann, Nathan J. McNeese, Guo Freeman, and Rohit Mallick. 2022. Let’s think together! Assessing shared mental models, performance, and trust in human-agent teams. Proceedings of the ACM on Human-Computer Interaction 6, Article 13 (2022), 29 pages.Google ScholarDigital Library
- Matthias Scheutz, Scott A DeLoach, and Julie A Adams. 2017. A framework for developing and using shared mental models in human-agent teams. Journal of Cognitive Engineering and Decision Making 11, 3 (2017), 203–224.Google ScholarCross Ref
- Yanyan Sheng and Christopher K Wikle. 2007. Comparing multiunidimensional and unidimensional item response theory models. Educ. Psychol. Meas. 67, 6 (2007), 899–919.Google ScholarCross Ref
- Francesc Sidera, Georgina Perpiñà, Jèssica Serrano, and Carles Rostan. 2018. Why is theory of mind important for referential communication?Current Psychology 37 (2018), 82–97.Google Scholar
- Mary M Smyth, Alan F Collins, Peter E Morris, and Philip Levy. 1994. Cognition in Action (2nd ed.). Lawrence Erlbaum Associates.Google Scholar
- C. Spearman. 1904. "General intelligence," objectively determined and measured. The American Journal of Psychology 15, 2 (1904), 201–292.Google ScholarCross Ref
- Mark Steyvers and Aakriti Kumar. 2022. Three challenges for AI-assisted decision-making. PsyArXiv (2022). https://doi.org/10.31234/osf.io/gctv6Google Scholar
- Mark Steyvers, Heliodoro Tejeda, Gavin Kerrigan, and Padhraic Smyth. 2022. Bayesian modeling of human–AI complementarity. Proceedings of the National Academy of Sciences 119, 11 (2022), e2111547119.Google ScholarCross Ref
- Michael L Thomas. 2019. Advances in applications of item response theory to clinical assessment. Psychological Assessment 31, 12 (2019), 1442–1455.Google ScholarCross Ref
- Philipp Tschandl, Christoph Rinner, Zoe Apalla, Giuseppe Argenziano, Noel Codella, Allan Halpern, Monika Janda, Aimilios Lallas, Caterina Longo, Josep Malvehy, John Paoli, Susana Puig, Cliff Rosendahl, H Peter Soyer, Iris Zalaudek, and Harald Kittler. 2020. Human-computer collaboration for skin cancer recognition. Nature Medicine 26, 8 (2020), 1229–1234.Google ScholarCross Ref
- W.J. van der Linden and R.K. Hambleton. 2013. Handbook of Modern Item Response Theory. Springer, New York.Google Scholar
- Aki Vehtari, Andrew Gelman, and Jonah Gabry. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27 (2017), 1413–1432.Google ScholarDigital Library
- Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, and Andrew H Beck. 2016. Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718 (2016).Google Scholar
- Qiaosi Wang, Koustuv Saha, Eric Gregori, David Joyner, and Ashok Goel. 2021. Towards Mutual Theory of Mind in Human-AI Interaction: How Language Reflects What Students Perceive About a Virtual Teaching Assistant. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 384, 14 pages. https://doi.org/10.1145/3411764.3445645Google ScholarDigital Library
- Samuel Westby and Christoph Riedl. 2022. Collective intelligence in human-AI teams: A Bayesian theory of mind approach. ArXiv abs/2208.11660 (2022).Google Scholar
- David Westerman, Autumn P. Edwards, Chad Edwards, Zhenyang Luo, and Patric R. Spence. 2020. I-It, I-Thou, I-Robot: The Perceived Humanness of AI in Human-Machine Communication. Communication Studies 71, 3 (2020), 393–408. https://doi.org/10.1080/10510974.2020.1749683 arXiv:https://doi.org/10.1080/10510974.2020.1749683Google ScholarCross Ref
- Bryan Wilder, Eric Horvitz, and Ece Kamar. 2020. Learning to complement humans. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20). 1526–1533.Google ScholarCross Ref
Index Terms
- Capturing Humans’ Mental Models of AI: An Item Response Theory Approach
Recommendations
Mental Models of AI Agents in a Cooperative Game Setting
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing SystemsAs more and more forms of AI become prevalent, it becomes increasingly important to understand how people develop mental models of these systems. In this work we study people's mental models of AI in a cooperative word guessing game. We run think-aloud ...
Theory of Mind in Human-AI Interaction
CHI EA '24: Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing SystemsTheory of Mind (ToM), humans’ capability of attributing mental states such as intentions, goals, emotions, and beliefs to ourselves and others, has become a concept of great interest in human-AI interaction research. Given the fundamental role of ToM in ...
The Hidden Rules of Hanabi: How Humans Outperform AI Agents
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing SystemsGames that feature multiple players, limited communication, and partial information are particularly challenging for AI agents. In the cooperative card game Hanabi, which possesses all of these attributes, AI agents fail to achieve scores comparable to ...
Comments