Towards human-like spoken dialogue systems
Introduction
The evaluation and development of spoken dialogue systems is a complex undertaking, and much effort is expended on making it manageable. Research and industry endeavours in the area often seek to compare versions of existing systems, or to compare component technologies, in order to find the best methods – where “best” is defined as most efficient. Sometimes, user satisfaction is used as an alternative, more human centred metric, but as the systems under scrutiny are often designed to help users perform some task, user satisfaction and efficiency are highly correlated. Much effort is also spent on minimising the cost of evaluation, for example by designing evaluation methods that will generalise over systems and that may be re-used (e.g. Dybkjær et al., 2004, Walker et al., 2000; see also Möller et al., 2007 for an overview); by automating the evaluations (e.g. Bechet et al., 2004, Glass et al., 2000); or by utilising simulations instead of users in order make low cost repeat studies (e.g. Georgila et al., 2006, Schatzmann et al., 2005).
In this paper, we look at the particular issues involved in evaluating and developing human-like spoken dialogue systems – systems that aim to mimic human conversation to the greatest extent. The discussion is limited to collaborative spoken dialogue, although the reasoning may hold for a wider set of interaction types, for example text chats. Discussing human-like spoken dialogue systems implicitly requires that we formulate what “human-like” means, and the next two sections provide background on the concept of human-likeness. The first section proposes an analysis of how users perceive spoken dialogue systems in terms of other, more familiar things. The following section is a brief overview of pros and cons of striving for human-likeness in spoken dialogue systems. The following three sections deal with, in turn, how “increased human-likeness” can be understood, how to gather the experimental data needed to evaluate a component intended to increase human-likeness, and how to analyse that data.
Section snippets
Two faces of spoken dialogue systems
Spoken dialogue system research is often guided by a wish to achieve more natural interaction. The term “natural” is somewhat problematic and rarely defined, but is generally taken to mean something like “more like human–human interaction”. For example, Jokinen (2003) talks about “computers that mimic human interaction” and Boyce (2000) says “the act of using natural speech as input mechanism makes the computer seem more human-like”. We will use “human-like” to mean “more like human–human
Opting for human-likeness
Before turning our attention entirely to the design and evaluation of human-like spoken dialogue systems, let us apply the metaphor distinction to current spoken dialogue systems, and pre-emptively address some objections to the endeavour of increasing human-likeness in spoken dialogue systems.
Towards human-likeness
Whether one subscribes to a metaphoric view or not, one may opt to aim for spoken dialogue systems with increased human-likeness, begging the question how to proceed. What technology must be developed, and how is it evaluated to ensure that it brings human-likeness and not something else?
Eliciting pragmatic evidence
In the following, we will discuss techniques to elicit the pragmatic data needed for evaluating user responses against human-likeness gold standards. The methods we discuss – Wizard-of-Oz variations, human–human data manipulation, and micro-domains – are all commonly used to collect data in order to build models of human–computer dialogue, and references to such usage are provided for completeness. Note that when these methods are used in the traditional sense, it is important to ensure that
Analysing the experiment data
Having collected data on how users respond to a component, we are left with the task of evaluating the data. To reiterate, with reference to Fig. 2, the human-like criteria is that C1’s behaviour in DHC should resemble H2’s behaviour in DHH and H1’s behaviour in DHC should resemble H1’s behaviour in DHH, as illustrated by in Fig. 3. There are several approaches to the task of measuring this.
Future work
It would be especially gratifying to make use of known human–human dialogue phenomena in this type of evaluation. The Lombard reflex can serve as an example. It is known that under noisy conditions, speakers change their voice in a number of ways: they raise their voice and their pitch, etc. If noise is added to a human–computer dialogue, a human-like computer would be expected to exhibit the same changes (so that C1’s behaviour in DHC should resemble H2’s behaviour in DHH). This could be
Conclusion
This paper has presented an overview of methods to collect data on how users respond to techniques intended to increase human-likeness in spoken dialogue systems and to analyse the results. Some of these represent fairly traditional ways of accomplishing this, such as Wizard-of-Oz studies with the subjects being the objects of study; or having subjects judge manipulated interactions off-line. Other methods have added a measure of innovation, including studying the wizards as subjects, allowing
Acknowledgements
The call routing experiments took place at TeliaSonera, Sweden. They would not have been possible had not it been for the ASR 90 200 pilot team that provided the data collection tools for the skilled wizards. Note also that Anders Lindström at TeliaSonera took part in designing the second experiment. Finally, our heartfelt thanks to everybody in the research group at Speech, Music and Hearing at KTH, the two anonymous reviewers, and to colleagues everywhere for valuable input and support. Part
References (126)
- et al.
User representations of computer systems in human–computer speech interaction
Internat. J. Man–Machine Studies
(1993) - et al.
The Philips automatic train timetable information system
Speech Comm.
(1995) - et al.
Interaction and feedback in a spoken language system: a theoretical framework
Knowledge-Based Syst.
(1995) Managing problems in speaking
Speech Comm.
(1994)Voice-input aids for the physically handicapped
Internat. J. Man–Machine Stud.
(1984)- et al.
Evaluation and usability of multimodal spoken dialogue systems
Speech Comm.
(2004) - et al.
Simulating speech systems
Comput. Speech Lang.
(1991) - et al.
Evaluating spoken dialogue systems according to de-facto standards: a case study
Comput. Speech Lang.
(2007) - et al.
User evaluation of the SYNFACE talking head telephone
Lect. Notes Comput. Sci.
(2006) - et al.
The TRAINS project: a case study in defining a conversational planning agent
J. Exp. Technol. Artif. Intell.
(1995)
Towards conversational human–computer interaction
AI Magaz.
How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues
Designing Interactive Speech Systems
Put that there: voice and gesture at the graphics interface
Comp. Graph
Spoken natural language dialogue systems: user interface issues for the future
Natural spoken dialogue systems for telephony applications
Comm. ACM
A statistical analysis of on-off patterns in 16 conversations
Bell Syst. Tech. J.
Body language: lessons from the near-human
Negotiated collusion: modeling social language and its relationship effects in intelligent agents
User Model. Adapt. Interf.
Interactive human communication: some lessons learned from laboratory experiments
Conversational Organization: Interaction Between Speakers and Hearers
Multimodal feedback cues in human–machine interactions
Cited by (105)
Understanding and Predicting User Satisfaction with Conversational Recommender Systems
2023, ACM Transactions on Information SystemsTowards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors
2023, ACM International Conference Proceeding SeriesGeneration of speech and facial animation with controllable articulatory effort for amusing conversational characters
2023, Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents, IVA 2023Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems
2023, ACM Transactions on Information SystemsTowards Informative and Diverse Dialogue Systems Over Hierarchical Crowd Intelligence Knowledge Graph
2023, ACM Transactions on Knowledge Discovery from Data