Elsevier

Speech Communication

Volume 50, Issues 8–9, August–September 2008, Pages 630-645
Speech Communication

Towards human-like spoken dialogue systems

https://doi.org/10.1016/j.specom.2008.04.002Get rights and content

Abstract

This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human–human data manipulation, and micro-domains are discussed in this context, as is the use of third-party reviewers to get a measure of the degree of human-likeness. We also present the two-way mimicry target, a model for measuring how well a human–computer dialogue mimics or replicates some aspect of human–human dialogue, including human flaws and inconsistencies. Although we have added a measure of innovation, none of the techniques is new in its entirety. Taken together and described from a human-likeness perspective, however, they form a set of tools that may widen the path towards human-like spoken dialogue systems.

Introduction

The evaluation and development of spoken dialogue systems is a complex undertaking, and much effort is expended on making it manageable. Research and industry endeavours in the area often seek to compare versions of existing systems, or to compare component technologies, in order to find the best methods – where “best” is defined as most efficient. Sometimes, user satisfaction is used as an alternative, more human centred metric, but as the systems under scrutiny are often designed to help users perform some task, user satisfaction and efficiency are highly correlated. Much effort is also spent on minimising the cost of evaluation, for example by designing evaluation methods that will generalise over systems and that may be re-used (e.g. Dybkjær et al., 2004, Walker et al., 2000; see also Möller et al., 2007 for an overview); by automating the evaluations (e.g. Bechet et al., 2004, Glass et al., 2000); or by utilising simulations instead of users in order make low cost repeat studies (e.g. Georgila et al., 2006, Schatzmann et al., 2005).

In this paper, we look at the particular issues involved in evaluating and developing human-like spoken dialogue systems – systems that aim to mimic human conversation to the greatest extent. The discussion is limited to collaborative spoken dialogue, although the reasoning may hold for a wider set of interaction types, for example text chats. Discussing human-like spoken dialogue systems implicitly requires that we formulate what “human-like” means, and the next two sections provide background on the concept of human-likeness. The first section proposes an analysis of how users perceive spoken dialogue systems in terms of other, more familiar things. The following section is a brief overview of pros and cons of striving for human-likeness in spoken dialogue systems. The following three sections deal with, in turn, how “increased human-likeness” can be understood, how to gather the experimental data needed to evaluate a component intended to increase human-likeness, and how to analyse that data.

Section snippets

Two faces of spoken dialogue systems

Spoken dialogue system research is often guided by a wish to achieve more natural interaction. The term “natural” is somewhat problematic and rarely defined, but is generally taken to mean something like “more like human–human interaction”. For example, Jokinen (2003) talks about “computers that mimic human interaction” and Boyce (2000) says “the act of using natural speech as input mechanism makes the computer seem more human-like”. We will use “human-like” to mean “more like human–human

Opting for human-likeness

Before turning our attention entirely to the design and evaluation of human-like spoken dialogue systems, let us apply the metaphor distinction to current spoken dialogue systems, and pre-emptively address some objections to the endeavour of increasing human-likeness in spoken dialogue systems.

Towards human-likeness

Whether one subscribes to a metaphoric view or not, one may opt to aim for spoken dialogue systems with increased human-likeness, begging the question how to proceed. What technology must be developed, and how is it evaluated to ensure that it brings human-likeness and not something else?

Eliciting pragmatic evidence

In the following, we will discuss techniques to elicit the pragmatic data needed for evaluating user responses against human-likeness gold standards. The methods we discuss – Wizard-of-Oz variations, human–human data manipulation, and micro-domains – are all commonly used to collect data in order to build models of human–computer dialogue, and references to such usage are provided for completeness. Note that when these methods are used in the traditional sense, it is important to ensure that

Analysing the experiment data

Having collected data on how users respond to a component, we are left with the task of evaluating the data. To reiterate, with reference to Fig. 2, the human-like criteria is that C1’s behaviour in DHC should resemble H2’s behaviour in DHH and H1’s behaviour in DHC should resemble H1’s behaviour in DHH, as illustrated by DHC in Fig. 3. There are several approaches to the task of measuring this.

Future work

It would be especially gratifying to make use of known human–human dialogue phenomena in this type of evaluation. The Lombard reflex can serve as an example. It is known that under noisy conditions, speakers change their voice in a number of ways: they raise their voice and their pitch, etc. If noise is added to a human–computer dialogue, a human-like computer would be expected to exhibit the same changes (so that C1’s behaviour in DHC should resemble H2’s behaviour in DHH). This could be

Conclusion

This paper has presented an overview of methods to collect data on how users respond to techniques intended to increase human-likeness in spoken dialogue systems and to analyse the results. Some of these represent fairly traditional ways of accomplishing this, such as Wizard-of-Oz studies with the subjects being the objects of study; or having subjects judge manipulated interactions off-line. Other methods have added a measure of innovation, including studying the wizards as subjects, allowing

Acknowledgements

The call routing experiments took place at TeliaSonera, Sweden. They would not have been possible had not it been for the ASR 90 200 pilot team that provided the data collection tools for the skilled wizards. Note also that Anders Lindström at TeliaSonera took part in designing the second experiment. Finally, our heartfelt thanks to everybody in the research group at Speech, Music and Hearing at KTH, the two anonymous reviewers, and to colleagues everywhere for valuable input and support. Part

References (126)

  • J.F. Allen et al.

    Towards conversational human–computer interaction

    AI Magaz.

    (2001)
  • Allwood, J., Haglund, B., 1992. Communicative activity analysis of a wizard of Oz experiment. Technical Report,...
  • B. Balentine et al.

    How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues

    (2001)
  • Bechet, F., Riccardi, G., Hakkani-Tur, D., 2004. Mining spoken dialogue corpora for system evaluation and modeling. In:...
  • N. Bernsen et al.

    Designing Interactive Speech Systems

    (1998)
  • Berry, G.A., Pavlovic, V.I., Huang, T.S., 1998. A multimodal human–computer interface for the control of a virtual...
  • Bertenstam, J., Beskow, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S.,...
  • Blomberg, M., Carlson, R., Elenius, K., Gustafson, J., Granström, B., Hunnicutt, S., Lindell, R., Neovius, L., 1993. An...
  • Bohus, D., Rudnicky, A., 2002. LARRI: a language-based maintenance and repair assistant. In: Proc. ISCA Workshop...
  • R. Bolt

    Put that there: voice and gesture at the graphics interface

    Comp. Graph

    (1980)
  • S.J. Boyce

    Spoken natural language dialogue systems: user interface issues for the future

  • S.J. Boyce

    Natural spoken dialogue systems for telephony applications

    Comm. ACM

    (2000)
  • Boye, J., Wirén, M., 2007. Multi-slot semantics for natural-language call routing systems. In: Proc. Workshop on...
  • Boye, J., Wiren, M., Rayner, M., Lewin, I., Carter, D., Becket, R., 1999. Language-processing strategies and...
  • P.T. Brady

    A statistical analysis of on-off patterns in 16 conversations

    Bell Syst. Tech. J.

    (1968)
  • Brennan, S.E., 1996. Lexical entrainment in spontaneous dialog. In: Proc. ISSD, pp....
  • Brockett, C., Dolan, W., 2005. Echo chamber: a game for eliciting a colloquial paraphrase corpus. In: AAAI 2005 Spring...
  • J. Cassell

    Body language: lessons from the near-human

  • J. Cassell et al.

    Negotiated collusion: modeling social language and its relationship effects in intelligent agents

    User Model. Adapt. Interf.

    (2002)
  • Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., Yan, H., 1999. Embodiment in...
  • Cassell, J., Ananny, M., Basu, A., Bickmore, T., Chong, P., Mellis, D., Ryokai, K., Vilhjalmsson, H., Smith, J., Yan,...
  • Cassell, J., Stocky, T., Bickmore, T., Gao, Y., Nakano, Y., Ryokai, K., 2002. MACK: media lab autonomous conversational...
  • A. Chapanis

    Interactive human communication: some lessons learned from laboratory experiments

  • Dahlbäck, N., Jönsson, A., Ahrenberg, L., 1993. Wizard of Oz studies – why and how. In: Proc. 1993 International...
  • Dautenhahn, K., Woods, S., Kaouri, C., Walters, M., Koay, K., Werry, I., 2005. What is a robot companion – friend,...
  • Dybkjaer, H., Bernsen, N., Dybkjaer, L., 1993. Wizard-of-Oz and the trade-off between naturalness and recognizer...
  • Edlund, J., Beskow, J., 2007. Pushy versus meek – using avatars to influence turn-taking behaviour. In: Proc....
  • Edlund, J., Hjalmarsson, A., 2005. Applications of distributed dialogue systems: the KTH Connector. In: Proc. ISCA...
  • Edlund, J., Heldner, M., Gustafson, J., 2006. Two faces of spoken dialogue systems. In: Interspeech 2006 – ICSLP...
  • Eklund, R., 2004. Disfluency in Swedish human–human and human–machine travel booking dialogues. Doctoral dissertation,...
  • Fischer, K., 2006a. The role of users’ preconceptions in talking to computers and robots. In: Proc. Workshop on’How...
  • Fischer, K., 2006b. What Computer Talk Is and Is not: Human–Computer Conversation as Intercultural Communication....
  • Georgila, K., Henderson, J., Lemon, O., 2006. User simulation for spoken dialogue systems: learning and evaluation. In:...
  • Glass, J., Polifroni, J., Seneff, S., Zue, V., 2000. Data collection and performance evaluation of spoken dialogue...
  • C. Goodwin

    Conversational Organization: Interaction Between Speakers and Hearers

    (1981)
  • B. Granström et al.

    Multimodal feedback cues in human–machine interactions

  • Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S., Morales, M., van der Werf, R.J., Morency, L-P., 2006. Virtual...
  • Gustafson, J., Bell, L., 2000. Speech Technology on Trial: Experiences from the August System. Natural Language...
  • Gustafson, J., Sjölander, K., 2002. Voice transformations for improving children’s speech recognition in a publicly...
  • Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., Wirén, M., 2000....
  • View full text