Elsevier

Neurocomputing

Volume 388, 7 May 2020, Pages 102-109
Neurocomputing

Dialogue systems with audio context

https://doi.org/10.1016/j.neucom.2019.12.126Get rights and content

Abstract

Research on building dialogue systems that converse with humans naturally has recently attracted a lot of attention. Most work on this area assumes text-based conversation, where the user message is modeled as a sequence of words in a vocabulary. Real-world human conversation, in contrast, involves other modalities, such as voice, facial expression and body language, which can influence the conversation significantly in certain scenarios. In this work, we explore the impact of incorporating the audio features of the user message into generative dialogue systems. Specifically, we first design an auxiliary response retrieval task for audio representation learning. Then, we use word-level modality fusion to incorporate the audio features as additional context to our main generative model. Experiments show that our audio-augmented model outperforms the audio-free counterpart on perplexity, response diversity and human evaluation.

Introduction

In recent years, data-driven approaches to building conversation models have been made possible by the proliferation of social media conversation data and the increase of computing power. Based on a large amount of conversation data, very natural-sounding dialogue systems can be built by learning a mapping from textual context to response using powerful machine learning models [1], [2], [3], [4]. Specifically in the popular sequence-to-sequence (Seq2Seq) learning framework, the textual context, modeled as a sequence of words from a vocabulary, is encoded into a context vector by a recurrent neural network (RNN). This context vector serves as the initial state of another RNN, which decodes the whole response one token at a time.

This setting, however, is oversimplified compared to real-world human conversation, which is naturally a multimodal process [5], [6]. Information can be communicated through voice [7], body language [8] and facial expression [9]. In some cases, the same words can carry very different meanings depending on information expressed through other modalities.

In this work, we are interested in audio signals in conversation. Audio signals naturally carry emotional information. For example, “Oh, my god!” generally expresses surprise. Depending on the voice shade, however, a wide range of different emotions can also be carried, including fear, anger and happiness. Audio signals can have strong semantic functions as well. They may augment or alter the meaning expressed in text. For example, “Oh, that’s great!” usually shows positive attitude. With a particular voice shade of contempt, however, the same utterance can be construed as sarcastic. Stress also plays a role in semantics: “I think she stole your money” emphasizes the speaker’s opinion on the identity of the thief while “I think she stole your money” emphasizes the speaker’s opinion on the identity of the victim.

Therefore, while identical from a written point of view, utterances may acquire different meanings based solely on audio information. Empowering a dialogue system with such information is necessary to interpret an utterance correctly and generate an appropriate response.

In this work, we explore dialogue generation augmented by audio context under the commonly-used Seq2Seq framework. Firstly, because of the noisiness of the audio signal and the high dimensionality of raw audio features, we design an auxiliary response classification task to learn suitable audio representation for our dialogue generation objective. Secondly, we use word-level modality fusion for integrating audio features into the Seq2Seq framework. We design experiments to test how well our model can generate appropriate responses corresponding to the emotion and emphasis expressed in the audio.

In summary, this paper makes the following contributions:

  • (i)

    To the best of our knowledge, this work is the first attempt to use audio features of the user message in neural conversation generation. Our model outperforms the baseline audio-free model in terms of perplexity, diversity and human evaluation.

  • (ii)

    We perform extensive experiments on the trained model which show that our model captures the following phenomena in conversation: (1) Vocally emphasized words in an utterance are relatively important to response generation. (2) Emotion expressed in the audio of an utterance has influence on the response.

Section snippets

Related work

Massive text-based conversation data has driven a strong interest in building dialogue systems with data-driven methods. The Seq2Seq model, in particular, has been widely used due to its success in text generation tasks such as machine translation [10], video captioning [11] and abstractive text summarization [12]. Seq2Seq employs an encoder–decoder framework, where the conversational context is encoded into a vector representation and, then, fed to the decoder to generate the response [13].

Audio representation learning

Raw features extracted from audio sequences are high-dimensional and noisy. They are not suited as direct input to the dialogue generative model. Therefore, we need an audio representation learning method to reduce audio feature dimensions and also make it suitable for the dialogue generation task.

For this purpose, we design an auxiliary response classification task based solely on audio features.

Specifically, we construct a set of  < context, response, label >  triples, where label is binary

Dataset

Most of the existing and consolidated datasets used in dialogue system related research come with textual content only [26], [30]. The predominance of text-only datasets can be seen as a consequence of both the ease with which this type of data can be acquired, and the lack of a real demand of multimodal conversation data until recent times. Fortunately, along with the growing interest in multimodal systems, there has also been an increase in the proliferation of datasets fit for our task. We

Conclusion

In this work, we augmented the common Seq2Seq dialogue model with audio features and showed that the resulting model outperforms the audio-free baseline on several evaluation metrics. It also captures interesting audio-related conversation phenomena.

Although only using text in dialogue systems is a good-enough approximation in a lot of scenarios, other modalities (i.e., video and audio) have to be integrated before automatic dialogue systems can reach human performance. Our work belongs to such

CRediT authorship contribution statement

Tom Young: Conceptualization, Methodology, Software, Writing - original draft. Vlad Pandelea: Data curation, Software. Soujanya Poria: Conceptualization. Erik Cambria: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project #A18A2b0046).

Tom Young got his Bachelor from Beijing Institute of Technology in 2018. Currently, he is a Ph.D.student under the supervision of Erik Cambria in the School of Computer Science and Engineering in Nanyang Technological University. His main research interests are dialogue systems, deep learning, and computer vision. Specifically, he applies memory-augmented neural networks to model human conversation. He is interested in expanding current chatbot systems to handle more complex environments with

References (43)

  • Y. Gu et al.

    Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder

    Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference

    (2018)
  • T. Young et al.

    Augmenting end-to-end dialogue systems with commonsense knowledge

    Proceedings of the 2018 AAAI

    (2018)
  • L. Shao et al.

    Generating long and diverse responses with neural conversation models

    CoRR

    (2017)
  • N. Majumder et al.

    DialogueRNN: an attentive RNN for emotion detection in conversations

    Proceedings of the 2019 AAAI

    (2019)
  • H. Xu et al.

    End-to-end latent-variable task-oriented dialogue system with exact log-likelihood optimization

    World Wide Web

    (2020)
  • I. Chaturvedi et al.

    Fuzzy commonsense reasoning for multimodal sentiment analysis

    Pattern Recognit. Lett.

    (2019)
  • R. Kingdon

    The semantic functions of stress and tone

    ELT J.

    (1949)
  • J. Streeck et al.

    Embodied Interaction: Language and Body in the Material World

    (2011)
  • A. Takeuchi et al.

    Communicative facial displays as a new conversational modality

    Proceedings of the INTERACT’93 and CHI’93 Conference on Human Factors in Computing Systems

    (1993)
  • D. Bahdanau et al.

    Neural machine translation by jointly learning to align and translate

    CoRR

    (2014)
  • S. Venugopalan et al.

    Sequence to sequence-video to text

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al., Abstractive Text Summarization Using Sequence-to-Sequence RNNs...
  • O. Vinyals, Q. Le, A Neural Conversational Model, arXiv:1506.05869...
  • J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A Diversity-Promoting Objective Function for Neural Conversation...
  • H. Zhou et al.

    Emotional chatting machine: emotional conversation generation with internal and external memory

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

    (2018)
  • J. Gu et al.

    Incorporating copying mechanism in sequence-to-sequence learning

    Proceedings of the 54th Annual Meeting of the Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers)

    (2016)
  • N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G.P. Spithourakis, L. Vanderwende, Image-Grounded...
  • H. Alamri, V. Cartillier, R.G. Lopes, A. Das, J. Wang, I. Essa, D. Batra, D. Parikh, A. Cherian, T.K. Marks, et al.,...
  • C. Hori, H. Alamri, J. Wang, G. Winchern, T. Hori, A. Cherian, T.K. Marks, V. Cartillier, R.G. Lopes, A. Das, et al.,...
  • A. Saha et al.

    Towards building large scale multimodal domain-aware conversation systems

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

    (2018)
  • S. Agarwal, O. Dusek, I. Konstas, V. Rieser, Improving Context Modelling in Multimodal Dialogue Generation,...
  • Cited by (37)

    • Multitask learning for multilingual intent detection and slot filling in dialogue systems

      2023, Information Fusion
      Citation Excerpt :

      In contrast, the task of extracting information in the form of slots is called slot filling (SF). With the enhanced usage of personal assistants, the research and development of human–machine interaction have increased rapidly over the decade [8–10]. For the widespread application of chatbots, personal assistant should be able to understand different languages spoken by the user.

    • Generation of Coherent Multi-Sentence Texts with a Coherence Mechanism

      2023, Computer Speech and Language
      Citation Excerpt :

      Automatic generation of long texts containing multiple sentences is an important research field in Natural Language Processing (NLP) and has fostered many applications, including question answering (Kwiatkowski et al., 2019; Zhou et al., 2018), dialog generation (Li et al., 2019; Young et al., 2020), image caption generation (Ding et al., 2020; Tan and Chan, 2019), neural machine translation (NMT) (Liu et al., 2020; Shi et al., 2021), paraphrase generation (Gupta et al., 2018; Ning et al., 2021), and text summarization (Chen and Bansal, 2018; Celikyilmaz et al., 2018; Liu et al., 2018; Zhang et al., 2020; Koay et al., 2020).

    • A code-mixed task-oriented dialog dataset for medical domain

      2023, Computer Speech and Language
      Citation Excerpt :

      The fused dataset is then evaluated using the DialoGPT model (Zhang et al., 2020a). Young et al. (2020) explored the impact of incorporating the audio features of the user message into generative dialog systems. The audio features are modeled as conversational metadata, which benefits the response generation in end-to-end conversational systems.

    • A survey on XAI and natural language explanations

      2023, Information Processing and Management
    View all citing articles on Scopus

    Tom Young got his Bachelor from Beijing Institute of Technology in 2018. Currently, he is a Ph.D.student under the supervision of Erik Cambria in the School of Computer Science and Engineering in Nanyang Technological University. His main research interests are dialogue systems, deep learning, and computer vision. Specifically, he applies memory-augmented neural networks to model human conversation. He is interested in expanding current chatbot systems to handle more complex environments with higher accuracy.

    Vlad Pandelea received his Bachelor and Master of Science in Computer Science from Pisa University in 2017 and 2019, respectively. Since 2020, he is a PhD student at NTU under the supervision of Erik Cambria. His thesis focuses on the exploitation of multimodal information for dialogue systems. In addition to dialogue systems, his research interest lies in data analytics and in the application of deep learning techniques to a variety of fields, including sentiment analysis, time series and point processes.

    Soujanya Poria received his B.Eng. in Computer Science from Jadavpur University (India) in 2013. In the same year, he received the best undergraduate thesis and researcher award and was awarded Gold Plated Silver medal from Jadavpur University and Tata Consultancy Service for his final year project during his undergraduate course. In 2017, Soujanya got his Ph.D. in Computing Science and Mathematics from the University of Stirling (UK) under the co-supervision of Amir Hussain and Erik Cambria. Soon after, he joined Nanyang Technological University as a Research Scientist in the School of Computer Science and Engineering. Later in 2019, he joined Singapore University of Technology and Design (SUTD), where he is now conducting research on aspect-based sentiment analysis in multiple domains and different modalities as an Assistant Professor.

    Erik Cambria is the Founder of SenticNet, a Singapore-based company offering B2B sentiment analysis services, and an Associate Professor at NTU, where he also holds the appointment of Provost Chair in Computer Science and Engineering. Prior to joining NTU, he worked at Microsoft Research Asia and HP Labs India and earned his Ph.D. through a joint programme between the University of Stirling and MIT Media Lab. He is recipient of many awards, e.g., the 2018 AI’s 10 to Watch and the 2019 IEEE Outstanding Early Career award, and is often featured in the news, e.g., Forbes. He is Associate Editor of several journals, e.g., NEUCOM, INFFUS, KBS, IEEE CIM and IEEE Intelligent Systems (where he manages the Department of Affective Computing and Sentiment Analysis), and is involved in many international conferences as PC member, program chair, and speaker.

    View full text