Dialogue systems with audio context
Introduction
In recent years, data-driven approaches to building conversation models have been made possible by the proliferation of social media conversation data and the increase of computing power. Based on a large amount of conversation data, very natural-sounding dialogue systems can be built by learning a mapping from textual context to response using powerful machine learning models [1], [2], [3], [4]. Specifically in the popular sequence-to-sequence (Seq2Seq) learning framework, the textual context, modeled as a sequence of words from a vocabulary, is encoded into a context vector by a recurrent neural network (RNN). This context vector serves as the initial state of another RNN, which decodes the whole response one token at a time.
This setting, however, is oversimplified compared to real-world human conversation, which is naturally a multimodal process [5], [6]. Information can be communicated through voice [7], body language [8] and facial expression [9]. In some cases, the same words can carry very different meanings depending on information expressed through other modalities.
In this work, we are interested in audio signals in conversation. Audio signals naturally carry emotional information. For example, “Oh, my god!” generally expresses surprise. Depending on the voice shade, however, a wide range of different emotions can also be carried, including fear, anger and happiness. Audio signals can have strong semantic functions as well. They may augment or alter the meaning expressed in text. For example, “Oh, that’s great!” usually shows positive attitude. With a particular voice shade of contempt, however, the same utterance can be construed as sarcastic. Stress also plays a role in semantics: “I think she stole your money” emphasizes the speaker’s opinion on the identity of the thief while “I think she stole your money” emphasizes the speaker’s opinion on the identity of the victim.
Therefore, while identical from a written point of view, utterances may acquire different meanings based solely on audio information. Empowering a dialogue system with such information is necessary to interpret an utterance correctly and generate an appropriate response.
In this work, we explore dialogue generation augmented by audio context under the commonly-used Seq2Seq framework. Firstly, because of the noisiness of the audio signal and the high dimensionality of raw audio features, we design an auxiliary response classification task to learn suitable audio representation for our dialogue generation objective. Secondly, we use word-level modality fusion for integrating audio features into the Seq2Seq framework. We design experiments to test how well our model can generate appropriate responses corresponding to the emotion and emphasis expressed in the audio.
In summary, this paper makes the following contributions:
- (i)
To the best of our knowledge, this work is the first attempt to use audio features of the user message in neural conversation generation. Our model outperforms the baseline audio-free model in terms of perplexity, diversity and human evaluation.
- (ii)
We perform extensive experiments on the trained model which show that our model captures the following phenomena in conversation: (1) Vocally emphasized words in an utterance are relatively important to response generation. (2) Emotion expressed in the audio of an utterance has influence on the response.
Section snippets
Related work
Massive text-based conversation data has driven a strong interest in building dialogue systems with data-driven methods. The Seq2Seq model, in particular, has been widely used due to its success in text generation tasks such as machine translation [10], video captioning [11] and abstractive text summarization [12]. Seq2Seq employs an encoder–decoder framework, where the conversational context is encoded into a vector representation and, then, fed to the decoder to generate the response [13].
Audio representation learning
Raw features extracted from audio sequences are high-dimensional and noisy. They are not suited as direct input to the dialogue generative model. Therefore, we need an audio representation learning method to reduce audio feature dimensions and also make it suitable for the dialogue generation task.
For this purpose, we design an auxiliary response classification task based solely on audio features.
Specifically, we construct a set of < context, response, label > triples, where label is binary
Dataset
Most of the existing and consolidated datasets used in dialogue system related research come with textual content only [26], [30]. The predominance of text-only datasets can be seen as a consequence of both the ease with which this type of data can be acquired, and the lack of a real demand of multimodal conversation data until recent times. Fortunately, along with the growing interest in multimodal systems, there has also been an increase in the proliferation of datasets fit for our task. We
Conclusion
In this work, we augmented the common Seq2Seq dialogue model with audio features and showed that the resulting model outperforms the audio-free baseline on several evaluation metrics. It also captures interesting audio-related conversation phenomena.
Although only using text in dialogue systems is a good-enough approximation in a lot of scenarios, other modalities (i.e., video and audio) have to be integrated before automatic dialogue systems can reach human performance. Our work belongs to such
CRediT authorship contribution statement
Tom Young: Conceptualization, Methodology, Software, Writing - original draft. Vlad Pandelea: Data curation, Software. Soujanya Poria: Conceptualization. Erik Cambria: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project #A18A2b0046).
Tom Young got his Bachelor from Beijing Institute of Technology in 2018. Currently, he is a Ph.D.student under the supervision of Erik Cambria in the School of Computer Science and Engineering in Nanyang Technological University. His main research interests are dialogue systems, deep learning, and computer vision. Specifically, he applies memory-augmented neural networks to model human conversation. He is interested in expanding current chatbot systems to handle more complex environments with
References (43)
- et al.
Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder
Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference
(2018) - et al.
Augmenting end-to-end dialogue systems with commonsense knowledge
Proceedings of the 2018 AAAI
(2018) - et al.
Generating long and diverse responses with neural conversation models
CoRR
(2017) - et al.
DialogueRNN: an attentive RNN for emotion detection in conversations
Proceedings of the 2019 AAAI
(2019) - et al.
End-to-end latent-variable task-oriented dialogue system with exact log-likelihood optimization
World Wide Web
(2020) - et al.
Fuzzy commonsense reasoning for multimodal sentiment analysis
Pattern Recognit. Lett.
(2019) The semantic functions of stress and tone
ELT J.
(1949)- et al.
Embodied Interaction: Language and Body in the Material World
(2011) - et al.
Communicative facial displays as a new conversational modality
Proceedings of the INTERACT’93 and CHI’93 Conference on Human Factors in Computing Systems
(1993) - et al.
Neural machine translation by jointly learning to align and translate
CoRR
(2014)
Sequence to sequence-video to text
Proceedings of the IEEE International Conference on Computer Vision
Emotional chatting machine: emotional conversation generation with internal and external memory
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
Incorporating copying mechanism in sequence-to-sequence learning
Proceedings of the 54th Annual Meeting of the Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers)
Towards building large scale multimodal domain-aware conversation systems
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
Cited by (37)
Response generation in multi-modal dialogues with split pre-generation and cross-modal contrasting
2024, Information Processing and ManagementMultitask learning for multilingual intent detection and slot filling in dialogue systems
2023, Information FusionCitation Excerpt :In contrast, the task of extracting information in the form of slots is called slot filling (SF). With the enhanced usage of personal assistants, the research and development of human–machine interaction have increased rapidly over the decade [8–10]. For the widespread application of chatbots, personal assistant should be able to understand different languages spoken by the user.
Generation of Coherent Multi-Sentence Texts with a Coherence Mechanism
2023, Computer Speech and LanguageCitation Excerpt :Automatic generation of long texts containing multiple sentences is an important research field in Natural Language Processing (NLP) and has fostered many applications, including question answering (Kwiatkowski et al., 2019; Zhou et al., 2018), dialog generation (Li et al., 2019; Young et al., 2020), image caption generation (Ding et al., 2020; Tan and Chan, 2019), neural machine translation (NMT) (Liu et al., 2020; Shi et al., 2021), paraphrase generation (Gupta et al., 2018; Ning et al., 2021), and text summarization (Chen and Bansal, 2018; Celikyilmaz et al., 2018; Liu et al., 2018; Zhang et al., 2020; Koay et al., 2020).
A code-mixed task-oriented dialog dataset for medical domain
2023, Computer Speech and LanguageCitation Excerpt :The fused dataset is then evaluated using the DialoGPT model (Zhang et al., 2020a). Young et al. (2020) explored the impact of incorporating the audio features of the user message into generative dialog systems. The audio features are modeled as conversational metadata, which benefits the response generation in end-to-end conversational systems.
A survey on XAI and natural language explanations
2023, Information Processing and Management
Tom Young got his Bachelor from Beijing Institute of Technology in 2018. Currently, he is a Ph.D.student under the supervision of Erik Cambria in the School of Computer Science and Engineering in Nanyang Technological University. His main research interests are dialogue systems, deep learning, and computer vision. Specifically, he applies memory-augmented neural networks to model human conversation. He is interested in expanding current chatbot systems to handle more complex environments with higher accuracy.
Vlad Pandelea received his Bachelor and Master of Science in Computer Science from Pisa University in 2017 and 2019, respectively. Since 2020, he is a PhD student at NTU under the supervision of Erik Cambria. His thesis focuses on the exploitation of multimodal information for dialogue systems. In addition to dialogue systems, his research interest lies in data analytics and in the application of deep learning techniques to a variety of fields, including sentiment analysis, time series and point processes.
Soujanya Poria received his B.Eng. in Computer Science from Jadavpur University (India) in 2013. In the same year, he received the best undergraduate thesis and researcher award and was awarded Gold Plated Silver medal from Jadavpur University and Tata Consultancy Service for his final year project during his undergraduate course. In 2017, Soujanya got his Ph.D. in Computing Science and Mathematics from the University of Stirling (UK) under the co-supervision of Amir Hussain and Erik Cambria. Soon after, he joined Nanyang Technological University as a Research Scientist in the School of Computer Science and Engineering. Later in 2019, he joined Singapore University of Technology and Design (SUTD), where he is now conducting research on aspect-based sentiment analysis in multiple domains and different modalities as an Assistant Professor.
Erik Cambria is the Founder of SenticNet, a Singapore-based company offering B2B sentiment analysis services, and an Associate Professor at NTU, where he also holds the appointment of Provost Chair in Computer Science and Engineering. Prior to joining NTU, he worked at Microsoft Research Asia and HP Labs India and earned his Ph.D. through a joint programme between the University of Stirling and MIT Media Lab. He is recipient of many awards, e.g., the 2018 AI’s 10 to Watch and the 2019 IEEE Outstanding Early Career award, and is often featured in the news, e.g., Forbes. He is Associate Editor of several journals, e.g., NEUCOM, INFFUS, KBS, IEEE CIM and IEEE Intelligent Systems (where he manages the Department of Affective Computing and Sentiment Analysis), and is involved in many international conferences as PC member, program chair, and speaker.