Dialogue systems with audio context

doi:10.1016/j.neucom.2019.12.126

Neurocomputing

Volume 388, 7 May 2020, Pages 102-109

https://doi.org/10.1016/j.neucom.2019.12.126 Get rights and content

Abstract

Research on building dialogue systems that converse with humans naturally has recently attracted a lot of attention. Most work on this area assumes text-based conversation, where the user message is modeled as a sequence of words in a vocabulary. Real-world human conversation, in contrast, involves other modalities, such as voice, facial expression and body language, which can influence the conversation significantly in certain scenarios. In this work, we explore the impact of incorporating the audio features of the user message into generative dialogue systems. Specifically, we first design an auxiliary response retrieval task for audio representation learning. Then, we use word-level modality fusion to incorporate the audio features as additional context to our main generative model. Experiments show that our audio-augmented model outperforms the audio-free counterpart on perplexity, response diversity and human evaluation.

Introduction

In recent years, data-driven approaches to building conversation models have been made possible by the proliferation of social media conversation data and the increase of computing power. Based on a large amount of conversation data, very natural-sounding dialogue systems can be built by learning a mapping from textual context to response using powerful machine learning models [1], [2], [3], [4]. Specifically in the popular sequence-to-sequence (Seq2Seq) learning framework, the textual context, modeled as a sequence of words from a vocabulary, is encoded into a context vector by a recurrent neural network (RNN). This context vector serves as the initial state of another RNN, which decodes the whole response one token at a time.

This setting, however, is oversimplified compared to real-world human conversation, which is naturally a multimodal process [5], [6]. Information can be communicated through voice [7], body language [8] and facial expression [9]. In some cases, the same words can carry very different meanings depending on information expressed through other modalities.

In this work, we are interested in audio signals in conversation. Audio signals naturally carry emotional information. For example, “Oh, my god!” generally expresses surprise. Depending on the voice shade, however, a wide range of different emotions can also be carried, including fear, anger and happiness. Audio signals can have strong semantic functions as well. They may augment or alter the meaning expressed in text. For example, “Oh, that’s great!” usually shows positive attitude. With a particular voice shade of contempt, however, the same utterance can be construed as sarcastic. Stress also plays a role in semantics: “I think she stole your money” emphasizes the speaker’s opinion on the identity of the thief while “I think she stole your money” emphasizes the speaker’s opinion on the identity of the victim.

Therefore, while identical from a written point of view, utterances may acquire different meanings based solely on audio information. Empowering a dialogue system with such information is necessary to interpret an utterance correctly and generate an appropriate response.

In this work, we explore dialogue generation augmented by audio context under the commonly-used Seq2Seq framework. Firstly, because of the noisiness of the audio signal and the high dimensionality of raw audio features, we design an auxiliary response classification task to learn suitable audio representation for our dialogue generation objective. Secondly, we use word-level modality fusion for integrating audio features into the Seq2Seq framework. We design experiments to test how well our model can generate appropriate responses corresponding to the emotion and emphasis expressed in the audio.

In summary, this paper makes the following contributions:

(i)
To the best of our knowledge, this work is the first attempt to use audio features of the user message in neural conversation generation. Our model outperforms the baseline audio-free model in terms of perplexity, diversity and human evaluation.
(ii)
We perform extensive experiments on the trained model which show that our model captures the following phenomena in conversation: (1) Vocally emphasized words in an utterance are relatively important to response generation. (2) Emotion expressed in the audio of an utterance has influence on the response.

Section snippets

Related work

Massive text-based conversation data has driven a strong interest in building dialogue systems with data-driven methods. The Seq2Seq model, in particular, has been widely used due to its success in text generation tasks such as machine translation [10], video captioning [11] and abstractive text summarization [12]. Seq2Seq employs an encoder–decoder framework, where the conversational context is encoded into a vector representation and, then, fed to the decoder to generate the response [13].

Audio representation learning

Raw features extracted from audio sequences are high-dimensional and noisy. They are not suited as direct input to the dialogue generative model. Therefore, we need an audio representation learning method to reduce audio feature dimensions and also make it suitable for the dialogue generation task.

For this purpose, we design an auxiliary response classification task based solely on audio features.

Specifically, we construct a set of < context, response, label > triples, where label is binary

Dataset

Most of the existing and consolidated datasets used in dialogue system related research come with textual content only [26], [30]. The predominance of text-only datasets can be seen as a consequence of both the ease with which this type of data can be acquired, and the lack of a real demand of multimodal conversation data until recent times. Fortunately, along with the growing interest in multimodal systems, there has also been an increase in the proliferation of datasets fit for our task. We

Conclusion

In this work, we augmented the common Seq2Seq dialogue model with audio features and showed that the resulting model outperforms the audio-free baseline on several evaluation metrics. It also captures interesting audio-related conversation phenomena.

Although only using text in dialogue systems is a good-enough approximation in a lot of scenarios, other modalities (i.e., video and audio) have to be integrated before automatic dialogue systems can reach human performance. Our work belongs to such

CRediT authorship contribution statement

Tom Young: Conceptualization, Methodology, Software, Writing - original draft. Vlad Pandelea: Data curation, Software. Soujanya Poria: Conceptualization. Erik Cambria: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project #A18A2b0046).

Tom Young got his Bachelor from Beijing Institute of Technology in 2018. Currently, he is a Ph.D.student under the supervision of Erik Cambria in the School of Computer Science and Engineering in Nanyang Technological University. His main research interests are dialogue systems, deep learning, and computer vision. Specifically, he applies memory-augmented neural networks to model human conversation. He is interested in expanding current chatbot systems to handle more complex environments with

References (43)

Y. Gu et al.
Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder
Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference
(2018)
T. Young et al.
Augmenting end-to-end dialogue systems with commonsense knowledge
Proceedings of the 2018 AAAI
(2018)
L. Shao et al.
Generating long and diverse responses with neural conversation models
CoRR
(2017)
N. Majumder et al.
DialogueRNN: an attentive RNN for emotion detection in conversations
Proceedings of the 2019 AAAI
(2019)
H. Xu et al.
End-to-end latent-variable task-oriented dialogue system with exact log-likelihood optimization
World Wide Web
(2020)
I. Chaturvedi et al.
Fuzzy commonsense reasoning for multimodal sentiment analysis
Pattern Recognit. Lett.
(2019)
R. Kingdon
The semantic functions of stress and tone
ELT J.
(1949)
J. Streeck et al.
Embodied Interaction: Language and Body in the Material World
(2011)
A. Takeuchi et al.
Communicative facial displays as a new conversational modality
Proceedings of the INTERACT’93 and CHI’93 Conference on Human Factors in Computing Systems
(1993)
D. Bahdanau et al.
Neural machine translation by jointly learning to align and translate
CoRR
(2014)

S. Venugopalan et al.

Sequence to sequence-video to text

Proceedings of the IEEE International Conference on Computer Vision

(2015)

R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al., Abstractive Text Summarization Using Sequence-to-Sequence RNNs...

O. Vinyals, Q. Le, A Neural Conversational Model, arXiv:1506.05869...

J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A Diversity-Promoting Objective Function for Neural Conversation...

H. Zhou et al.

Emotional chatting machine: emotional conversation generation with internal and external memory

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

(2018)

J. Gu et al.

Incorporating copying mechanism in sequence-to-sequence learning

Proceedings of the 54th Annual Meeting of the Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers)

(2016)

N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G.P. Spithourakis, L. Vanderwende, Image-Grounded...

H. Alamri, V. Cartillier, R.G. Lopes, A. Das, J. Wang, I. Essa, D. Batra, D. Parikh, A. Cherian, T.K. Marks, et al.,...

C. Hori, H. Alamri, J. Wang, G. Winchern, T. Hori, A. Cherian, T.K. Marks, V. Cartillier, R.G. Lopes, A. Das, et al.,...

A. Saha et al.

Towards building large scale multimodal domain-aware conversation systems

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

(2018)

S. Agarwal, O. Dusek, I. Konstas, V. Rieser, Improving Context Modelling in Multimodal Dialogue Generation,...

Cited by (37)

Sarcasm driven by sentiment: A sentiment-aware hierarchical fusion network for multimodal sarcasm detection
2024, Information Fusion
Sarcasm is a form of sentiment expression that highlights the disparity between a person’s true intentions and the content they explicitly present. With the exponential increase in multimodal data on social platforms, the detection of sarcasm across various modes has become a pivotal area of research. Although previous studies have extensively examined multimodal feature extraction, fusion, and the modeling of inter-modal incongruities, they often neglected the subtle sentiment cues inherent in sarcastic multimodal data. Additionally, they did not adequately address the sparse distribution and tenuous connections between sarcastic features both within and cross modalities. To address these gaps, we introduce a hierarchical fusion model that integrates sentiment information for enhanced multimodal sarcasm detection. Specifically, we use attribute-object matching in the image modality, treating it as an auxiliary attribute modality. Sentiment data is then extracted from each modality and combined to achieve a more comprehensive representation within modalities. Moreover, we characterize the relationships of inter-modal incongruities using a crossmodal Transformer. We also implement a sentiment-aware image-text contrastive loss mechanism to synchronize the semantics of images and text better. By intensifying these alignments, our model is better equipped to understand incongruous relationships. Experiments demonstrate that our hierarchical fusion model achieves state-of-the-art performance on the multimodal sarcasm detection task.
Response generation in multi-modal dialogues with split pre-generation and cross-modal contrasting
2024, Information Processing and Management
Due to the natural multi-modal occurrence format (text, audio, vision) of the dialogues, textual response generation in dialogues should rely on the multi-modal contexts beyond text only. However, most existing studies normally ignore the rich information of other modalities, such as audio. To investigate the importance of the acoustic contexts, we explore the multi-modal dialogue scenario with aligned text and audio temporal sequences for textual response generation of an assumed system, namely RGMD task. To this end, we construct a new multi-modal dataset for this task based on TV shows, which contains 84.9K utterances. Considering the response diversity limited by the context and modality interactions for RGMD, we attempt the split pre-generation (SPG) strategy and the cross-modal contrastive learning (CCL) strategy in multi-modal pre-training for better response generation. On the one hand, with SPG, we can obtain many diverse responses without the restrictions of too many historical mixed multi-modal contexts. On the other hand, with CCL, we can capture the interactions between text and audio. Extensive experiments demonstrate that our approach based on BART can consistently perform better than the state-of-the-art textual approach DP by 4.17%, 8.96%, 2.43%, 1.04% and 7.54% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Moreover, our approach based on GPT can outperform the state-of-the-art multi-modal approach RLM by 6.79%, 9.25%, 7.49%, 9.31% and 13.75% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Besides, we conduct much in-depth analysis, showing the necessity of audio for response generation and further verifying the effectiveness of our approach.
Multitask learning for multilingual intent detection and slot filling in dialogue systems
2023, Information Fusion
Citation Excerpt :
In contrast, the task of extracting information in the form of slots is called slot filling (SF). With the enhanced usage of personal assistants, the research and development of human–machine interaction have increased rapidly over the decade [8–10]. For the widespread application of chatbots, personal assistant should be able to understand different languages spoken by the user.
Dialogue systems are becoming an ubiquitous presence in our everyday lives having a huge impact on business and society. Spoken language understanding (SLU) is the critical component of every goal-oriented dialogue system or any conversational system. The understanding of the user utterance is crucial for assisting the user in achieving their desired objectives. Future-generation systems need to be able to handle the multilinguality issue. Hence, the development of conversational agents becomes challenging as it needs to understand the different languages along with the semantic meaning of the given utterance. In this work, we propose a multilingual multitask approach to fuse the two primary SLU tasks, namely, intent detection and slot filling for three different languages. While intent detection deals with identifying user’s goal or purpose, slot filling captures the appropriate user utterance information in the form of slots. As both of these tasks are highly correlated, we propose a multitask strategy to tackle these two tasks concurrently. We employ a transformer as a shared sentence encoder for the three languages, i.e., English, Hindi, and Bengali. Experimental results show that the proposed model achieves an improvement for all the languages for both the tasks of SLU. The multi-lingual multi-task (MLMT) framework shows an improvement of more than 2% in case of intent accuracy and 3% for slot F1 score in comparison to the single task models. Also, there is an increase of more than 1 point intent accuracy and 2 points slot F1 score in the MLMT model as opposed to the language specific frameworks.
Generation of Coherent Multi-Sentence Texts with a Coherence Mechanism
2023, Computer Speech and Language
Citation Excerpt :
Automatic generation of long texts containing multiple sentences is an important research field in Natural Language Processing (NLP) and has fostered many applications, including question answering (Kwiatkowski et al., 2019; Zhou et al., 2018), dialog generation (Li et al., 2019; Young et al., 2020), image caption generation (Ding et al., 2020; Tan and Chan, 2019), neural machine translation (NMT) (Liu et al., 2020; Shi et al., 2021), paraphrase generation (Gupta et al., 2018; Ning et al., 2021), and text summarization (Chen and Bansal, 2018; Celikyilmaz et al., 2018; Liu et al., 2018; Zhang et al., 2020; Koay et al., 2020).
Automatic generation of long texts containing multiple sentences has many applications in the field of Natural Language Processing (NLP) including question answering, machine translation, and paraphrase generation, etc. However, in terms of readability, the long texts generated by machines are not comparable to those organized by human beings. Through statistics, we observed that human-organized texts generally have a special property: one or more of the words (particularly nouns and pronouns) appeared in one sentence will reappear in the next one in the same or a different form. This repetition of words in consecutive sentences can greatly improve the readability. Based on this observation, we propose CMST, a deep neural network model for generating Coherent Multi-Sentence Texts. CMST explicitly incorporates a training strategy of coherence mechanism to evaluate the repetition of words in consecutive sentences. We evaluate the performance of the CMST on the CNN/Daily Mail dataset. The experimental results show that, compared with the baseline models, CMST not only improves the readability of the generated texts, but achieves higher METEOR and ROUGE values.
A code-mixed task-oriented dialog dataset for medical domain
2023, Computer Speech and Language
Citation Excerpt :
The fused dataset is then evaluated using the DialoGPT model (Zhang et al., 2020a). Young et al. (2020) explored the impact of incorporating the audio features of the user message into generative dialog systems. The audio features are modeled as conversational metadata, which benefits the response generation in end-to-end conversational systems.
In the healthcare domain, medical and patient interactions form a crucial part of the diagnosis. Initially, the AI models developed for healthcare centered only on monolingual data. However, such models do not cater to the multilingual regions, where most conversations are Code-Mixed. We present the Code-Mixed Medical Task-Oriented Dialog Dataset to facilitate the research and development of Code-Mixed medical dialog systems. We analyzed the dataset using medical, conversational, and linguistic theories. The dataset contains 3005 Telugu–English Code-Mixed dialogs between patients and doctors with 29 k utterances covering ten specializations with an average code-mixing index (CMI) of 33.3%. We manually annotated the conversational dataset with intents and slot labels. We also present baselines to establish benchmarks on the dataset using existing state-of-the-art Natural Language Understanding (NLU) models. We improved the existing baselines using contextual ground truth intent labels and processing the slots as chunks. The data is made publically available.¹
A survey on XAI and natural language explanations
2023, Information Processing and Management
The field of explainable artificial intelligence (XAI) is gaining increasing importance in recent years. As a consequence, several surveys have been published to explore the current state of the art on this topic. One aspect that seems to be overlooked by these works is the applied presentation methods and, specifically, the role of natural language in generating the final explanations. This survey reviews 70 XAI papers published between 2006 and 2021 and evaluates their readiness with respect to natural language explanations. Thus, together with a set of hierarchical criteria, we define a multi-criteria decision-making model. Finally, we conclude that only a handful of recent XAI works either considered natural language explanations to approach final users (see, e.g.,(Bennetot et al., 2021)) or implemented a method capable of generating such explanations.

View all citing articles on Scopus

Vlad Pandelea received his Bachelor and Master of Science in Computer Science from Pisa University in 2017 and 2019, respectively. Since 2020, he is a PhD student at NTU under the supervision of Erik Cambria. His thesis focuses on the exploitation of multimodal information for dialogue systems. In addition to dialogue systems, his research interest lies in data analytics and in the application of deep learning techniques to a variety of fields, including sentiment analysis, time series and point processes.

Soujanya Poria received his B.Eng. in Computer Science from Jadavpur University (India) in 2013. In the same year, he received the best undergraduate thesis and researcher award and was awarded Gold Plated Silver medal from Jadavpur University and Tata Consultancy Service for his final year project during his undergraduate course. In 2017, Soujanya got his Ph.D. in Computing Science and Mathematics from the University of Stirling (UK) under the co-supervision of Amir Hussain and Erik Cambria. Soon after, he joined Nanyang Technological University as a Research Scientist in the School of Computer Science and Engineering. Later in 2019, he joined Singapore University of Technology and Design (SUTD), where he is now conducting research on aspect-based sentiment analysis in multiple domains and different modalities as an Assistant Professor.

Erik Cambria is the Founder of SenticNet, a Singapore-based company offering B2B sentiment analysis services, and an Associate Professor at NTU, where he also holds the appointment of Provost Chair in Computer Science and Engineering. Prior to joining NTU, he worked at Microsoft Research Asia and HP Labs India and earned his Ph.D. through a joint programme between the University of Stirling and MIT Media Lab. He is recipient of many awards, e.g., the 2018 AI’s 10 to Watch and the 2019 IEEE Outstanding Early Career award, and is often featured in the news, e.g., Forbes. He is Associate Editor of several journals, e.g., NEUCOM, INFFUS, KBS, IEEE CIM and IEEE Intelligent Systems (where he manages the Department of Affective Computing and Sentiment Analysis), and is involved in many international conferences as PC member, program chair, and speaker.

View full text

Dialogue systems with audio context

Abstract

Introduction

Section snippets

Related work

Audio representation learning

Dataset

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Augmenting end-to-end dialogue systems with commonsense knowledge

Proceedings of the 2018 AAAI

Generating long and diverse responses with neural conversation models

CoRR

DialogueRNN: an attentive RNN for emotion detection in conversations

Proceedings of the 2019 AAAI

End-to-end latent-variable task-oriented dialogue system with exact log-likelihood optimization

World Wide Web

Fuzzy commonsense reasoning for multimodal sentiment analysis

Pattern Recognit. Lett.

The semantic functions of stress and tone

ELT J.

Embodied Interaction: Language and Body in the Material World

Communicative facial displays as a new conversational modality

Proceedings of the INTERACT’93 and CHI’93 Conference on Human Factors in Computing Systems

Neural machine translation by jointly learning to align and translate

CoRR

Sequence to sequence-video to text

Proceedings of the IEEE International Conference on Computer Vision

Emotional chatting machine: emotional conversation generation with internal and external memory

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

Incorporating copying mechanism in sequence-to-sequence learning

Proceedings of the 54th Annual Meeting of the Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers)

Towards building large scale multimodal domain-aware conversation systems

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence