User behavior fusion in dialog management with multi-modal history cues

Yang, Minghao; Tao, Jianhua; Chao, Linlin; Li, Hao; Zhang, Dawei; Che, Hao; Gao, Tingli; Liu, Bin

doi:10.1007/s11042-014-2161-5

User behavior fusion in dialog management with multi-modal history cues

Published: 16 July 2014

Volume 74, pages 10025–10051, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Minghao Yang¹,
Jianhua Tao¹,
Linlin Chao¹,
Hao Li¹,
Dawei Zhang¹,
Hao Che¹,
Tingli Gao¹ &
…
Bin Liu¹

492 Accesses
6 Citations
6 Altmetric
Explore all metrics

Abstract

It enhances user experience by making the talking avatar be sensitive to user behaviors in human computer interaction (HCI). In this study, we combine user’s multi-modal behaviors with behaviors’ historical information in dialog management (DM) to improve the avatar’s sensitivity not only to user explicit behavior (speech command) but also to user supporting expression (emotion and gesture, etc.). In the dialog management, according to the different contributions of facial expression, gesture and head motion to speech comprehension, we divide the user’s multi-modal behaviors into three categories: complementation, conflict and independence. The behavior categories could be first automatically obtained from a short-term and time-dynamic (STTD) fusion model with audio-visual input. Different behavior category leads to different avatar’s response in later dialog turns. Usually, the conflict behavior reflects user’s ambiguous intention (for example: user says “no” while he (her) is smiling). In this case, the trial-and-error schema is adopted to eliminate the conversation ambiguity. For the later dialog process, we divide all the avatar dialog states into four types: “Ask”, “Answer”, “Chat” and “Forget”. With the detection of complementation and independence behaviors, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among four dialog states. At the first section of experiments, we discuss the reliability of STTD model for user behavior classification. Based on the proposed dialog management and STTD model, we continue to construct a drive route information query system by connecting the user behavior sensitive dialog management (BSDM) to a 3D talking avatar. The practical conversation records of avatar with different users show that the BSDM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of User’s State during a Dialog Turn with Sequential Multi-modal Features

A Proposal for Processing and Fusioning Multiple Information Sources in Multimodal Dialog Systems

HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation

Article 16 May 2023

References

Ananova. http://en.wikipedia.org/wiki/Ananova. Accessed 18 Jan 2014; Available from: http://en.wikipedia.org/wiki/Ananova
Baltrušaitis T, Ramirez GA, Morency L-P (2011) Modeling latent discriminative dynamic of multi-dimensional affective signals. Affect Comput Intell Interact, p 396–406. Springer, Berlin
Bell L, Gustafson J (2000) Positive and negative user feedback in a spoken dialogue Corpus. In: INTERSPEECH. p 589–592
Bianchi-Berthouze N, Meng H (2011) Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. Affect Comput Intell Interact 3975:378–387
Google Scholar
Bohus D, Rudnicky A (2005) Sorry, I didn’t catch that! - an investigation of non-understanding errors and recovery strategies. In: Proceedings of SIGdial. Lisbon, Portugal
Bousmalis K, Zafeiriou S, Morency L-P, Pantic M, Ghahramani Z (2013) Variational hidden conditional random fields with coupled Dirichlet process mixtures. In: European conference on machine learning and principles and practice of knowledge discovery in databases
Brustoloni JC (1991) Autonomous agents: characterization and requirements
Cerekovic TPA, Igor P (2009) RealActor: character animation and multimodal behavior realization system. Intelligent virtual agents. p 486–487.
Dobrisek S, Gajsek R, Mihelic F, Pavesic N, Struc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst 10:1–10
Article Google Scholar
Engwall O, Balter O (2007) Pronunciation feedback from real and virtual language teachers. J Comput Assist Lang Learn 20(3):235–262
Article Google Scholar
Goddeau HMD, Poliforni J, Seneff S, Busayapongchait S (1996) A form-based dialogue management for spoken language applications. In: International conference on spoken language processing. Pittsburgh, PA. p 701–704
GoogleAPI. www.google.com/intl/en/chrome/demos/speech.heml. Available from: www.google.com/intl/en/chrome/demos/speech.heml
Heloir A, Kipp M, Gebhard P, Schroeder M (2010) Realizing Multimodal Behavior: Closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer
Hjalmarsson A, Wik P (2009) Embodied conversational agents in computer assisted language learning. Speech Comm 51(10):1024–1037
Article Google Scholar
Jones MJ, Viola PA (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Article Google Scholar
Kaiser M, Willmer M, Eyben F, Schuller B (2012) LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31:153–163
Google Scholar
Kang Y, Tao J (2005) Features importance analysis for emotion speech classification. In: International conference on affective computing and intelligence interaction -ACII 2005. p 449–457
kth. http://www.speech.kth.se/multimodal/. Accessed 18 Jan 2014; Available from: http://www.speech.kth.se/multimodal/
Lee C, Jung S, Kim K, Lee D, Lee GG (2010) Recent approaches to dialog management for spoken dialog systems. J Comput Sci Eng 4(1):1–22
Article MATH Google Scholar
Levin RPE, Eckert W (2000) A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans Speech Audio Process 8(1):11–23
Article Google Scholar
Litman DJ, Tetreault JR (2006) Comparing the utility of state features in spoken dialogue using reinforcement learning. In: Conference: North American Chapter of the association for computational linguistics - NAACL. New York City
MapAPIBaidu. http://developer.baidu.com/map/webservice.htm. Available from: http://developer.baidu.com/map/webservice.htm
McKeown G, Valstar MF, Cowie R, Pantic M (2010) The SEMAINE Corpus of emotionally coloured character interactions. In: Proc IEEE Int’l Conf Multimedia and Expo. p 1079–1084
mmdagent. http://www.mmdagent.jp/. Accessed 18 Jan 2014; Available from: http://www.mmdagent.jp/
nlprFace. http://www.cripac.ia.ac.cn/Databases/databases.html and http://www.idealtest.org/dbDetailForUser.do?id=9. Accessed 18 Jan 2014; Available from: http://www.cripac.ia.ac.cn/Databases/databases.html and http://www.idealtest.org/dbDetailForUser.do?id=9
Pietquin TDO (2006) A probabilistic framework for dialog simulation and optimal strategy. IEEE Trans Audio Speech Lang Process 14(2):589–599
Article Google Scholar
Rebillat M, Courgeon M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: 5th meeting of the French association for virtual reality
Schatzmann KWJ, Stuttle M, Young S (2006) A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl Eng Rev 21(2):97–126
Article Google Scholar
Schels M, Glodek M, Meudt S, Scherer S, Schmidt M, Layher G, Tschechne S, Brosch T, Hrabal D, Walter S, Traue HC, Palm G, Schwenker F, Campbell MR (2013) Multi-modal classifier-fusion for the recognition of emotions, chapter in converbal synchrony in human-machine interaction. CRC Press, Boca Raton, FL 33487, USA
Scherer KR (1999) Appraisal theory. In: Dalgleish T, Power M (eds) Handbook of cognition and emotion. Wiley, Chichester
Google Scholar
Schwarzlery SMS, Schenk J, Wallhoff F, Rigoll G (2009) Using graphical models for mixed-initiative dialog management systems with realtime Policies. In: Conference: annual conference of the International Speech Communication Association - INTERSPEECH. p 260–263
Shan S, Niu Z, Chen X (2009) Facial shape localization using probability gradient hints. IEEE Signal Process Lett 16(10):897–900
Article Google Scholar
SPTK. http://sp-tk.sourceforge.jp. Accessed 18 Jan 2014; Available from: http://sp-tk.sourceforge.jp
Steedman M, Badler N, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH. p 73–80
Tao J, Pan S, Yang M, Li Y, Mu K, Che J (2011) Utterance independent bimodal emotion recognition in spontaneous communication. EURASIP J Adv Signal Process 11(1):1–11
Google Scholar
Tao J, Yang M, Mu K, Li Y, Che J (2012) A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5(1–2):61–68
Google Scholar
Tao J, Yang M, Chao L (2013) Combining emotional history through multimodal fusion methods. In: Asia Pacific Signal and Information Processing Association (APSIPA 2013). Kaohsiung, Taiwan, China
Tschechne S, Glodek M, Layher G, Schels M, Brosch T, Scherer S, Schwenker F (2011) Multiple classifier systems for the classification of audio-visual emotion states. Affect Comput Intell Interact, 378–387. Springer, Berlin
Van Reidsma D, Welbergen H, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML Realizer for continuous, multimodal interaction with a Virtual Human. J Multimodal User Interfaces 3(4):271–284
Google Scholar
Williams JD (2003) A probabilistic model of human/computer dialogue with application to a partially observable Markov decision process
Williams JD, Poupart P, Young S (2005) Partially observable Markov decision processes with continuous observations for dialogue management. In: Proceedings of the 6th SigDial workshop on discourse and dialogue. Lisbon
Xin L, Huang L, Zhao L, Tao J (2007) Combining audio and video by dominance in bimodal emotion recognition. In: International conference on affective computing and intelligence interaction - ACII. p 729–730
Young S (2006) Using POMDPs for dialog management. In: Conference: IEEE workshop on spoken language technology - SLT
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Article Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61273288, No. 61233009, No. 61203258, No. 61530503, No. 61332017, No. 61375027), and the Major Program for the National Social Science Fund of China (13&ZD189).

Author information

Authors and Affiliations

The National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Minghao Yang, Jianhua Tao, Linlin Chao, Hao Li, Dawei Zhang, Hao Che, Tingli Gao & Bin Liu

Authors

Minghao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Tao
View author publications
You can also search for this author in PubMed Google Scholar
Linlin Chao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Li
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Che
View author publications
You can also search for this author in PubMed Google Scholar
Tingli Gao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minghao Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, M., Tao, J., Chao, L. et al. User behavior fusion in dialog management with multi-modal history cues. Multimed Tools Appl 74, 10025–10051 (2015). https://doi.org/10.1007/s11042-014-2161-5

Download citation

Received: 20 January 2014
Revised: 26 May 2014
Accepted: 23 June 2014
Published: 16 July 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11042-014-2161-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

User behavior fusion in dialog management with multi-modal history cues

Abstract

Access this article

Similar content being viewed by others

Estimation of User’s State during a Dialog Turn with Sequential Multi-modal Features

A Proposal for Processing and Fusioning Multiple Information Sources in Multimodal Dialog Systems

HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

User behavior fusion in dialog management with multi-modal history cues

Abstract

Access this article

Similar content being viewed by others

Estimation of User’s State during a Dialog Turn with Sequential Multi-modal Features

A Proposal for Processing and Fusioning Multiple Information Sources in Multimodal Dialog Systems

HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation