Abstract
It enhances user experience by making the talking avatar be sensitive to user behaviors in human computer interaction (HCI). In this study, we combine user’s multi-modal behaviors with behaviors’ historical information in dialog management (DM) to improve the avatar’s sensitivity not only to user explicit behavior (speech command) but also to user supporting expression (emotion and gesture, etc.). In the dialog management, according to the different contributions of facial expression, gesture and head motion to speech comprehension, we divide the user’s multi-modal behaviors into three categories: complementation, conflict and independence. The behavior categories could be first automatically obtained from a short-term and time-dynamic (STTD) fusion model with audio-visual input. Different behavior category leads to different avatar’s response in later dialog turns. Usually, the conflict behavior reflects user’s ambiguous intention (for example: user says “no” while he (her) is smiling). In this case, the trial-and-error schema is adopted to eliminate the conversation ambiguity. For the later dialog process, we divide all the avatar dialog states into four types: “Ask”, “Answer”, “Chat” and “Forget”. With the detection of complementation and independence behaviors, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among four dialog states. At the first section of experiments, we discuss the reliability of STTD model for user behavior classification. Based on the proposed dialog management and STTD model, we continue to construct a drive route information query system by connecting the user behavior sensitive dialog management (BSDM) to a 3D talking avatar. The practical conversation records of avatar with different users show that the BSDM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation.
Similar content being viewed by others
References
Ananova. http://en.wikipedia.org/wiki/Ananova. Accessed 18 Jan 2014; Available from: http://en.wikipedia.org/wiki/Ananova
Baltrušaitis T, Ramirez GA, Morency L-P (2011) Modeling latent discriminative dynamic of multi-dimensional affective signals. Affect Comput Intell Interact, p 396–406. Springer, Berlin
Bell L, Gustafson J (2000) Positive and negative user feedback in a spoken dialogue Corpus. In: INTERSPEECH. p 589–592
Bianchi-Berthouze N, Meng H (2011) Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. Affect Comput Intell Interact 3975:378–387
Bohus D, Rudnicky A (2005) Sorry, I didn’t catch that! - an investigation of non-understanding errors and recovery strategies. In: Proceedings of SIGdial. Lisbon, Portugal
Bousmalis K, Zafeiriou S, Morency L-P, Pantic M, Ghahramani Z (2013) Variational hidden conditional random fields with coupled Dirichlet process mixtures. In: European conference on machine learning and principles and practice of knowledge discovery in databases
Brustoloni JC (1991) Autonomous agents: characterization and requirements
Cerekovic TPA, Igor P (2009) RealActor: character animation and multimodal behavior realization system. Intelligent virtual agents. p 486–487.
Dobrisek S, Gajsek R, Mihelic F, Pavesic N, Struc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst 10:1–10
Engwall O, Balter O (2007) Pronunciation feedback from real and virtual language teachers. J Comput Assist Lang Learn 20(3):235–262
Goddeau HMD, Poliforni J, Seneff S, Busayapongchait S (1996) A form-based dialogue management for spoken language applications. In: International conference on spoken language processing. Pittsburgh, PA. p 701–704
GoogleAPI. www.google.com/intl/en/chrome/demos/speech.heml. Available from: www.google.com/intl/en/chrome/demos/speech.heml
Heloir A, Kipp M, Gebhard P, Schroeder M (2010) Realizing Multimodal Behavior: Closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer
Hjalmarsson A, Wik P (2009) Embodied conversational agents in computer assisted language learning. Speech Comm 51(10):1024–1037
Jones MJ, Viola PA (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Kaiser M, Willmer M, Eyben F, Schuller B (2012) LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31:153–163
Kang Y, Tao J (2005) Features importance analysis for emotion speech classification. In: International conference on affective computing and intelligence interaction -ACII 2005. p 449–457
kth. http://www.speech.kth.se/multimodal/. Accessed 18 Jan 2014; Available from: http://www.speech.kth.se/multimodal/
Lee C, Jung S, Kim K, Lee D, Lee GG (2010) Recent approaches to dialog management for spoken dialog systems. J Comput Sci Eng 4(1):1–22
Levin RPE, Eckert W (2000) A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans Speech Audio Process 8(1):11–23
Litman DJ, Tetreault JR (2006) Comparing the utility of state features in spoken dialogue using reinforcement learning. In: Conference: North American Chapter of the association for computational linguistics - NAACL. New York City
MapAPIBaidu. http://developer.baidu.com/map/webservice.htm. Available from: http://developer.baidu.com/map/webservice.htm
McKeown G, Valstar MF, Cowie R, Pantic M (2010) The SEMAINE Corpus of emotionally coloured character interactions. In: Proc IEEE Int’l Conf Multimedia and Expo. p 1079–1084
mmdagent. http://www.mmdagent.jp/. Accessed 18 Jan 2014; Available from: http://www.mmdagent.jp/
nlprFace. http://www.cripac.ia.ac.cn/Databases/databases.html and http://www.idealtest.org/dbDetailForUser.do?id=9. Accessed 18 Jan 2014; Available from: http://www.cripac.ia.ac.cn/Databases/databases.html and http://www.idealtest.org/dbDetailForUser.do?id=9
Pietquin TDO (2006) A probabilistic framework for dialog simulation and optimal strategy. IEEE Trans Audio Speech Lang Process 14(2):589–599
Rebillat M, Courgeon M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: 5th meeting of the French association for virtual reality
Schatzmann KWJ, Stuttle M, Young S (2006) A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl Eng Rev 21(2):97–126
Schels M, Glodek M, Meudt S, Scherer S, Schmidt M, Layher G, Tschechne S, Brosch T, Hrabal D, Walter S, Traue HC, Palm G, Schwenker F, Campbell MR (2013) Multi-modal classifier-fusion for the recognition of emotions, chapter in converbal synchrony in human-machine interaction. CRC Press, Boca Raton, FL 33487, USA
Scherer KR (1999) Appraisal theory. In: Dalgleish T, Power M (eds) Handbook of cognition and emotion. Wiley, Chichester
Schwarzlery SMS, Schenk J, Wallhoff F, Rigoll G (2009) Using graphical models for mixed-initiative dialog management systems with realtime Policies. In: Conference: annual conference of the International Speech Communication Association - INTERSPEECH. p 260–263
Shan S, Niu Z, Chen X (2009) Facial shape localization using probability gradient hints. IEEE Signal Process Lett 16(10):897–900
SPTK. http://sp-tk.sourceforge.jp. Accessed 18 Jan 2014; Available from: http://sp-tk.sourceforge.jp
Steedman M, Badler N, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH. p 73–80
Tao J, Pan S, Yang M, Li Y, Mu K, Che J (2011) Utterance independent bimodal emotion recognition in spontaneous communication. EURASIP J Adv Signal Process 11(1):1–11
Tao J, Yang M, Mu K, Li Y, Che J (2012) A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5(1–2):61–68
Tao J, Yang M, Chao L (2013) Combining emotional history through multimodal fusion methods. In: Asia Pacific Signal and Information Processing Association (APSIPA 2013). Kaohsiung, Taiwan, China
Tschechne S, Glodek M, Layher G, Schels M, Brosch T, Scherer S, Schwenker F (2011) Multiple classifier systems for the classification of audio-visual emotion states. Affect Comput Intell Interact, 378–387. Springer, Berlin
Van Reidsma D, Welbergen H, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML Realizer for continuous, multimodal interaction with a Virtual Human. J Multimodal User Interfaces 3(4):271–284
Williams JD (2003) A probabilistic model of human/computer dialogue with application to a partially observable Markov decision process
Williams JD, Poupart P, Young S (2005) Partially observable Markov decision processes with continuous observations for dialogue management. In: Proceedings of the 6th SigDial workshop on discourse and dialogue. Lisbon
Xin L, Huang L, Zhao L, Tao J (2007) Combining audio and video by dominance in bimodal emotion recognition. In: International conference on affective computing and intelligence interaction - ACII. p 729–730
Young S (2006) Using POMDPs for dialog management. In: Conference: IEEE workshop on spoken language technology - SLT
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Acknowledgments
This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61273288, No. 61233009, No. 61203258, No. 61530503, No. 61332017, No. 61375027), and the Major Program for the National Social Science Fund of China (13&ZD189).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, M., Tao, J., Chao, L. et al. User behavior fusion in dialog management with multi-modal history cues. Multimed Tools Appl 74, 10025–10051 (2015). https://doi.org/10.1007/s11042-014-2161-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2161-5