Skip to main content
Log in

User behavior fusion in dialog management with multi-modal history cues

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

It enhances user experience by making the talking avatar be sensitive to user behaviors in human computer interaction (HCI). In this study, we combine user’s multi-modal behaviors with behaviors’ historical information in dialog management (DM) to improve the avatar’s sensitivity not only to user explicit behavior (speech command) but also to user supporting expression (emotion and gesture, etc.). In the dialog management, according to the different contributions of facial expression, gesture and head motion to speech comprehension, we divide the user’s multi-modal behaviors into three categories: complementation, conflict and independence. The behavior categories could be first automatically obtained from a short-term and time-dynamic (STTD) fusion model with audio-visual input. Different behavior category leads to different avatar’s response in later dialog turns. Usually, the conflict behavior reflects user’s ambiguous intention (for example: user says “no” while he (her) is smiling). In this case, the trial-and-error schema is adopted to eliminate the conversation ambiguity. For the later dialog process, we divide all the avatar dialog states into four types: “Ask”, “Answer”, “Chat” and “Forget”. With the detection of complementation and independence behaviors, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among four dialog states. At the first section of experiments, we discuss the reliability of STTD model for user behavior classification. Based on the proposed dialog management and STTD model, we continue to construct a drive route information query system by connecting the user behavior sensitive dialog management (BSDM) to a 3D talking avatar. The practical conversation records of avatar with different users show that the BSDM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Ananova. http://en.wikipedia.org/wiki/Ananova. Accessed 18 Jan 2014; Available from: http://en.wikipedia.org/wiki/Ananova

  2. Baltrušaitis T, Ramirez GA, Morency L-P (2011) Modeling latent discriminative dynamic of multi-dimensional affective signals. Affect Comput Intell Interact, p 396–406. Springer, Berlin

  3. Bell L, Gustafson J (2000) Positive and negative user feedback in a spoken dialogue Corpus. In: INTERSPEECH. p 589–592

  4. Bianchi-Berthouze N, Meng H (2011) Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. Affect Comput Intell Interact 3975:378–387

    Google Scholar 

  5. Bohus D, Rudnicky A (2005) Sorry, I didn’t catch that! - an investigation of non-understanding errors and recovery strategies. In: Proceedings of SIGdial. Lisbon, Portugal

  6. Bousmalis K, Zafeiriou S, Morency L-P, Pantic M, Ghahramani Z (2013) Variational hidden conditional random fields with coupled Dirichlet process mixtures. In: European conference on machine learning and principles and practice of knowledge discovery in databases

  7. Brustoloni JC (1991) Autonomous agents: characterization and requirements

  8. Cerekovic TPA, Igor P (2009) RealActor: character animation and multimodal behavior realization system. Intelligent virtual agents. p 486–487.

  9. Dobrisek S, Gajsek R, Mihelic F, Pavesic N, Struc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst 10:1–10

    Article  Google Scholar 

  10. Engwall O, Balter O (2007) Pronunciation feedback from real and virtual language teachers. J Comput Assist Lang Learn 20(3):235–262

    Article  Google Scholar 

  11. Goddeau HMD, Poliforni J, Seneff S, Busayapongchait S (1996) A form-based dialogue management for spoken language applications. In: International conference on spoken language processing. Pittsburgh, PA. p 701–704

  12. GoogleAPI. www.google.com/intl/en/chrome/demos/speech.heml. Available from: www.google.com/intl/en/chrome/demos/speech.heml

  13. Heloir A, Kipp M, Gebhard P, Schroeder M (2010) Realizing Multimodal Behavior: Closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer

  14. Hjalmarsson A, Wik P (2009) Embodied conversational agents in computer assisted language learning. Speech Comm 51(10):1024–1037

    Article  Google Scholar 

  15. Jones MJ, Viola PA (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154

    Article  Google Scholar 

  16. Kaiser M, Willmer M, Eyben F, Schuller B (2012) LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31:153–163

    Google Scholar 

  17. Kang Y, Tao J (2005) Features importance analysis for emotion speech classification. In: International conference on affective computing and intelligence interaction -ACII 2005. p 449–457

  18. kth. http://www.speech.kth.se/multimodal/. Accessed 18 Jan 2014; Available from: http://www.speech.kth.se/multimodal/

  19. Lee C, Jung S, Kim K, Lee D, Lee GG (2010) Recent approaches to dialog management for spoken dialog systems. J Comput Sci Eng 4(1):1–22

    Article  MATH  Google Scholar 

  20. Levin RPE, Eckert W (2000) A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans Speech Audio Process 8(1):11–23

    Article  Google Scholar 

  21. Litman DJ, Tetreault JR (2006) Comparing the utility of state features in spoken dialogue using reinforcement learning. In: Conference: North American Chapter of the association for computational linguistics - NAACL. New York City

  22. MapAPIBaidu. http://developer.baidu.com/map/webservice.htm. Available from: http://developer.baidu.com/map/webservice.htm

  23. McKeown G, Valstar MF, Cowie R, Pantic M (2010) The SEMAINE Corpus of emotionally coloured character interactions. In: Proc IEEE Int’l Conf Multimedia and Expo. p 1079–1084

  24. mmdagent. http://www.mmdagent.jp/. Accessed 18 Jan 2014; Available from: http://www.mmdagent.jp/

  25. nlprFace. http://www.cripac.ia.ac.cn/Databases/databases.html and http://www.idealtest.org/dbDetailForUser.do?id=9. Accessed 18 Jan 2014; Available from: http://www.cripac.ia.ac.cn/Databases/databases.html and http://www.idealtest.org/dbDetailForUser.do?id=9

  26. Pietquin TDO (2006) A probabilistic framework for dialog simulation and optimal strategy. IEEE Trans Audio Speech Lang Process 14(2):589–599

    Article  Google Scholar 

  27. Rebillat M, Courgeon M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: 5th meeting of the French association for virtual reality

  28. Schatzmann KWJ, Stuttle M, Young S (2006) A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl Eng Rev 21(2):97–126

    Article  Google Scholar 

  29. Schels M, Glodek M, Meudt S, Scherer S, Schmidt M, Layher G, Tschechne S, Brosch T, Hrabal D, Walter S, Traue HC, Palm G, Schwenker F, Campbell MR (2013) Multi-modal classifier-fusion for the recognition of emotions, chapter in converbal synchrony in human-machine interaction. CRC Press, Boca Raton, FL 33487, USA

  30. Scherer KR (1999) Appraisal theory. In: Dalgleish T, Power M (eds) Handbook of cognition and emotion. Wiley, Chichester

    Google Scholar 

  31. Schwarzlery SMS, Schenk J, Wallhoff F, Rigoll G (2009) Using graphical models for mixed-initiative dialog management systems with realtime Policies. In: Conference: annual conference of the International Speech Communication Association - INTERSPEECH. p 260–263

  32. Shan S, Niu Z, Chen X (2009) Facial shape localization using probability gradient hints. IEEE Signal Process Lett 16(10):897–900

    Article  Google Scholar 

  33. SPTK. http://sp-tk.sourceforge.jp. Accessed 18 Jan 2014; Available from: http://sp-tk.sourceforge.jp

  34. Steedman M, Badler N, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH. p 73–80

  35. Tao J, Pan S, Yang M, Li Y, Mu K, Che J (2011) Utterance independent bimodal emotion recognition in spontaneous communication. EURASIP J Adv Signal Process 11(1):1–11

    Google Scholar 

  36. Tao J, Yang M, Mu K, Li Y, Che J (2012) A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5(1–2):61–68

    Google Scholar 

  37. Tao J, Yang M, Chao L (2013) Combining emotional history through multimodal fusion methods. In: Asia Pacific Signal and Information Processing Association (APSIPA 2013). Kaohsiung, Taiwan, China

  38. Tschechne S, Glodek M, Layher G, Schels M, Brosch T, Scherer S, Schwenker F (2011) Multiple classifier systems for the classification of audio-visual emotion states. Affect Comput Intell Interact, 378–387. Springer, Berlin

  39. Van Reidsma D, Welbergen H, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML Realizer for continuous, multimodal interaction with a Virtual Human. J Multimodal User Interfaces 3(4):271–284

    Google Scholar 

  40. Williams JD (2003) A probabilistic model of human/computer dialogue with application to a partially observable Markov decision process

  41. Williams JD, Poupart P, Young S (2005) Partially observable Markov decision processes with continuous observations for dialogue management. In: Proceedings of the 6th SigDial workshop on discourse and dialogue. Lisbon

  42. Xin L, Huang L, Zhao L, Tao J (2007) Combining audio and video by dominance in bimodal emotion recognition. In: International conference on affective computing and intelligence interaction - ACII. p 729–730

  43. Young S (2006) Using POMDPs for dialog management. In: Conference: IEEE workshop on spoken language technology - SLT

  44. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61273288, No. 61233009, No. 61203258, No. 61530503, No. 61332017, No. 61375027), and the Major Program for the National Social Science Fund of China (13&ZD189).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minghao Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, M., Tao, J., Chao, L. et al. User behavior fusion in dialog management with multi-modal history cues. Multimed Tools Appl 74, 10025–10051 (2015). https://doi.org/10.1007/s11042-014-2161-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2161-5

Keywords

Navigation