Abstract
We introduce gesture controllers, a method for animating the body language of avatars engaged in live spoken conversation. A gesture controller is an optimal-policy controller that schedules gesture animations in real time based on acoustic features in the user's speech. The controller consists of an inference layer, which infers a distribution over a set of hidden states from the speech signal, and a control layer, which selects the optimal motion based on the inferred state distribution. The inference layer, consisting of a specialized conditional random field, learns the hidden structure in body language style and associates it with acoustic features in speech. The control layer uses reinforcement learning to construct an optimal policy for selecting motion clips from a distribution over the learned hidden states. The modularity of the proposed method allows customization of a character's gesture repertoire, animation of non-human characters, and the use of additional inputs such as speech recognition or direct user control.
Supplemental Material
Available for Download
The auxiliary material contains the accompanying video showing various gesture controllers.
- Albrecht, I., Haber, J., and peter Seidel, H. 2002. Automatic generation of non-verbal facial expressions from speech. In Computer Graphics International, 283--293.Google Scholar
- Bailenson, J. N., Beall, A. C., Loomis, J., Blascovich, J., and Turk, M. 2004. Transformed social interaction: Decoupling representation from behavior and form in collaborative virtual environments. Presence: Teleoperators and Virtual Environments 13, 4, 428--441. Google ScholarDigital Library
- Bertsekas, D. 2007. Dynamic Programming and Optimal Control, third ed. Athena Scientific. Google ScholarDigital Library
- Birdwhistell, R. 1952. Introduction to Kinesics. Department of State Foreign Service Institute, Washington, DC.Google Scholar
- Brand, M. 1999. Voice puppetry. In SIGGRAPH '99: ACM SIGGRAPH 1999 papers, ACM, New York, NY, USA, 21--28. Google ScholarDigital Library
- Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: driving visual speech with audio. In SIGGRAPH '97: ACM SIGGRAPH 1997 Papers, ACM, New York, NY, USA, 353--360. Google ScholarDigital Library
- Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., and Stone, M. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In SIGGRAPH '94: ACM SIGGRAPH 1994 Papers, ACM, New York, NY, USA, 413--420. Google ScholarDigital Library
- Cassell, J., Vilhjálmsson, H. H., and Bickmore, T. 2001. Beat: the behavior expression animation toolkit. In SIGGRAPH '01: ACM SIGGRAPH 2001 papers, ACM, New York, NY, USA, 477--486. Google ScholarDigital Library
- Chuang, E., and Bregler, C. 2005. Mood swings: expressive speech animation. ACM Transactions on Graphics 24, 2, 331--347. Google ScholarDigital Library
- de Meijer, M. 1989. The contribution of general features of body movement to the attribution of emotions. Journal of Nonverbal Behavior 13, 4, 247--268.Google ScholarCross Ref
- Deng, Z., and Neumann, U. 2007. Data-Driven 3D Facial Animation. Springer-Verlag Press. Google ScholarDigital Library
- Dobrogaev, S. M. 1931. Ucenie o reflekse v problemakh jazykovedenija. {Observations on reflex in aspects of language study.}. Jazykovedenie i Materializm 2, 105--173.Google Scholar
- Efron, D. 1972. Gesture, Race and Culture. The Hague: Mouton.Google Scholar
- Englebienne, G., Cootes, T., and Rattray, M. 2007. A probabilistic model for generating realistic lip movements from speech. In Neural Information Processing Systems (NIPS) 19, MIT Press.Google Scholar
- Feyereisen, P., and de Lannoy, J.-D. 1991. Gestures and Speech: Psychological Investigations. Cambridge University Press.Google Scholar
- Hartmann, B., Mancini, M., and Pelachaud, C. 2002. Formational parameters and adaptive prototype instantiation for mpeg-4 compliant gesture synthesis. In Proceedings on Computer Animation, IEEE Computer Society, Washington, DC, USA, 111. Google ScholarDigital Library
- Hartmann, B., Mancini, M., and Pelachaud, C. 2005. Implementing expressive gesture synthesis for embodied conversational agents. In Gesture Workshop, Springer, 188--199. Google ScholarDigital Library
- Kendon, A. 2004. Gesture -- Visible Action as Utterance. Cambridge University Press, New York, NY, USA.Google Scholar
- Kipp, M., Neff, M., and Albrecht, I. 2007. An annotation scheme for conversational gestures: How to economically capture timing and form. Language Resources and Evaluation 41, 3--4, 325--339.Google ScholarCross Ref
- Kopp, S., and Wachsmuth, I. 2004. Synthesizing multimodal utterances for conversational agents: Research articles. Computer Animation and Virtual Worlds 15, 1, 39--52. Google ScholarDigital Library
- Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learning, Morgan Kaufmann Inc., 282--289. Google ScholarDigital Library
- Levine, S., Theobalt, C., and Koltun, V. 2009. Real-time prosody-driven synthesis of body language. In SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papers, ACM, New York, NY, USA. Google ScholarDigital Library
- Li, Y., and Shum, H.-Y. 2006. Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Transactions on Multimedia 8, 3, 542--549. Google ScholarDigital Library
- Loehr, D. 2007. Aspects of rhythm in gesture and speech. Gesture 7, 2, 179--214.Google ScholarCross Ref
- McCann, J., and Pollard, N. 2007. Responsive characters from motion fragments. In SIGGRAPH '07: ACM SIGGRAPH 2007 papers, ACM, New York, NY, USA. Google ScholarDigital Library
- McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University Of Chicago Press.Google Scholar
- Morency, L.-P., Quattoni, A., and Darrell, T. 2007. Latent-dynamic discriminative models for continuous gesture recognition. In Proc. of IEEE Computer Vision and Pattern Recognition, 1--8.Google Scholar
- Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. 2008. Gesture modeling and animation based on a probabilistic recreation of speaker style. ACM Transactions on Graphics 27, 1, 1--24. Google ScholarDigital Library
- Newlove, J. 1993. Laban for Actors and Dancers. Routledge Nick Hern Books, New York, NY, USA.Google Scholar
- Perlin, K., and Goldberg, A. 1996. Improv: a system for scripting interactive actors in virtual worlds. In SIGGRAPH '96: ACM SIGGRAPH 1996 Papers, ACM, 205--216. Google ScholarDigital Library
- Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 2, 257--286.Google ScholarCross Ref
- Sargin, M. E., Yemez, Y., Erzin, E., and Tekalp, A. M. 2008. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 8, 1330--1345. Google ScholarDigital Library
- Shröder, M. 2009. Expressive speech synthesis: Past, present, and possible futures. Affective Information Processing, 111--126.Google Scholar
- Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Stere, A., Lees, A., and Bregler, C. 2004. Speaking with hands: creating animated conversational characters from recordings of human performance. In SIGGRAPH '04: ACM SIGGRAPH 2004 Papers, ACM, New York, NY, USA, 506--513. Google ScholarDigital Library
- The CMU Sphinx Group, 2007. Open source speech recognition engines.Google Scholar
- Treuille, A., Lee, Y., and Popović, Z. 2007. Near-optimal character animation with continuous control. In SIGGRAPH '07: ACM SIGGRAPH 2007 Papers, ACM, New York, NY, USA. Google ScholarDigital Library
- Valbonesi, L., Ansari, R., McNeill, D., Quek, F., S. Duncan, K. E. M., and Bryll, R. 2002. Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures. In EUSIPCO '02, vol. 1, 75--78.Google Scholar
- Wang, S. B., Quattoni, A., Morency, L.-P., Demirdjian, D., and Darrell, T. 2006. Hidden conditional random fields for gesture recognition. In Computer Vision and Pattern Recognition, 1521--1527. Google ScholarDigital Library
- Xue, J., Borgstrom, J., Jiang, J., Bernstein, L., and Alwan, A. 2006. Acoustically-driven talking face synthesis using dynamic Bayesian networks. IEEE International Conference on Multimedia and Expo, 1165--1168.Google Scholar
- Zhao, L., and Badler, N. I. 2005. Acquiring and validating motion qualities from live limb gestures. Graphical Models 67, 1, 1--16. Google ScholarDigital Library
Index Terms
- Gesture controllers
Recommendations
Gesture controllers
SIGGRAPH '10: ACM SIGGRAPH 2010 papersWe introduce gesture controllers, a method for animating the body language of avatars engaged in live spoken conversation. A gesture controller is an optimal-policy controller that schedules gesture animations in real time based on acoustic features in ...
Real-time prosody-driven synthesis of body language
SIGGRAPH Asia '09: ACM SIGGRAPH Asia 2009 papersHuman communication involves not only speech, but also a wide variety of gestures and body motions. Interactions in virtual environments often lack this multi-modal aspect of communication. We present a method for automatically synthesizing body ...
Real-time prosody-driven synthesis of body language
Human communication involves not only speech, but also a wide variety of gestures and body motions. Interactions in virtual environments often lack this multi-modal aspect of communication. We present a method for automatically synthesizing body ...
Comments