ABSTRACT
We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward consistencies over a long period of time. Our network regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Then, we apply combined temporal filters to smooth out the generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and marker-based motion capture to train our network. We validated our approach with a subjective evaluation and compared it against "original" human gestures and "mismatched" human gestures taken from a different utterance. The evaluation result shows that our generated gestures are significantly better than the "mismatched" gestures with respect to time consistency. The generated gesture also shows marginally significant improvement in terms of semantic consistency when compared to "mismatched" gestures.
- Paavo Alku, Tom Bäckström, and Erkki Vilkman. 2002. Normalized amplitude quotient for parametrization of the glottal flow. The Journal of the Acoustical Society of America 112, 2 (2002), 701--710.Google ScholarCross Ref
- Timothy W Bickmore, Laura M Pfeifer, Donna Byron, Shaula Forsythe, Lori E Henault, Brian W Jack, Rebecca Silliman, and Michael K Paasche-Orlow. 2010. Usability of conversational agents by patients with inadequate health literacy: Evidence from two clinical trials. Journal of Health Communication 15, S2 (2010), 197--210.Google ScholarCross Ref
- Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1€ filter: A simple speed-based low-pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). 2527--2530. Google ScholarDigital Library
- Justine Cassell. 2000. Embodied conversational agents. MIT press.Google Scholar
- Justine Cassell, Stefan Kopp, Paul Tepper, Kim Ferriman, and Kristina Striegnitz. 2007. Trading spaces: How humans and humanoids use speech and gesture to give directions. Conversational Informatics (2007), 133--160.Google Scholar
- Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. BEAT: the behavior expression animation toolkit. In Proceedings of the SIGGRAPH Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 477--486. Google ScholarDigital Library
- Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 127--140. Google ScholarDigital Library
- Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: a deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 152--166.Google ScholarCross Ref
- Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357--366.Google ScholarCross Ref
- Rosalind Edwards and Janet Holland. 2013. What is qualitative interviewing? Bloomsbury Academic.Google Scholar
- Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587. Google ScholarDigital Library
- Rui Fang, Malcolm Doering, and Joyce Y Chai. 2015. Embodied collaborative referring expression generation in situated human-robot interaction. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI). 271--278. Google ScholarDigital Library
- Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602--610. Google ScholarDigital Library
- Cheolho Han, Sang-Woo Lee, Yujung Heo, Wooyoung Kang, Jaehyun Jun, and Byoung-Tak Zhang. 2017. Criteria for human-compatible AI in two-player vision-language tasks. In Proceedings of the Linguistic And Cognitive Approaches To Dialog Agents (LaCATODA), Workshop of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017). 28--33.Google Scholar
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning(ICML). 448--456. Google ScholarDigital Library
- Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC). 12.1--12.11.Google ScholarCross Ref
- Adam Kendon. 1980. Gesticulation and speech: Two aspects of theprocess of utterance. The relationship of verbal and nonverbal communication 25, 1980 (1980), 207--227.Google Scholar
- Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
- A Lawson, Pavel Vabishchevich, M Huggins, P Ardis, Brandon Battles, and A Stauffer. 2011. Survey and evaluation of acoustic features for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). 5444--5447.Google ScholarCross Ref
- David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.Google Scholar
- Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning(ICML). 807--814. Google ScholarDigital Library
- Brandon Rohrer, Susan Fasoli, Hermano Igo Krebs, Richard Hughes, Bruce Volpe, Walter R Frontera, Joel Stein, and Neville Hogan. 2002. Movement smoothness changes during stroke recovery. Journal of Neuroscience 22, 18 (2002), 8297--8304.Google ScholarCross Ref
- Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681. Google ScholarDigital Library
- Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958. Google ScholarDigital Library
- Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In Proceedings of the International Conference on Human-Computer Interaction (HCI). 198--202.Google ScholarCross Ref
- Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 648--656.Google ScholarCross Ref
Index Terms
- Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
Recommendations
Gesticulator: A framework for semantically-aware speech-driven gesture generation
ICMI '20: Proceedings of the 2020 International Conference on Multimodal InteractionDuring speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture ...
Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM
HAI '17: Proceedings of the 5th International Conference on Human Agent InteractionIn this research, we take a first step in generating motion data for gestures directly from speech features. Such a method can make creating gesture animations for Embodied Conversational Agents much easier. We implemented a model using Bi-Directional ...
Evaluation of text-to-gesture generation model using convolutional neural network
AbstractConversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation ...
Highlights- The quality of text-to-gesture generation models is evaluated by human perceptual studies.
Comments