skip to main content
10.1145/3267851.3267878acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
research-article

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

Published:05 November 2018Publication History

ABSTRACT

We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward consistencies over a long period of time. Our network regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Then, we apply combined temporal filters to smooth out the generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and marker-based motion capture to train our network. We validated our approach with a subjective evaluation and compared it against "original" human gestures and "mismatched" human gestures taken from a different utterance. The evaluation result shows that our generated gestures are significantly better than the "mismatched" gestures with respect to time consistency. The generated gesture also shows marginally significant improvement in terms of semantic consistency when compared to "mismatched" gestures.

References

  1. Paavo Alku, Tom Bäckström, and Erkki Vilkman. 2002. Normalized amplitude quotient for parametrization of the glottal flow. The Journal of the Acoustical Society of America 112, 2 (2002), 701--710.Google ScholarGoogle ScholarCross RefCross Ref
  2. Timothy W Bickmore, Laura M Pfeifer, Donna Byron, Shaula Forsythe, Lori E Henault, Brian W Jack, Rebecca Silliman, and Michael K Paasche-Orlow. 2010. Usability of conversational agents by patients with inadequate health literacy: Evidence from two clinical trials. Journal of Health Communication 15, S2 (2010), 197--210.Google ScholarGoogle ScholarCross RefCross Ref
  3. Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1€ filter: A simple speed-based low-pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). 2527--2530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Justine Cassell. 2000. Embodied conversational agents. MIT press.Google ScholarGoogle Scholar
  5. Justine Cassell, Stefan Kopp, Paul Tepper, Kim Ferriman, and Kristina Striegnitz. 2007. Trading spaces: How humans and humanoids use speech and gesture to give directions. Conversational Informatics (2007), 133--160.Google ScholarGoogle Scholar
  6. Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. BEAT: the behavior expression animation toolkit. In Proceedings of the SIGGRAPH Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 477--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 127--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: a deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 152--166.Google ScholarGoogle ScholarCross RefCross Ref
  9. Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357--366.Google ScholarGoogle ScholarCross RefCross Ref
  10. Rosalind Edwards and Janet Holland. 2013. What is qualitative interviewing? Bloomsbury Academic.Google ScholarGoogle Scholar
  11. Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rui Fang, Malcolm Doering, and Joyce Y Chai. 2015. Embodied collaborative referring expression generation in situated human-robot interaction. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI). 271--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cheolho Han, Sang-Woo Lee, Yujung Heo, Wooyoung Kang, Jaehyun Jun, and Byoung-Tak Zhang. 2017. Criteria for human-compatible AI in two-player vision-language tasks. In Proceedings of the Linguistic And Cognitive Approaches To Dialog Agents (LaCATODA), Workshop of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017). 28--33.Google ScholarGoogle Scholar
  15. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.Google ScholarGoogle Scholar
  16. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning(ICML). 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC). 12.1--12.11.Google ScholarGoogle ScholarCross RefCross Ref
  18. Adam Kendon. 1980. Gesticulation and speech: Two aspects of theprocess of utterance. The relationship of verbal and nonverbal communication 25, 1980 (1980), 207--227.Google ScholarGoogle Scholar
  19. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  20. A Lawson, Pavel Vabishchevich, M Huggins, P Ardis, Brandon Battles, and A Stauffer. 2011. Survey and evaluation of acoustic features for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). 5444--5447.Google ScholarGoogle ScholarCross RefCross Ref
  21. David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.Google ScholarGoogle Scholar
  22. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning(ICML). 807--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Brandon Rohrer, Susan Fasoli, Hermano Igo Krebs, Richard Hughes, Bruce Volpe, Walter R Frontera, Joel Stein, and Neville Hogan. 2002. Movement smoothness changes during stroke recovery. Journal of Neuroscience 22, 18 (2002), 8297--8304.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In Proceedings of the International Conference on Human-Computer Interaction (HCI). 198--202.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 648--656.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents
          November 2018
          381 pages
          ISBN:9781450360135
          DOI:10.1145/3267851

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 November 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          IVA '18 Paper Acceptance Rate17of82submissions,21%Overall Acceptance Rate53of196submissions,27%

          Upcoming Conference

          IVA '24
          ACM International Conference on Intelligent Virtual Agents
          September 16 - 19, 2024
          GLASGOW , United Kingdom

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader