research-article

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

Authors:
Dai Hasegawa

Hokkai Gakuen University Sapporo, Japan

Hokkai Gakuen University Sapporo, Japan
View Profile

,
Naoshi Kaneko

Aoyama Gakuin University, Japan Sagamihara, Japan

Aoyama Gakuin University, Japan Sagamihara, Japan
View Profile

,
Shinichi Shirakawa

Yokohama National University, Yokohama, Japan

Yokohama National University, Yokohama, Japan
View Profile

,
Hiroshi Sakuta

Aoyama Gakuin University, Japan Sagamihara, Japan

Aoyama Gakuin University, Japan Sagamihara, Japan
View Profile

,
Kazuhiko Sumi

Aoyama Gakuin University, Japan Sagamihara, Japan

Aoyama Gakuin University, Japan Sagamihara, Japan
View Profile

IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual AgentsNovember 2018Pages 79–86https://doi.org/10.1145/3267851.3267878

Published:05 November 2018Publication History

IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents

Pages 79–86

ABSTRACT

We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward consistencies over a long period of time. Our network regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Then, we apply combined temporal filters to smooth out the generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and marker-based motion capture to train our network. We validated our approach with a subjective evaluation and compared it against "original" human gestures and "mismatched" human gestures taken from a different utterance. The evaluation result shows that our generated gestures are significantly better than the "mismatched" gestures with respect to time consistency. The generated gesture also shows marginally significant improvement in terms of semantic consistency when compared to "mismatched" gestures.

References

Paavo Alku, Tom Bäckström, and Erkki Vilkman. 2002. Normalized amplitude quotient for parametrization of the glottal flow. The Journal of the Acoustical Society of America 112, 2 (2002), 701--710.Google ScholarCross Ref
Timothy W Bickmore, Laura M Pfeifer, Donna Byron, Shaula Forsythe, Lori E Henault, Brian W Jack, Rebecca Silliman, and Michael K Paasche-Orlow. 2010. Usability of conversational agents by patients with inadequate health literacy: Evidence from two clinical trials. Journal of Health Communication 15, S2 (2010), 197--210.Google ScholarCross Ref
Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1€ filter: A simple speed-based low-pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). 2527--2530. Google ScholarDigital Library
Justine Cassell. 2000. Embodied conversational agents. MIT press.Google Scholar
Justine Cassell, Stefan Kopp, Paul Tepper, Kim Ferriman, and Kristina Striegnitz. 2007. Trading spaces: How humans and humanoids use speech and gesture to give directions. Conversational Informatics (2007), 133--160.Google Scholar
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. BEAT: the behavior expression animation toolkit. In Proceedings of the SIGGRAPH Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 477--486. Google ScholarDigital Library
Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 127--140. Google ScholarDigital Library
Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: a deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 152--166.Google ScholarCross Ref
Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357--366.Google ScholarCross Ref
Rosalind Edwards and Janet Holland. 2013. What is qualitative interviewing? Bloomsbury Academic.Google Scholar
Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587. Google ScholarDigital Library
Rui Fang, Malcolm Doering, and Joyce Y Chai. 2015. Embodied collaborative referring expression generation in situated human-robot interaction. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI). 271--278. Google ScholarDigital Library
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602--610. Google ScholarDigital Library
Cheolho Han, Sang-Woo Lee, Yujung Heo, Wooyoung Kang, Jaehyun Jun, and Byoung-Tak Zhang. 2017. Criteria for human-compatible AI in two-player vision-language tasks. In Proceedings of the Linguistic And Cognitive Approaches To Dialog Agents (LaCATODA), Workshop of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017). 28--33.Google Scholar
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.Google Scholar
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning(ICML). 448--456. Google ScholarDigital Library
Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC). 12.1--12.11.Google ScholarCross Ref
Adam Kendon. 1980. Gesticulation and speech: Two aspects of theprocess of utterance. The relationship of verbal and nonverbal communication 25, 1980 (1980), 207--227.Google Scholar
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
A Lawson, Pavel Vabishchevich, M Huggins, P Ardis, Brandon Battles, and A Stauffer. 2011. Survey and evaluation of acoustic features for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). 5444--5447.Google ScholarCross Ref
David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.Google Scholar
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning(ICML). 807--814. Google ScholarDigital Library
Brandon Rohrer, Susan Fasoli, Hermano Igo Krebs, Richard Hughes, Bruce Volpe, Walter R Frontera, Joel Stein, and Neville Hogan. 2002. Movement smoothness changes during stroke recovery. Journal of Neuroscience 22, 18 (2002), 8297--8304.Google ScholarCross Ref
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681. Google ScholarDigital Library
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958. Google ScholarDigital Library
Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In Proceedings of the International Conference on Human-Computer Interaction (HCI). 198--202.Google ScholarCross Ref
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 648--656.Google ScholarCross Ref

Index Terms

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI

Recommendations

Gesticulator: A framework for semantically-aware speech-driven gesture generation
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture ...
Read More
Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM
HAI '17: Proceedings of the 5th International Conference on Human Agent Interaction

In this research, we take a first step in generating motion data for gestures directly from speech features. Such a method can make creating gesture animations for Embodied Conversational Agents much easier. We implemented a model using Bi-Directional ...
Read More
Evaluation of text-to-gesture generation model using convolutional neural network
Abstract
Conversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation ...
Highlights
- The quality of text-to-gesture generation models is evaluated by human perceptual studies.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents
November 2018
381 pages
ISBN:9781450360135
DOI:10.1145/3267851
Conference Chairs:
Anton Bogdanovych
Western Sydney University
,
Deborah Richards
Macquarie University
,
Simeon Simoff
Western Sydney University
,
Program Chairs:
Catherine Pelachaud
CNRS - ISIR, Université Pierre et Marie Curie
,
Dirk Heylen
University of Twente
,
Tomas Trescak
Western Sydney University
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 November 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
gesture generation
long short-term memory
neural networks
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
IVA '18 Paper Acceptance Rate17of82submissions,21%Overall Acceptance Rate53of196submissions,27%
More
Upcoming Conference
IVA '24

Sponsor:

sigai

ACM International Conference on Intelligent Virtual Agents

September 16 - 19, 2024

GLASGOW , United Kingdom
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 83
  Total Citations
  View Citations
- 636
  Total Downloads
- Downloads (Last 12 months)75
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents

ABSTRACT

References

Cited By

Index Terms

Recommendations

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM

Evaluation of text-to-gesture generation model using convolutional neural network