research-article

Temporally Guided Music-to-Body-Movement Generation

Authors:
Hsuan-Kai Kao

Academia Sinica, Taipei, Taiwan Roc

Academia Sinica, Taipei, Taiwan Roc
View Profile

,
Li Su

Academia Sinica, Taipei, Taiwan Roc

Academia Sinica, Taipei, Taiwan Roc
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 147–155https://doi.org/10.1145/3394171.3413848

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 147–155

ABSTRACT

This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To facilitate the optimization of self-attention model, beat tracking is applied to determine effective sizes and boundaries of the training examples. The decoder is accompanied with a refining network and a bowing attack inference mechanism to emphasize the right-hand behavior and bowing attack timing. Both objective and subjective evaluations reveal that the proposed model outperforms the state-of-the-art methods. To the best of our knowledge, this work represents the first attempt to generate 3-D violinists? body movements considering key features in musical body movement.

Supplemental Material

3394171.3413848.mp4

mp4

141.2 MB

Download

References

Tamara Berg, Debaleena Chattopadhyay, Margaret Schedel, and Timothy Vallier. 2012. Interactive music: Human motion initiated music generation using skeletal tracking by kinect. In Proc. Conf. Soc. Electro-Acoustic Music United States.Google Scholar
Birgitta Burger, Suvi Saarikallio, Geoff Luck, Marc R. Thompson, and Petri Toiviainen. 2013. Relationships Between Perceived Emotions in Music and Music-induced Movement. Music Perception: An Interdisciplinary Journal, Vol. 30, 5 (2013), 517--533.Google ScholarCross Ref
Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. 2017. Deep cross-modal audio-visual generation. In Proc. Thematic Workshops of ACM MM. 349--357.Google ScholarDigital Library
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 3444--3453.Google ScholarCross Ref
Sofia Dahl, Frédéric Bevilacqua, and Roberto Bresin. 2010. Gestures in performance. In musical Gestures. Routledge, 48--80.Google Scholar
Jane W Davidson. 2012. Bodily movement and facial actions in expressive musical performance by solo and duo instrumentalists: Two distinctive case studies. Psychology of Music, Vol. 40, 5 (2012), 595--633.Google ScholarCross Ref
Anne Farber and Lisa Parker. 1987. Discovering music through Dalcroze eurhythmics. Music Educators Journal, Vol. 74, 3 (1987), 43--45.Google ScholarCross Ref
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In IEEE Conference on Computer Vision and Pattern Recognition. 3497--3506.Google Scholar
Egil Haga. 2008. Correspondences between music and body movement. Ph.D. Dissertation. Faculty of Humanities, University of Oslo Unipub.Google Scholar
Yu-Fen Huang, Tsung-Ping Chen, Nikki Moran, Simon Coleman, and Li Su. 2019. Identifying Expressive Semantics in Orchestral Conducting Kinematics.. In International Society of Music Information Retrieval Conference. 115--122.Google Scholar
Yu-Fen Huang, Simon Coleman, Eric Barnhill, Raymond MacDonald, and Nikki Moran. 2017. How do conductors' movement communicate compositional features and interpretational intentions? Psychomusicology: Music, Mind, and Brain, Vol. 27, 3 (2017), 148.Google Scholar
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition. 1125--1134.Google Scholar
Ryo Kakitsuka, Kosetsu Tsukuda, Satoru Fukayama, Naoya Iwamoto, Masataka Goto, and Shigeo Morishima. 2016. A choreographic authoring system for character dance animation reflecting a user's preference. In ACM SIGGRAPH.Google Scholar
Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. 2019. Dancing to Music. In Advances in Neural Information Processing Systems.Google Scholar
Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2018a. Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications. IEEE Transactions on Multimedia, Vol. 21, 2 (2018), 522--535.Google ScholarDigital Library
Bochen Li, Akira Maezawa, and Zhiyao Duan. 2018b. Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance. In International Society of Music Information Retrieval Conference. 218--224.Google Scholar
Bochen Li, Chenliang Xu, and Zhiyao Duan. 2017. Audiovisual source association for string ensembles through multi-modal vibrato analysis. Proc. Sound and Music Computing (2017).Google Scholar
Jun-Wei Liu, Hung-Yi Lin, Yu-Fen Huang, Hsuan-Kai Kao, and Li Su. 2020. Body Movement Generation for Expressive Violin Performance Applying Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing. 3787--3791.Google Scholar
Jennifer MacRitchie, Bryony Buck, and Nicholas J Bailey. 2013. Inferring musical structure through bodily gestures. Musicae Scientiae, Vol. 17, 1 (2013), 86--108.Google ScholarCross Ref
Jennifer MacRitchie and Massimo Zicari. 2012. The intentions of piano touch. In 12th ICMPC and 8th ESCOM.Google Scholar
Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8.Google ScholarCross Ref
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In IEEE Conference on Computer Vision and Pattern Recognition. 7753--7762.Google ScholarCross Ref
Alexandra Pierce. 1997. Four distinct movement qualities in music: a performer's guide. Contemporary Music Review, Vol. 16, 3 (1997), 39--53.Google ScholarCross Ref
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarCross Ref
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).Google Scholar
Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. 2018. Audio to body dynamics. In IEEE Conference on Computer Vision and Pattern Recognition. 7574--7583.Google ScholarCross Ref
Marc R Thompson and Geoff Luck. 2012. Exploring relationships between pianists? body movements, their expressive intentions, and structural elements of the music. Musicae Scientiae, Vol. 16, 1 (2012), 19--40.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.Google Scholar
Marcelo M. Wanderley, Bradley W. Vines, Neil Middleton, Cory McKay, and Wesley Hatch. 2005. The musical significance of clarinetists' ancillary gestures: An exploration of the field. Journal of New Music Research, Vol. 34, 1 (2005), 97--113.Google ScholarCross Ref
Yu-Te Wu, Berlin Chen, and Li Su. 2018. Automatic music transcription leveraging generalized cepstral features and deep learning. In IEEE International Conference on Acoustics, Speech and Signal Processing. 401--405.Google ScholarCross Ref
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation. 4303--4309.Google ScholarCross Ref

Index Terms

Temporally Guided Music-to-Body-Movement Generation

Recommendations

Self-supervised Dance Video Synthesis Conditioned on Music
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

We present a self-supervised approach with pose perceptual loss for automatic dance video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of given music. To achieve this, we firstly generate a human ...
Read More
A Human-Computer Duet System for Music Performance
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Virtual musicians have become a remarkable phenomenon in the contemporary multimedia arts. However, most of the virtual musicians nowadays have not been endowed with abilities to create their own behaviors, or to perform music with human musicians. In ...
Read More
Music-Driven Animation Generation of Expressive Musical Gestures
ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction

While audio-driven face and gesture motion synthesis has been studied before, to our knowledge no research has been done yet for automatic generation of musical gestures for virtual humans. Existing work either focuses on precise 3D finger movement ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
body movement generation
music information retrieval
neural networks
pose estimation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 273
  Total Downloads
- Downloads (Last 12 months)57
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Temporally Guided Music-to-Body-Movement Generation

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Self-supervised Dance Video Synthesis Conditioned on Music

A Human-Computer Duet System for Music Performance

Music-Driven Animation Generation of Expressive Musical Gestures