skip to main content
10.1145/3394171.3413848acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Temporally Guided Music-to-Body-Movement Generation

Published:12 October 2020Publication History

ABSTRACT

This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To facilitate the optimization of self-attention model, beat tracking is applied to determine effective sizes and boundaries of the training examples. The decoder is accompanied with a refining network and a bowing attack inference mechanism to emphasize the right-hand behavior and bowing attack timing. Both objective and subjective evaluations reveal that the proposed model outperforms the state-of-the-art methods. To the best of our knowledge, this work represents the first attempt to generate 3-D violinists? body movements considering key features in musical body movement.

Skip Supplemental Material Section

Supplemental Material

3394171.3413848.mp4

mp4

141.2 MB

References

  1. Tamara Berg, Debaleena Chattopadhyay, Margaret Schedel, and Timothy Vallier. 2012. Interactive music: Human motion initiated music generation using skeletal tracking by kinect. In Proc. Conf. Soc. Electro-Acoustic Music United States.Google ScholarGoogle Scholar
  2. Birgitta Burger, Suvi Saarikallio, Geoff Luck, Marc R. Thompson, and Petri Toiviainen. 2013. Relationships Between Perceived Emotions in Music and Music-induced Movement. Music Perception: An Interdisciplinary Journal, Vol. 30, 5 (2013), 517--533.Google ScholarGoogle ScholarCross RefCross Ref
  3. Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. 2017. Deep cross-modal audio-visual generation. In Proc. Thematic Workshops of ACM MM. 349--357.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 3444--3453.Google ScholarGoogle ScholarCross RefCross Ref
  5. Sofia Dahl, Frédéric Bevilacqua, and Roberto Bresin. 2010. Gestures in performance. In musical Gestures. Routledge, 48--80.Google ScholarGoogle Scholar
  6. Jane W Davidson. 2012. Bodily movement and facial actions in expressive musical performance by solo and duo instrumentalists: Two distinctive case studies. Psychology of Music, Vol. 40, 5 (2012), 595--633.Google ScholarGoogle ScholarCross RefCross Ref
  7. Anne Farber and Lisa Parker. 1987. Discovering music through Dalcroze eurhythmics. Music Educators Journal, Vol. 74, 3 (1987), 43--45.Google ScholarGoogle ScholarCross RefCross Ref
  8. Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In IEEE Conference on Computer Vision and Pattern Recognition. 3497--3506.Google ScholarGoogle Scholar
  9. Egil Haga. 2008. Correspondences between music and body movement. Ph.D. Dissertation. Faculty of Humanities, University of Oslo Unipub.Google ScholarGoogle Scholar
  10. Yu-Fen Huang, Tsung-Ping Chen, Nikki Moran, Simon Coleman, and Li Su. 2019. Identifying Expressive Semantics in Orchestral Conducting Kinematics.. In International Society of Music Information Retrieval Conference. 115--122.Google ScholarGoogle Scholar
  11. Yu-Fen Huang, Simon Coleman, Eric Barnhill, Raymond MacDonald, and Nikki Moran. 2017. How do conductors' movement communicate compositional features and interpretational intentions? Psychomusicology: Music, Mind, and Brain, Vol. 27, 3 (2017), 148.Google ScholarGoogle Scholar
  12. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition. 1125--1134.Google ScholarGoogle Scholar
  13. Ryo Kakitsuka, Kosetsu Tsukuda, Satoru Fukayama, Naoya Iwamoto, Masataka Goto, and Shigeo Morishima. 2016. A choreographic authoring system for character dance animation reflecting a user's preference. In ACM SIGGRAPH.Google ScholarGoogle Scholar
  14. Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. 2019. Dancing to Music. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  15. Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2018a. Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications. IEEE Transactions on Multimedia, Vol. 21, 2 (2018), 522--535.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bochen Li, Akira Maezawa, and Zhiyao Duan. 2018b. Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance. In International Society of Music Information Retrieval Conference. 218--224.Google ScholarGoogle Scholar
  17. Bochen Li, Chenliang Xu, and Zhiyao Duan. 2017. Audiovisual source association for string ensembles through multi-modal vibrato analysis. Proc. Sound and Music Computing (2017).Google ScholarGoogle Scholar
  18. Jun-Wei Liu, Hung-Yi Lin, Yu-Fen Huang, Hsuan-Kai Kao, and Li Su. 2020. Body Movement Generation for Expressive Violin Performance Applying Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing. 3787--3791.Google ScholarGoogle Scholar
  19. Jennifer MacRitchie, Bryony Buck, and Nicholas J Bailey. 2013. Inferring musical structure through bodily gestures. Musicae Scientiae, Vol. 17, 1 (2013), 86--108.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jennifer MacRitchie and Massimo Zicari. 2012. The intentions of piano touch. In 12th ICMPC and 8th ESCOM.Google ScholarGoogle Scholar
  21. Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8.Google ScholarGoogle ScholarCross RefCross Ref
  22. Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In IEEE Conference on Computer Vision and Pattern Recognition. 7753--7762.Google ScholarGoogle ScholarCross RefCross Ref
  23. Alexandra Pierce. 1997. Four distinct movement qualities in music: a performer's guide. Contemporary Music Review, Vol. 16, 3 (1997), 39--53.Google ScholarGoogle ScholarCross RefCross Ref
  24. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarGoogle ScholarCross RefCross Ref
  25. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).Google ScholarGoogle Scholar
  26. Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. 2018. Audio to body dynamics. In IEEE Conference on Computer Vision and Pattern Recognition. 7574--7583.Google ScholarGoogle ScholarCross RefCross Ref
  27. Marc R Thompson and Geoff Luck. 2012. Exploring relationships between pianists? body movements, their expressive intentions, and structural elements of the music. Musicae Scientiae, Vol. 16, 1 (2012), 19--40.Google ScholarGoogle ScholarCross RefCross Ref
  28. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.Google ScholarGoogle Scholar
  29. Marcelo M. Wanderley, Bradley W. Vines, Neil Middleton, Cory McKay, and Wesley Hatch. 2005. The musical significance of clarinetists' ancillary gestures: An exploration of the field. Journal of New Music Research, Vol. 34, 1 (2005), 97--113.Google ScholarGoogle ScholarCross RefCross Ref
  30. Yu-Te Wu, Berlin Chen, and Li Su. 2018. Automatic music transcription leveraging generalized cepstral features and deep learning. In IEEE International Conference on Acoustics, Speech and Signal Processing. 401--405.Google ScholarGoogle ScholarCross RefCross Ref
  31. Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation. 4303--4309.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Temporally Guided Music-to-Body-Movement Generation

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MM '20: Proceedings of the 28th ACM International Conference on Multimedia
            October 2020
            4889 pages
            ISBN:9781450379885
            DOI:10.1145/3394171

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 October 2020

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate995of4,171submissions,24%

            Upcoming Conference

            MM '24
            MM '24: The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader