ABSTRACT
This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To facilitate the optimization of self-attention model, beat tracking is applied to determine effective sizes and boundaries of the training examples. The decoder is accompanied with a refining network and a bowing attack inference mechanism to emphasize the right-hand behavior and bowing attack timing. Both objective and subjective evaluations reveal that the proposed model outperforms the state-of-the-art methods. To the best of our knowledge, this work represents the first attempt to generate 3-D violinists? body movements considering key features in musical body movement.
Supplemental Material
- Tamara Berg, Debaleena Chattopadhyay, Margaret Schedel, and Timothy Vallier. 2012. Interactive music: Human motion initiated music generation using skeletal tracking by kinect. In Proc. Conf. Soc. Electro-Acoustic Music United States.Google Scholar
- Birgitta Burger, Suvi Saarikallio, Geoff Luck, Marc R. Thompson, and Petri Toiviainen. 2013. Relationships Between Perceived Emotions in Music and Music-induced Movement. Music Perception: An Interdisciplinary Journal, Vol. 30, 5 (2013), 517--533.Google ScholarCross Ref
- Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. 2017. Deep cross-modal audio-visual generation. In Proc. Thematic Workshops of ACM MM. 349--357.Google ScholarDigital Library
- Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 3444--3453.Google ScholarCross Ref
- Sofia Dahl, Frédéric Bevilacqua, and Roberto Bresin. 2010. Gestures in performance. In musical Gestures. Routledge, 48--80.Google Scholar
- Jane W Davidson. 2012. Bodily movement and facial actions in expressive musical performance by solo and duo instrumentalists: Two distinctive case studies. Psychology of Music, Vol. 40, 5 (2012), 595--633.Google ScholarCross Ref
- Anne Farber and Lisa Parker. 1987. Discovering music through Dalcroze eurhythmics. Music Educators Journal, Vol. 74, 3 (1987), 43--45.Google ScholarCross Ref
- Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In IEEE Conference on Computer Vision and Pattern Recognition. 3497--3506.Google Scholar
- Egil Haga. 2008. Correspondences between music and body movement. Ph.D. Dissertation. Faculty of Humanities, University of Oslo Unipub.Google Scholar
- Yu-Fen Huang, Tsung-Ping Chen, Nikki Moran, Simon Coleman, and Li Su. 2019. Identifying Expressive Semantics in Orchestral Conducting Kinematics.. In International Society of Music Information Retrieval Conference. 115--122.Google Scholar
- Yu-Fen Huang, Simon Coleman, Eric Barnhill, Raymond MacDonald, and Nikki Moran. 2017. How do conductors' movement communicate compositional features and interpretational intentions? Psychomusicology: Music, Mind, and Brain, Vol. 27, 3 (2017), 148.Google Scholar
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition. 1125--1134.Google Scholar
- Ryo Kakitsuka, Kosetsu Tsukuda, Satoru Fukayama, Naoya Iwamoto, Masataka Goto, and Shigeo Morishima. 2016. A choreographic authoring system for character dance animation reflecting a user's preference. In ACM SIGGRAPH.Google Scholar
- Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. 2019. Dancing to Music. In Advances in Neural Information Processing Systems.Google Scholar
- Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2018a. Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications. IEEE Transactions on Multimedia, Vol. 21, 2 (2018), 522--535.Google ScholarDigital Library
- Bochen Li, Akira Maezawa, and Zhiyao Duan. 2018b. Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance. In International Society of Music Information Retrieval Conference. 218--224.Google Scholar
- Bochen Li, Chenliang Xu, and Zhiyao Duan. 2017. Audiovisual source association for string ensembles through multi-modal vibrato analysis. Proc. Sound and Music Computing (2017).Google Scholar
- Jun-Wei Liu, Hung-Yi Lin, Yu-Fen Huang, Hsuan-Kai Kao, and Li Su. 2020. Body Movement Generation for Expressive Violin Performance Applying Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing. 3787--3791.Google Scholar
- Jennifer MacRitchie, Bryony Buck, and Nicholas J Bailey. 2013. Inferring musical structure through bodily gestures. Musicae Scientiae, Vol. 17, 1 (2013), 86--108.Google ScholarCross Ref
- Jennifer MacRitchie and Massimo Zicari. 2012. The intentions of piano touch. In 12th ICMPC and 8th ESCOM.Google Scholar
- Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8.Google ScholarCross Ref
- Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In IEEE Conference on Computer Vision and Pattern Recognition. 7753--7762.Google ScholarCross Ref
- Alexandra Pierce. 1997. Four distinct movement qualities in music: a performer's guide. Contemporary Music Review, Vol. 16, 3 (1997), 39--53.Google ScholarCross Ref
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarCross Ref
- Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).Google Scholar
- Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. 2018. Audio to body dynamics. In IEEE Conference on Computer Vision and Pattern Recognition. 7574--7583.Google ScholarCross Ref
- Marc R Thompson and Geoff Luck. 2012. Exploring relationships between pianists? body movements, their expressive intentions, and structural elements of the music. Musicae Scientiae, Vol. 16, 1 (2012), 19--40.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.Google Scholar
- Marcelo M. Wanderley, Bradley W. Vines, Neil Middleton, Cory McKay, and Wesley Hatch. 2005. The musical significance of clarinetists' ancillary gestures: An exploration of the field. Journal of New Music Research, Vol. 34, 1 (2005), 97--113.Google ScholarCross Ref
- Yu-Te Wu, Berlin Chen, and Li Su. 2018. Automatic music transcription leveraging generalized cepstral features and deep learning. In IEEE International Conference on Acoustics, Speech and Signal Processing. 401--405.Google ScholarCross Ref
- Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation. 4303--4309.Google ScholarCross Ref
Index Terms
- Temporally Guided Music-to-Body-Movement Generation
Recommendations
Self-supervised Dance Video Synthesis Conditioned on Music
MM '20: Proceedings of the 28th ACM International Conference on MultimediaWe present a self-supervised approach with pose perceptual loss for automatic dance video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of given music. To achieve this, we firstly generate a human ...
A Human-Computer Duet System for Music Performance
MM '20: Proceedings of the 28th ACM International Conference on MultimediaVirtual musicians have become a remarkable phenomenon in the contemporary multimedia arts. However, most of the virtual musicians nowadays have not been endowed with abilities to create their own behaviors, or to perform music with human musicians. In ...
Music-Driven Animation Generation of Expressive Musical Gestures
ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal InteractionWhile audio-driven face and gesture motion synthesis has been studied before, to our knowledge no research has been done yet for automatic generation of musical gestures for virtual humans. Existing work either focuses on precise 3D finger movement ...
Comments