research-article

Skeleton-Aided Articulated Motion Generation

Authors:
Yichao Yan

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Jingwei Xu

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Bingbing Ni

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Wendong Zhang

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Xiaokang Yang

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 199–207https://doi.org/10.1145/3123266.3123277

Published:19 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 199–207

ABSTRACT

This work makes the first attempt to generate articulated human motion sequence from a single image. On one hand, we utilize paired inputs including human skeleton information as motion embedding and a single human image as appearance reference, to generate novel motion frames based on the conditional GAN infrastructure. On the other hand, a triplet loss is employed to pursue appearance smoothness between consecutive frames. As the proposed framework is capable of jointly exploiting the image appearance space and articulated/kinematic motion space, it generates realistic articulated motion sequence, in contrast to most previous video generation methods which yield blurred motion effects. We test our model on two human action datasets including KTH and Human3.6M, and the proposed framework generates very promising results on both datasets.

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR Vol. abs/1603.04467 (2016).Google Scholar
Aseem Agarwala, Ke Colin Zheng, Chris Pal, Maneesh Agrawala, Michael Cohen, Brian Curless, David Salesin, and Richard Szeliski. 2005. Panoramic video textures. In ACM Transactions on Graphics (TOG), Vol. Vol. 24. 821--827. Google ScholarDigital Library
Jake K Aggarwal and Quin Cai. 1997. Human motion analysis: A review. In Nonrigid and Articulated Motion Workshop, 1997. Proceedings., IEEE. 90--102. Google ScholarDigital Library
Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis CVPR. 3686--3693. Google ScholarDigital Library
Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation CVPR. 1014--1021.Google Scholar
Martín Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. CoRR Vol. abs/1701.07875 (2017).Google Scholar
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CoRR Vol. abs/1611.08050 (2016).Google Scholar
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS. 2172--2180.Google ScholarDigital Library
Emily Denton and Vighnesh Birodkar. 2017. Unsupervised Learning of Disentangled Representations from Video. CoRR Vol. abs/1705.10915 (2017).Google Scholar
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. TPAMI, Vol. 39, 4 (2017), 677--691. Google ScholarDigital Library
Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction NIPS. 64--72.Google Scholar
Dian Gong, Gerard Medioni, and Xuemei Zhao. 2014. Structured time series analysis for human action segmentation and recognition. TPAMI, Vol. 36, 7 (2014), 1414--1427. Google ScholarDigital Library
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS. 2672--2680. Google ScholarDigital Library
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A Recurrent Neural Network For Image Generation ICML. 1462--1471. Google ScholarDigital Library
Alexander Grushin, Derek D Monner, James A Reggia, and Ajay Mishra. 2013. Robust human action recognition via long short-term memory IJCNN. 1--8.Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Nicholas R Howe, Michael E Leventon, and William T Freeman. 1999. Bayesian Reconstruction of 3D Human Motion from Single-Camera Video. NIPS, Vol. Vol. 99. 820--6. Google ScholarDigital Library
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. CoRR Vol. abs/1611.07004 (2016).Google Scholar
Xiaofei Ji and Honghai Liu. 2010. Advances in view-invariant human motion analysis: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 40, 1 (2010), 13--24. Google ScholarDigital Library
Shanon X Ju, Michael J Black, and Yaser Yacoob. 1996. Cardboard people: A parameterized model of articulated image motion Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on. 38--44. Google ScholarDigital Library
Roland Kehl and Luc Van Gool. 2006. Markerless tracking of complex human motions from multiple views. CVIU, Vol. 104, 2 (2006), 190--209. Google ScholarDigital Library
Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRR Vol. abs/1312.6114 (2013).Google Scholar
Zicheng Liao, Neel Joshi, and Hugues Hoppe. 2013. Automated video looping with progressive dynamism. ACM Transactions on Graphics (TOG) Vol. 32, 4 (2013), 77. Google ScholarDigital Library
William Lotter, Gabriel Kreiman, and David Cox. 2016. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. CoRR Vol. abs/1605.08104 (2016).Google Scholar
Michaël Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. CoRR Vol. abs/1511.05440 (2015).Google Scholar
Ivana Mikić, Mohan Trivedi, Edward Hunter, and Pamela Cosman. 2003. Human body model acquisition and tracking using voxel data. IJCV, Vol. 53, 3 (2003), 199--223. Google ScholarDigital Library
Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. CoRR Vol. abs/1411.1784 (2014).Google Scholar
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. 2015. Action-Conditional Video Prediction using Deep Networks in Atari Games NIPS. 2863--2871. Google ScholarDigital Library
Eng-Jon Ong, Antonio S Micilotta, Richard Bowden, and Adrian Hilton. 2006. Viewpoint invariant exemplar-based 3D human tracking. CVIU, Vol. 104, 2 (2006), 178--189. Google ScholarDigital Library
Georgios Th Papadopoulos, Apostolos Axenopoulos, and Petros Daras. 2014. Real-time skeleton-tracking-based human action recognition using kinect data International Conference on Multimedia Modeling. 473--483. Google ScholarDigital Library
Ronald Poppe. 2007. Vision-based human motion analysis: An overview. CVIU, Vol. 108, 1 (2007), 4--18. Google ScholarDigital Library
Javier Portilla and Eero P Simoncelli. 2000. A parametric texture model based on joint statistics of complex wavelet coefficients. IJCV, Vol. 40, 1 (2000), 49--70. Google ScholarDigital Library
Guo-Jun Qi. 2017. Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities. CoRR Vol. abs/1701.06264 (2017).Google Scholar
Richard F Rashid. 1980. Towards a system for the interpretation of moving light displays. TPAMI 6 (1980), 574--581.Google ScholarCross Ref
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation MICCAI. 234--241.Google Scholar
Arno Schödl, Richard Szeliski, David H Salesin, and Irfan Essa. 2000. Video textures SIGGRAPH. 489--498. Google ScholarDigital Library
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs ICML. 843--852. Google ScholarDigital Library
Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks CVPR. 1653--1660. Google ScholarDigital Library
Joost R. van Amersfoort, Anitha Kannan, Marc'Aurelio Ranzato, Arthur Szlam, Du Tran, and Soumith Chintala. 2017. Transformation-Based Models of Video Sequences. CoRR Vol. abs/1701.08435 (2017).Google Scholar
Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. 2017. Learning to Generate Long-term Future via Hierarchical Prediction ICML.Google Scholar
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating Videos with Scene Dynamics. In NIPS. 613--621.Google Scholar
Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. 2016. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders ECCV. 835--851.Google Scholar
Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification ACMMM. 461--470. Google ScholarDigital Library
Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks NIPS. 91--99.Google Scholar
Junchi Yan, Yin Li, EnLiang Zheng, and Yuncai Liu. 2009. An Accelerated Human Motion Tracking System Based on Voxel Reconstruction under Complex Environments. In ACCV. 313--324. Google ScholarDigital Library
Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2image: Conditional image generation from visual attributes ECCV. 776--791.Google Scholar
Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. 2016. Energy-based Generative Adversarial Network. CoRR Vol. abs/1609.03126 (2016).Google Scholar
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. CoRR Vol. abs/1703.10593 (2017).Google Scholar

Index Terms

Skeleton-Aided Articulated Motion Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Appearance and texture representations
      2. Computer vision tasks
        Activity recognition and understanding

Recommendations

Motion generation from MTM semantics

Using digital human model (DHM) in the early phase of design is becoming an important practice nowadays. Thus, how to simulate the realistic human motion and facilitate the motion generation process is always the main concern. This research focuses on ...
Read More
Motion Generation System Using Interactive Evolutionary Computation and Signal Processing
NBIS '09: Proceedings of the 2009 International Conference on Network-Based Information Systems

This paper proposes new motion generation method by Interactive Evolutionary Computation based on Genetic Algorithm. This method generates new motions by combining some primitive motions, which are obtained by dividing already existing motions. This ...
Read More
A non-photorealistic motion generation system
ICACT'09: Proceedings of the 11th international conference on Advanced Communication Technology - Volume 2

Recently non-photorealistic rendering (NPR) has been brought to public attention. It causes the interests of non-photorealistic animation and motion (NPA), also. NPAR easily attracts the attention, but is very subjective field. That is, the results of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
motion generation
skeleton aid
video analysis
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 70
  Total Citations
  View Citations
- 652
  Total Downloads
- Downloads (Last 12 months)68
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Skeleton-Aided Articulated Motion Generation

MM '17: Proceedings of the 25th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Motion generation from MTM semantics

Motion Generation System Using Interactive Evolutionary Computation and Signal Processing

A non-photorealistic motion generation system