ABSTRACT
This work makes the first attempt to generate articulated human motion sequence from a single image. On one hand, we utilize paired inputs including human skeleton information as motion embedding and a single human image as appearance reference, to generate novel motion frames based on the conditional GAN infrastructure. On the other hand, a triplet loss is employed to pursue appearance smoothness between consecutive frames. As the proposed framework is capable of jointly exploiting the image appearance space and articulated/kinematic motion space, it generates realistic articulated motion sequence, in contrast to most previous video generation methods which yield blurred motion effects. We test our model on two human action datasets including KTH and Human3.6M, and the proposed framework generates very promising results on both datasets.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR Vol. abs/1603.04467 (2016).Google Scholar
- Aseem Agarwala, Ke Colin Zheng, Chris Pal, Maneesh Agrawala, Michael Cohen, Brian Curless, David Salesin, and Richard Szeliski. 2005. Panoramic video textures. In ACM Transactions on Graphics (TOG), Vol. Vol. 24. 821--827. Google ScholarDigital Library
- Jake K Aggarwal and Quin Cai. 1997. Human motion analysis: A review. In Nonrigid and Articulated Motion Workshop, 1997. Proceedings., IEEE. 90--102. Google ScholarDigital Library
- Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis CVPR. 3686--3693. Google ScholarDigital Library
- Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation CVPR. 1014--1021.Google Scholar
- Martín Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. CoRR Vol. abs/1701.07875 (2017).Google Scholar
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CoRR Vol. abs/1611.08050 (2016).Google Scholar
- Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS. 2172--2180.Google ScholarDigital Library
- Emily Denton and Vighnesh Birodkar. 2017. Unsupervised Learning of Disentangled Representations from Video. CoRR Vol. abs/1705.10915 (2017).Google Scholar
- Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. TPAMI, Vol. 39, 4 (2017), 677--691. Google ScholarDigital Library
- Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction NIPS. 64--72.Google Scholar
- Dian Gong, Gerard Medioni, and Xuemei Zhao. 2014. Structured time series analysis for human action segmentation and recognition. TPAMI, Vol. 36, 7 (2014), 1414--1427. Google ScholarDigital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS. 2672--2680. Google ScholarDigital Library
- Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A Recurrent Neural Network For Image Generation ICML. 1462--1471. Google ScholarDigital Library
- Alexander Grushin, Derek D Monner, James A Reggia, and Ajay Mishra. 2013. Robust human action recognition via long short-term memory IJCNN. 1--8.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Nicholas R Howe, Michael E Leventon, and William T Freeman. 1999. Bayesian Reconstruction of 3D Human Motion from Single-Camera Video. NIPS, Vol. Vol. 99. 820--6. Google ScholarDigital Library
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. CoRR Vol. abs/1611.07004 (2016).Google Scholar
- Xiaofei Ji and Honghai Liu. 2010. Advances in view-invariant human motion analysis: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 40, 1 (2010), 13--24. Google ScholarDigital Library
- Shanon X Ju, Michael J Black, and Yaser Yacoob. 1996. Cardboard people: A parameterized model of articulated image motion Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on. 38--44. Google ScholarDigital Library
- Roland Kehl and Luc Van Gool. 2006. Markerless tracking of complex human motions from multiple views. CVIU, Vol. 104, 2 (2006), 190--209. Google ScholarDigital Library
- Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRR Vol. abs/1312.6114 (2013).Google Scholar
- Zicheng Liao, Neel Joshi, and Hugues Hoppe. 2013. Automated video looping with progressive dynamism. ACM Transactions on Graphics (TOG) Vol. 32, 4 (2013), 77. Google ScholarDigital Library
- William Lotter, Gabriel Kreiman, and David Cox. 2016. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. CoRR Vol. abs/1605.08104 (2016).Google Scholar
- Michaël Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. CoRR Vol. abs/1511.05440 (2015).Google Scholar
- Ivana Mikić, Mohan Trivedi, Edward Hunter, and Pamela Cosman. 2003. Human body model acquisition and tracking using voxel data. IJCV, Vol. 53, 3 (2003), 199--223. Google ScholarDigital Library
- Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. CoRR Vol. abs/1411.1784 (2014).Google Scholar
- Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. 2015. Action-Conditional Video Prediction using Deep Networks in Atari Games NIPS. 2863--2871. Google ScholarDigital Library
- Eng-Jon Ong, Antonio S Micilotta, Richard Bowden, and Adrian Hilton. 2006. Viewpoint invariant exemplar-based 3D human tracking. CVIU, Vol. 104, 2 (2006), 178--189. Google ScholarDigital Library
- Georgios Th Papadopoulos, Apostolos Axenopoulos, and Petros Daras. 2014. Real-time skeleton-tracking-based human action recognition using kinect data International Conference on Multimedia Modeling. 473--483. Google ScholarDigital Library
- Ronald Poppe. 2007. Vision-based human motion analysis: An overview. CVIU, Vol. 108, 1 (2007), 4--18. Google ScholarDigital Library
- Javier Portilla and Eero P Simoncelli. 2000. A parametric texture model based on joint statistics of complex wavelet coefficients. IJCV, Vol. 40, 1 (2000), 49--70. Google ScholarDigital Library
- Guo-Jun Qi. 2017. Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities. CoRR Vol. abs/1701.06264 (2017).Google Scholar
- Richard F Rashid. 1980. Towards a system for the interpretation of moving light displays. TPAMI 6 (1980), 574--581.Google ScholarCross Ref
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation MICCAI. 234--241.Google Scholar
- Arno Schödl, Richard Szeliski, David H Salesin, and Irfan Essa. 2000. Video textures SIGGRAPH. 489--498. Google ScholarDigital Library
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs ICML. 843--852. Google ScholarDigital Library
- Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks CVPR. 1653--1660. Google ScholarDigital Library
- Joost R. van Amersfoort, Anitha Kannan, Marc'Aurelio Ranzato, Arthur Szlam, Du Tran, and Soumith Chintala. 2017. Transformation-Based Models of Video Sequences. CoRR Vol. abs/1701.08435 (2017).Google Scholar
- Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. 2017. Learning to Generate Long-term Future via Hierarchical Prediction ICML.Google Scholar
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating Videos with Scene Dynamics. In NIPS. 613--621.Google Scholar
- Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. 2016. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders ECCV. 835--851.Google Scholar
- Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification ACMMM. 461--470. Google ScholarDigital Library
- Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks NIPS. 91--99.Google Scholar
- Junchi Yan, Yin Li, EnLiang Zheng, and Yuncai Liu. 2009. An Accelerated Human Motion Tracking System Based on Voxel Reconstruction under Complex Environments. In ACCV. 313--324. Google ScholarDigital Library
- Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2image: Conditional image generation from visual attributes ECCV. 776--791.Google Scholar
- Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. 2016. Energy-based Generative Adversarial Network. CoRR Vol. abs/1609.03126 (2016).Google Scholar
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. CoRR Vol. abs/1703.10593 (2017).Google Scholar
Index Terms
- Skeleton-Aided Articulated Motion Generation
Recommendations
Motion generation from MTM semantics
Using digital human model (DHM) in the early phase of design is becoming an important practice nowadays. Thus, how to simulate the realistic human motion and facilitate the motion generation process is always the main concern. This research focuses on ...
Motion Generation System Using Interactive Evolutionary Computation and Signal Processing
NBIS '09: Proceedings of the 2009 International Conference on Network-Based Information SystemsThis paper proposes new motion generation method by Interactive Evolutionary Computation based on Genetic Algorithm. This method generates new motions by combining some primitive motions, which are obtained by dividing already existing motions. This ...
A non-photorealistic motion generation system
ICACT'09: Proceedings of the 11th international conference on Advanced Communication Technology - Volume 2Recently non-photorealistic rendering (NPR) has been brought to public attention. It causes the interests of non-photorealistic animation and motion (NPA), also. NPAR easily attracts the attention, but is very subjective field. That is, the results of ...
Comments