skip to main content
10.1145/3123266.3123277acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Skeleton-Aided Articulated Motion Generation

Authors Info & Claims
Published:19 October 2017Publication History

ABSTRACT

This work makes the first attempt to generate articulated human motion sequence from a single image. On one hand, we utilize paired inputs including human skeleton information as motion embedding and a single human image as appearance reference, to generate novel motion frames based on the conditional GAN infrastructure. On the other hand, a triplet loss is employed to pursue appearance smoothness between consecutive frames. As the proposed framework is capable of jointly exploiting the image appearance space and articulated/kinematic motion space, it generates realistic articulated motion sequence, in contrast to most previous video generation methods which yield blurred motion effects. We test our model on two human action datasets including KTH and Human3.6M, and the proposed framework generates very promising results on both datasets.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR Vol. abs/1603.04467 (2016).Google ScholarGoogle Scholar
  2. Aseem Agarwala, Ke Colin Zheng, Chris Pal, Maneesh Agrawala, Michael Cohen, Brian Curless, David Salesin, and Richard Szeliski. 2005. Panoramic video textures. In ACM Transactions on Graphics (TOG), Vol. Vol. 24. 821--827. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jake K Aggarwal and Quin Cai. 1997. Human motion analysis: A review. In Nonrigid and Articulated Motion Workshop, 1997. Proceedings., IEEE. 90--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis CVPR. 3686--3693. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation CVPR. 1014--1021.Google ScholarGoogle Scholar
  6. Martín Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. CoRR Vol. abs/1701.07875 (2017).Google ScholarGoogle Scholar
  7. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CoRR Vol. abs/1611.08050 (2016).Google ScholarGoogle Scholar
  8. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS. 2172--2180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Emily Denton and Vighnesh Birodkar. 2017. Unsupervised Learning of Disentangled Representations from Video. CoRR Vol. abs/1705.10915 (2017).Google ScholarGoogle Scholar
  10. Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. TPAMI, Vol. 39, 4 (2017), 677--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction NIPS. 64--72.Google ScholarGoogle Scholar
  12. Dian Gong, Gerard Medioni, and Xuemei Zhao. 2014. Structured time series analysis for human action segmentation and recognition. TPAMI, Vol. 36, 7 (2014), 1414--1427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS. 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A Recurrent Neural Network For Image Generation ICML. 1462--1471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alexander Grushin, Derek D Monner, James A Reggia, and Ajay Mishra. 2013. Robust human action recognition via long short-term memory IJCNN. 1--8.Google ScholarGoogle Scholar
  16. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Nicholas R Howe, Michael E Leventon, and William T Freeman. 1999. Bayesian Reconstruction of 3D Human Motion from Single-Camera Video. NIPS, Vol. Vol. 99. 820--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. CoRR Vol. abs/1611.07004 (2016).Google ScholarGoogle Scholar
  19. Xiaofei Ji and Honghai Liu. 2010. Advances in view-invariant human motion analysis: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 40, 1 (2010), 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Shanon X Ju, Michael J Black, and Yaser Yacoob. 1996. Cardboard people: A parameterized model of articulated image motion Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on. 38--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Roland Kehl and Luc Van Gool. 2006. Markerless tracking of complex human motions from multiple views. CVIU, Vol. 104, 2 (2006), 190--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRR Vol. abs/1312.6114 (2013).Google ScholarGoogle Scholar
  23. Zicheng Liao, Neel Joshi, and Hugues Hoppe. 2013. Automated video looping with progressive dynamism. ACM Transactions on Graphics (TOG) Vol. 32, 4 (2013), 77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. William Lotter, Gabriel Kreiman, and David Cox. 2016. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. CoRR Vol. abs/1605.08104 (2016).Google ScholarGoogle Scholar
  25. Michaël Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. CoRR Vol. abs/1511.05440 (2015).Google ScholarGoogle Scholar
  26. Ivana Mikić, Mohan Trivedi, Edward Hunter, and Pamela Cosman. 2003. Human body model acquisition and tracking using voxel data. IJCV, Vol. 53, 3 (2003), 199--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. CoRR Vol. abs/1411.1784 (2014).Google ScholarGoogle Scholar
  28. Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. 2015. Action-Conditional Video Prediction using Deep Networks in Atari Games NIPS. 2863--2871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Eng-Jon Ong, Antonio S Micilotta, Richard Bowden, and Adrian Hilton. 2006. Viewpoint invariant exemplar-based 3D human tracking. CVIU, Vol. 104, 2 (2006), 178--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Georgios Th Papadopoulos, Apostolos Axenopoulos, and Petros Daras. 2014. Real-time skeleton-tracking-based human action recognition using kinect data International Conference on Multimedia Modeling. 473--483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ronald Poppe. 2007. Vision-based human motion analysis: An overview. CVIU, Vol. 108, 1 (2007), 4--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Javier Portilla and Eero P Simoncelli. 2000. A parametric texture model based on joint statistics of complex wavelet coefficients. IJCV, Vol. 40, 1 (2000), 49--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Guo-Jun Qi. 2017. Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities. CoRR Vol. abs/1701.06264 (2017).Google ScholarGoogle Scholar
  34. Richard F Rashid. 1980. Towards a system for the interpretation of moving light displays. TPAMI 6 (1980), 574--581.Google ScholarGoogle ScholarCross RefCross Ref
  35. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation MICCAI. 234--241.Google ScholarGoogle Scholar
  36. Arno Schödl, Richard Szeliski, David H Salesin, and Irfan Essa. 2000. Video textures SIGGRAPH. 489--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs ICML. 843--852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks CVPR. 1653--1660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Joost R. van Amersfoort, Anitha Kannan, Marc'Aurelio Ranzato, Arthur Szlam, Du Tran, and Soumith Chintala. 2017. Transformation-Based Models of Video Sequences. CoRR Vol. abs/1701.08435 (2017).Google ScholarGoogle Scholar
  40. Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. 2017. Learning to Generate Long-term Future via Hierarchical Prediction ICML.Google ScholarGoogle Scholar
  41. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating Videos with Scene Dynamics. In NIPS. 613--621.Google ScholarGoogle Scholar
  42. Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. 2016. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders ECCV. 835--851.Google ScholarGoogle Scholar
  43. Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification ACMMM. 461--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks NIPS. 91--99.Google ScholarGoogle Scholar
  45. Junchi Yan, Yin Li, EnLiang Zheng, and Yuncai Liu. 2009. An Accelerated Human Motion Tracking System Based on Voxel Reconstruction under Complex Environments. In ACCV. 313--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2image: Conditional image generation from visual attributes ECCV. 776--791.Google ScholarGoogle Scholar
  47. Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. 2016. Energy-based Generative Adversarial Network. CoRR Vol. abs/1609.03126 (2016).Google ScholarGoogle Scholar
  48. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. CoRR Vol. abs/1703.10593 (2017).Google ScholarGoogle Scholar

Index Terms

  1. Skeleton-Aided Articulated Motion Generation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '17: Proceedings of the 25th ACM international conference on Multimedia
        October 2017
        2028 pages
        ISBN:9781450349062
        DOI:10.1145/3123266

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 October 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader