Skip to main content

SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12357))

Abstract

3D hand pose estimation based on RGB images has been studied for a long time. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. In this paper, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework, which leads to the necessity of a large-scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential synthetic hand images and emphasizing smoothness of estimations with temporal consistency constraints. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. Utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in our experiments on hand pose estimation benchmarks.

J. Yang and H. J. Chang—Equal Contribution.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Note that direct random samplings from continuous pose parameter space \(\theta \in \mathbb {R}\) does not assure diversity and authenticity of poses [31].

  2. 2.

    Although we can generate as many synthetic data as we want, our SeqHand dataset contains 400K/10K samples used for training/validation.

References

  1. Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1067–1076 (2019)

    Google Scholar 

  2. Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1949–1957 (2015)

    Google Scholar 

  3. Boukhayma, A., Bem, R.D., Torr, P.H.: 3D hand shape and pose from images in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10843–10852 (2019)

    Google Scholar 

  4. Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 678–694. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_41

    Chapter  Google Scholar 

  5. Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks (2019)

    Google Scholar 

  6. de Campos, T.E., Murray, D.W.: Regression-based hand pose estimation from multiple cameras 1, 782–789 (2006)

    Google Scholar 

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  8. Chang, H.J., Garcia-Hernando, G., Tang, D., Kim, T.K.: Spatio-temporal hough forest for efficient detection-localisation-recognition of fingerwriting in egocentric camera. Comput. Vis. Image Understand. 148, 87–96 (2016)

    Article  Google Scholar 

  9. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  10. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 409–419 (2018)

    Google Scholar 

  11. Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10833–10842 (2019)

    Google Scholar 

  12. Gomez-Donoso, F., Orts-Escolano, S., Cazorla, M.: Large-scale multiview 3D hand pose dataset. arXiv preprint arXiv:1707.03742 (2017)

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  14. Hu, Z., Hu, Y., Liu, J., Wu, B., Han, D., Kurfess, T.: A CRNN module for hand pose estimation. Neurocomputing 333, 157–168 (2019)

    Article  Google Scholar 

  15. Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5 d heatmap regression In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 118–134. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_8

  16. Jang, Y., Noh, S., Chang, H.J., Kim, T., Woo, W.: 3D finger cape: clicking action and position estimation under self-occlusions in egocentric viewpoint. IEEE Trans. Vis. Comput. Graph. 21(4), 501–510 (2015)

    Article  Google Scholar 

  17. Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., Fitzgibbon, A.: Learning an efficient model of hand shape variation from depth images, pp. 2540–2548 (2015)

    Google Scholar 

  18. Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017)

    Article  MathSciNet  Google Scholar 

  19. Le, T.H.N., Quach, K.G., Zhu, C., Duong, C.N., Luu, K., Savvides, M.: Robust hand detection and classification in vehicles and in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1203–1210. IEEE (2017)

    Google Scholar 

  20. Lee, M., Lee, S., Son, S., Park, G., Kwak, N.: Motion feature network: fixed motion filter for action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 392–408. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_24

    Chapter  Google Scholar 

  21. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)

    Google Scholar 

  22. Madadi, M., Escalera, S., Carruesco, A., Andujar, C., Baró, X., Gonzàlez, J.: Top-down model fitting for hand pose recovery in sequences of depth images. Image Vis. Comput. 79, 63–75 (2018)

    Article  Google Scholar 

  23. Malik, J., et al.: DeepHPS: end-to-end estimation of 3D hand pose and shape by learning from synthetic depth. In: 2018 International Conference on 3D Vision (3DV), pp. 110–119. IEEE (2018)

    Google Scholar 

  24. Mueller, F., et al.: Ganerated hands for real-time 3D hand tracking from monocular RGB, pp. 49–59 (2018)

    Google Scholar 

  25. Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor, pp. 1284–1293 (2017)

    Google Scholar 

  26. Oberweger, M., Riegler, G., Wohlhart, P., Lepetit, V.: Efficiently creating 3D training data for fine hand pose estimation, pp. 4957–4965 (2016)

    Google Scholar 

  27. Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints, pp. 2088–2095 (2011)

    Google Scholar 

  28. Panteleris, P., Argyros, A.: Back to RGB: 3D tracking of hands and hand-object interactions based on short-baseline stereo, pp. 575–584 (2017)

    Google Scholar 

  29. Panteleris, P., Oikonomidis, I., Argyros, A.: Using a single RGB frame for real time 3D hand pose estimation in the wild. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 436–445. IEEE (2018)

    Google Scholar 

  30. Remilekun Basaru, R., Slabaugh, G., Alonso, E., Child, C.: Hand pose estimation using deep stereovision and Markov-chain Monte Carlo, pp. 595–603 (2017)

    Google Scholar 

  31. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (TOG) 36(6), 245 (2017)

    Article  Google Scholar 

  32. Rosales, R., Athitsos, V., Sigal, L., Sclaroff, S.: 3D hand pose reconstruction using specialized mappings, vol. 1, pp. 378–385 (2001)

    Google Scholar 

  33. Sharp, T., et al.: Accurate, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3633–3642 (2015)

    Google Scholar 

  34. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping, pp. 1145–1153 (2017)

    Google Scholar 

  35. Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation, pp. 89–98 (2018)

    Google Scholar 

  36. Sridhar, S., Mueller, F., Zollhöfer, M., Casas, D., Oulasvirta, A., Theobalt, C.: Real-time joint tracking of a hand manipulating an object from RGB-D input. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 294–310. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_19

    Chapter  Google Scholar 

  37. Sridhar, S., Rhodin, H., Seidel, H.P., Oulasvirta, A., Theobalt, C.: Real-time hand tracking using a sum of anisotropic Gaussians model 1, 319–326 (2014)

    Google Scholar 

  38. Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 824–832 (2015)

    Google Scholar 

  39. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)

    Google Scholar 

  40. Tang, D., Jin Chang, H., Tejani, A., Kim, T.K.: Latent regression forest: Structured estimation of 3D articulated hand posture, pp. 3786–3793 (2014)

    Google Scholar 

  41. Taylor, J., et al.: User-specific hand modeling from monocular depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 644–651 (2014)

    Google Scholar 

  42. Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. (ToG) 33(5), 169 (2014)

    Article  Google Scholar 

  43. Wu, Y., Ji, W., Li, X., Wang, G., Yin, J., Wu, F.: Context-aware deep spatiotemporal network for hand pose estimation from depth images. IEEE Trans. Cybern. (2018)

    Google Scholar 

  44. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)

    Google Scholar 

  45. Yang, L., Yao, A.: Disentangling latent hands for image synthesis and pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9877–9886 (2019)

    Google Scholar 

  46. Yuan, S., Ye, Q., Stenger, B., Jain, S., Kim, T.K.: Bighand2. 2m benchmark: Hand pose dataset and state of the art analysis, pp. 4866–4874 (2017)

    Google Scholar 

  47. Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: 3D hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214 (2016)

  48. Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: A hand pose tracking benchmark from stereo matching, pp. 982–986 (2017)

    Google Scholar 

  49. Zhang, X., Li, Q., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. arXiv preprint arXiv:1902.09305 (2019)

  50. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

    Google Scholar 

  51. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images, pp. 4903–4911 (2017)

    Google Scholar 

  52. Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: a dataset for markerless capture of hand pose and shape from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 813–822 (2019)

    Google Scholar 

Download references

Acknowledgement

This work was supported by IITP grant funded by the Korea government (MSIT) (No. 2019-0-01367, Babymind) and Next-Generation Information Computing Development Program through the NRF of Korea (2017M3C4A7077582).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nojun Kwak .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 53598 KB)

Supplementary material 1 (pdf 100 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, J., Chang, H.J., Lee, S., Kwak, N. (2020). SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12357. Springer, Cham. https://doi.org/10.1007/978-3-030-58610-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58610-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58609-6

  • Online ISBN: 978-3-030-58610-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics