Skip to main content
Log in

A spatiotemporal attention-based ResC3D model for large-scale gesture recognition

  • Special Issue Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Abnormal gesture recognition has many applications in the fields of visual surveillance, crowd behavior analysis, and sensitive video content detection. However, the recognition of dynamic gestures with large-scale videos remains a challenging task due to the barriers of gesture-irrelevant factors like the variations in illumination, movement path, and background. In this paper, we propose a spatiotemporal attention-based ResC3D model for abnormal gesture recognition with large-scale videos. One key idea is to find a compact and effective representation of the gesture in both spatial and temporal contexts. To eliminate the influence of gesture-irrelevant factors, we first employ the enhancement techniques such as Retinex and hybrid median filer to improve the quality of RGB and depth inputs. Then, we design a spatiotemporal attention scheme to focus on the most valuable cues related to the moving parts for the gesture. Upon these representations, a ResC3D network, which leverages the advantages of both residual network and C3D model, is developed to extract features, together with a canonical correlation analysis-based fusion scheme for blending features from different modalities. The performance of our method is evaluated on the Chalearn IsoGD Dataset. Experiments demonstrate the effectiveness of each module of our method and show the ultimate accuracy reaches 68.14%, which outperforms other state-of-the-art methods, including our basic work in 2017 Chalearn Looking at People Workshop of ICCV.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Albu, V.: Measuring customer behavior with deep convolutional neural networks. BRAIN Broad Res. Artif. Intell. Neurosci. 7(1), 74–79 (2016)

    Google Scholar 

  2. Andrade, E.L., Blunsden, S., Fisher, R.B.: Modelling crowd scenes for event detection. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006, vol. 1, pp. 175–178. IEEE (2006)

  3. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)

    Article  Google Scholar 

  4. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36. Springer (2004)

  5. Chang, J.Y.: Nonparametric feature matching based conditional random fields for gesture recognition from multi-modal video. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1612–1625 (2016)

    Article  Google Scholar 

  6. Choi, H., Park, H.: A hierarchical structure for gesture recognition using RGB-D sensor. In: Proceedings of the Second International Conference on Human–Agent Interaction, pp. 265–268. ACM (2014)

  7. Corradini, A.: Dynamic time warping for off-line recognition of a small gesture vocabulary. In: IEEE International Conference on Computer Vision Workshops, pp. 82–89. IEEE (2001)

  8. Di Benedetto, A., Palmieri, F.A., Cavallo, A., Falco, P.: A hidden markov model-based approach to grasping hand gestures classification. In: Advances in Neural Networks, pp. 415–423. Springer (2016)

  9. Ding, J., Chang, C.W.: An adaptive hidden markov model-based gesture recognition approach using kinect to simplify large-scale video data processing for humanoid robot imitation. Multimed. Tools Appl. 75(23), 15537–15551 (2016)

    Article  Google Scholar 

  10. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2625–2634 (2015)

  11. Duan, J., Wan, J., Zhou, S., Guo, X., Li, S.: A unified framework for multi-modal isolated gesture recognition. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 14, 21:1–21:16 (2017)

    Google Scholar 

  12. Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The Chalearn gesture dataset (CGD 2011). Mach. Vis. Appl. 25(8), 1929–1951 (2014)

    Article  Google Scholar 

  13. Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11(9), 1984–1996 (2016)

    Article  Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 770–778 (2016)

  15. Helbing, D., Johansson, A., Al-Abideen, H.Z.: Dynamics of crowd disasters: an empirical study. Phys. Rev. E 75(4), 046109 (2007)

    Article  Google Scholar 

  16. Hong, P., Turk, M., Huang, T.S.: Gesture modeling and recognition using finite state machines. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 410–415. IEEE (2000)

  17. Hsu, Y.L., Chu, C.L., Tsai, Y.J., Wang, J.S.: An inertial pen with dynamic time warping recognizer for handwriting and gesture recognition. IEEE Sens. J. 15(1), 154–163 (2015)

    Article  Google Scholar 

  18. Huang, S., Ramanan, D.: Expecting the unexpected: training detectors for unusual pedestrians with adversarial imposters. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1 (2017)

  19. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  20. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia, pp. 675–678. ACM (2014)

  21. Jin, C.B., Li, S., Kim, H.: Real-time action detection in video surveillance using sub-action descriptor with multi-cnn. ArXiv preprint arXiv:1710.03383 (2017)

  22. Kaâniche, M.B., Bremond, F.: Recognizing gestures by learning local motion signatures of hog descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2247–2258 (2012)

    Article  Google Scholar 

  23. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

  24. Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference, pp. 1–10. British Machine Vision Association (2008)

  25. Ko, K.E., Sim, K.B.: Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Eng. Appl. Artif. Intell. 67, 226–234 (2018)

    Article  Google Scholar 

  26. Konecnỳ, J., Hagara, M.: One-shot-learning gesture recognition using hog–hof. J. Mach. Learn. Res. 15, 2513–2532 (2014)

    MathSciNet  Google Scholar 

  27. Kratz, L., Nishino, K.: Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 1446–1453. IEEE (2009)

  28. Lakoba, T.I., Kaup, D.J., Finkelstein, N.M.: Modifications of the Helbing–Molnar–Farkas–Vicsek social force model for pedestrian evolution. Simulation 81(5), 339–352 (2005)

    Article  Google Scholar 

  29. Land, E.H., McCann, J.J.: Lightness and retinex theory. Josa 61(1), 1–11 (1971)

    Article  Google Scholar 

  30. LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, vol. 2, pp. 96–104. IEEE (2004)

  31. Li, J., Xu, X., Tao, J., Ding, L., Gao, H., Deng, Z.: Interact with robot: an efficient approach based on finite state machine and mouse gesture recognition. In: 2016 9th International Conference on Human System Interactions (HSI), pp. 203–208. IEEE (2016)

  32. Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Li, R., Song, J.: Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In: IEEE International Conference on Pattern Recognition Workshops. IEEE (2016)

  33. Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Li, R., Song, J.: Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2956–2964 (2017)

    Article  Google Scholar 

  34. Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Ma, Z., Song, J.: Large-scale gesture recognition with a fusion of RGB-D data based on optical flow and the C3D model. Pattern Recognit. Lett. (2017). https://doi.org/10.1016/j.patrec.2017.12.003

  35. Liu, C., Wechsler, H.: A shape-and texture-based enhanced Fisher classifier for face recognition. IEEE Trans. Image Process. 10(4), 598–608 (2001)

    Article  MATH  Google Scholar 

  36. Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. IJCAI 1, 3 (2013)

    Google Scholar 

  37. Liu, M., Liu, H.: Depth context: a new descriptor for human activity recognition by using sole depth sequences. Neurocomputing 175, 747–758 (2016)

    Article  Google Scholar 

  38. Liu, Z., Chai, X., Liu, Z., Chen, X.: Continuous gesture recognition with hand-oriented spatiotemporal feature. In: Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3056–3064 (2017)

  39. Malgireddy, M.R., Inwogu, I., Govindaraju, V.: A temporal bayesian model for classifying, detecting and localizing activities in video sequences. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 43–48. IEEE (2012)

  40. Malgireddy, M.R., Nwogu, I., Govindaraju, V.: Language-motivated approaches to action recognition. J. Mach. Learn. Res. 14(1), 2189–2212 (2013)

    MathSciNet  Google Scholar 

  41. Mehran, R., Oyama, A., Shah, M.: Abnormal crowd behavior detection using social force model. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 935–942. IEEE (2009)

  42. Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., Cao, X.: Multimodal gesture recognition based on the ResC3D network. In: Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3047–3055 (2017)

  43. Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215. IEEE (2016)

  44. Nishida, N., Nakayama, H.: Multimodal gesture recognition using multi-stream recurrent neural network. In: Pacific-Rim Symposium on Image and Video Technology, pp. 682–694. Springer (2015)

  45. Pitsikalis, V., Katsamanis, A., Theodorakis, S., Maragos, P.: Multimodal gesture recognition via multiple hypotheses rescoring. J. Mach. Learn. Res. 16(1), 255–284 (2015)

    MathSciNet  MATH  Google Scholar 

  46. Plouffe, G., Cretu, A.M.: Static and dynamic hand gesture recognition in depth data using dynamic time warping. IEEE Trans. Instrum. Meas. 65(2), 305–316 (2016)

    Article  Google Scholar 

  47. Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)

    Article  Google Scholar 

  48. Sanin, A., Sanderson, C., Harandi, M.T., Lovell, B.C.: Spatio-temporal covariance descriptors for action and gesture recognition. In: IEEE Workshops on Applications of Computer Vision, pp. 103–110. IEEE (2013)

  49. Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)

    Article  Google Scholar 

  50. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. ArXiv preprint arXiv:1412.6806 (2014)

  51. Sun, Q.S., Zeng, S.G., Liu, Y., Heng, P.A., Xia, D.S.: A new method of feature fusion and its application in image recognition. Pattern Recognit. 38(12), 2437–2448 (2005)

    Article  Google Scholar 

  52. Tang, J., Cheng, H., Zhao, Y., Guo, H.: Structured dynamic time warping for continuous hand trajectory gesture recognition. Pattern Recognit. 80, 21–31 (2018)

    Article  Google Scholar 

  53. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE (2015)

  54. Wan, J., Escalera, S., Anbarjafari, G., Escalante, H.J., Baró, X., Guyon, I., Madadi, M., Allik, J., Gorbova, J., Lin, C., et al.: Results and analysis of Chalearn lap multi-modal isolated and continuous gesture recognition, and real versus fake expressed emotions challenges. In: ICCV Workshops, pp. 3189–3197 (2017)

  55. Wan, J., Guo, G., Li, S.: Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1626–1639 (2015)

    Article  Google Scholar 

  56. Wan, J., Li, S.Z., Zhao, Y., Zhou, S., Guyon, I., Escalera, S.: Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 56–64. IEEE (2016)

  57. Wan, J., Ruan, Q., Li, W., An, G., Zhao, R.: 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J. Electron. Imaging 23(2), 3017–3017 (2014)

    Article  Google Scholar 

  58. Wang, H., Wang, P., Song, Z., Li, W.: Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3138–3146 (2017)

  59. Wang, P., Li, W., Liu, S., Gao, Z., Tang, C., Ogunbona, P.: Large-scale isolated gesture recognition using convolutional neural networks. In: Proceedings of International Conference on PR, pp. 7–12. IEEE (2016)

  60. Wang, S.B., Quattoni, A., Morency, L.P., Demirdjian, D., Darrell, T.: Hidden conditional random fields for gesture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, vol. 2, pp. 1521–1527. IEEE (2006)

  61. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006)

    Article  Google Scholar 

  62. Yang, J., Yang, J.: Generalized K–L transform based combined feature extraction. Pattern Recognit. 35(1), 295–297 (2002)

    Article  MATH  Google Scholar 

  63. Yeasin, M., Chaudhuri, S.: Visual understanding of dynamic hand gestures. Pattern Recognit. 33(11), 1805–1817 (2000)

    Article  Google Scholar 

  64. Zhang, L., Zhu, G., Shen, P., Song, J., Shah, S.A., Bennamoun, M.: Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3120–3128 (2017)

  65. Zhu, G., Zhang, L., Mei, L., Shao, J., Song, J., Shen, P.: Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In: IEEE International Conference on Pattern Recognition Workshops (2016)

  66. Zhu, G., Zhang, L., Shen, P., Song, J.: Multimodal gesture recognition using 3D convolution and convolutional LSTM. IEEE Access 5, 4517–4524 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiguang Miao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work was jointly supported by the National Key R&D Program of China under Grant No. 2018YFC0807500, the National Natural Science Foundations of China under Grant Nos. 61772396, 61472302, 61772392, the Fundamental Research Funds for the Central Universities under Grant Nos. JB170306, JB170304, JBF180301, Xi’an Key Laboratory of Big Data and Intelligent Vision under Grant No. 201805053ZD4CG37, the Fundamental Research Funds for the Central Universities and the Innovation Fund of Xidian University.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Miao, Q., Qi, X. et al. A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Machine Vision and Applications 30, 875–888 (2019). https://doi.org/10.1007/s00138-018-0996-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-018-0996-x

Keywords

Navigation