Skip to main content

Global Regularizer and Temporal-Aware Cross-Entropy for Skeleton-Based Early Action Recognition

  • Conference paper
  • First Online:
Book cover Computer Vision – ACCV 2018 (ACCV 2018)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11364))

Included in the following conference series:

Abstract

In this paper, we propose a new approach to recognize the class label of an action before this action is fully performed based on skeleton sequences. Compared to action recognition which uses fully observed action sequences, early action recognition with partial sequences is much more challenging mainly due to: (1) the global information of a long-term action is not available in the partial sequence, and (2) the partial sequences at different observation ratios of an action contain a number of sub-actions with diverse motion information. To address the first challenge, we introduce a global regularizer to learn a hidden feature space, where the statistical properties of the partial sequences are similar to those of the full sequences. We introduce a temporal-aware cross-entropy to address the second challenge and achieve better prediction performance. We evaluate the proposed method on three challenging skeleton datasets. Experimental results show the superiority of the proposed method for skeleton-based early action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV, pp. 1036–1043. IEEE (2011)

    Google Scholar 

  2. Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_45

    Chapter  Google Scholar 

  3. Ma, Q., Shen, L., Chen, E., Tian, S., Wang, J., Cottrell, G.W.: WALKING WALKing walking: action recognition from action echoes. In: IJCAI, pp. 2457–2463. AAAI Press (2017)

    Google Scholar 

  4. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp. 1290–1297. IEEE (2012)

    Google Scholar 

  5. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp. 588–595. IEEE (2014)

    Google Scholar 

  6. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp. 1110–1118. IEEE (2015)

    Google Scholar 

  7. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: CVPR. IEEE (2016)

    Google Scholar 

  8. Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, vol. 2, p. 8. AAAI Press (2016)

    Google Scholar 

  9. Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50

    Chapter  Google Scholar 

  10. Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: A new representation of skeleton sequences for 3D action recognition. In: CVPR. IEEE (2017)

    Google Scholar 

  11. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_29

    Chapter  Google Scholar 

  12. Wang, L., Qiao, Y., Tang, X.: Latent hierarchical model of temporal structure for complex activity classification. IEEE Transa. Image Process. 23, 810–822 (2014)

    Article  MathSciNet  Google Scholar 

  13. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634. IEEE (2015)

    Google Scholar 

  14. Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)

  15. Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: ICCV, pp. 5784–5793. IEEE (2017)

    Google Scholar 

  16. Bütepage, J., Black, M.J., Kragic, D., Kjellström, H.: Deep representation learning for human motion prediction and classification. In: CVPR 2017. IEEE (2017)

    Google Scholar 

  17. Ke, Q., Liu, J., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Chapter 5 - computer vision for human-machine interaction. In: Leo, M., Farinella, G.M., (eds.) Computer Vision for Assistive Healthcare, pp. 127–145. Academic Press (2018)

    Google Scholar 

  18. Tang, C., Li, W., Wang, P., Wang, L.: Online human action recognition based on incremental learning of weighted covariance descriptors. Inf. Sci. 467, 219–237 (2018)

    Article  Google Scholar 

  19. Rahmani, H., Mahmood, A., Huynh, D., Mian, A.: Histogram of oriented principal components for cross-view action recognition. PAMI 38, 2430–2443 (2016)

    Article  Google Scholar 

  20. Rahmani, H., Mian, A., Shah, M.: Learning a deep model for human action recognition from novel viewpoints. PAMI 40(3), 667–681 (2018)

    Article  Google Scholar 

  21. Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: ICCV. IEEE (2017)

    Google Scholar 

  22. Rahmani, H., Mian, A.: 3D action recognition from novel viewpoints. In: CVPR, pp. 1506–1515. IEEE (2016)

    Google Scholar 

  23. Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera, S.: RGB-D-based human motion recognition with deep learning: a survey. Comput. Vis. Image Underst. 171, 118–139 (2018)

    Article  Google Scholar 

  24. Ke, Q., An, S., Bennamoun, M., Sohel, F., Boussaid, F.: Skeletonnet: mining deep part features for 3-D action recognition. IEEE Sig. Process. Lett. 24, 731–735 (2017)

    Article  Google Scholar 

  25. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27, 2842–2855 (2018)

    Article  MathSciNet  Google Scholar 

  26. Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: CVPR, vol. 7. IEEE (2017)

    Google Scholar 

  27. Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV, pp. 1036–1043. IEEE (2011)

    Google Scholar 

  28. Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_39

    Chapter  Google Scholar 

  29. Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: Human interaction prediction using deep temporal features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 403–414. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_28

    Chapter  Google Scholar 

  30. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Leveraging structural context models and ranking score fusion forhuman interaction prediction. IEEE Trans. Multimed. 20, 1712–1723 (2017)

    Article  Google Scholar 

  31. Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: ICRA, pp. 3118–3125. IEEE (2016)

    Google Scholar 

  32. Aliakbarian, M.S., Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: ICCV. IEEE (2017)

    Google Scholar 

  33. Farha, Y.A., Richard, A., Gall, J.: When will you do what? Anticipating temporal occurrences of activities. arXiv preprint arXiv:1804.00892 (2018)

  34. Hu, J.-F., Zheng, W.-S., Ma, L., Wang, G., Lai, J.: Real-time RGB-D activity prediction by soft regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 280–296. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_17

    Chapter  Google Scholar 

  35. Liu, J., Shahroudy, A., Wang, G., Duan, L.Y., Kot, A.C.: SSNet: scale selection network for online 3D action prediction. In: CVPR, pp. 8349–8358. IEEE (2018)

    Google Scholar 

  36. Herath, S., Harandi, M., Porikli, F.: Learning an invariant hilbert space for domain adaptation. arXiv preprint arXiv:1611.08350 (2016)

  37. Hubert Tsai, Y.H., Yeh, Y.R., Frank Wang, Y.C.: Learning cross-domain landmarks for heterogeneous domain adaptation. In: CVPR, pp. 5081–5090. IEEE (2016)

    Google Scholar 

  38. Baktashmotlagh, M., Harandi, M., Salzmann, M.: Distribution-matching embedding for visual domain adaptation. J. Mach. Learn. Res. 17, 3760–3789 (2016)

    MathSciNet  MATH  Google Scholar 

  39. Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 199–210 (2011)

    Article  Google Scholar 

  40. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)

    Google Scholar 

  41. CMU: CMU graphics lab motion capture database (2013). http://mocap.cs.cmu.edu/

  42. Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: CVPR, pp. 5344–5352. IEEE (2015)

    Google Scholar 

Download references

Acknowledgment

We greatly acknowledge NVIDIA for providing a Titan XP GPU for the experiments involved in this research. This work was partially supported by Australian Research Council grants DP150100294, DP150104251, and DE120102960.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiuhong Ke .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ke, Q. et al. (2019). Global Regularizer and Temporal-Aware Cross-Entropy for Skeleton-Based Early Action Recognition. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11364. Springer, Cham. https://doi.org/10.1007/978-3-030-20870-7_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20870-7_45

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20869-1

  • Online ISBN: 978-3-030-20870-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics