Global Regularizer and Temporal-Aware Cross-Entropy for Skeleton-Based Early Action Recognition

Ke, Qiuhong; Liu, Jun; Bennamoun, Mohammed; Rahmani, Hossein; An, Senjian; Sohel, Ferdous; Boussaid, Farid

doi:10.1007/978-3-030-20870-7_45

Qiuhong Ke¹²,
Jun Liu¹³,
Mohammed Bennamoun¹²,
Hossein Rahmani¹⁴,
Senjian An¹⁵,
Ferdous Sohel¹⁶ &
…
Farid Boussaid¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11364))

Included in the following conference series:

Asian Conference on Computer Vision

1830 Accesses
5 Citations

Abstract

In this paper, we propose a new approach to recognize the class label of an action before this action is fully performed based on skeleton sequences. Compared to action recognition which uses fully observed action sequences, early action recognition with partial sequences is much more challenging mainly due to: (1) the global information of a long-term action is not available in the partial sequence, and (2) the partial sequences at different observation ratios of an action contain a number of sub-actions with diverse motion information. To address the first challenge, we introduce a global regularizer to learn a hidden feature space, where the statistical properties of the partial sequences are similar to those of the full sequences. We introduce a temporal-aware cross-entropy to address the second challenge and achieve better prediction performance. We evaluate the proposed method on three challenging skeleton datasets. Experimental results show the superiority of the proposed method for skeleton-based early action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV, pp. 1036–1043. IEEE (2011)
Google Scholar
Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_45
Chapter Google Scholar
Ma, Q., Shen, L., Chen, E., Tian, S., Wang, J., Cottrell, G.W.: WALKING WALKing walking: action recognition from action echoes. In: IJCAI, pp. 2457–2463. AAAI Press (2017)
Google Scholar
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp. 1290–1297. IEEE (2012)
Google Scholar
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp. 588–595. IEEE (2014)
Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp. 1110–1118. IEEE (2015)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: CVPR. IEEE (2016)
Google Scholar
Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, vol. 2, p. 8. AAAI Press (2016)
Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: A new representation of skeleton sequences for 3D action recognition. In: CVPR. IEEE (2017)
Google Scholar
Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_29
Chapter Google Scholar
Wang, L., Qiao, Y., Tang, X.: Latent hierarchical model of temporal structure for complex activity classification. IEEE Transa. Image Process. 23, 810–822 (2014)
Article MathSciNet Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634. IEEE (2015)
Google Scholar
Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)
Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: ICCV, pp. 5784–5793. IEEE (2017)
Google Scholar
Bütepage, J., Black, M.J., Kragic, D., Kjellström, H.: Deep representation learning for human motion prediction and classification. In: CVPR 2017. IEEE (2017)
Google Scholar
Ke, Q., Liu, J., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Chapter 5 - computer vision for human-machine interaction. In: Leo, M., Farinella, G.M., (eds.) Computer Vision for Assistive Healthcare, pp. 127–145. Academic Press (2018)
Google Scholar
Tang, C., Li, W., Wang, P., Wang, L.: Online human action recognition based on incremental learning of weighted covariance descriptors. Inf. Sci. 467, 219–237 (2018)
Article Google Scholar
Rahmani, H., Mahmood, A., Huynh, D., Mian, A.: Histogram of oriented principal components for cross-view action recognition. PAMI 38, 2430–2443 (2016)
Article Google Scholar
Rahmani, H., Mian, A., Shah, M.: Learning a deep model for human action recognition from novel viewpoints. PAMI 40(3), 667–681 (2018)
Article Google Scholar
Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: ICCV. IEEE (2017)
Google Scholar
Rahmani, H., Mian, A.: 3D action recognition from novel viewpoints. In: CVPR, pp. 1506–1515. IEEE (2016)
Google Scholar
Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera, S.: RGB-D-based human motion recognition with deep learning: a survey. Comput. Vis. Image Underst. 171, 118–139 (2018)
Article Google Scholar
Ke, Q., An, S., Bennamoun, M., Sohel, F., Boussaid, F.: Skeletonnet: mining deep part features for 3-D action recognition. IEEE Sig. Process. Lett. 24, 731–735 (2017)
Article Google Scholar
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27, 2842–2855 (2018)
Article MathSciNet Google Scholar
Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: CVPR, vol. 7. IEEE (2017)
Google Scholar
Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV, pp. 1036–1043. IEEE (2011)
Google Scholar
Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_39
Chapter Google Scholar
Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: Human interaction prediction using deep temporal features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 403–414. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_28
Chapter Google Scholar
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Leveraging structural context models and ranking score fusion forhuman interaction prediction. IEEE Trans. Multimed. 20, 1712–1723 (2017)
Article Google Scholar
Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: ICRA, pp. 3118–3125. IEEE (2016)
Google Scholar
Aliakbarian, M.S., Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: ICCV. IEEE (2017)
Google Scholar
Farha, Y.A., Richard, A., Gall, J.: When will you do what? Anticipating temporal occurrences of activities. arXiv preprint arXiv:1804.00892 (2018)
Hu, J.-F., Zheng, W.-S., Ma, L., Wang, G., Lai, J.: Real-time RGB-D activity prediction by soft regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 280–296. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_17
Chapter Google Scholar
Liu, J., Shahroudy, A., Wang, G., Duan, L.Y., Kot, A.C.: SSNet: scale selection network for online 3D action prediction. In: CVPR, pp. 8349–8358. IEEE (2018)
Google Scholar
Herath, S., Harandi, M., Porikli, F.: Learning an invariant hilbert space for domain adaptation. arXiv preprint arXiv:1611.08350 (2016)
Hubert Tsai, Y.H., Yeh, Y.R., Frank Wang, Y.C.: Learning cross-domain landmarks for heterogeneous domain adaptation. In: CVPR, pp. 5081–5090. IEEE (2016)
Google Scholar
Baktashmotlagh, M., Harandi, M., Salzmann, M.: Distribution-matching embedding for visual domain adaptation. J. Mach. Learn. Res. 17, 3760–3789 (2016)
MathSciNet MATH Google Scholar
Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 199–210 (2011)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Google Scholar
CMU: CMU graphics lab motion capture database (2013). http://mocap.cs.cmu.edu/
Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: CVPR, pp. 5344–5352. IEEE (2015)
Google Scholar

Download references

Acknowledgment

We greatly acknowledge NVIDIA for providing a Titan XP GPU for the experiments involved in this research. This work was partially supported by Australian Research Council grants DP150100294, DP150104251, and DE120102960.

Author information

Authors and Affiliations

The University of Western Australia, Crawley, Australia
Qiuhong Ke, Mohammed Bennamoun & Farid Boussaid
Nanyang Technological University, Singapore, Singapore
Jun Liu
Lancaster University, Lancashire, England
Hossein Rahmani
Curtin University, Bentley, Australia
Senjian An
Murdoch University, Murdoch, Australia
Ferdous Sohel

Authors

Qiuhong Ke
View author publications
You can also search for this author in PubMed Google Scholar
Jun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Bennamoun
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Senjian An
View author publications
You can also search for this author in PubMed Google Scholar
Ferdous Sohel
View author publications
You can also search for this author in PubMed Google Scholar
Farid Boussaid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiuhong Ke .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C.V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ke, Q. et al. (2019). Global Regularizer and Temporal-Aware Cross-Entropy for Skeleton-Based Early Action Recognition. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11364. Springer, Cham. https://doi.org/10.1007/978-3-030-20870-7_45

Download citation

DOI: https://doi.org/10.1007/978-3-030-20870-7_45
Published: 25 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20869-1
Online ISBN: 978-3-030-20870-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics