Abstract
Most of the existing graph convolutional network-based action recognition methods use an adaptive mechanism to learn action features from a skeleton sequence. Although this mechanism improves the recognition accuracy to some extent, its performance is still limited by the initial skeleton topology, which uses a natural human connection approach to connect skeletal joints. In addition, the semantic information of skeletal joints is naturally informative and discriminative for action recognition tasks, but its inclusion has rarely been investigated in the existing methods. To solve these problems, in this work, we propose a novel multistream-based effective skeleton topology and semantically guided adaptive graph convolution network for action recognition. By comparing several different topological graphs, we design an elbow- and knee-centric topology structure that forms the input to the adaptive graph convolutional network. Moreover, we explicitly embed the high-level semantic skeletal information into this network to enhance the feature representation capabilities. Finally, we study the positional relationships between different joints and the center of gravity in the same frame to generate relative position data. They are combined with the joint data, bone data and their corresponding motion information by a multistream network to further improve the action recognition accuracy. Extensive experiments show that the proposed method achieves state-of-the-art performance levels.
Similar content being viewed by others
References
Ardianto, S., Hang, H.M.: Multi-view and multi-modal action recognition with learned fusion. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1601–1604. IEEE (2018)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299 (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chan, W., Tian, Z., Wu, Y.: Gas-gcn: Gated action-specific graph convolutional networks for skeleton-based action recognition. Sensors 20(12), 3499 (2020)
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Du, Y., Fu, Y., Wang, L.: Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans. Image Process. 25(7), 3010–3022 (2016)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118 (2015)
Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5378–5387 (2015)
Gupta, P., Thatipelli, A., Aggarwal, A., Maheshwari, S., Trivedi, N., Das, S., Sarvadevabhatla, R.K.: Quo vadis, skeleton action recognition? arXiv preprint arXiv:2007.02072 (2020)
Han, J., Shao, L., Xu, D., Shotton, J.: Enhanced computer vision with microsoft kinect sensor: A review. IEEE Trans. Cybernet. 43(5), 1318–1334 (2013)
Heidari, N., Iosifidis, A.: Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition. arXiv preprint arXiv:2010.12221 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J.: Real-time rgb-d activity prediction by soft regression. In: European Conference on Computer Vision, pp. 280–296. Springer (2016)
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297 (2017)
Kim, T.S., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp. 1623–1631. IEEE (2017)
Li, B., Dai, Y., Cheng, X., Chen, H., Lin, Y., He, M.: Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 601–604. IEEE (2017)
Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 597–600. IEEE (2017)
Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055 (2018)
Li, D., Jahan, H., Huang, X., Feng, Z.: Human action recognition method based on historical point cloud trajectory characteristics. Vis. Comput. (8) (2021)
Li, F., Li, J., Zhu, A., Xu, Y., Yin, H., Hua, G.: Enhanced spatial and extended temporal graph convolutional network for skeleton-based action recognition. Sensors 20(18), 5260 (2020)
Li, F., Zhu, A., Xu, Y., Cui, R., Hua, G.: Multi-stream and enhanced spatial-temporal graph convolution network for skeleton-based action recognition. IEEE Access 8, 97757–97770 (2020)
Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5457–5466 (2018)
Li, W., Liu, X., Liu, Z., Du, F., Zou, Q.: Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Access 8, 144529–144542 (2020)
Liu, H., Tu, J., Liu, M.: Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv preprint arXiv:1705.08106 (2017)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision, pp. 816–833. Springer (2016)
Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68, 346–362 (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Peng, W., Hong, X., Chen, H., Zhao, G.: Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 2669–2676 (2020)
Peng, W., Hong, X., Zhao, G.: Video action recognition via neural architecture searching. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 11–15 (2019). https://doi.org/10.1109/ICIP.2019.8802919
Peng, W., Hong, X., Zhao, G.: Tripool: Graph triplet pooling for 3d skeleton-based action recognition. Pattern Recogn. 115, 107921 (2021)
Peng, W., Shi, J., Xia, Z., Zhao, G.: Mix dimension in poincaré geometry for 3d skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, p. 1432-1440. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413910
Peng, W., Shi, J., Zhao, G.: Spatial temporal graph deconvolutional network for skeleton-based human action recognition. IEEE Signal Process. Lett. 28, 244–248 (2021). https://doi.org/10.1109/LSP.2021.3049691
Presti, L.L., La Cascia, M.: 3d skeleton-based human action classification: A survey. Pattern Recogn. 53, 130–147 (2016)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019 (2016)
Shi, H., Peng, W., Liu, X., Zhao, G.: Graph adversarial learning for noisy skeleton-based action recognition. Electronic Imaging 2021(10), 239–1–239–7 (2021). https://doi.org/10.2352/ISSN.2470-1173.2021.10.IPAS-239. https://www.ingentaconnect.com/content/ist/ei/2021/00002021/00000010/art00007
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 29, 9532–9545 (2020)
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR 2011, pp. 1297–1304. Ieee (2011)
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)
Si, C., Jing, Y., Wang, W., Wang, L., Tan, T.: Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–118 (2018)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI conference on artificial intelligence, vol. 31 (2017)
Subetha, T., Chitrakala, S.: A survey on human activity recognition from videos. In: 2016 International Conference on Information Communication and Embedded Systems (ICICES), pp. 1–7. IEEE (2016)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5323–5332 (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588–595 (2014)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp. 3551–3558 (2013)
Wang, Y., Zhou, L., Qiao, Y.: Temporal hallucinating for action recognition with few still images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5314–5322 (2018)
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (2020)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1112–1121 (2020)
Zheng, H., Fu, J., Zha, Z.J., Luo, J.: Learning deep bilinear transformation for fine-grained image representation. arXiv preprint arXiv:1911.03621 (2019)
Zheng, W., Li, L., Zhang, Z., Huang, Y., Wang, L.: Relational network for skeleton-based action recognition. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 826–831. IEEE (2019)
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable and insightful comments on an earlier version of this manuscript. This work was supported by the Natural Science Foundation of China (No. 61871196, 61902330, 62001176); National Key Research and Development Program of China (NO.2019YFC1604700); Natural Science Foundation of Fujian Province of China (No. 2019J01082 and 2020J01085); and the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-YX601).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qiu, ZX., Zhang, HB., Deng, WM. et al. Effective skeleton topology and semantics-guided adaptive graph convolution network for action recognition. Vis Comput 39, 2191–2203 (2023). https://doi.org/10.1007/s00371-022-02473-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-022-02473-7