Action detection with two-stream enhanced detector

Zhang, Min; Hu, Haiyang; Li, Zhongjin; Chen, Jie

doi:10.1007/s00371-021-02397-8

Action detection with two-stream enhanced detector

Original article
Published: 05 February 2022

Volume 39, pages 1193–1204, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Min Zhang¹,
Haiyang Hu ORCID: orcid.org/0000-0002-6070-8524¹,
Zhongjin Li¹ &
…
Jie Chen¹

397 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Action understanding in videos is a challenging task that has attracted widespread attention in recent years. Most current methods localize bounding box of actors at frame level, and then track or link these detections to form action tubes across frames. These methods often focus on utilizing temporal context in videos while neglecting the importance of the detector itself. In this paper, we present a two-stream enhanced framework to deal with the problem of action detection. Specifically, we devise an appearance and motion detectors in two-stream manner to detect actions, which take k consecutive RGB frames and optical flow images as input respectively. To improve the feature presentation capabilities, anchor refinement sub-module with feature alignment is introduced into the two-stream architecture to generate flexible anchor cuboids. Meanwhile, hierarchical fusion strategy is utilized to concatenate intermediate feature maps for capturing fast moving subjects. Moreover, layer normalization with skip connection is adopted to reduce the internal co-variate shift between network layers, which makes the training process simple and effective. Compared to state-of-the-art methods, the proposed approach yields impressive performance gain on three prevailing datasets: UCF-Sports, UCF-101 and J-HMDB, which confirm the effectiveness of our enhanced detector for action detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-region Two-Stream R-CNN for Action Detection

Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos

Context-Aware RCNN: A Baseline for Action Detection in Videos

References

Mandal, M., Dhar, V., Mishra, A., Vipparthi, S.K., Abdel-Mottaleb, M.: 3DCD: scene independent end-to-end spatiotemporal feature learning framework for change detection in unseen videos. IEEE Trans. Image Process. 30, 546–558 (2021)
Article Google Scholar
Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Single shot video object detector. IEEE Trans. Multimed. 23, 846–858 (2021)
Article Google Scholar
Dong, E., Deng, M., Wang, Z.: A robust tracking algorithm with on online detector and high-confidence updating strategy. Vis. Comput. 37(3), 567–585 (2021)
Article Google Scholar
Dai, C., Liu, X., Lai, J.: Human action recognition using two-stream attention based LSTM networks. Appl. Soft Comput. 86, 105820 (2020)
Article Google Scholar
Nawaratne, R., Alahakoon, D., De Silva, D., Yu, X.: Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Trans. Ind. Inf. 16(1), 393–402 (2019)
Article Google Scholar
Zhou, J.T., Du, J., Zhu, H., Peng, X., Liu, Y., Goh, R.S.M.: Anomalynet: an anomaly detection network for video surveillance. IEEE Trans. Inf. Forensics Secur. 14(10), 2537–2550 (2019)
Article Google Scholar
Wang, D., Devin, C., Cai, QZ., Yu, F., Darrell, T.: Deep object-centric policies for autonomous driving. In: International Conference on Robotics and Automation (ICRA), pp. 8853–8859 (2019)
Gu, R., Wang, G., Hwang, J.N.: Efficient multi-person hierarchical 3D pose estimation for autonomous driving. In: IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 163–168 (2019)
Gong, K., Cao, Z., Xiao, Y., Fang, Z.: Abrupt-motion-aware lightweight visual tracking for unmanned aerial vehicles. Vis. Comput. 37(2), 371–383 (2021)
Article Google Scholar
Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 8191–8198 (2019)
Li, C., Yang, C., Giannetti, C.: Segmentation and generalisation for writing skills transfer from humans to robots. Cogn. Comput. Syst. 1(1), 20–25 (2019)
Article Google Scholar
Zhou, Y., Sun, X., Zha, Z.J., Zeng, W.: MiCT: mixed 3d/2d convolutional tube for human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 449–458 (2018)
Wei, L., Cui, W., Hu, Z., Sun, H., Hou, S.: A single-shot multi-level feature reused neural network for object detection. Vis. Comput. 37(1), 133–142 (2021)
Article Google Scholar
Li, Y., Lin, W., Wang, T et al.: Finding action tubes with a sparse-to-dense framework. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11466–11473 (2020)
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: European Conference on Computer Vision (ECCV), pp. 744–759 (2016)
Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. arXiv:1608.01529 (2016)
Xu, H., Das, A., Saenko, K.: R-c3d: region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5783–5792 (2017)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: European Conference on Computer Vision (ECCV), pp. 21–37 (2016)
Liu, Z., Xiang, Q., Tang, J., Wang, Y., Zhao, P.: Robust salient object detection for RGB images. Vis. Comput. 36(9), 1823–1835 (2020)
Zhao, X., Zhang, L., Pang, Y., Lu, H., Zhang, L.: A single stream network for robust and real-time RGB-d salient object detection. In: European Conference on Computer Vision (ECCV), pp. 646–662 (2020)
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
Wang, X., Yang, M., Zhu, S., Lin, Y.: Regionlets for generic object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 17–24 (2013)
Wu, X., Sahoo, D., Hoi, S.C.: Recent advances in deep learning for object detection. Neurocomputing 396, 39–64 (2020)
Article Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision (ECCV), pp. 213–229 (2020)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10781–10790 (2020)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Escorcia, V., Heilbron, F.C., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: European Conference on Computer Vision (ECCV), pp. 768–784 (2016)
Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., Sebe, N.: Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 22(11), 2990–3001 (2020)
Article Google Scholar
Cai, J., Hu, J.: 3D RANs: 3D residual attention networks for action recognition. Vis. Comput. 36(6), 1261–1270 (2020)
Article Google Scholar
Gkioxari, G., Malik, J.: Finding action tubes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768 (2015)
Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3164–3172 (2015)
Li, D., Qiu, Z., Dai, Q., Yao, T., Mei, T.: Recurrent tubelet proposal and recognition networks for action detection. In: European Conference on Computer Vision (ECCV), pp. 303–318 (2018)
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4405–4413 (2017)
Abbass, M.Y., Kwon, K.C., Kim, N., et al.: Efficient object tracking using hierarchical convolutional features model and correlation filters. Vis. Comput. 37(4), 831–842 (2021)
Article Google Scholar
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 591–600 (2020)
Gilbarg, D., Trudinger, N.S.: Elliptic Partial Differential Equations of Second Order, pp. 13–70. Springer, Berlin (2015)
MATH Google Scholar
Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3637–3646 (2017)
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3192–3199 (2013)
Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2003–2010 (2011)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5822–5831 (2017)
Li, W., Yuan, Z., Guo, D., Huang, L., Fang, X., Wang, C.: Deformable tube network for action detection in videos. arXiv:1907.01847 (2019)
Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Hierarchical self-attention network for action localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 61–70 (2019)
Wu, Y., Wang, H., Wang, S., Li, Q.: Enhanced action tubelet detector for spatio-temporal video action detection. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2388–2392 (2020)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)
Li, Y., Lin, W., Wang, T., See, J., Qian, R., Xu, N., Xu, S.: Finding action tubes with a sparse-to-dense framework. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11466–11473 (2020)
Zhao, J., Snoek, C.G.: Dance with flow: two-in-one stream action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9935–9944 (2019)

Download references

Acknowledgements

This work is partially supported by National Natural Science Foundation of China (Grant nos. 61572251, 61572162, 61702144 and 61802095), the Natural Science Foundation of Zhejiang Province (LQ17F020003), the Key Science and Technology Project Foundation of Zhejiang Province (2018C01012).

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Min Zhang, Haiyang Hu, Zhongjin Li & Jie Chen

Authors

Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongjin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiyang Hu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, M., Hu, H., Li, Z. et al. Action detection with two-stream enhanced detector. Vis Comput 39, 1193–1204 (2023). https://doi.org/10.1007/s00371-021-02397-8

Download citation

Accepted: 30 December 2021
Published: 05 February 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s00371-021-02397-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action detection with two-stream enhanced detector

Abstract

Access this article

Similar content being viewed by others

Multi-region Two-Stream R-CNN for Action Detection

Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos

Context-Aware RCNN: A Baseline for Action Detection in Videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Action detection with two-stream enhanced detector

Abstract

Access this article

Similar content being viewed by others

Multi-region Two-Stream R-CNN for Action Detection

Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos

Context-Aware RCNN: A Baseline for Action Detection in Videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation