Skip to main content
Log in

Action detection with two-stream enhanced detector

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Action understanding in videos is a challenging task that has attracted widespread attention in recent years. Most current methods localize bounding box of actors at frame level, and then track or link these detections to form action tubes across frames. These methods often focus on utilizing temporal context in videos while neglecting the importance of the detector itself. In this paper, we present a two-stream enhanced framework to deal with the problem of action detection. Specifically, we devise an appearance and motion detectors in two-stream manner to detect actions, which take k consecutive RGB frames and optical flow images as input respectively. To improve the feature presentation capabilities, anchor refinement sub-module with feature alignment is introduced into the two-stream architecture to generate flexible anchor cuboids. Meanwhile, hierarchical fusion strategy is utilized to concatenate intermediate feature maps for capturing fast moving subjects. Moreover, layer normalization with skip connection is adopted to reduce the internal co-variate shift between network layers, which makes the training process simple and effective. Compared to state-of-the-art methods, the proposed approach yields impressive performance gain on three prevailing datasets: UCF-Sports, UCF-101 and J-HMDB, which confirm the effectiveness of our enhanced detector for action detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Mandal, M., Dhar, V., Mishra, A., Vipparthi, S.K., Abdel-Mottaleb, M.: 3DCD: scene independent end-to-end spatiotemporal feature learning framework for change detection in unseen videos. IEEE Trans. Image Process. 30, 546–558 (2021)

    Article  Google Scholar 

  2. Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Single shot video object detector. IEEE Trans. Multimed. 23, 846–858 (2021)

    Article  Google Scholar 

  3. Dong, E., Deng, M., Wang, Z.: A robust tracking algorithm with on online detector and high-confidence updating strategy. Vis. Comput. 37(3), 567–585 (2021)

    Article  Google Scholar 

  4. Dai, C., Liu, X., Lai, J.: Human action recognition using two-stream attention based LSTM networks. Appl. Soft Comput. 86, 105820 (2020)

    Article  Google Scholar 

  5. Nawaratne, R., Alahakoon, D., De Silva, D., Yu, X.: Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Trans. Ind. Inf. 16(1), 393–402 (2019)

    Article  Google Scholar 

  6. Zhou, J.T., Du, J., Zhu, H., Peng, X., Liu, Y., Goh, R.S.M.: Anomalynet: an anomaly detection network for video surveillance. IEEE Trans. Inf. Forensics Secur. 14(10), 2537–2550 (2019)

    Article  Google Scholar 

  7. Wang, D., Devin, C., Cai, QZ., Yu, F., Darrell, T.: Deep object-centric policies for autonomous driving. In: International Conference on Robotics and Automation (ICRA), pp. 8853–8859 (2019)

  8. Gu, R., Wang, G., Hwang, J.N.: Efficient multi-person hierarchical 3D pose estimation for autonomous driving. In: IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 163–168 (2019)

  9. Gong, K., Cao, Z., Xiao, Y., Fang, Z.: Abrupt-motion-aware lightweight visual tracking for unmanned aerial vehicles. Vis. Comput. 37(2), 371–383 (2021)

    Article  Google Scholar 

  10. Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 8191–8198 (2019)

  11. Li, C., Yang, C., Giannetti, C.: Segmentation and generalisation for writing skills transfer from humans to robots. Cogn. Comput. Syst. 1(1), 20–25 (2019)

    Article  Google Scholar 

  12. Zhou, Y., Sun, X., Zha, Z.J., Zeng, W.: MiCT: mixed 3d/2d convolutional tube for human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 449–458 (2018)

  13. Wei, L., Cui, W., Hu, Z., Sun, H., Hou, S.: A single-shot multi-level feature reused neural network for object detection. Vis. Comput. 37(1), 133–142 (2021)

    Article  Google Scholar 

  14. Li, Y., Lin, W., Wang, T et al.: Finding action tubes with a sparse-to-dense framework. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11466–11473 (2020)

  15. Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: European Conference on Computer Vision (ECCV), pp. 744–759 (2016)

  16. Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. arXiv:1608.01529 (2016)

  17. Xu, H., Das, A., Saenko, K.: R-c3d: region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5783–5792 (2017)

  18. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: European Conference on Computer Vision (ECCV), pp. 21–37 (2016)

  19. Liu, Z., Xiang, Q., Tang, J., Wang, Y., Zhao, P.: Robust salient object detection for RGB images. Vis. Comput. 36(9), 1823–1835 (2020)

  20. Zhao, X., Zhang, L., Pang, Y., Lu, H., Zhang, L.: A single stream network for robust and real-time RGB-d salient object detection. In: European Conference on Computer Vision (ECCV), pp. 646–662 (2020)

  21. Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)

    Article  Google Scholar 

  22. Wang, X., Yang, M., Zhu, S., Lin, Y.: Regionlets for generic object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 17–24 (2013)

  23. Wu, X., Sahoo, D., Hoi, S.C.: Recent advances in deep learning for object detection. Neurocomputing 396, 39–64 (2020)

    Article  Google Scholar 

  24. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision (ECCV), pp. 213–229 (2020)

  25. Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10781–10790 (2020)

  26. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014)

  27. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)

  28. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015)

  29. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)

  30. Escorcia, V., Heilbron, F.C., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: European Conference on Computer Vision (ECCV), pp. 768–784 (2016)

  31. Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., Sebe, N.: Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 22(11), 2990–3001 (2020)

    Article  Google Scholar 

  32. Cai, J., Hu, J.: 3D RANs: 3D residual attention networks for action recognition. Vis. Comput. 36(6), 1261–1270 (2020)

    Article  Google Scholar 

  33. Gkioxari, G., Malik, J.: Finding action tubes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768 (2015)

  34. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3164–3172 (2015)

  35. Li, D., Qiu, Z., Dai, Q., Yao, T., Mei, T.: Recurrent tubelet proposal and recognition networks for action detection. In: European Conference on Computer Vision (ECCV), pp. 303–318 (2018)

  36. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4405–4413 (2017)

  37. Abbass, M.Y., Kwon, K.C., Kim, N., et al.: Efficient object tracking using hierarchical convolutional features model and correlation filters. Vis. Comput. 37(4), 831–842 (2021)

    Article  Google Scholar 

  38. Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 591–600 (2020)

  39. Gilbarg, D., Trudinger, N.S.: Elliptic Partial Differential Equations of Second Order, pp. 13–70. Springer, Berlin (2015)

    MATH  Google Scholar 

  40. Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3637–3646 (2017)

  41. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)

  42. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)

  43. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3192–3199 (2013)

  44. Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2003–2010 (2011)

  45. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  46. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5822–5831 (2017)

  47. Li, W., Yuan, Z., Guo, D., Huang, L., Fang, X., Wang, C.: Deformable tube network for action detection in videos. arXiv:1907.01847 (2019)

  48. Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Hierarchical self-attention network for action localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 61–70 (2019)

  49. Wu, Y., Wang, H., Wang, S., Li, Q.: Enhanced action tubelet detector for spatio-temporal video action detection. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2388–2392 (2020)

  50. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)

  51. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)

  52. Li, Y., Lin, W., Wang, T., See, J., Qian, R., Xu, N., Xu, S.: Finding action tubes with a sparse-to-dense framework. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11466–11473 (2020)

  53. Zhao, J., Snoek, C.G.: Dance with flow: two-in-one stream action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9935–9944 (2019)

Download references

Acknowledgements

This work is partially supported by National Natural Science Foundation of China (Grant nos. 61572251, 61572162, 61702144 and 61802095), the Natural Science Foundation of Zhejiang Province (LQ17F020003), the Key Science and Technology Project Foundation of Zhejiang Province (2018C01012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiyang Hu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, M., Hu, H., Li, Z. et al. Action detection with two-stream enhanced detector. Vis Comput 39, 1193–1204 (2023). https://doi.org/10.1007/s00371-021-02397-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-021-02397-8

Keywords

Navigation