Skip to main content

Action Recognition in Videos Using Multi-stream Convolutional Neural Networks

  • Chapter
  • First Online:
Deep Learning Applications

Abstract

Human action recognition aims to classify trimmed videos based on the action being performed by one or more agents. It can be applied to a large variety of tasks, such as surveillance systems, intelligent homes, health monitoring, and human-computer interaction. Despite the significant progress achieved through image-based deep networks, video understanding still faces challenges in modeling spatiotemporal relations. The inclusion of temporal information in the network may lead to significant growth in the training cost. To address this issue, we explore complementary handcrafted features to feed pre-trained two-dimensional (2D) networks in a multi-stream fashion. In addition to the commonly used RGB and optical flow streams, we propose the use of a stream based on visual rhythm images that encode long-term information. Previous works have shown that either RGB or optical flow streams may benefit from pre-training on ImageNet since they maintain a certain level of object shape. The visual rhythm, on the other hand, harshly deforms the silhouettes of the actors and objects. Therefore, we develop a different pre-training procedure for the latter stream using visual rhythm images extracted from a large and challenging video dataset, the Kinetics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2799–2813 (2018)

    Article  Google Scholar 

  2. J.Y. Bouguet, Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm. Intel Corp. 5(1–10), 4 (2001)

    Google Scholar 

  3. J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 4724–4733

    Google Scholar 

  4. K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets (2014), pp. 1–11, arXiv:14053531

  5. V. Choutas, P. Weinzaepfel, J. Revaud, C. Schmid, PoTion: pose motion representation for action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 7024–7033

    Google Scholar 

  6. D.T. Concha, H. de Almeida Maia, H. Pedrini, H. Tacon, A. de Souza Brito, H. de Lima Chaves, M.B. Vieira Multi-stream convolutional neural networks for action recognition in video sequences based on adaptive visual rhythms, in IEEE International Conference on Machine Learning and Applications (IEEE, 2018), pp. 473–480

    Google Scholar 

  7. I. Gori, J.K. Aggarwal, L. Matthies, M.S. Ryoo, Multitype activity recognition in robot-centric scenarios. IEEE Robot. Autom. Lett. 1(1), 593–600 (2016)

    Article  Google Scholar 

  8. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778

    Google Scholar 

  9. S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  10. R. Kahani, A. Talebpour, A. Mahmoudi-Aznaveh, A correlation based feature representation for first-person activity recognition. Multimed. Tools Appl. 78(15), 21673–21694 (2019)

    Google Scholar 

  11. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 1725–1732

    Google Scholar 

  12. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, The kinetics human action video dataset (2017), pp. 1–22, arXiv:170506950

  13. H. Kim, J. Lee, J.H. Yang, S. Sull, W.M. Kim, S.M.H. Song, Visual rhythm and shot verification. Multimed. Tools Appl. 15(3), 227–245 (2001)

    Article  Google Scholar 

  14. H. Kuehne, H. Jhuang, R. Stiefelhagen, T. Serre, HMDB51: a large video database for human motion recognition, High Performance Computing in Science and Engineering (Springer, Berlin, 2013), pp. 571–582

    Google Scholar 

  15. D. Li, T. Yao, L. Duan, T. Mei, Y. Rui, Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Multimed. 416–428 (2018)

    Google Scholar 

  16. J.Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond short snippets: deep networks for video classification, in IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4694–4702

    Google Scholar 

  17. C.W. Ngo, T.C. Pong, R.T. Chin, Camera break detection by partitioning of 2D spatio-temporal images in MPEG domain, in IEEE International Conference on Multimedia Computing and Systems, vol. 1 (IEEE, 1999), pp. 750–755

    Google Scholar 

  18. X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)

    Article  Google Scholar 

  19. H. Rahmani, A. Mian, M. Shah, Learning a deep model for human action recognition from novel viewpoints. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 667–681 (2018)

    Article  Google Scholar 

  20. M. Ravanbakhsh, H. Mousavi, M. Rastegari, V. Murino, L.S. Davis, Action recognition with image based CNN features (2015), pp. 1–10, arXiv:151203980

  21. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  22. M.S. Ryoo, L. Matthies, First-person activity recognition: feature, temporal structure, and prediction. Int. J. Comput. Vis. 119(3), 307–328 (2016)

    Article  MathSciNet  Google Scholar 

  23. J. Shi, C. Tomasi, Good features to track, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 1994), pp. 593–600

    Google Scholar 

  24. K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Advances in Neural Information Processing Systems (2014), pp. 568–576

    Google Scholar 

  25. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in International Conference on Learning Representations (2015), pp. 1–14

    Google Scholar 

  26. K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the wild (2012), pp. 1–7, arXiv:12120402

  27. M.R. Souza, Digital video stabilization: algorithms and evaluation. Master’s thesis, Institute of Computing, University of Campinas, Campinas, Brazil, 2018

    Google Scholar 

  28. L. Sun, K. Jia, K. Chen, D.Y. Yeung, B.E. Shi, S. Savarese, Lattice long short-term memory for human action recognition, in IEEE International Conference on Computer Vision (2017), pp. 2147–2156

    Google Scholar 

  29. S. Sun, Z. Kuang, L. Sheng, W. Ouyang, W. Zhang, Optical flow guided feature: a fast and robust motion representation for video action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1390–1399

    Google Scholar 

  30. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–12

    Google Scholar 

  31. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2818–2826

    Google Scholar 

  32. B.S. Torres, H. Pedrini, Detection of complex video events through visual rhythm. Vis. Comput. 1–21 (2016)

    Google Scholar 

  33. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6450–6459

    Google Scholar 

  34. Z. Tu, W. Xie, J. Dauwels, B. Li, J. Yuan, Semantic cues enhanced multi-modality multi-stream CNN for action recognition. IEEE Trans. Circuits Syst. Video Technol. 29(5), 1423–1437 (2018)

    Article  Google Scholar 

  35. G. Varol, I. Laptev, C. Schmid, Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)

    Article  Google Scholar 

  36. L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4305–4314

    Google Scholar 

  37. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, Towards good practices for very deep two-stream convnets (2015), pp. 1–5, arXiv:150702159

  38. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in European Conference on Computer Vision (Springer, 2016), pp. 20–36

    Google Scholar 

  39. Y. Wang, M. Long, J. Wang, P.S. Yu, Spatiotemporal pyramid network for video action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 2097–2106

    Google Scholar 

  40. H. Wang, Y. Yang, E. Yang, C. Deng, Exploring hybrid spatio-temporal convolutional networks for human action recognition. Multimed. Tools Appl. 76(13), 15065–15081 (2017)

    Google Scholar 

  41. J. Wang, A. Cherian, F. Porikli, S. Gould, Video representation learning using discriminative pooling, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1149–1158

    Google Scholar 

  42. M.A. Wani, F.A. Bhat, S. Afzal, A.I. Khan, Advances in Deep Learning, vol. 57 (Springer, Berlin, 2020)

    Google Scholar 

  43. H. Yang, C. Yuan, B. Li, Y. Du, J. Xing, W. Hu, S.J. Maybank, Asymmetric 3D convolutional neural networks for action recognition. Pattern Recognit. 85, 1–12 (2019)

    Article  Google Scholar 

  44. W. Zhu, J. Hu, G. Sun, X. Cao, Y. Qiao, A key volume mining deep framework for action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 1991–1999

    Google Scholar 

  45. J. Zhu, Z. Zhu, W. Zou, End-to-end video-level representation learning for action recognition, in 24th International Conference on Pattern Recognition (IEEE, 2018), pp. 645–650

    Google Scholar 

  46. Y. Zhu, PyTorch implementation of popular two-stream frameworks for video action recognition (2019), https://github.com/bryanyzhu/two-stream-pytorch

Download references

Acknowledgements

The authors thank FAPESP (grants #2017/09160-1 and #2017/12646-3), CNPq (grant #305169/2015-7), CAPES and FAPEMIG for their financial support. The authors are also grateful to NVIDIA for the donation of a GPU as part of the GPU Grant Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Helena de Almeida Maia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

de Almeida Maia, H. et al. (2020). Action Recognition in Videos Using Multi-stream Convolutional Neural Networks. In: Wani, M., Kantardzic, M., Sayed-Mouchaweh, M. (eds) Deep Learning Applications. Advances in Intelligent Systems and Computing, vol 1098. Springer, Singapore. https://doi.org/10.1007/978-981-15-1816-4_6

Download citation

Publish with us

Policies and ethics