Action Recognition in Videos Using Multi-stream Convolutional Neural Networks

de Almeida Maia, Helena; Concha, Darwin Ttito; Pedrini, Helio; Tacon, Hemerson; de Souza Brito, André; de Lima Chaves, Hugo; Vieira, Marcelo Bernardes; Villela, Saulo Moraes

doi:10.1007/978-981-15-1816-4_6

Helena de Almeida Maia¹⁷,
Darwin Ttito Concha¹⁷,
Helio Pedrini¹⁷,
Hemerson Tacon¹⁸,
André de Souza Brito¹⁸,
Hugo de Lima Chaves¹⁸,
Marcelo Bernardes Vieira¹⁸ &
…
Saulo Moraes Villela¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1098))

1676 Accesses
2 Citations

Abstract

Human action recognition aims to classify trimmed videos based on the action being performed by one or more agents. It can be applied to a large variety of tasks, such as surveillance systems, intelligent homes, health monitoring, and human-computer interaction. Despite the significant progress achieved through image-based deep networks, video understanding still faces challenges in modeling spatiotemporal relations. The inclusion of temporal information in the network may lead to significant growth in the training cost. To address this issue, we explore complementary handcrafted features to feed pre-trained two-dimensional (2D) networks in a multi-stream fashion. In addition to the commonly used RGB and optical flow streams, we propose the use of a stream based on visual rhythm images that encode long-term information. Previous works have shown that either RGB or optical flow streams may benefit from pre-training on ImageNet since they maintain a certain level of object shape. The visual rhythm, on the other hand, harshly deforms the silhouettes of the actors and objects. Therefore, we develop a different pre-training procedure for the latter stream using visual rhythm images extracted from a large and challenging video dataset, the Kinetics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2799–2813 (2018)
Article Google Scholar
J.Y. Bouguet, Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm. Intel Corp. 5(1–10), 4 (2001)
Google Scholar
J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 4724–4733
Google Scholar
K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets (2014), pp. 1–11, arXiv:14053531
V. Choutas, P. Weinzaepfel, J. Revaud, C. Schmid, PoTion: pose motion representation for action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 7024–7033
Google Scholar
D.T. Concha, H. de Almeida Maia, H. Pedrini, H. Tacon, A. de Souza Brito, H. de Lima Chaves, M.B. Vieira Multi-stream convolutional neural networks for action recognition in video sequences based on adaptive visual rhythms, in IEEE International Conference on Machine Learning and Applications (IEEE, 2018), pp. 473–480
Google Scholar
I. Gori, J.K. Aggarwal, L. Matthies, M.S. Ryoo, Multitype activity recognition in robot-centric scenarios. IEEE Robot. Autom. Lett. 1(1), 593–600 (2016)
Article Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
Google Scholar
S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
R. Kahani, A. Talebpour, A. Mahmoudi-Aznaveh, A correlation based feature representation for first-person activity recognition. Multimed. Tools Appl. 78(15), 21673–21694 (2019)
Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 1725–1732
Google Scholar
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, The kinetics human action video dataset (2017), pp. 1–22, arXiv:170506950
H. Kim, J. Lee, J.H. Yang, S. Sull, W.M. Kim, S.M.H. Song, Visual rhythm and shot verification. Multimed. Tools Appl. 15(3), 227–245 (2001)
Article Google Scholar
H. Kuehne, H. Jhuang, R. Stiefelhagen, T. Serre, HMDB51: a large video database for human motion recognition, High Performance Computing in Science and Engineering (Springer, Berlin, 2013), pp. 571–582
Google Scholar
D. Li, T. Yao, L. Duan, T. Mei, Y. Rui, Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Multimed. 416–428 (2018)
Google Scholar
J.Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond short snippets: deep networks for video classification, in IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4694–4702
Google Scholar
C.W. Ngo, T.C. Pong, R.T. Chin, Camera break detection by partitioning of 2D spatio-temporal images in MPEG domain, in IEEE International Conference on Multimedia Computing and Systems, vol. 1 (IEEE, 1999), pp. 750–755
Google Scholar
X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
H. Rahmani, A. Mian, M. Shah, Learning a deep model for human action recognition from novel viewpoints. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 667–681 (2018)
Article Google Scholar
M. Ravanbakhsh, H. Mousavi, M. Rastegari, V. Murino, L.S. Davis, Action recognition with image based CNN features (2015), pp. 1–10, arXiv:151203980
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
M.S. Ryoo, L. Matthies, First-person activity recognition: feature, temporal structure, and prediction. Int. J. Comput. Vis. 119(3), 307–328 (2016)
Article MathSciNet Google Scholar
J. Shi, C. Tomasi, Good features to track, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 1994), pp. 593–600
Google Scholar
K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Advances in Neural Information Processing Systems (2014), pp. 568–576
Google Scholar
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in International Conference on Learning Representations (2015), pp. 1–14
Google Scholar
K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the wild (2012), pp. 1–7, arXiv:12120402
M.R. Souza, Digital video stabilization: algorithms and evaluation. Master’s thesis, Institute of Computing, University of Campinas, Campinas, Brazil, 2018
Google Scholar
L. Sun, K. Jia, K. Chen, D.Y. Yeung, B.E. Shi, S. Savarese, Lattice long short-term memory for human action recognition, in IEEE International Conference on Computer Vision (2017), pp. 2147–2156
Google Scholar
S. Sun, Z. Kuang, L. Sheng, W. Ouyang, W. Zhang, Optical flow guided feature: a fast and robust motion representation for video action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1390–1399
Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–12
Google Scholar
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2818–2826
Google Scholar
B.S. Torres, H. Pedrini, Detection of complex video events through visual rhythm. Vis. Comput. 1–21 (2016)
Google Scholar
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6450–6459
Google Scholar
Z. Tu, W. Xie, J. Dauwels, B. Li, J. Yuan, Semantic cues enhanced multi-modality multi-stream CNN for action recognition. IEEE Trans. Circuits Syst. Video Technol. 29(5), 1423–1437 (2018)
Article Google Scholar
G. Varol, I. Laptev, C. Schmid, Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)
Article Google Scholar
L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4305–4314
Google Scholar
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, Towards good practices for very deep two-stream convnets (2015), pp. 1–5, arXiv:150702159
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in European Conference on Computer Vision (Springer, 2016), pp. 20–36
Google Scholar
Y. Wang, M. Long, J. Wang, P.S. Yu, Spatiotemporal pyramid network for video action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 2097–2106
Google Scholar
H. Wang, Y. Yang, E. Yang, C. Deng, Exploring hybrid spatio-temporal convolutional networks for human action recognition. Multimed. Tools Appl. 76(13), 15065–15081 (2017)
Google Scholar
J. Wang, A. Cherian, F. Porikli, S. Gould, Video representation learning using discriminative pooling, in IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1149–1158
Google Scholar
M.A. Wani, F.A. Bhat, S. Afzal, A.I. Khan, Advances in Deep Learning, vol. 57 (Springer, Berlin, 2020)
Google Scholar
H. Yang, C. Yuan, B. Li, Y. Du, J. Xing, W. Hu, S.J. Maybank, Asymmetric 3D convolutional neural networks for action recognition. Pattern Recognit. 85, 1–12 (2019)
Article Google Scholar
W. Zhu, J. Hu, G. Sun, X. Cao, Y. Qiao, A key volume mining deep framework for action recognition, in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 1991–1999
Google Scholar
J. Zhu, Z. Zhu, W. Zou, End-to-end video-level representation learning for action recognition, in 24th International Conference on Pattern Recognition (IEEE, 2018), pp. 645–650
Google Scholar
Y. Zhu, PyTorch implementation of popular two-stream frameworks for video action recognition (2019), https://github.com/bryanyzhu/two-stream-pytorch

Download references

Acknowledgements

The authors thank FAPESP (grants #2017/09160-1 and #2017/12646-3), CNPq (grant #305169/2015-7), CAPES and FAPEMIG for their financial support. The authors are also grateful to NVIDIA for the donation of a GPU as part of the GPU Grant Program.

Author information

Authors and Affiliations

Institute of Computing, University of Campinas, Campinas, SP, Brazil
Helena de Almeida Maia, Darwin Ttito Concha & Helio Pedrini
Department of Computer Science, Federal University of Juiz de Fora, Juiz de Fora, MG, Brazil
Hemerson Tacon, André de Souza Brito, Hugo de Lima Chaves, Marcelo Bernardes Vieira & Saulo Moraes Villela

Authors

Helena de Almeida Maia
View author publications
You can also search for this author in PubMed Google Scholar
Darwin Ttito Concha
View author publications
You can also search for this author in PubMed Google Scholar
Helio Pedrini
View author publications
You can also search for this author in PubMed Google Scholar
Hemerson Tacon
View author publications
You can also search for this author in PubMed Google Scholar
André de Souza Brito
View author publications
You can also search for this author in PubMed Google Scholar
Hugo de Lima Chaves
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Bernardes Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Saulo Moraes Villela
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Helena de Almeida Maia .

Editor information

Editors and Affiliations

Post Graduate Department of Computer Science, University of Kashmir, Srinagar, Jammu and Kashmir, India
M. Arif Wani
Department of Computer Engineering and Computer Science, University of Louisville, Louisville, USA
Mehmed Kantardzic
Department of Computer Science and Automatic Control, High National Engineering School of Mines Telecom Lille Douai, Douai, France
Moamar Sayed-Mouchaweh

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

de Almeida Maia, H. et al. (2020). Action Recognition in Videos Using Multi-stream Convolutional Neural Networks. In: Wani, M., Kantardzic, M., Sayed-Mouchaweh, M. (eds) Deep Learning Applications. Advances in Intelligent Systems and Computing, vol 1098. Springer, Singapore. https://doi.org/10.1007/978-981-15-1816-4_6

Download citation

DOI: https://doi.org/10.1007/978-981-15-1816-4_6
Published: 29 February 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1815-7
Online ISBN: 978-981-15-1816-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics