Deep Learning Approaches for Spatio-Temporal Clues Modelling

Suresha, M.; Kuppa, S.; Raghukumar, D. S.

doi:10.1007/978-981-16-4284-5_30

M. Suresha¹³,
S. Kuppa¹³ &
D. S. Raghukumar¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 291))

867 Accesses

Abstract

Multimedia domain, understanding of video content requires vibrant semantic features, which are static and temporal information, static information contains spatial clues, and temporal information contains short-term motions clues, as well as long-term motions clues. Exploring these visual features is a challenging task. Conventional handcrafted features extraction methods are inefficient to analyse these features. Inspired by the significant progress of deep learning models in the image domain, this research extends to the design and exploration of deep neural models and in video domain. We addressed the evolution of spatio-temporal features in two ways in this paper. First, we used various “convolutional neural networks (CNN)” and their pipelines to investigate video features, fusion strategies and their results. Second, the drawbacks of convolutional neural networks for long-term motion clues are overcome by using sequential learning models such as the long short-term memory (LSTM) network in conjunction with multimodel fusion techniques. The main objective of this study is to explore spatial-temporal clues, different models available for the problem, and determine the most promising approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates BT—pattern analysis and machine intelligence. IEEE Trans 23(3):257–267
Google Scholar
Laptev I et al (2008) Learning realistic human actions from movies. In: 26th IEEE conference computer vision pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2008.4587756
Sermanet P, Chintala S, LeCun Y (2012) Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), pp 3288–3291 (IEEE, 2012)
Google Scholar
Gonzalez TF (ed) (2007) Handbook of approximation algorithms and metaheuristics. CRC Press. https://doi.org/10.1201/9781420010749
Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223
Burghouts GJ, Schutte K (2013) Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recognit Lett 34(15):1861–1869. https://doi.org/10.1016/j.patrec.2013.01.024
Article Google Scholar
Simonyan K, Zisserman A (2004) Andrew: two-stream convolutional networks for action recognition. In: NIPS’14 proceedings of the 27th international conference on neural information processing systems, vol 1 pp 568–576. https://doi.org/10.1016/0006-2952(83)90587-7
Jiang YG et al (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147. https://doi.org/10.1109/TMM.2018.2823900
Article Google Scholar
Wang Y et al (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017. pp 2097–2106. https://doi.org/10.1109/CVPR.2017.226
Wu Z et al (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: MM’15 proceedings of the 23rd ACM international conference on multimedia, pp 461–470
Google Scholar
Peng Y et al (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circ Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2808685
Article Google Scholar
Singh B et al (2016) A Multi-stream Bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2016.216
Lowe DG (2004) Distinctive image features from scale-invariant Keypoints David. Int J Comput Vis 1–28
Google Scholar
Yao L (2016) Extract the relational information of static features and motion features for human activities recognition in videos. Comput Intell Nerosci 2016:1–7. https://doi.org/10.1155/2016/1760172
Article Google Scholar
Wang H et al (2011) Action recognition by dense trajectories To cite this version: Cvpr’11 (2011)
Google Scholar
Lecun Y et al (2015) Deep learning. https://doi.org/10.1038/nature14539
Ouadiay, F.Z. et al.: Simultaneous object detection and localization using convolutional neural networks. 2018 Int. Conf. Intell. Syst. Comput. Vision, ISCV 2018. 2018-May, 1-8 (2018). https://doi.org/10.1109/ISACV.2018.8354045
Guiming D et al (2017) Speech recognition based on convolutional neural networks. In: 2016 IEEE international conference on signal and image processing, ICSIP, pp 708–711. https://doi.org/10.1109/SIPROCESS.2016.7888355
Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 211–252. https://doi.org/10.1007/s11263-015-0816-y
Feichtenhofer C et al (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213
Ye H et al (2015) Evaluating two-stream CNN for video classification. https://doi.org/10.1145/2671188.2749406
Goodale MAM, Melvyn DM (2004) Separate visual pathways for perception and action. Lit Theol 11(1):80–92. https://doi.org/10.1093/litthe/11.1.80
Article Google Scholar
Kruger N et al (2013) Deep hierarchies in the primate visual cortex: what can we learn for computer vision? IEEE Trans Pattern Anal Mach Intell 35(8):1847–1871. https://doi.org/10.1109/TPAMI.2012.272
Article Google Scholar
Palmer R et al (2012) Scale proportionate histograms of oriented gradients for object detection in co-registered visual and range data. In: 2012 international conference on digital image computing techniques and applications (DICTA 2012). https://doi.org/10.1109/DICTA.2012.6411699
Li F, Du J (2012) Local spatio-temporal interest point detection for human action recognition. In: 2012 IEEE 5th international conference on advanced computational intelligence ICACI 2012, pp 579–582. https://doi.org/10.1109/ICACI.2012.6463231
Laptev I, Lindeberg T (2003) Space-time Interest Points Ivan. In: Ninth IEEE international conference on computer vision (ICCV’03) 0-7695-1950-4/03:0-7
Google Scholar
Peng K et al (2009) 3D reconstruction based on SIFT and Harris feature points. 2009 IEEE International Conference on Robotics and Biomimetics (ROBIO), ROBIO 2009. 960, 1, 960–964 (2009). https://doi.org/10.1109/ROBIO.2009.5420735
Papadopoulos K et al (2019) Localized trajectories for 2D and 3D action recognition
Google Scholar
He K et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, pp 1–14
Google Scholar
Szegedy C et al Rethinking the inception architecture for computer vision. In: Proceeding IEEE computer society conference on computer vision and pattern recognition. 2016-Decem, pp. 2818–2826. https://doi.org/10.1109/CVPR.2016.308
Baker S et al (2011) A database and evaluation methodology for optical flow. Int J Comput Vis 92(1):1–31. https://doi.org/10.1007/s11263-010-0390-2
Article MathSciNet Google Scholar
Wang T, Delaunay-lms IC (2012) Histograms of optical flow orientation for visual abnormal events detection. In: IEEE Ninth international conference on advanced video signal-based surveill, pp 13–18. https://doi.org/10.1109/AVSS.2012.39
Chaudhry R et al (2009) Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE conference on computer vision and pattern Recognition (CVPR) Work. CVPR Work. 2009 IEEE, pp 1932–1939 (2009). https://doi.org/10.1109/CVPRW.2009.5206821
Ilg E et al (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Proceeding - 30th IEEE conference computer vision pattern recognition, CVPR 2017. pp 1647–1655. https://doi.org/10.1109/CVPR.2017.179
Wang L et al (2017) Three-stream CNNs for action recognition. Pattern Recognit Lett 92:33–40. https://doi.org/10.1016/j.patrec.2017.04.004
Article Google Scholar
Hui TW et al (2018) LiteFlowNet: a lightweight convolutional neural network for optical flow estimation. In: IEEE computer society conference on computer vision and pattern recognition, pp 8981–8989. https://doi.org/10.1109/CVPR.2018.00936
Sun C et al (2015) Temporal localization of fine-grained actions in videos by domain transfer from web images. https://doi.org/10.1145/2733373.2806226
Zhao R et al (2017) Two-stream RNN/CNN for action recognition in 3D videos. IEEE Int Conf Intell Rob Syst 2017-Septe, 4260–4267 (2017). https://doi.org/10.1109/IROS.2017.8206288
Wu Z et al (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147. https://doi.org/10.1109/TMM.2018.2823900
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Kuvempu University, Karnataka, 577451, India
M. Suresha, S. Kuppa & D. S. Raghukumar

Authors

M. Suresha
View author publications
You can also search for this author in PubMed Google Scholar
S. Kuppa
View author publications
You can also search for this author in PubMed Google Scholar
D. S. Raghukumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Suresha .

Editor information

Editors and Affiliations

Faculdade de Engenharia, Universidade do Porto, Porto, Portugal
João Manuel R. S. Tavares
Department of Computer and System Sciences, Visva-Bharati University, Santiniketan, India
Paramartha Dutta
Institute of Engineering and Management, Kolkata, India
Soumi Dutta
Department of Computer Science, CHRIST (Deemed to be University), Bengaluru, India
Debabrata Samanta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suresha, M., Kuppa, S., Raghukumar, D.S. (2022). Deep Learning Approaches for Spatio-Temporal Clues Modelling. In: Tavares, J.M.R.S., Dutta, P., Dutta, S., Samanta, D. (eds) Cyber Intelligence and Information Retrieval. Lecture Notes in Networks and Systems, vol 291. Springer, Singapore. https://doi.org/10.1007/978-981-16-4284-5_30

Download citation

DOI: https://doi.org/10.1007/978-981-16-4284-5_30
Published: 29 September 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4283-8
Online ISBN: 978-981-16-4284-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics