Abstract
Multimedia domain, understanding of video content requires vibrant semantic features, which are static and temporal information, static information contains spatial clues, and temporal information contains short-term motions clues, as well as long-term motions clues. Exploring these visual features is a challenging task. Conventional handcrafted features extraction methods are inefficient to analyse these features. Inspired by the significant progress of deep learning models in the image domain, this research extends to the design and exploration of deep neural models and in video domain. We addressed the evolution of spatio-temporal features in two ways in this paper. First, we used various “convolutional neural networks (CNN)” and their pipelines to investigate video features, fusion strategies and their results. Second, the drawbacks of convolutional neural networks for long-term motion clues are overcome by using sequential learning models such as the long short-term memory (LSTM) network in conjunction with multimodel fusion techniques. The main objective of this study is to explore spatial-temporal clues, different models available for the problem, and determine the most promising approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates BT—pattern analysis and machine intelligence. IEEE Trans 23(3):257–267
Laptev I et al (2008) Learning realistic human actions from movies. In: 26th IEEE conference computer vision pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2008.4587756
Sermanet P, Chintala S, LeCun Y (2012) Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), pp 3288–3291 (IEEE, 2012)
Gonzalez TF (ed) (2007) Handbook of approximation algorithms and metaheuristics. CRC Press. https://doi.org/10.1201/9781420010749
Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223
Burghouts GJ, Schutte K (2013) Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recognit Lett 34(15):1861–1869. https://doi.org/10.1016/j.patrec.2013.01.024
Simonyan K, Zisserman A (2004) Andrew: two-stream convolutional networks for action recognition. In: NIPS’14 proceedings of the 27th international conference on neural information processing systems, vol 1 pp 568–576. https://doi.org/10.1016/0006-2952(83)90587-7
Jiang YG et al (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147. https://doi.org/10.1109/TMM.2018.2823900
Wang Y et al (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017. pp 2097–2106. https://doi.org/10.1109/CVPR.2017.226
Wu Z et al (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: MM’15 proceedings of the 23rd ACM international conference on multimedia, pp 461–470
Peng Y et al (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circ Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2808685
Singh B et al (2016) A Multi-stream Bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2016.216
Lowe DG (2004) Distinctive image features from scale-invariant Keypoints David. Int J Comput Vis 1–28
Yao L (2016) Extract the relational information of static features and motion features for human activities recognition in videos. Comput Intell Nerosci 2016:1–7. https://doi.org/10.1155/2016/1760172
Wang H et al (2011) Action recognition by dense trajectories To cite this version: Cvpr’11 (2011)
Lecun Y et al (2015) Deep learning. https://doi.org/10.1038/nature14539
Ouadiay, F.Z. et al.: Simultaneous object detection and localization using convolutional neural networks. 2018 Int. Conf. Intell. Syst. Comput. Vision, ISCV 2018. 2018-May, 1-8 (2018). https://doi.org/10.1109/ISACV.2018.8354045
Guiming D et al (2017) Speech recognition based on convolutional neural networks. In: 2016 IEEE international conference on signal and image processing, ICSIP, pp 708–711. https://doi.org/10.1109/SIPROCESS.2016.7888355
Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 211–252. https://doi.org/10.1007/s11263-015-0816-y
Feichtenhofer C et al (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213
Ye H et al (2015) Evaluating two-stream CNN for video classification. https://doi.org/10.1145/2671188.2749406
Goodale MAM, Melvyn DM (2004) Separate visual pathways for perception and action. Lit Theol 11(1):80–92. https://doi.org/10.1093/litthe/11.1.80
Kruger N et al (2013) Deep hierarchies in the primate visual cortex: what can we learn for computer vision? IEEE Trans Pattern Anal Mach Intell 35(8):1847–1871. https://doi.org/10.1109/TPAMI.2012.272
Palmer R et al (2012) Scale proportionate histograms of oriented gradients for object detection in co-registered visual and range data. In: 2012 international conference on digital image computing techniques and applications (DICTA 2012). https://doi.org/10.1109/DICTA.2012.6411699
Li F, Du J (2012) Local spatio-temporal interest point detection for human action recognition. In: 2012 IEEE 5th international conference on advanced computational intelligence ICACI 2012, pp 579–582. https://doi.org/10.1109/ICACI.2012.6463231
Laptev I, Lindeberg T (2003) Space-time Interest Points Ivan. In: Ninth IEEE international conference on computer vision (ICCV’03) 0-7695-1950-4/03:0-7
Peng K et al (2009) 3D reconstruction based on SIFT and Harris feature points. 2009 IEEE International Conference on Robotics and Biomimetics (ROBIO), ROBIO 2009. 960, 1, 960–964 (2009). https://doi.org/10.1109/ROBIO.2009.5420735
Papadopoulos K et al (2019) Localized trajectories for 2D and 3D action recognition
He K et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, pp 1–14
Szegedy C et al Rethinking the inception architecture for computer vision. In: Proceeding IEEE computer society conference on computer vision and pattern recognition. 2016-Decem, pp. 2818–2826. https://doi.org/10.1109/CVPR.2016.308
Baker S et al (2011) A database and evaluation methodology for optical flow. Int J Comput Vis 92(1):1–31. https://doi.org/10.1007/s11263-010-0390-2
Wang T, Delaunay-lms IC (2012) Histograms of optical flow orientation for visual abnormal events detection. In: IEEE Ninth international conference on advanced video signal-based surveill, pp 13–18. https://doi.org/10.1109/AVSS.2012.39
Chaudhry R et al (2009) Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE conference on computer vision and pattern Recognition (CVPR) Work. CVPR Work. 2009 IEEE, pp 1932–1939 (2009). https://doi.org/10.1109/CVPRW.2009.5206821
Ilg E et al (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Proceeding - 30th IEEE conference computer vision pattern recognition, CVPR 2017. pp 1647–1655. https://doi.org/10.1109/CVPR.2017.179
Wang L et al (2017) Three-stream CNNs for action recognition. Pattern Recognit Lett 92:33–40. https://doi.org/10.1016/j.patrec.2017.04.004
Hui TW et al (2018) LiteFlowNet: a lightweight convolutional neural network for optical flow estimation. In: IEEE computer society conference on computer vision and pattern recognition, pp 8981–8989. https://doi.org/10.1109/CVPR.2018.00936
Sun C et al (2015) Temporal localization of fine-grained actions in videos by domain transfer from web images. https://doi.org/10.1145/2733373.2806226
Zhao R et al (2017) Two-stream RNN/CNN for action recognition in 3D videos. IEEE Int Conf Intell Rob Syst 2017-Septe, 4260–4267 (2017). https://doi.org/10.1109/IROS.2017.8206288
Wu Z et al (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147. https://doi.org/10.1109/TMM.2018.2823900
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Suresha, M., Kuppa, S., Raghukumar, D.S. (2022). Deep Learning Approaches for Spatio-Temporal Clues Modelling. In: Tavares, J.M.R.S., Dutta, P., Dutta, S., Samanta, D. (eds) Cyber Intelligence and Information Retrieval. Lecture Notes in Networks and Systems, vol 291. Springer, Singapore. https://doi.org/10.1007/978-981-16-4284-5_30
Download citation
DOI: https://doi.org/10.1007/978-981-16-4284-5_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4283-8
Online ISBN: 978-981-16-4284-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)