Skip to main content

Deep Learning Approaches for Spatio-Temporal Clues Modelling

  • Conference paper
  • First Online:
Cyber Intelligence and Information Retrieval

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 291))

  • 867 Accesses

Abstract

Multimedia domain, understanding of video content requires vibrant semantic features, which are static and temporal information, static information contains spatial clues, and temporal information contains short-term motions clues, as well as long-term motions clues. Exploring these visual features is a challenging task. Conventional handcrafted features extraction methods are inefficient to analyse these features. Inspired by the significant progress of deep learning models in the image domain, this research extends to the design and exploration of deep neural models and in video domain. We addressed the evolution of spatio-temporal features in two ways in this paper. First, we used various “convolutional neural networks (CNN)” and their pipelines to investigate video features, fusion strategies and their results. Second, the drawbacks of convolutional neural networks for long-term motion clues are overcome by using sequential learning models such as the long short-term memory (LSTM) network in conjunction with multimodel fusion techniques. The main objective of this study is to explore spatial-temporal clues, different models available for the problem, and determine the most promising approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates BT—pattern analysis and machine intelligence. IEEE Trans 23(3):257–267

    Google Scholar 

  2. Laptev I et al (2008) Learning realistic human actions from movies. In: 26th IEEE conference computer vision pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2008.4587756

  3. Sermanet P, Chintala S, LeCun Y (2012) Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), pp 3288–3291 (IEEE, 2012)

    Google Scholar 

  4. Gonzalez TF (ed) (2007) Handbook of approximation algorithms and metaheuristics. CRC Press. https://doi.org/10.1201/9781420010749

  5. Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223

  6. Burghouts GJ, Schutte K (2013) Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recognit Lett 34(15):1861–1869. https://doi.org/10.1016/j.patrec.2013.01.024

    Article  Google Scholar 

  7. Simonyan K, Zisserman A (2004) Andrew: two-stream convolutional networks for action recognition. In: NIPS’14 proceedings of the 27th international conference on neural information processing systems, vol 1 pp 568–576. https://doi.org/10.1016/0006-2952(83)90587-7

  8. Jiang YG et al (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147. https://doi.org/10.1109/TMM.2018.2823900

    Article  Google Scholar 

  9. Wang Y et al (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017. pp 2097–2106. https://doi.org/10.1109/CVPR.2017.226

  10. Wu Z et al (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: MM’15 proceedings of the 23rd ACM international conference on multimedia, pp 461–470

    Google Scholar 

  11. Peng Y et al (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circ Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2808685

    Article  Google Scholar 

  12. Singh B et al (2016) A Multi-stream Bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2016.216

  13. Lowe DG (2004) Distinctive image features from scale-invariant Keypoints David. Int J Comput Vis 1–28

    Google Scholar 

  14. Yao L (2016) Extract the relational information of static features and motion features for human activities recognition in videos. Comput Intell Nerosci 2016:1–7. https://doi.org/10.1155/2016/1760172

    Article  Google Scholar 

  15. Wang H et al (2011) Action recognition by dense trajectories To cite this version: Cvpr’11 (2011)

    Google Scholar 

  16. Lecun Y et al (2015) Deep learning. https://doi.org/10.1038/nature14539

  17. Ouadiay, F.Z. et al.: Simultaneous object detection and localization using convolutional neural networks. 2018 Int. Conf. Intell. Syst. Comput. Vision, ISCV 2018. 2018-May, 1-8 (2018). https://doi.org/10.1109/ISACV.2018.8354045

  18. Guiming D et al (2017) Speech recognition based on convolutional neural networks. In: 2016 IEEE international conference on signal and image processing, ICSIP, pp 708–711. https://doi.org/10.1109/SIPROCESS.2016.7888355

  19. Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 211–252. https://doi.org/10.1007/s11263-015-0816-y

  20. Feichtenhofer C et al (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213

  21. Ye H et al (2015) Evaluating two-stream CNN for video classification. https://doi.org/10.1145/2671188.2749406

  22. Goodale MAM, Melvyn DM (2004) Separate visual pathways for perception and action. Lit Theol 11(1):80–92. https://doi.org/10.1093/litthe/11.1.80

    Article  Google Scholar 

  23. Kruger N et al (2013) Deep hierarchies in the primate visual cortex: what can we learn for computer vision? IEEE Trans Pattern Anal Mach Intell 35(8):1847–1871. https://doi.org/10.1109/TPAMI.2012.272

    Article  Google Scholar 

  24. Palmer R et al (2012) Scale proportionate histograms of oriented gradients for object detection in co-registered visual and range data. In: 2012 international conference on digital image computing techniques and applications (DICTA 2012). https://doi.org/10.1109/DICTA.2012.6411699

  25. Li F, Du J (2012) Local spatio-temporal interest point detection for human action recognition. In: 2012 IEEE 5th international conference on advanced computational intelligence ICACI 2012, pp 579–582. https://doi.org/10.1109/ICACI.2012.6463231

  26. Laptev I, Lindeberg T (2003) Space-time Interest Points Ivan. In: Ninth IEEE international conference on computer vision (ICCV’03) 0-7695-1950-4/03:0-7

    Google Scholar 

  27. Peng K et al (2009) 3D reconstruction based on SIFT and Harris feature points. 2009 IEEE International Conference on Robotics and Biomimetics (ROBIO), ROBIO 2009. 960, 1, 960–964 (2009). https://doi.org/10.1109/ROBIO.2009.5420735

  28. Papadopoulos K et al (2019) Localized trajectories for 2D and 3D action recognition

    Google Scholar 

  29. He K et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  30. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, pp 1–14

    Google Scholar 

  31. Szegedy C et al Rethinking the inception architecture for computer vision. In: Proceeding IEEE computer society conference on computer vision and pattern recognition. 2016-Decem, pp. 2818–2826. https://doi.org/10.1109/CVPR.2016.308

  32. Baker S et al (2011) A database and evaluation methodology for optical flow. Int J Comput Vis 92(1):1–31. https://doi.org/10.1007/s11263-010-0390-2

    Article  MathSciNet  Google Scholar 

  33. Wang T, Delaunay-lms IC (2012) Histograms of optical flow orientation for visual abnormal events detection. In: IEEE Ninth international conference on advanced video signal-based surveill, pp 13–18. https://doi.org/10.1109/AVSS.2012.39

  34. Chaudhry R et al (2009) Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE conference on computer vision and pattern Recognition (CVPR) Work. CVPR Work. 2009 IEEE, pp 1932–1939 (2009). https://doi.org/10.1109/CVPRW.2009.5206821

  35. Ilg E et al (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Proceeding - 30th IEEE conference computer vision pattern recognition, CVPR 2017. pp 1647–1655. https://doi.org/10.1109/CVPR.2017.179

  36. Wang L et al (2017) Three-stream CNNs for action recognition. Pattern Recognit Lett 92:33–40. https://doi.org/10.1016/j.patrec.2017.04.004

    Article  Google Scholar 

  37. Hui TW et al (2018) LiteFlowNet: a lightweight convolutional neural network for optical flow estimation. In: IEEE computer society conference on computer vision and pattern recognition, pp 8981–8989. https://doi.org/10.1109/CVPR.2018.00936

  38. Sun C et al (2015) Temporal localization of fine-grained actions in videos by domain transfer from web images. https://doi.org/10.1145/2733373.2806226

  39. Zhao R et al (2017) Two-stream RNN/CNN for action recognition in 3D videos. IEEE Int Conf Intell Rob Syst 2017-Septe, 4260–4267 (2017). https://doi.org/10.1109/IROS.2017.8206288

  40. Wu Z et al (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147. https://doi.org/10.1109/TMM.2018.2823900

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Suresha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Suresha, M., Kuppa, S., Raghukumar, D.S. (2022). Deep Learning Approaches for Spatio-Temporal Clues Modelling. In: Tavares, J.M.R.S., Dutta, P., Dutta, S., Samanta, D. (eds) Cyber Intelligence and Information Retrieval. Lecture Notes in Networks and Systems, vol 291. Springer, Singapore. https://doi.org/10.1007/978-981-16-4284-5_30

Download citation

Publish with us

Policies and ethics