Elsevier

Neurocomputing

Volume 287, 26 April 2018, Pages 68-83
Neurocomputing

A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos

https://doi.org/10.1016/j.neucom.2018.01.076Get rights and content

Abstract

Although research on detection of saliency and visual attention has been active over recent years, most of the existing work focuses on still image rather than video based saliency. In this paper, a deep learning based hybrid spatiotemporal saliency feature extraction framework is proposed for saliency detection from video footages. The deep learning model is used for the extraction of high-level features from raw video data, and they are then integrated with other high-level features. The deep learning network has been found extremely effective for extracting hidden features than that of conventional handcrafted methodology. The effectiveness for using hybrid high-level features for saliency detection in video is demonstrated in this work. Rather than using only one static image, the proposed deep learning model take several consecutive frames as input and both the spatial and temporal characteristics are considered when computing saliency maps. The efficacy of the proposed hybrid feature framework is evaluated by five databases with human gaze complex scenes. Experimental results show that the proposed model outperforms five other state-of-the-art video saliency detection approaches. In addition, the proposed framework is found useful for other video content based applications such as video highlights. As a result, a large movie clip dataset together with labeled video highlights is generated.

Introduction

Visual saliency has been an important and popular research in image processing for decades with a sole purpose to mimic biological visual perception for machine vision applications. Substantial interests in the field as evidenced by the vast volume of publications, such as application of saliency concept for image/video compression and recognition [1], [2], [3], [4], [5], [6], automatic image cropping [7], non-photorealistic rendering [8], adaptive image display on small devices [9], movie summarization [10], shot detection [11], human–robot interaction [12], and detection of multi-class geospatial targets [13], [14] have been reported in the last two decades.

Historically, saliency detection research was first initiated by Treisman and Gelade in 1980 [15] who proposed the “Feature Integration Theory”, which illustrated how visual attention was attracted by features in the imagery. Itti and Koch's model triggered strong interests in this field of research, including the use of low-level features to map the saliency regions/objects in the image scene [16]. He et al. [17] proposed a biologically inspired saliency model using high-level object and contextual features for saliency detection based on Judd's concept [18]. Further extension of research along this line was reported by Goferman et al. [19] who emphasized that four important factors, including local low-level features, global consideration, visual organization and high-level factors could affect saliency detections strongly. The methodology for feature extractions has also been improved.

Despite of intensive research in the image based saliency detection, video saliency has not been addressed until recent years. In fact, video saliency is quite different from that of still images, mainly because of the very limited frame-to-frame interval time for the observers’ attention to be drawn by features in the scene. Although there are extensions from the image-based saliency models for the video stream such as the temporal intensity and orientation contrasts as dynamic features [20], [21], better frame work is needed for more efficient saliency detection from video footage.

While most work in the field has been focusing on low level features, human attention prediction is considered to be dominated by some high-level features, such as objects, actions and events. Rudoy et al. [22] employed viewer's gazing direction and also to use their actions as cue to locate the saliency features, as opposed to the conventional image based pixel feature extraction method. Han et al. [23] proposed that meaningful objects were important to saliency detection. Based on visual attention and eye movement data, a video saliency detection model was trained and it was found to outperform all other state-of-the-art algorithms.

On one hand, conventional handcrafted features have proven their success in existing approaches and applications. On the other hand, deep learning networks has shown their great potential in computer vision such as coping with human perception especially for large-scale data and more complicated problems. It is our intension here to combine all of these approaches together to address the challenges for video saliency detection. Some papers [24], [25] report that it is effective to combine deep learning based features and handcrafted features for saliency detection. However these methods use only single image and do not consider temporal information in video. Wenguang et al. [26] captures spatial and temporal saliency via fully convolutional networks (FCNs) from frame pairs, but only using a frame pairs is not enough to fit the visual staying phenomenon for watching videos. Different with the mentioned works above, we proposed a novel deep learning and handcraft features hybrid framework for spatial dynamic attention of video by using consecutive 7 frames as input.

This paper focuses on eye fixation prediction task of video streams. A deep learning based combined feature framework is proposed to predict the spatial and temporal saliency regions and visual dispersion amongst video sequences. The features are extracted via an effective deep learning model, and they are then integrated with other handcrafted features. The effectiveness of these combined high-level features for saliency detection from video stream is assessed using five publicly available eye gazing datasets. In addition, a clip-vote dataset with about 596 movie clips and votes from 70 observers has also been employed to validate the applicability of the proposed approach for highlight extraction from movie streams.

Although research on saliency detection and visual attention have been receiving increasing attention in recent years, most of existing work focuses on still image rather than video based saliency detection. In this paper we proposed a deep learning based hybrid spatiotemporal feature framework for saliency prediction from video streams, and the main contributions of the present work can be summarized as follows:

  • A hybrid feature framework is proposed for saliency detection in video. Low-level features extracted from convolutional neural networks are found more effective than other commonly used handcrafted features such as intensity, color, orientation, and texture for saliency detection. The integration of high-level features with the low-level one, as well as the use of a customized classifier rather than that in CNN, are found very useful supplement to our framework. The performance of this hybrid feature framework is validated by five video datasets.

  • A CNN based feature hybrid method has been proposed for the spatial saliency detection using 7consecutive raw frames.

  • A 3D CNN with high level object features, scene complexity feature and a cluster weighted model have been employed for temporal dynamic attention detection.

  • In addition, based on the proposed TDA model, a movie clip dataset is constructed with subjective ranking of highlight levels. To the best of our knowledge, this is the first of such datasets to be made in this field. This data set may be useful for semantic video analysis as it is shown in this work.

  • According to experimental results, we have shown that this hybrid feature framework outperform five state-of-the-art methods for the saliency detections from five public eye fixation databases.

The remaining paper is organized as follows. Section 2 overviewed the related works. Section 3 provides an overview of the proposed approach, including definition of spatiotemporal attention and ground truth determination. In Sections 4 and 5 the spatial dynamic attention model and temporal dynamic attention model are discussed, respectively, along with details in terms of feature extraction. Section 6 presents the experimental results and discussions on five publicly available datasets. In Section 7, experiments on our constructed movie clip database are reported. Finally, some concluding remarks are drawn in Section 8.

Section snippets

Related work

Saliency detection models in general can be categorized into visual attention prediction and saliency object detection. In this paper, we propose a deep learning framework for predicting eye fixation locations where a human observer may fixate [27], [28], [29]. Itti and Koch use low-level features to map the saliency regions/objects in the image scene [16]. Koch and Ullman [28] introduced a feed-forward bottom up model to combine features in the form of a saliency map to represent the most

Introduction of spatiotemporal dynamic attention

In recent years, saliency detection from static images has been intensively investigated. Deduced from several image cues, Koch and Ullman [28] introduced the concept of saliency map to represent the conspicuousness regions of images. By using eye ball tracking devices to capture the viewer's most attended region of interest when people looking at specified images, several eye fixation datasets have been constructed, which have facilitated a number of methods for supervised-learning based image

SDA deep learning

Based on the previously introduced spatiotemporal dynamic attention framework, a SDA predicting model is proposed in this section and detailed as follows.

TDA deep learning

In this section, we will explain how to predict temporal dynamic attention by using the proposed framework. It is well-known that the visual gazing locations of viewers are not always constant when they are watching video. The distribution of visual gazing location can be very dispersive particularly when the video/scene is not very interesting. However, the viewer's gazing tends to be focused at the same place when an interesting event was happening. This phenomenon is known as visual

Learning and experimental results

To verify the effectiveness of the proposed models for SDA/TDA computation, intensive experiments were performed on five datasets, including VAGBA [54], Lübeck INB [55], IVB [56], CRCNS [57] and DIEM [58]. All of these five eye fixation datasets can be downloaded from the Internet. The main contents of these datasets are summarized and compared in Table 1, where three of them (IVB, VAGBA and Lübeck INB) had been used in [23]. There as on to consider two more eye tracking datasets in our

Film clips ranking based database

In this group of experiments, the proposed spatial-temporal video saliency model is applied for predicting movie highlights. Relevant results on a large clip-based database collected by us are reported below to further validate the efficacy of the proposed methodologies.

Conclusion

As shown in Fig. 1, a deep-learning based hybrid feature extraction framework is proposed to address the problem of video saliency description and characterization. With the combination of proposed novel deep learning networks and conventional methods as feature extractors, the final hybrid features are used to predict the spatial and temporal saliency. Two detail applications are implemented for detection of spatiotemporal dynamic attention. Novel deep learning networks architecture for

Acknowledgments

The authors would greatly thank the Editors and anonymous reviewers for their constructive comments to further improve the clarity and quality of this paper. The authors wish to acknowledge the support they received from the National Natural Science Foundation, China, under grants 61572351, 61772360, and the joint project funded by the Royal Society of Edinburgh and NSFC 61211130125.

Zheng Wang received the Ph.D. degree in Computer Science from Tianjin University (TJU), Tianjin, China, in 2009. He is now an associate professor in School of Computer Software, TJU. He once was a visiting scholar of INRIA institute, France, from 2007 to 2008. His current research interests include video analysis, hyperspectral imaging, and computer graphics.

References (125)

  • LuX. et al.

    Sparse coding for image denoising using spike and slab prior

    Neurocomputing

    (2013)
  • LuX. et al.

    Image reconstruction by an alternating minimization

    Neurocomputing

    (2011)
  • D. Walther et al.

    Modeling attention to salient proto objects

    Neural Netw.

    (2006)
  • R. Rao et al.

    Eye movements in iconic visual search

    Vis. Res.

    (2002)
  • M. Pomplun

    Saccadic selectivity in complex visual search displays

    Vis. Res.

    (2006)
  • U. Rutishauser et al.

    Is bottom-up attention useful for object recognition?

  • WangZ. et al.

    Foveation scalable video coding with automatic fixation selection

    IEEE Trans. Image Process.

    (2003)
  • W.S. Geisler et al.

    Real-time foveated multi-resolution system for low-bandwidth video communication

  • SongX. et al.

    Semi-supervised feature selection via hierarchical regression for web image classification

    Multimedia Syst.

    (2016)
  • A. Santella et al.

    Gaze-based interaction for semi-automatic photo cropping

  • D. DeCarlo et al.

    Stylization and abstraction of photographs

    ACM Trans. Gr.

    (2002)
  • M. Rubinstein et al.

    Improved seam carving for video retargeting

    ACM Trans. Gr.

    (2008)
  • S. Marat et al.

    Video summarization using a visual attention model

  • S. Marat et al.

    Modeling spatio-temporal saliency to predict gaze direction for short videos

    Int. J. Comput. Vis.

    (2009)
  • C. Muhl et al.

    On constructing a communicative space in HRI

  • J. Harel et al.

    Graph-based visual saliency

  • HeS. et al.

    A biologically inspired computational model for image saliency detection

  • T. Judd et al.

    Learning to predict where humans look

  • S. Goferman et al.

    Context-aware saliency detection

    IEEE Trans Pattern Anal. Mach. Intell.

    (2012)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans Pattern Anal. Mach. Intell.

    (1998)
  • C. Siagian et al.

    Rapid biologically-inspired scene classification using features shared with visual attention

    IEEE Trans Pattern Anal. Mach. Intell

    (2007)
  • D. Rudoy et al.

    Learning video saliency from human gaze using candidate selection

  • LiG. et al.

    Visual saliency detection based on multi scale deep CNN features

    IEEE Transon Image Process.

    (2016)
  • LiaH. et al.

    CNN for saliency detection with low-level feature integration

    Neurocomputing

    (2017)
  • W. Wang, J. Shen, L. Shao, Deep learning for video saliency detection, 2017 arXiv preprint...
  • M. Dorr et al.

    Variability of eye movements when viewing dynamic natural scenes

    J. Vis.

    (2010)
  • C. Koch et al.

    Shifts in selective visual attention: towards the underlying neural circuitry

    Hum. Neurobiol.

    (1986)
  • ChuaH. et al.

    Cultural variation in eye movements during scene perception

  • R. Milanese

    Detecting salient regions in an imagefrom biological evidence to computer implementation

    (1993)
  • S. Baluja et al.

    Using a saliency map for active spatial selective attention: Implementation &initial results

  • ZhangY. et al.

    Saliency detection by combining spatial and spectral information

    Opt. Lett.

    (2013)
  • HanJ. et al.

    An object-oriented visual saliency detection framework based on sparse coding representations

    IEEE Trans. Circuits Syst. Video Technol.

    (2013)
  • A. Borji et al.

    State-of-the-art in visual attention modeling

    IEEE Trans Pattern Anal. Mach. Intell

    (2012)
  • KimW. et al.

    Spatiotemporal saliency detection and its applications in static and dynamic scenes

    IEEE Trans. Circuits Syst. Video Technol

    (2011)
  • HouX. et al.

    Saliency detection: a spectral residual approach

  • GuoC. et al.

    Spatio-temporal saliency detection using phase spectrum of quaternion Fourier transform

  • CuiX. et al.

    Temporal spectral residual: fast motion saliency detection

  • HouX. et al.

    Dynamic visual attention: Searching for coding length increments

  • ZhangD. et al.

    Revealing event saliency in unconstrained video collection

    IEEE Trans. Image Process.

    (2017)
  • WangW. et al.

    Consistent video saliency using local gradient flow optimization and global refinement

    IEEE Trans. Image Process.

    (2015)
  • Cited by (132)

    • Coarse-grained generalized zero-shot learning with efficient self-focus mechanism

      2021, Neurocomputing
      Citation Excerpt :

      In recent years, deep neural network (DNN) models have made great achievement in computer vision [1–6].

    • Improved salient object detection using hybrid Convolution Recurrent Neural Network

      2021, Expert Systems with Applications
      Citation Excerpt :

      The deep neural network with regression nets is also used to reduce time consumption Xi, Luo, Wang, and Qiao (2019). A method in Wang, Ren, Zhang, Sun, and Jiang (2018) obtains both the spatial and temporal characteristics using deep learning hybrid spatiotemporal saliency feature extraction framework. A technique in Ding, Liu, Huang, Shi, and Wang (2019) uses CNN with color saliency network, depth saliency network and saliency fusion network for saliency detection in RGBD images and stereoscopic images.

    View all citing articles on Scopus

    Zheng Wang received the Ph.D. degree in Computer Science from Tianjin University (TJU), Tianjin, China, in 2009. He is now an associate professor in School of Computer Software, TJU. He once was a visiting scholar of INRIA institute, France, from 2007 to 2008. His current research interests include video analysis, hyperspectral imaging, and computer graphics.

    Jinchang Ren received his B.Eng. degree in computer software, M.Eng. in image processing, Dengin computer vision, all from Northwestern Polytechnical University (NWPU), China. He was also awarded a Ph.D. in Electronic Imaging and Media Communication from Bradford University, U.K. He is currently with Dept. of Electronic and Electrical Engineering, University of Strathclyde. His research interests focus mainly on visual computing and multi media signal processing, especially osemantic content extraction for video analysis and understanding and hyperspectral imaging.

    Dong Zhang is currently working towards his M.S. degree in school of computer science and technology at Tianjin University. His current research intersts mainly focus on Computer Image Processing, especially on Machine Learning, etc.

    Meijun Sun received the Ph.D. degree in Computer Science from Tianjin University (TJU), Tianjin, China, in 2009. She is now an associate professor in School of Computer Science and Technology, TJU. She once was a visiting scholar of INRIA institute, France, from 2007 to 2008. Her current research interests include computer graphics, hyperspectral imaging, and image processing.

    Jianmin Jiang (Co-corresponding Author) received a Ph.D. from the University of Nottingham, UK, in 1994. From 1997 to 2001, he worked as a full professor of Computing at the University of Glamorgan, Wales, UK. In 2002, he joined the University of Bradford, UK, as a Chair Professor of Digital Media, and Director of the Digital Media & Systems Research Institute. He worked at the University of Surrey, UK, as a full professor during 2010–2015 and as a distinguished professor (1000-plan) at Tianjin University, China, during 2010–2013. He is currently a distinguished professor and director of the Research Institute for Future Media Computing at the College of Computer Science & Software Engineering, Shenzhen University, China. He has been a chartered engineer, fellow of IEE/IET, fellow of RSA, member of EPSRC College in the UK, and EU FP-6/7 evaluator.

    View full text