Elsevier

Signal Processing

Volume 120, March 2016, Pages 761-766
Signal Processing

Multimodal video classification with stacked contractive autoencoders

https://doi.org/10.1016/j.sigpro.2015.01.001Get rights and content

Highlights

  • A two-stage framework for multimodal video classification is proposed.

  • The model is built based on stacked contractive autoencoders.

  • The first stage is single modal pre-training.

  • The second stage is multimodal fine-tuning.

  • The objective functions are optimized by stochastic gradient descent.

Abstract

In this paper we propose a multimodal feature learning mechanism based on deep networks (i.e., stacked contractive autoencoders) for video classification. Considering the three modalities in video, i.e., image, audio and text, we first build one Stacked Contractive Autoencoder (SCAE) for each single modality, whose outputs will be joint together and fed into another Multimodal Stacked Contractive Autoencoder (MSCAE). The first stage preserves intra-modality semantic relations and the second stage discovers inter-modality semantic correlations. Experiments on real world dataset demonstrate that the proposed approach achieves better performance compared with the state-of-the-art methods.

Introduction

With rapid progress of storage devices, Internet and social network, a large amount of video data are generated. How to index and search for these videos effectively is an increasingly active research issue in the multimedia community. To bridge the semantic gap between low-level features and high-level semantics, automatic video annotation and classification have emerged as important techniques for efficient video retrieval [1], [2], [3].

Typical approaches to accomplishing video annotation and classification are to apply machine learning methods only using image features from keyframes of video clips. As a matter of fact, video consists of three modalities, namely image, audio and text. Image features of keyframes just express visual aspects, whereas auditory and textual features are equivalently significant for video semantics understanding. A great deal of research has been focused on utilizing multimodal features for better understanding of video semantics [4], [5]. Thus multimodal integration in video may compensate the limitations of learning from any single modality.

There are also many other multimodal learning strategies. One group focuses on multi-modal or cross-modal retrieval that learn to map the high-dimensional heterogeneous features into a common low-dimensional latent space [6], [7], [8], [9]. Another group is composed of graph-based models, which generate geometric descriptors from multi-channel or multi-sensor to improve image or video analysis [10], [11], [12], [13], [14], [15], [16]. However, these methods are discriminative by supervised setting, which require a large amount of labeled data and waste abundant unlabeled data. Collecting labeled data is time-consuming and labor intensive. Thus discovering good representations of data that make it easier to extract useful information when building classifiers with only unsupervised learning has become a big challenge.

Recently, deep learning methods have tremendously attracted researchers interests. The breakthrough in deep learning was initiated by Hinton and quickly followed up in the same year [17], [18], [19], and many more later. A central idea, referred to as greedy layerwise unsupervised pre-training, was to learn a hierarchy of features one level at a time, using unsupervised feature learning to learn a new transformation at each level to be composed with the previously learned transformations; essentially, each iteration of unsupervised feature learning adds one layer of weights to a deep neural network. Finally, the set of layers could be combined to initialize a deep supervised predictor [20]. Methods that have been considered include Deep Belief Networks (DBNs) [17] with Restricted Boltzmann Machine (RBM), autoencoders [18], Convolutional Neural Network (CNN)[19], and so on.

Deep learning has been successfully applied to unsupervised feature learning not only for single modality, but also for multiple modalities [21], [22], [23], [24]. However, these approaches just learn deep networks with two modalities, e.g., image-text or audio-image pairs. This paper explores useful feature representation by fusing three modalities of video into a joint representation that reflects the intrinsic semantics that the video data corresponds to. Specifically, we first build one Stacked Contractive Autoencoders (SCAE) for each single modality, whose outputs will be joint together and fed into another Multimodal Stacked Contractive Autoencoders (MSCAE). The Contractive Autoencoder (CAE) is a representation-learning algorithm which captures local manifold structure and has the potential for non-local generalization [25]. It is much more appropriate for multimedia data, such as video and image, which are intrinsically on low-dimensional manifolds. The proposed method has two stages. The first stage preserves intra-modality semantic relations and the following stage discovers inter-modality semantic correlations. Compared with existing supervised learners, our method requires minimum amount of prior knowledge of the training data.

Moreover, the power of deep architecture used in the proposed algorithm has its own advantages than other shallow methods for video semantic analysis. For example, [26] used one hidden layer to learn an intermediate representation from video features. However, the single layer learning is limited, while deep network is able to discover more abstract and higher level semantic features for multimedia data. In addition, authors [27], [28] have studied the fusion and adaption multiple features for Multimedia Event Detection (MED). But their work mostly focuses on various image features, ignoring the other modalities in video.

The remainder of this paper is organized as follows. We first provide some background to build our model in Section 2. The proposed framework is introduced in Section 3. Section 4 reports the experimental analysis. Finally, we summarize the conclusion and future work in Section 5.

Section snippets

Autoencoder

An autoencoder is a special neural network consisting of three layers: the input layer, the hidden layer, and the reconstruction layer, which sets the target values to be equal to the input (as shown in Fig. 1). It is composed of two parts: (1) Encoder: a deterministic mapping f that transforms an input xRdx into hidden representation yRdh:y=f(x)=sf(Wx+bh)(2) Decoder: the resulting hidden representation y is then mapped back to a reconstruction rRdx in input space by another mapping function

Methodology

In this section, we introduce a framework for multimodal video classification. The two-stage training algorithm learns a set of parameters such that the mapped latent features capture both intra-modal semantics and inter-modal semantics well.

Dataset

The experimental data are mainly based on the collection of TREC video retrieval evaluation (TRECVID) [29] provided by the National Institute of Standards and Technology (NIST). We use TRECVID 2005 video dataset, which is composed of about 168 h multilingual digital video captured from LBC (Arabic), CCTV4, NTDTV (Chinese), and CNN, NBC, MSNBC (English). Due to the limitation of multi-languages, we select the videos broadcasted in English. We then partition the whole dataset into a training

Conclusion and future work

Multimodal integration plays important role in video semantics classification, based on which we propose a two-stage learning framework with contractive stacked autoencoders. By considering both intra-modal and inter-modal semantics, we learn a set of effective SCAEs for feature mapping from single modal pre-training to multimodal fine-tuning. Compared to other deep and shallow models, experimental results show the improvements of our approach in video classification accuracy.

In the recent

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Nos. 61100084, 61202197, 61303133), Zhejiang Province Department of Education Fund (No. Y201223321) and Zhejiang Provincial Natural Science Foundation of China (No. LQ13F020003).

References (34)

  • L. Zhang et al.

    A fine-grained image categorization system by cellet-encoded spatial pyramid modeling

    IEEE Trans. Ind. Electron.

    (2014)
  • L. Zhang et al.

    Recognizing architecture styles by hierarchical sparse coding of blocklets

    Inf. Sci.

    (2014)
  • L. Zhang et al.

    Fast multi-view segment graph kernel for object classification

    Signal Process.

    (2013)
  • M. Wang et al.

    Beyond distance measurementconstructing neighborhood similarity for video annotation

    IEEE Trans. Multimed.

    (2009)
  • M. Wang et al.

    Assistive tagginga survey of multimedia tagging with human–computer joint exploration

    ACM Comput. Surv.

    (2012)
  • G. Li et al.

    In-video product annotation with web information mining

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2012)
  • Y. Yang et al.

    Multi-feature fusion via hierarchical regression for multimedia analysis

    IEEE Trans. Multimed.

    (2013)
  • M. Wang et al.

    Unified video annotation via multigraph learning

    IEEE Trans. Circuits Syst. Video Technol.

    (2009)
  • M. Bronstein, A. Bronstein, F. Michel, N. Paragios, Data fusion through cross-modality metric learning using...
  • Y. Zhuang et al.

    Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval

    IEEE Trans. Multimed.

    (2008)
  • L. Zhang et al.

    Feature correlation hypergraphexploiting high-order potentials for multimodal recognition

    IEEE Trans. Cybern.

    (2014)
  • M. Wang et al.

    Multimodal graph-based reranking for web image search

    IEEE Trans. Image Process.

    (2012)
  • L. Zhang et al.

    Fusion of multichannel local and global structural cues for photo aesthetics evaluation

    IEEE Trans. Image Process.

    (2014)
  • Y. Xia et al.

    Parallelized fusion on multisensor transportation dataa case study in cyberits

    Int. J. Intell. Syst.

    (2013)
  • Y. Xia, J. Hu, M.D. Fontaine, A cyber-its framework for massive traffic data analysis using cyber infrastructure, Sci....
  • L. Zhang et al.

    Discovering discriminative graphlets for aerial image categories recognition

    IEEE Trans. Image Process.

    (2013)
  • G. E. Hinton, S. Osindero, A fast learning algorithm for deep belief nets, Neural Comput. 18 (2006)...
  • Cited by (97)

    • Measurement of health evolution tendency for aircraft engine using a data-driven method based on multi-scale series reconstruction and adaptive hybrid model

      2022, Measurement: Journal of the International Measurement Confederation
      Citation Excerpt :

      With the novel network structure combining one DAE and several CAEs, the EnAE can efficiently extract the status features from multi-source signals, which contain the essential information of health evolution process. More detailed theoretical basis about DAE and CAE can be found in References [32,33]. Compared with conventional stacked auto-encoder, the proposed EnAE makes distinct auto-encoder models complement each other and adopts a new pattern of bottom-up and layer-wise training.

    View all citing articles on Scopus
    View full text