Multimodal video classification with stacked contractive autoencoders

doi:10.1016/j.sigpro.2015.01.001

Signal Processing

Volume 120, March 2016, Pages 761-766

https://doi.org/10.1016/j.sigpro.2015.01.001 Get rights and content

Highlights

•
A two-stage framework for multimodal video classification is proposed.
•
The model is built based on stacked contractive autoencoders.
•
The first stage is single modal pre-training.
•
The second stage is multimodal fine-tuning.
•
The objective functions are optimized by stochastic gradient descent.

Abstract

In this paper we propose a multimodal feature learning mechanism based on deep networks (i.e., stacked contractive autoencoders) for video classification. Considering the three modalities in video, i.e., image, audio and text, we first build one Stacked Contractive Autoencoder (SCAE) for each single modality, whose outputs will be joint together and fed into another Multimodal Stacked Contractive Autoencoder (MSCAE). The first stage preserves intra-modality semantic relations and the second stage discovers inter-modality semantic correlations. Experiments on real world dataset demonstrate that the proposed approach achieves better performance compared with the state-of-the-art methods.

Introduction

With rapid progress of storage devices, Internet and social network, a large amount of video data are generated. How to index and search for these videos effectively is an increasingly active research issue in the multimedia community. To bridge the semantic gap between low-level features and high-level semantics, automatic video annotation and classification have emerged as important techniques for efficient video retrieval [1], [2], [3].

Typical approaches to accomplishing video annotation and classification are to apply machine learning methods only using image features from keyframes of video clips. As a matter of fact, video consists of three modalities, namely image, audio and text. Image features of keyframes just express visual aspects, whereas auditory and textual features are equivalently significant for video semantics understanding. A great deal of research has been focused on utilizing multimodal features for better understanding of video semantics [4], [5]. Thus multimodal integration in video may compensate the limitations of learning from any single modality.

There are also many other multimodal learning strategies. One group focuses on multi-modal or cross-modal retrieval that learn to map the high-dimensional heterogeneous features into a common low-dimensional latent space [6], [7], [8], [9]. Another group is composed of graph-based models, which generate geometric descriptors from multi-channel or multi-sensor to improve image or video analysis [10], [11], [12], [13], [14], [15], [16]. However, these methods are discriminative by supervised setting, which require a large amount of labeled data and waste abundant unlabeled data. Collecting labeled data is time-consuming and labor intensive. Thus discovering good representations of data that make it easier to extract useful information when building classifiers with only unsupervised learning has become a big challenge.

Recently, deep learning methods have tremendously attracted researchers interests. The breakthrough in deep learning was initiated by Hinton and quickly followed up in the same year [17], [18], [19], and many more later. A central idea, referred to as greedy layerwise unsupervised pre-training, was to learn a hierarchy of features one level at a time, using unsupervised feature learning to learn a new transformation at each level to be composed with the previously learned transformations; essentially, each iteration of unsupervised feature learning adds one layer of weights to a deep neural network. Finally, the set of layers could be combined to initialize a deep supervised predictor [20]. Methods that have been considered include Deep Belief Networks (DBNs) [17] with Restricted Boltzmann Machine (RBM), autoencoders [18], Convolutional Neural Network (CNN)[19], and so on.

Deep learning has been successfully applied to unsupervised feature learning not only for single modality, but also for multiple modalities [21], [22], [23], [24]. However, these approaches just learn deep networks with two modalities, e.g., image-text or audio-image pairs. This paper explores useful feature representation by fusing three modalities of video into a joint representation that reflects the intrinsic semantics that the video data corresponds to. Specifically, we first build one Stacked Contractive Autoencoders (SCAE) for each single modality, whose outputs will be joint together and fed into another Multimodal Stacked Contractive Autoencoders (MSCAE). The Contractive Autoencoder (CAE) is a representation-learning algorithm which captures local manifold structure and has the potential for non-local generalization [25]. It is much more appropriate for multimedia data, such as video and image, which are intrinsically on low-dimensional manifolds. The proposed method has two stages. The first stage preserves intra-modality semantic relations and the following stage discovers inter-modality semantic correlations. Compared with existing supervised learners, our method requires minimum amount of prior knowledge of the training data.

Moreover, the power of deep architecture used in the proposed algorithm has its own advantages than other shallow methods for video semantic analysis. For example, [26] used one hidden layer to learn an intermediate representation from video features. However, the single layer learning is limited, while deep network is able to discover more abstract and higher level semantic features for multimedia data. In addition, authors [27], [28] have studied the fusion and adaption multiple features for Multimedia Event Detection (MED). But their work mostly focuses on various image features, ignoring the other modalities in video.

The remainder of this paper is organized as follows. We first provide some background to build our model in Section 2. The proposed framework is introduced in Section 3. Section 4 reports the experimental analysis. Finally, we summarize the conclusion and future work in Section 5.

Section snippets

Autoencoder

An autoencoder is a special neural network consisting of three layers: the input layer, the hidden layer, and the reconstruction layer, which sets the target values to be equal to the input (as shown in Fig. 1). It is composed of two parts: (1) Encoder: a deterministic mapping f that transforms an input $x \in R^{d_{x}}$ into hidden representation $y \in R^{d_{h}}$ : $y = f (x) = s_{f} (Wx + b_{h})$ (2) Decoder: the resulting hidden representation y is then mapped back to a reconstruction $r \in R^{d_{x}}$ in input space by another mapping function

Methodology

In this section, we introduce a framework for multimodal video classification. The two-stage training algorithm learns a set of parameters such that the mapped latent features capture both intra-modal semantics and inter-modal semantics well.

Dataset

The experimental data are mainly based on the collection of TREC video retrieval evaluation (TRECVID) [29] provided by the National Institute of Standards and Technology (NIST). We use TRECVID 2005 video dataset, which is composed of about 168 h multilingual digital video captured from LBC (Arabic), CCTV4, NTDTV (Chinese), and CNN, NBC, MSNBC (English). Due to the limitation of multi-languages, we select the videos broadcasted in English. We then partition the whole dataset into a training

Conclusion and future work

Multimodal integration plays important role in video semantics classification, based on which we propose a two-stage learning framework with contractive stacked autoencoders. By considering both intra-modal and inter-modal semantics, we learn a set of effective SCAEs for feature mapping from single modal pre-training to multimodal fine-tuning. Compared to other deep and shallow models, experimental results show the improvements of our approach in video classification accuracy.

In the recent

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Nos. 61100084, 61202197, 61303133), Zhejiang Province Department of Education Fund (No. Y201223321) and Zhejiang Provincial Natural Science Foundation of China (No. LQ13F020003).

References (34)

L. Zhang et al.
A fine-grained image categorization system by cellet-encoded spatial pyramid modeling
IEEE Trans. Ind. Electron.
(2014)
L. Zhang et al.
Recognizing architecture styles by hierarchical sparse coding of blocklets
Inf. Sci.
(2014)
L. Zhang et al.
Fast multi-view segment graph kernel for object classification
Signal Process.
(2013)
M. Wang et al.
Beyond distance measurementconstructing neighborhood similarity for video annotation
IEEE Trans. Multimed.
(2009)
M. Wang et al.
Assistive tagginga survey of multimedia tagging with human–computer joint exploration
ACM Comput. Surv.
(2012)
G. Li et al.
In-video product annotation with web information mining
ACM Trans. Multimed. Comput. Commun. Appl.
(2012)
Y. Yang et al.
Multi-feature fusion via hierarchical regression for multimedia analysis
IEEE Trans. Multimed.
(2013)
M. Wang et al.
Unified video annotation via multigraph learning
IEEE Trans. Circuits Syst. Video Technol.
(2009)
M. Bronstein, A. Bronstein, F. Michel, N. Paragios, Data fusion through cross-modality metric learning using...
Y. Zhuang et al.
Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval
IEEE Trans. Multimed.
(2008)

L. Zhang et al.

Feature correlation hypergraphexploiting high-order potentials for multimodal recognition

IEEE Trans. Cybern.

(2014)

M. Wang et al.

Multimodal graph-based reranking for web image search

IEEE Trans. Image Process.

(2012)

L. Zhang et al.

Fusion of multichannel local and global structural cues for photo aesthetics evaluation

IEEE Trans. Image Process.

(2014)

Y. Xia et al.

Parallelized fusion on multisensor transportation dataa case study in cyberits

Int. J. Intell. Syst.

(2013)

Y. Xia, J. Hu, M.D. Fontaine, A cyber-its framework for massive traffic data analysis using cyber infrastructure, Sci....

L. Zhang et al.

Discovering discriminative graphlets for aerial image categories recognition

IEEE Trans. Image Process.

(2013)

G. E. Hinton, S. Osindero, A fast learning algorithm for deep belief nets, Neural Comput. 18 (2006)...

Cited by (97)

Knowledge distillation-optimized two-stage anomaly detection for liquid rocket engine with missing multimodal data
2024, Reliability Engineering and System Safety
Anomaly detection (AD) is essential to ensure safe and reliable operation of liquid rocket engine (LRE). However, with harsh and complex operating conditions in LRE, existing methods find it difficult to fuse missing multimodal data and extract features for AD. To recover missing data and achieve effective AD for LRE, we propose a knowledge distillation-optimized two-stage AD method, which consists of two models: teacher model and student model. Specifically, the teacher model includes two complex modules: imputation and reconstruction module, which respectively impute and reconstruct missing multimodal data. Meanwhile, the simple student model is proposed to learn the knowledge of the teacher model, which could quickly and accurately determine the health status of LREs. In training process, the two modules of teacher are trained by two steps, and the third step is to transfer knowledge of pretrained teacher model to the student model. Finally, a high-performance and simple-structure student model is obtained. To verify the accuracy and efficiency of the proposed method, we carry out sufficient experiments and discussions to research it in many aspects with data from static firing tests. Experimental results show that F1 Score can reach 0.9916 with a delay less than 19 ms.
Deep4Fusion: A Deep FORage Fusion framework for high-throughput phenotyping for green and dry matter yield traits
2023, Computers and Electronics in Agriculture
Deep learning methods have become one of the fundamental blocks of high-throughput phenotyping using RGB imagery. In this study, we go beyond applying deep learning algorithms; we improve deep learning models using a multi-view fusion approach. The proposal dynamically merges information from two deep-learning models. We evaluate this approach to improve the estimation of total dry matter yield, leaf dry matter yield and total green matter yield of plots of Guineagrass, an important tropical forage species. The proposed approach, named Deep4Fusion fusion network, can be set to use two different deep learning models. The experimental results indicated that our approach improved the performance between 20% to 33% when compared with standard models reported in previous works, with a significant improvement ( $p$ -value $<$ 0.05) for leaf dry matter and total dry matter yield. We believe that the flexibility of multi-view fusion in merging the predictions of several CNNs models through shared layers across the network has the potential to improve the results of many other single-view deep learning approaches.
Constructing a health indicator for bearing degradation assessment via an unsupervised and enhanced stacked autoencoder
2022, Advanced Engineering Informatics
To eliminate the noise from the extracted health indicator (HI) and reduce dependence on manual experience including knowledge for selecting the appropriate time–frequency domain indicators and conversions for data pre-processing, multiple-step processing, and multiple model combinations. An enhanced stacked autoencoder (ESAE) based on an exponent weight moving average (EWMA) in a deep learning model is proposed. This ESAE uses the vibration amplitude of an unlabeled original time-domain signal to construct the HI directly. To demonstrate our proposed method is better than other models, including the stacked autoencoder (SAE), stacked denoising autoencoder (SDAE), root mean square (RMS), kurtosis, K-medoids clustering, and self-organizing map (SOM) neural network models. The proposed model is simulated for multiple bearings in two case studies. The experiment result shows that the extracted HI curve is smoother than mentioned-above other models reduce the wrong judgment of bearing health caused by noise effectively. Moreover, high monotonicity is lay a good foundation for following reaming useful life prediction. An index, named Mon, is used to assess the monotonicity performance of all models, the experiment result also shows that the extracted HI by our proposed is superior to other models.
Measurement of health evolution tendency for aircraft engine using a data-driven method based on multi-scale series reconstruction and adaptive hybrid model
2022, Measurement: Journal of the International Measurement Confederation
Citation Excerpt :
With the novel network structure combining one DAE and several CAEs, the EnAE can efficiently extract the status features from multi-source signals, which contain the essential information of health evolution process. More detailed theoretical basis about DAE and CAE can be found in References [32,33]. Compared with conventional stacked auto-encoder, the proposed EnAE makes distinct auto-encoder models complement each other and adopts a new pattern of bottom-up and layer-wise training.
With the improvement of structure complexity and the strict requirement for stable operation, maintenance pattern of aircraft engine has experienced the transformation from passive response to active prevention. Accurate result of health evolution trend measuring is the core part of conducting the prevention maintenance. In this paper, a data-driven method based on multi-scale series reconstruction and adaptive hybrid model is proposed to measure the health development tendency of aircraft engine. Firstly, to quantitatively characterize the health levels of engine, a comprehensive health state index (CHSI) is innovatively constructed by an improved ensemble auto-encoder (EnAE) and self-organizing map (SeOM) neural network, which realizes the feature-level fusion of multi-source sensory data. By designing novel network structure and training pattern in EnAE, original auto-encoder model is optimized and the more robust state features can be captured from raw signals. Secondly, considering the influence of series random fluctuations on forecasting results, a multi-scale intrinsic mode functions (IMFs) reconstruction strategy using the fast ensemble empirical mode decomposition based dispersion entropy (FEEMD-DE) theory is provided to efficiently extract the regular and irregular components from original CHSI series. Finally, an adaptive hybrid model, combining a recurrent reconstructed Grey Markov (RRGM) model and long short-term memory (LSTM) network, is developed to capture the complex characteristics of reconstruction components and then complete the measurement of health evolution trend. The feasibility and superiority of the proposed method is validated by using a multi-source sensory dataset collected from aircraft engines, and the experimental results indicate that the measuring accuracy of the proposed method is significantly higher than that of other existing methods.
SwinCross: Cross-modal Swin transformer for head-and-neck tumor segmentation in PET/CT images
2024, Medical Physics
Finding the Needle in a Haystack: Detecting Bug Occurrences in Gameplay Videos
2023, arXiv

View all citing articles on Scopus

View full text

Multimodal video classification with stacked contractive autoencoders

Highlights

Abstract

Introduction

Section snippets

Autoencoder

Methodology

Dataset

Conclusion and future work

Acknowledgments

IEEE Trans. Ind. Electron.

Inf. Sci.

Signal Process.

Beyond distance measurementconstructing neighborhood similarity for video annotation

IEEE Trans. Multimed.

Assistive tagginga survey of multimedia tagging with human–computer joint exploration

ACM Comput. Surv.

In-video product annotation with web information mining

ACM Trans. Multimed. Comput. Commun. Appl.

Multi-feature fusion via hierarchical regression for multimedia analysis

IEEE Trans. Multimed.

Unified video annotation via multigraph learning

IEEE Trans. Circuits Syst. Video Technol.

Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval

IEEE Trans. Multimed.

Feature correlation hypergraphexploiting high-order potentials for multimodal recognition

IEEE Trans. Cybern.

Multimodal graph-based reranking for web image search

IEEE Trans. Image Process.

Fusion of multichannel local and global structural cues for photo aesthetics evaluation

IEEE Trans. Image Process.

Parallelized fusion on multisensor transportation dataa case study in cyberits

Int. J. Intell. Syst.

Discovering discriminative graphlets for aerial image categories recognition

IEEE Trans. Image Process.