research-article

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification

Authors:
Zuxuan Wu

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Yu-Gang Jiang

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Xi Wang

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Hao Ye

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Xiangyang Xue

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

MM '16: Proceedings of the 24th ACM international conference on MultimediaOctober 2016Pages 791–800https://doi.org/10.1145/2964284.2964328

Published:01 October 2016Publication History

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 791–800

ABSTRACT

This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams on multiple classes, we propose to mine class relationships hidden in the data from the trained models. The automatically discovered relationships are then leveraged in the multi-stream multi-class fusion process as a prior, indicating which and how much information is needed from the remaining classes, to adaptively determine the optimal fusion weights for generating the final scores of each class. Our contributions are two-fold. First, the multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, our proposed fusion method not only learns the best weights of the multiple network streams for each class, but also takes class relationship into account, which is known as a helpful clue in multi-class visual classification tasks. Our framework produces significantly better results than the state of the arts on two popular benchmarks, 92.2% on UCF-101 (without using audio) and 84.9% on Columbia Consumer Videos.

References

O. Abdel-Hamid, L. Deng, and D. Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In INTERSPEECH, 2013.Google Scholar
S. M. Assari, A. R. Zamir, and M. Shah. Video classification using semantic concept co-occurrences. In CVPR, 2014. Google ScholarDigital Library
F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning, 2011.Google Scholar
S. Bengio, J. Dean, D. Erhan, E. Ie, Q. Le, A. Rabinovich, J. Shlens, and Y. Singer. Using web co-occurrence statistics for improving image categorization. CoRR, 2013.Google Scholar
T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV. 2004.Google ScholarCross Ref
X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In ICCV, 2015. Google ScholarDigital Library
J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. CoRR, 2015.Google Scholar
J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In ECCV, 2014.Google ScholarCross Ref
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.Google ScholarCross Ref
D. L. Donoho and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 1995.Google Scholar
C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE TPAMI, 2013. Google ScholarDigital Library
B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.Google ScholarCross Ref
P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In ICCV, 2009.Google ScholarCross Ref
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. Google ScholarDigital Library
A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarCross Ref
A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 2005. Google ScholarDigital Library
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, 2015.Google Scholar
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.Google ScholarDigital Library
H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos" in the wild". arXiv preprint arXiv:1604.06182, 2016.Google Scholar
M. Jain and et al. University of amsterdam at thumos 2015. In CVPR THUMOS Workshop, 2015.Google Scholar
I.-H. Jhuo, G. Ye, S. Gao, D. Liu, Y.-G. Jiang, D. T. Lee, and S.-F. Chang. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications, 2014. Google ScholarDigital Library
S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In ICML, 2010.Google Scholar
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014. Google ScholarDigital Library
Y.-G. Jiang, J. Wang, S.-F. Chang, and C.-W. Ngo. Domain adaptive semantic diffusion for large scale context-based video annotation. In ICCV, 2009.Google Scholar
Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ACM ICMR, 2011. Google ScholarDigital Library
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. Google ScholarDigital Library
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 2011. Google ScholarDigital Library
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Google ScholarDigital Library
K.-T. Lai, F. X. Yu, M.-S. Chen, and S.-F. Chang. Video event detection by inferring temporal instance labels. In CVPR, 2014. Google ScholarDigital Library
Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. CoRR, 2014.Google Scholar
I. Laptev. On space-time interest points. IJCV, 2007. Google ScholarDigital Library
D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang. Sample-specific late fusion for visual category recognition. In CVPR, 2013. Google ScholarDigital Library
D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. Google ScholarDigital Library
A. J. Ma and P. C. Yuen. Reduced analytic dependency modeling: Robust fusion for visual recognition. IJCV, 2014. Google ScholarDigital Library
M. Nagel, T. Mensink, and C. G. M. Snoek. Event fisher vectors: Robust encoding visual diversity of visual streams. In BMVC, 2015.Google ScholarCross Ref
K. Nandakumar, Y. Chen, S. C. Dass, and A. K. Jain. Likelihood ratio-based biometric score fusion. IEEE TPAMI, 2008. Google ScholarDigital Library
P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, and R. Prasad. Multimodal feature fusion for robust event detection in web videos. In CVPR, 2012. Google ScholarDigital Library
N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multi-modal gesture recognition. IEEE TPAMI, 2014.Google Scholar
J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.Google Scholar
D. Oneata, J. Verbeek, C. Schmid, et al. Action and event recognition with fisher vectors on a compact feature set. In ICCV, 2013. Google ScholarDigital Library
A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007.Google ScholarCross Ref
V. Ramanathan, K. Tang, G. Mori, and L. Fei-Fei. Learning temporal embeddings for complex video analysis. In ICCV, 2015. Google ScholarDigital Library
J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 2013. Google ScholarDigital Library
S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. CoRR, 2015.Google Scholar
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. Google ScholarDigital Library
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.Google Scholar
K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, 2012.Google Scholar
N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.Google ScholarDigital Library
N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012. Google ScholarDigital Library
X. Sun, M. Chen, and A. Hauptmann. Action recognition via local descriptors and holistic features. In CVPR, 2009.Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In CVPR, 2015.Google ScholarCross Ref
K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. Google ScholarDigital Library
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: Generic features for video analysis. CoRR, 2014.Google Scholar
A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, 2013. Google ScholarDigital Library
V. Veeriah, N. Zhuang, and G.-J. Qi. Differential recurrent neural networks for action recognition. In ICCV, 2015. Google ScholarDigital Library
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. Google ScholarDigital Library
H. Wang and C. Schmid. Lear-inria submission for the thumos workshop. In ICCV THUMOS Workshop, 2013.Google Scholar
H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.Google ScholarCross Ref
L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.Google ScholarCross Ref
X. Wang, A. Farhadi, and A. Gupta. Actions~ transformations. In CVPR, 2016.Google ScholarCross Ref
Y. Wang and G. Mori. Max-margin hidden conditional random fields for human action recognition. In CVPR, 2009.Google ScholarCross Ref
Z. Wu, Y. Fu, Y.-G. Jiang, and L. Sigal. Harnessing object and scene semantics for large-scale video understanding. In CVPR, 2016.Google ScholarCross Ref
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Multimedia, 2014. Google ScholarDigital Library
Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM Multimedia, 2015. Google ScholarDigital Library
Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In CVPR, 2015.Google ScholarCross Ref
Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann. Feature weighting via optimal thresholding for video analysis. In ICCV, 2013. Google ScholarDigital Library
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015. Google ScholarDigital Library
G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusion with rank minimization. In CVPR, 2012.Google Scholar
S.-I. Yu, L. Jiang, and et al. Informedia@ trecvid 2014 med and mer. In NIST TRECVID Video Retrieval Evaluation Workshop, 2014.Google Scholar
S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-trained cnn architectures for unconstrained video classification. In BMVC, 2015.Google ScholarCross Ref
H. Zhang, M. Mang, R. Hong, L. Nie, and T.-S. Chua. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM Multimedia, 2016. Google ScholarDigital Library

Index Terms

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification
1. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
MM '16: Proceedings of the 24th ACM international conference on Multimedia

This paper presents a novel framework to combine multiple layers and modalities of deep neural networks for video classification. We first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a ...
Read More
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-...
Read More
Multi-stream Convolutional Neural Networks Fusion for Palmprint Recognition
Biometric Recognition
Abstract
In recent years, researchers have carried out palmprint recognition study based on deep learning, and proposed a variety of methods based on deep learning. In these methods, convolution neural networks (CNN) were directly applied to the original ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CNN
LSTM
fusion
video classification
Qualifiers
- research-article
Conference

Acceptance Rates
MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 122
  Total Citations
  View Citations
- 1,501
  Total Downloads
- Downloads (Last 12 months)86
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification

MM '16: Proceedings of the 24th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Multi-stream Convolutional Neural Networks Fusion for Palmprint Recognition