ABSTRACT
This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams on multiple classes, we propose to mine class relationships hidden in the data from the trained models. The automatically discovered relationships are then leveraged in the multi-stream multi-class fusion process as a prior, indicating which and how much information is needed from the remaining classes, to adaptively determine the optimal fusion weights for generating the final scores of each class. Our contributions are two-fold. First, the multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, our proposed fusion method not only learns the best weights of the multiple network streams for each class, but also takes class relationship into account, which is known as a helpful clue in multi-class visual classification tasks. Our framework produces significantly better results than the state of the arts on two popular benchmarks, 92.2% on UCF-101 (without using audio) and 84.9% on Columbia Consumer Videos.
- O. Abdel-Hamid, L. Deng, and D. Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In INTERSPEECH, 2013.Google Scholar
- S. M. Assari, A. R. Zamir, and M. Shah. Video classification using semantic concept co-occurrences. In CVPR, 2014. Google ScholarDigital Library
- F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning, 2011.Google Scholar
- S. Bengio, J. Dean, D. Erhan, E. Ie, Q. Le, A. Rabinovich, J. Shlens, and Y. Singer. Using web co-occurrence statistics for improving image categorization. CoRR, 2013.Google Scholar
- T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV. 2004.Google ScholarCross Ref
- X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In ICCV, 2015. Google ScholarDigital Library
- J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. CoRR, 2015.Google Scholar
- J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In ECCV, 2014.Google ScholarCross Ref
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.Google ScholarCross Ref
- D. L. Donoho and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 1995.Google Scholar
- C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE TPAMI, 2013. Google ScholarDigital Library
- B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.Google ScholarCross Ref
- P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In ICCV, 2009.Google ScholarCross Ref
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. Google ScholarDigital Library
- A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarCross Ref
- A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 2005. Google ScholarDigital Library
- G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, 2015.Google Scholar
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.Google ScholarDigital Library
- H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos" in the wild". arXiv preprint arXiv:1604.06182, 2016.Google Scholar
- M. Jain and et al. University of amsterdam at thumos 2015. In CVPR THUMOS Workshop, 2015.Google Scholar
- I.-H. Jhuo, G. Ye, S. Gao, D. Liu, Y.-G. Jiang, D. T. Lee, and S.-F. Chang. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications, 2014. Google ScholarDigital Library
- S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In ICML, 2010.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014. Google ScholarDigital Library
- Y.-G. Jiang, J. Wang, S.-F. Chang, and C.-W. Ngo. Domain adaptive semantic diffusion for large scale context-based video annotation. In ICCV, 2009.Google Scholar
- Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ACM ICMR, 2011. Google ScholarDigital Library
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. Google ScholarDigital Library
- M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 2011. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Google ScholarDigital Library
- K.-T. Lai, F. X. Yu, M.-S. Chen, and S.-F. Chang. Video event detection by inferring temporal instance labels. In CVPR, 2014. Google ScholarDigital Library
- Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. CoRR, 2014.Google Scholar
- I. Laptev. On space-time interest points. IJCV, 2007. Google ScholarDigital Library
- D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang. Sample-specific late fusion for visual category recognition. In CVPR, 2013. Google ScholarDigital Library
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. Google ScholarDigital Library
- A. J. Ma and P. C. Yuen. Reduced analytic dependency modeling: Robust fusion for visual recognition. IJCV, 2014. Google ScholarDigital Library
- M. Nagel, T. Mensink, and C. G. M. Snoek. Event fisher vectors: Robust encoding visual diversity of visual streams. In BMVC, 2015.Google ScholarCross Ref
- K. Nandakumar, Y. Chen, S. C. Dass, and A. K. Jain. Likelihood ratio-based biometric score fusion. IEEE TPAMI, 2008. Google ScholarDigital Library
- P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, and R. Prasad. Multimodal feature fusion for robust event detection in web videos. In CVPR, 2012. Google ScholarDigital Library
- N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multi-modal gesture recognition. IEEE TPAMI, 2014.Google Scholar
- J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.Google Scholar
- D. Oneata, J. Verbeek, C. Schmid, et al. Action and event recognition with fisher vectors on a compact feature set. In ICCV, 2013. Google ScholarDigital Library
- A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007.Google ScholarCross Ref
- V. Ramanathan, K. Tang, G. Mori, and L. Fei-Fei. Learning temporal embeddings for complex video analysis. In ICCV, 2015. Google ScholarDigital Library
- J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 2013. Google ScholarDigital Library
- S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. CoRR, 2015.Google Scholar
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. Google ScholarDigital Library
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.Google Scholar
- K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, 2012.Google Scholar
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.Google ScholarDigital Library
- N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012. Google ScholarDigital Library
- X. Sun, M. Chen, and A. Hauptmann. Action recognition via local descriptors and holistic features. In CVPR, 2009.Google Scholar
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In CVPR, 2015.Google ScholarCross Ref
- K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. Google ScholarDigital Library
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: Generic features for video analysis. CoRR, 2014.Google Scholar
- A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, 2013. Google ScholarDigital Library
- V. Veeriah, N. Zhuang, and G.-J. Qi. Differential recurrent neural networks for action recognition. In ICCV, 2015. Google ScholarDigital Library
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. Google ScholarDigital Library
- H. Wang and C. Schmid. Lear-inria submission for the thumos workshop. In ICCV THUMOS Workshop, 2013.Google Scholar
- H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.Google ScholarCross Ref
- L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.Google ScholarCross Ref
- X. Wang, A. Farhadi, and A. Gupta. Actions~ transformations. In CVPR, 2016.Google ScholarCross Ref
- Y. Wang and G. Mori. Max-margin hidden conditional random fields for human action recognition. In CVPR, 2009.Google ScholarCross Ref
- Z. Wu, Y. Fu, Y.-G. Jiang, and L. Sigal. Harnessing object and scene semantics for large-scale video understanding. In CVPR, 2016.Google ScholarCross Ref
- Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Multimedia, 2014. Google ScholarDigital Library
- Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM Multimedia, 2015. Google ScholarDigital Library
- Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In CVPR, 2015.Google ScholarCross Ref
- Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann. Feature weighting via optimal thresholding for video analysis. In ICCV, 2013. Google ScholarDigital Library
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015. Google ScholarDigital Library
- G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusion with rank minimization. In CVPR, 2012.Google Scholar
- S.-I. Yu, L. Jiang, and et al. Informedia@ trecvid 2014 med and mer. In NIST TRECVID Video Retrieval Evaluation Workshop, 2014.Google Scholar
- S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-trained cnn architectures for unconstrained video classification. In BMVC, 2015.Google ScholarCross Ref
- H. Zhang, M. Mang, R. Hong, L. Nie, and T.-S. Chua. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM Multimedia, 2016. Google ScholarDigital Library
Index Terms
- Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification
Recommendations
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
MM '16: Proceedings of the 24th ACM international conference on MultimediaThis paper presents a novel framework to combine multiple layers and modalities of deep neural networks for video classification. We first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a ...
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
MM '15: Proceedings of the 23rd ACM international conference on MultimediaClassifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-...
Multi-stream Convolutional Neural Networks Fusion for Palmprint Recognition
Biometric RecognitionAbstractIn recent years, researchers have carried out palmprint recognition study based on deep learning, and proposed a variety of methods based on deep learning. In these methods, convolution neural networks (CNN) were directly applied to the original ...
Comments