skip to main content
10.1145/2964284.2964328acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification

Published:01 October 2016Publication History

ABSTRACT

This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams on multiple classes, we propose to mine class relationships hidden in the data from the trained models. The automatically discovered relationships are then leveraged in the multi-stream multi-class fusion process as a prior, indicating which and how much information is needed from the remaining classes, to adaptively determine the optimal fusion weights for generating the final scores of each class. Our contributions are two-fold. First, the multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, our proposed fusion method not only learns the best weights of the multiple network streams for each class, but also takes class relationship into account, which is known as a helpful clue in multi-class visual classification tasks. Our framework produces significantly better results than the state of the arts on two popular benchmarks, 92.2% on UCF-101 (without using audio) and 84.9% on Columbia Consumer Videos.

References

  1. O. Abdel-Hamid, L. Deng, and D. Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In INTERSPEECH, 2013.Google ScholarGoogle Scholar
  2. S. M. Assari, A. R. Zamir, and M. Shah. Video classification using semantic concept co-occurrences. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning, 2011.Google ScholarGoogle Scholar
  4. S. Bengio, J. Dean, D. Erhan, E. Ie, Q. Le, A. Rabinovich, J. Shlens, and Y. Singer. Using web co-occurrence statistics for improving image categorization. CoRR, 2013.Google ScholarGoogle Scholar
  5. T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV. 2004.Google ScholarGoogle ScholarCross RefCross Ref
  6. X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In ICCV, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. CoRR, 2015.Google ScholarGoogle Scholar
  8. J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In ECCV, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  10. D. L. Donoho and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 1995.Google ScholarGoogle Scholar
  11. C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE TPAMI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In ICCV, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, 2015.Google ScholarGoogle Scholar
  18. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos" in the wild". arXiv preprint arXiv:1604.06182, 2016.Google ScholarGoogle Scholar
  20. M. Jain and et al. University of amsterdam at thumos 2015. In CVPR THUMOS Workshop, 2015.Google ScholarGoogle Scholar
  21. I.-H. Jhuo, G. Ye, S. Gao, D. Liu, Y.-G. Jiang, D. T. Lee, and S.-F. Chang. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In ICML, 2010.Google ScholarGoogle Scholar
  23. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y.-G. Jiang, J. Wang, S.-F. Chang, and C.-W. Ngo. Domain adaptive semantic diffusion for large scale context-based video annotation. In ICCV, 2009.Google ScholarGoogle Scholar
  25. Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ACM ICMR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K.-T. Lai, F. X. Yu, M.-S. Chen, and S.-F. Chang. Video event detection by inferring temporal instance labels. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. CoRR, 2014.Google ScholarGoogle Scholar
  31. I. Laptev. On space-time interest points. IJCV, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang. Sample-specific late fusion for visual category recognition. In CVPR, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. J. Ma and P. C. Yuen. Reduced analytic dependency modeling: Robust fusion for visual recognition. IJCV, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Nagel, T. Mensink, and C. G. M. Snoek. Event fisher vectors: Robust encoding visual diversity of visual streams. In BMVC, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  36. K. Nandakumar, Y. Chen, S. C. Dass, and A. K. Jain. Likelihood ratio-based biometric score fusion. IEEE TPAMI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, and R. Prasad. Multimodal feature fusion for robust event detection in web videos. In CVPR, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multi-modal gesture recognition. IEEE TPAMI, 2014.Google ScholarGoogle Scholar
  39. J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.Google ScholarGoogle Scholar
  40. D. Oneata, J. Verbeek, C. Schmid, et al. Action and event recognition with fisher vectors on a compact feature set. In ICCV, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  42. V. Ramanathan, K. Tang, G. Mori, and L. Fei-Fei. Learning temporal embeddings for complex video analysis. In ICCV, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. CoRR, 2015.Google ScholarGoogle Scholar
  45. K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.Google ScholarGoogle Scholar
  47. K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, 2012.Google ScholarGoogle Scholar
  48. N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. X. Sun, M. Chen, and A. Hauptmann. Action recognition via local descriptors and holistic features. In CVPR, 2009.Google ScholarGoogle Scholar
  51. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  52. K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: Generic features for video analysis. CoRR, 2014.Google ScholarGoogle Scholar
  54. A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. V. Veeriah, N. Zhuang, and G.-J. Qi. Differential recurrent neural networks for action recognition. In ICCV, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. H. Wang and C. Schmid. Lear-inria submission for the thumos workshop. In ICCV THUMOS Workshop, 2013.Google ScholarGoogle Scholar
  58. H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  59. L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  60. X. Wang, A. Farhadi, and A. Gupta. Actions~ transformations. In CVPR, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  61. Y. Wang and G. Mori. Max-margin hidden conditional random fields for human action recognition. In CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  62. Z. Wu, Y. Fu, Y.-G. Jiang, and L. Sigal. Harnessing object and scene semantics for large-scale video understanding. In CVPR, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  63. Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Multimedia, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM Multimedia, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  66. Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann. Feature weighting via optimal thresholding for video analysis. In ICCV, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusion with rank minimization. In CVPR, 2012.Google ScholarGoogle Scholar
  69. S.-I. Yu, L. Jiang, and et al. Informedia@ trecvid 2014 med and mer. In NIST TRECVID Video Retrieval Evaluation Workshop, 2014.Google ScholarGoogle Scholar
  70. S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-trained cnn architectures for unconstrained video classification. In BMVC, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  71. H. Zhang, M. Mang, R. Hong, L. Nie, and T.-S. Chua. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM Multimedia, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '16: Proceedings of the 24th ACM international conference on Multimedia
      October 2016
      1542 pages
      ISBN:9781450336031
      DOI:10.1145/2964284

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 October 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader