skip to main content
10.1145/3474085.3475298acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors,e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

Skip Supplemental Material Section

Supplemental Material

ACMMM2021.mp4

mp4

210 MB

References

  1. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-supervised learning of audio-visual objects from video. arXiv (2020).Google ScholarGoogle Scholar
  2. Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2020. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. arXiv preprint arXiv:2011.11479 (2020).Google ScholarGoogle Scholar
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR .Google ScholarGoogle Scholar
  4. Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR .Google ScholarGoogle Scholar
  5. Junsuk Choe and Hyunjung Shim. 2019. Attention-based dropout layer for weakly supervised object localization. In CVPR .Google ScholarGoogle Scholar
  6. Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. 2018. Triplet-based deep hashing network for cross-modal retrieval. TIP (2018).Google ScholarGoogle Scholar
  7. Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. 2015. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In CVPR .Google ScholarGoogle Scholar
  8. Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. 2021. MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection. In CVPR .Google ScholarGoogle Scholar
  9. Guoqiang Gong, Xinghan Wang, Yadong Mu, and Qi Tian. 2020. Learning Temporal Co-Attention Models for Unsupervised Video Action Localization. In CVPR .Google ScholarGoogle Scholar
  10. Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection. In ECCV .Google ScholarGoogle Scholar
  11. Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In CVPR .Google ScholarGoogle Scholar
  12. Ashraful Islam, Chengjiang Long, and Richard J Radke. 2021. A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization. arXiv (2021).Google ScholarGoogle Scholar
  13. Ashraful Islam and Richard Radke. 2020. Weakly Supervised Temporal Action Localization Using Deep Metric Learning. In WACV .Google ScholarGoogle Scholar
  14. Mihir Jain, Amir Ghodrati, and Cees GM Snoek. 2020. ActionBytes: Learning from trimmed videos to localize actions. In CVPR .Google ScholarGoogle Scholar
  15. Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.Google ScholarGoogle Scholar
  16. Ya Jing, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Cross-Modal Cross-Domain Moment Alignment Network for Person Search. In CVPR .Google ScholarGoogle Scholar
  17. Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv (2017).Google ScholarGoogle Scholar
  18. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv (2014).Google ScholarGoogle Scholar
  19. Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background Suppression Network for Weakly-Supervised Temporal Action Localization.. In AAAI .Google ScholarGoogle Scholar
  20. Pilhyeon Lee, Jinglu Wang, Yan Lu, and Hyeran Byun. 2021. Weakly-supervised Temporal Action Localization by Uncertainty Modeling. arXiv (2021).Google ScholarGoogle Scholar
  21. Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331--7341.Google ScholarGoogle ScholarCross RefCross Ref
  22. Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV .Google ScholarGoogle Scholar
  23. Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR .Google ScholarGoogle Scholar
  24. Ziyi Liu, Le Wang, Qilin Zhang, Wei Tang, Junsong Yuan, Zheng Nanning, and Gang Hua. 2021. ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization. In AAAI .Google ScholarGoogle Scholar
  25. Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, and Huijuan Xu. 2020. Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning. arXiv (2020).Google ScholarGoogle Scholar
  26. Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. 2020. SF-Net: Single-frame supervision for temporal action localization. In ECCV .Google ScholarGoogle Scholar
  27. Kyle Min and Jason J Corso. 2020. Adversarial background-aware loss for weakly-supervised temporal activity localization. In ECCV .Google ScholarGoogle Scholar
  28. Jonathan Munro and Dima Damen. 2020. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition. In CVPR .Google ScholarGoogle Scholar
  29. Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 2019. 3c-net: Category count and center loss for weakly-supervised action localization. In ICCV .Google ScholarGoogle Scholar
  30. Megha Nawhal and Greg Mori. 2021. Activity Graph Transformer for Temporal Action Localization. arXiv (2021).Google ScholarGoogle Scholar
  31. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML . Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In CVPR .Google ScholarGoogle Scholar
  33. Alejandro Pardo, Humam Alwassel, Fabian Caba, Ali Thabet, and Bernard Ghanem. 2021. Refineloc: Iterative refinement for weakly-supervised action localization. In WACV .Google ScholarGoogle Scholar
  34. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv (2019). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In ECCV .Google ScholarGoogle Scholar
  36. Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation. In CVPR .Google ScholarGoogle Scholar
  37. Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. 2020. Weakly-supervised action localization by generative attention modeling. In CVPR .Google ScholarGoogle Scholar
  38. Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. 2018. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In ECCV .Google ScholarGoogle Scholar
  39. Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR .Google ScholarGoogle Scholar
  40. Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6479--6488.Google ScholarGoogle ScholarCross RefCross Ref
  41. Abhinav Valada, Rohit Mohan, and Wolfram Burgard. 2019. Self-supervised model adaptation for multimodal semantic segmentation. IJCV (2019).Google ScholarGoogle Scholar
  42. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv (2017).Google ScholarGoogle Scholar
  43. Daixin Wang, Peng Cui, Mingdong Ou, and Wenwu Zhu. 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Transactions on Multimedia (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zhenzhen Wang, Weixiang Hong, Yap-Peng Tan, and Junsong Yuan. 2019. Pruning 3D Filters For Accelerating 3D ConvNets. IEEE Transactions on Multimedia (2019).Google ScholarGoogle ScholarCross RefCross Ref
  45. Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, and Nicu Sebe. 2017. Learning Cross-Modal Deep Representations for Robust Pedestrian Detection. In CVPR .Google ScholarGoogle Scholar
  46. Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018. PAD-Net: Multi-Tasks Guided Prediciton-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing. In CVPR .Google ScholarGoogle Scholar
  47. Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. In BMVC .Google ScholarGoogle Scholar
  48. Mengmeng Xu, Juan-Manuel Pérez-Rúa, Victor Escorcia, Brais Martinez, Xiatian Zhu, Li Zhang, Bernard Ghanem, and Tao Xiang. 2020. Boundary-sensitive pre-training for temporal localization in videos. arXiv preprint arXiv:2011.10830 (2020).Google ScholarGoogle Scholar
  49. Yunlu Xu, Chengwei Zhang, Zhanzhan Cheng, Jianwen Xie, Yi Niu, Shiliang Pu, and Fei Wu. 2019. Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In AAAI .Google ScholarGoogle Scholar
  50. Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, and Jian-Huang Lai. 2020. Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos. In ACM MM . Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In ICCV .Google ScholarGoogle Scholar
  52. Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In ECCV .Google ScholarGoogle Scholar
  53. Xiao-Yu Zhang, Haichao Shi, Changsheng Li, and Peng Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In AAAI .Google ScholarGoogle Scholar
  54. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In ICCV .Google ScholarGoogle Scholar

Index Terms

  1. Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader