ABSTRACT
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors,e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.
Supplemental Material
- Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-supervised learning of audio-visual objects from video. arXiv (2020).Google Scholar
- Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2020. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. arXiv preprint arXiv:2011.11479 (2020).Google Scholar
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR .Google Scholar
- Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR .Google Scholar
- Junsuk Choe and Hyunjung Shim. 2019. Attention-based dropout layer for weakly supervised object localization. In CVPR .Google Scholar
- Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. 2018. Triplet-based deep hashing network for cross-modal retrieval. TIP (2018).Google Scholar
- Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. 2015. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In CVPR .Google Scholar
- Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. 2021. MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection. In CVPR .Google Scholar
- Guoqiang Gong, Xinghan Wang, Yadong Mu, and Qi Tian. 2020. Learning Temporal Co-Attention Models for Unsupervised Video Action Localization. In CVPR .Google Scholar
- Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection. In ECCV .Google Scholar
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In CVPR .Google Scholar
- Ashraful Islam, Chengjiang Long, and Richard J Radke. 2021. A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization. arXiv (2021).Google Scholar
- Ashraful Islam and Richard Radke. 2020. Weakly Supervised Temporal Action Localization Using Deep Metric Learning. In WACV .Google Scholar
- Mihir Jain, Amir Ghodrati, and Cees GM Snoek. 2020. ActionBytes: Learning from trimmed videos to localize actions. In CVPR .Google Scholar
- Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.Google Scholar
- Ya Jing, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Cross-Modal Cross-Domain Moment Alignment Network for Person Search. In CVPR .Google Scholar
- Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv (2017).Google Scholar
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv (2014).Google Scholar
- Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background Suppression Network for Weakly-Supervised Temporal Action Localization.. In AAAI .Google Scholar
- Pilhyeon Lee, Jinglu Wang, Yan Lu, and Hyeran Byun. 2021. Weakly-supervised Temporal Action Localization by Uncertainty Modeling. arXiv (2021).Google Scholar
- Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331--7341.Google ScholarCross Ref
- Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV .Google Scholar
- Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR .Google Scholar
- Ziyi Liu, Le Wang, Qilin Zhang, Wei Tang, Junsong Yuan, Zheng Nanning, and Gang Hua. 2021. ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization. In AAAI .Google Scholar
- Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, and Huijuan Xu. 2020. Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning. arXiv (2020).Google Scholar
- Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. 2020. SF-Net: Single-frame supervision for temporal action localization. In ECCV .Google Scholar
- Kyle Min and Jason J Corso. 2020. Adversarial background-aware loss for weakly-supervised temporal activity localization. In ECCV .Google Scholar
- Jonathan Munro and Dima Damen. 2020. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition. In CVPR .Google Scholar
- Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 2019. 3c-net: Category count and center loss for weakly-supervised action localization. In ICCV .Google Scholar
- Megha Nawhal and Greg Mori. 2021. Activity Graph Transformer for Temporal Action Localization. arXiv (2021).Google Scholar
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML . Google ScholarDigital Library
- Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In CVPR .Google Scholar
- Alejandro Pardo, Humam Alwassel, Fabian Caba, Ali Thabet, and Bernard Ghanem. 2021. Refineloc: Iterative refinement for weakly-supervised action localization. In WACV .Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv (2019). Google ScholarDigital Library
- Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In ECCV .Google Scholar
- Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation. In CVPR .Google Scholar
- Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. 2020. Weakly-supervised action localization by generative attention modeling. In CVPR .Google Scholar
- Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. 2018. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In ECCV .Google Scholar
- Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR .Google Scholar
- Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6479--6488.Google ScholarCross Ref
- Abhinav Valada, Rohit Mohan, and Wolfram Burgard. 2019. Self-supervised model adaptation for multimodal semantic segmentation. IJCV (2019).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv (2017).Google Scholar
- Daixin Wang, Peng Cui, Mingdong Ou, and Wenwu Zhu. 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Transactions on Multimedia (2015).Google ScholarDigital Library
- Zhenzhen Wang, Weixiang Hong, Yap-Peng Tan, and Junsong Yuan. 2019. Pruning 3D Filters For Accelerating 3D ConvNets. IEEE Transactions on Multimedia (2019).Google ScholarCross Ref
- Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, and Nicu Sebe. 2017. Learning Cross-Modal Deep Representations for Robust Pedestrian Detection. In CVPR .Google Scholar
- Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018. PAD-Net: Multi-Tasks Guided Prediciton-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing. In CVPR .Google Scholar
- Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. In BMVC .Google Scholar
- Mengmeng Xu, Juan-Manuel Pérez-Rúa, Victor Escorcia, Brais Martinez, Xiatian Zhu, Li Zhang, Bernard Ghanem, and Tao Xiang. 2020. Boundary-sensitive pre-training for temporal localization in videos. arXiv preprint arXiv:2011.10830 (2020).Google Scholar
- Yunlu Xu, Chengwei Zhang, Zhanzhan Cheng, Jianwen Xie, Yi Niu, Shiliang Pu, and Fei Wu. 2019. Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In AAAI .Google Scholar
- Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, and Jian-Huang Lai. 2020. Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos. In ACM MM . Google ScholarDigital Library
- Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In ICCV .Google Scholar
- Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In ECCV .Google Scholar
- Xiao-Yu Zhang, Haichao Shi, Changsheng Li, and Peng Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In AAAI .Google Scholar
- Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In ICCV .Google Scholar
Index Terms
- Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization
Recommendations
Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning
MM '21: Proceedings of the 29th ACM International Conference on MultimediaWeakly supervised temporal action localization (WTAL) is a challenging task as only video-level category labels are available during training stage. Without precise temporal annotations, most approaches rely on complementary RGB and optical flow ...
Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization
MM '20: Proceedings of the 28th ACM International Conference on MultimediaThe state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action ...
Deep cascaded action attention network for weakly-supervised temporal action localization
AbstractWeakly-supervised temporal action localization (W-TAL) is to locate the boundaries of action instances and classify them in an untrimmed video, which is a challenging task due to only video-level labels during training. Existing methods mainly ...
Comments