research-article

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Authors:
Ning Han

Hunan University, Changsha, China

Hunan University, Changsha, China
View Profile

,
Jingjing Chen

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Hao Zhang

City University of Hong Kong, Hong Kong, China

City University of Hong Kong, Hong Kong, China
View Profile

,
Huanwen Wang

Hunan University, Changsha, China

Hunan University, Changsha, China
View Profile

,
Hao Chen

Hunan University, Changsha, China

Hunan University, Changsha, China
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18 Issue 2Article No.: 63pp 1–23https://doi.org/10.1145/3483381

Published:16 February 2022Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

REFERENCES

[1] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299–6308.Google ScholarCross Ref
[2] Chen Jingjing and Ngo Chong-Wah. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’16). 32–41. Google ScholarDigital Library
[3] Chen Jing-Jing, Ngo Chong-Wah, Feng Fu-Li, and Chua Tat-Seng. 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 1020–1028. Google ScholarDigital Library
[4] Chen Kai, Wang Jiaqi, Pang Jiangmiao, Cao Yuhang, Xiong Yu, Li Xiaoxiao, Sun Shuyang, et al. 2019. MMDetection: Open MMLab detection toolbox and benchmark. CoRR abs/1906.07155 (2019). http://arxiv.org/abs/1906.07155.Google Scholar
[5] Chen Shizhe, Zhao Yida, Jin Qin, and Wu Qi. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’20). 10635–10644.Google ScholarCross Ref
[6] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google Scholar
[7] Diba Ali, Sharma Vivek, Gool Luc Van, and Stiefelhagen Rainer. 2019. DynamoNet: Dynamic action and motion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 6192–6201.Google ScholarCross Ref
[8] Dong Jianfeng, Li Xirong, Xu Chaoxi, Ji Shouling, He Yuan, Yang Gang, and Wang Xun. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9346–9355.Google ScholarCross Ref
[9] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.Google Scholar
[10] Feng Fangxiang, Wang Xiaojie, Li Ruifan, and Ahmad Ibrar. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s (2015), Article 26, 22 pages. Google ScholarDigital Library
[11] Feng Zerun, Zeng Zhimin, Guo Caili, and Li Zheng. 2020. Exploiting visual semantic reasoning for video-text retrieval. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 1005–1011. Google ScholarDigital Library
[12] Gabeur Valentin, Sun Chen, Alahari Karteek, and Schmid Cordelia. 2020. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 5.Google ScholarDigital Library
[13] Ganin Yaroslav and Lempitsky Victor. 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning. 1180–1189. Google ScholarDigital Library
[14] Ghosh Pallabi, Yao Yi, Davis Larry, and Divakaran Ajay. 2020. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 576–585.Google ScholarCross Ref
[15] Girshick Ross. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 1440–1448. Google ScholarDigital Library
[16] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680. Google ScholarDigital Library
[17] Gu Jiuxiang, Cai Jianfei, Joty Shafiq R., Niu Li, and Wang Gang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 7181–7189.Google ScholarCross Ref
[18] Guo Yutian, Chen Jingjing, Zhang Hao, and Jiang Yu-Gang. 2020. Visual relations augmented cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR’20). 9–15. Google ScholarDigital Library
[19] Hara Kensho, Kataoka Hirokatsu, and Satoh Yutaka. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNN and ImageNet? In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 6546–6555.Google ScholarCross Ref
[20] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google ScholarCross Ref
[21] Hotelling Harold. 1936. Relations between two sets of variates. Biometrika 28, 3–4 (1936), 321–377.Google Scholar
[22] Hu Han, Gu Jiayuan, Zhang Zheng, Dai Jifeng, and Wei Yichen. 2018. Relation networks for object detection. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 3588–3597.Google ScholarCross Ref
[23] Jain Ashesh, Zamir Amir R., Savarese Silvio, and Saxena Ashutosh. 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5308–5317.Google ScholarCross Ref
[24] Ji Shuiwang, Xu Wei, Yang Ming, and Yu Kai. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231. Google ScholarDigital Library
[25] Johnson Justin, Krishna Ranjay, Stark Michael, Li Li-Jia, Shamma David, Bernstein Michael, and Fei-Fei Li. 2015. Image retrieval using scene graphs. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 3668–3678.Google ScholarCross Ref
[26] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
[27] Klein Benjamin, Lev Guy, Sadeh Gil, and Wolf Lior. 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437–4446.Google ScholarCross Ref
[28] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012). 1097–1105. Google ScholarDigital Library
[29] Li Fu, Gan Chuang, Liu Xiao, Bian Yunlong, Long Xiang, Li Yandong, Li Zhichao, Zhou Jie, and Wen Shilei. 2017. Temporal modeling approaches for large-scale YouTube-8m video understanding. arXiv preprint arXiv:1707.04555 (2017).Google Scholar
[30] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.Google ScholarCross Ref
[31] Liu Meng, Wang Xiang, Nie Liqiang, He Xiangnan, Chen Baoquan, and Chua Tat-Seng. 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (MM’18). 15–24. Google ScholarDigital Library
[32] Liu Yang, Albanie Samuel, Nagrani Arsha, and Zisserman Andrew. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.Google Scholar
[33] Mavroudi Effrosyni, Haro Benjamín Béjar, and Vidal René. 2019. Neural message passing on hybrid spatio-temporal visual and symbolic graphs for video understanding. arXiv preprint arXiv:1905.07385.Google Scholar
[34] Miech Antoine, Alayrac Jean-Baptiste, Smaira Lucas, Laptev Ivan, Sivic Josef, and Zisserman Andrew. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 9876–9886.Google ScholarCross Ref
[35] Miech Antoine, Laptev Ivan, and Sivic Josef. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516.Google Scholar
[36] Miech Antoine, Zhukov Dimitri, Alayrac Jean-Baptiste, Tapaswi Makarand, Laptev Ivan, and Sivic Josef. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the International Conference on Computer Vision (ICCV’19). 2630–2640.Google ScholarCross Ref
[37] Mithun Niluthpol Chowdhury, Li Juncheng, Metze Florian, and Roy-Chowdhury Amit K.. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR’18). 19–27. Google ScholarDigital Library
[38] Pan Boxiao, Cai Haoye, Huang De-An, Lee Kuan-Hui, Gaidon Adrien, Adeli Ehsan, and Niebles Juan Carlos. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10870–10879.Google ScholarCross Ref
[39] Pan Pingbo, Xu Zhongwen, Yang Yi, Wu Fei, and Zhuang Yueting. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029–1038.Google ScholarCross Ref
[40] Peng Yuxin and Qi Jinwei. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24. Google ScholarDigital Library
[41] Peng Yuxin, Qi Jinwei, Huang Xin, and Yuan Yuxin. 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2017), 405–420. Google ScholarDigital Library
[42] Perronnin Florent and Dance Christopher. 2007. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07). 1–8.Google ScholarCross Ref
[43] Qian Xufeng, Zhuang Yueting, Li Yimeng, Xiao Shaoning, Pu Shiliang, and Xiao Jun. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the ACM International Conference on Multimedia (MM’19). 84–93. Google ScholarDigital Library
[44] Rouditchenko Andrew, Boggust Angie, Harwath David, Joshi Dhiraj, Thomas Samuel, Audhkhasi Kartik, Feris Rogerio, et al. 2020. AVLnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199.Google Scholar
[45] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. Google ScholarDigital Library
[46] Sarafianos Nikolaos, Xu Xiang, and Kakadiaris Ioannis A.. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 5814–5824.Google ScholarCross Ref
[47] Shang Xindi, Ren Tongwei, Guo Jingfan, Zhang Hanwang, and Chua Tat-Seng. 2017. Video visual relation detection. In Proceedings of the ACM International Conference on Multimedia (MM’17). 1300–1308. Google ScholarDigital Library
[48] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14). 568–576. Google ScholarDigital Library
[49] Song Yale and Soleymani Mohammad. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.Google ScholarCross Ref
[50] Srivastava Nitish, Mansimov Elman, and Salakhudinov Ruslan. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the International Conference on Machine Learning. 843–852. Google ScholarDigital Library
[51] Teney Damien, Liu Lingqiao, and Hengel Anton van Den. 2017. Graph-structured representations for visual question answering. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 1–9.Google ScholarCross Ref
[52] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4489–4497. Google ScholarDigital Library
[53] Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6450–6459.Google ScholarCross Ref
[54] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google ScholarDigital Library
[55] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2017. Graph attention networks. CoRR abs/1710.10903 (2017). http://arxiv.org/abs/1710.10903.Google Scholar
[56] Wang Bokun, Yang Yang, Xu Xing, Hanjalic Alan, and Shen Heng Tao. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162. Google ScholarDigital Library
[57] Wang Hao, Sahoo Doyen, Liu Chenghao, Lim Ee-Peng, and Hoi Steven C. H.. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11572–11581.Google ScholarCross Ref
[58] Wang Limin, Xiong Yuanjun, Wang Zhe, and Qiao Yu. 2015. Towards good practices for very deep two-stream ConvNets. arXiv preprint arXiv:1507.02159.Google Scholar
[59] Wang Sijin, Wang Ruiping, Yao Ziwei, Shan Shiguang, and Chen Xilin. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV’20). 1508–1517.Google ScholarCross Ref
[60] Wang Xiaolong and Gupta Abhinav. 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV’18). 399–417.Google ScholarCross Ref
[61] Wray Michael, Larlus Diane, Csurka Gabriela, and Damen Dima. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the International Conference on Computer Vision (ICCV’19). 450–459.Google ScholarCross Ref
[62] Wu Jianchao, Wang Limin, Wang Li, Guo Jie, and Wu Gangshan. 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9964–9974.Google ScholarCross Ref
[63] Xiong Yu, Huang Qingqiu, Guo Lingfeng, Zhou Hang, Zhou Bolei, and Lin Dahua. 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the International Conference on Computer Vision (ICCV’19). 4592–4601.Google ScholarCross Ref
[64] Xu Danfei, Zhu Yuke, Choy Christopher B., and Fei-Fei Li. 2017. Scene graph generation by iterative message passing. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 5410–5419.Google ScholarCross Ref
[65] Xu Huijuan, He Kun, Plummer Bryan A., Sigal Leonid, Sclaroff Stan, and Saenko Kate. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 9062–9069. Google ScholarDigital Library
[66] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 5288–5296.Google ScholarCross Ref
[67] Yang Xun, Dong Jianfeng, Cao Yixin, Wang Xun, Wang Meng, and Chua Tat-Seng. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1339–1348. Google ScholarDigital Library
[68] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google ScholarDigital Library
[69] Yu Youngjae, Kim Jongseok, and Kim Gunhee. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’18). 487–503.Google ScholarCross Ref
[70] Yu Youngjae, Ko Hyungjin, Choi Jongwook, and Kim Gunhee. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3165–3173.Google ScholarCross Ref
[71] Zhou Luowei, Louis Nathan, and Corso Jason J.. 2018. Weakly-Supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the British Machine Vision Conference (BMVC’18). 50.Google Scholar
[72] Zhu Bin, Ngo Chong-Wah, Chen Jingjing, and Hao Yanbin. 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11477–11486.Google ScholarCross Ref

Index Terms

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level ...
Read More
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Read More
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 2
May 2022
494 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3505207
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 February 2022
- Accepted: 1 August 2021
- Revised: 1 June 2021
- Received: 1 February 2021
Published in tomm Volume 18, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multi-grained fusion
spatial-temporal object relationships
text-video retrieval
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 807
  Total Downloads
- Downloads (Last 12 months)135
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Adversarial Cross-Modal Retrieval