Abstract
Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.
- [1] . 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299–6308.Google ScholarCross Ref
- [2] . 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’16). 32–41. Google ScholarDigital Library
- [3] . 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 1020–1028. Google ScholarDigital Library
- [4] . 2019. MMDetection: Open MMLab detection toolbox and benchmark. CoRR abs/1906.07155 (2019). http://arxiv.org/abs/1906.07155.Google Scholar
- [5] . 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’20). 10635–10644.Google ScholarCross Ref
- [6] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google Scholar
- [7] . 2019. DynamoNet: Dynamic action and motion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 6192–6201.Google ScholarCross Ref
- [8] . 2019. Dual encoding for zero-example video retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9346–9355.Google ScholarCross Ref
- [9] . 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.Google Scholar
- [10] . 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s (2015), Article 26, 22 pages. Google ScholarDigital Library
- [11] . 2020. Exploiting visual semantic reasoning for video-text retrieval. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 1005–1011. Google ScholarDigital Library
- [12] . 2020. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 5.Google ScholarDigital Library
- [13] . 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning. 1180–1189. Google ScholarDigital Library
- [14] . 2020. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 576–585.Google ScholarCross Ref
- [15] . 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 1440–1448. Google ScholarDigital Library
- [16] . 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680. Google ScholarDigital Library
- [17] . 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 7181–7189.Google ScholarCross Ref
- [18] . 2020. Visual relations augmented cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR’20). 9–15. Google ScholarDigital Library
- [19] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNN and ImageNet? In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 6546–6555.Google ScholarCross Ref
- [20] . 2016. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google ScholarCross Ref
- [21] . 1936. Relations between two sets of variates. Biometrika 28, 3–4 (1936), 321–377.Google Scholar
- [22] . 2018. Relation networks for object detection. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 3588–3597.Google ScholarCross Ref
- [23] . 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5308–5317.Google ScholarCross Ref
- [24] . 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231. Google ScholarDigital Library
- [25] . 2015. Image retrieval using scene graphs. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 3668–3678.Google ScholarCross Ref
- [26] . 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
- [27] . 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437–4446.Google ScholarCross Ref
- [28] . 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012). 1097–1105. Google ScholarDigital Library
- [29] . 2017. Temporal modeling approaches for large-scale YouTube-8m video understanding. arXiv preprint arXiv:1707.04555 (2017).Google Scholar
- [30] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.Google ScholarCross Ref
- [31] . 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (MM’18). 15–24. Google ScholarDigital Library
- [32] . 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.Google Scholar
- [33] . 2019. Neural message passing on hybrid spatio-temporal visual and symbolic graphs for video understanding. arXiv preprint arXiv:1905.07385.Google Scholar
- [34] . 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 9876–9886.Google ScholarCross Ref
- [35] . 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516.Google Scholar
- [36] . 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the International Conference on Computer Vision (ICCV’19). 2630–2640.Google ScholarCross Ref
- [37] . 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR’18). 19–27. Google ScholarDigital Library
- [38] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10870–10879.Google ScholarCross Ref
- [39] . 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029–1038.Google ScholarCross Ref
- [40] . 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24. Google ScholarDigital Library
- [41] . 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2017), 405–420. Google ScholarDigital Library
- [42] . 2007. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07). 1–8.Google ScholarCross Ref
- [43] . 2019. Video relation detection with spatio-temporal graph. In Proceedings of the ACM International Conference on Multimedia (MM’19). 84–93. Google ScholarDigital Library
- [44] . 2020. AVLnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199.Google Scholar
- [45] . 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. Google ScholarDigital Library
- [46] . 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 5814–5824.Google ScholarCross Ref
- [47] . 2017. Video visual relation detection. In Proceedings of the ACM International Conference on Multimedia (MM’17). 1300–1308. Google ScholarDigital Library
- [48] . 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14). 568–576. Google ScholarDigital Library
- [49] . 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.Google ScholarCross Ref
- [50] . 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the International Conference on Machine Learning. 843–852. Google ScholarDigital Library
- [51] . 2017. Graph-structured representations for visual question answering. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 1–9.Google ScholarCross Ref
- [52] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4489–4497. Google ScholarDigital Library
- [53] . 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6450–6459.Google ScholarCross Ref
- [54] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google ScholarDigital Library
- [55] . 2017. Graph attention networks. CoRR abs/1710.10903 (2017). http://arxiv.org/abs/1710.10903.Google Scholar
- [56] . 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162. Google ScholarDigital Library
- [57] . 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11572–11581.Google ScholarCross Ref
- [58] . 2015. Towards good practices for very deep two-stream ConvNets. arXiv preprint arXiv:1507.02159.Google Scholar
- [59] . 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV’20). 1508–1517.Google ScholarCross Ref
- [60] . 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV’18). 399–417.Google ScholarCross Ref
- [61] . 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the International Conference on Computer Vision (ICCV’19). 450–459.Google ScholarCross Ref
- [62] . 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9964–9974.Google ScholarCross Ref
- [63] . 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the International Conference on Computer Vision (ICCV’19). 4592–4601.Google ScholarCross Ref
- [64] . 2017. Scene graph generation by iterative message passing. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 5410–5419.Google ScholarCross Ref
- [65] . 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 9062–9069. Google ScholarDigital Library
- [66] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 5288–5296.Google ScholarCross Ref
- [67] . 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1339–1348. Google ScholarDigital Library
- [68] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google ScholarDigital Library
- [69] . 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’18). 487–503.Google ScholarCross Ref
- [70] . 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3165–3173.Google ScholarCross Ref
- [71] . 2018. Weakly-Supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the British Machine Vision Conference (BMVC’18). 50.Google Scholar
- [72] . 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11477–11486.Google ScholarCross Ref
Index Terms
- Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
Recommendations
Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
MM '21: Proceedings of the 29th ACM International Conference on MultimediaDespite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level ...
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia RetrievalConstructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on MultimediaCross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
Comments