Abstract
As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this article, to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. First, an auto-encoder is used to embed the visual features by minimizing the error of feature reconstruction and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter-correlations among the image-sentence pairs, i.e., the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its performance. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, i.e., Flickr8k, Flickr30k, and MS-COCO.
- Tu Bui, L. Ribeiro, Moacir Ponti, and John Collomosse. 2017. Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Comput. Vis. Image Understand. 164 (2017), 27--37.Google ScholarCross Ref
- Ruey-Cheng Chen, Damiano Spina, W. Bruce Croft, Mark Sanderson, and Falk Scholer. 2015. Harnessing semantics for answer sentence retrieval. In Proceedings of the 8th Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR’15). 21--27. Google ScholarDigital Library
- Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. 886--893. Google ScholarDigital Library
- Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2016. Word2VisualVec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016).Google Scholar
- Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482.Google ScholarCross Ref
- Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia. 7--16. Google ScholarDigital Library
- Andrea Frome, Yoram Singer, Fei Sha, and Jitendra Malik. 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’07). 1--8.Google ScholarCross Ref
- Matt W. Gardner and S. R. Dorling. 1998. Artificial neural networks (the multilayer perceptron): A review of applications in the atmospheric sciences. Atmos. Environ. 32, 14--15 (1998), 2627--2636.Google ScholarCross Ref
- Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Proceedings of the European Conference on Computer Vision (ECCV’16). 241--257.Google ScholarCross Ref
- Albert Gordo and Diane Larlus. 2017. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarCross Ref
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47, 1 (2013), 853--899. Google ScholarDigital Library
- Richang Hong, Zhenzhen Hu, Ruxin Wang, Meng Wang, and Dacheng Tao. 2016. Multi-view object retrieval via multi-scale topic models. IEEE Trans. Image Process. 25, 12 (2016), 5814--5827. Google ScholarDigital Library
- Richang Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, and Tat-Seng Chua. 2011. Beyond search: Event-driven summarization for web videos. ACM Trans. Multimedia Comput. Commun. Appl. 7, 4 (2011), 35. Google ScholarDigital Library
- Richang Hong, Yang Yang, Meng Wang, and Xian-Sheng Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. IEEE Trans. Big Data 1, 4 (2015), 152--161.Google ScholarCross Ref
- Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2310--2318.Google ScholarCross Ref
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML’15). 448--456. Google ScholarDigital Library
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.Google ScholarCross Ref
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google Scholar
- Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2014. Fisher vectors derived from hybrid Gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399 (2014).Google Scholar
- Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437--4446.Google ScholarCross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems (NIPS’12). 1097--1105. Google ScholarDigital Library
- Luke Ledwich and Stefan Williams. 2004. Reduced SIFT features for image retrieval and indoor localisation. In Proceedings of the Australasian Conference on Robotics and Automation, Vol. 322. 3.Google Scholar
- Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. 2015. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4247--4255. Google ScholarDigital Library
- Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision (ECCV’16). 833--850.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740--755.Google Scholar
- Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 261--277.Google ScholarCross Ref
- Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2018. iVQA: Inverse visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8611--8619.Google ScholarCross Ref
- David G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’99), Vol. 2. 1150--1157. Google ScholarDigital Library
- Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2623--2631. Google ScholarDigital Library
- Zhuang Ma, Yichao Lu, and Dean Foster. 2015. Finding linear structure in large datasets with scalable canonical correlation analysis. In Proceedings of the International Conference on Machine Learning (ICML’15). 169--178. Google ScholarDigital Library
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (Nov. 2008), 2579--2605.Google Scholar
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Neural Information Processing Systems (NIPS’13). 3111--3119. Google ScholarDigital Library
- Paul Mineiro and Nikos Karampatziakis. 2014. A randomized algorithm for CCA. arXiv preprint arXiv:1411.3409 (2014).Google Scholar
- Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2013. Image retrieval using textual cues. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). 3040--3047. Google ScholarDigital Library
- Todd K. Moon. 1996. The expectation-maximization algorithm. IEEE Signal Processing Magazine 13, 6 (1996), 47--60.Google ScholarCross Ref
- Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1881--1889.Google ScholarCross Ref
- Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2641--2649. Google ScholarDigital Library
- Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4094--4102. Google ScholarDigital Library
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia. 251--260. Google ScholarDigital Library
- Douglas Reynolds. 2015. Gaussian mixture models. Encyclopedia of Biometrics, Stan Z. Li and Anil Jain (Eds.). Springer, 827--832.Google Scholar
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 815--823.Google ScholarCross Ref
- K. Seetharaman and Bachala Shyam Kumar. 2016. Texture and color features based color image retrieval using canonical correlation. Global J. Res. Eng. 15, 6 (2016), 1--9.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarCross Ref
- Bruce Thompson. 2000. Canonical correlation analysis. Reading and Understanding More Multivariate Statistics, L. G. Grimm and P. R. Yarnold. American Psychological Association, 285--316.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156--3164.Google ScholarCross Ref
- Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5005--5013.Google ScholarCross Ref
- Shuo Wang, Dan Guo, Wen-gang Zhou, Zheng-Jun Zha, and Meng Wang. 2018. Connectionist temporal fusion for sign language translation. In Proceedings of the ACM International Conference on Multimedia. ACM, 1483--1491. Google ScholarDigital Library
- Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Understand. 163 (2017), 21--40. Google ScholarDigital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441--3450.Google ScholarCross Ref
- Xun Yang, Meng Wang, and Dacheng Tao. 2018. Person re-identification with metric learning using privileged information. IEEE Trans. Image Process. 27, 2 (2018), 791--805.Google ScholarCross Ref
Index Terms
- Cross-Modality Retrieval by Joint Correlation Learning
Recommendations
Joint graph regularization based modality-dependent cross-media retrieval
Cross-media retrieval returns heterogeneous multimedia data of the same semantics for a query object, and the key problem for cross-media retrieval is how to deal with the correlations of heterogeneous multimedia data. Many works focus on mapping ...
Modality-Dependent Cross-Media Retrieval
Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular PapersIn this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which ...
Semi-supervised modality-dependent cross-media retrieval
In this paper, we propose a modality-dependent cross-media retrieval approach under semi-supervised conditions. The approach utilizes both labeled samples and unlabeled ones to obtain two couples of projection matrices and uses feature distance to ...
Comments