research-article

Cross-Modality Retrieval by Joint Correlation Learning

Editors:
Shuo Wang

School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology, China

School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology, China

0000-0002-4881-9344
View Profile

,
Dan Guo

School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology, China

School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology, China

0000-0003-2594-254X
View Profile

,
Xin Xu

College of Computer Science 8 Technology, Wuhan University of Science and Technology, China

College of Computer Science 8 Technology, Wuhan University of Science and Technology, China

0000-0003-0748-3669
View Profile

,
Li Zhuo

Signal 8 Information Processing Laboratory, Beijing University of Technology, China

Signal 8 Information Processing Laboratory, Beijing University of Technology, China
View Profile

,
Meng Wang

School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology, China

School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology, China

0000-0002-3094-7735
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15 Issue 2sArticle No.: 56pp 1–16https://doi.org/10.1145/3314577

Published:03 July 2019Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this article, to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. First, an auto-encoder is used to embed the visual features by minimizing the error of feature reconstruction and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter-correlations among the image-sentence pairs, i.e., the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its performance. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, i.e., Flickr8k, Flickr30k, and MS-COCO.

References

Tu Bui, L. Ribeiro, Moacir Ponti, and John Collomosse. 2017. Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Comput. Vis. Image Understand. 164 (2017), 27--37.Google ScholarCross Ref
Ruey-Cheng Chen, Damiano Spina, W. Bruce Croft, Mark Sanderson, and Falk Scholer. 2015. Harnessing semantics for answer sentence retrieval. In Proceedings of the 8th Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR’15). 21--27. Google ScholarDigital Library
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. 886--893. Google ScholarDigital Library
Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2016. Word2VisualVec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016).Google Scholar
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482.Google ScholarCross Ref
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia. 7--16. Google ScholarDigital Library
Andrea Frome, Yoram Singer, Fei Sha, and Jitendra Malik. 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’07). 1--8.Google ScholarCross Ref
Matt W. Gardner and S. R. Dorling. 1998. Artificial neural networks (the multilayer perceptron): A review of applications in the atmospheric sciences. Atmos. Environ. 32, 14--15 (1998), 2627--2636.Google ScholarCross Ref
Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Proceedings of the European Conference on Computer Vision (ECCV’16). 241--257.Google ScholarCross Ref
Albert Gordo and Diane Larlus. 2017. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarCross Ref
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47, 1 (2013), 853--899. Google ScholarDigital Library
Richang Hong, Zhenzhen Hu, Ruxin Wang, Meng Wang, and Dacheng Tao. 2016. Multi-view object retrieval via multi-scale topic models. IEEE Trans. Image Process. 25, 12 (2016), 5814--5827. Google ScholarDigital Library
Richang Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, and Tat-Seng Chua. 2011. Beyond search: Event-driven summarization for web videos. ACM Trans. Multimedia Comput. Commun. Appl. 7, 4 (2011), 35. Google ScholarDigital Library
Richang Hong, Yang Yang, Meng Wang, and Xian-Sheng Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. IEEE Trans. Big Data 1, 4 (2015), 152--161.Google ScholarCross Ref
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2310--2318.Google ScholarCross Ref
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML’15). 448--456. Google ScholarDigital Library
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.Google ScholarCross Ref
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google Scholar
Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2014. Fisher vectors derived from hybrid Gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399 (2014).Google Scholar
Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437--4446.Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems (NIPS’12). 1097--1105. Google ScholarDigital Library
Luke Ledwich and Stefan Williams. 2004. Reduced SIFT features for image retrieval and indoor localisation. In Proceedings of the Australasian Conference on Robotics and Automation, Vol. 322. 3.Google Scholar
Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. 2015. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4247--4255. Google ScholarDigital Library
Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision (ECCV’16). 833--850.Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740--755.Google Scholar
Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 261--277.Google ScholarCross Ref
Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2018. iVQA: Inverse visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8611--8619.Google ScholarCross Ref
David G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’99), Vol. 2. 1150--1157. Google ScholarDigital Library
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2623--2631. Google ScholarDigital Library
Zhuang Ma, Yichao Lu, and Dean Foster. 2015. Finding linear structure in large datasets with scalable canonical correlation analysis. In Proceedings of the International Conference on Machine Learning (ICML’15). 169--178. Google ScholarDigital Library
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (Nov. 2008), 2579--2605.Google Scholar
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Neural Information Processing Systems (NIPS’13). 3111--3119. Google ScholarDigital Library
Paul Mineiro and Nikos Karampatziakis. 2014. A randomized algorithm for CCA. arXiv preprint arXiv:1411.3409 (2014).Google Scholar
Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2013. Image retrieval using textual cues. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). 3040--3047. Google ScholarDigital Library
Todd K. Moon. 1996. The expectation-maximization algorithm. IEEE Signal Processing Magazine 13, 6 (1996), 47--60.Google ScholarCross Ref
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1881--1889.Google ScholarCross Ref
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2641--2649. Google ScholarDigital Library
Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4094--4102. Google ScholarDigital Library
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia. 251--260. Google ScholarDigital Library
Douglas Reynolds. 2015. Gaussian mixture models. Encyclopedia of Biometrics, Stan Z. Li and Anil Jain (Eds.). Springer, 827--832.Google Scholar
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 815--823.Google ScholarCross Ref
K. Seetharaman and Bachala Shyam Kumar. 2016. Texture and color features based color image retrieval using canonical correlation. Global J. Res. Eng. 15, 6 (2016), 1--9.Google Scholar
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarCross Ref
Bruce Thompson. 2000. Canonical correlation analysis. Reading and Understanding More Multivariate Statistics, L. G. Grimm and P. R. Yarnold. American Psychological Association, 285--316.Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156--3164.Google ScholarCross Ref
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5005--5013.Google ScholarCross Ref
Shuo Wang, Dan Guo, Wen-gang Zhou, Zheng-Jun Zha, and Meng Wang. 2018. Connectionist temporal fusion for sign language translation. In Proceedings of the ACM International Conference on Multimedia. ACM, 1483--1491. Google ScholarDigital Library
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Understand. 163 (2017), 21--40. Google ScholarDigital Library
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441--3450.Google ScholarCross Ref
Xun Yang, Meng Wang, and Dacheng Tao. 2018. Person re-identification with metric learning using privileged information. IEEE Trans. Image Process. 27, 2 (2018), 791--805.Google ScholarCross Ref

Index Terms

Cross-Modality Retrieval by Joint Correlation Learning
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Joint graph regularization based modality-dependent cross-media retrieval

Cross-media retrieval returns heterogeneous multimedia data of the same semantics for a query object, and the key problem for cross-media retrieval is how to deal with the correlations of heterogeneous multimedia data. Many works focus on mapping ...
Read More
Modality-Dependent Cross-Media Retrieval
Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular Papers

In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which ...
Read More
Semi-supervised modality-dependent cross-media retrieval

In this paper, we propose a modality-dependent cross-media retrieval approach under semi-supervised conditions. The approach utilizes both labeled samples and unlabeled ones to obtain two couples of projection matrices and uses feature distance to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 2s
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018
April 2019
381 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3343360
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 July 2019
- Accepted: 1 February 2019
- Revised: 1 January 2019
- Received: 1 June 2018
Published in tomm Volume 15, Issue 2s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross-modality retrieval
MLP
auto-encoder
joint loss
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 336
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Cross-Modality Retrieval by Joint Correlation Learning

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Joint graph regularization based modality-dependent cross-media retrieval

Modality-Dependent Cross-Media Retrieval

Semi-supervised modality-dependent cross-media retrieval