note

Modality-Dependent Cross-Media Retrieval

Authors:
Yunchao Wei

Beijing Jiaotong University, Beijing, China

Beijing Jiaotong University, Beijing, China
View Profile

,
Yao Zhao

Beijing Jiaotong University, Beijing, China

Beijing Jiaotong University, Beijing, China
View Profile

,
Zhenfeng Zhu

Beijing Jiaotong University, Beijing, China

Beijing Jiaotong University, Beijing, China
View Profile

,
Shikui Wei

Beijing Jiaotong University, Beijing, China

Beijing Jiaotong University, Beijing, China
View Profile

,
Yanhui Xiao

Beijing Jiaotong University, Beijing, China

Beijing Jiaotong University, Beijing, China
View Profile

,
Jiashi Feng

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Shuicheng Yan

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 7 Issue 4Article No.: 57pp 1–13https://doi.org/10.1145/2775109

Published:22 March 2016Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.

References

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarDigital Library
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129. Google ScholarDigital Library
Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2013. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision 106 (2013), 1--24. Google ScholarDigital Library
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google ScholarDigital Library
S. J. Hwang and K. Grauman. 2010. Accounting for the relative importance of objects in image retrieval. In Proceedings of the British Machine Vision Conference. 1--12.Google Scholar
C. Kang, S. Liao, Y. He, J. Wang, S. Xiang, and C. Pan. 2014. Cross-modal similarity learning: A low rank bilinear formulation. Arxiv Preprint Arxiv:1411.4738 (2014). Google ScholarDigital Library
J. Krapac, M. Allan, J. Verbeek, and F. Jurie. 2010. Improving web-image search results using query-relative classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1094--1101.Google Scholar
A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1106--1114. Google ScholarDigital Library
S. Kumar and R. Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1360. Google ScholarDigital Library
X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang. 2014. Learning multimodal neural network with ranking examples. In Proceedings of the International Conference on Multimedia. 985--988. Google ScholarDigital Library
C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139--147. Google ScholarDigital Library
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia. 251--260. Google ScholarDigital Library
R. Rosipal and N. Krämer. 2006. Overview and recent advances in partial least squares. In Subspace, Latent Structure and Feature Selection. Springer, 34--51. Google ScholarDigital Library
A. Sharma and D. W. Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600. Google ScholarDigital Library
A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2160--2167. Google ScholarDigital Library
J. B. Tenenbaum and W. T. Freeman. 2000. Separating style and content with bilinear models. Neural Computation 12, 6 (2000), 1247--1283. Google ScholarDigital Library
W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. In Proceedings of the International Conference on Very Large Data Bases 7, 8 (2014), 649--660. Google ScholarDigital Library
Y. Wei, Y. Zhao, Z. Zhu, Y. Xiao, and S. Wei. 2014. Learning a mid-level feature space for cross-media regularization. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1--6.Google Scholar
F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In Proceedings of the International Conference on Multimedia. 877--886. Google ScholarDigital Library
F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang. 2014. Sparse multi-modal hashing. IEEE Transactions on Multimedia 16, 2 (2014), 427--439. Google ScholarDigital Library
Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan. 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 723--742. Google ScholarDigital Library
Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L.-T. Chia. 2010. Cross-media retrieval using query dependent search methods. Pattern Recognition 43, 8 (2010), 2927--2936. Google ScholarDigital Library
Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the International Conference on Multimedia. 175--184. Google ScholarDigital Library
Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan. 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10, 3 (2008), 437--446. Google ScholarDigital Library
X. Zhai, Y. Peng, and J. Xiao. 2013. Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems 19, 5 (2013), 395--406. Google ScholarDigital Library
L. Zhang, Y. Zhao, Z. Zhu, S. Wei, and X. Wu. 2014. Mining semantically consistent patterns for cross-view data. IEEE Transactions on Knowledge and Data Engineering 26, 11 (2014), 2745--2758.Google ScholarCross Ref
Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, and J. Shao. 2014. Cross-media hashing with neural networks. In Proceedings of the International Conference on Multimedia. 901--904. Google ScholarDigital Library

Index Terms

Modality-Dependent Cross-Media Retrieval
1. Information systems
  1. Information retrieval

Recommendations

Semi-supervised modality-dependent cross-media retrieval

In this paper, we propose a modality-dependent cross-media retrieval approach under semi-supervised conditions. The approach utilizes both labeled samples and unlabeled ones to obtain two couples of projection matrices and uses feature distance to ...
Read More
Joint graph regularization based modality-dependent cross-media retrieval

Cross-media retrieval returns heterogeneous multimedia data of the same semantics for a query object, and the key problem for cross-media retrieval is how to deal with the correlations of heterogeneous multimedia data. Many works focus on mapping ...
Read More
Cross-media Relevance Computation for Multimedia Retrieval
MM '17: Proceedings of the 25th ACM international conference on Multimedia

In this paper, we summarize our works for cross-media retrieval where the queries and retrieval content are of different media types. We study cross-media retrieval in the context of two applications, i.e., ~image retrieval by textual queries, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Intelligent Systems and Technology Volume 7, Issue 4
Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular Papers
July 2016
498 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2906145
Editor:
Yu Zheng
Microsoft Research, China
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 March 2016
- Accepted: 1 May 2015
- Revised: 1 March 2015
- Received: 1 April 2014
Published in tist Volume 7, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross-media retrieval
canonical correlation analysis
subspace learning
Qualifiers
- note
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 63
  Total Citations
  View Citations
- 377
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modality-Dependent Cross-Media Retrieval

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Semi-supervised modality-dependent cross-media retrieval

Joint graph regularization based modality-dependent cross-media retrieval

Cross-media Relevance Computation for Multimedia Retrieval