Abstract
In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.
- D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarDigital Library
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129. Google ScholarDigital Library
- Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2013. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision 106 (2013), 1--24. Google ScholarDigital Library
- D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google ScholarDigital Library
- S. J. Hwang and K. Grauman. 2010. Accounting for the relative importance of objects in image retrieval. In Proceedings of the British Machine Vision Conference. 1--12.Google Scholar
- C. Kang, S. Liao, Y. He, J. Wang, S. Xiang, and C. Pan. 2014. Cross-modal similarity learning: A low rank bilinear formulation. Arxiv Preprint Arxiv:1411.4738 (2014). Google ScholarDigital Library
- J. Krapac, M. Allan, J. Verbeek, and F. Jurie. 2010. Improving web-image search results using query-relative classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1094--1101.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1106--1114. Google ScholarDigital Library
- S. Kumar and R. Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1360. Google ScholarDigital Library
- X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang. 2014. Learning multimodal neural network with ranking examples. In Proceedings of the International Conference on Multimedia. 985--988. Google ScholarDigital Library
- C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139--147. Google ScholarDigital Library
- N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia. 251--260. Google ScholarDigital Library
- R. Rosipal and N. Krämer. 2006. Overview and recent advances in partial least squares. In Subspace, Latent Structure and Feature Selection. Springer, 34--51. Google ScholarDigital Library
- A. Sharma and D. W. Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600. Google ScholarDigital Library
- A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2160--2167. Google ScholarDigital Library
- J. B. Tenenbaum and W. T. Freeman. 2000. Separating style and content with bilinear models. Neural Computation 12, 6 (2000), 1247--1283. Google ScholarDigital Library
- W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. In Proceedings of the International Conference on Very Large Data Bases 7, 8 (2014), 649--660. Google ScholarDigital Library
- Y. Wei, Y. Zhao, Z. Zhu, Y. Xiao, and S. Wei. 2014. Learning a mid-level feature space for cross-media regularization. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1--6.Google Scholar
- F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In Proceedings of the International Conference on Multimedia. 877--886. Google ScholarDigital Library
- F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang. 2014. Sparse multi-modal hashing. IEEE Transactions on Multimedia 16, 2 (2014), 427--439. Google ScholarDigital Library
- Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan. 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 723--742. Google ScholarDigital Library
- Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L.-T. Chia. 2010. Cross-media retrieval using query dependent search methods. Pattern Recognition 43, 8 (2010), 2927--2936. Google ScholarDigital Library
- Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the International Conference on Multimedia. 175--184. Google ScholarDigital Library
- Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan. 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10, 3 (2008), 437--446. Google ScholarDigital Library
- X. Zhai, Y. Peng, and J. Xiao. 2013. Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems 19, 5 (2013), 395--406. Google ScholarDigital Library
- L. Zhang, Y. Zhao, Z. Zhu, S. Wei, and X. Wu. 2014. Mining semantically consistent patterns for cross-view data. IEEE Transactions on Knowledge and Data Engineering 26, 11 (2014), 2745--2758.Google ScholarCross Ref
- Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, and J. Shao. 2014. Cross-media hashing with neural networks. In Proceedings of the International Conference on Multimedia. 901--904. Google ScholarDigital Library
Index Terms
- Modality-Dependent Cross-Media Retrieval
Recommendations
Semi-supervised modality-dependent cross-media retrieval
In this paper, we propose a modality-dependent cross-media retrieval approach under semi-supervised conditions. The approach utilizes both labeled samples and unlabeled ones to obtain two couples of projection matrices and uses feature distance to ...
Joint graph regularization based modality-dependent cross-media retrieval
Cross-media retrieval returns heterogeneous multimedia data of the same semantics for a query object, and the key problem for cross-media retrieval is how to deal with the correlations of heterogeneous multimedia data. Many works focus on mapping ...
Cross-media Relevance Computation for Multimedia Retrieval
MM '17: Proceedings of the 25th ACM international conference on MultimediaIn this paper, we summarize our works for cross-media retrieval where the queries and retrieval content are of different media types. We study cross-media retrieval in the context of two applications, i.e., ~image retrieval by textual queries, and ...
Comments