skip to main content
note

Modality-Dependent Cross-Media Retrieval

Published:22 March 2016Publication History
Skip Abstract Section

Abstract

In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.

References

  1. D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2013. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision 106 (2013), 1--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. J. Hwang and K. Grauman. 2010. Accounting for the relative importance of objects in image retrieval. In Proceedings of the British Machine Vision Conference. 1--12.Google ScholarGoogle Scholar
  6. C. Kang, S. Liao, Y. He, J. Wang, S. Xiang, and C. Pan. 2014. Cross-modal similarity learning: A low rank bilinear formulation. Arxiv Preprint Arxiv:1411.4738 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Krapac, M. Allan, J. Verbeek, and F. Jurie. 2010. Improving web-image search results using query-relative classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1094--1101.Google ScholarGoogle Scholar
  8. A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1106--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Kumar and R. Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang. 2014. Learning multimodal neural network with ranking examples. In Proceedings of the International Conference on Multimedia. 985--988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia. 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Rosipal and N. Krämer. 2006. Overview and recent advances in partial least squares. In Subspace, Latent Structure and Feature Selection. Springer, 34--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Sharma and D. W. Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2160--2167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. B. Tenenbaum and W. T. Freeman. 2000. Separating style and content with bilinear models. Neural Computation 12, 6 (2000), 1247--1283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. In Proceedings of the International Conference on Very Large Data Bases 7, 8 (2014), 649--660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Wei, Y. Zhao, Z. Zhu, Y. Xiao, and S. Wei. 2014. Learning a mid-level feature space for cross-media regularization. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1--6.Google ScholarGoogle Scholar
  19. F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In Proceedings of the International Conference on Multimedia. 877--886. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang. 2014. Sparse multi-modal hashing. IEEE Transactions on Multimedia 16, 2 (2014), 427--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan. 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 723--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L.-T. Chia. 2010. Cross-media retrieval using query dependent search methods. Pattern Recognition 43, 8 (2010), 2927--2936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the International Conference on Multimedia. 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan. 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10, 3 (2008), 437--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. X. Zhai, Y. Peng, and J. Xiao. 2013. Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems 19, 5 (2013), 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Zhang, Y. Zhao, Z. Zhu, S. Wei, and X. Wu. 2014. Mining semantically consistent patterns for cross-view data. IEEE Transactions on Knowledge and Data Engineering 26, 11 (2014), 2745--2758.Google ScholarGoogle ScholarCross RefCross Ref
  27. Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, and J. Shao. 2014. Cross-media hashing with neural networks. In Proceedings of the International Conference on Multimedia. 901--904. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Modality-Dependent Cross-Media Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 7, Issue 4
      Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular Papers
      July 2016
      498 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/2906145
      • Editor:
      • Yu Zheng
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 March 2016
      • Accepted: 1 May 2015
      • Revised: 1 March 2015
      • Received: 1 April 2014
      Published in tist Volume 7, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • note
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader