skip to main content
research-article

Cross-Modality Retrieval by Joint Correlation Learning

Authors Info & Claims
Published:03 July 2019Publication History
Skip Abstract Section

Abstract

As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this article, to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. First, an auto-encoder is used to embed the visual features by minimizing the error of feature reconstruction and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter-correlations among the image-sentence pairs, i.e., the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its performance. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, i.e., Flickr8k, Flickr30k, and MS-COCO.

References

  1. Tu Bui, L. Ribeiro, Moacir Ponti, and John Collomosse. 2017. Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Comput. Vis. Image Understand. 164 (2017), 27--37.Google ScholarGoogle ScholarCross RefCross Ref
  2. Ruey-Cheng Chen, Damiano Spina, W. Bruce Croft, Mark Sanderson, and Falk Scholer. 2015. Harnessing semantics for answer sentence retrieval. In Proceedings of the 8th Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR’15). 21--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. 886--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2016. Word2VisualVec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016).Google ScholarGoogle Scholar
  5. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482.Google ScholarGoogle ScholarCross RefCross Ref
  6. Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia. 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Andrea Frome, Yoram Singer, Fei Sha, and Jitendra Malik. 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’07). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  8. Matt W. Gardner and S. R. Dorling. 1998. Artificial neural networks (the multilayer perceptron): A review of applications in the atmospheric sciences. Atmos. Environ. 32, 14--15 (1998), 2627--2636.Google ScholarGoogle ScholarCross RefCross Ref
  9. Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Proceedings of the European Conference on Computer Vision (ECCV’16). 241--257.Google ScholarGoogle ScholarCross RefCross Ref
  10. Albert Gordo and Diane Larlus. 2017. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  12. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47, 1 (2013), 853--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Richang Hong, Zhenzhen Hu, Ruxin Wang, Meng Wang, and Dacheng Tao. 2016. Multi-view object retrieval via multi-scale topic models. IEEE Trans. Image Process. 25, 12 (2016), 5814--5827. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Richang Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, and Tat-Seng Chua. 2011. Beyond search: Event-driven summarization for web videos. ACM Trans. Multimedia Comput. Commun. Appl. 7, 4 (2011), 35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Richang Hong, Yang Yang, Meng Wang, and Xian-Sheng Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. IEEE Trans. Big Data 1, 4 (2015), 152--161.Google ScholarGoogle ScholarCross RefCross Ref
  16. Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2310--2318.Google ScholarGoogle ScholarCross RefCross Ref
  17. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML’15). 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  19. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google ScholarGoogle Scholar
  20. Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2014. Fisher vectors derived from hybrid Gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399 (2014).Google ScholarGoogle Scholar
  21. Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437--4446.Google ScholarGoogle ScholarCross RefCross Ref
  22. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems (NIPS’12). 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Luke Ledwich and Stefan Williams. 2004. Reduced SIFT features for image retrieval and indoor localisation. In Proceedings of the Australasian Conference on Robotics and Automation, Vol. 322. 3.Google ScholarGoogle Scholar
  24. Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. 2015. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4247--4255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision (ECCV’16). 833--850.Google ScholarGoogle ScholarCross RefCross Ref
  26. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740--755.Google ScholarGoogle Scholar
  27. Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 261--277.Google ScholarGoogle ScholarCross RefCross Ref
  28. Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2018. iVQA: Inverse visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8611--8619.Google ScholarGoogle ScholarCross RefCross Ref
  29. David G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’99), Vol. 2. 1150--1157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2623--2631. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Zhuang Ma, Yichao Lu, and Dean Foster. 2015. Finding linear structure in large datasets with scalable canonical correlation analysis. In Proceedings of the International Conference on Machine Learning (ICML’15). 169--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (Nov. 2008), 2579--2605.Google ScholarGoogle Scholar
  33. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  34. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Neural Information Processing Systems (NIPS’13). 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Paul Mineiro and Nikos Karampatziakis. 2014. A randomized algorithm for CCA. arXiv preprint arXiv:1411.3409 (2014).Google ScholarGoogle Scholar
  36. Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2013. Image retrieval using textual cues. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). 3040--3047. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Todd K. Moon. 1996. The expectation-maximization algorithm. IEEE Signal Processing Magazine 13, 6 (1996), 47--60.Google ScholarGoogle ScholarCross RefCross Ref
  38. Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1881--1889.Google ScholarGoogle ScholarCross RefCross Ref
  39. Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2641--2649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4094--4102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia. 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Douglas Reynolds. 2015. Gaussian mixture models. Encyclopedia of Biometrics, Stan Z. Li and Anil Jain (Eds.). Springer, 827--832.Google ScholarGoogle Scholar
  43. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  44. K. Seetharaman and Bachala Shyam Kumar. 2016. Texture and color features based color image retrieval using canonical correlation. Global J. Res. Eng. 15, 6 (2016), 1--9.Google ScholarGoogle Scholar
  45. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  46. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  47. Bruce Thompson. 2000. Canonical correlation analysis. Reading and Understanding More Multivariate Statistics, L. G. Grimm and P. R. Yarnold. American Psychological Association, 285--316.Google ScholarGoogle Scholar
  48. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  49. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5005--5013.Google ScholarGoogle ScholarCross RefCross Ref
  50. Shuo Wang, Dan Guo, Wen-gang Zhou, Zheng-Jun Zha, and Meng Wang. 2018. Connectionist temporal fusion for sign language translation. In Proceedings of the ACM International Conference on Multimedia. ACM, 1483--1491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Understand. 163 (2017), 21--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441--3450.Google ScholarGoogle ScholarCross RefCross Ref
  53. Xun Yang, Meng Wang, and Dacheng Tao. 2018. Person re-identification with metric learning using privileged information. IEEE Trans. Image Process. 27, 2 (2018), 791--805.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Cross-Modality Retrieval by Joint Correlation Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 2s
      Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018
      April 2019
      381 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3343360
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 July 2019
      • Accepted: 1 February 2019
      • Revised: 1 January 2019
      • Received: 1 June 2018
      Published in tomm Volume 15, Issue 2s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format