skip to main content
10.1145/3343031.3351053acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation

Authors Info & Claims
Published:15 October 2019Publication History

ABSTRACT

Cross-modal retrieval has become a hot research topic in recent years for its theoretical and practical significance. This paper proposes a new technique for learning such deep visual-semantic embedding that is more effective and interpretable for cross-modal retrieval. The proposed method employs a two-stage strategy to fulfill the task. In the first stage, deep mutual information estimation is incorporated into the objective to maximize the mutual information between the input data and its embedding. In the second stage, an expelling branch is added to the network to disentangle the modality-exclusive information from the learned representations. This helps to reduce the impact of modality-exclusive information to the common subspace representation as well as improve the interpretability of the learned feature. Extensive experiments on two large-scale benchmark datasets demonstrate that our method can learn better visual-semantic embedding and achieve state-of-the-art cross-modal retrieval results.

References

  1. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Devon Hjelm, and Aaron Courville. 2018. Mutual Information Neural Estimation. In ICML. 530--539.Google ScholarGoogle Scholar
  2. Anthony J Bell and Terrence J Sejnowski. 1995. An information-maximization approach to blind separation and blind deconvolution. Neural computation , Vol. 7, 6 (1995), 1129--1159.Google ScholarGoogle Scholar
  3. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS. 2172--2180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Monroe D Donsker and SR Srinivasa Varadhan. 1983. Asymptotic evaluation of certain Markov process expectations for large time. IV. Communications on Pure and Applied Mathematics , Vol. 36, 2 (1983), 183--212.Google ScholarGoogle ScholarCross RefCross Ref
  5. Emilien Dupont. 2018. Learning disentangled joint continuous and discrete representations. In NIPS .Google ScholarGoogle Scholar
  6. Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Cross-Modal Deep Variational Hashing. In ICCV. 4097--4105.Google ScholarGoogle Scholar
  7. Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VseGoogle ScholarGoogle Scholar
  8. : Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).Google ScholarGoogle Scholar
  9. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In ICML. 1180--1189.Google ScholarGoogle Scholar
  10. Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. 2018. Image-to-image translation for cross-domain disentanglement. In NIPS . 710--720.Google ScholarGoogle Scholar
  11. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. 2672--2680.Google ScholarGoogle Scholar
  12. Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR. 7181--7189.Google ScholarGoogle Scholar
  13. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In NIPS. 5767--5777.Google ScholarGoogle Scholar
  14. Weikuo Guo, Liang Jian, Kong Xiangwei, Song Lingxiao, and He Ran. 2018. X-GACMN: An X-Shaped Generative Adversarial Cross-Modal Network with Hypersphere Embedding. In ACCV. 513--529.Google ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  16. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR , Vol. 2. 6.Google ScholarGoogle Scholar
  17. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR .Google ScholarGoogle Scholar
  18. Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-To-End Module Networks for Visual Question Answering. In ICCV . 804--813.Google ScholarGoogle Scholar
  19. Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal lstm. In CVPR . 2310--2318.Google ScholarGoogle Scholar
  20. Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In CVPR. 3232--3240.Google ScholarGoogle Scholar
  21. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR . 3128--3137.Google ScholarGoogle Scholar
  22. Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. arXiv preprint arXiv:1802.05983 (2018).Google ScholarGoogle Scholar
  23. Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR .Google ScholarGoogle Scholar
  24. Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In ICLR .Google ScholarGoogle Scholar
  25. Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google ScholarGoogle Scholar
  26. Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2018. Diverse Image-to-Image Translation via Disentangled Representations. In ECCV . 35--51.Google ScholarGoogle Scholar
  27. Lizhao Li, Guoyong Cai, and Nannan Chen. 2018. A Rumor Events Detection Method Based on Deep Bidirectional GRU Neural Network. In ICIVC . 755--759.Google ScholarGoogle Scholar
  28. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.Google ScholarGoogle Scholar
  29. Ralph Linsker. 1988. Self-organization in a perceptual network. Computer , Vol. 21, 3 (1988), 105--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yang Liu, Zhaowen Wang, Hailin Jin, and Ian Wassell. 2018. Multi-Task Adversarial Network for Disentangled Feature Learning. In CVPR . 3743--3751.Google ScholarGoogle Scholar
  31. Boyu Lu, Jun-Cheng Chen, and Rama Chellappa. 2019. Unsupervised Domain-Specific Deblurring via Disentangled Representations. In CVPR . 10225--10234.Google ScholarGoogle Scholar
  32. Chenguang Lu. 2019. The CM Algorithm for the Maximum Mutual Information Classifications of Unseen Instances. arXiv preprint arXiv:1901.09902 (2019).Google ScholarGoogle Scholar
  33. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In CVPR . 375--383.Google ScholarGoogle Scholar
  34. Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In ICCV . 2623--2631.Google ScholarGoogle Scholar
  35. Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. 2019. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning. In ICLR .Google ScholarGoogle Scholar
  36. Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR . 299--307.Google ScholarGoogle Scholar
  37. Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal lstm for dense visual-semantic embedding. In ICCV . 1881--1889.Google ScholarGoogle Scholar
  38. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In ICLR .Google ScholarGoogle Scholar
  39. Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled Representation Learning GAN for Pose-Invariant Face Recognition. In CVPR . 1415--1424.Google ScholarGoogle Scholar
  40. Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In ICLR .Google ScholarGoogle Scholar
  41. Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In ACM MM. 154--162.Google ScholarGoogle Scholar
  42. Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. TPAMI , Vol. 41, 2 (2018), 394--407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In CVPR . 5005--5013.Google ScholarGoogle Scholar
  44. Shuohang Wang, Sheng Zhang, Yelong Shen, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, and Jing Jiang. 2019. Unsupervised Deep Structured Semantic Models for Commonsense Reasoning. In NAACL . 882--891.Google ScholarGoogle Scholar
  45. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics , Vol. 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  46. En Yu, Jiande Sun, Jing Li, Xiaojun Chang, Xianhua Han, and Alexander Hauptmann. 2018. Adaptive Semi-supervised Feature Selection for Cross-modal Retrieval. IEEE TMM , Vol. 21, 5 (2018), 1276--1288.Google ScholarGoogle Scholar
  47. Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, Dacheng Tao, and Qi Tian. 2018. Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval. In ACM MM . 1137--1145.Google ScholarGoogle Scholar
  48. Hao Zhu, Aihua Zheng, Huaibo Huang, and Ran He. 2018b. High-Resolution Talking Face Generation via Mutual Information Approximation. arXiv preprint arXiv:1812.06589 (2018).Google ScholarGoogle Scholar
  49. Lin Zhu, Yushi Chen, Pedram Ghamisi, and Jón Atli Benediktsson. 2018a. Generative Adversarial Networks for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing , Vol. 56, 9 (2018), 5046--5063.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '19: Proceedings of the 27th ACM International Conference on Multimedia
      October 2019
      2794 pages
      ISBN:9781450368896
      DOI:10.1145/3343031

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 October 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader