research-article

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation

Authors:
Weikuo Guo

Dalian University of Technology, Dalian, China

Dalian University of Technology, Dalian, China
View Profile

,
Huaibo Huang

University of Chinese Academy of Sciences & CASIA, Beijing, China

University of Chinese Academy of Sciences & CASIA, Beijing, China
View Profile

,
Xiangwei Kong

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Ran He

University of Chinese Academy of Sciences & CASIA, Beijing, China

University of Chinese Academy of Sciences & CASIA, Beijing, China
View Profile

MM '19: Proceedings of the 27th ACM International Conference on MultimediaOctober 2019Pages 1712–1720https://doi.org/10.1145/3343031.3351053

Published:15 October 2019Publication History

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 1712–1720

ABSTRACT

Cross-modal retrieval has become a hot research topic in recent years for its theoretical and practical significance. This paper proposes a new technique for learning such deep visual-semantic embedding that is more effective and interpretable for cross-modal retrieval. The proposed method employs a two-stage strategy to fulfill the task. In the first stage, deep mutual information estimation is incorporated into the objective to maximize the mutual information between the input data and its embedding. In the second stage, an expelling branch is added to the network to disentangle the modality-exclusive information from the learned representations. This helps to reduce the impact of modality-exclusive information to the common subspace representation as well as improve the interpretability of the learned feature. Extensive experiments on two large-scale benchmark datasets demonstrate that our method can learn better visual-semantic embedding and achieve state-of-the-art cross-modal retrieval results.

References

Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Devon Hjelm, and Aaron Courville. 2018. Mutual Information Neural Estimation. In ICML. 530--539.Google Scholar
Anthony J Bell and Terrence J Sejnowski. 1995. An information-maximization approach to blind separation and blind deconvolution. Neural computation , Vol. 7, 6 (1995), 1129--1159.Google Scholar
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS. 2172--2180.Google ScholarDigital Library
Monroe D Donsker and SR Srinivasa Varadhan. 1983. Asymptotic evaluation of certain Markov process expectations for large time. IV. Communications on Pure and Applied Mathematics , Vol. 36, 2 (1983), 183--212.Google ScholarCross Ref
Emilien Dupont. 2018. Learning disentangled joint continuous and discrete representations. In NIPS .Google Scholar
Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Cross-Modal Deep Variational Hashing. In ICCV. 4097--4105.Google Scholar
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VseGoogle Scholar
: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).Google Scholar
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In ICML. 1180--1189.Google Scholar
Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. 2018. Image-to-image translation for cross-domain disentanglement. In NIPS . 710--720.Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. 2672--2680.Google Scholar
Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR. 7181--7189.Google Scholar
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In NIPS. 5767--5777.Google Scholar
Weikuo Guo, Liang Jian, Kong Xiangwei, Song Lingxiao, and He Ran. 2018. X-GACMN: An X-Shaped Generative Adversarial Cross-Modal Network with Hypersphere Embedding. In ACCV. 513--529.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google Scholar
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR , Vol. 2. 6.Google Scholar
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR .Google Scholar
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-To-End Module Networks for Visual Question Answering. In ICCV . 804--813.Google Scholar
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal lstm. In CVPR . 2310--2318.Google Scholar
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In CVPR. 3232--3240.Google Scholar
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR . 3128--3137.Google Scholar
Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. arXiv preprint arXiv:1802.05983 (2018).Google Scholar
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR .Google Scholar
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In ICLR .Google Scholar
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google Scholar
Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2018. Diverse Image-to-Image Translation via Disentangled Representations. In ECCV . 35--51.Google Scholar
Lizhao Li, Guoyong Cai, and Nannan Chen. 2018. A Rumor Events Detection Method Based on Deep Bidirectional GRU Neural Network. In ICIVC . 755--759.Google Scholar
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.Google Scholar
Ralph Linsker. 1988. Self-organization in a perceptual network. Computer , Vol. 21, 3 (1988), 105--117.Google ScholarDigital Library
Yang Liu, Zhaowen Wang, Hailin Jin, and Ian Wassell. 2018. Multi-Task Adversarial Network for Disentangled Feature Learning. In CVPR . 3743--3751.Google Scholar
Boyu Lu, Jun-Cheng Chen, and Rama Chellappa. 2019. Unsupervised Domain-Specific Deblurring via Disentangled Representations. In CVPR . 10225--10234.Google Scholar
Chenguang Lu. 2019. The CM Algorithm for the Maximum Mutual Information Classifications of Unseen Instances. arXiv preprint arXiv:1901.09902 (2019).Google Scholar
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In CVPR . 375--383.Google Scholar
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In ICCV . 2623--2631.Google Scholar
Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. 2019. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning. In ICLR .Google Scholar
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR . 299--307.Google Scholar
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal lstm for dense visual-semantic embedding. In ICCV . 1881--1889.Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In ICLR .Google Scholar
Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled Representation Learning GAN for Pose-Invariant Face Recognition. In CVPR . 1415--1424.Google Scholar
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In ICLR .Google Scholar
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In ACM MM. 154--162.Google Scholar
Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. TPAMI , Vol. 41, 2 (2018), 394--407.Google ScholarDigital Library
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In CVPR . 5005--5013.Google Scholar
Shuohang Wang, Sheng Zhang, Yelong Shen, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, and Jing Jiang. 2019. Unsupervised Deep Structured Semantic Models for Commonsense Reasoning. In NAACL . 882--891.Google Scholar
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics , Vol. 2 (2014), 67--78.Google ScholarCross Ref
En Yu, Jiande Sun, Jing Li, Xiaojun Chang, Xianhua Han, and Alexander Hauptmann. 2018. Adaptive Semi-supervised Feature Selection for Cross-modal Retrieval. IEEE TMM , Vol. 21, 5 (2018), 1276--1288.Google Scholar
Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, Dacheng Tao, and Qi Tian. 2018. Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval. In ACM MM . 1137--1145.Google Scholar
Hao Zhu, Aihua Zheng, Huaibo Huang, and Ran He. 2018b. High-Resolution Talking Face Generation via Mutual Information Approximation. arXiv preprint arXiv:1812.06589 (2018).Google Scholar
Lin Zhu, Yushi Chen, Pedram Ghamisi, and Jón Atli Benediktsson. 2018a. Generative Adversarial Networks for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing , Vol. 56, 9 (2018), 5046--5063.Google ScholarCross Ref

Index Terms

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
Read More
Hybrid representation learning for cross-modal retrieval
Abstract
The rapid development of Deep Neural Networks (DNNs) in single-modal retrieval has promoted the wide application of DNNs in cross-modal retrieval tasks. Therefore, we propose a DNN-based method to learn the shared representation for ...
Read More
Scalable Deep Multimodal Learning for Cross-Modal Retrieval
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-modal retrieval
disentangled representation learning
mutual information estimation
Qualifiers
- research-article
Conference

Acceptance Rates
MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 860
  Total Downloads
- Downloads (Last 12 months)92
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adversarial Cross-Modal Retrieval

Hybrid representation learning for cross-modal retrieval

Scalable Deep Multimodal Learning for Cross-Modal Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media