Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Liang, Weixin; Zhang, Yuhui; Kwon, Yongchan; Yeung, Serena; Zou, James

Computer Science > Computation and Language

arXiv:2203.02053 (cs)

[Submitted on 3 Mar 2022 (v1), last revised 19 Oct 2022 (this version, v2)]

Title:Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Authors:Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, James Zou

View PDF

Abstract:We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness. Our code and data are available at this https URL

Comments:	Published at NeurIPS 2022. Code and data are available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2203.02053 [cs.CL]
	(or arXiv:2203.02053v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.02053

Submission history

From: Weixin Liang [view email]
[v1] Thu, 3 Mar 2022 22:53:54 UTC (9,354 KB)
[v2] Wed, 19 Oct 2022 20:39:13 UTC (17,326 KB)

Computer Science > Computation and Language

Title:Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators