ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Subramanian, Sanjay; Merrill, William; Darrell, Trevor; Gardner, Matt; Singh, Sameer; Rohrbach, Anna

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.05991 (cs)

[Submitted on 12 Apr 2022 (v1), last revised 2 May 2022 (this version, v2)]

Title:ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Authors:Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach

View PDF

Abstract:Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8%.

Comments:	ACL 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2204.05991 [cs.CV]
	(or arXiv:2204.05991v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.05991

Submission history

From: Sanjay Subramanian [view email]
[v1] Tue, 12 Apr 2022 17:55:38 UTC (30,690 KB)
[v2] Mon, 2 May 2022 20:08:17 UTC (30,690 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators