Refer to Anything with Vision-Language Prompts

1University of Illinois Urbana-Champaign, 2Adobe Research

New Task: Omnimodal Referring Expression Segmentation (ORES)

Omnimodal Referring Expression Segmentation

We incorporate visual reference into text prompts in Omnimodal Referring Expression Segmentation (ORES), providing more flexible and intuitive interactions. ORES enables applications such as precisely localized image editing.

New Model: Refer to Any Segmentation Mask Group (RAS)

Refer to Any Segmentation Mask Group

We propose a new multimodal model that can "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric design.

New Datasets: MaskGroups-2M & MaskGroups-HQ

MaskGroups-HQ Examples

We curate two new datasets, MaskGroups-2M and MaskGroups-HQ, for large-scale training and high-quality finetuning/evaluation of RAS, respectively.

New SOTA: Best Model Across RES, GRES, and ORES

Model RES GRES ORES
w/o <mask-ref> w/ <mask-ref> Overall
Prev. SOTA 77.1
(PSALM)
67.8
(SAM4MLLM)
49.6
(GSVA)
N/A* N/A*
Ours 77.8 71.8 74.6 68.8 73.1

* No previous methods could understand visual reference prompts