Refer to Anything with Vision-Language Prompts

1University of Illinois Urbana-Champaign, 2Adobe

Segmentation from referring expressions & visual references enables more intuitive interactions.

📍New Task: Omnimodal Referring Expression Segmentation

Omnimodal Referring Expression Segmentation

We incorporate visual reference into text prompts in Omnimodal Referring Expression Segmentation (ORES), providing more flexible and intuitive interactions. ORES enables applications such as precisely localized image editing.

🚀New Model: Refer to Any Segmentation Mask Group

Refer to Any Segmentation Mask Group

We propose a new multimodal model that can "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric design.

💎New Datasets: MaskGroups-2M & MaskGroups-HQ

MaskGroups-HQ Examples

We curate two new datasets, MaskGroups-2M and MaskGroups-HQ, for large-scale training and high-quality finetuning/evaluation of RAS, respectively.

🏆New SOTA: Best Model Across RES, GRES, and ORES

Model RES GRES ORES
w/o <mask-ref> w/ <mask-ref> Overall
Prev. SOTA 77.1
(PSALM)
67.8
(SAM4MLLM)
49.6
(GSVA)
N/A* N/A*
Ours 77.8 71.8 74.6 68.8 73.1

* No previous methods could understand visual reference prompts

Qualitative comparison

RAS accurately follows complex prompts to segment targets, even when they are small or occluded.

📚BibTeX

@article{cao2025refer,
      title={Refer to Anything with Vision-Language Prompts},
      author={Shengcao Cao and Zijun Wei and Jason Kuen and Kangning Liu and Lingzhi Zhang and Jiuxiang Gu and HyunJoon Jung and Liang-Yan Gui and Yu-Xiong Wang},
      journal={arXiv preprint arXiv:2506.05342},
      year={2025}
    }