Refer to Anything with Vision-Language Prompts

Shengcao Cao¹, Zijun Wei², Jason Kuen², Kangning Liu², Lingzhi Zhang²,
Jiuxiang Gu², HyunJoon Jung², Liang-Yan Gui¹, Yu-Xiong Wang¹

¹University of Illinois Urbana-Champaign, ²Adobe

arXiv

Segmentation from referring expressions & visual references enables more intuitive interactions.

📍New Task: Omnimodal Referring Expression Segmentation

We incorporate visual reference into text prompts in Omnimodal Referring Expression Segmentation (ORES), providing more flexible and intuitive interactions. ORES enables applications such as precisely localized image editing.

🚀New Model: Refer to Any Segmentation Mask Group

We propose a new multimodal model that can "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric design.

💎New Datasets: MaskGroups-2M & MaskGroups-HQ

We curate two new datasets, MaskGroups-2M and MaskGroups-HQ, for large-scale training and high-quality finetuning/evaluation of RAS, respectively.

🏆New SOTA: Best Model Across RES, GRES, and ORES

Model	RES	GRES	ORES
Model	RES	GRES	w/o <mask-ref>	w/ <mask-ref>	Overall
Prev. SOTA	77.1 (PSALM)	67.8 (SAM4MLLM)	49.6 (GSVA)	N/A*	N/A*
Ours	77.8	71.8	74.6	68.8	73.1

* No previous methods could understand visual reference prompts

RAS accurately follows complex prompts to segment targets, even when they are small or occluded.

📚BibTeX

@article{cao2025refer,
      title={Refer to Anything with Vision-Language Prompts},
      author={Shengcao Cao and Zijun Wei and Jason Kuen and Kangning Liu and Lingzhi Zhang and Jiuxiang Gu and HyunJoon Jung and Liang-Yan Gui and Yu-Xiong Wang},
      journal={arXiv preprint arXiv:2506.05342},
      year={2025}
    }