Rethinking Cross-Modal Interaction for Efficient Referring Image Segmentation

Cuttano, Claudia; Pistilli, Francesca; Cermelli, Fabio; Averta, Giuseppe

doi:10.1109/lra.2025.3579604

Referring Image Segmentation, the task of finding and segmenting objects in an image conditioned on a natural language description, is crucial for human-robot collaboration. However, current RIS methods often implement visual-text alignment relying on computationally intensive Transformer-based self-attention mechanisms, which impairs deployment on robots, especially those with limited computational resources. Indeed, beyond accuracy, practical robotic applications demand efficient models with small footprints. This letter introduces ERIS, an Efficient RIS approach designed for real-world deployment. ERIS achieves effective multi-modal interaction through a novel dual-branch architecture: a Visual Text Alignment branch and a Text Visual Refinement branch. This design implements bilateral alignment between textual and visual modalities without the computational burden of self-attention. Of note, the progressive alignment in ERIS enhances interpretability, revealing how textual cues guide segmentation. For the sake of efficiency, our alignment strategy produces structured embeddings which can be directly mapped into the final segmentation mask, without the need for additional segmentation heads. Thus, ERIS footprint scales linearly with respect to the number of visual and text tokens, making it suitable for both cloud-based and edge deployment. Experimental results demonstrate that ERIS achieves competitive or superior performance compared to state-of-the-art methods while significantly reducing computational cost, proving that efficiency and accuracy are not mutually exclusive.

Rethinking Cross-Modal Interaction for Efficient Referring Image Segmentation / Cuttano, Claudia; Pistilli, Francesca; Cermelli, Fabio; Averta, Giuseppe. - In: IEEE ROBOTICS AND AUTOMATION LETTERS. - ISSN 2377-3766. - 10:8(2025), pp. 7811-7818. [10.1109/lra.2025.3579604]

Rethinking Cross-Modal Interaction for Efficient Referring Image Segmentation

Cuttano, Claudia;Pistilli, Francesca;Cermelli, Fabio;Averta, Giuseppe

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2025
			
	Codice DOI
	
				https://dx.doi.org/10.1109/lra.2025.3579604
			
	Titolo della Rivista
	
				IEEE ROBOTICS AND AUTOMATION LETTERS
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
ERIS_postprint.pdf accesso aperto Tipologia: 2. Post-print / Author's Accepted Manuscript Licenza: Pubblico - Tutti i diritti riservati Dimensione 4.05 MB Formato Adobe PDF Visualizza/Apri	4.05 MB	Adobe PDF	Visualizza/Apri
ERIS_editoriale.pdf accesso riservato Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 4.42 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	4.42 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3005531

PORTO @ Archivio Istituzionale della Ricerca

Rethinking Cross-Modal Interaction for Efficient Referring Image Segmentation

Cuttano, Claudia;Pistilli, Francesca;Cermelli, Fabio;Averta, Giuseppe

2025

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)