Your ViT is Secretly an Image Segmentation Model

Kerssies, Tommie; Cavagnero, Niccolo; Hermans, Alexander; Norouzi, Narges; Averta, Giuseppe; Leibe, Bastian; Dubbelman, Gijs; Daan De Geus,

doi:10.1109/CVPR52734.2025.02356

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-Only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4 × faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

Your ViT is Secretly an Image Segmentation Model / Kerssies, Tommie; Cavagnero, Niccolo; Hermans, Alexander; Norouzi, Narges; Averta, Giuseppe; Leibe, Bastian; Dubbelman, Gijs; De Geus, Daan. - (2025), pp. 25303-25313. (Intervento presentato al convegno IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR) tenutosi a Nashville (USA) nel June 11th - 15th, 2025) [10.1109/CVPR52734.2025.02356].

Your ViT is Secretly an Image Segmentation Model

Tommie Kerssies;Niccolo Cavagnero;Alexander Hermans;Narges Norouzi;Giuseppe Averta;Bastian Leibe;Gijs Dubbelman;Daan de Geus

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2025
			
	Codice ISBN
	
				979-8-3315-4364-8
			
	Appare nelle tipologie
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2503.19108v1.pdf accesso aperto Tipologia: 2. Post-print / Author's Accepted Manuscript Licenza: Pubblico - Tutti i diritti riservati Dimensione 6.04 MB Formato Adobe PDF Visualizza/Apri	6.04 MB	Adobe PDF	Visualizza/Apri
Your_ViT_is_Secretly_an_Image_Segmentation_Model.pdf accesso riservato Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 1.98 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.98 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3001423

PORTO @ Archivio Istituzionale della Ricerca

Your ViT is Secretly an Image Segmentation Model

Tommie Kerssies;Niccolo Cavagnero;Alexander Hermans;Narges Norouzi;Giuseppe Averta;Bastian Leibe;Gijs Dubbelman;Daan de Geus

2025

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)