Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs

Cagliero, Luca; Vaiani, Lorenzo; Pastor, Eliana; Koudounas, Alkis; Baralis, Elena; Mazzia, Vittorio; Pollastrini, Sandro; Gueudre, Thomas; Giollo, Manuel; Amberti, Daniele; Yue, Wu

Video summarization aims to generate a condensed textual version of an original video. Summaries may consist of either plain text or a shortlist of salient events, possibly including temporal or spatial references. Video Large Language Models (VLLMs) exhibit impressive zero-shot capabilities in video analysis. However, their performance varies significantly according to the LLM prompt, the characteristics of the video, and the properties of the training data and LLM architecture.In this work, we thoroughly evaluate the zero-shot summarization performance of four state-of-the-art open-source VLLMs specifically designed to address spatial and temporal reasoning. In light of the detected summarization issues, we propose different cost-effective mitigation strategies, based on Chain-of-Thought prompting, that involve the injection of knowledge extracted by external, lightweight models. To perform the VLLM evaluation, we design a new video summarization benchmark consisting of 100 videos with varying characteristics in terms of domain, duration, and spatio-temporal properties. Videos are manually annotated by three independent human experts with plain text, event-based, and spatio-temporal summaries. The experimental evaluation shows that VLLMs significantly benefit from prompting a list of recognized actions, whereas injecting automatically recognized objects and scene changes respectively improve spatially contextualized and event-based summaries in specific cases.

Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs / Cagliero, Luca; Vaiani, Lorenzo; Pastor, Eliana; Koudounas, Alkis; Baralis, Elena; Mazzia, Vittorio; Pollastrini, Sandro; Gueudre, Thomas; Giollo, Manuel; Amberti, Daniele; Wu, Yue. - (2025), pp. 286-301. (Intervento presentato al convegno 63rd Annual Meeting of the Association for Computational Linguistics: ACL 2025 tenutosi a Vienna (AT) nel 27Jul - 1 Aug 2025).

Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs

Cagliero, Luca;Vaiani, Lorenzo;Pastor, Eliana;Koudounas, Alkis;Baralis, Elena;Mazzia, Vittorio;Pollastrini, Sandro;Gueudre, Thomas;Giollo, Manuel;Amberti, Daniele;Wu, Yue

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2025
			
	Codice ISBN
	
				979-8-89176-256-5
			
	Appare nelle tipologie
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025.findings-acl.16.pdf accesso aperto Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Creative commons Dimensione 283.08 kB Formato Adobe PDF Visualizza/Apri	283.08 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3002216

PORTO @ Archivio Istituzionale della Ricerca

Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs

Cagliero, Luca;Vaiani, Lorenzo;Pastor, Eliana;Koudounas, Alkis;Baralis, Elena;Mazzia, Vittorio;Pollastrini, Sandro;Gueudre, Thomas;Giollo, Manuel;Amberti, Daniele;Wu, Yue

2025

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)