Athlete Re-Identification (ReID) in sports broadcast footage presents unique challenges due to occlusions, similar uniforms and limited resolution. This work investigates the enhancement of ReID performance by incorporating 3D pose information into a multiscale vision transformer (MViT) architecture. By leveraging MotionBERT, a state-of-the-art monocular 3D human pose estimator, to extract joint coordinates from single-view broadcast images, these features are injected into the visual pipeline of a MViTv2 architecture. This approach aims to guide the ReID model toward semantically meaningful body parts, thereby improving feature discrimination in visually ambiguous scenarios. The system is evaluated on the DeepSportRadar basketball dataset, with results indicating that pose-injected embeddings yield a substantial improvement in mean Average Precision (mAP) on the test set (from 77.1% to 80.6%), with minor gains in top-k Cumulative Matching Characteristic (CMC) ranks. A t-SNE-based qualitative analysis of the learned embedding space reveals increased cluster compactness and inter-class separability when 3D pose cues are used, confirming their contribution to semantic alignment and demonstrating that even in monocular, low-resource settings, 3D pose information can be effectively integrated into ReID pipelines to boost both precision and robustness.

Sport Athletes Re-Identification in Broadcast Images via 3D Pose Estimation / Rossi, Luca Francesco; Sanna, Andrea; Manuri, Federico; Donna Bianco, Mattia. - (In corso di stampa). (Intervento presentato al convegno 14th EAI International Conference: ArtsIT, Interactivity & Game Creation tenutosi a Dubai (UAE) nel 7-9 November, 2025).

Sport Athletes Re-Identification in Broadcast Images via 3D Pose Estimation

Rossi, Luca Francesco;Sanna, Andrea;Manuri, Federico;
In corso di stampa

Abstract

Athlete Re-Identification (ReID) in sports broadcast footage presents unique challenges due to occlusions, similar uniforms and limited resolution. This work investigates the enhancement of ReID performance by incorporating 3D pose information into a multiscale vision transformer (MViT) architecture. By leveraging MotionBERT, a state-of-the-art monocular 3D human pose estimator, to extract joint coordinates from single-view broadcast images, these features are injected into the visual pipeline of a MViTv2 architecture. This approach aims to guide the ReID model toward semantically meaningful body parts, thereby improving feature discrimination in visually ambiguous scenarios. The system is evaluated on the DeepSportRadar basketball dataset, with results indicating that pose-injected embeddings yield a substantial improvement in mean Average Precision (mAP) on the test set (from 77.1% to 80.6%), with minor gains in top-k Cumulative Matching Characteristic (CMC) ranks. A t-SNE-based qualitative analysis of the learned embedding space reveals increased cluster compactness and inter-class separability when 3D pose cues are used, confirming their contribution to semantic alignment and demonstrating that even in monocular, low-resource settings, 3D pose information can be effectively integrated into ReID pipelines to boost both precision and robustness.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3003495