Gait recognition accuracy drops 10-25% when subjects wear coats or carry bags, yet collecting diverse training data is costly. This paper proposes a text-to-video diffusion pipeline to synthesize realistic gait data. Unlike prior silhouette generation, we produce full video sequences with identity-consistent faces, enabling direct application to multimodal face-gait access control systems. Using a custom three-stage pipeline, we generate 200 synthetic identities across Normal Walking (NM), Coat-wearing (CL), and Bag-carrying (BG) conditions in indoor and outdoor environments. Sim-to-real transfer learning on CASIA-B demonstrates that synthetic pre-training followed by real NM fine-tuning boosts CL accuracy from 31% to 95%. This matches fully-supervised baselines without requiring real appearance-varied data, enabling balanced 50-50 fusion of face and gait modalities that reaches 99.6% accuracy.

Leveraging Text-to-Video Diffusion Models for Synthetic Gait Dataset Generation and Appearance-Robust Person Recognition / Boscolo, F., Lamberti, F.. - ELETTRONICO. - (In corso di stampa). (22nd International Conference on Advanced Visual and Signal-Based Systems (AVSS) Lecce (IT) Aug. 31 - Sep. 1-2-3, 2026).

Leveraging Text-to-Video Diffusion Models for Synthetic Gait Dataset Generation and Appearance-Robust Person Recognition

Boscolo, Federico;Lamberti, Fabrizio
In corso di stampa

Abstract

Gait recognition accuracy drops 10-25% when subjects wear coats or carry bags, yet collecting diverse training data is costly. This paper proposes a text-to-video diffusion pipeline to synthesize realistic gait data. Unlike prior silhouette generation, we produce full video sequences with identity-consistent faces, enabling direct application to multimodal face-gait access control systems. Using a custom three-stage pipeline, we generate 200 synthetic identities across Normal Walking (NM), Coat-wearing (CL), and Bag-carrying (BG) conditions in indoor and outdoor environments. Sim-to-real transfer learning on CASIA-B demonstrates that synthetic pre-training followed by real NM fine-tuning boosts CL accuracy from 31% to 95%. This matches fully-supervised baselines without requiring real appearance-varied data, enabling balanced 50-50 fusion of face and gait modalities that reaches 99.6% accuracy.
In corso di stampa
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3012337
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo