Gait recognition accuracy drops 10-25% when subjects wear coats or carry bags, yet collecting diverse training data is costly. This paper proposes a text-to-video diffusion pipeline to synthesize realistic gait data. Unlike prior silhouette generation, we produce full video sequences with identity-consistent faces, enabling direct application to multimodal face-gait access control systems. Using a custom three-stage pipeline, we generate 200 synthetic identities across Normal Walking (NM), Coat-wearing (CL), and Bag-carrying (BG) conditions in indoor and outdoor environments. Sim-to-real transfer learning on CASIA-B demonstrates that synthetic pre-training followed by real NM fine-tuning boosts CL accuracy from 31% to 95%. This matches fully-supervised baselines without requiring real appearance-varied data, enabling balanced 50-50 fusion of face and gait modalities that reaches 99.6% accuracy.
Leveraging Text-to-Video Diffusion Models for Synthetic Gait Dataset Generation and Appearance-Robust Person Recognition / Boscolo, F., Lamberti, F.. - ELETTRONICO. - (In corso di stampa). (22nd International Conference on Advanced Visual and Signal-Based Systems (AVSS) Lecce (IT) Aug. 31 - Sep. 1-2-3, 2026).
Leveraging Text-to-Video Diffusion Models for Synthetic Gait Dataset Generation and Appearance-Robust Person Recognition
Boscolo, Federico;Lamberti, Fabrizio
In corso di stampa
Abstract
Gait recognition accuracy drops 10-25% when subjects wear coats or carry bags, yet collecting diverse training data is costly. This paper proposes a text-to-video diffusion pipeline to synthesize realistic gait data. Unlike prior silhouette generation, we produce full video sequences with identity-consistent faces, enabling direct application to multimodal face-gait access control systems. Using a custom three-stage pipeline, we generate 200 synthetic identities across Normal Walking (NM), Coat-wearing (CL), and Bag-carrying (BG) conditions in indoor and outdoor environments. Sim-to-real transfer learning on CASIA-B demonstrates that synthetic pre-training followed by real NM fine-tuning boosts CL accuracy from 31% to 95%. This matches fully-supervised baselines without requiring real appearance-varied data, enabling balanced 50-50 fusion of face and gait modalities that reaches 99.6% accuracy.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3012337
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo
