Large Speech Models (LSMs), pre-trained on extensive speech corpora, have recently emerged as powerful foundations in the audio processing field, demonstrating strong transfer capabilities to downstream tasks such as speaker identification and emotion recognition. However, while these models excel on speech-centric tasks, limited research has investigated their adaptability to Non-Verbal Vocalization (NVV) tasks, which involve vocal bursts like laughter, sighs, shrieks, and moans. In this work, we examine how well LSMs, specifically Wav2Vec 2.0, HuBERT, WavLM, and Whisper, can be adapted to NVV tasks. We conduct experiments using both linear probing to evaluate the pre-trained knowledge relevant to NVVs, and Parameter-Efficient Fine-Tuning (PEFT) techniques, including LoRA, Adapters, and Prompt Tuning. Experimental results on several NVV datasets—ASVP-ESD, CNVVE, Non-Verbal Vocalization Dataset, ReCANVo, VIVAE—indicate that Whisper-based models consistently achieve superior performance, which is further enhanced through the application of LoRA. Additionally, our layer-wise analysis reveals that applying PEFT specifically to layers with lower NVV information is key to effective model adaptation, providing valuable insights for optimizing fine-tuning strategies in future work.
Exploring the Adaptability of Large Speech Models to Non-Verbal Vocalization Task / Márquez Villacis, Juan José; D'Asaro, Federico; Rizzo, Giuseppe; Bottino, Andrea. - (In corso di stampa). (Intervento presentato al convegno CLiC-it 2025 – Eleventh Italian Conference on Computational Linguistics tenutosi a Cagliari (ITA) nel September 24-26, 2025).
Exploring the Adaptability of Large Speech Models to Non-Verbal Vocalization Task
D'Asaro, Federico;Rizzo, Giuseppe;Bottino, Andrea
In corso di stampa
Abstract
Large Speech Models (LSMs), pre-trained on extensive speech corpora, have recently emerged as powerful foundations in the audio processing field, demonstrating strong transfer capabilities to downstream tasks such as speaker identification and emotion recognition. However, while these models excel on speech-centric tasks, limited research has investigated their adaptability to Non-Verbal Vocalization (NVV) tasks, which involve vocal bursts like laughter, sighs, shrieks, and moans. In this work, we examine how well LSMs, specifically Wav2Vec 2.0, HuBERT, WavLM, and Whisper, can be adapted to NVV tasks. We conduct experiments using both linear probing to evaluate the pre-trained knowledge relevant to NVVs, and Parameter-Efficient Fine-Tuning (PEFT) techniques, including LoRA, Adapters, and Prompt Tuning. Experimental results on several NVV datasets—ASVP-ESD, CNVVE, Non-Verbal Vocalization Dataset, ReCANVo, VIVAE—indicate that Whisper-based models consistently achieve superior performance, which is further enhanced through the application of LoRA. Additionally, our layer-wise analysis reveals that applying PEFT specifically to layers with lower NVV information is key to effective model adaptation, providing valuable insights for optimizing fine-tuning strategies in future work.File | Dimensione | Formato | |
---|---|---|---|
CLIC_it_2025_NVV.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Creative commons
Dimensione
837.36 kB
Formato
Adobe PDF
|
837.36 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3002059