Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or WeaklySupervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.

Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition / D'Asaro, Federico; Márquez Villacís, Juan José; Rizzo, Giuseppe; Bottino, Andrea. - (In corso di stampa). (Intervento presentato al convegno CLiC-it 2024 – Tenth Italian Conference on Computational Linguistics).

Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition

D'Asaro, Federico;Rizzo, Giuseppe;Bottino, Andrea
In corso di stampa

Abstract

Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or WeaklySupervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.
In corso di stampa
File in questo prodotto:
File Dimensione Formato  
manuscript-clic2024.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Creative commons
Dimensione 1.38 MB
Formato Adobe PDF
1.38 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2992803