Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition

Federico D’asaro, Juan José Márquez Villacís, Giuseppe Rizzo, Andrea Bottino


Abstract
Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or WeaklySupervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.
Anthology ID:
2024.clicit-1.31
Volume:
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:
December
Year:
2024
Address:
Pisa, Italy
Editors:
Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:
CLiC-it
SIG:
Publisher:
CEUR Workshop Proceedings
Note:
Pages:
258–265
Language:
URL:
https://aclanthology.org/2024.clicit-1.31/
DOI:
Bibkey:
Cite (ACL):
Federico D’asaro, Juan José Márquez Villacís, Giuseppe Rizzo, and Andrea Bottino. 2024. Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 258–265, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):
Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition (D’asaro et al., CLiC-it 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clicit-1.31.pdf