Automatic Partitioning of a Code-Switched Speech Corpus Using Mixed-Integer Programming

Joshua Miles Jansen van Vüren, Febe de Wet, Thomas Niesler


Abstract
Defining training, development and test set partitions for speech corpora is usually accomplished by hand. However, for the dataset under investigation, which contains a large number of speakers, eight different languages and code-switching between all the languages, this style of partitioning is not feasible. Therefore, we view the partitioning task as a resource allocation problem and propose to solve it automatically and optimally by the application of mixed-integer linear programming. Using this approach, we are able to partition a new 41.6-hour multilingual corpus of code-switched speech into training, development and testing partitions while maintaining a fixed number of speakers and a specific amount of code-switched speech in the development and test partitions. For this newly partitioned corpus, we present baseline speech recognition results using a state-of-the-art multilingual transformer model (Wav2Vec2-XLS-R) and show that the exclusion of very short utterances (<1s) results in substantially improved speech recognition performance.
Anthology ID:
2024.lrec-main.174
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1944–1952
Language:
URL:
https://aclanthology.org/2024.lrec-main.174
DOI:
Bibkey:
Cite (ACL):
Joshua Miles Jansen van Vüren, Febe de Wet, and Thomas Niesler. 2024. Automatic Partitioning of a Code-Switched Speech Corpus Using Mixed-Integer Programming. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1944–1952, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Automatic Partitioning of a Code-Switched Speech Corpus Using Mixed-Integer Programming (Jansen van Vüren et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.174.pdf