XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

Robin Algayres; Pablo Diego-Simon; Benoît Sagot; Emmanuel Dupoux

doi:10.18653/v1/2023.findings-emnlp.810

XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

Robin Algayres, Pablo Diego-Simon, Benoît Sagot, Emmanuel Dupoux

Abstract

Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we fine-tune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT and DP-Parse. Once XLS-R is fine-tuned, it is used to infer new word boundary labels that are used in turn for another fine-tuning step. Our method consistently improves the performance of each system and set a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.

Anthology ID:: 2023.findings-emnlp.810
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12103–12112
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.810
DOI:: 10.18653/v1/2023.findings-emnlp.810
Bibkey:
Cite (ACL):: Robin Algayres, Pablo Diego-Simon, Benoît Sagot, and Emmanuel Dupoux. 2023. XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12103–12112, Singapore. Association for Computational Linguistics.
Cite (Informal):: XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words (Algayres et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.810.pdf

PDF Cite Search