Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo Negri


Abstract
While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.
Anthology ID:
2025.coling-main.455
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6816–6826
Language:
URL:
https://aclanthology.org/2025.coling-main.455/
DOI:
Bibkey:
Cite (ACL):
Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, and Matteo Negri. 2025. Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6816–6826, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection (Lee et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.455.pdf