Crowd-sourcing for Less-resourced Languages: Lingua Libre for Polish

Mathilde Hutin, Marc Allassonnière-Tang


Abstract
Oral corpora for linguistic inquiry are frequently built based on the content of news, radio, and/or TV shows, sometimes also of laboratory recordings. Most of these existing corpora are restricted to languages with a large amount of data available. Furthermore, such corpora are not always accessible under a free open-access license. We propose a crowd-sourced alternative to this gap. Lingua Libre is the participatory linguistic media library hosted by Wikimedia France. It includes recordings from more than 140 languages. These recordings have been provided by more than 750 speakers worldwide, who voluntarily recorded word entries of their native language and made them available under a Creative Commons license. In the present study, we take Polish, a less-resourced language in terms of phonetic data, as an example, and compare our phonetic observations built on the data from Lingua Libre with the phonetic observations found by previous linguistic studies. We observe that the data from Lingua Libre partially matches the phonetic inventory of Polish as described in previous studies, but that the acoustic values are less precise, thus showing both the potential and the limitations of Lingua Libre to be used for phonetic research.
Anthology ID:
2022.sigul-1.6
Volume:
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venue:
SIGUL
SIG:
SIGUL
Publisher:
European Language Resources Association
Note:
Pages:
41–47
Language:
URL:
https://aclanthology.org/2022.sigul-1.6
DOI:
Bibkey:
Cite (ACL):
Mathilde Hutin and Marc Allassonnière-Tang. 2022. Crowd-sourcing for Less-resourced Languages: Lingua Libre for Polish. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 41–47, Marseille, France. European Language Resources Association.
Cite (Informal):
Crowd-sourcing for Less-resourced Languages: Lingua Libre for Polish (Hutin & Allassonnière-Tang, SIGUL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sigul-1.6.pdf