Corpus Creation and Automatic Alignment of Historical Dutch Dialect Speech

Martijn Bentum, Eric Sanders, Antal P.J. van den Bosch, Douwe Zeldenrust, Henk van den Heuvel


Abstract
The Dutch Dialect Database (also known as the ‘Nederlandse Dialectenbank’) contains dialectal variations of Dutch that were recorded all over the Netherlands in the second half of the twentieth century. A subset of these recordings of about 300 hours were enriched with manual orthographic transcriptions, using non-standard approximations of dialectal speech. In this paper we describe the creation of a corpus containing both the audio recordings and their corresponding transcriptions and focus on our method for aligning the recordings with the transcriptions and the metadata.
Anthology ID:
2024.lrec-main.357
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4021–4029
Language:
URL:
https://aclanthology.org/2024.lrec-main.357
DOI:
Bibkey:
Cite (ACL):
Martijn Bentum, Eric Sanders, Antal P.J. van den Bosch, Douwe Zeldenrust, and Henk van den Heuvel. 2024. Corpus Creation and Automatic Alignment of Historical Dutch Dialect Speech. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4021–4029, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Corpus Creation and Automatic Alignment of Historical Dutch Dialect Speech (Bentum et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.357.pdf