2022
pdf
bib
abs
A Speech Recognizer for Frisian/Dutch Council Meetings
Martijn Bentum
|
Louis ten Bosch
|
Henk van den Heuvel
|
Simone Wills
|
Domenique van der Niet
|
Jelske Dijkstra
|
Hans Van de Velde
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We developed a bilingual Frisian/Dutch speech recognizer for council meetings in Fryslân (the Netherlands). During these meetings both Frisian and Dutch are spoken, and code switching between both languages shows up frequently. The new speech recognizer is based on an existing speech recognizer for Frisian and Dutch named FAME!, which was trained and tested on historical radio broadcasts. Adapting a speech recognizer for the council meeting domain is challenging because of acoustic background noise, speaker overlap and the jargon typically used in council meetings. To train the new recognizer, we used the radio broadcast materials utilized for the development of the FAME! recognizer and added newly created manually transcribed audio recordings of council meetings from eleven Frisian municipalities, the Frisian provincial council and the Frisian water board. The council meeting recordings consist of 49 hours of speech, with 26 hours of Frisian speech and 23 hours of Dutch speech. Furthermore, from the same sources, we obtained texts in the domain of council meetings containing 11 million words; 1.1 million Frisian words and 9.9 million Dutch words. We describe the methods used to train the new recognizer, report the observed word error rates, and perform an error analysis on remaining errors.
pdf
bib
abs
PoS Tagging, Lemmatization and Dependency Parsing of West Frisian
Wilbert Heeringa
|
Gosse Bouma
|
Martha Hofman
|
Jelle Brouwer
|
Eduard Drenth
|
Jan Wijffels
|
Hans Van de Velde
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present a lemmatizer/PoS tagger/dependency parser for West Frisian using a corpus of 44,714 words in 3,126 sentences that were annotated according to the guidelines of Universal Dependencies version 2. PoS tags were assigned to words by using a Dutch PoS tagger that was applied to a Dutch word-by-word translation, or to sentences of a Dutch parallel text. Best results were obtained when using word-by-word translations that were created by using the previous version of the Frisian translation program Oersetter. Morphologic and syntactic annotations were generated on the basis of a Dutch word-by-word translation as well. The performance of the lemmatizer/tagger/annotator when it was trained using default parameters was compared to the performance that was obtained when using the parameter values that were used for training the LassySmall UD 2.5 corpus. We study the effects of different hyperparameter settings on the accuracy of the annotation pipeline. The Frisian lemmatizer/PoS tagger/dependency parser is released as a web app and as a web service.
2016
pdf
bib
abs
A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-Switching Research
Emre Yilmaz
|
Maaike Andringa
|
Sigrid Kingma
|
Jelske Dijkstra
|
Frits van der Kuip
|
Hans Van de Velde
|
Frederik Kampstra
|
Jouke Algra
|
Henk van den Heuvel
|
David van Leeuwen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present a new speech database containing 18.5 hours of annotated radio broadcasts in the Frisian language. Frisian is mostly spoken in the province Fryslan and it is the second official language of the Netherlands. The recordings are collected from the archives of Omrop Fryslan, the regional public broadcaster of the province Fryslan. The database covers almost a 50-year time span. The native speakers of Frisian are mostly bilingual and often code-switch in daily conversations due to the extensive influence of the Dutch language. Considering the longitudinal and code-switching nature of the data, an appropriate annotation protocol has been designed and the data is manually annotated with the orthographic transcription, speaker identities, dialect information, code-switching details and background noise/music information.