Monika Rind-Pawlowski


2024

pdf bib
Speech Recognition Corpus of the Khinalug Language for Documenting Endangered Languages
Zhaolin Li | Monika Rind-Pawlowski | Jan Niehues
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Automatic Speech Recognition (ASR) can be a valuable tool to document endangered languages. However, building ASR tools for these languages poses several difficult research challenges, notably data scarcity. In this paper, we show the whole process of creating a useful ASR tool for language documentation scenarios. We publish the first speech corpus for Khinalug, an endangered language spoken in Northern Azerbaijan. The corpus consists of 2.67 hours of labeled data from recordings of spontaneous speech about various topics. As Khinalug is an extremely low-resource language, we investigate the benefits of multilingual models for self-supervised learning and supervised learning and achieve the performance of 6.65 Character Error Rate (CER) points and 25.53 Word Error Rate (WER) points. The benefits of multilingual models are further validated through experimentation with three additional under-resourced languages. Lastly, this work conducts quality assessments with linguists on new recordings to investigate the model’s usefulness in language documentation. We observe an evident degradation for new recordings, indicating the importance of enhancing model robustness. In addition, we find the inaudible content is the main cause of wrong ASR predictions, suggesting relating work on incorporating contextual information.

2018

pdf bib
Universal Morphologies for the Caucasus region
Christian Chiarcos | Kathrin Donandt | Maxim Ionov | Monika Rind-Pawlowski | Hasmik Sargsian | Jesse Wichers Schreur | Frank Abromeit | Christian Fäth
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)