Gamli - Icelandic Oral History Corpus: Design, Collection and Evaluation

Luke O’Brien, Finnur Ingimundarson, Jón Guðnasson, Steinþór Steingrímsson


Abstract
We present Gamli, an ASR corpus for Icelandic oral histories, the first of its kind for this language, derived from the Ísmús ethnographic collection. Corpora for oral histories differ in various ways from corpora for general ASR, they contain spontaneous speech, multiple speakers per channel, noisy environments, the effects of historic recording equipment, and typically a large proportion of elderly speakers. Gamli contains 146 hours of aligned speech and transcripts, split into a training set and a test set. We describe our approach for creating the transcripts, through both OCR of previous transcripts and post-editing of ASR output. We also describe our approach for aligning, segmenting, and filtering the corpus and finally training a Kaldi ASR system, which achieves 22.4% word error rate (WER) on the Gamli test set, a substantial improvement from 58.4% word error rate from a baseline general ASR system for Icelandic.
Anthology ID:
2023.nodalida-1.59
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
601–609
Language:
URL:
https://aclanthology.org/2023.nodalida-1.59
DOI:
Bibkey:
Cite (ACL):
Luke O’Brien, Finnur Ingimundarson, Jón Guðnasson, and Steinþór Steingrímsson. 2023. Gamli - Icelandic Oral History Corpus: Design, Collection and Evaluation. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 601–609, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Gamli - Icelandic Oral History Corpus: Design, Collection and Evaluation (O’Brien et al., NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.59.pdf