Per Egil Kummervold


2023

pdf bib
A Large Norwegian Dataset for Weak Supervision ASR
Per Erik Solberg | Pierre Beauguitte | Per Egil Kummervold | Freddy Wetjen
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

With the advent of weakly supervised ASR systems like Whisper, it is possible to train ASR systems on non-verbatim transcriptions. This paper describes an effort to create a large Norwegian dataset for weakly supervised ASR from parliamentary recordings. Audio from Stortinget, the Norwegian parliament, is segmented and transcribed with an existing ASR system. An algorithm retrieves transcripts of these segments from Stortinget’s official proceedings using the Levenshtein edit distance between the ASR output and the proceedings text. In that way, a dataset of more than 5000 hours of transcribed speech is produced with limited human effort. Since parliamentary data is public domain, the dataset can be shared freely without any restrictions.