A Large Norwegian Dataset for Weak Supervision ASR

Per Erik Solberg, Pierre Beauguitte, Per Egil Kummervold, Freddy Wetjen


Abstract
With the advent of weakly supervised ASR systems like Whisper, it is possible to train ASR systems on non-verbatim transcriptions. This paper describes an effort to create a large Norwegian dataset for weakly supervised ASR from parliamentary recordings. Audio from Stortinget, the Norwegian parliament, is segmented and transcribed with an existing ASR system. An algorithm retrieves transcripts of these segments from Stortinget’s official proceedings using the Levenshtein edit distance between the ASR output and the proceedings text. In that way, a dataset of more than 5000 hours of transcribed speech is produced with limited human effort. Since parliamentary data is public domain, the dataset can be shared freely without any restrictions.
Anthology ID:
2023.resourceful-1.7
Volume:
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
Month:
May
Year:
2023
Address:
Tórshavn, the Faroe Islands
Editors:
Nikolai Ilinykh, Felix Morger, Dana Dannélls, Simon Dobnik, Beáta Megyesi, Joakim Nivre
Venue:
RESOURCEFUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
48–52
Language:
URL:
https://aclanthology.org/2023.resourceful-1.7
DOI:
Bibkey:
Cite (ACL):
Per Erik Solberg, Pierre Beauguitte, Per Egil Kummervold, and Freddy Wetjen. 2023. A Large Norwegian Dataset for Weak Supervision ASR. In Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), pages 48–52, Tórshavn, the Faroe Islands. Association for Computational Linguistics.
Cite (Informal):
A Large Norwegian Dataset for Weak Supervision ASR (Solberg et al., RESOURCEFUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.resourceful-1.7.pdf