Jón Guðnason

Also published as: Jon Gudnason


2024

pdf bib
SamróMur MilljóN: An ASR Corpus of One Million Verified Read Prompts in Icelandic
Carlos Daniel Hernandez Mena | Þorsteinn Daði Gunnarsson | Jon Gudnason
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The platform samromur.is, or “Samrómur” for short, is a crowdsourcing web application built on Mozilla’s Common Voice, designed to accumulate speech data for the advancement of language technologies in Icelandic. Over the years, Samrómur has proven to be remarkably successful in amassing a significant number of high-quality audio clips from thousands of users. However, the challenge of manually verifying the entirety of the collected data has hindered its effective exploitation, especially in the realm of Automatic Speech Recognition (ASR), its original purpose. In this paper, we introduce the “Samrómur Milljón” corpus, an ASR dataset comprising one million audio clips from Samrómur. These clips have been automatically verified using state-of-the-art speech recognition systems such as NeMo, Wav2Vec2, and Whisper. Additionally, we present the ASR results obtained from creating acoustic models based on Samrómur Milljón. These results demonstrate significant promise when compared to other acoustic models trained with a similar volume of Icelandic data from different sources.

2023

pdf bib
ASR Language Resources for Faroese
Carlos Hernández Mena | Annika Simonsen | Jon Gudnason
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

The aim of this work is to present a set of novel language resources in Faroese suitable for the field of Automatic Speech Recognition including: an ASR corpus comprised of 109 hours of transcribed speech data, acoustic models in systems such as WAV2VEC2, NVIDIA-NeMo, Kaldi and PocketSphinx; a set of n-gram language models and a set of pronunciation dictionaries with two different variants of Faroese. We also show comparison results between the distinct acoustic models presented here. All the resources exposed in this document are publicly available under creative commons licences.

pdf bib
Microservices at Your Service: Bridging the Gap between NLP Research and Industry
Tiina Lindh-Knuutila | Hrafn Loftsson | Pedro Alonso Doval | Sebastian Andersson | Bjarni Barkarson | Héctor Cerezo-Costas | Jon Gudnason | Jökull Gylfason | Jarmo Hemminki | Heiki-Jaan Kaalep
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

This paper describes a collaborative European project whose aim was to gather open source Natural Language Processing (NLP) tools and make them accessible as running services and easy to try out in the European Language Grid (ELG). The motivation of the project was to increase accessibility for more European languages and make it easier for developers to use the underlying tools in their own applications. The project resulted in the containerization of 60 existing NLP tools for 16 languages, all of which are now currently running as easily testable services in the ELG platform.

pdf bib
Standardising Pronunciation for a Grapheme-to-Phoneme Converter for Faroese
Sandra Lamhauge | Iben Debess | Carlos Hernández Mena | Annika Simonsen | Jon Gudnason
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Pronunciation dictionaries allow computational modelling of the pronunciation of words in a certain language and are widely used in speech technologies, especially in the fields of speech recognition and synthesis. On the other hand, a grapheme-to-phoneme tool is a generalization of a pronunciation dictionary that is not limited to a given and finite vocabulary. In this paper, we present a set of standardized phonological rules for the Faroese language; we introduce FARSAMPA, a machine-readable character set suitable for phonetic transcription of Faroese, and we present a set of grapheme-to-phoneme models for Faroese, which are publicly available and shared under a creative commons license. We present the G2P converter and evaluate the performance. The evaluation shows reliable results that demonstrate the quality of the data.

2022

pdf bib
An Open Source Web Reader for Under-Resourced Languages
Judy Fong | Þorsteinn Daði Gunnarsson | Sunneva Þorsteinsdóttir | Gunnar Thor Örnólfsson | Jon Gudnason
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

We have developed an open source web reader in Iceland for under-resourced languages. The web reader was developed due to the need for a free and good quality web reader for languages which fall outside the scope of commercially available web readers. It relies on a text-to-speech (TTS) pipeline accessed via a cloud service. The web reader was developed using the Icelandic TTS voices Alfur and Dilja, but could be connected to any language which has a TTS pipeline. The design of our web reader focuses on functionality, adaptability and user friendliness. Therefore, the web reader’s feature set heavily overlaps with the minimal features necessary to provide a good web reading experience while still being extensible enough to be adapted to work for other languages, high-resourced and under-resourced. The web reader works well on all the major web browsers and has a Web Content Accessibility Guidelines 2.0 Level AA: Acceptable compliance, meaning that it works well for the largest user groups, people in under-resourced languages with visual impairments and difficulty reading. The code for our web reader is available and published with an Apache 2.0 license at https://github.com/cadia-lvl/WebRICE, which includes a simple demo of the project.

pdf bib
Samrómur Children: An Icelandic Speech Corpus
Carlos Daniel Hernandez Mena | David Erik Mollberg | Michal Borský | Jón Guðnason
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Samrómur Children is an Icelandic speech corpus intended for the field of automatic speech recognition. It contains 131 hours of read speech from Icelandic children aged between 4 to 17 years. The test portion was meticulously selected to cover a wide range of ages as possible; we aimed to have exactly the same amount of data per age range. The speech was collected with the crowd-sourcing platform Samrómur.is, which is inspired on the “Mozilla’s Common Voice Project”. The corpus was developed within the framework of the “Language Technology Programme for Icelandic 2019 − 2023”; the goal of the project is to make Icelandic available in language-technology applications. Samrómur Children is the first corpus in Icelandic with children’s voices for public use under a Creative Commons license. Additionally, we present baseline experiments and results using Kaldi.

pdf bib
Samrómur: Crowd-sourcing large amounts of data
Staffan Hedström | David Erik Mollberg | Ragnheiður Þórhallsdóttir | Jón Guðnason
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This contribution describes the collection of a large and diverse corpus for speech recognition and similar tools using crowd-sourced donations. We have built a collection platform inspired by Mozilla Common Voice and specialized it to our needs. We discuss the importance of engaging the community and motivating it to contribute, in our case through competitions. Given the incentive and a platform to easily read in large amounts of utterances, we have observed four cases of speakers freely donating over 10 thousand utterances. We have also seen that women are keener to participate in these events throughout all age groups. Manually verifying a large corpus is a monumental task and we attempt to automatically verify parts of the data using tools like Marosijo and the Montreal Forced Aligner. The method proved helpful, especially for detecting invalid utterances and halving the work needed from crowd-sourced verification.

pdf bib
National Language Technology Platform for Public Administration
Marko Tadić | Daša Farkaš | Matea Filko | Artūrs Vasiļevskis | Andrejs Vasiļjevs | Jānis Ziediņš | Željka Motika | Mark Fishel | Hrafn Loftsson | Jón Guðnason | Claudia Borg | Keith Cortis | Judie Attard | Donatienne Spiteri
Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference

This article presents the work in progress on the collaborative project of several European countries to develop National Language Technology Platform (NLTP). The project aims at combining the most advanced Language Technology tools and solutions in a new, state-of-the-art, Artificial Intelligence driven, National Language Technology Platform for five EU/EEA official and lower-resourced languages.

2021

pdf bib
Creating Data in Icelandic for Text Normalization
Helga Svala Sigurðardóttir | Anna Björk Nikulásdóttir | Jón Guðnason
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

There is no natural way to acquire normalized data so we try to create good enough data to attempt more advanced methods for text normalization. We manually annotated the first normalized corpus in Icelandic, 40,000 sentences, and developed Regína, a rule-based system for text normalization. Regína gets 90.83% accuracy compared to the manually annotated corpus on non-standard words. Regína showed a significant improvement in accuracy when compared to an older normalization system for Icelandic. The normalized corpus and Regína will be released as open source.

pdf bib
Talrómur: A large Icelandic TTS corpus
Atli Sigurgeirsson | Þorsteinn Gunnarsson | Gunnar Örnólfsson | Eydís Magnúsdóttir | Ragnheiður Þórhallsdóttir | Stefán Jónsson | Jón Guðnason
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We present Talrómur, a large high-quality Text-To-Speech (TTS) corpus for the Icelandic language. This multi-speaker corpus contains recordings from 4 male speakers and 4 female speakers of a wide range in age and speaking style. The corpus consists of 122,417 single utterance recordings equating to approximately 213 hours of voice data. All speakers read from the same script which has a high coverage of possible Icelandic diphones. Manual analysis of 15,956 utterances indicates that the corpus has a reading mistake rate no higher than 0.25%. We additionally present results from subjective evaluations of the different voices with regards to intelligibility, likeability and trustworthiness.

2020

pdf bib
Language Technology Programme for Icelandic 2019-2023
Anna Nikulásdóttir | Jón Guðnason | Anton Karl Ingason | Hrafn Loftsson | Eiríkur Rögnvaldsson | Einar Freyr Sigurðsson | Steinþór Steingrímsson
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we describe a new national language technology programme for Icelandic. The programme, which spans a period of five years, aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, open-source language resources and software. The research and development work within the programme is carried out by a consortium of universities, institutions, and private companies, with a strong emphasis on cooperation between academia and industries. Five core projects will be the main content of the programme: language resources, speech recognition, speech synthesis, machine translation, and spell and grammar checking. We also describe other national language technology programmes and give an overview over the history of language technology in Iceland.

pdf bib
Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition
David Erik Mollberg | Ólafur Helgi Jónsson | Sunneva Þorsteinsdóttir | Steinþór Steingrímsson | Eydís Huld Magnúsdóttir | Jon Gudnason
Proceedings of the Twelfth Language Resources and Evaluation Conference

This contribution describes an ongoing project of speech data collection, using the web application Samrómur which is built upon Common Voice, Mozilla Foundation’s web platform for open-source voice collection. The goal of the project is to build a large-scale speech corpus for Automatic Speech Recognition (ASR) for Icelandic. Upon completion, Samrómur will be the largest open speech corpus for Icelandic collected from the public domain. We discuss the methods used for the crowd-sourcing effort and show the importance of marketing and good media coverage when launching a crowd-sourcing campaign. Preliminary results exceed our expectations, and in one month we collected data that we had estimated would take three months to obtain. Furthermore, our initial dataset of around 45 thousand utterances has good demographic coverage, is gender-balanced and with proper age distribution. We also report on the task of validating the recordings, which we have not promoted, but have had numerous hours invested by volunteers.

pdf bib
Manual Speech Synthesis Data Acquisition - From Script Design to Recording Speech
Atli Sigurgeirsson | Gunnar Örnólfsson | Jón Guðnason
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Atli Þór Sigurgeirsson, atlithors@ru.is, Reykjavik University Gunnar Thor Örnólfsson, gunnarthor@hi.is, Árni Magnússon institute of Icelandic studies Dr. Jón Guðnason, jg@ru.is In this paper we present the work of collecting a large amount of high quality speech synthesis data for Icelandic. 8 speakers will be recorded for 20 hours each. A script design strategy is proposed and three scripts have been generated to maximize diphone coverage, varying in length. The largest reading script contains 14,400 prompts and includes 87.3% of all Icelandic diphones at least once and 81% of all Icelandic diphones at least twenty times. A recording client was developed to facilitate recording sessions. The client supports easily importing scripts and maintaining multiple collections in parallel. The recorded data can be downloaded straight from the client. Recording sessions are carried out in a professional studio under supervision and started October of 2019. As of writing, 58.7 hours of high quality speech data has been collected. The scripts, the recording software and the speech data will later be released under a CC-BY 4.0 license.

2018

pdf bib
Open ASR for Icelandic: Resources and a Baseline System
Anna Björk Nikulásdóttir | Inga Rún Helgadóttir | Matthías Pétursson | Jón Guðnason
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Risamálheild: A Very Large Icelandic Text Corpus
Steinþór Steingrímsson | Sigrún Helgadóttir | Eiríkur Rögnvaldsson | Starkaður Barkarson | Jón Guðnason
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Málrómur: A Manually Verified Corpus of Recorded Icelandic Speech
Steinþór Steingrímsson | Jón Guðnason | Sigrún Helgadóttir | Eiríkur Rögnvaldsson
Proceedings of the 21st Nordic Conference on Computational Linguistics