Torbjørn Svendsen

2025

Match ‘em: Multi-Tiered Alignment for Error Analysis in ASR
Phoebe Parsons | Knut Kvale | Torbjørn Svendsen | Giampiero Salvi
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

We introduce “Match ‘em”: a new framework for aligning output from automatic speech recognition (ASR) with reference transcriptions. This allows a more detailed analysis of errors produced by end-to-end ASR systems compared to word error rate (WER). Match ‘em performs the alignment on both the word and character level; each relying on information from the other to provide the most meaningful global alignment. At the character level, we define a speech production motivated character similarity metric. At the word level, we rely on character similarities to define word similarity and, additionally, we reconcile compounding (insertion or deletion of spaces). We evaluated Match ‘em on transcripts of three European languages produced by wav2vec2 and Whisper. We show that Match ‘em results in more similar word substitution pairs and that compound reconciling can capture a broad range of spacing errors. We believe Match ‘em to be a valuable tool for ASR error analysis across many languages.

pdf bib abs

Adding Metadata to Existing Parliamentary Speech Corpus
Phoebe Parsons | Per Erik Solberg | Knut Kvale | Torbjørn Svendsen | Giampiero Salvi
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Parliamentary proceedings are convenient data sources for creating corpora for speech technology. Given its public nature, there is an abundance of extra information about the speakers that can be legally and ethically harvested to enrich this kind of corpora. This paper describes the methods we have used to add speaker metadata to the Stortinget Speech Corpus (SSC) containing over 5,000 hours of Norwegian speech with non-verbatim transcripts but without speaker metadata. The additional metadata for each speech segment includes speaker ID, gender, date of birth, municipality of birth, and counties represented. We also infer speaker dialect from their municipality of birth using a manually designed mapping between municipalities and Norwegian dialects. We provide observations on the SSC data and give suggestions for how it may be used for tasks other than speech recognition. Finally, we demonstrate the utility of this new metadata through a dialect identification task. The described methods can be adapted to add metadata information to parliamentary corpora in other languages.

2024

pdf bib abs

This paper reports on the experience collecting a number of corpora of Nordic languages spoken by children. The aim of the data collection is providing annotated data to develop and evaluate computer assisted pronunciation assessment systems both for non-native children learning a Nordic language (L2) and for L1 children with speech sound disorder (SSD). The paper presents the challenges encountered recording and annotating data for Finnish, Swedish and Norwegian, as well as the ethical considerations related with making this data publicly available. We hope that sharing this experience will encourage others to collect similar data for other languages. Of the different data collections, we were able to make the Norwegian corpus publicly available in the hope that it will serve as a reference in pronunciation assessment research.

2023

pdf bib abs

A character-based analysis of impacts of dialects on end-to-end Norwegian ASR
Phoebe Parsons | Knut Kvale | Torbjørn Svendsen | Giampiero Salvi
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We present a method for analyzing character errors for use with character-based, end-to-end ASR systems, as used herein for investigating dialectal speech. As end-to-end systems are able to produce novel spellings, there exists a possibility that the spelling variants produced by these systems can capture phonological information beyond the intended target word. We therefore first introduce a way of guaranteeing that similar words and characters are paired during alignment, thus ensuring that any resulting analysis of character errors is founded on sound substitutions. Then, from such a careful character alignment, we find trends in system-generated spellings that align with known phonological features of Norwegian dialects, in particular, “r” and “l” confusability and voiceless stop lenition. Through this analysis, we demonstrate that cues from acoustic dialectal features can influence the output of an end-to-end ASR systems.

pdf bib abs

Improving Generalization of Norwegian ASR with Limited Linguistic Resources
Per Erik Solberg | Pablo Ortiz | Phoebe Parsons | Torbjørn Svendsen | Giampiero Salvi
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

With large amounts of training data, it is possible to train ASR models that generalize well across speakers and domains. But how do you train robust models when there is a limited amount of available training data? In the experiments reported here, we fine-tuned a pre-trained wav2vec2 ASR model on two transcribed, Norwegian speech datasets, one with parliamentary speech and one with radio recordings, as well as on combinations of the two datasets. We subsequently tested these models on different test sets with planned and unplanned speech and with speakers of various dialects. Our results show that models trained on combinations of the two datasets generalize better to new data than the single-dataset models, even when the length of the training data is the same. Our lexical analysis sheds light on the type of mistakes made by the models and on the importance of consistent standardization when training combined models of this kind.

2010

pdf bib abs

Spontal-N is a corpus of spontaneous, interactional Norwegian. To our knowledge, it is the first corpus of Norwegian in which the majority of speakers have spent significant parts of their lives in Sweden, and in which the recorded speech displays varying degrees of interference from Swedish. The corpus consists of studio quality audio- and video-recordings of four 30-minute free conversations between acquaintances, and a manual orthographic transcription of the entire material. On basis of the orthographic transcriptions, we automatically annotated approximately 50 percent of the material on the phoneme level, by means of a forced alignment between the acoustic signal and pronunciations listed in a dictionary. Approximately seven percent of the automatic transcription was manually corrected. Taking the manual correction as a gold standard, we evaluated several sources of pronunciation variants for the automatic transcription. Spontal-N is intended as a general purpose speech resource that is also suitable for investigating phonetic detail.

pdf bib abs

NameDat: A Database of English Proper Names Spoken by Native Norwegians
Line Adde | Torbjørn Svendsen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the design and collection of NameDat, a database containing English proper names spoken by native Norwegians. The database was designed to cover the typical acoustic and phonetic variations that appear when Norwegians pronounce English names. The intended use of the database is acoustic and lexical modeling of these phonetic variations. The English names in the database have been enriched with several annotation tiers. The recorded names were selected according to three selection criteria: the familiarity of the name, the expected recognition performance and the coverage of non-native phonemes. The validity of the manual annotations was verified by means of an automatic recognition experiment of non-native names. The experiment showed that the use of the manual transcriptions from NameDat yields an increase in recognition performance over automatically generated transcriptions.

2008

pdf bib abs

RUNDKAST: an Annotated Norwegian Broadcast News Speech Corpus
Ingunn Amdal | Ole Morten Strand | Jørn Almberg | Torbjørn Svendsen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the Norwegian broadcast news speech corpus RUNDKAST. The corpus contains recordings of approximately 77 hours of broadcast news shows from the Norwegian broadcasting company NRK. The corpus covers both read and spontaneous speech as well as spontaneous dialogues and multipart discussions, including frequent occurrences of non-speech material (e.g. music, jingles). The recordings have large variations in speaking styles, dialect use and recording/transmission quality. RUNDKAST has been annotated for research in speech technology. The entire corpus has been manually segmented and transcribed using hierarchical levels. A subset of one hour of read and spontaneous speech from 10 different speakers has been manually annotated using broad phonetic labels. We provide a description of the database content, the annotation tools and strategies, and the conventions used for the different levels of annotation. A corpus of this kind has up to this point not been available for Norwegian, but is considered a necessary part of the infrastructure for language technology research in Norway. The RUNDKAST corpus is planned to be included in a future national Norwegian language resource bank.

2006

pdf bib abs

FonDat1: A Speech Synthesis Corpus for Norwegian
Ingunn Amdal | Torbjørn Svendsen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the Norwegian speech database FonDat1 designedfor development and assessment of Norwegian unit selection speechsynthesis. The quality of unit selection speech synthesis systems depends highly on the database used. The database should contain sufficient phonemicand prosodic coverage. High quality unit selection synthesis alsorequires that the database is annotated with accurate information about identity and position of the units. Traditionally this involves much manual work, either by hand labelingthe entire database or by correcting automatic annotations. We are working on methods for a complete automation of the annotationprocess. To validate these methods a realistic unit selectionsynthesis database is needed. In addition to serve as a testbed for annotation tools and synthesisexperiments, the process of producing the database using automaticmethods is in itself an important result. FonDat1 contains studio recordings of approximately 2000 sentencesread by two professional speakers, one male and one female. 10% ofthe database is manually annotated.

2002

pdf bib

Evaluation of Pronunciation Variants in the ASR Lexicon for Different Speaking Styles
Ingunn Amdal | Torbjørn Svendsen
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

Co-authors

Venues

Fix author