Andre Kåsen


2024

pdf bib
Aligning the Norwegian UD Treebank with Entity and Coreference Information
Tollef Emil Jørgensen | Andre Kåsen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents a merged collection of entity and coreference annotated data grounded in the Universal Dependencies (UD) treebanks for the two written forms of Norwegian: Bokmål and Nynorsk. The aligned and converted corpora are the Norwegian Named Entities (NorNE) and Norwegian Anaphora Resolution Corpus (NARC). While NorNE is aligned with an older version of the treebank, NARC is misaligned and requires extensive transformation from the original annotations to the UD structure and CoNLL-U format. Here, we demonstrate the conversion and alignment processes, along with an analysis of discovered issues and errors in the data, some of which include data split overlaps in the original treebank. These procedures and the developed system may prove helpful for future work on processing and aligning data from universal dependencies. The merged corpora comprise the first Norwegian UD treebank enriched with named entities and coreference information, supporting the standardized format for the CorefUD initiative.

2022

pdf bib
Annotating Norwegian language varieties on Twitter for Part-of-speech
Petter Mæhlum | Andre Kåsen | Samia Touileb | Jeremy Barnes
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset, and that models trained on Bokmål generally perform better than those trained on Nynorsk. We also see that performance on dialectal tweets is comparable to the written standards for some models. Finally we perform a detailed analysis of the errors that models commonly make on this data.

pdf bib
The Norwegian Dialect Corpus Treebank
Andre Kåsen | Kristin Hagen | Anders Nøklestad | Joel Priestly | Per Erik Solberg | Dag Trygve Truslew Haug
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents the NDC Treebank of spoken Norwegian dialects in the Bokmål variety of Norwegian. It consists of dialect recordings made between 2006 and 2012 which have been digitised, segmented, transcribed and subsequently annotated with morphological and syntactic analysis. The nature of the spoken data gives rise to various challenges both in segmentation and annotation. We follow earlier efforts for Norwegian, in particular the LIA Treebank of spoken dialects transcribed in the Nynorsk variety of Norwegian, in the annotation principles to ensure interusability of the resources. We have developed a spoken language parser on the basis of the annotated material and report on its accuracy both on a test set across the dialects and by holding out single dialects.

pdf bib
NARCNorwegian Anaphora Resolution Corpus
Petter Mæhlum | Dag Haug | Tollef Jørgensen | Andre Kåsen | Anders Nøklestad | Egil Rønningstad | Per Erik Solberg | Erik Velldal | Lilja Øvrelid
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

We present the Norwegian Anaphora Resolution Corpus (NARC), the first publicly available corpus annotated with anaphoric relations between noun phrases for Norwegian. The paper describes the annotated data for 326 documents in Norwegian Bokmål, together with inter-annotator agreement and discussions of relevant statistics. We also present preliminary modelling results which are comparable to existing corpora for other languages, and discuss relevant problems in relation to both modelling and the annotations themselves.

2020

pdf bib
Comparing Methods for Measuring Dialect Similarity in Norwegian
Janne Johannessen | Andre Kåsen | Kristin Hagen | Anders Nøklestad | Joel Priestley
Proceedings of the Twelfth Language Resources and Evaluation Conference

The present article presents four experiments with two different methods for measuring dialect similarity in Norwegian: the Levenshtein method and the neural long short term memory (LSTM) autoencoder network, a machine learning algorithm. The visual output in the form of dialect maps is then compared with canonical maps found in the dialect literature. All of this enables us to say that one does not need fine-grained transcriptions of speech to replicate classical classification patterns.

pdf bib
Progress of the PRINCIPLE Project: Promoting MT for Croatian, Icelandic, Irish and Norwegian
Andy Way | Petra Bago | Jane Dunne | Federico Gaspari | Andre Kåsen | Gauti Kristmannsson | Helen McHugh | Jon Arild Olsen | Dana Davis Sheridan | Páraic Sheridan | John Tinsley
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper updates the progress made on the PRINCIPLE project, a 2-year action funded by the European Commission under the Connecting Europe Facility (CEF) programme. PRINCIPLE focuses on collecting high-quality language resources for Croatian, Icelandic, Irish and Norwegian, which have been identified as low-resource languages, especially for building effective machine translation (MT) systems. We report initial achievements of the project and ongoing activities aimed at promoting the uptake of neural MT for the low-resource languages of the project.

2019

pdf bib
Tagging a Norwegian Dialect Corpus
Andre Kåsen | Anders Nøklestad | Kristin Hagen | Joel Priestley
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper describes an evaluation of five data-driven part-of-speech (PoS) taggers for spoken Norwegian. The taggers all rely on different machine learning mechanisms: decision trees, hidden Markov models (HMMs), conditional random fields (CRFs), long-short term memory networks (LSTMs), and convolutional neural networks (CNNs). We go into some of the challenges posed by the task of tagging spoken, as opposed to written, language, and in particular a wide range of dialects as is found in the recordings of the LIA (Language Infrastructure made Accessible) project. The results show that the taggers based on either conditional random fields or neural networks perform much better than the rest, with the LSTM tagger getting the highest score.

2018

pdf bib
The LIA Treebank of Spoken Norwegian Dialects
Lilja Øvrelid | Andre Kåsen | Kristin Hagen | Anders Nøklestad | Per Erik Solberg | Janne Bondi Johannessen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)