Sjur Moshagen

Also published as: Sjur N. Moshagen, Sjur Nørstebø Moshagen


2022

pdf bib
Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch
Linda Wiechetek | Katri Hiovain-Asikainen | Inga Lill Sigga Mikkelsen | Sjur Moshagen | Flammie Pirinen | Trond Trosterud | Børre Gaup
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Machine learning (ML) approaches have dominated NLP during the last two decades. From machine translation and speech technology, ML tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, mark-up and building from scratch. We also discuss what kind of language technology minority languages actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology, which is knowledge-based language technology, and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.

pdf bib
Building Open-source Speech Technology for Low-resource Minority Languages with SáMi as an Example – Tools, Methods and Experiments
Katri Hiovain-Asikainen | Sjur Moshagen
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

This paper presents a work-in-progress report of an open-source speech technology project for indigenous Sami languages. A less detailed description of this work has been presented in a more general paper about the whole GiellaLT language infrastructure, submitted to the LREC 2022 main conference. At this stage, we have designed and collected a text corpus specifically for developing speech technology applications, namely Text-to-speech (TTS) and Automatic speech recognition (ASR) for the Lule and North Sami languages. We have also piloted and experimented with different speech synthesis technologies using a miniature speech corpus as well as developed tools for effective processing of large spoken corpora. Additionally, we discuss effective and mindful use of the speech corpus and also possibilities to use found/archive materials for training an ASR model for these languages.

2019

pdf bib
Is this the end? Two-step tokenization of sentence boundaries
Linda Wiechetek | Sjur Nørstebø Moshagen | Thomas Omma
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Seeing more than whitespace — Tokenisation and disambiguation in a North Sámi grammar checker
Linda Wiechetek | Sjur Nørstebø Moshagen | Kevin Brubeck Unhammer
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

pdf bib
Modeling Northern Haida Verb Morphology
Jordan Lachler | Lene Antonsen | Trond Trosterud | Sjur Moshagen | Antti Arppe
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
A Morphological Parser for Odawa
Dustin Bowers | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2014

pdf bib
Modeling the Noun Morphology of Plains Cree
Conor Snoek | Dorothy Thunder | Kaidi Lõo | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

2013

pdf bib
Building an Open-Source Development Infrastructure for Language Technology Projects
Sjur N. Moshagen | Tommi Pirinen | Trond Trosterud
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2007

pdf bib
Usage of XSL Stylesheets for the Annotation of the Sámi Language Corpora.
Saara Huhmarniemi | Sjur N. Moshagen | Trond Trosterud
Proceedings of the Linguistic Annotation Workshop

1996

pdf bib
A Sign Expansion Approach to Dynamic, Multi-Purpose Lexicons
Jon Atle Gulla | Sjur Nørstebø Moshagen
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics