Trond Trosterud


2025

The article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.
This article will present a grammatical aunalyser, disambiguator and dependency analysis of Tokelau. The grammatical analyser is written as a finite-state transducer (FST), whereas the disambiguator and dependency analyser are written in Constraint Grammar (CG), both within the GiellaLT infrastructure. Contrary to most languages analyzed within this framework, Being a Polynesian language, Tokelau is a predominantly isolating language, with reduplication and affixation as the main morphological processes. The article will discuss how FST and CG deal with Polynesian languages.
This article gives an overview of the state of the art in language technology tools for Balto-Finnic minority languages, i.e., Balto-Finnic languages other than Estonian and Finnish. For simplicity, we will use the term Finnic in this article when referring to all members of this language branch except the Estonian and Finnish literary languages. All in all, there are nine standardised languages represented in existing language technology infrastructures with keyboards, grammatical language models, proofing tools, annotated corpora and (for one of the langauges) extensive ICALL programs. This article presents these tools and resources, discusses the relation between language models and proofing tool quality, as well as the (potential) impact of these tools on the respective language communities. The article rounds off with a discussion on prospects for future development.
Optical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN’s collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN’s collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.

2024

Creating language technology based on language data has become very popular with the recent advances of large language models and neural network technologies. This makes language resources very valuable, and especially in case of indigenous languages, the scarce resources are even more precious. Given the good results of simply fetching everything you can from the internet and feeding it to neural networks in English, there has been more work on doing the same for all languages. However, indigenous language resources as they are on the web are not comparable in that they would encode the most recent normativised language in all its aspects. This problematic is further due to not understanding the texts input to models or output by models by the people who work on them. Corpora also have intelligent property rights and copyrights that are not respected. Furthermore, the web is filled with the result of language model -generated texts. In this article we describe an ethical and sustainable way to work with indigenous languages.

2023

Many errors in Faroese writing are linked to the letter ð, a letter which has no corresponding phoneme, and is always omitted intervocally and wordfinally after a vowel. It plays an important role in the written language, disambiguating homophone but not homograph forms like infinitive kasta ‘throw’ from its participle kastað. Since adding a hypercorrect ð or erroneously omitting it often results in an existing word, these errors cannot be captured by ordinary spellcheckers. The article presents a grammar checker targeting ð errors, and discusses challenges related to false alarms.
We present GramDivvun, the first Inari Saami grammar checker for L2 users. The grammar checker is an important tool in the revitalisation of the language, in particular for strengthening the literary language. As the Inari Saami language community needs language tools predominantly for language learners, the focus is on grammatical interference errors made by (mostly Finnish-speaking) learners. Six of these errors are featured in the first version of the grammar checker. For non-proofread text written by inexperienced writers, precision is good, 73%. With experienced text and proofread text, alarms are rare but precision considerably lower, 19.5 % on average, but varying considerably between the error types. The paper discusses reasons for this variation. Future plans are improving results by means of increased testing, especially for complex sentences, and eventually also including more error types.

2022

Machine learning (ML) approaches have dominated NLP during the last two decades. From machine translation and speech technology, ML tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, mark-up and building from scratch. We also discuss what kind of language technology minority languages actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology, which is knowledge-based language technology, and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.

2021

2020

2018

2017

2014

2013

2012

2010

Grammatical approaches to language technology are often considered less optimal than statistical approaches in multilingual settings, where large-scale portability becomes an important issue. The present paper argues that there is a notable gain in reusing grammatical resources when porting technology to new languages. The pivot language is North Sámi, and the paper discusses portability with respect to the closely related Lule and South Sámi, and to the unrelated Faroese and Greenlandic languages.

2009

We describe the development of a two-way shallow-transfer machine translation system between Norwegian Nynorsk and Norwegian Bokma ̊l built on the Apertium platform, using the Free and Open Source resources Norsk Ordbank and the Oslo–Bergen Constraint Grammar tagger. We detail the integration of these and other resources in the system along with the construction of the lexical and structural transfer, and evaluate the translation quality in comparison with another system. Finally, some future work is suggested.

2008

2007