Maja Lisa Kappfjell

2025

The world’s first South Sámi TTS – a revitalisation effort of an endangered language by reviving a legacy voice
Katri Hiovain-Asikainen | Thomas B. Kjærstad | Maja Lisa Kappfjell | Sjur N. Moshagen
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

South Sámi (ISO 639: SMA) is a severely endangered language spoken by the South Sámi people in Norway and Sweden. Estimates of the number of speakers vary from 500 to 600. Recent advances in speech technology and the general increase in popularity of spoken language and audio content have facilitated the development of modern speech technology tools also for minority languages, such as the Sámi languages. The current paper documents the development process of the world’s first South Sámi text-to-speech (TTS) system, using only digitized archive materials from 1989–1993 as the training material. To reach an end-user suitable quality of the TTS, we have used a neural, end-to-end approach with a rule-based text processing module. The aim of our project is to contribute to the language revitalization by offering tools for language users to use spoken language in new contexts. Since the modern written standard of South Sámi was established as late as in 1978, the rise of speech technology might encourage language use even for people who are not accustomed to the written standar.

pdf bib abs

How to Create Treebanks without Human Annotators – An Indigenous Language Grammar Checker for Treebank Construction
Linda Wiechetek | Flammie A Pirinen | Maja Lisa Kappfjell
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)

Creating treebanks for low resource languages is an important task. However, low resource Indigenous language contexts have not only limited resources in terms of text data, but also limited human resources that are available for linguistic annotation. We suggest a work-around by applying a Constraint Grammar operated rule-based dependency parser to do the work of creating a marked-up treebank. However, due to a lot of noise, meaning spelling and grammatical errors in South Sámi written texts, this tool often fails to create complete and correct trees. As a fix to this, we created a grammar checking tool for the most common South Sámi grammatical error types, which improves the quality of the dependency parser significantly. As both literacy and normative standards for most Indigenous languages are much more recent than for majority languages, spelling and grammatical variation and errors are a common source of noise, and the application of a correction tool like ours can be useful in the construction of treebanks for these languages.

2024

pdf bib abs

The Ethical Question – Use of Indigenous Corpora for Large Language Models
Linda Wiechetek | Flammie Pirinen | Maja Lisa Kappfjell | Trond Trosterud | Børre Gaup | Sjur Nørstebø Moshagen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Creating language technology based on language data has become very popular with the recent advances of large language models and neural network technologies. This makes language resources very valuable, and especially in case of indigenous languages, the scarce resources are even more precious. Given the good results of simply fetching everything you can from the internet and feeding it to neural networks in English, there has been more work on doing the same for all languages. However, indigenous language resources as they are on the web are not comparable in that they would encode the most recent normativised language in all its aspects. This problematic is further due to not understanding the texts input to models or output by models by the people who work on them. Corpora also have intelligent property rights and copyrights that are not respected. Furthermore, the web is filled with the result of language model -generated texts. In this article we describe an ethical and sustainable way to work with indigenous languages.

2023

pdf bib abs

A South Sámi Grammar Checker For Stopping Language Change
Linda Wiechetek | Maja Lisa Kappfjell
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications

We have released and evaluated the first South Sámi grammar checker GramDivvun. It corrects two frequent error types that are caused by and causing language change and a loss of the language’s morphological richness. These general error types comprise a number of errors regarding the adjective paradigm (confusion of attributive and predicative forms) and the negation paradigm. In addition, our work includes a classification of common error types regarding the adjective and negation paradigms and lead to extensive grammatical error mark-up of our gold corpus. We achieve precisions above 71% for both adjective and negation error correction.

Co-authors

Thomas B. Kjærstad 1

Trond Trosterud 1

Venues

SyntaxFest1

TLT1

Fix author