2025
pdf
bib
abs
Divvunspell—Finite-State Spell-Checking and Correction on Modern Platforms
Flammie A Pirinen
|
Sjur Nørstebø Moshagen
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP
Spell-checking and correction is one of the key applications of natural language support. Historically, for the biggest, less morphologically complex languages, spell-checking and correction could be implemented by relatively simple means; however, for morphologically complex and low-resource languages, the solutions were often suboptimal. Finite-state methods are the state of the art in rule-based natural language processing and also for spell-checking and correction they have been effectively used. In this article, we show some recent developments of a finite-state spell-checker implementation that works with modern operating systems and platforms.
2024
pdf
bib
abs
Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets
Flammie A Pirinen
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages
The current trends in natural language processing strongly favor large language models and generative AIs as the basis for everything. For Uralic languages that are not largely present in publically available data on the Internet, this can be problematic. In the current computational linguistic scene, it is very important to have representation of your language in popular datasets. Languages that are included in well-known datasets are also included in shared tasks, products by large technology corporations, and so forth. This inclusion will become especially important for under-resourced, under-studied minority, and Indigenous languages, which will otherwise be easily forgotten. In this article, we present the resources that are often deemed necessary for digital presence of a language in the large language model obsessed world of today. We show that there are methods and tricks available to alleviate the problems with a lack of data and a lack of creators and annotators of the data, some more successful than others.
pdf
bib
abs
The Ethical Question – Use of Indigenous Corpora for Large Language Models
Linda Wiechetek
|
Flammie A. Pirinen
|
Børre Gaup
|
Trond Trosterud
|
Maja Lisa Kappfjell
|
Sjur Moshagen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Creating language technology based on language data has become very popular with the recent advances of large language models and neural network technologies. This makes language resources very valuable, and especially in case of indigenous languages, the scarce resources are even more precious. Given the good results of simply fetching everything you can from the internet and feeding it to neural networks in English, there has been more work on doing the same for all languages. However, indigenous language resources as they are on the web are not comparable in that they would encode the most recent normativised language in all its aspects. This problematic is further due to not understanding the texts input to models or output by models by the people who work on them. Corpora also have intelligent property rights and copyrights that are not respected. Furthermore, the web is filled with the result of language model -generated texts. In this article we describe an ethical and sustainable way to work with indigenous languages.
2022
pdf
bib
abs
Reusing a Multi-lingual Setup to Bootstrap a Grammar Checker for a Very Low Resource Language without Data
Inga Lill Sigga Mikkelsen
|
Linda Wiechetek
|
Flammie A Pirinen
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Grammar checkers (GEC) are needed for digital language survival. Very low resource languages like Lule Sámi with less than 3,000 speakers need to hurry to build these tools, but do not have the big corpus data that are required for the construction of machine learning tools. We present a rule-based tool and a workflow where the work done for a related language can speed up the process. We use an existing grammar to infer rules for the new language, and we do not need a large gold corpus of annotated grammar errors, but a smaller corpus of regression tests is built while developing the tool. We present a test case for Lule Sámi reusing resources from North Sámi, show how we achieve a categorisation of the most frequent errors, and present a preliminary evaluation of the system. We hope this serves as an inspiration for small languages that need advanced tools in a limited amount of time, but do not have big data.
2021
pdf
bib
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages
Flammie A Pirinen
|
Timofey Arhangelskiy
|
Trond Trosterud
|
Michael Rießler
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages
pdf
bib
No more fumbling in the dark - Quality assurance of high-level NLP tools in a multi-lingual infrastructure
Linda Wiechetek
|
Flammie A Pirinen
|
Børre Gaup
|
Thomas Omma
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages
pdf
bib
Numerals and what counts
Jack Rueter
|
Niko Partanen
|
Flammie A. Pirinen
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)