Linda Wiechetek


2022

pdf bib
Reusing a Multi-lingual Setup to Bootstrap a Grammar Checker for a Very Low Resource Language without Data
Inga Lill Sigga Mikkelsen | Linda Wiechetek | Flammie A Pirinen
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Grammar checkers (GEC) are needed for digital language survival. Very low resource languages like Lule Sámi with less than 3,000 speakers need to hurry to build these tools, but do not have the big corpus data that are required for the construction of machine learning tools. We present a rule-based tool and a workflow where the work done for a related language can speed up the process. We use an existing grammar to infer rules for the new language, and we do not need a large gold corpus of annotated grammar errors, but a smaller corpus of regression tests is built while developing the tool. We present a test case for Lule Sámi reusing resources from North Sámi, show how we achieve a categorisation of the most frequent errors, and present a preliminary evaluation of the system. We hope this serves as an inspiration for small languages that need advanced tools in a limited amount of time, but do not have big data.

2021

pdf bib
Rules Ruling Neural Networks - Neural vs. Rule-Based Grammar Checking for a Low Resource Language
Linda Wiechetek | Flammie Pirinen | Mika Hämäläinen | Chiara Argese
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a corpus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.

pdf bib
No more fumbling in the dark - Quality assurance of high-level NLP tools in a multi-lingual infrastructure
Linda Wiechetek | Flammie A Pirinen | Børre Gaup | Thomas Omma
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

2020

pdf bib
Morphological Disambiguation of South Sámi with FSTs and Neural Networks
Mika Hämäläinen | Linda Wiechetek
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North Sámi training data for South Sámi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Sámi, which makes it usable and applicable in the contexts of any other endangered language as well.

2019

pdf bib
Is this the end? Two-step tokenization of sentence boundaries
Linda Wiechetek | Sjur Nørstebø Moshagen | Thomas Omma
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Seeing more than whitespace — Tokenisation and disambiguation in a North Sámi grammar checker
Linda Wiechetek | Sjur Nørstebø Moshagen | Kevin Brubeck Unhammer
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2010

pdf bib
Reusing Grammatical Resources for New Languages
Lene Antonsen | Trond Trosterud | Linda Wiechetek
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Grammatical approaches to language technology are often considered less optimal than statistical approaches in multilingual settings, where large-scale portability becomes an important issue. The present paper argues that there is a notable gain in reusing grammatical resources when porting technology to new languages. The pivot language is North Sámi, and the paper discusses portability with respect to the closely related Lule and South Sámi, and to the unrelated Faroese and Greenlandic languages.

2009

pdf bib
Developing Prototypes for Machine Translation between Two Sami Languages
Francis M. Tyers | Linda Wiechetek | Trond Trosterud
Proceedings of the 13th Annual conference of the European Association for Machine Translation