Janine Siewert

2025

pdf bib abs
Adapting Definition Modeling for New Languages: A Case Study on Belarusian
Daniela Kazakouskaya | Timothee Mickus | Janine Siewert
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

Definition modeling, the task of generating new definitions for words in context, holds great prospect as a means to assist the work of lexicographers in documenting a broader variety of lects and languages, yet much remains to be done in order to assess how we can leverage pre-existing models for as-of-yet unsupported languages. In this work, we focus on adapting existing models to Belarusian, for which we propose a novel dataset of 43,150 definitions. Our experiments demonstrate that adapting a definition modeling systems requires minimal amounts of data, but that there currently are gaps in what automatic metrics do capture.

pdf bib
Towards a Cross-Dialectal Dictionary for Low German (Low Saxon)
Christian Chiarcos | Janine Siewert | Tabea Gröger | Christian Fäth
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

pdf bib abs
DUDU: A Treebank for Ottoman Turkish in UD Style
Enes Yılandiloğlu | Janine Siewert
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

This paper introduces a recently released Ottoman Turkish (ota) treebank in Universal Dependencies (UD) style, DUDU. The DUDU Treebank consists of 1,064 automatically annotated and manually corrected sentences. The texts were manually collected from various academic or literary sources available on the Internet. Following preprocessing, the sentences were annotated using a MaCHAMP-based neural network model utilizing the large language model (LLM) architecture and manually corrected. The treebank became publicly available with the 2.14 release, and future steps involve expanding the treebank with more data and refining the annotation scheme. The treebank is the first and only treebank that utilizes the IJMES transliteration alphabet. The treebank not only gives insight on Ottoman Turkish lexically, morphologically, and syntactically, but also provides a small but robust test set for future computational models for Ottoman Turkish.

2024

pdf bib
AXOLOTL’24 Shared Task on Multilingual Explainable Semantic Change Modeling
Mariia Fedorova | Timothee Mickus | Niko Partanen | Janine Siewert | Elena Spaziani | Andrey Kutuzov
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

pdf bib abs
The Low Saxon LSDC Dataset at Universal Dependencies
Janine Siewert | Jack Rueter
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present an extension of the Low Saxon Universal Dependencies dataset and discuss a few annotation-related challenges. Low Saxon is a West-Germanic low-resource language that lacks a common standard and therefore poses challenges for NLP. The 1,000 sentences in our dataset cover the last 200 years and 8 of the 9 major dialects. They are presented both in original and in normalised spelling and two lemmata are provided: A Modern Low Saxon lemma and a Middle Low Saxon lemma. Several annotation-related issues result from dialectal variation in morphological categories, and we explain differences in the pronoun, gender, case, and mood system. Furthermore, we take up three syntactic constructions that do not occur in Standard Dutch or Standard German: the possessive dative, pro-drop in pronominal adverbs, and complementiser doubling in subordinate interrogative clauses. These constructions are also rare in the other Germanic UD datasets and have not always been annotated consistently.

2023

pdf bib abs
Changing usage of Low Saxon auxiliary and modal verbs
Janine Siewert | Martijn Wieling | Yves Scherrer
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

We investigate the usage of auxiliary and modal verbs in Low Saxon dialects from both Germany and the Netherlands based on word vectors, and compare developments in the modern language to Middle Low Saxon. Although most of these function words have not been affected by lexical replacement, changes in usage that likely at least partly result from contact with the state languages can still be observed.

pdf bib abs
Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan
Aleksandra Miletić | Janine Siewert
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learning-based approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.

2022

pdf bib abs
Low Saxon dialect distances at the orthographic and syntactic level
Janine Siewert | Yves Scherrer | Martijn Wieling
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

We compare five Low Saxon dialects from the 19th and 21st century from Germany and the Netherlands with each other as well as with modern Standard Dutch and Standard German. Our comparison is based on character n-grams on the one hand and PoS n-grams on the other and we show that these two lead to different distances. Particularly in the PoS-based distances, one can observe all of the 21st century Low Saxon dialects shifting towards the modern majority languages.

2021

pdf bib
Towards a balanced annotated Low Saxon dataset for diachronic investigation of dialectal variation
Janine Siewert | Yves Scherrer | Jörg Tiedemann
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

2020

pdf bib abs
LSDC - A comprehensive dataset for Low Saxon Dialect Classification
Janine Siewert | Yves Scherrer | Martijn Wieling | Jörg Tiedemann
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

We present a new comprehensive dataset for the unstandardised West-Germanic language Low Saxon covering the last two centuries, the majority of modern dialects and various genres, which will be made openly available in connection with the final version of this paper. Since so far no such comprehensive dataset of contemporary Low Saxon exists, this provides a great contribution to NLP research on this language. We also test the use of this dataset for dialect classification by training a few baseline models comparing statistical and neural approaches. The performance of these models shows that in spite of an imbalance in the amount of data per dialect, enough features can be learned for a relatively high classification accuracy.