Colin Swaelens

2025

Lemmatisation & Morphological Analysis of Unedited Greek: Do Simple Tasks Need Complex Solutions?
Colin Swaelens | Ilse De Vos | Els Lefever
Findings of the Association for Computational Linguistics: ACL 2025

Fine-tuning transformer-based models for part-of-speech tagging of unedited Greek text has outperformed traditional systems. However, when applied to lemmatisation or morphological analysis, fine-tuning has not yet achieved competitive results. This paper explores various approaches to combine morphological features to both reduce label complexity and enhance multi-task training. Specifically, we group three nominal features into a single label, and combine the three most distinctive features of verbs into another unified label. These combined labels are used to fine-tune DBBERT, a BERT model pre-trained on both ancient and modern Greek. Additionally, we experiment with joint training – both among these labels and in combination with POS tagging – within a multi-task framework to improve performance by transferring parameters. To evaluate our models, we use a manually annotated gold standard from the Database of Byzantine Book Epigrams. Our results show a nearly 9 pp. improvement, demonstrating that multi-task learning is a promising approach for linguistic annotation in less standardised corpora.

2024

pdf bib abs

Lemmatisation of Medieval Greek: Against the Limits of Transformer’s Capabilities?
Colin Swaelens | Pranaydeep Singh | Ilse de Vos | Els Lefever
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents preliminary experiments for the lemmatisation of unedited, Byzantine Greek epigrams. This type of Greek is quite different from its classical ancestor, mostly because of its orthographic inconsistencies. Existing lemmatisation algorithms display an accuracy drop of around 30pp when tested on these Byzantine book epigrams. We conducted seven different lemmatisation experiments, which were either transformer-based or based on neural edit-trees. The best performing lemmatiser was a hybrid method combining transformer-based embeddings with a dictionary look-up. We compare our results with existing lemmatisers, and provide a detailed error analysis revealing why unedited, Byzantine Greek is so challenging for lemmatisation.

2023

pdf bib abs

Medieval Social Media: Manual and Automatic Annotation of Byzantine Greek Marginal Writing
Colin Swaelens | Ilse De Vos | Els Lefever
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

In this paper, we present the interim results of a transformer-based annotation pipeline for Ancient and Medieval Greek. As the texts in the Database of Byzantine Book Epigrams have not been normalised, they pose more challenges for manual and automatic annotation than Ancient Greek, normalised texts do. As a result, the existing annotation tools perform poorly. We compiled three data sets for the development of an automatic annotation tool and carried out an inter-annotator agreement study, with a promising agreement score. The experimental results show that our part-of-speech tagger yields accuracy scores that are almost 50 percentage points higher than the widely used rule-based system Morpheus. In addition, error analysis revealed problems related to phenomena also occurring in current social media language.

pdf bib abs

Evaluating Existing Lemmatisers on Unedited Byzantine Greek Poetry
Colin Swaelens | Ilse De Vos | Els Lefever
Proceedings of the Ancient Language Processing Workshop

This paper reports on the results of a comparative evaluation in view of the development of a new lemmatizer for unedited, Byzantine Greek texts. For the experiment, the performance of four existing lemmatizers, all pre-trained on Ancient Greek texts, was evaluated on how well they could handle texts stemming from the Middle Ages and displaying quite some peculiarities. The aim of this study is to get insights into the pitfalls of existing lemmatistion approaches as well as the specific challenges of our Byzantine Greek corpus, in order to develop a lemmatizer that can cope with its peculiarities. The results of the experiment show an accuracy drop of 20pp. on our corpus, which is further investigated in a qualitative error analysis.

Colin Swaelens

2025

2024

2023

2019

Co-authors

Venues