2024
pdf
bib
abs
“Gotta catch ‘em all!”: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge
Marijke Beersmans
|
Alek Keersmaekers
|
Evelien de Graaf
|
Tim Van de Cruys
|
Mark Depauw
|
Margherita Fantoli
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
In this paper, we present a study of transformer-based Named Entity Recognition (NER) as applied to Ancient Greek texts, with an emphasis on retrieving personal names. Recent research shows that, while the task remains difficult, the use of transformer models results in significant improvements. We, therefore, compare the performance of four transformer models on the task of NER for the categories of people, locations and groups, and add an out-of-domain test set to the existing datasets. Results on this set highlight the shortcomings of the models when confronted with a random sample of sentences. To be able to more straightforwardly integrate domain and linguistic knowledge to improve performance, we narrow down our approach to the category of people. The task is simplified to a binary PERS/MISC classification on the token level, starting from capitalised words. Next, we test the use of domain and linguistic knowledge to improve the results. We find that including simple gazetteer information as a binary mask has a marginally positive effect on newly annotated data and that treebanks can be used to help identify multi-word individuals if they are scarcely or inconsistently annotated in the available training data. The qualitative error analysis identifies the potential for improvement in both manual annotation and the inclusion of domain and linguistic knowledge in the transformer models.
pdf
bib
abs
Adapting transformer models to morphological tagging of two highly inflectional languages: a case study on Ancient Greek and Latin
Alek Keersmaekers
|
Wouter Mercelis
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Natural language processing for Greek and Latin, inflectional languages with small corpora, requires special techniques. For morphological tagging, transformer models show promising potential, but the best approach to use these models is unclear. For both languages, this paper examines the impact of using morphological lexica, training different model types (a single model with a combined feature tag, multiple models for separate features, and a multi-task model for all features), and adding linguistic constraints. We find that, although simply fine-tuning transformers to predict a monolithic tag may already yield decent results, each of these adaptations can further improve tagging accuracy.
2023
pdf
bib
abs
Word Sense Disambiguation for Ancient Greek: Sourcing a training corpus through translation alignment
Alek Keersmaekers
|
Wouter Mercelis
|
Toon Van Hal
Proceedings of the Ancient Language Processing Workshop
This paper seeks to leverage translations of Ancient Greek texts to enhance the performance of automatic word sense disambiguation (WSD). Satisfactory WSD in Ancient Greek is achievable, provided that the system can rely on annotated data. This study, acknowledging the challenges of manually assigning meanings to every Greek lemma, explores the strategies to derive WSD data from parallel texts using sentence and word alignment. Our results suggest that, assuming the condition of high word frequency is met, this technique permits us to automatically produce a significant volume of annotated data, although there are still significant obstacles when trying to automate this process.
2022
pdf
bib
abs
In Search of the Flocks: How to Perform Onomasiological Queries in an Ancient Greek Corpus?
Alek Keersmaekers
|
Toon Van Hal
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
This paper explores the possibilities of onomasiologically querying corpus data of Ancient Greek. The significance of the onomasiological approach has been highlighted in recent studies, yet the possibilities of performing ‘word-finding’ investigations into corpus data have not been dealt with in depth. The case study chosen focuses on collective nouns denoting animate groups (such as flocks of people, herds of cattle). By relying on a large automatically annotated corpus of Ancient Greek and on token-based vector information, a longlist of collective nouns was compiled through morpho-syntactic extraction and successive clustering procedures. After reducing this longlist to a shortlist, the results obtained are evaluated. In general, we find that πλῆθος can be considered to be the default collective noun of both humans and animals, becoming especially prominent during the Hellenistic period. In addition, specific tendencies in the use of collective nouns are discerned for specific semantic classes (e.g. gods and insects) and over time. Throughout the paper, special attention is paid to methodological issues related to onomasiologically searching.
pdf
bib
abs
An ELECTRA Model for Latin Token Tagging Tasks
Wouter Mercelis
|
Alek Keersmaekers
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
This report describes the KU Leuven / Brepols-CTLO submission to EvaLatin 2022. We present the results of our current small Latin ELECTRA model, which will be expanded to a larger model in the future. For the lemmatization task, we combine a neural token-tagging approach with the in-house rule-based lemma lists from Brepols’ ReFlex software. The results are decent, but suffer from inconsistencies between Brepols’ and EvaLatin’s definitions of a lemma. For POS-tagging, the results come up just short from the first place in this competition, mainly struggling with proper nouns. For morphological tagging, there is much more room for improvement. Here, the constraints added to our Multiclass Multilabel model were often not tight enough, causing missing morphological features. We will further investigate why the combination of the different morphological features, which perform fine on their own, leads to issues.
2021
pdf
bib
abs
The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek
Alek Keersmaekers
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
This paper describes the GLAUx project (“the Greek Language Automated”), an ongoing effort to develop a large long-term diachronic corpus of Greek, covering sixteen centuries of literary and non-literary material annotated with NLP methods. After providing an overview of related corpus projects and discussing the general architecture of the corpus, it zooms in on a number of larger methodological issues in the design of historical corpora. These include the encoding of textual variants, handling extralinguistic variation and annotating linguistic ambiguity. Finally, the long- and short-term perspectives of this project are discussed.
2020
pdf
bib
abs
Automatic semantic role labeling in Ancient Greek using distributional semantic modeling
Alek Keersmaekers
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
This paper describes a first attempt to automatic semantic role labeling in Ancient Greek, using a supervised machine learning approach. A Random Forest classifier is trained on a small semantically annotated corpus of Ancient Greek, annotated with a large amount of linguistic features, including form of the construction, morphology, part-of-speech, lemmas, animacy, syntax and distributional vectors of Greek words. These vectors turned out to be more important in the model than any other features, likely because they are well suited to handle a low amount of training examples. Overall labeling accuracy was 0.757, with large differences with respect to the specific role that was labeled and with respect to text genre. Some ways to further improve these results include expanding the amount of training examples, improving the quality of the distributional vectors and increasing the consistency of the syntactic annotation.
2019
pdf
bib
Creating, Enriching and Valorizing Treebanks of Ancient Greek
Alek Keersmaekers
|
Wouter Mercelis
|
Colin Swaelens
|
Toon Van Hal
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)