Evelien de Graaf
2024
“Gotta catch ‘em all!”: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge
Marijke Beersmans
|
Alek Keersmaekers
|
Evelien de Graaf
|
Tim Van de Cruys
|
Mark Depauw
|
Margherita Fantoli
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
In this paper, we present a study of transformer-based Named Entity Recognition (NER) as applied to Ancient Greek texts, with an emphasis on retrieving personal names. Recent research shows that, while the task remains difficult, the use of transformer models results in significant improvements. We, therefore, compare the performance of four transformer models on the task of NER for the categories of people, locations and groups, and add an out-of-domain test set to the existing datasets. Results on this set highlight the shortcomings of the models when confronted with a random sample of sentences. To be able to more straightforwardly integrate domain and linguistic knowledge to improve performance, we narrow down our approach to the category of people. The task is simplified to a binary PERS/MISC classification on the token level, starting from capitalised words. Next, we test the use of domain and linguistic knowledge to improve the results. We find that including simple gazetteer information as a binary mask has a marginally positive effect on newly annotated data and that treebanks can be used to help identify multi-word individuals if they are scarcely or inconsistently annotated in the available training data. The qualitative error analysis identifies the potential for improvement in both manual annotation and the inclusion of domain and linguistic knowledge in the transformer models.
2023
Training and Evaluation of Named Entity Recognition Models for Classical Latin
Marijke Beersmans
|
Evelien de Graaf
|
Tim Van de Cruys
|
Margherita Fantoli
Proceedings of the Ancient Language Processing Workshop
We evaluate the performance of various models on the task of named entity recognition (NER) for classical Latin. Using an existing dataset, we train two transformer-based LatinBERT models and one shallow conditional random field (CRF) model. The performance is assessed using both standard metrics and a detailed manual error analysis, and compared to the results obtained by different already released Latin NER tools. Both analyses demonstrate that the BERT models achieve a better f1-score than the other models. Furthermore, we annotate new, unseen data for further evaluation of the models, and we discuss the impact of annotation choices on the results.
2022
AGILe: The First Lemmatizer for Ancient Greek Inscriptions
Evelien de Graaf
|
Silvia Stopponi
|
Jasper K. Bos
|
Saskia Peels-Matthey
|
Malvina Nissim
Proceedings of the Thirteenth Language Resources and Evaluation Conference
To facilitate corpus searches by classicists as well as to reduce data sparsity when training models, we focus on the automatic lemmatization of ancient Greek inscriptions, which have not received as much attention in this sense as literary text data has. We show that existing lemmatizers for ancient Greek, trained on literary data, are not performant on epigraphic data, due to major language differences between the two types of texts. We thus train the first inscription-specific lemmatizer achieving above 80% accuracy, and make both the models and the lemmatized data available to the community. We also provide a detailed error analysis highlighting peculiarities of inscriptions which again highlights the importance of a lemmatizer dedicated to inscriptions.
Search
Fix data
Co-authors
- Marijke Beersmans 2
- Margherita Fantoli 2
- Tim Van de Cruys 2
- Jasper K. Bos 1
- Mark Depauw 1
- show all...