“Gotta catch ‘em all!”: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge

Marijke Beersmans, Alek Keersmaekers, Evelien Graaf, Tim Van De Cruys, Mark Depauw, Margherita Fantoli


Abstract
In this paper, we present a study of transformer-based Named Entity Recognition (NER) as applied to Ancient Greek texts, with an emphasis on retrieving personal names. Recent research shows that, while the task remains difficult, the use of transformer models results in significant improvements. We, therefore, compare the performance of four transformer models on the task of NER for the categories of people, locations and groups, and add an out-of-domain test set to the existing datasets. Results on this set highlight the shortcomings of the models when confronted with a random sample of sentences. To be able to more straightforwardly integrate domain and linguistic knowledge to improve performance, we narrow down our approach to the category of people. The task is simplified to a binary PERS/MISC classification on the token level, starting from capitalised words. Next, we test the use of domain and linguistic knowledge to improve the results. We find that including simple gazetteer information as a binary mask has a marginally positive effect on newly annotated data and that treebanks can be used to help identify multi-word individuals if they are scarcely or inconsistently annotated in the available training data. The qualitative error analysis identifies the potential for improvement in both manual annotation and the inclusion of domain and linguistic knowledge in the transformer models.
Anthology ID:
2024.ml4al-1.16
Volume:
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Month:
August
Year:
2024
Address:
Hybrid in Bangkok, Thailand and online
Editors:
John Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson
Venues:
ML4AL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
152–164
Language:
URL:
https://aclanthology.org/2024.ml4al-1.16
DOI:
Bibkey:
Cite (ACL):
Marijke Beersmans, Alek Keersmaekers, Evelien Graaf, Tim Van De Cruys, Mark Depauw, and Margherita Fantoli. 2024. “Gotta catch ‘em all!”: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge. In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 152–164, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics.
Cite (Informal):
“Gotta catch ‘em all!”: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge (Beersmans et al., ML4AL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ml4al-1.16.pdf