Krzysztof Nowak


2022

pdf bib
Transformer-based Part-of-Speech Tagging and Lemmatization for Latin
Krzysztof Wróbel | Krzysztof Nowak
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

The paper presents a submission to the EvaLatin 2022 shared task. Our system places first for lemmatization, part-of-speech and morphological tagging in both closed and open modalities. The results for cross-genre and cross-time sub-tasks show that the system handles the diachronic and diastratic variation of Latin. The architecture employs state-of-the-art transformer models. For part-of-speech and morphological tagging, we use XLM-RoBERTa large, while for lemmatization a ByT5 small model was employed. The paper features a thorough discussion of part-of-speech and lemmatization errors which shows how the system performance may be improved for Classical, Medieval and Neo-Latin texts.