Unsupervised Authorship Attribution for Medieval Latin Using Transformer-Based Embeddings

Loic De Langhe, Orphee De Clercq, Veronique Hoste


Abstract
We explore the potential of employing transformer-based embeddings in an unsupervised authorship attribution task for medieval Latin. The development of Large Language Models (LLMs) and recent advances in transfer learning alleviate many of the traditional issues associated with authorship attribution in lower-resourced (ancient) languages. Despite this, these methods remain heavily understudied within this domain. Concretely, we generate strong contextual embeddings using a variety of mono -and multilingual transformer models and use these as input for two unsupervised clustering methods: a standard agglomerative clustering algorithm and a self-organizing map. We show that these transformer-based embeddings can be used to generate high-quality and interpretable clusterings, resulting in an attractive alternative to the traditional feature-based methods.
Anthology ID:
2024.lt4hala-1.8
Volume:
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Rachele Sprugnoli, Marco Passarotti
Venues:
LT4HALA | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
57–64
Language:
URL:
https://aclanthology.org/2024.lt4hala-1.8
DOI:
Bibkey:
Cite (ACL):
Loic De Langhe, Orphee De Clercq, and Veronique Hoste. 2024. Unsupervised Authorship Attribution for Medieval Latin Using Transformer-Based Embeddings. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 57–64, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Unsupervised Authorship Attribution for Medieval Latin Using Transformer-Based Embeddings (De Langhe et al., LT4HALA-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lt4hala-1.8.pdf