A Cross-model Study on Learning Romanian Parts of Speech with Transformer Models

Radu Ion, Verginica Barbu Mititelu, Vasile Păiş, Elena Irimia, Valentin Badea


Abstract
This paper will attempt to determine experimentally if POS tagging of unseen words produces comparable performance, in terms of accuracy, as for words that were rarely seen in the training set (i.e. frequency less than 5), or more frequently seen (i.e. frequency greater than 10). To compare accuracies objectively, we will use the odds ratio statistic and its confidence interval testing to show that odds of being correct on unseen words are close to odds of being correct on rarely seen words. For the training of the POS taggers, we use different Romanian BERT models that are freely available on HuggingFace.
Anthology ID:
2024.clib-1.1
Volume:
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
Month:
September
Year:
2024
Address:
Sofia, Bulgaria
Venue:
CLIB
SIG:
Publisher:
Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Note:
Pages:
6–13
Language:
URL:
https://aclanthology.org/2024.clib-1.1
DOI:
Bibkey:
Cite (ACL):
Radu Ion, Verginica Barbu Mititelu, Vasile Păiş, Elena Irimia, and Valentin Badea. 2024. A Cross-model Study on Learning Romanian Parts of Speech with Transformer Models. In Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024), pages 6–13, Sofia, Bulgaria. Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences.
Cite (Informal):
A Cross-model Study on Learning Romanian Parts of Speech with Transformer Models (Ion et al., CLIB 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clib-1.1.pdf