A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages

Aleksi Sahala, Krister Lindén


Abstract
We presented a pipeline for POS-tagging and lemmatizing cuneiform languages and evaluated its performance on Sumerian, first millennium Babylonian, Neo-Assyrian and Urartian texts extracted from Oracc. The system achieves a POS-tagging accuracy between 95-98% and a lemmatization accuracy of 94-96% depending on the language or dialect. For OOV words only, the current version can predict correct POS-tags for 83-91%, and lemmata for 68-84% of the input words. Compared with the earlier version, the current one has about 10% higher accuracy in OOV lemmatization and POS-tagging due to better neural network performance. We also tested the system for lemmatizing and POS-tagging the PROIEL Ancient Greek and Latin treebanks, achieving results similar to those with the cuneiform languages.
Anthology ID:
2023.alp-1.23
Volume:
Proceedings of the Ancient Language Processing Workshop
Month:
September
Year:
2023
Address:
Varna, Bulgaria
Editors:
Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti
Venues:
ALP | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
203–212
Language:
URL:
https://aclanthology.org/2023.alp-1.23
DOI:
Bibkey:
Cite (ACL):
Aleksi Sahala and Krister Lindén. 2023. A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages. In Proceedings of the Ancient Language Processing Workshop, pages 203–212, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages (Sahala & Lindén, ALP-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.alp-1.23.pdf