2024
pdf
bib
abs
Neural Lemmatization and POS-tagging models for Coptic, Demotic and Earlier Egyptian
Aleksi Sahala
|
Eliese-Sophia Lincke
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
We present models for lemmatizing and POS-tagging Earlier Egyptian, Coptic and Demotic to test the performance of our pipeline for the ancient languages of Egypt. Of these languages, Demotic and Egyptian are known to be difficult to annotate due to their high extent of ambiguity. We report lemmatization accuracy of 86%, 91% and 99%, and XPOS-tagging accuracy of 89%, 95% and 98% for Earlier Egyptian, Demotic and Coptic, respectively.
2023
pdf
bib
abs
Using Word Embeddings for Identifying Emotions Relating to the Body in a Neo-Assyrian Corpus
Ellie Bennett
|
Aleksi Sahala
Proceedings of the Ancient Language Processing Workshop
Research into emotions is a developing field within Assyriology, and NLP tools for Akkadian texts offers a new perspective on the data. In this submission, we use PMI-based word embeddings to explore the relationship between parts of the body and emotions. Using data downloaded from Oracc, we ask which parts of the body were semantically linked to emotions. We do this through examining which of the top 10 results for a body part could be used to express emotions. After identifying two words for the body that have the most emotion words in their results list (libbu and kabattu), we then examine whether the emotion words in their results lists were indeed used in this manner in the Neo-Assyrian textual corpus. The results indicate that of the two body parts, kabattu was semantically linked to happiness and joy, and had a secondary emotional field of anger.
pdf
bib
abs
A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages
Aleksi Sahala
|
Krister Lindén
Proceedings of the Ancient Language Processing Workshop
We presented a pipeline for POS-tagging and lemmatizing cuneiform languages and evaluated its performance on Sumerian, first millennium Babylonian, Neo-Assyrian and Urartian texts extracted from Oracc. The system achieves a POS-tagging accuracy between 95-98% and a lemmatization accuracy of 94-96% depending on the language or dialect. For OOV words only, the current version can predict correct POS-tags for 83-91%, and lemmata for 68-84% of the input words. Compared with the earlier version, the current one has about 10% higher accuracy in OOV lemmatization and POS-tagging due to better neural network performance. We also tested the system for lemmatizing and POS-tagging the PROIEL Ancient Greek and Latin treebanks, achieving results similar to those with the cuneiform languages.
2020
pdf
bib
abs
Automated Phonological Transcription of Akkadian Cuneiform Text
Aleksi Sahala
|
Miikka Silfverberg
|
Antti Arppe
|
Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference
Akkadian was an East-Semitic language spoken in ancient Mesopotamia. The language is attested on hundreds of thousands of cuneiform clay tablets. Several Akkadian text corpora contain only the transliterated text. In this paper, we investigate automated phonological transcription of the transliterated corpora. The phonological transcription provides a linguistically appealing form to represent Akkadian, because the transcription is normalized according to the grammatical description of a given dialect and explicitly shows the Akkadian renderings for Sumerian logograms. Because cuneiform text does not mark the inflection for logograms, the inflected form needs to be inferred from the sentence context. To the best of our knowledge, this is the first documented attempt to automatically transcribe Akkadian. Using a context-aware neural network model, we are able to automatically transcribe syllabic tokens at near human performance with 96% recall @ 3, while the logogram transcription remains more challenging at 82% recall @ 3.
pdf
bib
abs
BabyFST - Towards a Finite-State Based Computational Model of Ancient Babylonian
Aleksi Sahala
|
Miikka Silfverberg
|
Antti Arppe
|
Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference
Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1% of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4% of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.
pdf
bib
Akkadian Treebank for early Neo-Assyrian Royal Inscriptions
Mikko Luukko
|
Aleksi Sahala
|
Sam Hardwick
|
Krister Lindén
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories