Aleksi Sahala

2025

pdf bib abs
Neural Models for Lemmatization and POS-Tagging of Earlier and Late Egyptian (Supporting Hieroglyphic Input) and Demotic
Aleksi Sahala | Eliese-Sophia Lincke
Proceedings of the Second Workshop on Ancient Language Processing

We present updated models for BabyLemma-tizer for lemmatizing and POS-tagging De-motic, Late Egyptian and Earlier Egyptian with a support for using hieroglyphs as an input. In this paper, we also use data that has not been cleaned from breakages. We achieve consistent UPOS tagging accuracy of 94% or higher and an XPOS tagging accuracy of 93% and higher for all languages. For lemmatization, which is challenging in all of our test languages due to extensive ambiguity, we demonstrate accu-racies from 77% up to 92% depending on the language and the input script.

pdf bib abs
EvaCun 2025 Shared Task: Lemmatization and Token Prediction in Akkadian and Sumerian using LLMs
Shai Gordin | Aleksi Sahala | Shahar Spencer | Stav Klein
Proceedings of the Second Workshop on Ancient Language Processing

The EvaCun 2025 Shared Task, organized as part of ALP 2025 workshop and co-located with NAACL 2025, explores how Large Language Models (LLMs) and transformer-based models can be used to improve lemmatization and token prediction tasks for low-resource ancient cuneiform texts. This year our datasets focused on the best attested ancient Near Eastern languages written in cuneiform, namely, Akkadian and Sumerian texts. However, we utilized the availability of datasets never before used on scale in NLP tasks, primarily first millennium literature (i.e. “Canonical”) provided by the Electronic Babylonian Library (eBL), and Old Babylonian letters and archival texts, provided by Archibab. We aim to encourage the development of new computational methods to better analyze and reconstruct cuneiform inscriptions, pushing NLP forward for ancient and low-resource languages. Three teams competed for the lemmatization subtask and one for the token prediction subtask. Each subtask was evaluated alongside a baseline model, provided by the organizers.

2024

pdf bib abs
Neural Lemmatization and POS-tagging models for Coptic, Demotic and Earlier Egyptian
Aleksi Sahala | Eliese-Sophia Lincke
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

We present models for lemmatizing and POS-tagging Earlier Egyptian, Coptic and Demotic to test the performance of our pipeline for the ancient languages of Egypt. Of these languages, Demotic and Egyptian are known to be difficult to annotate due to their high extent of ambiguity. We report lemmatization accuracy of 86%, 91% and 99%, and XPOS-tagging accuracy of 89%, 95% and 98% for Earlier Egyptian, Demotic and Coptic, respectively.

2023

pdf bib abs
Using Word Embeddings for Identifying Emotions Relating to the Body in a Neo-Assyrian Corpus
Ellie Bennett | Aleksi Sahala
Proceedings of the Ancient Language Processing Workshop

Research into emotions is a developing field within Assyriology, and NLP tools for Akkadian texts offers a new perspective on the data. In this submission, we use PMI-based word embeddings to explore the relationship between parts of the body and emotions. Using data downloaded from Oracc, we ask which parts of the body were semantically linked to emotions. We do this through examining which of the top 10 results for a body part could be used to express emotions. After identifying two words for the body that have the most emotion words in their results list (libbu and kabattu), we then examine whether the emotion words in their results lists were indeed used in this manner in the Neo-Assyrian textual corpus. The results indicate that of the two body parts, kabattu was semantically linked to happiness and joy, and had a secondary emotional field of anger.

pdf bib abs
A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages
Aleksi Sahala | Krister Lindén
Proceedings of the Ancient Language Processing Workshop

We presented a pipeline for POS-tagging and lemmatizing cuneiform languages and evaluated its performance on Sumerian, first millennium Babylonian, Neo-Assyrian and Urartian texts extracted from Oracc. The system achieves a POS-tagging accuracy between 95-98% and a lemmatization accuracy of 94-96% depending on the language or dialect. For OOV words only, the current version can predict correct POS-tags for 83-91%, and lemmata for 68-84% of the input words. Compared with the earlier version, the current one has about 10% higher accuracy in OOV lemmatization and POS-tagging due to better neural network performance. We also tested the system for lemmatizing and POS-tagging the PROIEL Ancient Greek and Latin treebanks, achieving results similar to those with the cuneiform languages.

2020

pdf bib abs
Automated Phonological Transcription of Akkadian Cuneiform Text
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian was an East-Semitic language spoken in ancient Mesopotamia. The language is attested on hundreds of thousands of cuneiform clay tablets. Several Akkadian text corpora contain only the transliterated text. In this paper, we investigate automated phonological transcription of the transliterated corpora. The phonological transcription provides a linguistically appealing form to represent Akkadian, because the transcription is normalized according to the grammatical description of a given dialect and explicitly shows the Akkadian renderings for Sumerian logograms. Because cuneiform text does not mark the inflection for logograms, the inflected form needs to be inferred from the sentence context. To the best of our knowledge, this is the first documented attempt to automatically transcribe Akkadian. Using a context-aware neural network model, we are able to automatically transcribe syllabic tokens at near human performance with 96% recall @ 3, while the logogram transcription remains more challenging at 82% recall @ 3.

pdf bib abs
BabyFST - Towards a Finite-State Based Computational Model of Ancient Babylonian
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1% of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4% of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.

pdf bib
Akkadian Treebank for early Neo-Assyrian Royal Inscriptions
Mikko Luukko | Aleksi Sahala | Sam Hardwick | Krister Lindén
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories