Shai Gordin

2025

From Clay to Code: Transforming Hittite Texts for Machine Learning
Emma Yavasan | Shai Gordin
Proceedings of the Second Workshop on Ancient Language Processing

This paper presents a comprehensive method-ology for transforming XML-encoded Hittite cuneiform texts into computationally accessi-ble formats for machine learning applications. Drawing from a corpus of 8,898 texts (558,349 tokens in total) encompassing 145 cataloged genres and compositions, we develop a struc-tured approach to preserve both linguistic and philological annotations while enabling compu-tational analysis. Our methodology addresses key challenges in ancient language processing, including the handling of fragmentary texts, multiple language layers, and complex anno-tation systems. We demonstrate the applica-tion of our corpus through experiments with T5 models, achieving significant improvements in Hittite-to-German translation (ROUGE-1: 0.895) while identifying limitations in morpho-logical glossing tasks. This work establishes a standardized, machine-readable dataset in Hit-tite cuneiform, which also maintains a balance with philological accuracy and current state-of-the-art.

pdf bib

pdf bib abs

EvaCun 2025 Shared Task: Lemmatization and Token Prediction in Akkadian and Sumerian using LLMs
Shai Gordin | Aleksi Sahala | Shahar Spencer | Stav Klein
Proceedings of the Second Workshop on Ancient Language Processing

The EvaCun 2025 Shared Task, organized as part of ALP 2025 workshop and co-located with NAACL 2025, explores how Large Language Models (LLMs) and transformer-based models can be used to improve lemmatization and token prediction tasks for low-resource ancient cuneiform texts. This year our datasets focused on the best attested ancient Near Eastern languages written in cuneiform, namely, Akkadian and Sumerian texts. However, we utilized the availability of datasets never before used on scale in NLP tasks, primarily first millennium literature (i.e. “Canonical”) provided by the Electronic Babylonian Library (eBL), and Old Babylonian letters and archival texts, provided by Archibab. We aim to encourage the development of new computational methods to better analyze and reconstruct cuneiform inscriptions, pushing NLP forward for ancient and low-resource languages. Three teams competed for the lemmatization subtask and one for the token prediction subtask. Each subtask was evaluated alongside a baseline model, provided by the organizers.

pdf bib abs

Assignment of account type to proto-cuneiform economic texts with Multi-Class Support Vector Machines
Piotr Zadworny | Shai Gordin
Proceedings of the Second Workshop on Ancient Language Processing

We investigate the use of machine learning for classifying proto-cuneiform economic texts (3,500-3,000 BCE), leveraging Multi-Class Support Vector Machines (MSVM) to assign text type based on content. Proto-cuneiform presents unique challenges, as it does not en-code spoken language, yet is transcribed into linear formats that obscure original structural elements. We address this by reformatting tran-scriptions, experimenting with different tok-enization strategies, and optimizing feature ex-traction. Our workflow achieves high label-ing reliability and enables significant metadata enrichment. In addition to improving digital corpus organization, our approach opens the chance to identify economic institutions in an-cient Mesopotamian archives, providing a new tool for Assyriological research.

2024

pdf bib abs

CuReD: Deep Learning Optical Character Recognition for Cuneiform Text Editions and Legacy Materials
Shai Gordin | Morris Alper | Avital Romach | Luis Saenz Santos | Naama Yochai | Roey Lalazar
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

Cuneiform documents, the earliest known form of writing, are prolific textual sources of the ancient past. Experts publish editions of these texts in transliteration using specialized typesetting, but most remain inaccessible for computational analysis in traditional printed books or legacy materials. Off-the-shelf OCR systems are insufficient for digitization without adaptation. We present CuReD (Cuneiform Recognition-Documents), a deep learning-based human-in-the-loop OCR pipeline for digitizing scanned transliterations of cuneiform texts. CuReD has a character error rate of 9% on clean data and 11% on representative scans. We digitized a challenging sample of transliterated cuneiform documents, as well as lexical index cards from the University of Pennsylvania Museum, demonstrating the feasibility of our platform for enabling computational analysis and bolstering machine-readable cuneiform text datasets. Our result provide the first human-in-the-loop pipeline and interface for digitizing transliterated cuneiform sources and legacy materials, enabling the enrichment of digital sources of these low-resource languages.

pdf bib

2023

pdf bib

Proceedings of the Ancient Language Processing Workshop
Adam Anderson | Shai Gordin | Bin Li | Yudong Liu | Marco C. Passarotti
Proceedings of the Ancient Language Processing Workshop

2021

pdf bib abs

Word Sense Induction with Attentive Context Clustering
Moshe Stekel | Amos Azaria | Shai Gordin
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

In this paper, we present ACCWSI (Attentive Context Clustering WSI), a method for Word Sense Induction, suitable for languages with limited resources. Pretrained on a small corpus and given an ambiguous word (query word) and a set of excerpts that contain it, ACCWSI uses an attention mechanism for generating context-aware embeddings, distinguishing between the different senses assigned to the query word. These embeddings are then clustered to provide groups of main common uses of the query word. This method demonstrates practical applicability for shedding light on the meanings of ambiguous words in ancient languages, such as Classical Hebrew.

Co-authors

Venues

Fix author