Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
John Pavlopoulos
Thea Sommerschield
Yannis Assael
Shai Gordin
Kyunghyun Cho
Marco Passarotti
Rachele Sprugnoli
Yudong Liu
Bin Li
Adam Anderson
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
CuReD: Deep Learning Optical Character Recognition for Cuneiform Text Editions and Legacy Materials
Shai Gordin
Morris Alper
Avital Romach
Luis Saenz Santos
Naama Yochai
Roey Lalazar
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Cuneiform documents, the earliest known form of writing, are prolific textual sources of the ancient past. Experts publish editions of these texts in transliteration using specialized typesetting, but most remain inaccessible for computational analysis in traditional printed books or legacy materials. Off-the-shelf OCR systems are insufficient for digitization without adaptation. We present CuReD (Cuneiform Recognition-Documents), a deep learning-based human-in-the-loop OCR pipeline for digitizing scanned transliterations of cuneiform texts. CuReD has a character error rate of 9% on clean data and 11% on representative scans. We digitized a challenging sample of transliterated cuneiform documents, as well as lexical index cards from the University of Pennsylvania Museum, demonstrating the feasibility of our platform for enabling computational analysis and bolstering machine-readable cuneiform text datasets. Our result provide the first human-in-the-loop pipeline and interface for digitizing transliterated cuneiform sources and legacy materials, enabling the enrichment of digital sources of these low-resource languages.
Proceedings of the Ancient Language Processing Workshop
Adam Anderson
Shai Gordin
Bin Li
Yudong Liu
Marco C. Passarotti
Proceedings of the Ancient Language Processing Workshop
Word Sense Induction with Attentive Context Clustering
Moshe Stekel
Amos Azaria
Shai Gordin
Proceedings of the Workshop on Natural Language Processing for Digital Humanities
In this paper, we present ACCWSI (Attentive Context Clustering WSI), a method for Word Sense Induction, suitable for languages with limited resources. Pretrained on a small corpus and given an ambiguous word (query word) and a set of excerpts that contain it, ACCWSI uses an attention mechanism for generating context-aware embeddings, distinguishing between the different senses assigned to the query word. These embeddings are then clustered to provide groups of main common uses of the query word. This method demonstrates practical applicability for shedding light on the meanings of ambiguous words in ancient languages, such as Classical Hebrew.