M. Willis Monroe


2024

pdf bib
Towards Fast Cognate Alignment on Imbalanced Data
Logan Born | M. Willis Monroe | Kathryn Kelley | Anoop Sarkar
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

Cognate alignment models purport to enable decipherment, but their speed and need for clean data can make them unsuitable for realistic decipherment problems. We seek to draw attention to these shortcomings in the hopes that future work may avoid them, and we outline two techniques which begin to overcome the described problems.

2023

pdf bib
Disambiguating Numeral Sequences to Decipher Ancient Accounting Corpora
Logan Born | M. Willis Monroe | Kathryn Kelley | Anoop Sarkar
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

A numeration system encodes abstract numeric quantities as concrete strings of written characters. The numeration systems used by modern scripts tend to be precise and unambiguous, but this was not so for the ancient and partially-deciphered proto-Elamite (PE) script, where written numerals can have up to four distinct readings depending on the system that is used to read them. We consider the task of disambiguating between these readings in order to determine the values of the numeric quantities recorded in this corpus. We algorithmically extract a list of possible readings for each PE numeral notation, and contribute two disambiguation techniques based on structural properties of the original documents and classifiers learned with the bootstrapping algorithm. We also contribute a test set for evaluating disambiguation techniques, as well as a novel approach to cautious rule selection for bootstrapped classifiers. Our analysis confirms existing intuitions about this script and reveals previously-unknown correlations between tablet content and numeral magnitude. This work is crucial to understanding and deciphering PE, as the corpus is heavily accounting-focused and contains many more numeric tokens than tokens of text.

pdf bib
Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering
Logan Born | M. Willis Monroe | Kathryn Kelley | Anoop Sarkar
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

A crucial step in deciphering a text is to identify what set of characters were used to write it. This requires grouping character tokens according to visual and contextual features, which can be challenging for human analysts when the number of tokens or underlying types is large. Prior work has shown that this process can be automated by clustering dense representations of character images, in a task which we call “script clustering”. In this work, we present novel architectures which exploit varying degrees of contextual and visual information to learn representations for use in script clustering. We evaluate on a range of modern and ancient scripts, and find that our models produce representations which are more effective for script recovery than the current state-of-the-art, despite using just ~2% as many parameters. Our analysis fruitfully applies these models to assess hypotheses about the character inventory of the partially-deciphered proto-Elamite script.

2021

pdf bib
Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models
Logan Born | Kathryn Kelley | M. Willis Monroe | Anoop Sarkar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021