Logan Born

2024

pdf bib abs
Towards Fast Cognate Alignment on Imbalanced Data
Logan Born | M. Willis Monroe | Kathryn Kelley | Anoop Sarkar
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

Cognate alignment models purport to enable decipherment, but their speed and need for clean data can make them unsuitable for realistic decipherment problems. We seek to draw attention to these shortcomings in the hopes that future work may avoid them, and we outline two techniques which begin to overcome the described problems.

2023

pdf bib abs
Decipherment as Regression: Solving Historical Substitution Ciphers by Learning Symbol Recurrence Relations
Nishant Kambhatla | Logan Born | Anoop Sarkar
Findings of the Association for Computational Linguistics: EACL 2023

Solving substitution ciphers involves mapping sequences of cipher symbols to fluent text in a target language. This has conventionally been formulated as a search problem, to find the decipherment key using a character-level language model to constrain the search space. This work instead frames decipherment as a sequence prediction task, using a Transformer-based causal language model to learn recurrences between characters in a ciphertext. We introduce a novel technique for transcribing arbitrary substitution ciphers into a common recurrence encoding. By leveraging this technique, we (i) create a large synthetic dataset of homophonic ciphers using random keys, and (ii) train a decipherment model that predicts the plaintext sequence given a recurrence-encoded ciphertext. Our method achieves strong results on synthetic 1:1 and homophonic ciphers, and cracks several real historic homophonic ciphers. Our analysis shows that the model learns recurrence relations between cipher symbols and recovers decipherment keys in its self-attention.

pdf bib abs
Disambiguating Numeral Sequences to Decipher Ancient Accounting Corpora
Logan Born | M. Willis Monroe | Kathryn Kelley | Anoop Sarkar
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

A numeration system encodes abstract numeric quantities as concrete strings of written characters. The numeration systems used by modern scripts tend to be precise and unambiguous, but this was not so for the ancient and partially-deciphered proto-Elamite (PE) script, where written numerals can have up to four distinct readings depending on the system that is used to read them. We consider the task of disambiguating between these readings in order to determine the values of the numeric quantities recorded in this corpus. We algorithmically extract a list of possible readings for each PE numeral notation, and contribute two disambiguation techniques based on structural properties of the original documents and classifiers learned with the bootstrapping algorithm. We also contribute a test set for evaluating disambiguation techniques, as well as a novel approach to cautious rule selection for bootstrapped classifiers. Our analysis confirms existing intuitions about this script and reveals previously-unknown correlations between tablet content and numeral magnitude. This work is crucial to understanding and deciphering PE, as the corpus is heavily accounting-focused and contains many more numeric tokens than tokens of text.

pdf bib abs
Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering
Logan Born | M. Willis Monroe | Kathryn Kelley | Anoop Sarkar
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

A crucial step in deciphering a text is to identify what set of characters were used to write it. This requires grouping character tokens according to visual and contextual features, which can be challenging for human analysts when the number of tokens or underlying types is large. Prior work has shown that this process can be automated by clustering dense representations of character images, in a task which we call “script clustering”. In this work, we present novel architectures which exploit varying degrees of contextual and visual information to learn representations for use in script clustering. We evaluate on a range of modern and ancient scripts, and find that our models produce representations which are more effective for script recovery than the current state-of-the-art, despite using just ~2% as many parameters. Our analysis fruitfully applies these models to assess hypotheses about the character inventory of the partially-deciphered proto-Elamite script.

pdf bib abs
Learning Nearest Neighbour Informed Latent Word Embeddings to Improve Zero-Shot Machine Translation
Nishant Kambhatla | Logan Born | Anoop Sarkar
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

Multilingual neural translation models exploit cross-lingual transfer to perform zero-shot translation between unseen language pairs. Past efforts to improve cross-lingual transfer have focused on aligning contextual sentence-level representations. This paper introduces three novel contributions to allow exploiting nearest neighbours at the token level during training, including: (i) an efficient, gradient-friendly way to share representations between neighboring tokens; (ii) an attentional semantic layer which extracts latent features from shared embeddings; and (iii) an agreement loss to harmonize predictions across different sentence representations. Experiments on two multilingual datasets demonstrate consistent gains in zero shot translation over strong baselines.

2022

pdf bib abs
Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation
Nishant Kambhatla | Logan Born | Anoop Sarkar
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We propose a novel technique that combines alternative subword tokenizations of a single source-target language pair that allows us to leverage multilingual neural translation training methods. These alternate segmentations function like related languages in multilingual translation. Overall this improves translation accuracy for low-resource languages and produces translations that are lexically diverse and morphologically rich. We also introduce a cross-teaching technique which yields further improvements in translation accuracy and cross-lingual transfer between high- and low-resource language pairs. Compared to other strong multilingual baselines, our approach yields average gains of +1.7 BLEU across the four low-resource datasets from the multilingual TED-talks dataset. Our technique does not require additional training data and is a drop-in improvement for any existing neural translation system.

pdf bib abs
CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation
Nishant Kambhatla | Logan Born | Anoop Sarkar
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a novel data-augmentation technique for neural machine translation based on ROT-k ciphertexts. ROT-k is a simple letter substitution cipher that replaces a letter in the plaintext with the kth letter after it in the alphabet. We first generate multiple ROT-k ciphertexts using different values of k for the plaintext which is the source side of the parallel data. We then leverage this enciphered training data along with the original parallel data via multi-source training to improve neural machine translation. Our method, CipherDAug, uses a co-regularization-inspired training procedure, requires no external data sources other than the original training data, and uses a standard Transformer to outperform strong data augmentation techniques on several datasets by a significant margin. This technique combines easily with existing approaches to data augmentation, and yields particularly strong results in low-resource settings.

pdf bib abs
Sequence Models for Document Structure Identification in an Undeciphered Script
Logan Born | M. Monroe | Kathryn Kelley | Anoop Sarkar
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

This work describes the first thorough analysis of “header” signs in proto-Elamite, an undeciphered script from 3100-2900 BCE. Headers are a category of signs which have been provisionally identified through painstaking manual analysis of this script by domain experts. We use unsupervised neural and statistical sequence modeling techniques to provide new and independent evidence for the existence of headers, without supervision from domain experts. Having affirmed the existence of headers as a legitimate structural feature, we next arrive at a richer understanding of their possible meaning and purpose by (i) examining which features predict their presence; (ii) identifying correlations between these features and other document properties; and (iii) examining cases where these features predict the presence of a header in texts where domain experts do not expect one (or vice versa). We provide more concrete processes for labeling headers in this corpus and a clearer justification for existing intuitions about document structure in proto-Elamite.

2021

pdf bib
Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models
Logan Born | Kathryn Kelley | M. Willis Monroe | Anoop Sarkar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

pdf bib abs
Sign Clustering and Topic Extraction in Proto-Elamite
Logan Born | Kate Kelley | Nishant Kambhatla | Carolyn Chen | Anoop Sarkar
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We describe a first attempt at using techniques from computational linguistics to analyze the undeciphered proto-Elamite script. Using hierarchical clustering, n-gram frequencies, and LDA topic models, we both replicate results obtained by manual decipherment and reveal previously-unobserved relationships between signs. This demonstrates the utility of these techniques as an aid to manual decipherment.

2018

pdf bib abs
Prefix Lexicalization of Synchronous CFGs using Synchronous TAG
Logan Born | Anoop Sarkar
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We show that an epsilon-free, chain-free synchronous context-free grammar (SCFG) can be converted into a weakly equivalent synchronous tree-adjoining grammar (STAG) which is prefix lexicalized. This transformation at most doubles the grammar’s rank and cubes its size, but we show that in practice the size increase is only quadratic. Our results extend Greibach normal form from CFGs to SCFGs and prove new formal properties about SCFG, a formalism with many applications in natural language processing.

Co-authors

Carolyn Chen 1

M. Monroe 1

Venues

cawl3
acl2
findings2
eamt1
ws1
show all...

iwslt1

latech1

emnlp1