Logan Born


2022

pdf bib
Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation
Nishant Kambhatla | Logan Born | Anoop Sarkar
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We propose a novel technique that combines alternative subword tokenizations of a single source-target language pair that allows us to leverage multilingual neural translation training methods. These alternate segmentations function like related languages in multilingual translation. Overall this improves translation accuracy for low-resource languages and produces translations that are lexically diverse and morphologically rich. We also introduce a cross-teaching technique which yields further improvements in translation accuracy and cross-lingual transfer between high- and low-resource language pairs. Compared to other strong multilingual baselines, our approach yields average gains of +1.7 BLEU across the four low-resource datasets from the multilingual TED-talks dataset. Our technique does not require additional training data and is a drop-in improvement for any existing neural translation system.

pdf bib
CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation
Nishant Kambhatla | Logan Born | Anoop Sarkar
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a novel data-augmentation technique for neural machine translation based on ROT-k ciphertexts. ROT-k is a simple letter substitution cipher that replaces a letter in the plaintext with the kth letter after it in the alphabet. We first generate multiple ROT-k ciphertexts using different values of k for the plaintext which is the source side of the parallel data. We then leverage this enciphered training data along with the original parallel data via multi-source training to improve neural machine translation. Our method, CipherDAug, uses a co-regularization-inspired training procedure, requires no external data sources other than the original training data, and uses a standard Transformer to outperform strong data augmentation techniques on several datasets by a significant margin. This technique combines easily with existing approaches to data augmentation, and yields particularly strong results in low-resource settings.

2021

pdf bib
Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models
Logan Born | Kathryn Kelley | M. Willis Monroe | Anoop Sarkar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

pdf bib
Sign Clustering and Topic Extraction in Proto-Elamite
Logan Born | Kate Kelley | Nishant Kambhatla | Carolyn Chen | Anoop Sarkar
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We describe a first attempt at using techniques from computational linguistics to analyze the undeciphered proto-Elamite script. Using hierarchical clustering, n-gram frequencies, and LDA topic models, we both replicate results obtained by manual decipherment and reveal previously-unobserved relationships between signs. This demonstrates the utility of these techniques as an aid to manual decipherment.

2018

pdf bib
Prefix Lexicalization of Synchronous CFGs using Synchronous TAG
Logan Born | Anoop Sarkar
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We show that an epsilon-free, chain-free synchronous context-free grammar (SCFG) can be converted into a weakly equivalent synchronous tree-adjoining grammar (STAG) which is prefix lexicalized. This transformation at most doubles the grammar’s rank and cubes its size, but we show that in practice the size increase is only quadratic. Our results extend Greibach normal form from CFGs to SCFGs and prove new formal properties about SCFG, a formalism with many applications in natural language processing.