Elizabeth Nielsen

2025

Alligators All Around: Mitigating Lexical Confusion in Low-resource Machine Translation
Elizabeth Nielsen | Isaac Rayburn Caswell | Jiaming Luo | Colin Cherry
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Current machine translation (MT) systems for low-resource languages have a particular failure mode: When translating words in a given domain, they tend to confuse words within that domain. So, for example, “lion” might be translated as “alligator”, and “orange” might be rendered as “purple.” We propose a recall-based metric for measuring this problem and show that the problem exists in 122 low-resource languages. We then show that this problem can be mitigated by using a large language model (LLM) to post-edit the MT output, specifically by including the entire GATITOS lexicon for the relevant language as a very long context prompt. We show gains in average ChrF score over the set of 122 languages, and we show that the recall score for relevant lexical items also improves. Finally, we demonstrate that a small dedicated MT system with a general-purpose LLM as a post-editor is outperforms a lexicon-based RAG-LLM translator, suggesting a new paradigm for LLM use.

We open-source SMOL (Set of Maximal Over-all Leverage), a suite of training data to un-lock machine translation for low-resource languages (LRLs). SMOL has been translated into123 under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.

2023

pdf bib abs

Distinguishing Romanized Hindi from Romanized Urdu
Elizabeth Nielsen | Christo Kirov | Brian Roark
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

We examine the task of distinguishing between Hindi and Urdu when those languages are romanized, i.e., written in the Latin script. Both languages are widely informally romanized, and to the extent that they are identified in the Latin script by language identification systems, they are typically conflated. In the absence of large labeled collections of such text, we consider methods for generating training data. Beginning with a small set of seed words, each of which are strongly indicative of one of the languages versus the other, we prompt a pretrained large language model (LLM) to generate romanized text. Treating text generated from an Urdu prompt as one class and text generated from a Hindi prompt as the other class, we build a binary language identification (LangID) classifier. We demonstrate that the resulting classifier distinguishes manually romanized Urdu Wikipedia text from manually romanized Hindi Wikipedia text far better than chance. We use this classifier to estimate the prevalence of Urdu in a large collection of text labeled as romanized Hindi that has been used to train large language models. These techniques can be applied to bootstrap classifiers in other cases where a dataset is known to contain multiple distinct but related classes, such as different dialects of the same language, but for which labels cannot easily be obtained.

pdf bib abs

Spelling convention sensitivity in neural language models
Elizabeth Nielsen | Christo Kirov | Brian Roark
Findings of the Association for Computational Linguistics: EACL 2023

We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistently one or the other within model-generated strings. In contrast to long-distance dependencies in non-surface underlying structure (e.g., syntax), spelling consistency is easier to measure both in LMs and the text corpora used to train them, which can provide additional insight into certain observed model behaviors. Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency. A large T5 language model does appear to internalize this consistency, though only with respect to observed lexical items (not nonce words with British/American spelling patterns). We further experiment with correcting for biases in the training data by fine-tuning T5 on synthetic data that has been debiased, and find that finetuned T5 remains only somewhat sensitive to spelling consistency. Further experiments show GPT2 to be similarly limited.

2022

pdf bib

2020

pdf bib abs

The role of context in neural pitch accent detection in English
Elizabeth Nielsen | Mark Steedman | Sharon Goldwater
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Prosody is a rich information source in natural language, serving as a marker for phenomena such as contrast. In order to make this information available to downstream tasks, we need a way to detect prosodic events in speech. We propose a new model for pitch accent detection, inspired by the work of Stehwien et al. (2018), who presented a CNN-based model for this task. Our model makes greater use of context by using full utterances as input and adding an LSTM layer. We find that these innovations lead to an improvement from 87.5% to 88.7% accuracy on pitch accent detection on American English speech in the Boston University Radio News Corpus, a state-of-the-art result. We also find that a simple baseline that just predicts a pitch accent on every content word yields 82.2% accuracy, and we suggest that this is the appropriate baseline for this task. Finally, we conduct ablation tests that show pitch is the most important acoustic feature for this task and this corpus.