David Yarowsky

2025

DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models
Niyati Bafna | Emily Chang | Nathaniel Romney Robinson | David R. Mortensen | Kenton Murray | David Yarowsky | Hale Sirin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most of the world’s languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically regular ways from it. This underscores the importance of model robustness to dialectal variation and cross-lingual generalization to the HRL dialect continuum. We present DialUp, consisting of a training-time technique for adapting a pretrained model to dialectal data (M–>D), and an inference-time intervention adapting dialectal data to the model expertise (D–>M). M–>D induces model robustness to potentially unseen and unknown dialects by exposure to synthetic data exemplifying linguistic mechanisms of dialectal variation, whereas D–>M treats dialectal divergence for known target dialects. These methods show considerable performance gains for several dialects from four language families, and modest gains for two other language families. We also conduct feature and error analyses, which show that language varieties with low baseline MT performance are more likely to benefit from these approaches.

pdf bib abs

JHU’s Submission to the AmericasNLP 2025 Shared Task on the Creation of Educational Materials for Indigenous Languages
Tom Lupicki | Lavanya Shankar | Kaavya Chaparala | David Yarowsky
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

This paper presents JHU’s submission to the AmericasNLP shared task on the creation of educational materials for Indigenous languages. The task involves transforming a base sentence given one or more tags that correspond to grammatical features, such as negation or tense. The task also spans four languages: Bribri, Maya, Guaraní, and Nahuatl. We experiment with augmenting prompts to large language models with different information, chain of thought prompting, ensembling large language models by majority voting, and training a pointer-generator network. Our System 1, an ensemble of large language models, achieves the best performance on Maya and Guaraní, building upon the previous successes in leveraging large language models for this task and highlighting the effectiveness of ensembling large language models.

pdf bib abs

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
Niyati Bafna | Tianjian Li | Kenton Murray | David R. Mortensen | David Yarowsky | Hale Sirin | Daniel Khashabi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages, but the causes for this are not well-understood. We first demonstrate the existence of an implicit task-solving→translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We quantify the extent to which either stage in the pipeline is responsible for final failure for a word translation task across 108 language pairs, and find that the translation barrier explains a dominant portion of error for a majority of language pairs, and is especially severe for low-resource target languages. Our results highlight an important bottleneck for end-to-end multilingual generation, relevant for future work seeking to improve multilinguality in LLMs.

pdf bib abs

We present the Johns Hopkins University’s submission to the 2025 IWSLT Low-Resource Task. We competed on all 10 language pairs. Our approach centers around ensembling methods – specifically Minimum Bayes Risk Decoding. We find that such ensembling often improves performance only slightly over the best performing stand-alone model, and that in some cases it can even hurt performance slightly.

2024

pdf bib abs

Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization
Niyati Bafna | Kenton Murray | David Yarowsky
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

While large language models exhibit certain cross-lingual generalization capabilities, they suffer from performance degradation (PD) on unseen closely-related languages (CRLs) and dialects relative to their high-resource language neighbour (HRLN). However, we currently lack a fundamental understanding of what kinds of linguistic distances contribute to PD, and to what extent. Furthermore, studies of cross-lingual generalization are confounded by unknown quantities of CRL language traces in the training data, and by the frequent lack of availability of evaluation data in lower-resource related languages and dialects. To address these issues, we model phonological, morphological, and lexical distance as Bayesian noise processes to synthesize artificial languages that are controllably distant from the HRLN. We analyse PD as a function of underlying noise parameters, offering insights on model robustness to isolated and composed linguistic phenomena, and the impact of task and HRL characteristics on PD. We calculate parameter posteriors on real CRL-HRLN pair data and show that they follow computed trends of artificial languages, demonstrating the viability of our noisers. Our framework offers a cheap solution for estimating task performance on an unseen CRL given HRLN performance using its posteriors, as well as for diagnosing observed PD on a CRL in terms of its linguistic distances from its HRLN, and opens doors to principled methods of mitigating performance degradation.

pdf bib abs

Pointer-Generator Networks for Low-Resource Machine Translation: Don’t Copy That!
Niyati Bafna | Philipp Koehn | David Yarowsky
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP

While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural “shortcuts”, such as copying subwords from the source to the target, given that such language pairs often share a considerable number of identical words, cognates, and borrowings. We test Pointer-Generator Networks for this purpose for six language pairs over a variety of resource ranges, and find weak improvements for most settings. However, analysis shows that the model does not show greater improvements for closely-related vs. more distant language pairs, or for lower resource ranges, and that the models do not exhibit the expected usage of the mechanism for shared subwords. Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT, such as modern tokenization strategies, noisy real-world conditions, and linguistic complexities. We call for better scrutiny of linguistically motivated improvements to NMT given the blackbox nature of Transformer models, as well as for a focus on the above problems in the field.

2022

pdf bib abs

Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages
Georgie Botev | Arya D. McCarthy | Winston Wu | David Yarowsky
Proceedings of the 29th International Conference on Computational Linguistics

This paper presents a detailed foundational empirical case study of the nature of out-of-vocabulary words encountered in modern text in a moderate-resource language such as Bulgarian, and a multi-faceted distributional analysis of the underlying word-formation processes that can aid in their compositional translation, tagging, parsing, language modeling, and other NLP tasks. Given that out-of-vocabulary (OOV) words generally present a key open challenge to NLP and machine translation systems, especially toward the lower limit of resource availability, there are useful practical insights, as well as corpus-linguistic insights, from both a detailed manual and automatic taxonomic analysis of the types, multidimensional properties, and processing potential for multiple representative OOV data samples.

pdf bib abs

Known Words Will Do: Unknown Concept Translation via Lexical Relations
Winston Wu | David Yarowsky
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

Translating into low-resource languages is challenging due to the scarcity of training data. In this paper, we propose a probabilistic lexical translation method that bridges through lexical relations including synonyms, hypernyms, hyponyms, and co-hyponyms. This method, which only requires a dictionary like Wiktionary and a lexical database like WordNet, enables the translation of unknown vocabulary into low-resource languages for which we may only know the translation of a related concept. Experiments on translating a core vocabulary set into 472 languages, most of them low-resource, show the effectiveness of our approach.

pdf bib abs

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

pdf bib abs

On the Robustness of Cognate Generation Models
Winston Wu | David Yarowsky
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We evaluate two popular neural cognate generation models’ robustness to several types of human-plausible noise (deletion, duplication, swapping, and keyboard errors, as well as a new type of error, phonological errors). We find that duplication and phonological substitution is least harmful, while the other types of errors are harmful. We present an in-depth analysis of the models’ results with respect to each error type to explain how and why these models perform as they do.

2021

pdf bib abs

On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction
Winston Wu | David Yarowsky
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

We constructed parsers for five non-English editions of Wiktionary, which combined with pronunciations from the English edition, comprises over 5.3 million IPA pronunciations, the largest pronunciation lexicon of its kind. This dataset is a unique comparable corpus of IPA pronunciations annotated from multiple sources. We analyze the dataset, noting the presence of machine-generated pronunciations. We develop a novel visualization method to quantify syllabification. We experiment on the new combined task of multilingual IPA syllabification and stress prediction, finding that training a massively multilingual neural sequence-to-sequence model with copy attention can improve performance on both high- and low-resource languages, and multi-task training on stress prediction helps with syllabification.

David Yarowsky

2025

2024

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1997

1995

1994

1993

1992

Co-authors

Venues