Vera Axelrod


pdf bib
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Sebastian Ruder | Jonathan Clark | Alexander Gutkin | Mihir Kale | Min Ma | Massimo Nicosia | Shruti Rijhwani | Parker Riley | Jean-Michel Sarr | Xinyi Wang | John Wieting | Nitish Gupta | Anna Katanova | Christo Kirov | Dana Dickinson | Brian Roark | Bidisha Samanta | Connie Tao | David Adelani | Vera Axelrod | Isaac Caswell | Colin Cherry | Dan Garrette | Reeve Ingle | Melvin Johnson | Dmitry Panteleev | Partha Talukdar
Findings of the Association for Computational Linguistics: EMNLP 2023

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models.


pdf bib
Flexible text generation for counterfactual fairness probing
Zee Fryer | Vera Axelrod | Ben Packer | Alex Beutel | Jilin Chen | Kellie Webster
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

A common approach for testing fairness issues in text-based classifiers is through the use of counterfactuals: does the classifier output change if a sensitive attribute in the input is changed? Existing counterfactual generation methods typically rely on wordlists or templates, producing simple counterfactuals that fail to take into account grammar, context, or subtle sensitive attribute references, and could miss issues that the wordlist creators had not considered. In this paper, we introduce a task for generating counterfactuals that overcomes these shortcomings, and demonstrate how large language models (LLMs) can be leveraged to accomplish this task. We show that this LLM-based method can produce complex counterfactuals that existing methods cannot, comparing the performance of various counterfactual generation methods on the Civil Comments dataset and showing their value in evaluating a toxicity classifier.


pdf bib
Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns
Kellie Webster | Marta Recasens | Vera Axelrod | Jason Baldridge
Transactions of the Association for Computational Linguistics, Volume 6

Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines that demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.