2024
pdf
bib
abs
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
Marius Mosbach
|
Vagrant Gautam
|
Tomás Vergara Browne
|
Dietrich Klakow
|
Mor Geva
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Interpretability and analysis (IA) research is a growing subfield within NLP with the goal of developing a deeper understanding of the behavior or inner workings of NLP systems and methods. Despite growing interest in the subfield, a criticism of this work is that it lacks actionable insights and therefore has little impact on NLP. In this paper, we seek to quantify the impact of IA research on the broader field of NLP. We approach this with a mixed-methods analysis of: (1) a citation graph of 185K+ papers built from all papers published at ACL and EMNLP conferences from 2018 to 2023, and their references and citations, and (2) a survey of 138 members of the NLP community. Our quantitative results show that IA work is well-cited outside of IA, and central in the NLP citation graph. Through qualitative analysis of survey responses and manual annotation of 556 papers, we find that NLP researchers build on findings from IA work and perceive it as important for progress in NLP, multiple subfields, and rely on its findings and terminology for their own work. Many novel methods are proposed based on IA findings and highly influenced by them, but highly influential non-IA work cites IA findings without being driven by them. We end by summarizing what is missing in IA work today and provide a call to action, to pave the way for a more impactful future of IA research.
pdf
bib
abs
Understanding “Democratization” in NLP and ML Research
Arjun Subramonian
|
Vagrant Gautam
|
Dietrich Klakow
|
Zeerak Talat
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent improvements in natural language processing (NLP) and machine learning (ML) and increased mainstream adoption have led to researchers frequently discussing the “democratization” of artificial intelligence. In this paper, we seek to clarify how democratization is understood in NLP and ML publications, through large-scale mixed-methods analyses of papers using the keyword “democra*” published in NLP and adjacent venues. We find that democratization is most frequently used to convey (ease of) access to or use of technologies, without meaningfully engaging with theories of democratization, while research using other invocations of “democra*” tends to be grounded in theories of deliberation and debate. Based on our findings, we call for researchers to enrich their use of the term democratization with appropriate theory, towards democratic technologies beyond superficial access.
pdf
bib
abs
Stop! In the Name of Flaws: Disentangling Personal Names and Sociodemographic Attributes in NLP
Vagrant Gautam
|
Arjun Subramonian
|
Anne Lauscher
|
Os Keyes
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Personal names simultaneously differentiate individuals and categorize them in ways that are important in a given society. While the natural language processing community has thus associated personal names with sociodemographic characteristics in a variety of tasks, researchers have engaged to varying degrees with the established methodological problems in doing so. To guide future work that uses names and sociodemographic characteristics, we provide an overview of relevant research: first, we present an interdisciplinary background on names and naming. We then survey the issues inherent to associating names with sociodemographic attributes, covering problems of validity (e.g., systematic error, construct validity), as well as ethical concerns (e.g., harms, differential impact, cultural insensitivity). Finally, we provide guiding questions along with normative recommendations to avoid validity and ethical pitfalls when dealing with names and sociodemographic characteristics in natural language processing.
pdf
bib
abs
What explains the success of cross-modal fine-tuning with ORCA?
Paloma Garcia De Herreros
|
Vagrant Gautam
|
Philipp Slusallek
|
Dietrich Klakow
|
Marius Mosbach
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP
ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA’s success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.
pdf
bib
WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case
Vagrant Gautam
|
Julius Steuer
|
Eileen Bingert
|
Ray Johns
|
Anne Lauscher
|
Dietrich Klakow
Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference
pdf
bib
abs
The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis
Miaoran Zhang
|
Vagrant Gautam
|
Mingyang Wang
|
Jesujoba Alabi
|
Xiaoyu Shen
|
Dietrich Klakow
|
Marius Mosbach
Findings of the Association for Computational Linguistics: ACL 2024
In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.
2023
pdf
bib
abs
A Lightweight Method to Generate Unanswerable Questions in English
Vagrant Gautam
|
Miaoran Zhang
|
Dietrich Klakow
Findings of the Association for Computational Linguistics: EMNLP 2023
If a question cannot be answered with the available information, robust systems for question answering (QA) should know *not* to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing automated approaches is not justified, we examine a simpler data augmentation method for unanswerable question generation in English: performing antonym and entity swaps on answerable questions. Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models (+1.6 F1 points on SQuAD 2.0 data with BERT-large), and has higher human-judged relatedness and readability. We quantify the raw benefits of our approach compared to no augmentation across multiple encoder models, using different amounts of generated data, and also on TydiQA-MinSpan data (+9.3 F1 points with BERT-large). Our results establish swaps as a simple but strong baseline for future work.
2021
pdf
bib
abs
Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme prediction
Vagrant Gautam
|
Wang Yau Li
|
Zafarullah Mahmood
|
Fred Mailhot
|
Shreekantha Nadig
|
Riqiang Wang
|
Nathan Zhang
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
We describe three baseline beating systems for the high-resource English-only sub-task of the SIGMORPHON 2021 Shared Task 1: a small ensemble that Dialpad’s speech recognition team uses internally, a well-known off-the-shelf model, and a larger ensemble model comprising these and others. We additionally discuss the challenges related to the provided data, along with the processing steps we took.