Ian Berlot-Attwell

2026

Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour
Ian Berlot-Attwell | Tobias Sesterhenn | Frank Rudzicz | Xujie Si
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The in-context learning (ICL) coding, reasoning, and tool-using ability of LLMs has spurred interest in library learning (i.e., the creation and exploitation of reusable and composable functions, tools, or lemmas). Such systems often promise improved task performance and computational efficiency by caching reasoning (i.e., storing generated tools) - all without finetuning. However, we find strong reasons to be skeptical. Specifically, we identify a serious evaluation flaw present in a large number of ICL library learning works: these works do not correct for the difference in computational cost between baseline and library learning systems. Studying three separately published ICL library learning systems, we find that all of them fail to consistently outperform the simple baseline of prompting the model - improvements in task accuracy often vanish or reverse once computational cost is accounted for. Furthermore, we perform an in-depth examination of one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples).Our findings suggest that a serious re-examination of the effectiveness of ICL LLM-based library learning is required, as is much stronger standards for evaluation. An equal computational budget must be used for baselines, alongside behavioural analysis.

2024

pdf bib abs

Attribute Diversity Determines the Systematicity Gap in VQA
Ian Berlot-Attwell | Kumar Krishna Agrawal | Annabelle Michael Carrell | Yash Sharma | Naomi Saphra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Although modern neural networks often generalize to new combinations of familiar concepts, the conditions that enable such compositionality have long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that the systematicity gap is not reduced by increasing the quantity of training data, but is reduced by increasing the diversity of training data. In particular, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.

2023

pdf bib abs

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.

2022

pdf bib abs

Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric
Ian Berlot-Attwell | Frank Rudzicz
Proceedings of the 4th Workshop on NLP for Conversational AI

In this work, we evaluate various existing dialogue relevance metrics, find strong dependency on the dataset, often with poor correlation with human scores of relevance, and propose modifications to reduce data requirements and domain sensitivity while improving correlation. Our proposed metric achieves state-of-the-art performance on the HUMOD dataset while reducing measured sensitivity to dataset by 37%-66%. We achieve this without fine-tuning a pretrained language model, and using only 3,750 unannotated human dialogues and a single negative example. Despite these limitations, we demonstrate competitive performance on four datasets from different domains. Our code, including our metric and experiments, is open sourced.

2020

pdf bib abs

Exploring Text Specific and Blackbox Fairness Algorithms in Multimodal Clinical NLP
John Chen | Ian Berlot-Attwell | Xindi Wang | Safwan Hossain | Frank Rudzicz
Proceedings of the 3rd Clinical Natural Language Processing Workshop

Clinical machine learning is increasingly multimodal, collected in both structured tabular formats and unstructured forms such as free text. We propose a novel task of exploring fairness on a multimodal clinical dataset, adopting equalized odds for the downstream medical prediction tasks. To this end, we investigate a modality-agnostic fairness algorithm - equalized odds post processing - and compare it to a text-specific fairness algorithm: debiased clinical word embeddings. Despite the fact that debiased word embeddings do not explicitly address equalized odds of protected groups, we show that a text-specific approach to fairness may simultaneously achieve a good balance of performance classical notions of fairness. Our work opens the door for future work at the critical intersection of clinical NLP and fairness.