Lucas Weber

2025

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs’ ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long interaction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

pdf bib abs

Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy
Paramita Mirza | Lucas Weber | Fabian Küch
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this paper, we demonstrate that data selection can be both—efficient and universal—by using a multi-step pipeline in which we efficiently bin data points into groups, estimate quality using specialized models, and score difficulty with a robust, lightweight method. Task-based categorization allows us to control the composition of our final data—crucial for finetuning multi-purpose models. To guarantee diversity, we improve upon previous work using embedding models and a clustering algorithm. This integrated strategy enables high-performance fine-tuning with minimal overhead.

2024

pdf bib abs

Interpretability of Language Models via Task Spaces
Lucas Weber | Jaap Jumelet | Elia Bruni | Dieuwke Hupkes
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes.In this paper, we present an alternative approach, concentrating on the _quality_ of LM processing, with a focus on their language abilities.To this end, we construct ‘linguistic task spaces’ – representations of an LM’s language conceptualisation – that shed light on the connections LMs draw between language phenomena.Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call ‘similarity probing’.To disentangle the learning signals of linguistic phenomena, we further introduce a method called ‘fine-tuning via gradient differentials’ (FTGD).We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs.

2023

pdf bib abs

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning
Lucas Weber | Elia Bruni | Dieuwke Hupkes
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Finding the best way of adapting pre-trained language models to a task is a big challenge in current NLP. Just like the previous generation of task-tuned models (TT), models that are adapted to tasks via in-context-learning (ICL) or instruction tuning (IT) are robust in some setups, but not in others. Here, we present a detailed analysis of which design choices cause instabilities and inconsistencies in LLM predictions. First, we show how spurious correlations between input distributions and labels – a known issue in TT models – form only a minor problem for prompted models. Then we engage in a systematic, holistic evaluation of different factors that have been found to influence predictions in a prompting setup. We test all possible combinations of a range of factors on both vanilla and instruction-tuned LLMs of different scale, and statistically analyse the results to show which factors are the most influential, the most interactive or the most stable. From our results, we deduce which factors can be used without precautions, should be avoided or handled with care in most settings.

2021

pdf bib abs

Language Modelling as a Multi-Task Problem
Lucas Weber | Jaap Jumelet | Elia Bruni | Dieuwke Hupkes
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behaviour of language models as they learn the linguistic concept of Negative Polarity Items (NPIs). Our experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling. We argue that this insight is valuable for multi-task learning, linguistics and interpretability research and can lead to exciting new findings in all three domains.

Co-authors

Nathanaël Carraz Rakotonirina 1

Alberto Testoni 1

Marco Del Tredici 1

Venues

Fix author