Marco Kuhlmann

2025

pdf bib abs
Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency
Ehsan Doostmohammadi | Marco Kuhlmann
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query–context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.

pdf bib abs
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Denitsa Saynova | Lovisa Hagström | Moa Johansson | Richard Johansson | Marco Kuhlmann
Findings of the Association for Computational Linguistics: ACL 2025

Language models (LMs) can make a correct prediction based on many possible signals in a prompt, not all corresponding to recall of factual associations. However, current interpretations of LMs fail to take this into account. For example, given the query “Astrid Lindgren was born in” with the corresponding completion “Sweden”, no difference is made between whether the prediction was based on knowing where the author was born or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we present a model-specific recipe - PrISM - for constructing datasets with examples of four different prediction scenarios: generic language modeling, guesswork, heuristics recall and exact fact recall. We apply two popular interpretability methods to the scenarios: causal tracing (CT) and information flow analysis. We find that both yield distinct results for each scenario. Results for exact fact recall and generic language modeling scenarios confirm previous conclusions about the importance of mid-range MLP sublayers for fact recall, while results for guesswork and heuristics indicate a critical role of late last token position MLP sublayers. In summary, we contribute resources for a more extensive and granular study of fact completion in LMs, together with analyses that provide a more nuanced understanding of how LMs process fact-related queries.

2024

pdf bib abs
How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
Ehsan Doostmohammadi | Oskar Holmström | Marco Kuhlmann
Findings of the Association for Computational Linguistics: EMNLP 2024

Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. In evaluating how well automatic methods align with human evaluations, correlation metrics are the most commonly employed method despite their inherent limitations when dealing with ties and different scales. To address these shortcomings, we use Pairwise Accuracy as an alternative to standard correlation measures. We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent. Specifically, the simple ROUGE-L metric correlates very well with human ratings for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual scenarios. The effectiveness of the more advanced method of using GPT-4 as a judge diminishes significantly if reference answers are not included in the prompt, which is the scenario where this method has the potential to provide the most value compared to other metrics. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.

pdf bib abs
Properties and Challenges of LLM-Generated Explanations
Jenny Kunz | Marco Kuhlmann
Proceedings of the Third Workshop on Bridging Human--Computer Interaction and Natural Language Processing

The self-rationalising capabilities of large language models (LLMs) have been explored in restricted settings, using task-specific data sets.However, current LLMs do not (only) rely on specifically annotated data; nonetheless, they frequently explain their outputs.The properties of the generated explanations are influenced by the pre-training corpus and by the target data used for instruction fine-tuning.As the pre-training corpus includes a large amount of human-written explanations “in the wild”, we hypothesise that LLMs adopt common properties of human explanations.By analysing the outputs for a multi-domain instruction fine-tuning data set, we find that generated explanations show selectivity and contain illustrative elements, but less frequently are subjective or misleading.We discuss reasons and consequences of the properties’ presence or absence. In particular, we outline positive and negative implications depending on the goals and user groups of the self-rationalising system.

2023

pdf bib abs
Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
Ehsan Doostmohammadi | Tobias Norlund | Marco Kuhlmann | Richard Johansson
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

pdf bib abs
On the Generalization Ability of Retrieval-Enhanced Transformers
Tobias Norlund | Ehsan Doostmohammadi | Richard Johansson | Marco Kuhlmann
Findings of the Association for Computational Linguistics: EACL 2023

Recent work on the Retrieval-Enhanced Transformer (RETRO) model has shown impressive results: off-loading memory from trainable weights to a retrieval database can significantly improve language modeling and match the performance of non-retrieval models that are an order of magnitude larger in size. It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval. In this paper, we try to better understand the relative contributions of these two components. We find that the performance gains from retrieval to a very large extent originate from overlapping tokens between the database and the test data, suggesting less of non-trivial generalization than previously assumed. More generally, our results point to the challenges of evaluating the generalization of retrieval-augmented language models such as RETRO, as even limited token overlap may significantly decrease test-time loss. We release our code and model at https://github.com/TobiasNorlund/retro

pdf bib abs
Bridging the Resource Gap: Exploring the Efficacy of English and Multilingual LLMs for Swedish
Oskar Holmström | Jenny Kunz | Marco Kuhlmann
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

Large language models (LLMs) have substantially improved natural language processing (NLP) performance, but training these models from scratch is resource-intensive and challenging for smaller languages. With this paper, we want to initiate a discussion on the necessity of language-specific pre-training of LLMs.We propose how the “one model-many models” conceptual framework for task transfer can be applied to language transfer and explore this approach by evaluating the performance of non-Swedish monolingual and multilingual models’ performance on tasks in Swedish.Our findings demonstrate that LLMs exposed to limited Swedish during training can be highly capable and transfer competencies from English off-the-shelf, including emergent abilities such as mathematical reasoning, while at the same time showing distinct culturally adapted behaviour. Our results suggest that there are resourceful alternatives to language-specific pre-training when creating useful LLMs for small languages.

2022

pdf bib abs
Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions
Jenny Kunz | Martin Jirenius | Oskar Holmström | Marco Kuhlmann
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Models able to generate free-text rationales that explain their output have been proposed as an important step towards interpretable NLP for “reasoning” tasks such as natural language inference and commonsense question answering. However, the relative merits of different architectures and types of rationales are not well understood and hard to measure. In this paper, we contribute two insights to this line of research: First, we find that models trained on gold explanations learn to rely on these but, in the case of the more challenging question answering data set we use, fail when given generated explanations at test time. However, additional fine-tuning on generated explanations teaches the model to distinguish between reliable and unreliable information in explanations. Second, we compare explanations by a generation-only model to those generated by a self-rationalizing model and find that, while the former score higher in terms of validity, factual correctness, and similarity to gold explanations, they are not more useful for downstream classification. We observe that the self-rationalizing model is prone to hallucination, which is punished by most metrics but may add useful context for the classification step.

pdf bib abs
Tractable Parsing for CCGs of Bounded Degree
Lena Katharina Schiffer | Marco Kuhlmann | Giorgio Satta
Computational Linguistics, Volume 48, Issue 3 - September 2022

Unlike other mildly context-sensitive formalisms, Combinatory Categorial Grammar (CCG) cannot be parsed in polynomial time when the size of the grammar is taken into account. Refining this result, we show that the parsing complexity of CCG is exponential only in the maximum degree of composition. When that degree is fixed, parsing can be carried out in polynomial time. Our finding is interesting from a linguistic perspective because a bounded degree of composition has been suggested as a universal constraint on natural language grammar. Moreover, ours is the first complexity result for a version of CCG that includes substitution rules, which are used in practical grammars but have been ignored in theoretical work.

pdf bib abs
Where Does Linguistic Information Emerge in Neural Language Models? Measuring Gains and Contributions across Layers
Jenny Kunz | Marco Kuhlmann
Proceedings of the 29th International Conference on Computational Linguistics

Probing studies have extensively explored where in neural language models linguistic information is located. The standard approach to interpreting the results of a probing classifier is to focus on the layers whose representations give the highest performance on the probing task. We propose an alternative method that asks where the task-relevant information emerges in the model. Our framework consists of a family of metrics that explicitly model local information gain relative to the previous layer and each layer’s contribution to the model’s overall performance. We apply the new metrics to two pairs of syntactic probing tasks with different degrees of complexity and find that the metrics confirm the expected ordering only for one of the pairs. Our local metrics show a massive dominance of the first layers, indicating that the features that contribute the most to our probing tasks are not as high-level as global metrics suggest.

pdf bib abs
On the Effects of Video Grounding on Language Models
Ehsan Doostmohammadi | Marco Kuhlmann
Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models

Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.

2021

pdf bib abs
Test Harder than You Train: Probing with Extrapolation Splits
Jenny Kunz | Marco Kuhlmann
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Previous work on probing word representations for linguistic knowledge has focused on interpolation tasks. In this paper, we instead analyse probes in an extrapolation setting, where the inputs at test time are deliberately chosen to be ‘harder’ than the training examples. We argue that such an analysis can shed further light on the open question whether probes actually decode linguistic knowledge, or merely learn the diagnostic task from shallow features. To quantify the hardness of an example, we consider scoring functions based on linguistic, statistical, and learning-related criteria, all of which are applicable to a broad range of NLP tasks. We discuss the relative merits of these criteria in the context of two syntactic probing tasks, part-of-speech tagging and syntactic dependency labelling. From our theoretical and experimental analysis, we conclude that distance-based and hard statistical criteria show the clearest differences between interpolation and extrapolation settings, while at the same time being transparent, intuitive, and easy to control.

2020

pdf bib abs
Classifier Probes May Just Learn from Linear Context Features
Jenny Kunz | Marco Kuhlmann
Proceedings of the 28th International Conference on Computational Linguistics

Classifiers trained on auxiliary probing tasks are a popular tool to analyze the representations learned by neural sentence encoders such as BERT and ELMo. While many authors are aware of the difficulty to distinguish between “extracting the linguistic structure encoded in the representations” and “learning the probing task,” the validity of probing methods calls for further research. Using a neighboring word identity prediction task, we show that the token embeddings learned by neural sentence encoders contain a significant amount of information about the exact linear context of the token, and hypothesize that, with such information, learning standard probing tasks may be feasible even without additional linguistic structure. We develop this hypothesis into a framework in which analysis efforts can be scrutinized and argue that, with current models and baselines, conclusions that representations contain linguistic structure are not well-founded. Current probing methodology, such as restricting the classifier’s expressiveness or using strong baselines, can help to better estimate the complexity of learning, but not build a foundation for speculations about the nature of the linguistic structure encoded in the learned representations.

pdf bib abs
End-to-End Negation Resolution as Graph Parsing
Robin Kurtz | Stephan Oepen | Marco Kuhlmann
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

We present a neural end-to-end architecture for negation resolution based on a formulation of the task as a graph parsing problem. Our approach allows for the straightforward inclusion of many types of graph-structured features without the need for representation-specific heuristics. In our experiments, we specifically gauge the usefulness of syntactic information for negation resolution. Despite the conceptual simplicity of our architecture, we achieve state-of-the-art results on the Conan Doyle benchmark dataset, including a new top result for our best model.

2019

pdf bib
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks. Five distinct approaches to the representation of sentence meaning in the form of directed graph were represented in the training and evaluation data for the task, packaged in a uniform abstract graph representation and serialization. The task received submissions from eighteen teams, of which five do not participate in the official ranking because they arrived after the closing deadline, made use of additional training data, or involved one of the task co-organizers. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

pdf bib abs
Improving Semantic Dependency Parsing with Syntactic Features
Robin Kurtz | Daniel Roxbo | Marco Kuhlmann
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

We extend a state-of-the-art deep neural architecture for semantic dependency parsing with features defined over syntactic dependency trees. Our empirical results show that only gold-standard syntactic information leads to consistent improvements in semantic parsing accuracy, and that the magnitude of these improvements varies with the specific combination of the syntactic and the semantic representation used. In contrast, automatically predicted syntax does not seem to help semantic parsing. Our error analysis suggests that there is a significant overlap between syntactic and semantic representations.

2018

pdf bib abs
On the Complexity of CCG Parsing
Marco Kuhlmann | Giorgio Satta | Peter Jonsson
Computational Linguistics, Volume 44, Issue 3 - September 2018

We study the parsing complexity of Combinatory Categorial Grammar (CCG) in the formalism of Vijay-Shanker and Weir (1994). As our main result, we prove that any parsing algorithm for this formalism will take in the worst case exponential time when the size of the grammar, and not only the length of the input sentence, is included in the analysis. This sets the formalism of Vijay-Shanker and Weir (1994) apart from weakly equivalent formalisms such as Tree Adjoining Grammar, for which parsing can be performed in time polynomial in the combined size of grammar and input sentence. Our results contribute to a refined understanding of the class of mildly context-sensitive grammars, and inform the search for new, mildly context-sensitive versions of CCG.

2017

pdf bib
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms
Marco Kuhlmann | Tatjana Scheffler
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms

pdf bib abs
Exploiting Structure in Parsing to 1-Endpoint-Crossing Graphs
Robin Kurtz | Marco Kuhlmann
Proceedings of the 15th International Conference on Parsing Technologies

Deep dependency parsing can be cast as the search for maximum acyclic subgraphs in weighted digraphs. Because this search problem is intractable in the general case, we consider its restriction to the class of 1-endpoint-crossing (1ec) graphs, which has high coverage on standard data sets. Our main contribution is a characterization of 1ec graphs as a subclass of the graphs with pagenumber at most 3. Building on this we show how to extend an existing parsing algorithm for 1-endpoint-crossing trees to the full class. While the runtime complexity of the extended algorithm is polynomial in the length of the input sentence, it features a large constant, which poses a challenge for practical implementations.

2016

pdf bib
Squibs: Towards a Catalogue of Linguistic Graph Banks
Marco Kuhlmann | Stephan Oepen
Computational Linguistics, Volume 42, Issue 4 - December 2016

We announce a new language resource for research on semantic parsing, a large, carefully curated collection of semantic dependency graphs representing multiple linguistic traditions. This resource is called SDP~2016 and provides an update and extension to previous versions used as Semantic Dependency Parsing target representations in the 2014 and 2015 Semantic Evaluation Exercises. For a common core of English text, this third edition comprises semantic dependency graphs from four distinct frameworks, packaged in a unified abstract format and aligned at the sentence and token levels. SDP 2016 is the first general release of this resource and available for licensing from the Linguistic Data Consortium in May 2016. The data is accompanied by an open-source SDP utility toolkit and system results from previous contrastive parsing evaluations against these target representations.

2015

pdf bib
Lexicalization and Generative Power in CCG
Marco Kuhlmann | Alexander Koller | Giorgio Satta
Computational Linguistics, Volume 41, Issue 2 - June 2015

pdf bib abs
Parsing to Noncrossing Dependency Graphs
Marco Kuhlmann | Peter Jonsson
Transactions of the Association for Computational Linguistics, Volume 3

We study the generalization of maximum spanning tree dependency parsing to maximum acyclic subgraphs. Because the underlying optimization problem is intractable even under an arc-factored model, we consider the restriction to noncrossing dependency graphs. Our main contribution is a cubic-time exact inference algorithm for this class. We extend this algorithm into a practical parser and evaluate its performance on four linguistic data sets used in semantic dependency parsing. We also explore a generalization of our parsing framework to dependency graphs with pagenumber at most k and show that the resulting optimization problem is NP-hard for k ≥ 2.

pdf bib
Proceedings of the 14th Meeting on the Mathematics of Language (MoL 2015)
Marco Kuhlmann | Makoto Kanazawa | Gregory M. Kobele
Proceedings of the 14th Meeting on the Mathematics of Language (MoL 2015)

2014

pdf bib abs
A New Parsing Algorithm for Combinatory Categorial Grammar
Marco Kuhlmann | Giorgio Satta
Transactions of the Association for Computational Linguistics, Volume 2

We present a polynomial-time parsing algorithm for CCG, based on a new decomposition of derivations into small, shareable parts. Our algorithm has the same asymptotic complexity, O(n6), as a previous algorithm by Vijay-Shanker and Weir (1993), but is easier to understand, implement, and prove correct.

pdf bib
Linköping: Cubic-Time Graph Parsing with a Simple Scoring Scheme
Marco Kuhlmann
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

pdf bib
Mildly Non-Projective Dependency Grammar
Marco Kuhlmann
Computational Linguistics, Volume 39, Issue 2 - June 2013

pdf bib abs
Efficient Parsing for Head-Split Dependency Trees
Giorgio Satta | Marco Kuhlmann
Transactions of the Association for Computational Linguistics, Volume 1

Head splitting techniques have been successfully exploited to improve the asymptotic runtime of parsing algorithms for projective dependency trees, under the arc-factored model. In this article we extend these techniques to a class of non-projective dependency trees, called well-nested dependency trees with block-degree at most 2, which has been previously investigated in the literature. We define a structural property that allows head splitting for these trees, and present two algorithms that improve over the runtime of existing algorithms at no significant loss in coverage.

pdf bib
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)
András Kornai | Marco Kuhlmann
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)

Marco Kuhlmann

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2007

2006

2004

Co-authors

Venues