Kees van Deemter

Also published as: Kees Van Deemter

2025

Incorporating Formulaicness in the Automatic Evaluation of Naturalness: A Case Study in Logic-to-Text Generation
Eduardo Calò | Guanyi Chen | Elias Stengel-Eskin | Albert Gatt | Kees van Deemter
Proceedings of the 18th International Natural Language Generation Conference

Data-to-text natural language generation (NLG) models may produce outputs that closely mirror the structure of their input. We introduce formulaicness as a measure of the output-to-input structural resemblance, proposing it as an enhancement for reference-less naturalness evaluation. Focusing on logic-to-text generation, we construct a dataset and train a regressor to predict formulaicness scores. We collect human judgments on naturalness and examine how incorporating formulaicness into existing metrics affects alignment with these judgments.

pdf bib abs

Annotating Hallucinations in Question-Answering using Rewriting
Xu Liu | Guanyi Chen | Kees van Deemter | Tingting He
Proceedings of the 18th International Natural Language Generation Conference

Hallucinations pose a persistent challenge in open-ended question answering (QA). Traditional annotation methods, such as span-labelling, suffer from inconsistency and limited coverage. In this paper, we propose a rewriting-based framework as a new perspective on hallucinations in open-ended QA. We report on an experiment in which annotators are instructed to rewrite LLM-generated answers directly to ensure factual accuracy, with edits automatically recorded. Using the Chinese portion of the Mu-SHROOM dataset, we conduct a controlled rewriting experiment, comparing fact-checking tools (Google vs. GPT-4o), and analysing how tool choice, annotator background, and question openness influence rewriting behaviour. We find that rewriting leads to more hallucinations being identified, with higher inter-annotator agreement, than span-labelling.

2024

pdf bib abs

Intrinsic Task-based Evaluation for Referring Expression Generation
Guanyi Chen | Fahime Same | Kees Van Deemter
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, a human evaluation study of Referring Expression Generation (REG) models had an unexpected conclusion: on WEBNLG, Referring Expressions (REs) generated by the state-of-the-art neural models were not only indistinguishable from the REs in WEBNLG but also from the REs generated by a simple rule-based system. Here, we argue that this limitation could stem from the use of a purely ratings-based human evaluation (which is a common practice in Natural Language Generation). To investigate these issues, we propose an intrinsic task-based evaluation for REG models, in which, in addition to rating the quality of REs, participants were asked to accomplish two meta-level tasks. One of these tasks concerns the referential success of each RE; the other task asks participants to suggest a better alternative for each RE. The outcomes suggest that, in comparison to previous evaluations, the new evaluation protocol assesses the performance of each REG model more comprehensively and makes the participants’ ratings more reliable and discriminable.

pdf bib abs

The Pitfalls of Defining Hallucination
Kees van Deemter
Computational Linguistics, Volume 50, Issue 2 - June 2023

Despite impressive advances in Natural Language Generation (NLG) and Large Language Models (LLMs), researchers are still unclear about important aspects of NLG evaluation. To substantiate this claim, I examine current classifications of hallucination and omission in data-text NLG, and I propose a logic-based synthesis of these classfications. I conclude by highlighting some remaining limitations of all current thinking about hallucination and by discussing implications for LLMs.

pdf bib abs

Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases
Yuqi Liu | Guanyi Chen | Kees van Deemter
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Theoretical linguists have suggested that some languages (e.g., Chinese and Japanese) are “cooler” than other languages based on the observation that the intended meaning of phrases in these languages depends more on their contexts. As a result, many expressions in these languages are shortened, and their meaning is inferred from the context. In this paper, we focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs) to investigate the predictability of their intended meaning given the contexts. To this end, we built a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness. We carried out corpus assessments and analyses. The results suggest that Chinese speakers indeed drop plurality and definiteness markers very frequently. Building on the corpus, we train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP. We report on the performance of these models and analyse their behaviours.

2023

pdf bib abs

Dimensions of Explanatory Value in NLP Models
Kees van Deemter
Computational Linguistics, Volume 49, Issue 3 - September 2023

Performance on a dataset is often regarded as the key criterion for assessing NLP models. I argue for a broader perspective, which emphasizes scientific explanation. I draw on a long tradition in the philosophy of science, and on the Bayesian approach to assessing scientific theories, to argue for a plurality of criteria for assessing NLP models. To illustrate these ideas, I compare some recent models of language production with each other. I conclude by asking what it would mean for institutional policies if the NLP community took these ideas onboard.

pdf bib abs

Challenges in Reproducing Human Evaluation Results for Role-Oriented Dialogue Summarization
Takumi Ito | Qixiang Fang | Pablo Mosteiro | Albert Gatt | Kees van Deemter
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

There is a growing concern regarding the reproducibility of human evaluation studies in NLP. As part of the ReproHum campaign, we conducted a study to assess the reproducibility of a recent human evaluation study in NLP. Specifically, we attempted to reproduce a human evaluation of a novel approach to enhance Role-Oriented Dialogue Summarization by considering the influence of role interactions. Despite our best efforts to adhere to the reported setup, we were unable to reproduce the statistical results as presented in the original paper. While no contradictory evidence was found, our study raises questions about the validity of the reported statistical significance results, and/or the comprehensiveness with which the original study was reported. In this paper, we provide a comprehensive account of our reproduction study, detailing the methodologies employed, data collection, and analysis procedures. We discuss the implications of our findings for the broader issue of reproducibility in NLP research. Our findings serve as a cautionary reminder of the challenges in conducting reproducible human evaluations and prompt further discussions within the NLP community.

pdf bib abs

Models of reference production: How do they withstand the test of time?
Fahime Same | Guanyi Chen | Kees van Deemter
Proceedings of the 16th International Natural Language Generation Conference

In recent years, many NLP studies have focused solely on performance improvement. In this work, we focus on the linguistic and scientific aspects of NLP. We use the task of generating referring expressions in context (REG-in-context) as a case study and start our analysis from GREC, a comprehensive set of shared tasks in English that addressed this topic over a decade ago. We ask what the performance of models would be if we assessed them (1) on more realistic datasets, and (2) using more advanced methods. We test the models using different evaluation metrics and feature selection experiments. We conclude that GREC can no longer be regarded as offering a reliable assessment of models’ ability to mimic human reference production, because the results are highly impacted by the choice of corpus and evaluation metrics. Our results also suggest that pre-trained language models are less dependent on the choice of corpus than classic Machine Learning models, and therefore make more robust class predictions.

pdf bib abs

HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales
Michele Cafagna | Kees van Deemter | Albert Gatt
Proceedings of the 16th International Natural Language Generation Conference

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. “people eating food in a park”. Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict (“people at a holiday resort”) and the actions they perform (“people having a picnic”). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

pdf bib abs

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib abs

Is Shortest Always Best? The Role of Brevity in Logic-to-Text Generation
Eduardo Calò | Jordi Levy | Albert Gatt | Kees Van Deemter
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Some applications of artificial intelligence make it desirable that logical formulae be converted computationally to comprehensible natural language sentences. As there are many logical equivalents to a given formula, finding the most suitable equivalent to be used as input for such a “logic-to-text” generation system is a difficult challenge. In this paper, we focus on the role of brevity: Are the shortest formulae the most suitable? We focus on propositional logic (PL), framing formula minimization (i.e., the problem of finding the shortest equivalent of a given formula) as a Quantified Boolean Formulae (QBFs) satisfiability problem. We experiment with several generators and selection strategies to prune the resulting candidates. We conduct exhaustive automatic and human evaluations of the comprehensibility and fluency of the generated texts. The results suggest that while, in many cases, minimization has a positive impact on the quality of the sentences generated, formula minimization may ultimately not be the best strategy.

2022

pdf bib abs

Non-neural Models Matter: a Re-evaluation of Neural Referring Expression Generation Systems
Fahime Same | Guanyi Chen | Kees Van Deemter
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In recent years, neural models have often outperformed rule-based and classic Machine Learning approaches in NLG. These classic approaches are now often disregarded, for example when new neural models are evaluated. We argue that they should not be overlooked, since, for some tasks, well-designed non-neural approaches achieve better performance than neural ones. In this paper, the task of generating referring expressions in linguistic context is used as an example. We examined two very different English datasets (WEBNLG and WSJ), and evaluated each algorithm using both automatic and human evaluations. Overall, the results of these evaluations suggest that rule-based systems with simple rule sets achieve on-par or better performance on both datasets compared to state-of-the-art neural REG systems. In the case of the more realistic dataset, WSJ, a machine learning-based system with well-designed linguistic features performed best. We hope that our work can encourage researchers to consider non-neural models in future.

Kees van Deemter

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2013

2012

2011

2010

2009

2008

2007

2006

2005

2003

2002

2001

2000

1999

1998

1997

1990

Co-authors

Venues