Simone Balloccu

2026

Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.

pdf bib abs

ARQA: A Benchmark for Grounded Table–Text QA in Enterprise Annual Reports
Ruilong Wang | Simone Balloccu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Annual reports communicate corporate performance to stakeholders through dense tables and explanatory text, with rich grounding signals making automated reasoning challenging. Existing QA benchmarks focus on retrieval or single-modality reasoning and rarely require justification for answers with both textual and tabular evidence. We introduce ARQA (Annual Report QA), a benchmark of ~2.5K QA pairs spanning ten fiscal years of automotive enterprise annual reports and three reasoning families — Lookup, Arithmetic, and Insight. Data are produced via a planner–generator pipeline, deterministically verified and recomputed, and fully reviewed by domain experts. We evaluate state-of-the-art instruction-tuned language models on ARQA, showing strong factual retrieval but persistent weaknesses in grounded arithmetic and causal reasoning. We release ARQA and its evaluation toolkit to facilitate research on auditable, evidence-first reasoning over enterprise documents. (https://github.com/RuilongWang/ARQA-Benchmark/)

2025

pdf bib abs

When LLMs Can’t Help: Real-World Evaluation of LLMs in Nutrition
Karen Jia-Hui Li | Simone Balloccu | Ondrej Dusek | Ehud Reiter
Proceedings of the 18th International Natural Language Generation Conference

The increasing trust in large language models (LLMs), especially in the form of chatbots, is often undermined by the lack of their extrinsic evaluation. This holds particularly true in nutrition, where randomised controlled trials (RCTs) are the gold standard, and experts demand them for evidence-based deployment. LLMs have shown promising results in this field, but these are limited to intrinsic setups. We address this gap by running the first RCT involving LLMs for nutrition. We augment a rule-based chatbot with two LLM-based features: (1) message rephrasing for conversational variety and engagement, and (2) nutritional counselling through a fine-tuned model. In our seven-week RCT (n=81), we compare chatbot variants with and without LLM integration. We measure effects on dietary outcome, emotional well-being, and engagement. Despite our LLM-based features performing well in intrinsic evaluation, we find that they did not yield consistent benefits in real-world deployment. These results highlight critical gaps between intrinsic evaluations and real-world impact, emphasising the need for interdisciplinary, human-centred approaches.

pdf bib abs

Hallucinations are one of the most pressing challenges for large language models (LLMs). While numerous methods have been proposed to detect and mitigate them automatically, human evaluation continues to serve as the gold standard. However, these human evaluations of hallucinations show substantial variation in definitions, terminology, and evaluation practices. In this paper, we survey 64 studies involving human evaluation of hallucination published between 2019 and 2024, to investigate how hallucinations are currently defined and assessed. Our analysis reveals a lack of consistency in definitions and exposes several concerning methodological shortcomings. Crucial details, such as evaluation guidelines, user interface design, inter-annotator agreement metrics, and annotator demographics, are frequently under-reported or omitted altogether.

2024

pdf bib abs

ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
Mateusz Lango | Patricia Schmidtova | Simone Balloccu | Ondrej Dusek
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

In this paper, we describe several reproductions of a human evaluation experiment measuring the quality of automatic dialogue summarization (Feng et al., 2021). We investigate the impact of the annotators’ highest level of education, field of study, and native language on the evaluation of the informativeness of the summary. We find that the evaluation is relatively consistent regardless of these factors, but the biggest impact seems to be a prior specific background in natural language processing (as opposed to, e.g. a background in computer sci- ence). We also find that the experiment setup (asking for single vs. multiple criteria) may have an impact on the results.

pdf bib abs

Automatic metrics are extensively used to evaluate Natural Language Processing systems. However, there has been increasing focus on how the are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.

pdf bib abs

factgenie: A Framework for Span-based Evaluation of Generated Texts
Zdeněk Kasner | Ondrej Platek | Patricia Schmidtova | Simone Balloccu | Ondrej Dusek
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

We present ‘factgenie‘: a framework for annotating and visualizing word spans in textual model outputs. Annotations can capture various span-based phenomena such as semantic inaccuracies or irrelevant text. With ‘factgenie‘, the annotations can be collected both from human crowdworkers and large language models. Our framework consists of a web interface for data visualization and gathering text annotations, powered by an easily extensible codebase.

pdf bib

Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Simone Balloccu | Anya Belz | Rudali Huidrom | Ehud Reiter | Joao Sedoc | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

pdf bib

pdf bib abs

Large Language Models (LLMs) are being employed by end-users for various tasks, including sensitive ones such as health counseling, disregarding potential safety concerns. It is thus necessary to understand how adequately LLMs perform in such domains. We conduct a case study on ChatGPT in nutrition counseling, a popular use-case where the model supports a user with their dietary struggles. We crowd-source real-world diet-related struggles, then work with nutrition experts to generate supportive text using ChatGPT. Finally, experts evaluate the safety and text quality of ChatGPT’s output. The result is the HAI-coaching dataset, containing ~2.4K crowdsourced dietary struggles and ~97K corresponding ChatGPT-generated and expert-annotated supportive texts. We analyse ChatGPT’s performance, discovering potentially harmful behaviours, especially for sensitive topics like mental health. Finally, we use HAI-coaching to test open LLMs on various downstream tasks, showing that even the latest models struggle to achieve good performance. HAI-coaching is available at https://github.com/uccollab/hai-coaching/

pdf bib abs

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
Simone Balloccu | Patrícia Schmidtová | Mateusz Lango | Ondrej Dusek
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where modelsare iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI’s GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document the amount of data leaked to these models during the first year after the model’s release. We report that these models have been globally exposed to ∼4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.

2023

pdf bib abs

Are Experts Needed? On Human Evaluation of Counselling Reflection Generation
Zixiu Wu | Simone Balloccu | Ehud Reiter | Rim Helaoui | Diego Reforgiato Recupero | Daniele Riboni
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reflection is a crucial counselling skill where the therapist conveys to the client their interpretation of what the client said. Language models have recently been used to generate reflections automatically, but human evaluation is challenging, particularly due to the cost of hiring experts. Laypeople-based evaluation is less expensive and easier to scale, but its quality is unknown for reflections. Therefore, we explore whether laypeople can be an alternative to experts in evaluating a fundamental quality aspect: coherence and context-consistency. We do so by asking a group of laypeople and a group of experts to annotate both synthetic reflections and human reflections from actual therapists. We find that both laypeople and experts are reliable annotators and that they have moderate-to-strong inter-group correlation, which shows that laypeople can be trusted for such evaluations. We also discover that GPT-3 mostly produces coherent and consistent reflections, and we explore changes in evaluation results when the source of synthetic reflections changes to GPT-3 from the less powerful GPT-2.

2022

pdf bib

Comparing informativeness of an NLG chatbot vs graphical app in diet-information domain
Simone Balloccu | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation

pdf bib abs

Towards In-Context Non-Expert Evaluation of Reflection Generation for Counselling Conversations
Zixiu Wu | Simone Balloccu | Rim Helaoui | Diego Reforgiato Recupero | Daniele Riboni
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Reflection is an essential counselling strategy, where the therapist listens actively and responds with their own interpretation of the client’s words. Recent work leveraged pre-trained language models (PLMs) to approach reflection generation as a promising tool to aid counsellor training. However, those studies used limited dialogue context for modelling and simplistic error analysis for human evaluation. In this work, we take the first step towards addressing those limitations. First, we fine-tune PLMs on longer dialogue contexts for reflection generation. Then, we collect free-text error descriptions from non-experts about generated reflections, identify common patterns among them, and accordingly establish discrete error categories using thematic analysis. Based on this scheme, we plan for future work a mass non-expert error annotation phase for generated reflections followed by an expert-based validation phase, namely “whether a coherent and consistent response is a good reflection”.

pdf bib abs

Beyond calories: evaluating how tailored communication reduces emotional load in diet-coaching
Simone Balloccu | Ehud Reiter
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Dieting is a behaviour change task that is difficult for many people to conduct successfully. This is due to many factors, including stress and cost. Mobile applications offer an alternative to traditional coaching. However, previous work on apps evaluation only focused on dietary outcomes, ignoring users’ emotional state despite its influence on eating habits. In this work, we introduce a novel evaluation of the effects that tailored communication can have on the emotional load of dieting. We implement this by augmenting a traditional diet-app with affective NLG, text-tailoring and persuasive communication techniques. We then run a short 2-weeks experiment and check dietary outcomes, user feedback of produced text and, most importantly, its impact on emotional state, through PANAS questionnaire. Results show that tailored communication significantly improved users’ emotional state, compared to an app-only control group.