Albert Gatt

2026

A Logic-Based Approach to Hallucinations in Data-to-Text NLG: Experiments with Human and LLM Annotators
Eduardo Calò | Saad Mahamood | Albert Gatt | Kees Van Deemter
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

Hallucinations are a persistent challenge in natural language generation, including data-to-text. van Deemter (2024) introduced a framework based on the relation of logical consequence ("follows from"), which divides all data-to-text hallucinations into seven disjoint categories. We examine whether human annotators and large language models are able to apply the framework, in two data-to-text domains. Results suggest that the framework is applicable, although there are significant domain-dependent variations, as well as discrepancies between human and model judgments. We also uncover several issues that should inform future work on hallucination.

pdf bib abs

Breaking the Script: Do Role-Playing Agents Maintain Goal Alignment under Distraction?
Dongxu Lu | Albert Gatt | Johan Jeuring
Proceedings of the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue

As large language models (LLMs) are increasingly deployed as role-playing agents in educational and professional training simulations, their susceptibility to user-induced distraction threatens their pedagogical utility. We formalise goal-competing distraction as a controlled evaluation paradigm and introduce a simulation framework that captures both immediate reactions and multi-turn trajectories under targeted distraction, using an LLM-based user simulator. Building upon this framework, we evaluate agent behaviour across three models: Gemini-2.0-Flash, Llama-3.3-70B-Instruct, and Llama-3.1-8B-Instruct. Our findings reveal a critical trade-off dependent on model scale. While larger models tend to remain socially responsive and more frequently engage with distractor topics, the smaller model shows rigid goal adherence by resisting and rejecting distraction. Although redirection is the most common initial response, subsequent trajectories diverge substantially. The inclusion of explicit dialogue state demonstrates model-dependent effects, acting as a stabilising anchor for smaller models but providing limited benefit for larger ones. These results suggest that maintaining goal alignment in role-playing agents requires explicitly managing the trade-off between conversational responsiveness and goal adherence.

pdf bib abs

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
Nan Li | Albert Gatt | Massimo Poesio
Proceedings of the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue

In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be shared from what has been shared between dialogue participants through grounding. We formulate this as an interpretation-matching task on 13,077 annotated reference expressions from HCRC MapTask dialogues, and evaluate VLMs under systematically controlled manipulations of dialogue context and map-information access. Our results show that providing authentic map images improves overall performance but shifts models toward over-predicting alignment. Textual descriptions of the same map content reproduce this bias, while non-informative images suppress alignment predictions entirely, indicating that the bias is driven by task-relevant map content, not the visual channel. This improvement comes at the cost of degraded accuracy on non-aligned cases. Calibration analysis and reference-chain tracking further suggest that models rely on static referential cues on the maps rather than tracking how grounding unfolds through dialogue history. We observe these patterns most clearly in Qwen3-VL-8B-Instruct and, to varying degrees, in four additional models from two architecture families. In models that exhibit the bias, map content, whether presented visually or textually, is treated as evidence of mutual understanding, conflating potential with established common ground.

pdf bib abs

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Max Schaffelder | Albert Gatt
Findings of the Association for Computational Linguistics: ACL 2026

As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, we observe a tendency for higher output quality in the latter case, thus making outputs potentially more usable and dangerous. Finally, we also find evidence that fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data. All code is available at https://github.com/maxschaffelder/synthetic_data_diversity.

pdf bib abs

When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question Answering
Hugh Mee Wong | Rick Nouwen | Albert Gatt
Findings of the Association for Computational Linguistics: ACL 2026

Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that represents the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning content position becomes decodable immediately after the final option is processed, while the output symbol is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.

pdf bib abs

BERT, are you paying attention? Attention regularization with human-annotated rationales
Elize Herrewijnen | Dong Nguyen | Floris Bex | Albert Gatt
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Attention regularisation aims to supervise the attention patterns in language models like BERT. Various studies have shown that using human-annotated rationales, in the form of highlights that explain why a text has a specific label, can have positive effects on model generalisability. In this work, we ask to what extent attention regularisation with human-annotated rationales improve model performance and model robustness, as well as susceptibility to spurious correlations. We compare regularisation on human rationales with randomly selected tokens, a baseline which has hitherto remained unexplored.Our results suggest that often, attention regularisation with randomly selected tokens yields similar improvements to attention regularisation with human-annotated rationales. Nevertheless, we find that human-annotated rationales surpass randomly selected tokens when it comes to reducing model sensitivity to strong spurious correlations.

2025

pdf bib

Hit or Be Hit: Tests of (Pre)Compositional Abilities in Vision and Language Models
Mădălina Zgreabăn | Albert Gatt | Pablo Mosteiro
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

pdf bib abs

Incorporating Formulaicness in the Automatic Evaluation of Naturalness: A Case Study in Logic-to-Text Generation
Eduardo Calò | Guanyi Chen | Elias Stengel-Eskin | Albert Gatt | Kees van Deemter
Proceedings of the 18th International Natural Language Generation Conference

Data-to-text natural language generation (NLG) models may produce outputs that closely mirror the structure of their input. We introduce formulaicness as a measure of the output-to-input structural resemblance, proposing it as an enhancement for reference-less naturalness evaluation. Focusing on logic-to-text generation, we construct a dataset and train a regressor to predict formulaicness scores. We collect human judgments on naturalness and examine how incorporating formulaicness into existing metrics affects alignment with these judgments.

pdf bib abs

Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Dongxu Lu | Johan Jeuring | Albert Gatt
Proceedings of the 18th International Natural Language Generation Conference

Evaluating large language models (LLMs) in long-form, knowledge-grounded role-play dialogues remains challenging. This study compares LLM-generated and human-authored responses in multi-turn professional training simulations through human evaluation (N = 38) and automated LLM-as-a-judge assessment. Human evaluation revealed significant degradation in LLM-generated response quality across turns, particularly in naturalness, context maintenance and overall quality, while human-authored responses progressively improved. In line with this finding, participants also indicated a consistent preference for human-authored dialogue. These human judgements were validated by our automated LLM-as-a-judge evaluation, where GEMINI 2.0 FLASH achieved strong alignment with human evaluators on both zero-shot pairwise preference and stochastic 6-shot construct ratings, confirming the widening quality gap between LLM and human responses over time. Our work contributes a multi-turn benchmark exposing LLM degradation in knowledge-grounded role-play dialogues and provides a validated hybrid evaluation framework to guide the reliable integration of LLMs in training simulations.

pdf bib abs

References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation
Silvia Casola | Yang Janet Liu | Siyao Peng | Oliver Kraus | Albert Gatt | Barbara Plank
Proceedings of the 18th International Natural Language Generation Conference

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

pdf bib abs

VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
Hugh Mee Wong | Rick Nouwen | Albert Gatt
Findings of the Association for Computational Linguistics: ACL 2025

Vague quantifiers such as “a few” and “many” are influenced by various contextual factors, including the number of objects present in a given context. In this work, we evaluate the extent to which vision-and-language models (VLMs) are compatible with humans when producing or judging the appropriateness of vague quantifiers in visual contexts. We release a novel dataset, VAQUUM, containing 20,300 human ratings on quantified statements across a total of 1089 images. Using this dataset, we compare human judgments and VLM predictions using three different evaluation methods. Our findings show that VLMs, like humans, are influenced by object counts in vague quantifier use. However, we find significant inconsistencies across models in different evaluation settings, suggesting that judging and producing vague quantifiers rely on two different processes. We release our dataset and code at https://github.com/hughmee/vaquum.

pdf bib abs

Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Yingjin Song | Yupei Du | Denis Paperno | Albert Gatt
Findings of the Association for Computational Linguistics: ACL 2025

This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.

pdf bib abs

FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics
Yupei Du | Albert Gatt | Dong Nguyen
Proceedings of the 31st International Conference on Computational Linguistics

Despite the massive success of fine-tuning Pre-trained Language Models (PLMs), they remain susceptible to out-of-distribution input. Dataset cartography is a simple yet effective dual-model approach that improves the robustness of fine-tuned PLMs. It involves fine-tuning a model on the original training set (i.e. reference model), selecting a subset of important training instances based on the training dynamics, % of the reference model, and fine-tuning again only on these selected examples (i.e. main model). However, this approach requires fine-tuning the same model twice, which is computationally expensive for large PLMs. In this paper, we show that 1) training dynamics are highly transferable across model sizes and pre-training methods, and that 2) fine-tuning main models using these selected training instances achieves higher training efficiency than empirical risk minimization (ERM). Building on these observations, we propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT). Compared with dataset cartography, FTFT uses more efficient reference models and aggressive early stopping. FTFT achieves robustness improvements over ERM while lowering the training cost by up to ~50%

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.

pdf bib abs

Disentangling the Roles of Representation and Selection in Data Pruning
Yupei Du | Yingjin Song | Hugh Mee Wong | Daniil Ignatev | Albert Gatt | Dong Nguyen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Data pruning—selecting small but impactful subsets—offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: data representation and selection algorithm, and systematically analyze their influence on selected instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to better selected instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperform the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.

2024

pdf bib abs

Summarizing Long Regulatory Documents with a Multi-Step Pipeline
Mika Sie | Ruby Beek | Michiel Bots | Sjaak Brinkkemper | Albert Gatt
Proceedings of the Natural Legal Language Processing Workshop 2024

Due to their length and complexity, long regulatory texts are challenging to summarize. To address this, a multi-step extractive-abstractive architecture is proposed to handle lengthy regulatory documents more effectively. In this paper, we show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies significantly depending on the model used. Specifically, the two-step architecture improves the performance of decoder-only models. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance. This research also highlights the challenges of evaluating generated texts, as evidenced by the differing results from human and automated evaluations. Most notably, human evaluations favoured language models pretrained on legal text, while automated metrics rank general-purpose language models higher. The results underscore the importance of selecting the appropriate summarization strategy based on model architecture and context length.

pdf bib abs

Automatic metrics are extensively used to evaluate Natural Language Processing systems. However, there has been increasing focus on how the are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.

pdf bib abs

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning
Yingjin Song | Denis Paperno | Albert Gatt
Proceedings of the 17th International Natural Language Generation Conference

Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.

pdf bib abs

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences
Leonardo Bertolazzi | Albert Gatt | Raffaella Bernardi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as content effects, avoid answering that no conclusion follows, align with human difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge and with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter can mitigate most reasoning biases while being consistent.

pdf bib abs

How and where does CLIP process negation?
Vincent Quantmeyer | Pablo Mosteiro | Albert Gatt
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

Various benchmarks have been proposed to test linguistic understanding in pre-trained vision & language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al., 2022) which we use to test models’ understanding of negation, a particularly interesting issue for multimodal models. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal processes through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of negation. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al., 2021), a highly influential VL model. We localise parts of the encoder that process negation and analyse the role of attention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (e.g., causal tracing) can be translated to multimodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding.

2023

pdf bib abs

Is Shortest Always Best? The Role of Brevity in Logic-to-Text Generation
Eduardo Calò | Jordi Levy | Albert Gatt | Kees Van Deemter
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Some applications of artificial intelligence make it desirable that logical formulae be converted computationally to comprehensible natural language sentences. As there are many logical equivalents to a given formula, finding the most suitable equivalent to be used as input for such a “logic-to-text” generation system is a difficult challenge. In this paper, we focus on the role of brevity: Are the shortest formulae the most suitable? We focus on propositional logic (PL), framing formula minimization (i.e., the problem of finding the shortest equivalent of a given formula) as a Quantified Boolean Formulae (QBFs) satisfiability problem. We experiment with several generators and selection strategies to prune the resulting candidates. We conduct exhaustive automatic and human evaluations of the comprehensibility and fluency of the generated texts. The results suggest that while, in many cases, minimization has a positive impact on the quality of the sentences generated, formula minimization may ultimately not be the best strategy.

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

pdf bib abs

The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebNLG+ challenge addition- ally included generation into Russian and se- mantic parsing of English and Russian texts. In contrast, WebNLG 2023 focuses on four under-resourced languages which are severely under-represented in research on text genera- tion, namely Breton, Irish, Maltese and Welsh. In addition, WebNLG 2023 once again includes Russian. In this paper, we present the organi- sation of the shared task (data, timeline, eval- uation), briefly describe the participating sys- tems and summarise results for participating systems.

pdf bib

The Scenario Refiner: Grounding subjects in images at the morphological level
Claudia C. Tagliaferri | Denis Paperno | Albert Gatt | Sofia Axioti
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing

pdf bib abs

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib abs

HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales
Michele Cafagna | Kees van Deemter | Albert Gatt
Proceedings of the 16th International Natural Language Generation Conference

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. “people eating food in a park”. Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict (“people at a holiday resort”) and the actions they perform (“people having a picnic”). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

pdf bib abs

Challenges in Reproducing Human Evaluation Results for Role-Oriented Dialogue Summarization
Takumi Ito | Qixiang Fang | Pablo Mosteiro | Albert Gatt | Kees van Deemter
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

There is a growing concern regarding the reproducibility of human evaluation studies in NLP. As part of the ReproHum campaign, we conducted a study to assess the reproducibility of a recent human evaluation study in NLP. Specifically, we attempted to reproduce a human evaluation of a novel approach to enhance Role-Oriented Dialogue Summarization by considering the influence of role interactions. Despite our best efforts to adhere to the reported setup, we were unable to reproduce the statistical results as presented in the original paper. While no contradictory evidence was found, our study raises questions about the validity of the reported statistical significance results, and/or the comprehensiveness with which the original study was reported. In this paper, we provide a comprehensive account of our reproduction study, detailing the methodologies employed, data collection, and analysis procedures. We discuss the implications of our findings for the broader issue of reproducibility in NLP research. Our findings serve as a cautionary reminder of the challenges in conducting reproducible human evaluations and prompt further discussions within the NLP community.

2022

pdf bib abs

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna | Kees van Deemter | Albert Gatt
Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.

pdf bib abs

Current image description generation models do not transfer well to the task of describing human faces. To encourage the development of more human-focused descriptions, we developed a new data set of facial descriptions based on the CelebA image data set. We describe the properties of this data set, and present results from a face description generator trained on it, which explores the feasibility of using transfer learning from VGGFace/ResNet CNNs. Comparisons are drawn through both automated metrics and human evaluation by 76 English-speaking participants. The descriptions generated by the VGGFace-LSTM + Attention model are closest to the ground truth according to human evaluation whilst the ResNet-LSTM + Attention model obtained the highest CIDEr and CIDEr-D results (1.252 and 0.686 respectively). Together, the new data set and these experimental results provide data and baselines for future work in this area.

pdf bib

Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind
Patrizia Paggio | Albert Gatt | Marc Tanti
Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind

pdf bib abs

Automatic Classification of Legal Violations in Cookie Banner Texts
Marieke Van Hofslot | Almila Akdag Salah | Albert Gatt | Cristiana Santos
Proceedings of the Natural Legal Language Processing Workshop 2022

Cookie banners are designed to request consent from website visitors for their personal data. Recent research suggest that a high percentage of cookie banners violate legal regulations as defined by the General Data Protection Regulation (GDPR) and the ePrivacy Directive. In this paper, we focus on language used in these cookie banners, and whether these violations can be automatically detected, or not. We make use of a small cookie banner dataset that is annotated by five experts for legal violations and test it with state of the art classification models, namely BERT, LEGAL-BERT, BART in a zero-shot setting and BERT with LIWC embeddings. Our results show that none of the models outperform the others in all classes, but in general, BERT and LEGAL-BERT provide the highest accuracy results (70%-97%). However, they are influenced by the small size and the unbalanced distributions in the dataset.

pdf bib abs

Enhancing and Evaluating the Grammatical Framework Approach to Logic-to-Text Generation
Eduardo Calò | Elze van der Werf | Albert Gatt | Kees van Deemter
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Logic-to-text generation is an important yet underrepresented area of natural language generation (NLG). In particular, most previous works on this topic lack sound evaluation. We address this limitation by building and evaluating a system that generates high-quality English text given a first-order logic (FOL) formula as input. We start by analyzing the performance of Ranta (2011)’s system. Based on this analysis, we develop an extended version of the system, which we name LoLa, that performs formula simplification based on logical equivalences and syntactic transformations. We carry out an extensive evaluation of LoLa using standard automatic metrics and human evaluation. We compare the results against a baseline and Ranta (2011)’s system. The results show that LoLa outperforms the other two systems in most aspects.

This paper presents the Multitask, Multilingual, Multimodal Language Generation COST Action – Multi3Generation (CA18231), an interdisciplinary network of research groups working on different aspects of language generation. This “meta-paper” will serve as reference for citations of the Action in future publications. It presents the objectives, challenges and a the links for the achieved outcomes.

pdf bib abs

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Kurt Micallef | Albert Gatt | Marc Tanti | Lonneke van der Plas | Claudia Borg
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT – Maltese – with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks – dependency parsing, part-of-speech tagging, and named-entity recognition – and one semantic classification task – sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pretrained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.

pdf bib abs

Visually Grounded Interpretation of Noun-Noun Compounds in English
Inga Lang | Lonneke Plas | Malvina Nissim | Albert Gatt
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Noun-noun compounds (NNCs) occur frequently in the English language. Accurate NNC interpretation, i.e. determining the implicit relationship between the constituents of a NNC, is crucial for the advancement of many natural language processing tasks. Until now, computational NNC interpretation has been limited to approaches involving linguistic representations only. However, much research suggests that grounding linguistic representations in vision or other modalities can increase performance on this and other tasks. Our work is a novel comparison of linguistic and visuo-linguistic representations for the task of NNC interpretation. We frame NNC interpretation as a relation classification task, evaluating on a large, relationally-annotated NNC dataset. We combine distributional word vectors with image vectors to investigate how visual information can help improve NNC interpretation systems. We find that adding visual vectors increases classification performance on our dataset in many cases.

pdf bib abs

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu | Michele Cafagna | Lilitta Muradjan | Anette Frank | Iacer Calixto | Albert Gatt
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

2021

pdf bib abs

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu | Albert Gatt | Anette Frank | Iacer Calixto
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.

pdf bib abs

Entity-Based Semantic Adequacy for Data-to-Text Generation
Juliette Faille | Albert Gatt | Claire Gardent
Findings of the Association for Computational Linguistics: EMNLP 2021

While powerful pre-trained language models have improved the fluency of text generation models, semantic adequacy -the ability to generate text that is semantically faithful to the input- remains an unsolved issue. In this paper, we introduce a novel automatic evaluation metric, Entity-Based Semantic Adequacy, which can be used to assess to what extent generation models that verbalise RDF (Resource Description Framework) graphs produce text that contains mentions of the entities occurring in the RDF input. This is important as RDF subject and object entities make up 2/3 of the input. We use our metric to compare 25 models from the WebNLG Shared Tasks and we examine correlation with results from human evaluations of semantic adequacy. We show that while our metric correlates with human evaluation scores, this correlation varies with the specifics of the human evaluation setup. This suggests that in order to measure the entity-based adequacy of generated texts, an automatic metric such as the one proposed here might be more reliable, as less subjective and more focused on correct verbalisation of the input, than human evaluation measures.

pdf bib abs

On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Marc Tanti | Lonneke van der Plas | Claudia Borg | Albert Gatt
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks – POS tagging and natural language inference – which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on ‘unlearning’ language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model’s limited representational capacity, enhancing language-independent representations at the expense of language-specific ones.

2020

pdf bib

Proceedings of LREC2020 Workshop "People in language, vision and the mind" (ONION2020)
Patrizia Paggio | Albert Gatt | Roman Klinger
Proceedings of LREC2020 Workshop "People in language, vision and the mind" (ONION2020)

pdf bib abs

Towards Harnessing Natural Language Generation to Explain Black-box Models
Ettore Mariotti | Jose M. Alonso | Albert Gatt
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence

The opaque nature of many machine learning techniques prevents the wide adoption of powerful information processing tools for high stakes scenarios. The emerging field eXplainable Artificial Intelligence (XAI) aims at providing justifications for automatic decision-making systems in order to ensure reliability and trustworthiness in the users. For achieving this vision, we emphasize the importance of a natural language textual modality as a key component for a future intelligent interactive agent. We outline the challenges of XAI and review a set of publications that work in this direction.

pdf bib abs

The Natural Language Pipeline, Neural Text Generation and Explainability
Juliette Faille | Albert Gatt | Claire Gardent
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence

End-to-end encoder-decoder approaches to data-to-text generation are often black boxes whose predictions are difficult to explain. Breaking up the end-to-end model into sub-modules is a natural way to address this problem. The traditional pre-neural Natural Language Generation (NLG) pipeline provides a framework for breaking up the end-to-end encoder-decoder. We survey recent papers that integrate traditional NLG submodules in neural approaches and analyse their explainability. Our survey is a first step towards building explainable neural NLG models.

pdf bib abs

Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI HEADSET Corpus is publicly available for research/academic purposes.

pdf bib abs

Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis
Stavros Assimakopoulos | Rebecca Vella Muskat | Lonneke van der Plas | Albert Gatt
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realisation that hate speech is not a clear-cut category to begin with, appears to belong to a continuum of discriminatory discourse and is often realised through the use of indirect linguistic means, it is argued that annotation schemes for its detection should refrain from directly including the label ‘hate speech,’ as different annotators might have different thresholds as to what constitutes hate speech and what not. In view of this, we propose a multi-layer annotation scheme, which is pilot-tested against a binary ±hate speech classification and appears to yield higher inter-annotator agreement. Motivating the postulation of our scheme, we then present the MaNeCo corpus on which it will eventually be used; a substantial corpus of on-line newspaper comments spanning 10 years.

pdf bib

Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation
Daniel Sánchez | Raquel Hervás | Albert Gatt
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

pdf bib abs

Earlier research has shown that evaluation metrics based on textual similarity (e.g., BLEU, CIDEr, Meteor) do not correlate well with human evaluation scores for automatically generated text. We carried out an experiment with Chinese speakers, where we systematically manipulated image descriptions to contain different kinds of errors. Because our manipulated descriptions form minimal pairs with the reference descriptions, we are able to assess the impact of different kinds of errors on the perceived quality of the descriptions. Our results show that different kinds of errors elicit significantly different evaluation scores, even though all erroneous descriptions differ in only one character from the reference descriptions. Evaluation metrics based solely on textual similarity are unable to capture these differences, which (at least partially) explains their poor correlation with human judgments. Our work provides the foundations for future work, where we aim to understand why different errors are seen as more or less severe.

pdf bib abs

Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias
Marion Bartl | Malvina Nissim | Albert Gatt
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

Contextualized word embeddings have been replacing standard embeddings as the representational knowledge source of choice in NLP systems. Since a variety of biases have previously been found in standard word embeddings, it is crucial to assess biases encoded in their replacements as well. Focusing on BERT (Devlin et al., 2018), we measure gender bias by studying associations between gender-denoting target words and names of professions in English and German, comparing the findings with real-world workforce statistics. We mitigate bias by fine-tuning BERT on the GAP corpus (Webster et al., 2018), after applying Counterfactual Data Substitution (CDS) (Maudslay et al., 2019). We show that our method of measuring bias is appropriate for languages such as English, but not for languages with a rich morphology and gender-marking, such as German. Our results highlight the importance of investigating bias and mitigation techniques cross-linguistically,especially in view of the current emphasis on large-scale, multilingual language models.

pdf bib abs

An ongoing debate in the NLG community concerns the best way to evaluate systems, with human evaluation often being considered the most reliable method, compared to corpus-based metrics. However, tasks involving subtle textual differences, such as style transfer, tend to be hard for humans to perform. In this paper, we propose an evaluation method for this task based on purposely-trained classifiers, showing that it better reflects system differences than traditional metrics such as BLEU.

pdf bib

Datasets and Models for Authorship Attribution on Italian Personal Writings
Gaetana Ruggiero | Albert Gatt | Malvina Nissim
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

2019

pdf bib abs

Best practices for the human evaluation of automatically generated text
Chris van der Lee | Albert Gatt | Emiel van Miltenburg | Sander Wubben | Emiel Krahmer
Proceedings of the 12th International Conference on Natural Language Generation

Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated. While there is some agreement regarding automatic metrics, there is a high degree of variation in the way that human evaluation is carried out. This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature. With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.

pdf bib abs

Visually grounded generation of entailments from premises
Somayeh Jafaritazehjani | Albert Gatt | Marc Tanti
Proceedings of the 12th International Conference on Natural Language Generation

Natural Language Inference (NLI) is the task of determining the semantic relationship between a premise and a hypothesis. In this paper, we focus on the generation of hypotheses from premises in a multimodal setting, to generate a sentence (hypothesis) given an image and/or its description (premise) as the input. The main goals of this paper are (a) to investigate whether it is reasonable to frame NLI as a generation task; and (b) to consider the degree to which grounding textual premises in visual information is beneficial to generation. We compare different neural architectures, showing through automatic and human evaluation that entailments can indeed be generated successfully. We also show that multimodal models outperform unimodal models in this task, albeit marginally

pdf bib abs

You Write like You Eat: Stylistic Variation as a Predictor of Social Stratification
Angelo Basile | Albert Gatt | Malvina Nissim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Inspired by Labov’s seminal work on stylisticvariation as a function of social stratification,we develop and compare neural models thatpredict a person’s presumed socio-economicstatus, obtained through distant supervision,from their writing style on social media. Thefocus of our work is on identifying the mostimportant stylistic parameters to predict socio-economic group. In particular, we show theeffectiveness of morpho-syntactic features aspredictors of style, in contrast to lexical fea-tures, which are good predictors of topic

2018

pdf bib abs

Specificity measures and reference
Albert Gatt | Nicolás Marín | Gustavo Rivas-Gervilla | Daniel Sánchez
Proceedings of the 11th International Conference on Natural Language Generation

In this paper we study empirically the validity of measures of referential success for referring expressions involving gradual properties. More specifically, we study the ability of several measures of referential success to predict the success of a user in choosing the right object, given a referring expression. Experimental results indicate that certain fuzzy measures of success are able to predict human accuracy in reference resolution. Such measures are therefore suitable for the estimation of the success or otherwise of a referring expression produced by a generation algorithm, especially in case the properties in a domain cannot be assumed to have crisp denotations.

pdf bib abs

Meteorologists and Students: A resource for language grounding of geographical descriptors
Alejandro Ramos-Soto | Ehud Reiter | Kees van Deemter | Jose M. Alonso | Albert Gatt
Proceedings of the 11th International Conference on Natural Language Generation

We present a data resource which can be useful for research purposes on language grounding tasks in the context of geographical referring expression generation. The resource is composed of two data sets that encompass 25 different geographical descriptors and a set of associated graphical representations, drawn as polygons on a map by two groups of human subjects: teenage students and expert meteorologists.

pdf bib

Proceedings of the 11th International Conference on Natural Language Generation
Emiel Krahmer | Albert Gatt | Martijn Goudbeek
Proceedings of the 11th International Conference on Natural Language Generation

pdf bib

pdf bib abs

Capturing semantic relations between sentences, such as entailment, is a long-standing challenge for computational semantics. Logic-based models analyse entailment in terms of possible worlds (interpretations, or situations) where a premise P entails a hypothesis H iff in all worlds where P is true, H is also true. Statistical models view this relationship probabilistically, addressing it in terms of whether a human would likely infer H from P. In this paper, we wish to bridge these two perspectives, by arguing for a visually-grounded version of the Textual Entailment task. Specifically, we ask whether models can perform better if, in addition to P and H, there is also an image (corresponding to the relevant “world” or “situation”). We use a multimodal version of the SNLI dataset (Bowman et al., 2015) and we compare “blind” and visually-augmented models of textual entailment. We show that visual information is beneficial, but we also conduct an in-depth error analysis that reveals that current multimodal models are not performing “grounding” in an optimal fashion.

2017

pdf bib abs

System using BiLSTM and max pooling. Embedding is enhanced by POS, character and dependency info.

pdf bib abs

What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?
Marc Tanti | Albert Gatt | Kenneth Camilleri
Proceedings of the 10th International Conference on Natural Language Generation

Image captioning has evolved into a core task for Natural Language Generation and has also proved to be an important testbed for deep learning approaches to handling multimodal representations. Most contemporary approaches rely on a combination of a convolutional network to handle image features, and a recurrent network to encode linguistic information. The latter is typically viewed as the primary “generation” component. Beyond this high-level characterisation, a CNN+RNN model supports a variety of architectural designs. The dominant model in the literature is one in which visual features encoded by a CNN are “injected” as part of the linguistic encoding process, driving the RNN’s linguistic choices. By contrast, it is possible to envisage an architecture in which visual and linguistic features are encoded separately, and merged at a subsequent stage. In this paper, we address two related questions: (1) Is direct injection the best way of combining multimodal information, or is a late merging alternative better for the image captioning task? (2) To what extent should a recurrent network be viewed as actually generating, rather than simply encoding, linguistic information?

pdf bib abs

Morphological Analysis for the Maltese Language: The challenges of a hybrid system
Claudia Borg | Albert Gatt
Proceedings of the Third Arabic Natural Language Processing Workshop

Maltese is a morphologically rich language with a hybrid morphological system which features both concatenative and non-concatenative processes. This paper analyses the impact of this hybridity on the performance of machine learning techniques for morphological labelling and clustering. In particular, we analyse a dataset of morphologically related word clusters to evaluate the difference in results for concatenative and non-concatenative clusters. We also describe research carried out in morphological labelling, with a particular focus on the verb category. Two evaluations were carried out, one using an unseen dataset, and another one using a gold standard dataset which was manually labelled. The gold standard dataset was split into concatenative and non-concatenative to analyse the difference in results between the two morphological systems.

2015

pdf bib

Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)
Anya Belz | Albert Gatt | François Portet | Matthew Purver
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2014

bib abs

Crowd-sourcing evaluation of automatically acquired, morphologically related word groupings
Claudia Borg | Albert Gatt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The automatic discovery and clustering of morphologically related words is an important problem with several practical applications. This paper describes the evaluation of word clusters carried out through crowd-sourcing techniques for the Maltese language. The hybrid (Semitic-Romance) nature of Maltese morphology, together with the fact that no large-scale lexical resources are available for Maltese, make this an interesting and challenging problem.

pdf bib

Learning when to point: A data-driven approach
Albert Gatt | Patrizia Paggio
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib

What and Where: An Empirical Investigation of Pointing Gestures and Descriptions in Multimodal Referring Actions
Albert Gatt | Patrizia Paggio
Proceedings of the 14th European Workshop on Natural Language Generation

pdf bib

Proceedings of the 14th European Workshop on Natural Language Generation
Albert Gatt | Horacio Saggion
Proceedings of the 14th European Workshop on Natural Language Generation

2012

bib abs

Incorporating an Error Corpus into a Spellchecker for Maltese
Michael Rosner | Albert Gatt | Andrew Attard | Jan Joachimsen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper discusses the ongoing development of a new Maltese spell checker, highlighting the methodologies which would best suit such a language. We thus discuss several previous attempts, highlighting what we believe to be their weakest point: a lack of attention to context. Two developments are of particular interest, both of which concern the availability of language resources relevant to spellchecking: (i) the Maltese Language Resource Server (MLRS) which now includes a representative corpus of c. 100M words extracted from diverse documents including the Maltese Legislation, press releases and extracts from Maltese web-pages and (ii) an extensive and detailed corpus of spelling errors that was collected whilst part of the MLRS texts were being prepared. We describe the structure of these resources as well as the experimental approaches focused on context that we are now in a position to adopt. We describe the framework within which a variety of different approaches to spellchecking and evaluation will be carried out, and briefly discuss the first baseline system we have implemented. We conclude the paper with a roadmap for future improvements.

bib abs

A Repository of Data and Evaluation Resources for Natural Language Generation
Anja Belz | Albert Gatt
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Starting in 2007, the field of natural language generation (NLG) has organised shared-task evaluation events every year, under the Generation Challenges umbrella. In the course of these shared tasks, a wealth of data has been created, along with associated task definitions and evaluation regimes. In other contexts too, sharable NLG data is now being created. In this paper, we describe the online repository that we have created as a one-stop resource for obtaining NLG task materials, both from Generation Challenges tasks and from other sources, where the set of materials provided for each task consists of (i) task definition, (ii) input and output data, (iii) evaluation software, (iv) documentation, and (v) publications reporting previous results.

2011

pdf bib

Generation Challenges 2011 Preface
Anja Belz | Albert Gatt | Alexander Koller | Kristina Striegnitz
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib

If it may have happened before, it happened, but not necessarily before
Albert Gatt | François Portet
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib

What is in a text and what does it do: Qualitative Evaluations of an NLG system – the BT-Nurse – using content analysis and discourse analysis
Rahul Sambaraju | Ehud Reiter | Robert Logie | Andy Mckinlay | Chris McVittie | Albert Gatt | Cindy Sykes
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib

Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop
Anja Belz | Roger Evans | Albert Gatt | Kristina Striegnitz
Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop

2010

pdf bib

Generation Challenges 2010 Preface
Anja Belz | Albert Gatt | Alexander Koller
Proceedings of the 6th International Natural Language Generation Conference

pdf bib

Textual Properties and Task-based Evaluation: Investigating the Role of Surface Properties, Structure and Content
Albert Gatt | François Portet
Proceedings of the 6th International Natural Language Generation Conference

2009

pdf bib

The GREC Main Subject Reference Generation Challenge 2009: Overview and Evaluation Results
Anja Belz | Eric Kow | Jette Viethen | Albert Gatt
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

pdf bib

The TUNA-REG Challenge 2009: Overview and Evaluation Results
Albert Gatt | Anja Belz | Eric Kow
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib

Generation Challenges 2009: Preface
Anja Belz | Albert Gatt
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib

A Hearer-Oriented Evaluation of Referring Expression Generation
Imtiaz Hussain Khan | Kees van Deemter | Graeme Ritchie | Albert Gatt | Alexandra A. Cleland
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib

SimpleNLG: A Realisation Engine for Practical Applications
Albert Gatt | Ehud Reiter
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib

Text Content and Task Performance in the Evaluation of a Natural Language Generation System
Albert Gatt | François Portet
Proceedings of the International Conference RANLP-2009

pdf bib abs

Le projet BabyTalk : génération de texte à partir de données hétérogènes pour la prise de décision en unité néonatale
François Portet | Albert Gatt | Jim Hunter | Ehud Reiter | Somayajulu Sripada
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Notre société génère une masse d’information toujours croissante, que ce soit en médecine, en météorologie, etc. La méthode la plus employée pour analyser ces données est de les résumer sous forme graphique. Cependant, il a été démontré qu’un résumé textuel est aussi un mode de présentation efficace. L’objectif du prototype BT-45, développé dans le cadre du projet Babytalk, est de générer des résumés de 45 minutes de signaux physiologiques continus et d’événements temporels discrets en unité néonatale de soins intensifs (NICU). L’article présente l’aspect génération de texte de ce prototype. Une expérimentation clinique a montré que les résumés humains améliorent la prise de décision par rapport à l’approche graphique, tandis que les textes de BT-45 donnent des résultats similaires à l’approche graphique. Une analyse a identifié certaines des limitations de BT-45 mais en dépit de cellesci, notre travail montre qu’il est possible de produire automatiquement des résumés textuels efficaces de données complexes.