Michela Lorandi


2024

pdf bib
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
Michela Lorandi | Anya Belz
Findings of the Association for Computational Linguistics: EACL 2024

The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric’s suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.

pdf bib
Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Michela Lorandi | Anya Belz
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

Rerunning a metric-based evaluation should be more straightforward and results should be closer than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this brief report of our efforts to rerun a metric-based evaluation of a set of multi-aspect controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the orginal work.

pdf bib
Differences in Semantic Errors Made by Different Types of Data-to-text Systems
Rudali Huidrom | Anya Belz | Michela Lorandi
Proceedings of the 17th International Natural Language Generation Conference

In this paper, we investigate how different semantic, or content-related, errors made by different types of data-to-text systems differ in terms of number and type. In total, we examine 15 systems: three rule-based and 12 neural systems including two large language models without training or fine-tuning. All systems were tested on the English WebNLG dataset version 3.0. We use a semantic error taxonomy and the brat annotation tool to obtain word-span error annotations on a sample of system outputs. The annotations enable us to establish how many semantic errors different (types of) systems make and what specific types of errors they make, and thus to get an overall understanding of semantic strengths and weaknesses among various types of NLG systems. Among our main findings, we observe that symbolic (rule and template-based) systems make fewer semantic errors overall, non-LLM neural systems have better fluency and data coverage, but make more semantic errors, while LLM-based systems require improvement particularly in addressing superfluous.

pdf bib
Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups
Simon Mille | Massimiliano Pronesti | Craig Thomson | Michela Lorandi | Sophie Fitzpatrick | Rudali Huidrom | Mohammed Sabry | Amy O’Riordan | Anya Belz
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

Wikipedia is known to have systematic gaps in its coverage that correspond to under-resourced languages as well as underrepresented groups. This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLM-generated article to be content-checked against the more reliable, but less fluent, rule-generated article.

pdf bib
DCU-NLG-PBN at the GEM’24 Data-to-Text Task: Open-Source LLM PEFT-Tuning for Effective Data-to-Text Generation
Michela Lorandi | Anya Belz
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges

LLMs have been used in various tasks with impressive success, including data-to-text generation. However, one concern when LLMs are compared to alternative methods is data contamination, in other words, for many datasets the data used in training these models may have included publicly available test sets. In this paper, we explore the performance of LLMs using newly constructed datasets in the context of data-to-text generation for English, Chinese, German, Russian, Spanish, Korean, Hindi, Swahili, and Arabic. We performed a testing phase to evaluate a range of prompt types and a fine-tuning technique on Mistral 7B and Falcon 40B. We then fully evaluated the most promising system for each scenario: (i) LLM prompting in English followed by translation, and (ii) LLM PEFT-tuning in English followed by translation. We find that fine-tuning Mistral outperforms all other tested systems and achieves performance close to GPT-3.5. The few-shot prompting with a dynamic selection of examples achieves higher results among prompting. The human evaluation to be carried out by the shared-task organisers will provide insight into the performance of the new datasets. In conclusion, we observed how the fine-tuning of an open-source LLM can achieve good performance close to state-of-the-art closed-source LLM while using considerably fewer resources.

2023

pdf bib
The 2023 WebNLG Shared Task on Low Resource Languages. Overview and Evaluation Results (WebNLG 2023)
Liam Cripwell | Anya Belz | Claire Gardent | Albert Gatt | Claudia Borg | Marthese Borg | John Judge | Michela Lorandi | Anna Nikiforovskaya | William Soto Martinez
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)

The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebNLG+ challenge addition- ally included generation into Russian and se- mantic parsing of English and Russian texts. In contrast, WebNLG 2023 focuses on four under-resourced languages which are severely under-represented in research on text genera- tion, namely Breton, Irish, Maltese and Welsh. In addition, WebNLG 2023 once again includes Russian. In this paper, we present the organi- sation of the shared task (data, timeline, eval- uation), briefly describe the participating sys- tems and summarise results for participating systems.

pdf bib
Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate (WebNLG 2023)
Michela Lorandi | Anya Belz
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)

LLMs are great at tasks involving English which dominates in their training data. We explore their ability to address tasks involving languages that are severely under-represented in their training data. More specifically, we do this in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested GPT-3.5 and~4 with a range of prompt types and formats on a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced languages, and (ii) generation into English followed by translation into the under-resourced languages. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed all other systems by substantial margins in all languages on all automatic metrics. We conclude that good performance can be achieved with state-of-the-art LLMs out-of-the box for under-resourced languages. However, best results (for Welsh) of BLEU 25.12, ChrF++ 0.55, and TER 0.64 are well below the lowest ranked English system at WebNLG’20 with BLEU 0.391, ChrF++ 0.579, and TER 0.564.

pdf bib
How to Control Sentiment in Text Generation: A Survey of the State-of-the-Art in Sentiment-Control Techniques
Michela Lorandi | Anya Belz
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Recent advances in the development of large Pretrained Language Models, such as GPT, BERT and Bloom, have achieved remarkable performance on a wide range of different NLP tasks. However, when used for text generation tasks, these models still have limitations when it comes to controlling the content and style of the generated text, often producing content that is incorrect, irrelevant, or inappropriate in the context of a given task. In this survey paper, we explore methods for controllable text generation with a focus on sentiment control. We systematically collect papers from the ACL Anthology, create a categorisation scheme based on different control techniques and controlled attributes, and use the scheme to categorise and compare methods. The result is a detailed and comprehensive overview of state-of-the-art techniques for sentiment-controlled text generation categorised on the basis of how the control is implemented and what attributes are controlled and providing a clear idea of their relative strengths and weaknesses.

2022

pdf bib
Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable?
Seyed Mahed Mousavi | Gabriel Roccabruna | Michela Lorandi | Simone Caldarella | Giuseppe Riccardi
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community amended versions - as a transparent, replicable, and comparable approach to HE of generated responses.