2024
pdf
bib
abs
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
Michela Lorandi
|
Anya Belz
Findings of the Association for Computational Linguistics: EACL 2024
The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric’s suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.
pdf
bib
abs
Differences in Semantic Errors Made by Different Types of Data-to-text Systems
Rudali Huidrom
|
Anya Belz
|
Michela Lorandi
Proceedings of the 17th International Natural Language Generation Conference
In this paper, we investigate how different semantic, or content-related, errors made by different types of data-to-text systems differ in terms of number and type. In total, we examine 15 systems: three rule-based and 12 neural systems including two large language models without training or fine-tuning. All systems were tested on the English WebNLG dataset version 3.0. We use a semantic error taxonomy and the brat annotation tool to obtain word-span error annotations on a sample of system outputs. The annotations enable us to establish how many semantic errors different (types of) systems make and what specific types of errors they make, and thus to get an overall understanding of semantic strengths and weaknesses among various types of NLG systems. Among our main findings, we observe that symbolic (rule and template-based) systems make fewer semantic errors overall, non-LLM neural systems have better fluency and data coverage, but make more semantic errors, while LLM-based systems require improvement particularly in addressing superfluous.
pdf
bib
abs
Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups
Simon Mille
|
Massimiliano Pronesti
|
Craig Thomson
|
Michela Lorandi
|
Sophie Fitzpatrick
|
Rudali Huidrom
|
Mohammed Sabry
|
Amy O’Riordan
|
Anya Belz
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations
Wikipedia is known to have systematic gaps in its coverage that correspond to under-resourced languages as well as underrepresented groups. This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLM-generated article to be content-checked against the more reliable, but less fluent, rule-generated article.
pdf
bib
abs
Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Michela Lorandi
|
Anya Belz
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Rerunning a metric-based evaluation should be more straightforward and results should be closer than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this brief report of our efforts to rerun a metric-based evaluation of a set of multi-aspect controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the orginal work.
2023
pdf
bib
abs
How to Control Sentiment in Text Generation: A Survey of the State-of-the-Art in Sentiment-Control Techniques
Michela Lorandi
|
Anya Belz
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Recent advances in the development of large Pretrained Language Models, such as GPT, BERT and Bloom, have achieved remarkable performance on a wide range of different NLP tasks. However, when used for text generation tasks, these models still have limitations when it comes to controlling the content and style of the generated text, often producing content that is incorrect, irrelevant, or inappropriate in the context of a given task. In this survey paper, we explore methods for controllable text generation with a focus on sentiment control. We systematically collect papers from the ACL Anthology, create a categorisation scheme based on different control techniques and controlled attributes, and use the scheme to categorise and compare methods. The result is a detailed and comprehensive overview of state-of-the-art techniques for sentiment-controlled text generation categorised on the basis of how the control is implemented and what attributes are controlled and providing a clear idea of their relative strengths and weaknesses.
pdf
bib
abs
The 2023 WebNLG Shared Task on Low Resource Languages. Overview and Evaluation Results (WebNLG 2023)
Liam Cripwell
|
Anya Belz
|
Claire Gardent
|
Albert Gatt
|
Claudia Borg
|
Marthese Borg
|
John Judge
|
Michela Lorandi
|
Anna Nikiforovskaya
|
William Soto Martinez
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)
The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebNLG+ challenge addition- ally included generation into Russian and se- mantic parsing of English and Russian texts. In contrast, WebNLG 2023 focuses on four under-resourced languages which are severely under-represented in research on text genera- tion, namely Breton, Irish, Maltese and Welsh. In addition, WebNLG 2023 once again includes Russian. In this paper, we present the organi- sation of the shared task (data, timeline, eval- uation), briefly describe the participating sys- tems and summarise results for participating systems.
pdf
bib
abs
Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate (WebNLG 2023)
Michela Lorandi
|
Anya Belz
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)
LLMs are great at tasks involving English which dominates in their training data. We explore their ability to address tasks involving languages that are severely under-represented in their training data. More specifically, we do this in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested GPT-3.5 and~4 with a range of prompt types and formats on a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced languages, and (ii) generation into English followed by translation into the under-resourced languages. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed all other systems by substantial margins in all languages on all automatic metrics. We conclude that good performance can be achieved with state-of-the-art LLMs out-of-the box for under-resourced languages. However, best results (for Welsh) of BLEU 25.12, ChrF++ 0.55, and TER 0.64 are well below the lowest ranked English system at WebNLG’20 with BLEU 0.391, ChrF++ 0.579, and TER 0.564.
2022
pdf
bib
abs
Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable?
Seyed Mahed Mousavi
|
Gabriel Roccabruna
|
Michela Lorandi
|
Simone Caldarella
|
Giuseppe Riccardi
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community amended versions - as a transparent, replicable, and comparable approach to HE of generated responses.