Simon Mille


2024

pdf bib
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
Simone Balloccu | Zdeněk Kasner | Ondřej Plátek | Patrícia Schmidtová | Kristýna Onderková | Mateusz Lango | Ondřej Dušek | Lucie Flek | Ehud Reiter | Dimitra Gkatzia | Simon Mille
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation

pdf bib
QCET: An Interactive Taxonomy of Quality Criteria for Comparable and Repeatable Evaluation of NLP Systems
Anya Belz | Simon Mille | Craig Thomson | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

Four years on from two papers (Belz et al., 2020; Howcroft et al., 2020) that first called out the lack of standardisation and comparability in the quality criteria assessed in NLP system evaluations, researchers still use widely differing quality criteria names and definitions, meaning that it continues to be unclear when the same aspect of quality is being assessed in two evaluations. While normalised quality criteria were proposed at the time, the list was unwieldy and using it came with a steep learning curve. In this demo paper, our aim is to address these issues with an interactive taxonomy tool that enables quick perusal and selection of the quality criteria, and provides decision support and examples of use at each node.

pdf bib
Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups
Simon Mille | Massimiliano Pronesti | Craig Thomson | Michela Lorandi | Sophie Fitzpatrick | Rudali Huidrom | Mohammed Sabry | Amy O’Riordan | Anya Belz
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

Wikipedia is known to have systematic gaps in its coverage that correspond to under-resourced languages as well as underrepresented groups. This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLM-generated article to be content-checked against the more reliable, but less fluent, rule-generated article.

pdf bib
Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract
Anya Belz | João Sedo | Craig Thomson | Simon Mille | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract

pdf bib
The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units
Anya Belz | João Sedoc | Craig Thomson | Simon Mille | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract

Following numerous calls in the literature for improved practices and standardisation in human evaluation in Natural Language Processing over the past ten years, we held a tutorial on the topic at the 2024 INLG Conference. The tutorial addressed the structure, development, design, implementation, execution and analysis of human evaluations of NLP system quality. Hands-on practical sessions were run, designed to facilitate assimilation of the material presented. Slides, lecture recordings, code and data have been made available on GitHub (https://github.com/Human-Evaluation-Tutorial/INLG-2024-Tutorial). In this paper, we provide summaries of the content of the eight units of the tutorial, alongside its research context and aims.

pdf bib
On the Role of Summary Content Units in Text Summarization Evaluation
Marcel Nawrath | Agnieszka Nowak | Tristan Ratz | Danilo Walenta | Juri Opitz | Leonardo Ribeiro | João Sedoc | Daniel Deutsch | Simon Mille | Yixin Liu | Sebastian Gehrmann | Lining Zhang | Saad Mahamood | Miruna Clinciu | Khyathi Chandu | Yufang Hou
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs areconcise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages?ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategiesto approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when rankingshort summaries, but may not help as much when ranking systems or longer summaries.

2023

pdf bib
A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
Lining Zhang | Simon Mille | Yufang Hou | Daniel Deutsch | Elizabeth Clark | Yixin Liu | Saad Mahamood | Sebastian Gehrmann | Miruna Clinciu | Khyathi Raghavi Chandu | João Sedoc
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.

pdf bib
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Simon Mille
Findings of the Association for Computational Linguistics: ACL 2023

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried, let alone formally tested, in NLP which means that their repeatability and the reproducibility of their results is currently an open question. This focused contribution reports our review of human evaluation experiments reported in NLP papers over the past five years which we assessed in terms oftheir ability to be rerun. Overall, we estimatethat just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitivebarriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goesup to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibilityof human evaluations where those are repeatable in the first place. Here we find worryinglylow degrees of reproducibility, both in terms ofsimilarity of scores and of findings supportedby them. We summarise what insights can begleaned so far regarding how to make humanevaluations in NLP more repeatable and morereproducible.

pdf bib
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh Dhole | Varun Gangal | Sebastian Gehrmann | Aadesh Gupta | Zhenhao Li | Saad Mahamood | Abinaya Mahadiran | Simon Mille | Ashish Shrivastava | Samson Tan | Tongshang Wu | Jascha Sohl-Dickstein | Jinho Choi | Eduard Hovy | Ondřej Dušek | Sebastian Ruder | Sajant Anand | Nagender Aneja | Rabin Banjade | Lisa Barthe | Hanna Behnke | Ian Berlot-Attwell | Connor Boyle | Caroline Brun | Marco Antonio Sobrevilla Cabezudo | Samuel Cahyawijaya | Emile Chapuis | Wanxiang Che | Mukund Choudhary | Christian Clauss | Pierre Colombo | Filip Cornell | Gautier Dagan | Mayukh Das | Tanay Dixit | Thomas Dopierre | Paul-Alexis Dray | Suchitra Dubey | Tatiana Ekeinhor | Marco Di Giovanni | Tanya Goyal | Rishabh Gupta | Louanes Hamla | Sang Han | Fabrice Harel-Canada | Antoine Honoré | Ishan Jindal | Przemysław Joniak | Denis Kleyko | Venelin Kovatchev | Kalpesh Krishna | Ashutosh Kumar | Stefan Langer | Seungjae Ryan Lee | Corey James Levinson | Hualou Liang | Kaizhao Liang | Zhexiong Liu | Andrey Lukyanenko | Vukosi Marivate | Gerard de Melo | Simon Meoni | Maxine Meyer | Afnan Mir | Nafise Sadat Moosavi | Niklas Meunnighoff | Timothy Sum Hon Mun | Kenton Murray | Marcin Namysl | Maria Obedkova | Priti Oli | Nivranshu Pasricha | Jan Pfister | Richard Plant | Vinay Prabhu | Vasile Pais | Libo Qin | Shahab Raji | Pawan Kumar Rajpoot | Vikas Raunak | Roy Rinberg | Nicholas Roberts | Juan Diego Rodriguez | Claude Roux | Vasconcellos Samus | Ananya Sai | Robin Schmidt | Thomas Scialom | Tshephisho Sefara | Saqib Shamsi | Xudong Shen | Yiwen Shi | Haoyue Shi | Anna Shvets | Nick Siegel | Damien Sileo | Jamie Simon | Chandan Singh | Roman Sitelew | Priyank Soni | Taylor Sorensen | William Soto | Aman Srivastava | Aditya Srivatsa | Tony Sun | Mukund Varma | A Tabassum | Fiona Tan | Ryan Teehan | Mo Tiwari | Marie Tolkiehn | Athena Wang | Zijian Wang | Zijie Wang | Gloria Wang | Fuxuan Wei | Bryan Wilie | Genta Indra Winata | Xinyu Wu | Witold Wydmanski | Tianbao Xie | Usama Yaseen | Michael Yee | Jing Zhang | Yue Zhang
Northern European Journal of Language Technology, Volume 9

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.

pdf bib
Generating Irish Text with a Flexible Plug-and-Play Architecture
Simon Mille | Elaine Uí Dhonnchadha | Lauren Cassidy | Brian Davis | Stamatia Dasiopoulou | Anya Belz
Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning

In this paper, we describe M-FleNS, a multilingual flexible plug-and-play architecture designed to accommodate neural and symbolic modules, and initially instantiated with rule-based modules. We focus on using M-FleNS for the specific purpose of building new resources for Irish, a language currently under-represented in the NLP landscape. We present the general M-FleNS framework and how we use it to build an Irish Natural Language Generation system for verbalising part of the DBpedia ontology and building a multilayered dataset with rich linguistic annotations. Via automatic and human assessments of the output texts we show that with very limited resources we are able to create a system that reaches high levels of fluency and semantic accuracy, while having very low energy and memory requirements.

pdf bib
Mod-D2T: A Multi-layer Dataset for Modular Data-to-Text Generation
Simon Mille | Francois Lareau | Stamatia Dasiopoulou | Anya Belz
Proceedings of the 16th International Natural Language Generation Conference

Rule-based text generators lack the coverage and fluency of their neural counterparts, but have two big advantages over them: (i) they are entirely controllable and do not hallucinate; and (ii) they can fully explain how an output was generated from an input. In this paper we leverage these two advantages to create large and reliable synthetic datasets with multiple human-intelligible intermediate representations. We present the Modular Data-to-Text (Mod-D2T) Dataset which incorporates ten intermediate-level representations between input triple sets and output text; the mappings from one level to the next can broadly be interpreted as the traditional modular tasks of an NLG pipeline. We describe the Mod-D2T dataset, evaluate its quality via manual validation and discuss its applications and limitations. Data, code and documentation are available at https://github.com/mille-s/Mod-D2T.

pdf bib
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges
Simon Mille
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

pdf bib
A Pipeline for Extracting Abstract Dependency Templates for Data-to-Text Natural Language Generation
Simon Mille | Josep Ricci | Alexander Shvets | Anya Belz
Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)

We present work in progress that aims to address the coverage issue faced by rule-based text generators. We propose a pipeline for extracting abstract dependency template (predicate-argument structures) from Wikipedia text to be used as input for generating text from structured data with the FORGe system. The pipeline comprises three main components: (i) candidate sentence retrieval, (ii) clause extraction, ranking and selection, and (iii) conversion to predicate-argument form. We present an approach and preliminary evaluation for the ranking and selection module.

pdf bib
DCU/TCD-FORGe at WebNLG’23: Irish rules! (WegNLG 2023)
Simon Mille | Elaine Uí Dhonnchadha | Stamatia Dasiopoulou | Lauren Cassidy | Brian Davis | Anya Belz
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)

In this paper, we describe the submission of Dublin City University (DCU) and Trinity College Dublin (TCD) for the WebNLG 2023 shared task. We present a fully rule-based pipeline for generating Irish texts from DBpedia triple sets which comprises 4 components: triple lexicalisation, generation of noninflected Irish text, inflection generation, and post-processing.

2022

pdf bib
Quantified Reproducibility Assessment of NLP Results
Anya Belz | Maja Popovic | Simon Mille
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility.

pdf bib
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

2021

pdf bib
Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021)
Nicolas Mazziotta | Simon Mille
Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021)

pdf bib
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann | Tosin Adewumi | Karmanya Aggarwal | Pawan Sasanka Ammanamanchi | Anuoluwapo Aremu | Antoine Bosselut | Khyathi Raghavi Chandu | Miruna-Adriana Clinciu | Dipanjan Das | Kaustubh Dhole | Wanyu Du | Esin Durmus | Ondřej Dušek | Chris Chinenye Emezue | Varun Gangal | Cristina Garbacea | Tatsunori Hashimoto | Yufang Hou | Yacine Jernite | Harsh Jhamtani | Yangfeng Ji | Shailza Jolly | Mihir Kale | Dhruv Kumar | Faisal Ladhak | Aman Madaan | Mounica Maddela | Khyati Mahajan | Saad Mahamood | Bodhisattwa Prasad Majumder | Pedro Henrique Martins | Angelina McMillan-Major | Simon Mille | Emiel van Miltenburg | Moin Nadeem | Shashi Narayan | Vitaly Nikolaev | Andre Niyongabo Rubungo | Salomey Osei | Ankur Parikh | Laura Perez-Beltrachini | Niranjan Ramesh Rao | Vikas Raunak | Juan Diego Rodriguez | Sashank Santhanam | João Sedoc | Thibault Sellam | Samira Shaikh | Anastasia Shimorina | Marco Antonio Sobrevilla Cabezudo | Hendrik Strobelt | Nishant Subramani | Wei Xu | Diyi Yang | Akhila Yerukola | Jiawei Zhou
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.

pdf bib
Text-in-Context: Token-Level Error Detection for Table-to-Text Generation
Zdeněk Kasner | Simon Mille | Ondřej Dušek
Proceedings of the 14th International Conference on Natural Language Generation

We present our Charles-UPF submission for the Shared Task on Evaluating Accuracy in Generated Texts at INLG 2021. Our system can detect the errors automatically using a combination of a rule-based natural language generation (NLG) system and pretrained language models (LMs). We first utilize a rule-based NLG system to generate sentences with facts that can be derived from the input. For each sentence we evaluate, we select a subset of facts which are relevant by measuring semantic similarity to the sentence in question. Finally, we finetune a pretrained language model on annotated data along with the relevant facts for fine-grained error detection. On the test set, we achieve 69% recall and 75% precision with a model trained on a mixture of human-annotated and synthetic data.

pdf bib
Another PASS: A Reproduction Study of the Human Evaluation of a Football Report Generation System
Simon Mille | Thiago Castro Ferreira | Anya Belz | Brian Davis
Proceedings of the 14th International Conference on Natural Language Generation

This paper reports results from a reproduction study in which we repeated the human evaluation of the PASS Dutch-language football report generation system (van der Lee et al., 2017). The work was carried out as part of the ReproGen Shared Task on Reproducibility of Human Evaluations in NLG, in Track A (Paper 1). We aimed to repeat the original study exactly, with the main difference that a different set of evaluators was used. We describe the study design, present the results from the original and the reproduction study, and then compare and analyse the differences between the two sets of results. For the two ‘headline’ results of average Fluency and Clarity, we find that in both studies, the system was rated more highly for Clarity than for Fluency, and Clarity had higher standard deviation. Clarity and Fluency ratings were higher, and their standard deviations lower, in the reproduction study than in the original study by substantial margins. Clarity had a higher degree of reproducibility than Fluency, as measured by the coefficient of variation. Data and code are publicly available.

pdf bib
Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
Laura Pérez-Mayos | Alba Táboas García | Simon Mille | Leo Wanner
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)
Thiago Castro Ferreira | Claire Gardent | Nikolai Ilinykh | Chris van der Lee | Simon Mille | Diego Moussallem | Anastasia Shimorina
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

pdf bib
A Case Study of NLG from Multimedia Data Sources: Generating Architectural Landmark Descriptions
Simon Mille | Spyridon Symeonidis | Maria Rousi | Montserrat Marimon Felipe | Klearchos Stavrothanasopoulos | Petros Alvanitopoulos | Roberto Carlini Salguero | Jens Grivolla | Georgios Meditskos | Stefanos Vrochidis | Leo Wanner
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

In this paper, we present a pipeline system that generates architectural landmark descriptions using textual, visual and structured data. The pipeline comprises five main components:(i) a textual analysis component, which extracts information from Wikipedia pages; (ii)a visual analysis component, which extracts information from copyright-free images; (iii) a retrieval component, which gathers relevant (property, subject, object) triples from DBpedia; (iv) a fusion component, which stores the contents from the different modalities in a Knowledge Base (KB) and resolves the conflicts that stem from using different sources of information; (v) an NLG component, which verbalises the resulting contents of the KB. We show that thanks to the addition of other modalities, we can make the verbalisation of DBpedia triples more relevant and/or inspirational.

pdf bib
A General Benchmarking Framework for Text Generation
Diego Moussallem | Paramjot Kaur | Thiago Ferreira | Chris van der Lee | Anastasia Shimorina | Felix Conrads | Michael Röder | René Speck | Claire Gardent | Simon Mille | Nikolai Ilinykh | Axel-Cyrille Ngonga Ngomo
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

The RDF-to-text task has recently gained substantial attention due to the continuous growth of RDF knowledge graphs in number and size. Recent studies have focused on systematically comparing RDF-to-text approaches on benchmarking datasets such as WebNLG. Although some evaluation tools have already been proposed for text generation, none of the existing solutions abides by the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles and involves RDF data for the knowledge extraction task. In this paper, we present BENG, a FAIR benchmarking platform for Natural Language Generation (NLG) and Knowledge Extraction systems with focus on RDF data. BENG builds upon the successful benchmarking platform GERBIL, is opensource and is publicly available along with the data it contains.

pdf bib
The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)
Thiago Castro Ferreira | Claire Gardent | Nikolai Ilinykh | Chris van der Lee | Simon Mille | Diego Moussallem | Anastasia Shimorina
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

WebNLG+ offers two challenges: (i) mapping sets of RDF triples to English or Russian text (generation) and (ii) converting English or Russian text to sets of RDF triples (semantic parsing). Compared to the eponymous WebNLG challenge, WebNLG+ provides an extended dataset that enable the training, evaluation, and comparison of microplanners and semantic parsers. In this paper, we present the results of the generation and semantic parsing task for both English and Russian and provide a brief description of the participating systems.

pdf bib
Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
David M. Howcroft | Anya Belz | Miruna-Adriana Clinciu | Dimitra Gkatzia | Sadid A. Hasan | Saad Mahamood | Simon Mille | Emiel van Miltenburg | Sashank Santhanam | Verena Rieser
Proceedings of the 13th International Conference on Natural Language Generation

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

pdf bib
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
Anya Belz | Simon Mille | David M. Howcroft
Proceedings of the 13th International Conference on Natural Language Generation

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. %and merging others, as well as deciding which evaluations should be able to reproduce each other’s results. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.

pdf bib
Proceedings of the Third Workshop on Multilingual Surface Realisation
Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Simon Mille | Leo Wanner
Proceedings of the Third Workshop on Multilingual Surface Realisation

pdf bib
The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results
Simon Mille | Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Leo Wanner
Proceedings of the Third Workshop on Multilingual Surface Realisation

This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR’20) which was organised as part of the COLING’20 Workshop on Multilingual Surface Realisation. As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed. Moreover, each track had two subtracks: (a) restricted-resource, where only the data provided or approved as part of a track could be used for training models, and (b) open-resource, where any data could be used. The Shallow Track was offered in 11 languages, whereas the Deep Track in 3 ones. Systems were evaluated using both automatic metrics and direct assessment by human evaluators in terms of Readability and Meaning Similarity to reference outputs. We present the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods, as well as brief summaries of the participating systems. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

2019

pdf bib
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Leo Wanner
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

pdf bib
The Second Multilingual Surface Realisation Shared Task (SR’19): Overview and Evaluation Results
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Leo Wanner
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

We report results from the SR’19 Shared Task, the second edition of a multilingual surface realisation task organised as part of the EMNLP’19 Workshop on Multilingual Surface Realisation. As in SR’18, the shared task comprised two tracks with different levels of complexity: (a) a shallow track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a deep track where additionally, functional words and morphological information were removed. The shallow track was offered in eleven, and the deep track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity. This report presents the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

pdf bib
Back-Translation as Strategy to Tackle the Lack of Corpus in Natural Language Generation from Semantic Representations
Marco Antonio Sobrevilla Cabezudo | Simon Mille | Thiago Pardo
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

This paper presents an exploratory study that aims to evaluate the usefulness of back-translation in Natural Language Generation (NLG) from semantic representations for non-English languages. Specifically, Abstract Meaning Representation and Brazilian Portuguese (BP) are chosen as semantic representation and language, respectively. Two methods (focused on Statistical and Neural Machine Translation) are evaluated on two datasets (one automatically generated and another one human-generated) to compare the performance in a real context. Also, several cuts according to quality measures are performed to evaluate the importance (or not) of the data quality in NLG. Results show that there are still many improvements to be made but this is a promising approach.

pdf bib
Teaching FORGe to Verbalize DBpedia Properties in Spanish
Simon Mille | Stamatia Dasiopoulou | Beatriz Fisas | Leo Wanner
Proceedings of the 12th International Conference on Natural Language Generation

Statistical generators increasingly dominate the research in NLG. However, grammar-based generators that are grounded in a solid linguistic framework remain very competitive, especially for generation from deep knowledge structures. Furthermore, if built modularly, they can be ported to other genres and languages with a limited amount of work, without the need of the annotation of a considerable amount of training data. One of these generators is FORGe, which is based on the Meaning-Text Model. In the recent WebNLG challenge (the first comprehensive task addressing the mapping of RDF triples to text) FORGe ranked first with respect to the overall quality in human evaluation. We extend the coverage of FORGE’s open source grammatical and lexical resources for English, so as to further improve the English texts, and port them to Spanish, to achieve a comparable quality. This confirms that, as already observed in the case of SimpleNLG, a robust universal grammar-driven framework and a systematic organization of the linguistic resources can be an adequate choice for NLG applications.

2018

pdf bib
Proceedings of the First Workshop on Multilingual Surface Realisation
Simon Mille | Anja Belz | Bernd Bohnet | Emily Pitler | Leo Wanner
Proceedings of the First Workshop on Multilingual Surface Realisation

pdf bib
The First Multilingual Surface Realisation Shared Task (SR’18): Overview and Evaluation Results
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Emily Pitler | Leo Wanner
Proceedings of the First Workshop on Multilingual Surface Realisation

We report results from the SR’18 Shared Task, a new multilingual surface realisation task organised as part of the ACL’18 Workshop on Multilingual Surface Realisation. As in its English-only predecessor task SR’11, the shared task comprised two tracks with different levels of complexity: (a) a shallow track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a deep track where additionally, functional words and morphological information were removed. The shallow track was offered in ten, and the deep track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity. This report presents the evaluation results, along with descriptions of the SR’18 tracks, data and evaluation methods. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

pdf bib
Underspecified Universal Dependency Structures as Inputs for Multilingual Surface Realisation
Simon Mille | Anja Belz | Bernd Bohnet | Leo Wanner
Proceedings of the 11th International Conference on Natural Language Generation

In this paper, we present the datasets used in the Shallow and Deep Tracks of the First Multilingual Surface Realisation Shared Task (SR’18). For the Shallow Track, data in ten languages has been released: Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish. For the Deep Track, data in three languages is made available: English, French and Spanish. We describe in detail how the datasets were derived from the Universal Dependencies V2.0, and report on an evaluation of the Deep Track input quality. In addition, we examine the motivation for, and likely usefulness of, deriving NLG inputs from annotations in resources originally developed for Natural Language Understanding (NLU), and assess whether the resulting inputs supply enough information of the right kind for the final stage in the NLG process.

pdf bib
Sentence Packaging in Text Generation from Semantic Graphs as a Community Detection Problem
Alexander Shvets | Simon Mille | Leo Wanner
Proceedings of the 11th International Conference on Natural Language Generation

An increasing amount of research tackles the challenge of text generation from abstract ontological or semantic structures, which are in their very nature potentially large connected graphs. These graphs must be “packaged” into sentence-wise subgraphs. We interpret the problem of sentence packaging as a community detection problem with post optimization. Experiments on the texts of the VerbNet/FrameNet structure annotated-Penn Treebank, which have been converted into graphs by a coreference merge using Stanford CoreNLP, show a high F1-score of 0.738.

2017

pdf bib
FORGe at SemEval-2017 Task 9: Deep sentence generation based on a sequence of graph transducers
Simon Mille | Roberto Carlini | Alicia Burga | Leo Wanner
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We present the contribution of Universitat Pompeu Fabra’s NLP group to the SemEval Task 9.2 (AMR-to-English Generation). The proposed generation pipeline comprises: (i) a series of rule-based graph-transducers for the syntacticization of the input graphs and the resolution of morphological agreements, and (ii) an off-the-shelf statistical linearization component.

pdf bib
Shared Task Proposal: Multilingual Surface Realization Using Universal Dependency Trees
Simon Mille | Bernd Bohnet | Leo Wanner | Anja Belz
Proceedings of the 10th International Conference on Natural Language Generation

We propose a shared task on multilingual Surface Realization, i.e., on mapping unordered and uninflected universal dependency trees to correctly ordered and inflected sentences in a number of languages. A second deeper input will be available in which, in addition, functional words, fine-grained PoS and morphological information will be removed from the input trees. The first shared task on Surface Realization was carried out in 2011 with a similar setup, with a focus on English. We think that it is time for relaunching such a shared task effort in view of the arrival of Universal Dependencies annotated treebanks for a large number of languages on the one hand, and the increasing dominance of Deep Learning, which proved to be a game changer for NLP, on the other hand.

pdf bib
A demo of FORGe: the Pompeu Fabra Open Rule-based Generator
Simon Mille | Leo Wanner
Proceedings of the 10th International Conference on Natural Language Generation

This demo paper presents the multilingual deep sentence generator developed by the TALN group at Universitat Pompeu Fabra, implemented as a series of rule-based graph-transducers for the syntacticization of the input graphs, the resolution of morphological agreements, and the linearization of the trees.

2015

pdf bib
Data-driven sentence generation with non-isomorphic trees
Miguel Ballesteros | Bernd Bohnet | Simon Mille | Leo Wanner
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Visualizing Deep-Syntactic Parser Output
Juan Soler-Company | Miguel Ballesteros | Bernd Bohnet | Simon Mille | Leo Wanner
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
Towards a multi-layered dependency annotation of Finnish
Alicia Burga | Simon Mille | Anton Granvik | Leo Wanner
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

2014

pdf bib
Classifiers for data-driven deep sentence generation
Miguel Ballesteros | Simon Mille | Leo Wanner
Proceedings of the 8th International Natural Language Generation Conference (INLG)

pdf bib
Deep-Syntactic Parsing
Miguel Ballesteros | Bernd Bohnet | Simon Mille | Leo Wanner
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Exploring Morphosyntactic Annotation over a Spanish Corpus for Dependency Parsing
Miguel Ballesteros | Simon Mille | Alicia Burga
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf bib
AnCora-UPF: A Multi-Level Annotation of Spanish
Simon Mille | Alicia Burga | Leo Wanner
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

2012

pdf bib
Towards a Surface Realization-Oriented Corpus Annotation
Leo Wanner | Simon Mille | Bernd Bohnet
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

pdf bib
The Surface Realisation Task: Recent Developments and Future Plans
Anja Belz | Bernd Bohnet | Simon Mille | Leo Wanner | Michael White
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

pdf bib
How Does the Granularity of an Annotation Scheme Influence Dependency Parsing Performance?
Simon Mille | Alicia Burga | Gabriela Ferraro | Leo Wanner
Proceedings of COLING 2012: Posters

pdf bib
Text Simplification Tools for Spanish
Stefan Bott | Horacio Saggion | Simon Mille
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we describe the development of a text simplification system for Spanish. Text simplification is the adaptation of a text to the special needs of certain groups of readers, such as language learners, people with cognitive difficulties and elderly people, among others. There is a clear need for simplified texts, but manual production and adaptation of existing texts is labour intensive and costly. Automatic simplification is a field which attracts growing attention in Natural Language Processing, but, to the best of our knowledge, there are no simplification tools for Spanish. We present a prototype for automatic simplification, which shows that the most important structural simplification operations can be successfully treated with an approach based on rules which can potentially be improved by statistical methods. For the development of this prototype we carried out a corpus study which aims at identifying the operations a text simplification system needs to carry out in order to produce an output similar to what human editors produce when they simplify texts.

2011

pdf bib
<StuMaBa>: From Deep Representation to Surface
Bernd Bohnet | Simon Mille | Benoît Favre | Leo Wanner
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib
Broad Coverage Multilingual Deep Sentence Generation with a Stochastic Multi-Level Realizer
Bernd Bohnet | Leo Wanner | Simon Mille | Alicia Burga
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation
Simon Mille | Leo Wanner
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The relevance of syntactic dependency annotated corpora is nowadays unquestioned. However, a broad debate on the optimal set of dependency relation tags did not take place yet. As a result, largely varying tag sets of a largely varying size are used in different annotation initiatives. We propose a hierarchical dependency structure annotation schema that is more detailed and more flexible than the known annotation schemata. The schema allows us to choose the level of the desired detail of annotation, which facilitates the use of the schema for corpus annotation for different languages and for different NLP applications. Thanks to the inclusion of semantico-syntactic tags into the schema, we can annotate a corpus not only with syntactic dependency structures, but also with valency patterns as they are usually found in separate treebanks such as PropBank and NomBank. Semantico-syntactic tags and the level of detail of the schema furthermore facilitate the derivation of deep-syntactic and semantic annotations, leading to truly multilevel annotated dependency corpora. Such multilevel annotations can be readily used for the task of ML-based acquisition of grammar resources that map between the different levels of linguistic representation ― something which forms part of, for instance, any natural language text generator.

2008

pdf bib
Multilingual summarization in practice: the case of patent claims
Simon Mille | Leo Wanner
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf bib
Making Text Resources Accessible to the Reader: the Case of Patent Claims
Simon Mille | Leo Wanner
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Hardly any other kind of text structures is as notoriously difficult to read as patents. This is first of all due to their abstract vocabulary and their very complex syntactic constructions. Especially the claims in a patent are a challenge: in accordance with international patent writing regulations, each claim must be rendered in a single sentence. As a result, sentences with more than 200 words are not uncommon. Therefore, paraphrasing of the claims in terms the user can understand is of high demand. We present a rule-based paraphrasing module that realizes paraphrasing of patent claims in English as a rewriting task. Prior to the rewriting proper, the module implies the stages of simplification and discourse and syntactic analyses. The rewriting makes use of a full-fledged text generator and consists in a number of genuine generation tasks such as aggregation, selection of referring expressions, choice of discourse markers and syntactic generation. As generator, we use the MATE-work bench, which is based on the Meaning-Text Theory of linguistics.
Search
Co-authors