Craig Thomson - ACL Anthology

Craig Thomson

2025

Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments
Anya Belz | Simon Mille | Craig Thomson
Findings of the Association for Computational Linguistics: ACL 2025

Research shows that two evaluation experiments reporting results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw conclusions based on multiple independently conducted evaluations. It is hard to see how this issue can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the evaluations in use in NLP can be grounded in. Taking a descriptivist approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of 114 quality criterion names and definitions from three surveys of a combined total of 933 evaluation experiments in NLP, and structures them into a reference taxonomy. We present QCET and its uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulation compliance.

Evolving Stances on Reproducibility: A Longitudinal Study of NLP and ML Researchers’ Views and Experience of Reproducibility
Craig Thomson | Ehud Reiter | João Sedoc | Anya Belz
Findings of the Association for Computational Linguistics: EMNLP 2025

Over the past 10 years in NLP/ML, as in other fields of science, there has been growing interest in, and work on, reproducibility and methods for improving it. Identical experiments producing different results can be due to variation between samples of evaluation items or evaluators, but it can also be due to poor experimental practice. Both can be mitigated by bringing multiple comparable studies together in systematic reviews that can draw conclusions beyond the level of the individual studies, but such systematic reviews barely exist in NLP/ML. The alternative is to focus on improving experimental practice and study-level reproducibility, and the first step in this direction is awareness of the importance of reproducibility and knowledge of how to improve it. Here we aim to assess (i) what NLP/ML practitioners’ current views and experience of reproducibility are, and (ii) to what extent they have changed over the past two years, a period of rapidly growing interest in reproducibility. We report for the first time, results from two identical surveys, the first carried out in 2022 and the second in 2024, each time surveying 149 NLP and ML researchers. The results from the 2024 survey assess i above. We then compare the results of the two surveys in order to address ii above. We find that views and experience overall are moving towards better practice and appreciation of reproducibility.

HEDS 3.0: The Human Evaluation Data Sheet Version 3.0
Anya Belz | Craig Thomson
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

This paper presents a new version of the Human Evaluation Datasheet (HEDS), numbered 3.0 This update is the result of our experience using HEDS in the context of numerous recent human evaluation experiments, including reproduction studies, and of feedback collected from other researchers. Our main overall goal was to improve clarity, and to enable users to complete the datasheet more consistently and comparably. The HEDS 3.0 package consists of the digital data sheet, documentation, and code for exporting completed data sheets as latex files, all available from the HEDS 3.0 GitHub.

The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz | Craig Thomson | Javier González Corbelle | Malo Ruelle
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

This paper presents an overview of, and the results from, the 2025 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’25) which followed on from four previous shared tasks on reproducibility of evaluations, ReproNLP’24, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of the topic across the two fields. We describe the ReproNLP’25 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results, including for the first time additional, ‘sanity-check’ evaluations by LLMs.

Assessing Semantic Consistency in Data‐to‐Text Generation: A Meta-Evaluation of Textual, Semantic and Model-Based Metrics
Rudali Huidrom | Michela Lorandi | Simon Mille | Craig Thomson | Anya Belz
Proceedings of the 18th International Natural Language Generation Conference

Ensuring semantic consistency between semantic-triple inputs and generated text is crucial in data‐to‐text generation, but continues to pose challenges both during generation and in evaluation. In order to assess how accurately semantic consistency can currently be assessed, we meta-evaluate 29 different evaluation methods in terms of their ability to predict human semantic-consistency ratings. The evaluation methods include embeddings‐based, overlap‐based, and edit‐distance metrics, as well as learned regressors and a prompted ‘LLM‐as‐judge’ protocol. We meta-evaluate on two datasets: the WebNLG 2017 human evaluation dataset, and a newly created WebNLG-style dataset that none of the methods can have seen during training. We find that none of the traditional textual similarity metrics or the pre-Transformer model-based metrics are suitable for the task of semantic consistency assessment. LLM-based methods perform well on the whole, but best correlations with human judgments still lag behind those seen in other text generation tasks.

Combler les lacunes de Wikipédia : tirer parti de la génération de texte pour améliorer la couverture encyclopédique des groupes sous-représentés
Simon Mille | Massimiliano Pronesti | Craig Thomson | Michela Lorandi | Sophie Fitzpatrick | Rudali Huidrom | Mohammed Sabry | Amy O’Riordan | Anya Belz
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Wikipédia a des lacunes systématiques dans sa couverture des langues peu dotées ainsi que des groupes sous-représentés (par exemple, les femmes). Cet article présente un nouvel outil pour soutenir les efforts visant à combler ces lacunes en générant automatiquement des débuts d’articles en anglais, français et irlandais, et en facilitant la post-édition et la mise en ligne sur Wikipédia. Un générateur basé sur des règles et un LLM sont utilisés pour générer deux articles alternatifs à partir de graphes de connaissances DBpedia ou Wikidata sélectionnés par l’utilisateur, permettant à l’article généré via LLM, souvent plus fluide mais plus sujet aux erreurs, d’être vérifié en termes de contenu par rapport à l’article généré par des règles, plus fiable, mais moins fluide. Le code de l’outil est disponible sur https://github.com/dcu-nlg/wiki-gen-demo et il est actuellement déployé sur http://ec2-18-224-151-90.us-east-2.compute.amazonaws.com:3000/.

2024

Common Flaws in Running Human Evaluation Experiments in NLP
Craig Thomson | Ehud Reiter | Anya Belz
Computational Linguistics, Volume 50, Issue 2 - June 2023

While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we discovered, which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigor of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.

Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Simone Balloccu | Anya Belz | Rudali Huidrom | Ehud Reiter | Joao Sedoc | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

This paper presents an overview of, and the results from, the 2024 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’24), following on from three previous shared tasks on reproducibility of evaluations in NLP, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of reproducibility across the two fields. We describe the ReproNLP’24 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results.

(Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems
Craig Thomson | Anya Belz
Proceedings of the 17th International Natural Language Generation Conference

Human evaluation is widely considered the most reliable form of evaluation in NLP, but recent research has shown it to be riddled with mistakes, often as a result of manual execution of tasks. This paper argues that such mistakes could be avoided if we were to automate, as much as is practical, the process of performing experiments for human evaluation of NLP systems. We provide a simple methodology that can improve both the transparency and reproducibility of experiments. We show how the sequence of component processes of a human evaluation can be defined in advance, facilitating full or partial automation, detailed preregistration of the process, and research transparency and repeatability.

QCET: An Interactive Taxonomy of Quality Criteria for Comparable and Repeatable Evaluation of NLP Systems
Anya Belz | Simon Mille | Craig Thomson | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

Four years on from two papers (Belz et al., 2020; Howcroft et al., 2020) that first called out the lack of standardisation and comparability in the quality criteria assessed in NLP system evaluations, researchers still use widely differing quality criteria names and definitions, meaning that it continues to be unclear when the same aspect of quality is being assessed in two evaluations. While normalised quality criteria were proposed at the time, the list was unwieldy and using it came with a steep learning curve. In this demo paper, our aim is to address these issues with an interactive taxonomy tool that enables quick perusal and selection of the quality criteria, and provides decision support and examples of use at each node.

Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups
Simon Mille | Massimiliano Pronesti | Craig Thomson | Michela Lorandi | Sophie Fitzpatrick | Rudali Huidrom | Mohammed Sabry | Amy O’Riordan | Anya Belz
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

Wikipedia is known to have systematic gaps in its coverage that correspond to under-resourced languages as well as underrepresented groups. This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLM-generated article to be content-checked against the more reliable, but less fluent, rule-generated article.

Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract
Anya Belz | João Sedoc | Craig Thomson | Simon Mille | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract

The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units
Anya Belz | João Sedoc | Craig Thomson | Simon Mille | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract

Following numerous calls in the literature for improved practices and standardisation in human evaluation in Natural Language Processing over the past ten years, we held a tutorial on the topic at the 2024 INLG Conference. The tutorial addressed the structure, development, design, implementation, execution and analysis of human evaluations of NLP system quality. Hands-on practical sessions were run, designed to facilitate assimilation of the material presented. Slides, lecture recordings, code and data have been made available on GitHub (https://github.com/Human-Evaluation-Tutorial/INLG-2024-Tutorial). In this paper, we provide summaries of the content of the eight units of the tutorial, alongside its research context and aims.

2023

Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Simon Mille
Findings of the Association for Computational Linguistics: ACL 2023

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried, let alone formally tested, in NLP which means that their repeatability and the reproducibility of their results is currently an open question. This focused contribution reports our review of human evaluation experiments reported in NLP papers over the past five years which we assessed in terms oftheir ability to be rerun. Overall, we estimatethat just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitivebarriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goesup to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibilityof human evaluations where those are repeatable in the first place. Here we find worryinglylow degrees of reproducibility, both in terms ofsimilarity of scores and of findings supportedby them. We summarise what insights can begleaned so far regarding how to make humanevaluations in NLP more repeatable and morereproducible.

Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Anya Belz | Maja Popović | Ehud Reiter | Craig Thomson | João Sedoc
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

The 2023 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz | Craig Thomson
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

This paper presents an overview of, and the results from, the 2023 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’23), following on from two previous shared tasks on reproducibility of evaluations in NLG, ReproGen’21 and ReproGen’22. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, all against a background of an interest in reproducibility that con- tinues to grow in the two fields. This paper describes the ReproNLP’23 shared task, summarises results from the reproduction studies submitted, and provides comparative analysis of the results.

Enhancing factualness and controllability of Data-to-Text Generation via data Views and constraints
Craig Thomson | Clement Rebuffel | Ehud Reiter | Laure Soulier | Somayajulu Sripada | Patrick Gallinari
Proceedings of the 16th International Natural Language Generation Conference

Neural data-to-text systems lack the control and factual accuracy required to generate useful and insightful summaries of multidimensional data. We propose a solution in the form of data views, where each view describes an entity and its attributes along specific dimensions. A sequence of views can then be used as a high-level schema for document planning, with the neural model handling the complexities of micro-planning and surface realization. We show that our view-based system retains factual accuracy while offering high-level control of output that can be tailored based on user preference or other norms within the domain.

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

The 2023 WebNLG Shared Task on Low Resource Languages. Overview and Evaluation Results (WebNLG 2023)
Liam Cripwell | Anya Belz | Claire Gardent | Albert Gatt | Claudia Borg | Marthese Borg | John Judge | Michela Lorandi | Anna Nikiforovskaya | William Soto-Martinez | Craig Thomson
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)

The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebNLG+ challenge addition- ally included generation into Russian and se- mantic parsing of English and Russian texts. In contrast, WebNLG 2023 focuses on four under-resourced languages which are severely under-represented in research on text genera- tion, namely Breton, Irish, Maltese and Welsh. In addition, WebNLG 2023 once again includes Russian. In this paper, we present the organi- sation of the shared task (data, timeline, eval- uation), briefly describe the participating sys- tems and summarise results for participating systems.

Barriers and enabling factors for error analysis in NLG research
Emiel van Miltenburg | Miruna Clinciu | Ondřej Dušek | Dimitra Gkatzia | Stephanie Inglis | Leo Leppänen | Saad Mahamood | Stephanie Schoch | Craig Thomson | Luou Wen
Northern European Journal of Language Technology, Volume 9

Earlier research has shown that few studies in Natural Language Generation (NLG) evaluate their system outputs using an error analysis, despite known limitations of automatic evaluation metrics and human ratings. This position paper takes the stance that error analyses should be encouraged, and discusses several ways to do so. This paper is based on our shared experience as authors as well as a survey we distributed as a means of public consultation. We provide an overview of existing barriers to carrying out error analyses, and propose changes to improve error reporting in the NLG literature.

2022

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

The Accuracy Evaluation Shared Task as a Retrospective Reproduction Study
Craig Thomson | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We investigate the data collected for the Accuracy Evaluation Shared Task as a retrospective reproduction study. The shared task was based upon errors found by human annotation of computer generated summaries of basketball games. Annotation was performed in three separate stages, with texts taken from the same three systems and checked for errors by the same three annotators. We show that the mean count of errors was consistent at the highest level for each experiment, with increased variance when looking at per-system and/or per-error- type breakdowns.

2021

Underreporting of errors in NLG output, and what to do about it
Emiel van Miltenburg | Miruna Clinciu | Ondřej Dušek | Dimitra Gkatzia | Stephanie Inglis | Leo Leppänen | Saad Mahamood | Emma Manning | Stephanie Schoch | Craig Thomson | Luou Wen
Proceedings of the 14th International Conference on Natural Language Generation

We observe a severe under-reporting of the different kinds of errors that Natural Language Generation systems make. This is a problem, because mistakes are an important indicator of where systems should still be improved. If authors only report overall performance metrics, the research community is left in the dark about the specific weaknesses that are exhibited by ‘state-of-the-art’ research. Next to quantifying the extent of error under-reporting, this position paper provides recommendations for error identification, analysis and reporting.

Generation Challenges: Results of the Accuracy Evaluation Shared Task
Craig Thomson | Ehud Reiter
Proceedings of the 14th International Conference on Natural Language Generation

The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference).

2020

Studying the Impact of Filling Information Gaps on the Output Quality of Neural Data-to-Text
Craig Thomson | Zhijie Zhao | Somayajulu Sripada
Proceedings of the 13th International Conference on Natural Language Generation

It is unfair to expect neural data-to-text to produce high quality output when there are gaps between system input data and information contained in the training text. Thomson et al. (2020) identify and narrow information gaps in Rotowire, a popular data-to-text dataset. In this paper, we describe a study which finds that a state-of-the-art neural data-to-text system produces higher quality output, according to the information extraction (IE) based metrics, when additional input data is carefully selected from this newly available source. It remains to be shown, however, whether IE metrics used in this study correlate well with humans in judging text quality.

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
Craig Thomson | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

Shared Task on Evaluating Accuracy
Ehud Reiter | Craig Thomson
Proceedings of the 13th International Conference on Natural Language Generation

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts, specifically summaries of basketball games produced from basketball box score and other game data. We welcome submissions based on protocols for human evaluation, automatic metrics, as well as combinations of human evaluations and metrics.

SportSett:Basketball - A robust and maintainable data-set for Natural Language Generation
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

2018

Comprehension Driven Document Planning in Natural Language Generation Systems
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the 11th International Conference on Natural Language Generation

This paper proposes an approach to NLG system design which focuses on generating output text which can be more easily processed by the reader. Ways in which cognitive theory might be combined with existing NLG techniques are discussed and two simple experiments in content ordering are presented.

Co-authors

Ondřej Dušek 4

Dimitra Gkatzia 4

Michela Lorandi 4

Saad Mahamood 4

Somayajulu Sripada 4

Emiel Van Miltenburg 3

Elizabeth Clark 2

Miruna Clinciu 2

Sophie Fitzpatrick 2

Javier González Corbelle 2

Stephanie Inglis 2

Leo Leppänen 2

Amy O’Riordan 2

Massimiliano Pronesti 2

Mohammed Sabry 2

Stephanie Schoch 2

Gavin Abercrombie 1

Tosin Adewumi 1

Jose M. Alonso-Moral 1

Pawan Sasanka Ammanamanchi 1

Mohammad Arvan 1

Simone Balloccu 1

Chandra Bhagavatula 1

Abhik Bhattacharjee 1

Marthese Borg 1

Anouck Braggaar 1

Samuel Cahyawijaya 1

Ronald Cardenas 1

Khyathi Raghavi Chandu 1

Mark Cieliebak 1

Mathias Creutz 1

Liam Cripwell 1

Daniel Deutsch 1

Kaustubh Dhole 1

Moussa Kamal Eddine 1

Patrick Gallinari 1

Cristina Garbacea 1

Claire Gardent 1

Sebastian Gehrmann 1

Hiroaki Hayashi 1

Manuela Huerlimann 1

Yacine Jernite 1

Shailza Jolly 1

Juraj Juraska 1

Mihir Sanjay Kale 1

Jenna Kanerva 1

John Kelleher 1

Filip Klubicka 1

Emiel Krahmer 1

Faisal Ladhak 1

Paul Pu Liang 1

Abinaya Mahendiran 1

Joshua Maynez 1

Angelina McMillan-Major 1

Margot Mieskes 1

Sebastien Montella 1

Pablo Mosteiro 1

Anna Nikiforovskaya 1

Vitaly Nikolaev 1

Malvina Nissim 1

Jekaterina Novikova 1

Alexandros Papangelis 1

Natalie Parde 1

Laura Perez-Beltrachini 1

Ondřej Plátek 1

Maja Popović 1

Ratish Puduppully 1

Mahim Pushkarna 1

Dragomir Radev 1

Clement Rebuffel 1

Leonardo F. R. Ribeiro 1

Verena Rieser 1

Rifat Shahriyar 1

William Soto-Martinez 1

Laure Soulier 1

Hendrik Strobelt 1

Nishant Subramani 1

Joel Tetreault 1

Antonio Toral 1

Lewis Tunstall 1

Ashish Upadhyay 1

Michael White 1

Genta Indra Winata 1

Bingsheng Yao 1

Kees van Deemter 1

Chris van der Lee 1

Sanja Štajner 1

Venues

JEP/TALN/RECITAL1