Ehud Reiter


2024

pdf bib
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
Simone Balloccu | Zdeněk Kasner | Ondřej Plátek | Patrícia Schmidtová | Kristýna Onderková | Mateusz Lango | Ondřej Dušek | Lucie Flek | Ehud Reiter | Dimitra Gkatzia | Simon Mille
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation

pdf bib
Ask the experts: sourcing a high-quality nutrition counseling dataset through Human-AI collaboration
Simone Balloccu | Ehud Reiter | Karen Jia-Hui Li | Rafael Sargsyan | Vivek Kumar | Diego Reforgiato | Daniele Riboni | Ondrej Dusek
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) are being employed by end-users for various tasks, including sensitive ones such as health counseling, disregarding potential safety concerns. It is thus necessary to understand how adequately LLMs perform in such domains. We conduct a case study on ChatGPT in nutrition counseling, a popular use-case where the model supports a user with their dietary struggles. We crowd-source real-world diet-related struggles, then work with nutrition experts to generate supportive text using ChatGPT. Finally, experts evaluate the safety and text quality of ChatGPT’s output. The result is the HAI-coaching dataset, containing ~2.4K crowdsourced dietary struggles and ~97K corresponding ChatGPT-generated and expert-annotated supportive texts. We analyse ChatGPT’s performance, discovering potentially harmful behaviours, especially for sensitive topics like mental health. Finally, we use HAI-coaching to test open LLMs on various downstream tasks, showing that even the latest models struggle to achieve good performance. HAI-coaching is available at https://github.com/uccollab/hai-coaching/

pdf bib
Linguistically Communicating Uncertainty in Patient-Facing Risk Prediction Models
Adarsa Sivaprasad | Ehud Reiter
Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)

This paper addresses the unique challenges associated with uncertainty quantification in AI models when applied to patient-facing contexts within healthcare. Unlike traditional eXplainable Artificial Intelligence (XAI) methods tailored for model developers or domain experts, additional considerations of communicating in natural language, its presentation and evaluating understandability are necessary. We identify the challenges in communication model performance, confidence, reasoning and unknown knowns using natural language in the context of risk prediction. We propose a design aimed at addressing these challenges, focusing on the specific application of in-vitro fertilisation outcome prediction.

pdf bib
Improving Factual Accuracy of Neural Table-to-Text Output by Addressing Input Problems in ToTTo
Barkavi Sundararajan | Yaji Sripada | Ehud Reiter
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Neural Table-to-Text models tend to hallucinate, producing texts that contain factual errors. We investigate whether such errors in the output can be traced back to problems with the input. We manually annotated 1,837 texts generated by multiple models in the politics domain of the ToTTo dataset. We identify the input problems that are responsible for many output errors and show that fixing these inputs reduces factual errors by between 52% and 76% (depending on the model). In addition, we observe that models struggle in processing tabular inputs that are structured in a non-standard way, particularly when the input lacks distinct row and column values or when the column headers are not correctly mapped to corresponding values.

pdf bib
Common Flaws in Running Human Evaluation Experiments in NLP
Craig Thomson | Ehud Reiter | Anya Belz
Computational Linguistics, Volume 50, Issue 2 - June 2023

While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we discovered, which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigor of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.

pdf bib
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Simone Balloccu | Anya Belz | Rudali Huidrom | Ehud Reiter | Joao Sedoc | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

2023

pdf bib
Are Experts Needed? On Human Evaluation of Counselling Reflection Generation
Zixiu Wu | Simone Balloccu | Ehud Reiter | Rim Helaoui | Diego Reforgiato Recupero | Daniele Riboni
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reflection is a crucial counselling skill where the therapist conveys to the client their interpretation of what the client said. Language models have recently been used to generate reflections automatically, but human evaluation is challenging, particularly due to the cost of hiring experts. Laypeople-based evaluation is less expensive and easier to scale, but its quality is unknown for reflections. Therefore, we explore whether laypeople can be an alternative to experts in evaluating a fundamental quality aspect: coherence and context-consistency. We do so by asking a group of laypeople and a group of experts to annotate both synthetic reflections and human reflections from actual therapists. We find that both laypeople and experts are reliable annotators and that they have moderate-to-strong inter-group correlation, which shows that laypeople can be trusted for such evaluations. We also discover that GPT-3 mostly produces coherent and consistent reflections, and we explore changes in evaluation results when the source of synthetic reflections changes to GPT-3 from the less powerful GPT-2.

pdf bib
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Simon Mille
Findings of the Association for Computational Linguistics: ACL 2023

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried, let alone formally tested, in NLP which means that their repeatability and the reproducibility of their results is currently an open question. This focused contribution reports our review of human evaluation experiments reported in NLP papers over the past five years which we assessed in terms oftheir ability to be rerun. Overall, we estimatethat just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitivebarriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goesup to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibilityof human evaluations where those are repeatable in the first place. Here we find worryinglylow degrees of reproducibility, both in terms ofsimilarity of scores and of findings supportedby them. We summarise what insights can begleaned so far regarding how to make humanevaluations in NLP more repeatable and morereproducible.

pdf bib
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Anya Belz | Maja Popović | Ehud Reiter | Craig Thomson | João Sedoc
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

pdf bib
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib
Enhancing factualness and controllability of Data-to-Text Generation via data Views and constraints
Craig Thomson | Clement Rebuffel | Ehud Reiter | Laure Soulier | Somayajulu Sripada | Patrick Gallinari
Proceedings of the 16th International Natural Language Generation Conference

Neural data-to-text systems lack the control and factual accuracy required to generate useful and insightful summaries of multidimensional data. We propose a solution in the form of data views, where each view describes an entity and its attributes along specific dimensions. A sequence of views can then be used as a high-level schema for document planning, with the neural model handling the complexities of micro-planning and surface realization. We show that our view-based system retains factual accuracy while offering high-level control of output that can be tailored based on user preference or other norms within the domain.

2022

pdf bib
Proceedings of the First Workshop on Natural Language Generation in Healthcare
Emiel Krahmer | Kathy McCoy | Ehud Reiter
Proceedings of the First Workshop on Natural Language Generation in Healthcare

pdf bib
DrivingBeacon: Driving Behaviour Change Support System Considering Mobile Use and Geo-information
Jawwad Baig | Guanyi Chen | Chenghua Lin | Ehud Reiter
Proceedings of the First Workshop on Natural Language Generation in Healthcare

Natural Language Generation has been proved to be effective and efficient in constructing health behaviour change support systems. We are working on DrivingBeacon, a behaviour change support system that uses telematics data from mobile phone sensors to generate weekly data-to-text feedback reports to vehicle drivers. The system makes use of a wealth of information such as mobile phone use while driving, geo-information, speeding, rush hour driving to generate the feedback. We present results from a real-world evaluation where 8 drivers in UK used DrivingBeacon for 4 weeks. Results are promising but not conclusive.

pdf bib
Comparing informativeness of an NLG chatbot vs graphical app in diet-information domain
Simone Balloccu | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation

pdf bib
The 2022 ReproGen Shared Task on Reproducibility of Evaluations in NLG: Overview and Results
Anya Belz | Anastasia Shimorina | Maja Popović | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the second shared task on reproducibility of evaluations in NLG, ReproGen 2022. This paper describes the shared task, summarises results from the reproduction studies submitted, and provides further comparative analysis of the results. Out of six initial team registrations, we received submissions from five teams. Meta-analysis of the five reproduction studies revealed varying degrees of reproducibility, and allowed further tentative conclusions about what types of evaluation tend to have better reproducibility.

pdf bib
The Accuracy Evaluation Shared Task as a Retrospective Reproduction Study
Craig Thomson | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We investigate the data collected for the Accuracy Evaluation Shared Task as a retrospective reproduction study. The shared task was based upon errors found by human annotation of computer generated summaries of basketball games. Annotation was performed in three separate stages, with texts taken from the same three systems and checked for errors by the same three annotators. We show that the mean count of errors was consistent at the highest level for each experiment, with increased variance when looking at per-system and/or per-error- type breakdowns.

pdf bib
Error Analysis of ToTTo Table-to-Text Neural NLG Models
Barkavi Sundararajan | Somayajulu Sripada | Ehud Reiter
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We report error analysis of outputs from seven Table-to-Text generation models fine-tuned on ToTTo, an open-domain English language dataset. A manual error annotation of a subset of outputs (a total of 5,278 sentences) belonging to the topic of Politics generated by these seven models has been carried out. Our error annotation focused on eight categories of errors. The error analysis shows that more than 45% of sentences from each of the seven models have been error-free. It uncovered some of the specific classes of errors such as WORD errors that are the dominant errors in all the seven models, NAME and NUMBER errors are more committed by two of the GeM benchmark models, whereas DATE-DIMENSION and OTHER category of errors are more common in our Table-to-Text models.

pdf bib
User-Driven Research of Medical Note Generation Software
Tom Knoll | Francesco Moramarco | Alex Papadopoulos Korfiatis | Rachel Young | Claudia Ruffini | Mark Perera | Christian Perstl | Ehud Reiter | Anya Belz | Aleksandar Savkov
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three rounds of user studies, carried out in the context of developing a medical note generation system. We present, analyse and discuss the participating clinicians’ impressions and views of how the system ought to be adapted to be of value to them. Next, we describe a three-week test run of the system in a live telehealth clinical practice. Major findings include (i) the emergence of five different note-taking behaviours; (ii) the importance of the system generating notes in real time during the consultation; and (iii) the identification of a number of clinical use cases that could prove challenging for automatic note generation systems.

pdf bib
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation
Francesco Moramarco | Alex Papadopoulos Korfiatis | Mark Perera | Damir Juric | Jack Flann | Ehud Reiter | Anya Belz | Aleksandar Savkov
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient’s clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

pdf bib
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Anya Belz | Maja Popović | Ehud Reiter | Anastasia Shimorina
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

pdf bib
Beyond calories: evaluating how tailored communication reduces emotional load in diet-coaching
Simone Balloccu | Ehud Reiter
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Dieting is a behaviour change task that is difficult for many people to conduct successfully. This is due to many factors, including stress and cost. Mobile applications offer an alternative to traditional coaching. However, previous work on apps evaluation only focused on dietary outcomes, ignoring users’ emotional state despite its influence on eating habits. In this work, we introduce a novel evaluation of the effects that tailored communication can have on the emotional load of dieting. We implement this by augmenting a traditional diet-app with affective NLG, text-tailoring and persuasive communication techniques. We then run a short 2-weeks experiment and check dietary outcomes, user feedback of produced text and, most importantly, its impact on emotional state, through PANAS questionnaire. Results show that tailored communication significantly improved users’ emotional state, compared to an app-only control group.

pdf bib
Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation
Aleksandar Savkov | Francesco Moramarco | Alex Papadopoulos Korfiatis | Mark Perera | Anya Belz | Ehud Reiter
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note.

2021

pdf bib
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Anya Belz | Shubham Agarwal | Yvette Graham | Ehud Reiter | Anastasia Shimorina
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

pdf bib
Towards Objectively Evaluating the Quality of Generated Medical Summaries
Francesco Moramarco | Damir Juric | Aleksandar Savkov | Ehud Reiter
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

We propose a method for evaluating the quality of generated text by asking evaluators to count facts, and computing precision, recall, f-score, and accuracy from the raw counts. We believe this approach leads to a more objective and easier to reproduce evaluation. We apply this to the task of medical report summarisation, where measuring objective quality and accuracy is of paramount importance.

pdf bib
A Preliminary Study on Evaluating Consultation Notes With Post-Editing
Francesco Moramarco | Alex Papadopoulos Korfiatis | Aleksandar Savkov | Ehud Reiter
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Automatic summarisation has the potential to aid physicians in streamlining clerical tasks such as note taking. But it is notoriously difficult to evaluate these systems and demonstrate that they are safe to be used in a clinical setting. To circumvent this issue, we propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them. We conduct a preliminary study on the time saving of automatically generated consultation notes with post-editing. Our evaluators are asked to listen to mock consultations and to post-edit three generated notes. We time this and find that it is faster than writing the note from scratch. We present insights and lessons learnt from this experiment.

pdf bib
A Systematic Review of Reproducibility Research in Natural Language Processing
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP,

pdf bib
Proceedings of the 14th International Conference on Natural Language Generation
Anya Belz | Angela Fan | Ehud Reiter | Yaji Sripada
Proceedings of the 14th International Conference on Natural Language Generation

pdf bib
Explaining Decision-Tree Predictions by Addressing Potential Conflicts between Predictions and Plausible Expectations
Sameen Maruf | Ingrid Zukerman | Ehud Reiter | Gholamreza Haffari
Proceedings of the 14th International Conference on Natural Language Generation

We offer an approach to explain Decision Tree (DT) predictions by addressing potential conflicts between aspects of these predictions and plausible expectations licensed by background information. We define four types of conflicts, operationalize their identification, and specify explanatory schemas that address them. Our human evaluation focused on the effect of explanations on users’ understanding of a DT’s reasoning and their willingness to act on its predictions. The results show that (1) explanations that address potential conflicts are considered at least as good as baseline explanations that just follow a DT path; and (2) the conflict-based explanations are deemed especially valuable when users’ expectations disagree with the DT’s predictions.

pdf bib
Generation Challenges: Results of the Accuracy Evaluation Shared Task
Craig Thomson | Ehud Reiter
Proceedings of the 14th International Conference on Natural Language Generation

The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference).

pdf bib
The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results
Anya Belz | Anastasia Shimorina | Shubham Agarwal | Ehud Reiter
Proceedings of the 14th International Conference on Natural Language Generation

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation. Much of the work on reproducibility has so far focused on metric scores, with reproducibility of human evaluation results receiving far less attention. As part of a research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the first shared task on reproducibility of human evaluations, ReproGen 2021. This paper describes the shared task in detail, summarises results from each of the reproduction studies submitted, and provides further comparative analysis of the results. Out of nine initial team registrations, we received submissions from four teams. Meta-analysis of the four reproduction studies revealed varying degrees of reproducibility, and allowed very tentative first conclusions about what types of evaluation tend to have better reproducibility.

2020

pdf bib
Explaining Bayesian Networks in Natural Language: State of the Art and Challenges
Conor Hennessy | Alberto Bugarín | Ehud Reiter
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence

In order to increase trust in the usage of Bayesian Networks and to cement their role as a model which can aid in critical decision making, the challenge of explainability must be faced. Previous attempts at explaining Bayesian Networks have largely focused on graphical or visual aids. In this paper we aim to highlight the importance of a natural language approach to explanation and to discuss some of the previous and state of the art attempts of the textual explanation of Bayesian Networks. We outline several challenges that remain to be addressed in the generation and validation of natural language explanations of Bayesian Networks. This can serve as a reference for future work on natural language explanations of Bayesian Networks.

pdf bib
Arabic NLG Language Functions
Wael Abed | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

The Arabic language has very limited supports from NLG researchers. In this paper, we explain the challenges of the core grammar, provide a lexical resource, and implement the first language functions for the Arabic language. We did a human evaluation to evaluate our functions in generating sentences from the NADA Corpus.

pdf bib
A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
Craig Thomson | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

pdf bib
Shared Task on Evaluating Accuracy
Ehud Reiter | Craig Thomson
Proceedings of the 13th International Conference on Natural Language Generation

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts, specifically summaries of basketball games produced from basketball box score and other game data. We welcome submissions based on protocols for human evaluation, automatic metrics, as well as combinations of human evaluations and metrics.

pdf bib
ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

Across NLP, a growing body of work is looking at the issue of reproducibility. However, replicability of human evaluation experiments and reproducibility of their results is currently under-addressed, and this is of particular concern for NLG where human evaluations are the norm. This paper outlines our ideas for a shared task on reproducibility of human evaluations in NLG which aims (i) to shed light on the extent to which past NLG evaluations are replicable and reproducible, and (ii) to draw conclusions regarding how evaluations can be designed and reported to increase replicability and reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of replicability and reproducibility over time.

pdf bib
SportSett:Basketball - A robust and maintainable data-set for Natural Language Generation
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

pdf bib
Iterative Neural Scoring of Validated Insight Candidates
Allmin Susaiyah | Aki Härmä | Ehud Reiter | Milan Petković
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

pdf bib
How are you? Introducing stress-based text tailoring
Simone Balloccu | Ehud Reiter | Alexandra Johnstone | Claire Fyfe
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

2019

pdf bib
Natural Language Generation Challenges for Explainable AI
Ehud Reiter
Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019)

2018

pdf bib
A Structured Review of the Validity of BLEU
Ehud Reiter
Computational Linguistics, Volume 44, Issue 3 - September 2018

The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.

pdf bib
Comprehension Driven Document Planning in Natural Language Generation Systems
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the 11th International Conference on Natural Language Generation

This paper proposes an approach to NLG system design which focuses on generating output text which can be more easily processed by the reader. Ways in which cognitive theory might be combined with existing NLG techniques are discussed and two simple experiments in content ordering are presented.

pdf bib
Generating Summaries of Sets of Consumer Products: Learning from Experiments
Kittipitch Kuptavanich | Ehud Reiter | Kees Van Deemter | Advaith Siddharthan
Proceedings of the 11th International Conference on Natural Language Generation

We explored the task of creating a textual summary describing a large set of objects characterised by a small number of features using an e-commerce dataset. When a set of consumer products is large and varied, it can be difficult for a consumer to understand how the products in the set differ; consequently, it can be challenging to choose the most suitable product from the set. To assist consumers, we generated high-level summaries of product sets. Two generation algorithms are presented, discussed, and evaluated with human users. Our evaluation results suggest a positive contribution to consumers’ understanding of the domain.

pdf bib
Meteorologists and Students: A resource for language grounding of geographical descriptors
Alejandro Ramos-Soto | Ehud Reiter | Kees van Deemter | Jose Alonso | Albert Gatt
Proceedings of the 11th International Conference on Natural Language Generation

We present a data resource which can be useful for research purposes on language grounding tasks in the context of geographical referring expression generation. The resource is composed of two data sets that encompass 25 different geographical descriptors and a set of associated graphical representations, drawn as polygons on a map by two groups of human subjects: teenage students and expert meteorologists.

2017

pdf bib
Proceedings of the 10th International Conference on Natural Language Generation
Jose M. Alonso | Alberto Bugarín | Ehud Reiter
Proceedings of the 10th International Conference on Natural Language Generation

pdf bib
A Commercial Perspective on Reference
Ehud Reiter
Proceedings of the 10th International Conference on Natural Language Generation

I briefly describe some of the commercial work which XXX is doing in referring expression algorithms, and highlight differences between what is commercially important (at least to XXX) and the NLG research literature. In particular, XXX is less interested in generic reference algorithms than in high-quality algorithms for specific types of references, such as components of machines, named entities, and dates.

pdf bib
Textually Summarising Incomplete Data
Stephanie Inglis | Ehud Reiter | Somayajulu Sripada
Proceedings of the 10th International Conference on Natural Language Generation

Many data-to-text NLG systems work with data sets which are incomplete, ie some of the data is missing. We have worked with data journalists to understand how they describe incomplete data, and are building NLG algorithms based on these insights. A pilot evaluation showed mixed results, and highlighted several areas where we need to improve our system.

2016

pdf bib
Absolute and Relative Properties in Geographic Referring Expressions
Rodrigo de Oliveira | Somayajulu Sripada | Ehud Reiter
Proceedings of the 9th International Natural Language Generation conference

2015

pdf bib
Designing an Algorithm for Generating Named Spatial References
Rodrigo de Oliveira | Yaji Sripada | Ehud Reiter
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib
Creating Textual Driver Feedback from Telemetric Data
Daniel Braun | Ehud Reiter | Advaith Siddharthan
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2014

pdf bib
Generating Annotated Graphs using the NLG Pipeline Architecture
Saad Mahamood | William Bradshaw | Ehud Reiter
Proceedings of the 8th International Natural Language Generation Conference (INLG)

2013

pdf bib
Generating Expressions that Refer to Visible Objects
Margaret Mitchell | Kees van Deemter | Ehud Reiter
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
MIME - NLG in Pre-Hospital Care
Anne Schneider | Alasdair Mort | Chris Mellish | Ehud Reiter | Phil Wilson | Pierre-Luc Vaudry
Proceedings of the 14th European Workshop on Natural Language Generation

pdf bib
MIME- NLG Support for Complex and Unstable Pre-hospital Emergencies
Anne Schneider | Alasdair Mort | Chris Mellish | Ehud Reiter | Phil Wilson | Pierre-Luc Vaudry
Proceedings of the 14th European Workshop on Natural Language Generation

2012

pdf bib
Working with Clinicians to Improve a Patient-Information NLG System
Saad Mahamood | Ehud Reiter
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

2011

pdf bib
Task-Based Evaluation of NLG Systems: Control vs Real-World Context
Ehud Reiter
Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop

pdf bib
Generating Affective Natural Language for Parents of Neonatal Infants
Saad Mahamood | Ehud Reiter
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib
What is in a text and what does it do: Qualitative Evaluations of an NLG system – the BT-Nurse – using content analysis and discourse analysis
Rahul Sambaraju | Ehud Reiter | Robert Logie | Andy Mckinlay | Chris McVittie | Albert Gatt | Cindy Sykes
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib
Two Approaches for Generating Size Modifiers
Margaret Mitchell | Kees van Deemter | Ehud Reiter
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib
Using NLG and Sensors to Support Personal Narrative for Children with Complex Communication Needs
Rolf Black | Joseph Reddington | Ehud Reiter | Nava Tintarev | Annalu Waller
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Automatic generation of conversational utterances and narrative for Augmentative and Alternative Communication: a prototype system
Martin Dempster | Norman Alm | Ehud Reiter
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Natural Reference to Objects in a Visual Domain
Margaret Mitchell | Kees van Deemter | Ehud Reiter
Proceedings of the 6th International Natural Language Generation Conference

2009

pdf bib
Le projet BabyTalk : génération de texte à partir de données hétérogènes pour la prise de décision en unité néonatale
François Portet | Albert Gatt | Jim Hunter | Ehud Reiter | Somayajulu Sripada
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Notre société génère une masse d’information toujours croissante, que ce soit en médecine, en météorologie, etc. La méthode la plus employée pour analyser ces données est de les résumer sous forme graphique. Cependant, il a été démontré qu’un résumé textuel est aussi un mode de présentation efficace. L’objectif du prototype BT-45, développé dans le cadre du projet Babytalk, est de générer des résumés de 45 minutes de signaux physiologiques continus et d’événements temporels discrets en unité néonatale de soins intensifs (NICU). L’article présente l’aspect génération de texte de ce prototype. Une expérimentation clinique a montré que les résumés humains améliorent la prise de décision par rapport à l’approche graphique, tandis que les textes de BT-45 donnent des résultats similaires à l’approche graphique. Une analyse a identifié certaines des limitations de BT-45 mais en dépit de cellesci, notre travail montre qu’il est possible de produire automatiquement des résumés textuels efficaces de données complexes.

pdf bib
An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems
Ehud Reiter | Anja Belz
Computational Linguistics, Volume 35, Number 4, December 2009

pdf bib
Using NLG to Help Language-Impaired Users Tell Stories and Participate in Social Dialogues
Ehud Reiter | Ross Turner | Norman Alm | Rolf Black | Martin Dempster | Annalu Waller
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib
Generating Approximate Geographic Descriptions
Ross Turner | Yaji Sripada | Ehud Reiter
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib
SimpleNLG: A Realisation Engine for Practical Applications
Albert Gatt | Ehud Reiter
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

2008

pdf bib
Using Spatial Reference Frames to Generate Grounded Textual Summaries of Georeferenced Data
Ross Turner | Somayajulu Sripada | Ehud Reiter | Ian Davy
Proceedings of the Fifth International Natural Language Generation Conference

pdf bib
The Importance of Narrative and Other Lessons from an Evaluation of an NLG System that Summarises Clinical Data
Ehud Reiter | Albert Gatt | François Portet | Marian van der Meulen
Proceedings of the Fifth International Natural Language Generation Conference

2007

pdf bib
Last Words: The Shrinking Horizons of Computational Linguistics
Ehud Reiter
Computational Linguistics, Volume 33, Number 2, June 2007

pdf bib
The attribute selection for generation of referring expressions challenge. [Introduction to Shared Task Evaluation Challenge.]
Anja Belz | Albert Gatt | Ehud Reiter | Jette Viethen
Proceedings of the Workshop on Using corpora for natural language generation

pdf bib
An Architecture for Data-to-Text Systems
Ehud Reiter
Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 07)

pdf bib
A Comparison of Hedged and Non-hedged NLG Texts
Saad Mahamood | Ehud Reiter | Chris Mellish
Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 07)

2006

pdf bib
Comparing Automatic and Human Evaluation of NLG Systems
Anja Belz | Ehud Reiter
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Generating Spatio-Temporal Descriptions in Pollen Forecasts
Ross Turner | Somayajulu Sripada | Ehud Reiter | Ian P Davy
Demonstrations

pdf bib
GENEVAL: A Proposal for Shared-task Evaluation in NLG
Ehud Reiter | Anja Belz
Proceedings of the Fourth International Natural Language Generation Conference

2005

pdf bib
Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05)
Graham Wilcock | Kristiina Jokinen | Chris Mellish | Ehud Reiter
Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05)

pdf bib
Evaluation of an NLG System using Post-Edit Data: Lessons Learnt
Somayajulu Sripada | Ehud Reiter | Lezan Hawizy
Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05)

pdf bib
Generating Readable Texts for Readers with Low Basic Skills
Sandra Williams | Ehud Reiter
Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05)

2003

pdf bib
Summarizing Neonatal Time Series Data
Somayajulu G. Sripada | Ehud Reiter | Jim Hunter | Jin Yu
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Learning the Meaning and Usage of Time Phrases from a Parallel Text-Data Corpus
Ehud Reiter | Somayajulu Sripada
Proceedings of the HLT-NAACL 2003 Workshop on Learning Word Meaning from Non-Linguistic Data

pdf bib
Proceedings of the 9th European Workshop on Natural Language Generation (ENLG-2003) at EACL 2003
Ehud Reiter | Helmut Horacek | Kees van Deemter
Proceedings of the 9th European Workshop on Natural Language Generation (ENLG-2003) at EACL 2003

pdf bib
Acquiring and Using Limited User Models in NLG
Ehud Reiter | Somayajulu Sripada | Sandra Williams
Proceedings of the 9th European Workshop on Natural Language Generation (ENLG-2003) at EACL 2003

pdf bib
Experiments with discourse-level choices and readability
Sandra Williams | Ehud Reiter | Liesl Osman
Proceedings of the 9th European Workshop on Natural Language Generation (ENLG-2003) at EACL 2003

2002

pdf bib
Should Corpora Texts Be Gold Standards for NLG?
Ehud Reiter | Somayajulu Sripada
Proceedings of the International Natural Language Generation Conference

pdf bib
Squibs and Discussions: Human Variation and Lexical Choice
Ehud Reiter | Somayajulu Sripada
Computational Linguistics, Volume 28, Number 4, December 2002

2001

pdf bib
A Two-Staged Model For Content Determination
Somayajula G. Sripada | Ehud Reiter | Jim Hunter | Jin Yu
Proceedings of the ACL 2001 Eighth European Workshop on Natural Language Generation (EWNLG)

pdf bib
Using a Randomised Controlled Clinical Trial to Evaluate an NLG System
Ehud Reiter | Roma Robertson | A. Scott Lennox | Liesl Osman
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

2000

pdf bib
Knowledge Acquisition for Natural Language Generation
Ehud Reiter | Roma Robertson | Liesl Osman
INLG’2000 Proceedings of the First International Conference on Natural Language Generation

pdf bib
Pipelines and size constraints
Ehud Reiter
Computational Linguistics, Volume 26, Number 2, June 2000

1997

pdf bib
Customizable Descriptions of Object-Oriented Models
Benoit Lavoie | Owen Rambow | Ehud Reiter
Fifth Conference on Applied Natural Language Processing

pdf bib
Tailored Patient Information: Some Issues and Questions
Ehud Reiter
From Research to Commercial Applications: Making NLP Work in Practice

1996

pdf bib
The ModelExplainer
Benoit Lavoie | Owen Rambow | Ehud Reiter
Eighth International Natural Language Generation Workshop (Posters and Demonstrations)

1994

pdf bib
Has a Consensus NL Generation Architecture Appeared, and is it Psycholinguistically Plausible?
Ehud Reiter
Proceedings of the Seventh International Workshop on Natural Language Generation

1992

pdf bib
Automatic Generation of On-Line Documentation in the IDAS Project
Ehud Reiter | Chris Mellish | John Levine
Third Conference on Applied Natural Language Processing

pdf bib
A Fast Algorithm for the Generation of Referring Expressions
Ehud Reiter | Robert Dale
COLING 1992 Volume 1: The 14th International Conference on Computational Linguistics

pdf bib
Using Classification to Generate Text
Ehud Reiter | Chris Mellish
30th Annual Meeting of the Association for Computational Linguistics

1990

pdf bib
A New Model for Lexical Choice for Open-Class Words
Ehud Reiter
Proceedings of the Fifth International Workshop on Natural Language Generation

pdf bib
The Computational Complexity of Avoiding Conversational Implicatures
Ehud Reiter
28th Annual Meeting of the Association for Computational Linguistics

Search
Co-authors