Ehud Reiter

2024

pdf bib abs
Linguistically Communicating Uncertainty in Patient-Facing Risk Prediction Models
Adarsa Sivaprasad | Ehud Reiter
Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)

This paper addresses the unique challenges associated with uncertainty quantification in AI models when applied to patient-facing contexts within healthcare. Unlike traditional eXplainable Artificial Intelligence (XAI) methods tailored for model developers or domain experts, additional considerations of communicating in natural language, its presentation and evaluating understandability are necessary. We identify the challenges in communication model performance, confidence, reasoning and unknown knowns using natural language in the context of risk prediction. We propose a design aimed at addressing these challenges, focusing on the specific application of in-vitro fertilisation outcome prediction.

pdf bib
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Simone Balloccu | Anya Belz | Rudali Huidrom | Ehud Reiter | Joao Sedoc | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

pdf bib abs
Common Flaws in Running Human Evaluation Experiments in NLP
Craig Thomson | Ehud Reiter | Anya Belz
Computational Linguistics, Volume 50, Issue 2 - June 2023

While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we discovered, which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigor of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.

pdf bib abs
Improving Factual Accuracy of Neural Table-to-Text Output by Addressing Input Problems in ToTTo
Barkavi Sundararajan | Yaji Sripada | Ehud Reiter
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Neural Table-to-Text models tend to hallucinate, producing texts that contain factual errors. We investigate whether such errors in the output can be traced back to problems with the input. We manually annotated 1,837 texts generated by multiple models in the politics domain of the ToTTo dataset. We identify the input problems that are responsible for many output errors and show that fixing these inputs reduces factual errors by between 52% and 76% (depending on the model). In addition, we observe that models struggle in processing tabular inputs that are structured in a non-standard way, particularly when the input lacks distinct row and column values or when the column headers are not correctly mapped to corresponding values.

2023

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib abs
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Simon Mille
Findings of the Association for Computational Linguistics: ACL 2023

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried, let alone formally tested, in NLP which means that their repeatability and the reproducibility of their results is currently an open question. This focused contribution reports our review of human evaluation experiments reported in NLP papers over the past five years which we assessed in terms oftheir ability to be rerun. Overall, we estimatethat just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitivebarriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goesup to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibilityof human evaluations where those are repeatable in the first place. Here we find worryinglylow degrees of reproducibility, both in terms ofsimilarity of scores and of findings supportedby them. We summarise what insights can begleaned so far regarding how to make humanevaluations in NLP more repeatable and morereproducible.

pdf bib
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Anya Belz | Maja Popović | Ehud Reiter | Craig Thomson | João Sedoc
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

pdf bib abs
Are Experts Needed? On Human Evaluation of Counselling Reflection Generation
Zixiu Wu | Simone Balloccu | Ehud Reiter | Rim Helaoui | Diego Reforgiato Recupero | Daniele Riboni
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reflection is a crucial counselling skill where the therapist conveys to the client their interpretation of what the client said. Language models have recently been used to generate reflections automatically, but human evaluation is challenging, particularly due to the cost of hiring experts. Laypeople-based evaluation is less expensive and easier to scale, but its quality is unknown for reflections. Therefore, we explore whether laypeople can be an alternative to experts in evaluating a fundamental quality aspect: coherence and context-consistency. We do so by asking a group of laypeople and a group of experts to annotate both synthetic reflections and human reflections from actual therapists. We find that both laypeople and experts are reliable annotators and that they have moderate-to-strong inter-group correlation, which shows that laypeople can be trusted for such evaluations. We also discover that GPT-3 mostly produces coherent and consistent reflections, and we explore changes in evaluation results when the source of synthetic reflections changes to GPT-3 from the less powerful GPT-2.

pdf bib abs
Enhancing factualness and controllability of Data-to-Text Generation via data Views and constraints
Craig Thomson | Clement Rebuffel | Ehud Reiter | Laure Soulier | Somayajulu Sripada | Patrick Gallinari
Proceedings of the 16th International Natural Language Generation Conference

Neural data-to-text systems lack the control and factual accuracy required to generate useful and insightful summaries of multidimensional data. We propose a solution in the form of data views, where each view describes an entity and its attributes along specific dimensions. A sequence of views can then be used as a high-level schema for document planning, with the neural model handling the complexities of micro-planning and surface realization. We show that our view-based system retains factual accuracy while offering high-level control of output that can be tailored based on user preference or other norms within the domain.

2022

pdf bib
Proceedings of the First Workshop on Natural Language Generation in Healthcare
Emiel Krahmer | Kathy McCoy | Ehud Reiter
Proceedings of the First Workshop on Natural Language Generation in Healthcare

pdf bib abs
DrivingBeacon: Driving Behaviour Change Support System Considering Mobile Use and Geo-information
Jawwad Baig | Guanyi Chen | Chenghua Lin | Ehud Reiter
Proceedings of the First Workshop on Natural Language Generation in Healthcare

Natural Language Generation has been proved to be effective and efficient in constructing health behaviour change support systems. We are working on DrivingBeacon, a behaviour change support system that uses telematics data from mobile phone sensors to generate weekly data-to-text feedback reports to vehicle drivers. The system makes use of a wealth of information such as mobile phone use while driving, geo-information, speeding, rush hour driving to generate the feedback. We present results from a real-world evaluation where 8 drivers in UK used DrivingBeacon for 4 weeks. Results are promising but not conclusive.

pdf bib
Comparing informativeness of an NLG chatbot vs graphical app in diet-information domain
Simone Balloccu | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation

pdf bib abs
The 2022 ReproGen Shared Task on Reproducibility of Evaluations in NLG: Overview and Results
Anya Belz | Anastasia Shimorina | Maja Popović | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the second shared task on reproducibility of evaluations in NLG, ReproGen 2022. This paper describes the shared task, summarises results from the reproduction studies submitted, and provides further comparative analysis of the results. Out of six initial team registrations, we received submissions from five teams. Meta-analysis of the five reproduction studies revealed varying degrees of reproducibility, and allowed further tentative conclusions about what types of evaluation tend to have better reproducibility.

pdf bib abs
The Accuracy Evaluation Shared Task as a Retrospective Reproduction Study
Craig Thomson | Ehud Reiter
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We investigate the data collected for the Accuracy Evaluation Shared Task as a retrospective reproduction study. The shared task was based upon errors found by human annotation of computer generated summaries of basketball games. Annotation was performed in three separate stages, with texts taken from the same three systems and checked for errors by the same three annotators. We show that the mean count of errors was consistent at the highest level for each experiment, with increased variance when looking at per-system and/or per-error- type breakdowns.

pdf bib abs
Error Analysis of ToTTo Table-to-Text Neural NLG Models
Barkavi Sundararajan | Somayajulu Sripada | Ehud Reiter
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We report error analysis of outputs from seven Table-to-Text generation models fine-tuned on ToTTo, an open-domain English language dataset. A manual error annotation of a subset of outputs (a total of 5,278 sentences) belonging to the topic of Politics generated by these seven models has been carried out. Our error annotation focused on eight categories of errors. The error analysis shows that more than 45% of sentences from each of the seven models have been error-free. It uncovered some of the specific classes of errors such as WORD errors that are the dominant errors in all the seven models, NAME and NUMBER errors are more committed by two of the GeM benchmark models, whereas DATE-DIMENSION and OTHER category of errors are more common in our Table-to-Text models.

pdf bib abs
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation
Francesco Moramarco | Alex Papadopoulos Korfiatis | Mark Perera | Damir Juric | Jack Flann | Ehud Reiter | Anya Belz | Aleksandar Savkov
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient’s clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three rounds of user studies, carried out in the context of developing a medical note generation system. We present, analyse and discuss the participating clinicians’ impressions and views of how the system ought to be adapted to be of value to them. Next, we describe a three-week test run of the system in a live telehealth clinical practice. Major findings include (i) the emergence of five different note-taking behaviours; (ii) the importance of the system generating notes in real time during the consultation; and (iii) the identification of a number of clinical use cases that could prove challenging for automatic note generation systems.

pdf bib
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Anya Belz | Maja Popović | Ehud Reiter | Anastasia Shimorina
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

pdf bib abs
Beyond calories: evaluating how tailored communication reduces emotional load in diet-coaching
Simone Balloccu | Ehud Reiter
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Dieting is a behaviour change task that is difficult for many people to conduct successfully. This is due to many factors, including stress and cost. Mobile applications offer an alternative to traditional coaching. However, previous work on apps evaluation only focused on dietary outcomes, ignoring users’ emotional state despite its influence on eating habits. In this work, we introduce a novel evaluation of the effects that tailored communication can have on the emotional load of dieting. We implement this by augmenting a traditional diet-app with affective NLG, text-tailoring and persuasive communication techniques. We then run a short 2-weeks experiment and check dietary outcomes, user feedback of produced text and, most importantly, its impact on emotional state, through PANAS questionnaire. Results show that tailored communication significantly improved users’ emotional state, compared to an app-only control group.

pdf bib abs
Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation
Aleksandar Savkov | Francesco Moramarco | Alex Papadopoulos Korfiatis | Mark Perera | Anya Belz | Ehud Reiter
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note.

2021

pdf bib
Proceedings of the 14th International Conference on Natural Language Generation
Anya Belz | Angela Fan | Ehud Reiter | Yaji Sripada
Proceedings of the 14th International Conference on Natural Language Generation

pdf bib abs
Explaining Decision-Tree Predictions by Addressing Potential Conflicts between Predictions and Plausible Expectations
Sameen Maruf | Ingrid Zukerman | Ehud Reiter | Gholamreza Haffari
Proceedings of the 14th International Conference on Natural Language Generation

We offer an approach to explain Decision Tree (DT) predictions by addressing potential conflicts between aspects of these predictions and plausible expectations licensed by background information. We define four types of conflicts, operationalize their identification, and specify explanatory schemas that address them. Our human evaluation focused on the effect of explanations on users’ understanding of a DT’s reasoning and their willingness to act on its predictions. The results show that (1) explanations that address potential conflicts are considered at least as good as baseline explanations that just follow a DT path; and (2) the conflict-based explanations are deemed especially valuable when users’ expectations disagree with the DT’s predictions.

pdf bib abs
Generation Challenges: Results of the Accuracy Evaluation Shared Task
Craig Thomson | Ehud Reiter
Proceedings of the 14th International Conference on Natural Language Generation

The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference).

pdf bib abs
The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results
Anya Belz | Anastasia Shimorina | Shubham Agarwal | Ehud Reiter
Proceedings of the 14th International Conference on Natural Language Generation

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation. Much of the work on reproducibility has so far focused on metric scores, with reproducibility of human evaluation results receiving far less attention. As part of a research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the first shared task on reproducibility of human evaluations, ReproGen 2021. This paper describes the shared task in detail, summarises results from each of the reproduction studies submitted, and provides further comparative analysis of the results. Out of nine initial team registrations, we received submissions from four teams. Meta-analysis of the four reproduction studies revealed varying degrees of reproducibility, and allowed very tentative first conclusions about what types of evaluation tend to have better reproducibility.

pdf bib abs
A Systematic Review of Reproducibility Research in Natural Language Processing
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP,

pdf bib
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Anya Belz | Shubham Agarwal | Yvette Graham | Ehud Reiter | Anastasia Shimorina
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

pdf bib abs
Towards Objectively Evaluating the Quality of Generated Medical Summaries
Francesco Moramarco | Damir Juric | Aleksandar Savkov | Ehud Reiter
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

We propose a method for evaluating the quality of generated text by asking evaluators to count facts, and computing precision, recall, f-score, and accuracy from the raw counts. We believe this approach leads to a more objective and easier to reproduce evaluation. We apply this to the task of medical report summarisation, where measuring objective quality and accuracy is of paramount importance.

pdf bib abs
A Preliminary Study on Evaluating Consultation Notes With Post-Editing
Francesco Moramarco | Alex Papadopoulos Korfiatis | Aleksandar Savkov | Ehud Reiter
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Automatic summarisation has the potential to aid physicians in streamlining clerical tasks such as note taking. But it is notoriously difficult to evaluate these systems and demonstrate that they are safe to be used in a clinical setting. To circumvent this issue, we propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them. We conduct a preliminary study on the time saving of automatically generated consultation notes with post-editing. Our evaluators are asked to listen to mock consultations and to post-edit three generated notes. We time this and find that it is faster than writing the note from scratch. We present insights and lessons learnt from this experiment.

2020

pdf bib abs
Arabic NLG Language Functions
Wael Abed | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

The Arabic language has very limited supports from NLG researchers. In this paper, we explain the challenges of the core grammar, provide a lexical resource, and implement the first language functions for the Arabic language. We did a human evaluation to evaluate our functions in generating sentences from the NADA Corpus.

pdf bib abs
A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
Craig Thomson | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

pdf bib abs
Shared Task on Evaluating Accuracy
Ehud Reiter | Craig Thomson
Proceedings of the 13th International Conference on Natural Language Generation

We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts, specifically summaries of basketball games produced from basketball box score and other game data. We welcome submissions based on protocols for human evaluation, automatic metrics, as well as combinations of human evaluations and metrics.

pdf bib abs
ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter
Proceedings of the 13th International Conference on Natural Language Generation

Across NLP, a growing body of work is looking at the issue of reproducibility. However, replicability of human evaluation experiments and reproducibility of their results is currently under-addressed, and this is of particular concern for NLG where human evaluations are the norm. This paper outlines our ideas for a shared task on reproducibility of human evaluations in NLG which aims (i) to shed light on the extent to which past NLG evaluations are replicable and reproducible, and (ii) to draw conclusions regarding how evaluations can be designed and reported to increase replicability and reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of replicability and reproducibility over time.

pdf bib
SportSett:Basketball - A robust and maintainable data-set for Natural Language Generation
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

pdf bib
Iterative Neural Scoring of Validated Insight Candidates
Allmin Susaiyah | Aki Härmä | Ehud Reiter | Milan Petković
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

pdf bib
How are you? Introducing stress-based text tailoring
Simone Balloccu | Ehud Reiter | Alexandra Johnstone | Claire Fyfe
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

pdf bib abs
Explaining Bayesian Networks in Natural Language: State of the Art and Challenges
Conor Hennessy | Alberto Bugarín | Ehud Reiter
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence

In order to increase trust in the usage of Bayesian Networks and to cement their role as a model which can aid in critical decision making, the challenge of explainability must be faced. Previous attempts at explaining Bayesian Networks have largely focused on graphical or visual aids. In this paper we aim to highlight the importance of a natural language approach to explanation and to discuss some of the previous and state of the art attempts of the textual explanation of Bayesian Networks. We outline several challenges that remain to be addressed in the generation and validation of natural language explanations of Bayesian Networks. This can serve as a reference for future work on natural language explanations of Bayesian Networks.

2019

pdf bib
Natural Language Generation Challenges for Explainable AI
Ehud Reiter
Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019)

2018

pdf bib abs
Comprehension Driven Document Planning in Natural Language Generation Systems
Craig Thomson | Ehud Reiter | Somayajulu Sripada
Proceedings of the 11th International Conference on Natural Language Generation

This paper proposes an approach to NLG system design which focuses on generating output text which can be more easily processed by the reader. Ways in which cognitive theory might be combined with existing NLG techniques are discussed and two simple experiments in content ordering are presented.

pdf bib abs
Generating Summaries of Sets of Consumer Products: Learning from Experiments
Kittipitch Kuptavanich | Ehud Reiter | Kees Van Deemter | Advaith Siddharthan
Proceedings of the 11th International Conference on Natural Language Generation

We explored the task of creating a textual summary describing a large set of objects characterised by a small number of features using an e-commerce dataset. When a set of consumer products is large and varied, it can be difficult for a consumer to understand how the products in the set differ; consequently, it can be challenging to choose the most suitable product from the set. To assist consumers, we generated high-level summaries of product sets. Two generation algorithms are presented, discussed, and evaluated with human users. Our evaluation results suggest a positive contribution to consumers’ understanding of the domain.

pdf bib abs
Meteorologists and Students: A resource for language grounding of geographical descriptors
Alejandro Ramos-Soto | Ehud Reiter | Kees van Deemter | Jose Alonso | Albert Gatt
Proceedings of the 11th International Conference on Natural Language Generation

We present a data resource which can be useful for research purposes on language grounding tasks in the context of geographical referring expression generation. The resource is composed of two data sets that encompass 25 different geographical descriptors and a set of associated graphical representations, drawn as polygons on a map by two groups of human subjects: teenage students and expert meteorologists.

pdf bib abs
A Structured Review of the Validity of BLEU
Ehud Reiter
Computational Linguistics, Volume 44, Issue 3 - September 2018

The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.

2017

pdf bib
Proceedings of the 10th International Conference on Natural Language Generation
Jose M. Alonso | Alberto Bugarín | Ehud Reiter
Proceedings of the 10th International Conference on Natural Language Generation

pdf bib abs
A Commercial Perspective on Reference
Ehud Reiter
Proceedings of the 10th International Conference on Natural Language Generation

I briefly describe some of the commercial work which XXX is doing in referring expression algorithms, and highlight differences between what is commercially important (at least to XXX) and the NLG research literature. In particular, XXX is less interested in generic reference algorithms than in high-quality algorithms for specific types of references, such as components of machines, named entities, and dates.

pdf bib abs
Textually Summarising Incomplete Data
Stephanie Inglis | Ehud Reiter | Somayajulu Sripada
Proceedings of the 10th International Conference on Natural Language Generation

Many data-to-text NLG systems work with data sets which are incomplete, ie some of the data is missing. We have worked with data journalists to understand how they describe incomplete data, and are building NLG algorithms based on these insights. A pilot evaluation showed mixed results, and highlighted several areas where we need to improve our system.

Notre société génère une masse d’information toujours croissante, que ce soit en médecine, en météorologie, etc. La méthode la plus employée pour analyser ces données est de les résumer sous forme graphique. Cependant, il a été démontré qu’un résumé textuel est aussi un mode de présentation efficace. L’objectif du prototype BT-45, développé dans le cadre du projet Babytalk, est de générer des résumés de 45 minutes de signaux physiologiques continus et d’événements temporels discrets en unité néonatale de soins intensifs (NICU). L’article présente l’aspect génération de texte de ce prototype. Une expérimentation clinique a montré que les résumés humains améliorent la prise de décision par rapport à l’approche graphique, tandis que les textes de BT-45 donnent des résultats similaires à l’approche graphique. Une analyse a identifié certaines des limitations de BT-45 mais en dépit de cellesci, notre travail montre qu’il est possible de produire automatiquement des résumés textuels efficaces de données complexes.