Workshop on Evaluating NLG Evaluation (2020)

Volumes

Proceedings of the 1st Workshop on Evaluating NLG Evaluation 6 papers

pdf (full)
bib (full) Proceedings of the 1st Workshop on Evaluating NLG Evaluation

pdf bib abs

A proof of concept on triangular test evaluation for Natural Language Generation
Javier González Corbelle | José María Alonso Moral | Alberto Bugarín Diz

The evaluation of Natural Language Generation (NLG) systems has recently aroused much interest in the research community, since it should address several challenging aspects, such as readability of the generated texts, adequacy to the user within a particular context and moment and linguistic quality-related issues (e.g., correctness, coherence, understandability), among others. In this paper, we propose a novel technique for evaluating NLG systems that is inspired on the triangular test used in the field of sensory analysis. This technique allows us to compare two texts generated by different subjects and to i) determine whether statistically significant differences are detected between them when evaluated by humans and ii) quantify to what extent the number of evaluators plays an important role in the sensitivity of the results. As a proof of concept, we apply this evaluation technique in a real use case in the field of meteorology, showing the advantages and disadvantages of our proposal.

pdf bib abs

“This is a Problem, Don’t You Agree?” Framing and Bias in Human Evaluation for Natural Language Generation
Stephanie Schoch | Diyi Yang | Yangfeng Ji

Despite recent efforts reviewing current human evaluation practices for natural language generation (NLG) research, the lack of reported question wording and potential for framing effects or cognitive biases influencing results has been widely overlooked. In this opinion paper, we detail three possible framing effects and cognitive biases that could be imposed on human evaluation in NLG. Based on this, we make a call for increased transparency for human evaluation in NLG and propose the concept of human evaluation statements. We make several recommendations for design details to report that could potentially influence results, such as question wording, and suggest that reporting pertinent design details can help increase comparability across studies as well as reproducibility of results.

pdf bib abs

Evaluation rules! On the use of grammars and rule-based systems for NLG evaluation
Emiel van Miltenburg | Chris van der Lee | Thiago Castro-Ferreira | Emiel Krahmer

NLG researchers often use uncontrolled corpora to train and evaluate their systems, using textual similarity metrics, such as BLEU. This position paper argues in favour of two alternative evaluation strategies, using grammars or rule-based systems. These strategies are particularly useful to identify the strengths and weaknesses of different systems. We contrast our proposals with the (extended) WebNLG dataset, which is revealed to have a skewed distribution of predicates. We predict that this distribution affects the quality of the predictions for systems trained on this data. However, this hypothesis can only be thoroughly tested (without any confounds) once we are able to systematically manipulate the skewness of the data, using a rule-based approach.

pdf bib abs

NUBIA: NeUral Based Interchangeability Assessor for Text Generation
Hassan Kane | Muhammed Yusuf Kocyigit | Ali Abdalla | Pelkins Ajanoh | Mohamed Coulibali

We present NUBIA, a methodology to build automatic evaluation metrics for text generation using only machine learning models as core components. A typical NUBIA model is composed of three modules: a neural feature extractor, an aggregator and a calibrator. We demonstrate an implementation of NUBIA showing competitive performance with stateof-the art metrics used to evaluate machine translation and state-of-the art results for image captions quality evaluation. In addition to strong performance, NUBIA models have the advantage of being modular and improve in synergy with advances in text generation models.

pdf bib abs

An ongoing debate in the NLG community concerns the best way to evaluate systems, with human evaluation often being considered the most reliable method, compared to corpus-based metrics. However, tasks involving subtle textual differences, such as style transfer, tend to be hard for humans to perform. In this paper, we propose an evaluation method for this task based on purposely-trained classifiers, showing that it better reflects system differences than traditional metrics such as BLEU.