Lewis N. Watson

2024

pdf bib abs
ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
Lewis N. Watson | Dimitra Gkatzia
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

Reproducibility is a cornerstone of scientific research, ensuring the reliability and generalisability of findings. The ReproNLP Shared Task on Reproducibility of Evaluations in NLP aims to assess the reproducibility of human evaluation studies. This paper presents a reproduction study of the human evaluation experiment presented in “Hierarchical Sketch Induction for Paraphrase Generation” by Hosking et al. (2022). The original study employed a human evaluation on Amazon Mechanical Turk, assessing the quality of paraphrases generated by their proposed model using three criteria: meaning preservation, fluency, and dissimilarity. In our reproduction study, we focus on the meaning preservation criterion and utilise the Prolific platform for participant recruitment, following the ReproNLP challenge’s common approach to reproduction. We discuss the methodology, results, and implications of our reproduction study, comparing them to the original findings. Our findings contribute to the understanding of reproducibility in NLP research and highlights the potential impact of platform changes and evaluation criteria on the reproducibility of human evaluation studies.

pdf bib abs
Exploring the impact of data representation on neural data-to-text generation
David M. Howcroft | Lewis N. Watson | Olesia Nedopas | Dimitra Gkatzia
Proceedings of the 17th International Natural Language Generation Conference

A relatively under-explored area in research on neural natural language generation is the impact of the data representation on text quality. Here we report experiments on two leading input representations for data-to-text generation: attribute-value pairs and Resource Description Framework (RDF) triples. Evaluating the performance of encoder-decoder seq2seq models as well as recent large language models (LLMs) with both automated metrics and human evaluation, we find that the input representation does not seem to have a large impact on the performance of either purpose-built seq2seq models or LLMs. Finally, we present an error analysis of the texts generated by the LLMs and provide some insights into where these models fail.

Co-authors

Venues

Fix author