2024
pdf
bib
abs
Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation
Jie Ruan
|
Wenqing Wang
|
Xiaojun Wan
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Human evaluation serves as the gold standard for assessing the quality of Natural Language Generation (NLG) systems. Nevertheless, the evaluation guideline, as a pivotal element ensuring reliable and reproducible human assessment, has received limited attention. Our investigation revealed that only 29.84% of recent papers involving human evaluation at top conferences release their evaluation guidelines, with vulnerabilities identified in 77.09% of these guidelines. Unreliable evaluation guidelines can yield inaccurate assessment outcomes, potentially impeding the advancement of NLG in the right direction. To address these challenges, we take an initial step towards reliable evaluation guidelines and propose the first human evaluation guideline dataset by collecting annotations of guidelines extracted from existing papers as well as generated via Large Language Models (LLMs). We then introduce a taxonomy of eight vulnerabilities and formulate a principle for composing evaluation guidelines. Furthermore, a method for detecting guideline vulnerabilities has been explored using LLMs, and we offer a set of recommendations to enhance reliability in human evaluation. The annotated human evaluation guideline dataset and code for the vulnerability detection method are publicly available online.
pdf
bib
abs
ReproHum #0087-01: A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations
Mingqi Gao
|
Jie Ruan
|
Xiaojun Wan
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
We present a reproduction study of the human evaluation of the coverage of fact checking explanations conducted by Atanasova et al. (2020), as a team in Track B of ReproNLP 2024. The setup of our reproduction study is almost the same as the original study, with some necessary modifications to the evaluation guideline and annotation interface. Our reproduction achieves a higher IAA of 0.20 compared to the original study’s 0.12, but discovers a mismatch between the IAA calculated by us with the raw annotation in the original study and the IAA reported in the original paper. Additionally, our reproduction results on the ranks of three types of explanations are drastically different from the original experiment, rendering that one important conclusion in the original paper cannot be confirmed at all. The case study illustrates that the annotators in the reproduction study may understand the quality criterion differently from the annotators in the original study.
pdf
bib
abs
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation
Xunjian Yin
|
Xu Zhang
|
Jie Ruan
|
Xiaojun Wan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In recent years, substantial advancements have been made in the development of large language models, achieving remarkable performance across diverse tasks.To evaluate the knowledge ability of language models, previous studies have proposed lots of benchmarks based on question-answering pairs.We argue that it is not reliable and comprehensive to evaluate language models with a fixed question or limited paraphrases as the query, since language models are sensitive to prompt.Therefore, we introduce a novel concept named knowledge boundary to encompass both prompt-agnostic and prompt-sensitive knowledge within language models.Knowledge boundary avoids prompt sensitivity in language model evaluations, rendering them more dependable and robust.To explore the knowledge boundary for a given model, we propose projected gradient descent method with semantic constraints, a new algorithm designed to identify the optimal prompt for each piece of knowledge.Experiments demonstrate a superior performance of our algorithm in computing the knowledge boundary compared to existing methods.Furthermore, we evaluate the ability of multiple language models in several domains with knowledge boundary.
2023
pdf
bib
abs
A Reproduction Study of the Human Evaluation of Role-Oriented Dialogue Summarization Models
Mingqi Gao
|
Jie Ruan
|
Xiaojun Wan
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
This paper reports a reproduction study of the human evaluation of role-oriented dialogue summarization models, as part of the ReproNLP Shared Task 2023 on Reproducibility of Evaluations in NLP. We outline the disparities between the original study’s experimental design and our reproduction study, along with the outcomes obtained. The inter-annotator agreement within the reproduction study is observed to be lower, measuring 0.40 as compared to the original study’s 0.48. Among the six conclusions drawn in the original study, four are validated in our reproduction study. We confirm the effectiveness of the proposed approach on the overall metric, albeit with slightly poorer relative performance compared to the original study. Furthermore, we raise an open-ended inquiry: how can subjective practices in the original study be identified and addressed when conducting reproduction studies?
pdf
bib
abs
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz
|
Craig Thomson
|
Ehud Reiter
|
Gavin Abercrombie
|
Jose M. Alonso-Moral
|
Mohammad Arvan
|
Anouck Braggaar
|
Mark Cieliebak
|
Elizabeth Clark
|
Kees van Deemter
|
Tanvi Dinkar
|
Ondřej Dušek
|
Steffen Eger
|
Qixiang Fang
|
Mingqi Gao
|
Albert Gatt
|
Dimitra Gkatzia
|
Javier González-Corbelle
|
Dirk Hovy
|
Manuela Hürlimann
|
Takumi Ito
|
John D. Kelleher
|
Filip Klubicka
|
Emiel Krahmer
|
Huiyuan Lai
|
Chris van der Lee
|
Yiru Li
|
Saad Mahamood
|
Margot Mieskes
|
Emiel van Miltenburg
|
Pablo Mosteiro
|
Malvina Nissim
|
Natalie Parde
|
Ondřej Plátek
|
Verena Rieser
|
Jie Ruan
|
Joel Tetreault
|
Antonio Toral
|
Xiaojun Wan
|
Leo Wanner
|
Lewis Watson
|
Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.