Reference-based Metrics Disprove Themselves in Question Generation

Bang Nguyen; Mengxia Yu; Yun Huang; Meng Jiang

doi:10.18653/v1/2024.findings-emnlp.798

Reference-based Metrics Disprove Themselves in Question Generation

Bang Nguyen, Mengxia Yu, Yun Huang, Meng Jiang

Abstract

Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicate the annotation process and collect another reference. A good metric is expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.

Anthology ID:: 2024.findings-emnlp.798
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13651–13666
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.798/
DOI:: 10.18653/v1/2024.findings-emnlp.798
Bibkey:
Cite (ACL):: Bang Nguyen, Mengxia Yu, Yun Huang, and Meng Jiang. 2024. Reference-based Metrics Disprove Themselves in Question Generation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13651–13666, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Reference-based Metrics Disprove Themselves in Question Generation (Nguyen et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.798.pdf

PDF Cite Search Fix data