CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects

Wannaphong Phatthiyaphaibun, Surapon Nonesung, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Ekapol Chuangsuwanich, Sarana Nutanong


Abstract
The evaluation of generative models in Machine Reading Comprehension (MRC) presents distinct difficulties, as traditional metrics like BLEU, ROUGE, METEOR, Exact Match, and F1 score often struggle to capture the nuanced and diverse responses. While embedding-based metrics such as BERTScore and BARTScore focus on semantic similarity, they still fail to fully address aspects such as recognizing additional helpful information and rewarding contextual faithfulness. Recent advances in large language model (LLM) based metrics offer more fine-grained evaluations, but challenges such as score clustering remain. This paper introduces a multi-aspect evaluation framework, CHIE,incorporating aspects of Correctness, Helpfulness, Irrelevance, and Extraneousness. Our approach, which uses binary categorical values rather than continuous rating scales, aligns well with human judgments, indicating its potential as a comprehensive and effective evaluation method.
Anthology ID:
2024.genbench-1.10
Volume:
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Dieuwke Hupkes, Verna Dankers, Khuyagbaatar Batsuren, Amirhossein Kazemnejad, Christos Christodoulopoulos, Mario Giulianelli, Ryan Cotterell
Venue:
GenBench
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
154–164
Language:
URL:
https://aclanthology.org/2024.genbench-1.10
DOI:
Bibkey:
Cite (ACL):
Wannaphong Phatthiyaphaibun, Surapon Nonesung, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Ekapol Chuangsuwanich, and Sarana Nutanong. 2024. CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects. In Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP, pages 154–164, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects (Phatthiyaphaibun et al., GenBench 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.genbench-1.10.pdf