Surapon Nonesung
2024
CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects
Wannaphong Phatthiyaphaibun
|
Surapon Nonesung
|
Peerat Limkonchotiwat
|
Can Udomcharoenchaikit
|
Jitkapat Sawatphol
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
The evaluation of generative models in Machine Reading Comprehension (MRC) presents distinct difficulties, as traditional metrics like BLEU, ROUGE, METEOR, Exact Match, and F1 score often struggle to capture the nuanced and diverse responses. While embedding-based metrics such as BERTScore and BARTScore focus on semantic similarity, they still fail to fully address aspects such as recognizing additional helpful information and rewarding contextual faithfulness. Recent advances in large language model (LLM) based metrics offer more fine-grained evaluations, but challenges such as score clustering remain. This paper introduces a multi-aspect evaluation framework, CHIE,incorporating aspects of Correctness, Helpfulness, Irrelevance, and Extraneousness. Our approach, which uses binary categorical values rather than continuous rating scales, aligns well with human judgments, indicating its potential as a comprehensive and effective evaluation method.