Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task

An Yang, Kai Liu, Jing Liu, Yajuan Lyu, Sujian Li


Abstract
Current evaluation metrics to question answering based machine reading comprehension (MRC) systems generally focus on the lexical overlap between candidate and reference answers, such as ROUGE and BLEU. However, bias may appear when these metrics are used for specific question types, especially questions inquiring yes-no opinions and entity lists. In this paper, we make adaptations on the metrics to better correlate n-gram overlap with the human judgment for answers to these two question types. Statistical analysis proves the effectiveness of our approach. Our adaptations may provide positive guidance for the development of real-scene MRC systems.
Anthology ID:
W18-2611
Volume:
Proceedings of the Workshop on Machine Reading for Question Answering
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venues:
ACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
98–104
Language:
URL:
https://aclanthology.org/W18-2611
DOI:
10.18653/v1/W18-2611
Bibkey:
Cite (ACL):
An Yang, Kai Liu, Jing Liu, Yajuan Lyu, and Sujian Li. 2018. Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 98–104, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task (Yang et al., 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-2611.pdf
Data
DuReaderMS MARCOSQuAD