Improving Image Captioning Evaluation by Considering Inter References Variance

Yanzhi Yi, Hangyu Deng, Jinglu Hu


Abstract
Evaluating image captions is very challenging partially due to the fact that there are multiple correct captions for every single image. Most of the existing one-to-one metrics operate by penalizing mismatches between reference and generative caption without considering the intrinsic variance between ground truth captions. It usually leads to over-penalization and thus a bad correlation to human judgment. Recently, the latest one-to-one metric BERTScore can achieve high human correlation in system-level tasks while some issues can be fixed for better performance. In this paper, we propose a novel metric based on BERTScore that could handle such a challenge and extend BERTScore with a few new features appropriately for image captioning evaluation. The experimental results show that our metric achieves state-of-the-art human judgment correlation.
Anthology ID:
2020.acl-main.93
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
985–994
Language:
URL:
https://aclanthology.org/2020.acl-main.93
DOI:
10.18653/v1/2020.acl-main.93
Bibkey:
Cite (ACL):
Yanzhi Yi, Hangyu Deng, and Jinglu Hu. 2020. Improving Image Captioning Evaluation by Considering Inter References Variance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 985–994, Online. Association for Computational Linguistics.
Cite (Informal):
Improving Image Captioning Evaluation by Considering Inter References Variance (Yi et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.93.pdf
Video:
 http://slideslive.com/38929015