Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Tomer Levinboim, Ashish V. Thapliyal, Piyush Sharma, Radu Soricut


Abstract
Automatic image captioning has improved significantly over the last few years, but the problem is far from being solved, with state of the art models still often producing low quality captions when used in the wild. In this paper, we focus on the task of Quality Estimation (QE) for image captions, which attempts to model the caption quality from a human perspective and *without* access to ground-truth references, so that it can be applied at prediction time to detect low-quality captions produced on *previously unseen images*. For this task, we develop a human evaluation process that collects coarse-grained caption annotations from crowdsourced users, which is then used to collect a large scale dataset spanning more than 600k caption quality ratings. We then carefully validate the quality of the collected ratings and establish baseline models for this new QE task. Finally, we further collect fine-grained caption quality annotations from trained raters, and use them to demonstrate that QE models trained over the coarse ratings can effectively detect and filter out low-quality image captions, thereby improving the user experience from captioning systems.
Anthology ID:
2021.naacl-main.253
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3157–3166
Language:
URL:
https://aclanthology.org/2021.naacl-main.253
DOI:
10.18653/v1/2021.naacl-main.253
Bibkey:
Cite (ACL):
Tomer Levinboim, Ashish V. Thapliyal, Piyush Sharma, and Radu Soricut. 2021. Quality Estimation for Image Captions Based on Large-scale Human Evaluations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3157–3166, Online. Association for Computational Linguistics.
Cite (Informal):
Quality Estimation for Image Captions Based on Large-scale Human Evaluations (Levinboim et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.253.pdf
Video:
 https://aclanthology.org/2021.naacl-main.253.mp4
Code
 google-research-datasets/Image-Caption-Quality-Dataset
Data
Image Caption Quality DatasetConceptual Captions