How Do Image Description Systems Describe People? A Targeted Assessment of System Competence in the PEOPLE-domain

Emiel van Miltenburg


Abstract
Evaluations of image description systems are typically domain-general: generated descriptions for the held-out test images are either compared to a set of reference descriptions (using automated metrics), or rated by human judges on one or more Likert scales (for fluency, overall quality, and other quality criteria). While useful, these evaluations do not tell us anything about the kinds of image descriptions that systems are able to produce. Or, phrased differently, these evaluations do not tell us anything about the cognitive capabilities of image description systems. This paper proposes a different kind of assessment, that is able to quantify the extent to which these systems are able to describe humans. This assessment is based on a manual characterisation (a context-free grammar) of English entity labels in the PEOPLE domain, to determine the range of possible outputs. We examined 9 systems to see what kinds of labels they actually use. We found that these systems only use a small subset of at most 13 different kinds of modifiers (e.g. tall and short modify HEIGHT, sad and happy modify MOOD), but 27 kinds of modifiers are never used. Future research could study these semantic dimensions in more detail.
Anthology ID:
2020.lantern-1.4
Volume:
Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Month:
December
Year:
2020
Address:
Barcelona, Spain
Editors:
Aditya Mogadala, Sandro Pezzelle, Dietrich Klakow, Marie-Francine Moens, Zeynep Akata
Venue:
LANTERN
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30–36
Language:
URL:
https://aclanthology.org/2020.lantern-1.4
DOI:
Bibkey:
Cite (ACL):
Emiel van Miltenburg. 2020. How Do Image Description Systems Describe People? A Targeted Assessment of System Competence in the PEOPLE-domain. In Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), pages 30–36, Barcelona, Spain. Association for Computational Linguistics.
Cite (Informal):
How Do Image Description Systems Describe People? A Targeted Assessment of System Competence in the PEOPLE-domain (van Miltenburg, LANTERN 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lantern-1.4.pdf
Code
 evanmiltenburg/analysepeopledescriptions
Data
Flickr30kMS COCOVisual Genome