Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model

Sheng Cheng, Maitreya Patel, Yezhou Yang


Abstract
Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.
Anthology ID:
2024.findings-emnlp.211
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3703–3709
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.211
DOI:
Bibkey:
Cite (ACL):
Sheng Cheng, Maitreya Patel, and Yezhou Yang. 2024. Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3703–3709, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model (Cheng et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.211.pdf