Content-Specific Humorous Image Captioning Using Incongruity Resolution Chain-of-Thought

Kohtaro Tanaka, Kohei Uehara, Lin Gu, Yusuke Mukuta, Tatsuya Harada


Abstract
Although automated image captioning methods have benefited considerably from the development of large language models (LLMs), generating humorous captions is still a challenging task. Humorous captions generated by humans are unique to the image and reflect the content of the image. However, captions generated using previous captioning models tend to be generic. Therefore, we propose incongruity-resolution chain-of-thought (IRCoT) as a novel prompting framework that creates content-specific resolutions from fine details extracted from an image. Furthermore, we integrate logit bias and negative sampling to suppress the output of generic resolutions. The results of experiments with GPT4-V demonstrate that our proposed framework effectively generated humorous captions tailored to the content of specific input images.
Anthology ID:
2024.findings-naacl.152
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2348–2367
Language:
URL:
https://aclanthology.org/2024.findings-naacl.152
DOI:
Bibkey:
Cite (ACL):
Kohtaro Tanaka, Kohei Uehara, Lin Gu, Yusuke Mukuta, and Tatsuya Harada. 2024. Content-Specific Humorous Image Captioning Using Incongruity Resolution Chain-of-Thought. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2348–2367, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Content-Specific Humorous Image Captioning Using Incongruity Resolution Chain-of-Thought (Tanaka et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.152.pdf
Copyright:
 2024.findings-naacl.152.copyright.pdf