Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor

Ashwin Baluja


Abstract
While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying not only on the meaning of the words, but also their pronunciations, and even the speaker’s intonations. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.
Anthology ID:
2025.chum-1.2
Volume:
Proceedings of the 1st Workshop on Computational Humor (CHum)
Month:
January
Year:
2025
Address:
Online
Editors:
Christian F. Hempelmann, Julia Rayz, Tiansi Dong, Tristan Miller
Venues:
chum | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9–17
Language:
URL:
https://aclanthology.org/2025.chum-1.2/
DOI:
Bibkey:
Cite (ACL):
Ashwin Baluja. 2025. Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor. In Proceedings of the 1st Workshop on Computational Humor (CHum), pages 9–17, Online. Association for Computational Linguistics.
Cite (Informal):
Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor (Baluja, chum 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.chum-1.2.pdf