Can Language Models Laugh at YouTube Short-form Videos?

As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains, such as speeches or sitcoms, and mostly focus on verbal cues. We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.


Introduction
Today, a huge number of short-form funny videos are popularly circulated on social media platforms.Although humor often triggers instant laughter, understanding humor is not a straightforward process.Numerous studies (Hazlitt, 1845;Kant, 1786;Nerhardt, 1970;Jones, 1970;Shultz, 1972;Suls, 1972Suls, , 1983) ) have explored the cognitive process of humor appreciation.For instance, Hazlitt (1845) and Kant (1786) propose the incongruity theory, asserting that incongruity provokes laughter.Nerhardt (1970) further develops the idea by defining the discrepancy between expectation and content, such as punchlines or cartoons.Suls (1972) suggests the incongruity-resolution theory, positing that humor arises only when the incongruity is resolved by 8s ~ 10s It's funny because the black and white dog is shown a dandelion and does the same thing as the dog and eats the dandelion.Also, the man's intention was simply to give the dog a flower, not for the dog to eat it.
17s ~ 20s The dog turns and notices the dandelion, then goes over and eats the dandelion from the man's hand.It's funny because of the man's exaggerated reaction.We curate funny short-form videos in various domains through a filtering pipeline that verifies both verbal and visual elements contributing to humor.Each video is annotated with timestamps and explanations for funny moments.In this example, three funny moments are identified.
retrieving information from the joke, cartoon, or the perceiver's own knowledge.Since a sufficient understanding of the context is required to perceive and further resolve the incongruity, understanding humor can be challenging.Nevertheless, if AI models can understand humor, they could interact more effectively with humans by providing empathetic responses based on users' sense of humor.Furthermore, if the models understand short-form funny videos, they can recommend videos based on users' preferences or even generate witty titles based on video contexts.
Several studies (Hasan et al., 2019;Castro et al., 2019;Patro et al., 2021;Kumar et al., 2022) have collected humorous video datasets to investigate whether models can understand if a video is funny or not.However, the datasets have been gathered from a limited domain, such as speeches or sitcoms.For example, Hasan et al. (2019) collect videos from TED, where there is a single speaker, and visual cues are restricted to gestures or facial ex-  (Sun et al., 2022), AVH&FOR (Chandrasekaran et al., 2016), NYCC (Hessel et al., 2022), MORE (Desai et al., 2022), MUStARD (Castro et al., 2019), WITS (Kumar et al., 2022), UR-FUNNY (Hasan et al., 2019), and MHD (Patro et al., 2021) .In the Modality column, I, V, A, and T denote image, video, audio, and text, respectively.The #Data Points column shows only the number of positive (humorous) data points.The Data Config column specifies the composition of each data point.The Exp column indicates the presence of annotated explanations.In the Task column, Exp and BC are abbreviations of explanation generation and binary classification task each.
pressions.Castro et al. (2019) build the MUStARD dataset from four sitcoms, mainly from "Friends" and "Big Bang Theory," and Patro et al. ( 2021) collect the MHD dataset from the sitcom "Big Bang Theory."However, in sitcoms, the fixed actors follow a predetermined script on a constructed set, and the punchline plays a crucial role, so the visual elements may have less contribution to humor.Moreover, the aforementioned video datasets only have binary labels indicating whether the content is humorous or not.As binary classification may not evaluate whether a model truly understands the humor in a video, Kumar et al. (2022) collect WITS with annotated text explanations.However, this dataset is limited to sarcasm, a specific form of humor, and focuses on sarcasm explanation in dialogue.It highlights a need for a humor explanation dataset that considers visual elements more and covers general humor.
To this end, we curate ExFunTube, a dataset of funny, short-form videos with explanations.These videos are collected from user-generated YouTube videos, which are shared on the "r/youtubehaiku" subreddit.In this subreddit, users upload shortform funny videos, typically up to 30 seconds long.We develop a video filtering pipeline with GPT-3.5 (Ouyang et al., 2022), designed to exclude the videos with minimal visual impact on humor.Then, we annotate the collected videos with timestamps and text explanations of funny moments, as exemplified in Figure 1.
Recent LLMs show great performance for ex-plaining humor present in text to some extent (Chowdhery et al., 2022).Inspired by the recent research on multimodal-informed prompting (Zeng et al., 2022), we convert video content into text, leveraging various zero-shot models on diverse modalities of the video.We provide LLMs with the text prompt as a linguistic summary of video content.Specifically, we consider two modalities of the video content: visual and audio.From the visual modality, we obtain dense video descriptions.From the audio modality, we acquire speech transcripts and sound labels.Finally, we chronologically integrate them into a text prompt that can maximize LLMs' ability for humor explanation.
Since evaluating a model's ability to explain humor is challenging, we report our results in three different ways: model-based automatic scores, rationale quality metrics with the moment localization task, and human evaluation.First, we report modelbased metrics instead of those using word overlap.Second, we conduct a rationale quality experiment, which assesses the quality of explanations from the accuracy of predicting gold labels (Wiegreffe et al., 2021).Finally, we carry out human evaluations with sampled test examples.Through these three different results, our prompting approach considerably improves the humor explanation performance of three important LLMs, including one zero-shot GPT-3.5 and two finetuned T5 (Raffel et al., 2020) and BART (Lewis et al., 2020).
To summarize, our key contributions are: of 10,136 user-generated, funny short-form videos.Each video is annotated with timestamps and explanations of funny moments.
As compared in Table 1, our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content.
2. We design a zero-shot video-to-text prompting that converts video content into text to maximize LLMs' ability to explain video humor.
3. With three different evaluation methods of model-based lexical scores, rationale quality scores, and human evaluations, we verify that our prompting improves LLMs' performance on humor explanation.2023) has collected TED talks and sitcoms with explanations for why audiences laugh.Natural Language Explanation.As tasks of interest become increasingly complex, predicting labels may not be enough to evaluate the models' true understanding.Thus, some works make models explain their decisions as an alternative.For instance, FLUTE (Chakrabarty et al., 2022) augments e-SNLI (Camburu et al., 2018) to curate figurative texts with labels for natural language inference (NLI) tasks and evaluate model-generated explana-tions.To evaluate model explanations, they utilize a rationale quality metric suggested by Wiegreffe et al. (2021).As word-overlap scores may be insufficient for the evaluation of explanation, Wiegreffe et al. (2021) propose a rationale quality metric that calculates the difference of prediction scores for gold labels when rationales are provided or not: Acc (IR → O) − Acc (I → O), where I, R, and O denote input, rationale, and gold label, respectively.In addition, Sun et al. (2022) evaluate explanations by comparing the accuracy of joke classification with and without explanations: Acc (IE → O) − Acc (I → O) where E denotes explanation.We introduce a moment localization task to compute the rationale quality score of the video explanation.

Related work
Modular Vision-Language Learning.As pretrained models become larger and are trained with extensive datasets, various multimodal comprehension tasks have been tackled by composing these pretrained models.One approach is to transform visual information into discrete text words (Zeng et al., 2022;Yang et al., 2022;Wang et al., 2022b).Zeng et al. (2022) propose a modular framework that leverages LLM to construct the input text for the subsequent model based on the output of multimodal models in the previous stage.They demonstrate performance improvements in image captioning and visual question answering (VQA) tasks.Another approach connects pretrained models through continuous feature embeddings (Patro et al., 2021;Alayrac et al., 2022;Tiong et al., 2022).Li et al. (2023a) pretrain additional lightweight modules that bridge the frozen image encoder and LLMs to eliminate the modality gap between the two frozen pretrained models.Tewel et al. (2022) connect the frozen image encoder with the frozen language decoder and evolve additional pseudo tokens during inference time to perform the video captioning task.Recently, there have been efforts to integrate these two different approaches.Li et al. (2023b) introduce VideoChat, a chat-centric video understanding system consisting of two modules: VideoChat-Text and VideoChat-Embed.The former generates text descriptions from the video and the latter encodes the video as embeddings.These text descriptions and embeddings are combined with a received question to form a prompt, based on which the LLM generates a response.
In our work, we combine vision-language pretrained models with LLMs through text for two uses: (i) video filtering for collecting multimodal funny videos and (ii) video-to-text generation to provide LLMs with a prompt of video content.

The ExFunTube Dataset
The ExFunTube dataset comprises 10,136 videos, each annotated with timestamps of funny moments and corresponding explanations describing why each moment is humorous.The purpose of this dataset is to evaluate the models' ability to explain why a given video is funny as a measure of understanding video humor.

Video Collection and Filtering
We initially crawl all 220K videos shared on the subreddit "r/youtubehaiku,"1 where people share humorous short-form YouTube videos lasting up to 30 seconds.To ensure multimodal humor in videos, we design a four-step filtering pipeline that selects videos with both visual and verbal elements contributing to humor, as shown in Figure 2.
Video Caption and Transcript.In the first step (Figure 2 (a)), we obtain a transcript and a video caption to describe the verbal and visual elements of a video clip, respectively.We extract a video caption using a zero-shot video captioning model (Tewel et al., 2022).Since our dataset contains diverse videos such as animations and edited videos not present in previous video datasets, we choose a model that utilizes both CLIP (Radford et al., 2021) and GPT-2 (Radford et al., 2019), which are pretrained on huge Web-sourced data.We transcribe audio from the video clip using a speechto-text model Whisper (Radford et al., 2022).We remove videos with no speech or in languages other than English.
Multimodal Humor.Our goal is to collect the videos that are funny from both verbal and visual elements, instead of funny from only one modality.Thus, as shown in Figure 2 (b), we first verify that the video is verbally funny; we do this by whether GPT-3.5 can find a funny utterance given a pair of the video caption and the transcript.If GPT-3.5 detects no funny utterances, we filter out the video.Next, as shown in Figure 2 (c), we again prompt GPT-3.5 to find a funny utterance with only a transcript (i.e., no video caption).If no funny utterance is detected, then we accept this video.The rationale is that the humor of this video is multimodal; the visual caption is required to identify the fun in the video.Otherwise, if GPT-3.can find a funny utterance in this case, we perform a further inspection as follows.

Difference in Explanations.
In the last step (Figure 2 (d)), GPT-3.5 is prompted to generate explanations in one sentence for the two cases: when given both a video caption and a transcript and when given only a transcript.We then measure the similarity between the two explanations using the SentBERT score (Reimers and Gurevych, 2019), which embeds each sentence and calculates the cosine similarity of their embeddings.The reason for adopting the SentBERT score is that it can reflect the semantics of the entire sentence.If the score is higher than the threshold, we exclude the video since the video caption does not contribute to the humor explanation.Otherwise, the video is accepted.

Rationale of Our Pipeline.
There has yet to be a method to gauge the extent and manner in which visual elements contribute to humor.In other benchmarks, the multimodality of datasets has been validated by analyzing the performance gap when visual information is either provided or not (Hasan et al., 2019;Patro et al., 2021;Kumar et al., 2022).Similarly, we collect videos that exhibit differences in the assigned task (i.e., identifying humorous utterances by GPT-3.5) with or without visual information.In the field of NLI, previous works (Liu et al., 2022;Wiegreffe et al., 2022;Chakrabarty et al., 2022) leverage the power of LLMs such as GPT-3 (Brown et al., 2020) in creating figurative language examples or explanations for them.Likewise, we use GPT-3.5 to check the difference between generated explanations.To the best of our knowledge, this is the first approach that employs explanations for curating a dataset.Thanks to the pipeline, we can collect 21K high-quality multimodal humorous videos.
Postprocessing.To ensure that our dataset does not contain any disrespectful or harmful content towards individuals or animals, we conduct a thorough manual review of all 21K videos.We filter out the videos using the five criteria based on the safety objectives outlined by Thoppilan et al. (2022): (i) Discrimination: videos displaying discrimination based on race, gender, sexual orientation, age, or disability.(ii) Animal cruelty: videos depicting acts of animal cruelty, such as a cat falling.(iii) Dangerous goods, services, activities, or self-harm: videos featuring dangerous content like drugs, violence, or bullying.(iv) Obscenities or profanities: videos containing explicit language or sexual actions.(v) Shocking content: videos that include shocking content, such as gunshots or explosions.After the filtering, about 50% of the videos are removed, and we are left with 10,136 videos.

Data annotations
We crowdsource via Amazon Mechanical Turk (AMT) to annotate start and end timestamps of funny moments and provide text explanations for each moment.To participate in our dataset annotation, workers must meet the following criteria: a HIT approval rate of 99% or higher, a total of more than 10,000 approved HITs, and be located in one of the countries of AU, CA, GB, NZ, or US.
We conduct a qualification test for these workers, selecting those who can effectively explain humor.Out of 219 workers, only 60 pass the qualification test, indicating our thorough selection.
For each video, we instruct one worker first to identify up to three funny moments within a video (up to 30 seconds long) and then annotate why each moment is funny.To make workers explain both humor elements and justifications, we provide a recommended format: "[What is funny].It is funny because [Why funny]".We only accept responses including both descriptions (What) and justifications (Why) and reject those that lack either.Given the difficulty of the task, we offer detailed feedback to the workers, helping them improve their performance with a high annotation standard.
As a result, we obtain 11,166 explanations, each paired with start and end timestamps of the moment.They consist of 44.3 words on average.Out of 10,136 videos, 9,222 contain one funny moment, 798 contain two, and 116 contain three.Most videos contain a single funny moment since videos are typically shorter than 30 seconds.However, given the varied content in each video, there can be any number of funny moments.

Approach
We explore an approach to explain video humor.Our idea is first to convert the video content into fine-grained text and then take advantage of recent powerful LLMs in a zero-shot manner.We design to extract as much information from videos into text as possible.Figure 3 shows a zero-shot videoto-text prompting that converts the video content into a text input to LLMs.

Fine-grained Text Prompts
Videos contain visual and audio modalities.The audio is further split into speech and sound.For each component, we initially generate text descriptions using state-of-the-art zero-shot models.Then, we arrange text descriptions in chronological order and use them as a prompt.
Visual.In order to populate high-quality text descriptions about the visual, we first (i) segment the video, (ii) generate multiple frame captions, and (iii) retrieve the best-matching caption with the video-to-text model.
First, we employ PySceneDetect2 to divide a video into a set of  segments based on visual

Prompt
Please generate an explanation of why a video is funny, given visual descript ions and dialogue (or monologue), and an audio description for the video.Explain as if you were watching the video.Visual descriptions and utterances will be given in chronological order.

Visual descriptions and utterances (chronologically):
(Speech, Panting, Dog) Scene: Speaker 1: "Hey Luke, sit.Luke, dandelion.AHHHH!"A white dog with blue eyes being fed some kind of flower.changes.During the filtering pipeline ( §3.1), the speech-to-text model Whisper generates timestamps for each utterance.We also use them to split the segments further, resulting in more fine-grained and semantically meaningful video segments.
Next, we extract frames at a rate of 5fps from each of the  video segments.We generate  (= 20) captions per frame using the image captioning model BLIP-2 (Li et al., 2023a) with a "Who is doing what?" prompt, which can enhance action detection.We then have a frame caption corpus (# Frames ×  captions) per segment.Subsequently, we use the video-to-text model InternVideo (Wang et al., 2022a) to retrieve the caption that best matches each video segment from the respective frame corpus.Finally, we obtain one caption per segment, resulting in a total of  captions, which are fine-grained descriptions of the visual component.
Speech.We transcribe audio with Whisper (Radford et al., 2022) as done in our video filtering pipeline.We then predict the number of speakers and assign speakers to each utterance utilizing ChatGPT (OpenAI, 2023).This speaker separation helps a deep understanding of dialogue.
Sound.We extract sound tags to provide more context.We use an audio tagging model (Schmid et al., 2022) to classify the entire audio stream.We select the top 3 predicted tags that have a higher confidence value than the threshold (0.3).We concatenate the tags and insert them at the beginning of the prompt.This can provide the model with an overall atmosphere of the video.

Prompt Configuration and LLMs
After extracting text from visual, speech, and sound, we configure the prompt like an example of Figure 3.The prompt starts with a predefined text "Please generate ∼" to instruct LLMs to explain as if they are watching the video.We then include sound tags enclosed in parentheses and arrange the extracted text of speech and visuals for each video segment chronologically.To distinguish between video segments, we begin each segment with "Scene: ".Finally, we ask LLMs to generate an explanation of up to three sentences.

Experiments
We experiment with different models to see how well they explain the humor in the ExFunTube videos.We evaluate the models in three different ways of model-based automatic scores, rationale quality experiments, and human evaluation.
Table 2: Humor explanation results in terms of automatic scores (SentBERT and ROSCOE), rationale quality scores, and human rating.In the automatic scores, @K shows the proportion of test explanations of which scores are higher than K, and the mean column is the average score of each metric.For rationale quality scores with funny moment localization, we adopt two IoU thresholds, 0.3 and 0.5; lower scores are better.For human rating, five workers rate each of 100 randomly selected test videos from No (0), Weak No (0.25), Neutral (0.5), Weak Yes (0.75), to Yes (1).
After excluding the highest and lowest scores, the remaining scores are averaged.
finetuning and GPT-3.5 as a zero-shot model.(ii) MAF (Kumar et al., 2022) is a multimodal end-toend model designed for video sarcasm explanation.It generates explanations by receiving features of the three components (visual, speech, and audio).
We train the model on our dataset.(iii) VideoChat-Text (Li et al., 2023b) is a multimodal prompting framework that textualizes video information into text, including video/clip captions, objects contained in the video and a transcript.Given the prompt, GPT-3.5 generates explanations in a zeroshot manner.(iv) LLMs with our prompting generate explanations given a prompt created by our zero-shot video-to-text prompting, using the same LLMs as (i) of T5, BART, and GPT-3.5.Note that T5 and BART models are finetuned to generate explanations given generated prompts, while GPT-3.5 generates in a zero-shot manner.Explanation Generation.For all finetuned models on our dataset, we employ K-fold crossvalidation as follows.We divide the entire dataset of 10,136 videos into five equal-sized subsets.In each iteration, we train the model on three subsets, use one subset for validation, and test on the remaining subset.We repeat this process five times, rotating the test subset in each iteration.Finally, we obtain predicted explanations for the entire set.
Evaluation.To compare the predicted explanation with the gold explanation for each video, we concatenate explanations for each moment into a single, unified explanation.For more details on experiments, please refer to the Appendix.

Results of Model-based Automatic Scores
Since the metrics based on word overlaps may fail to reflect faithfulness and plausibility as highlighted by Sun et al. (2022), we evaluate explanations using two model-based scores: SentBERT Score and ROSCOE (Golovneva et al., 2022).ROSCOE is a suite of metrics designed to evaluate the reasoning process within a chain-of-thought prompting (Wei et al., 2022).It is suitable for our explanation tasks since our goal is to uncover the reason for laughter (i.e., why is the video humorous?)Among the various scores provided by ROSCOE, we use the reasoning alignment (RA) score, which computes the contextual similarity between the hypothesis and reasoning.
Table 2 reports the model-based automatic scores of different methods.We show not only the mean metric values but also the proportions of the test set with scores higher than various thresholds; @ represents the proportion of data points with scores equal to or greater than .
The results show that, except for SentBERT @0.7, GPT-3.5 with our prompting reaches the best performance.Especially, the SentBERT and ROSCOE scores with our prompting are higher than those with text-only baselines in all cases.In addition, our method outperforms the multimodal end-toend baseline MAF and the multimodal zero-shot prompting baseline VideoChat-Text.The compar-ison of @ metrics shows even more significant differences, particularly for SentBERT @0.5 and ROSCOE @0.8, where the performance margin ranges from 0.1 (BART) to 0.27 (GPT-3.5)compared to the text-only baselines.This means that using transcripts alone may not be sufficient to understand the humor in our videos.

Results of Rationale Quality Scores
We conduct a rationale quality experiment following Wiegreffe et al. (2021) and Sun et al. (2022).Since our dataset consists of videos, unlike theirs, we adapt the experimentation scheme by evaluating the rationale quality through a moment localization task, which aims at predicting funny moments defined by their start and end timestamps in a video given the text explanation.
We use QD-DETR (Moon et al., 2023) as a localizer and divide the entire dataset into 8:1:1 splits for training (8,110), validation (1,013), and testing (1,013).During the training, the localizer is learned to predict the gold timestamp given a gold explanation.At inference, we compute the rationale quality as the prediction difference of the localizer between when given a model-generated explanation and when given a gold explanation.
Let  be a model-generated explanation,  be a gold explanation, and  be a threshold.For each test data point, we calculate the maximum IoU from the top 5 candidates given  or , respectively denoted as IoU M or IoU G .We use the top 5 since there can be at most three funny moments in a single video and the localization predictions can overlap with each other.We compute the difference when IoU  > .The final score  is the sum of differences for all test data: where  is the number of test data points, and (•) is the indicator function.
Table 2 shows the results when the IoU threshold  is set to 0.3 and 0.5.A lower score is better as it is closer to the gold standard.In each LLM, the performance improves when our prompting is included compared to corresponding text-only ones.In particular, our approach improves GPT-3.5 the most, with the threshold at 0.3 resulting in a score gap of 13.3, and at 0.5, a score gap of 13.2.Again, the performance of all LLMs with our prompting is better than MAF and VideoChat-Text.

Results of Human Evaluations
For human evaluation, we employ 10 AMT workers using the same criteria as in the dataset annotation but excluding the ones who already participated in the annotation.We randomly select 100 videos and evaluate explanations generated by all models except baselines using T5 and VideoChat-Text, which show worse automatic scores than other text-only or multimodal baselines.We obtain human evaluations with two methods: rating and comparison.
For the rating, workers are asked to rate each explanation according to No (0), Weak No (0.25), Neutral (0.5), Weak Yes (0.75), and Yes (1) and check any shortcomings.We ask five workers for each explanation, exclude the highest and lowest scores, and take the average.For the comparison, workers compare GPT-3.5 with our prompting to (1) Text-only GPT-3.5, (2) MAF, and (3) Gold explanations and choose the better explanation.We ask five workers for each pair of comparisons.
The rating results are presented on the far right of Table 2.The scores of BART and GPT-3.5 increase by about 0.1 when our prompting is included.The comparison results are presented in Figure 4.The number of votes for text-only GPT-3.5 is significantly lower than that of GPT-3.5 with our prompting, indicating that visual information is valuable, and our prompting helps convey visual information effectively.In both rating and comparison, MAF shows lower performance than the text-only models despite being a multimodal model.This suggests that providing visual information as text to LLMs could be more effective than training the multimodal model end-to-end.Moreover, GPT-3.5 with our prompting, which shows the best results, still scores lower than Gold, indicating that understanding and explaining the humor in our dataset still remains unsolved.

Analyzing LLMs with Humor Taxonomy
We classify our dataset into a total of 20 humor categories referring to Martin and Ford (2018) and Buĳzen and Valkenburg (2004), and observe the performance of baselines by the humor taxonomy.We provide ChatGPT with 20 categories along with a brief description and one example (i.e., oneshot learning) and instruct ChatGPT to classify the video based on the given explanation.Thanks to ChatGPT's powerful in-context learning capability, we effectively classify 10,136 videos based on their corresponding explanations.
Figure 5 shows the models' performance by humor categories.Excluding the Jokes and Selfdeprecating classes, the performance increases with our prompting in all categories.In particular, the performance significantly increases in Clownish humor, Visual gags, and Slapsticks, which heavily reflect visual elements.This indicates that our zeroshot video-to-text prompting effectively conveys visual elements to the LLM.

Ablation Study
We compare the importance of each modality in humor explanation.Table 3 presents the results of SentBERT and ROSCOE scores when visual, speech, and sound components are not included in the prompt one by one.In GPT-3.5 with our prompting, the performance without the visual component drops as much as when the speech is removed, indicating that the visual component plays an important role in our dataset.Moreover, the performance decreases when either of the components is removed,    which suggests that all three components are crucial for understanding and explaining humorous videos in our dataset.Additional ablation studies are presented in the Appendix.

Conclusion
We introduced ExFunTube, a dataset consisting of 10,136 user-generated videos annotated with timestamps and explanations of funny moments.
Our dataset aims to assess how well AI models understand and explain video humor.We devised a zero-shot video-to-text prompting to make existing LLMs better explain the video content.With three different evaluation methods, we demonstrated that the humor in our dataset is multimodal, and our prompting maximized LLMs' ability to generate explanations.However, as the performance still falls short of human levels, our dataset remains sufficiently challenging and calls for future research.Furthermore, we can consider the training of the model using user feedback for personalized humor understanding.

A Experimental Details
Video Filtering Pipeline.In the video filtering pipeline, we utilize a zero-shot video captioning model from Tewel et al. (2022), a speech-to-text model Whisper (Radford et al., 2022), and GPT-3.5 (Ouyang et al., 2022).For the video captioning model, we optimize pseudo tokens for 25 iterations at inference time to guide the pretrained GPT-2 (Radford et al., 2019) with the CLIP ViT-L/14 image encoder (Radford et al., 2021).We use AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 0.008 and an L2 weight decay of 0.003.For Whisper, we use the large-v2 model.For GPT-3.5, we use text-davinci-003 and set the temperature to 0 for funny utterance detection and 0.3 for explanation generation.
Video-to-Text Prompting.During the prompting stage, we use BLIP-2 (Li et al., 2023a), Intern-Video (Wang et al., 2022a), Whisper, ChatGPT (OpenAI, 2023), and an audio-tagging model from Schmid et al. (2022).We use the coco-pretrained BLIP-2 model with nucleus sampling.For Intern-Video, we use CLIP ViT-L/14 as the image encoder.We set the temperature to 0.3 for ChatGPT, and we use the mn40_as model for audio tagging.
Explanation Generation.To generate explanations with baseline models, we finetune T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) with a batch size of 4 for 5 epochs.We use the AdamW optimizer with a learning rate of 2e-5 and an L2 weight decay of 0.01.Additionally, we train MAF (Kumar et al., 2022), a multimodal end-to-end model with an adaptor added to BART, with a batch size of 4 for 20 epochs.We use the AdamW optimizer with an L2 weight decay of 1e-4, and the learning rate is set to 5e-8 for BART parameters and 5e-7 for the remaining parameters.We use BART Large for all models.
Rationale Quality Experiments.For the rationale quality experiments with moment localization, we train QD-DETR (Moon et al., 2023) with a batch size of 128 for 200 epochs.We use the AdamW optimizer with a learning rate of 1e-5 and an L2 weight decay of 1e-4.We optimize with the moment retrieval loss consisting of the L1 loss, the cross-entropy loss and the generalized IoU loss.We use the loss balancing terms of 10, 1 and 2 for each of them, respectively.We do not use the saliency loss.We use the bert-base-uncased model (Devlin et al., 2019) as the text encoder with the max query length set to 400 and CLIP ViT-L/14 as the video encoder.We sample video frames at a rate of 1 fps.Except for the aforementioned hyperparameters, we use the default values for all models.

B Additional Ablation Study
We conduct ablation experiments on BART and T5 with our prompting as well, and the results are as shown in Table 4. Similar to the results of GPT-3.5 with our prompting, using all modalities achieves the best performance, and there is a certain degree of performance decrease when the visual component is removed.

C Crowdsourcing Details
We use three different user interfaces of Amazon Mechanical Turk (AMT) for (i) annotating the timestamps and explanations of funny moments, and the human evaluation of (ii) rating and (iii) comparison, as shown in Figures 6-8, respectively.We guarantee AMT workers receive fair wages of approximately $18 per hour.Additionally, we allocate about $2 as compensation for each data point and grant additional wages to workers contributing extended time and effort.A man is sitting in a chair and he says "look what you did to the dandelion" and then he screams.It is funny because it is unexpected and because he is screaming at the same time.

D Case Study
BART The video is funny because it presents a close up capture of a dandelion that has been knocked over by a man that is attempting to give it a flower.Furthermore, this scene is comical because the man featured seems to be comically flailing his arms and legs in such a way that makes it appear as if he is trying to dance to the beat of the song playing the background.The video is also funny when the man says, "I was just trying to give him a flower," because it is clear that he is not in the mood to dance, but instead is merely trying to get the attention of his companion.

GPT-3.5
The video is funny because of the contrast between the speaker's expectation of Luke's reaction to the dandelion and his actual reaction.The speaker's shock and dismay at Luke's destruction of the dandelion is comically exaggerated, and the speaker's surprise is punctuated by the loud "Aye!" when the dandelion is destroyed.The speaker's over-the-top reaction to the destruction of the dandelion is what makes the video funny.
MAF The video is funny because it presents a clip of a man attempting to give a dandelion to another man, but instead of giving him a flower, the man throws it to the ground, which is unexpected.Furthermore, it is comical because it is unexpected to see a man throw a flower at another man.

T5 w/ P
A man is holding dandelions and a dog is shown eating them.It is funny because the dog is so aggressive and the man is trying to get him to sit, but he is not doing so well.

BART w/ P
A man gives his dog a dandelion and the dog starts licking it.It is funny because it is unexpected, and also because the dog seems to enjoy it.

GPT-3.5 w/ P
The video is funny because the white dog, Luke, is initially obedient when told to sit, but then proceeds to eat the dandelion instead of just sniffing it.This causes another person to scream in shock, leading to a chain reaction of laughter and more dogs being fed dandelions.The unexpected behavior of the dogs and the exaggerated reactions of the people make the video humorous.
Gold It's funny because the white dog is shown a dandelion and then the dog eats the dandelion unexpectedly.It's funny because the black and white dog is shown a dandelion and does the same thing as the dog and eats the dandelion.Also, the man's intention was simply to give the dog a flower, not for the dog to eat it.The dog turns and notices the dandelion, then goes over and eats the dandelion from the man's hand.It's funny because of the man's exaggerated reaction.The video is funny because it presents a comical moment where a man is attempting to push his body to the point where he is unable to do so.Furthermore, the man's remark is comical because he is attempting to make a joke about how hard he is pushing his body to the point where he is unable to do so.
BART The video is funny because it presents a close up capture of a female individual that appears to be attempting to "push a lot more."Furthermore, this scene is comical because it is clear that the individual featured is attempting to demonstrate a "push-a-thon" in order to "get the muscles developed," which is why it is laughable to see that the woman featured is actually attempting to push herself to the extent that she is actually "pushing" herself.

GPT-3.5
The speaker's enthusiasm and repetition of the phrase "muscles" is comical, as if they are trying to emphasize the importance of the muscles.The speaker's exaggerated emphasis on the word "pushed" also adds to the humor as it implies that the person they are referring to is putting in a lot of effort.The speaker's enthusiasm and exaggeration makes the video funny.
MAF The video is funny because it presents a clip of a man saying, "She pushed a lot more."It is comical because it is unexpected to see a man say this, and also because the video is edited to make it seem as if the man is implying that the woman pushed him harder than she actually did.

T5 w/ P
A man is riding a lawnmower and he is talking about getting his muscles developed, and then he says "she pushed a lot more."It is funny because he is talking about pushing his muscles, and also because he is talking about pushing his muscles.

BART w/ P
A man is riding a lawnmower next to a bunch of sheep, and he says, "She pushed a lot more."It is funny because it is unexpected to see someone push a mower like that, and also because the man says it in a way that sounds like he is bragging.

GPT-3.5 w/ P
The visual of a person riding a lawnmower with a herd of sheep is humorous in itself.The dialogue between the two speakers adds to the comedic effect as they discuss the man's muscles and how much he has pushed.The absurdity of the situation makes it even funnier.
Gold It's funny to me because the implication is that lawnmower is sheep powered, because the sheep are pushing it, instead of horse powered.It's also weird to see sheep doing that.
Get the muscles developed.And the muscles.She pushed a lot more.
Figure 14: (Correct) An example of explanation generation results.LLMs with our prompting describe the verbal ("She pushed a lot more") and visual elements ("a person riding a lawnmower with a herd of sheep") contributing to humor.

T5
The video is funny because it features a man who is attempting to enter an aquarium by yelling "Aquarium, Aquarium, Aquarium, Aquarium," BART It's funny because the narrator keeps saying "I want to go to the aquarium" over and over again, as if he's trying to convince the audience that they should go to an aquarium. GPT-3.5 The speaker in the video is saying the same phrase over and over again in a silly, sing-song voice.The repetition of the phrase is humorous, and the speaker's enthusiasm for wanting to go to the is endearing.The way they keep asking "somebody" to take them to the aquarium is especially funny.MAF It's funny because the man is talking about going to the aquarium, but he says he wants to go to someone else's aquarium.

T5 w/ P
A man is holding a goldfish in his hand and he says "Aquarium, aquarium, I wanna go to the aquarium."The fish then starts to squirm and he says "Okay, aquarium, somebody."It is funny because the man is trying to get the fish to go to the aquarium, but the fish is not interested in going.

BART w/ P
It's funny because the narrator keeps saying "aquarium" over and over again, but the fish doesn't seem to care.

GPT-3.5 w/ P
The video is funny because goldfish is portrayed as a character that wants to go to the aquarium.The goldfish is seen swimming around the tank and looking up at the camera with its big eyes, as if it is asking for help.The video ends with an ad for the Oregon Coast Aquarium in Newport, making it even funnier.

Gold
The goldfish is swimming around in his bowl saying, "I want to go to the aquarium."It is funny because he is a goldfish and he's already in a tank full of water.
Hey, Mr. Bumblebee.Thanks for the honey.High five.Oh, yeah, there.
Figure 15: (Correct) An example of explanation generation results.Except for LLMs with our prompting, the presence of a goldfish (or fish) is not mentioned.Particularly, GPT-3.5 with our prompting accurately describes the character and behavior of the goldfish, and even mentions the content of the concluding advertisement.

T5
The man is describing a baguette that he bought and he says he liked the color, and then he says "some people said they like it, some people said they don't like it."It is funny because he is implying that people don't like the color of the baguette.
BART A man asks a woman what color is her baguette.She replies, "I had a mixed reviews on the color, some people said they don't like it."The man then asks her, "And I said, well what color?"It is funny because it is unexpected. GPT-3.5 The speaker's response to the question of whether people like the color or not is hilarious.
The unexpected and absurd question of what color their baguette is catches the audience off guard and elicits a burst of laughter.The speaker's deadpan delivery of the line only adds to the humor.
MAF A man says "I had a mixed reviews on the color of my baguette."It is funny because it is unexpected.

T5 w/ P
The man in the cartoon says "I like the color, you like the color?" and then the car is towed away.It's funny because the car is a very bright color and the man in the cartoon is a very nerdy looking guy.

BART w/ P
A man says "I had a mixed reviews on the color" while looking at a car.This is funny because it's a play on the word mixed.

GPT-3.5 w/ P
The video is funny because the speaker is making a joke about the color of the car, asking what color the person's baguette is in comparison.The visuals add to the humor, showing the man in sunglasses and orange shirt next to the orange sports car, and the tow truck driver with his tool.The sound of the car at the beginning also adds to the comedic effect.

Gold
The video is funny because it features an animated clip featuring a man that is towing a luxury car while in the presence of its presumably wealthy owner that makes the sly comment in stating, "I had mixed reviews on the color.Some people say they like it, some people said they don't like it.I said well, what color is your Baguette?"This remark is laughable because it implies that the man is rich and the person that is questioning the car color is not and is likely not in ownership of a Baguette, so therefore, their opinion is irrelevant.
I like the color.You like the color?Yeah.I had a mixed reviews on the color.Some people said they like it, some people said they don't like it.And I said, well what color is your baguette?
Figure 16: (Correct) An example of explanation generation results.Text-only LLMs do not mention a car that has a similar color to a Baguette.Meanwhile, LLMs with our prompting provide details about the car in the scene.Note that GPT-3.5 with our prompting can explain the sarcasm related to the baguette like in Gold.
Timestamps & Explanations of the funny moments 2s ~ 4s It's funny because the white dog is shown a dandelion and then the dog eats the dandelion unexpectedly.
did to the dandelion.I was trying to give him a flower.

Figure 1 :
Figure 1: An example from the ExFunTube dataset.We curate funny short-form videos in various domains through a filtering pipeline that verifies both verbal and visual elements contributing to humor.Each video is annotated with timestamps and explanations for funny moments.In this example, three funny moments are identified.

Figure 2 :
Figure 2: The video filtering pipeline selects multimodal funny videos.Red boxes display the actual prompts provided to GPT-3.5.See the details in § 3.1.(a) We generate a transcript and a caption from the input video.(b)Via GPT-3.5 prompting, we filter out the video that is not funny from the transcript and caption.(c) The video is accepted if it is funny from both the transcript and caption but not from the transcript only, since its humor is multimodal.(d) GPT-3.5 generates humor explanations with or without the video caption.We remove the videos if they are too similar since their humor is not multimodal.Examples for each case are presented in the Appendix.
Figure 3: (a) A zero-shot video-to-text prompting for converting video content into fine-grained text ( § 4.1).For the visual modality, the video is first divided into  segments, for each of which many possible captions are generated, and the best one is chosen finally.For audio modality, a transcript with speaker separation and sound tags are obtained.(b) The fine-grained text is configured as an input prompt to LLMs ( § 4.2).

Figure 5 :
Figure 5: Explanation performance according to humor taxonomy.We categorize all videos into 20 humor classes and compare the performance of eight different baselines in terms of the SentBERT score.The humor taxonomy is arranged in descending order of proportion in our dataset.

Figures 9 -
Figures 9-12 show representative videos accepted or excluded by our video filtering pipeline.Figures 13-18 provide several examples to demonstrate humor explanations that our baseline models actually generate.We color-code relevant (blue) and irrelevant (red) information contained in generated explanations.LLMs with our prompting, especially GPT-3.5, correctly explain the funny moments in Figures 13-16 while text-only LLMs and MAF fail to.All the models fail to explain humorous moments in Figures 17-18.

Figure 6 :
Figure6: A user interface for annotating timestamps and explanations of humorous moments.Workers are asked to watch a video, identify up to three funny moments, and provide the start/end timestamps along with the explanation for each moment.

Figure 7 :
Figure 7: A user interface for human evaluation through rating.Workers are asked to rate the explanation on a scale of No, Weak No, Neutral, Weak Yes, to Yes, and to choose any shortcomings if present.

Figure 8 :
Figure8: A user interface for human evaluation through comparison.Workers are asked to compare GPT-3.5 with our prompting to text-only GPT-3.5, MAF, and Gold, respectively, and select the superior one.
Figure 9: An example of a video excluded in the second step (Figure 2 (b)) of the filtering pipeline.

Funny
Is that a magnifying glass?Yep.Are you feeling warm all of a sudden?Video shows a man in the insect world, and it's.

Figure 11 :
Figure 11: An example of a video accepted in the third step (Figure 2 (c)) of the filtering pipeline.

Figure 12 :
Figure 12: An example of a video excluded in the fourth step (Figure 2 (d)) of the filtering pipeline.

T5
Figure13: (Correct) An example of explanation generation results.GPT-3.5 with our prompting correctly describes the unexpected behavior of dogs and the exaggeration of the people that provoke laughter.

Table 1 :
Comparison of our ExFunTube with previous humor datasets: ExPUN

Table 3 :
Ablation results of GPT-3.5 with our prompting measured by SentBERT and ROSCOE scores when each modality component is removed.V, T, and A denote visual, speech, and sound, respectively.

Table 4 :
Ablation results of T5 and BART with our prompting measured by SentBERT and ROSCOE scores when each modality component is removed.V, T, and A denote visual, speech, and sound, respectively.