MultiMET: A Multimodal Dataset for Metaphor Understanding

Metaphor involves not only a linguistic phenomenon, but also a cognitive phenomenon structuring human thought, which makes understanding it challenging. As a means of cognition, metaphor is rendered by more than texts alone, and multimodal information in which vision/audio content is integrated with the text can play an important role in expressing and understanding metaphor. However, previous metaphor processing and understanding has focused on texts, partly due to the unavailability of large-scale datasets with ground truth labels of multimodal metaphor. In this paper, we introduce MultiMET, a novel multimodal metaphor dataset to facilitate understanding metaphorical information from multimodal text and image. It contains 10,437 text-image pairs from a range of sources with multimodal annotations of the occurrence of metaphors, domain relations, sentiments metaphors convey, and author intents. MultiMET opens the door to automatic metaphor understanding by investigating multimodal cues and their interplay. Moreover, we propose a range of strong baselines and show the importance of combining multimodal cues for metaphor understanding. MultiMET will be released publicly for research.


Introduction
Metaphor is frequently employed in human language and its ubiquity in everyday communication has been established in empirical studies (Cameron, 2003;Steen, 2010;Shutova et al., 2010). Since Lakoff and Johnson (1980) introduced conceptual metaphor theory (CMT), metaphor has been regarded as not only a linguistic, but also a cognitive phenomenon for structuring human thought. Individuals use one usually concrete concept in metaphors to render another usually abstract one for reasoning and communication. For example,  in the metaphorical utterance "knowledge is treasure," knowledge is viewed in terms of treasure to express that knowledge can be valuable. According to CMT, metaphor involves the mapping process by which a target domain is conceptualized or understood in terms of a source domain.
As a means of cognition and communication, metaphor can occur in more modes than text alone. Multimodal information in which vision/audio content is integrated with the text can also contribute to metaphoric conceptualization (Forceville and Urios-Aparisi, 2009;Ventola et al., 2004). A multimodal metaphor is defined as a mapping of domains from different modes such as text and image, text and sound, or image and sound (Forceville and Urios-Aparisi, 2009). For example, in Figure  1 (a), the metaphorical message of fire in the sky is conveyed by a mapping between the target domain "sky" (sunset) and the source domain "fire" from two modalities. Figure 1 (b) offers another example with the metaphor of lungs made from cigarettes so a relation is triggered between two different entities, lung and cigarette, with the perceptual idea that smoking causes lung cancer. The source domain "cigarette" comes from the image, while the target domain "lung" appears in both text and image. Understanding multimodal metaphor requires decoding metaphorical messages and involves many cognitive efforts such as identifying the semantic relationship between two domains (Coulson and Van Petten, 2002;Yang et al., 2013), interpreting authorial intent from multimodal messages (Evan Nelson, 2008), analyzing the sentiment metaphors convey (Ervas, 2019), which might be difficult for computers to do.
Qualitative studies have investigated the interplay between different modes underlying the understanding of multimodal metaphors in communicative environments such as advertisements (Forceville et al., 2017;Urios-Aparisi, 2009), movies (Forceville, 2016;Kappelhoff and Müller, 2011), songs (Forceville and Urios-Aparisi, 2009;Way and McKerrell, 2017), and cartoons (Refaie, 2003;Xiufeng, 2013). In particular, with the development of mass communication, texts nowadays are often combined with other modalities such as images and videos to achieve a vivid, appealing, persuasive, or aesthetic effect for the audience. This rapidly growing trend toward multimodality requires a shift to extend metaphor studies from monomodality to multimodality, as well as from theory-driven analysis to data-driven empirical testing for in-depth metaphor understanding.
Despite the potential and importance of multimodal information for metaphor research, there has been little work on the automatic understanding of multimodal metaphors. While a number of approaches to metaphor processing have been proposed with a focus on text in the NLP community (Shutova et al., 2010;Mohler et al., 2013;Jang et al., 2015Jang et al., , 2017Pramanick et al., 2018;Liu et al., 2020), multimodal metaphors have not received the full attention they deserve, partly due to the severe lack of multimodal metaphor datasets with their challenging and timeand labor-consuming creation.
To overcome the above limitations, we propose a novel multimodal metaphor dataset (MultiMET) consisting of text-image pairs (text and its corresponding image counterparts) manually annotated for metaphor understanding. MultiMET will expand metaphor understanding from monomodality to multimodality and help to improve the performance of automatic metaphor comprehension systems by investigating multimodal cues. Our main contributions are as follows: • We create a novel multimodal dataset consisting of 10,437 text-image pair samples from a range of resources including social media (Twitter and Facebook), and advertisements. MultiMET will be released publicly for research.
• We present fine-grain manual multimodal annotations of the occurrence of metaphors, metaphor category, what sentiment metaphors evoke, and author intent. The quality control and agreement analyses for multiple annotators are described.
• We quantitatively show the role of textual and visual modalities for metaphor detection; whether and to what extent metaphor affects the distribution of sentiment and intention, which quantitatively explores the mechanism of multimodal metaphor.
• We propose three tasks to evaluate finegrained multimodal metaphor understanding abilities, including metaphor detection, sentiment analysis, and intent detection in multimodal metaphor. A range of baselines with benchmark results are reported to show the potential and usefulness of the MultiMET for future research.
2 Related Work

Metaphor Datasets
Although datasets of multimodal metaphors are scarce, a variety of monomodal datasets for metaphor studies have been created in recent years. Table 1 lists these datasets with their properties. Numerous text metaphor datasets have been published for metaphor processing in the NLP community including several popular ones, e.g., the VU Amsterdam Metaphor Corpus (VUAMC) (Steen, 2010), TroFi Example Base (Birke and Sarkar, 2006), and MOH-X (Mohammad et al., 2016). The largest one, VUAMC, consists of over 10,000 samples spread across 16,000 sentences, while others contain less than 5,000 samples. However, most existing metaphor datasets contain only textual data. Image metaphor datasets are few and they are pretty limited in the size and the scope of the data, such as VisMet , which is a visual metaphor online resource consisting of only 353 image samples. Although  constructed both text and image samples, their images were obtained by using a given phrase and queried Google

Metaphor Understanding
Automatic metaphor understanding requires accomplishing certain tasks to decode metaphorical messages. In this paper, we focus on three important tasks for NLP in understanding metaphor: metaphor detection, sentiment analysis, and author intent detection. There has been increasing interest in NLP in various approaches to metaphor detection based on monomodal text. Early metaphor studies have focused on hand-constructed knowledge and machine learning techniques (Mason, 2004;Turney et al., 2011;Tsvetkov et al., 2014;Hovy et al., 2013). Others have also used distributional clustering (Shutova et al., 2013) and unsupervised approaches Mao et al., 2018). More recently, deep learning models have been explored to understand metaphor. However, little has been explored in multimodal metaphor detection except by , who are among the very few to explore the fusion of textual and image modalities to detect multimodal metaphor. Their results demonstrate the positive effect of combining textual and image features for metaphor detection.
However, in their work, image features are extracted from a small size of constructed examples rather than natural samples of texts integrated with images, like MultiMET in our work. In addi-tion, apart from multimodal metaphor detection, the tasks related to metaphor understanding like sentiment detection and author intent detection in multimodal metaphor also have rarely been studied, although there exist similar multimodal studies in different tasks (Wang et al., 2017;Zadeh et al., 2017;Kruk et al., 2019).

Data Collection
With the goal of creating a large-scale multimodal metaphor dataset to support research on understanding metaphors, we collect data that contains both text and image from a range of sources including social media (Twitter and Facebook), and advertisements. Table 2 shows an overview for the statistics of the dataset.
Social Media. To collect potential metaphorical samples from Twitter and Facebook, we retrieved posts by querying hashtags metaphor or metaphorical. We collected publicly available Twitter and Facebook posts using Twitter and Facebook APIs complying with Twitter and Facebook's terms of service. What the author labels as metaphorical is not always aligned with the actual definition of metaphor in our study. To collect metaphors whose nature accorded with what we define as multimodal metaphors, we re-annotated "metaphorical or literal" in the below section to potential Twitter and Facebook posts that other authors annotated as metaphor with hashtags.
Advertisements. Based on our review of linguistic literature on multimodal metaphor, we focused on an important source that is the main context of study: advertisements. Metaphorical messages abound in advertisements , which offer a natural and rich resource of data on metaphor and how textual and visual factors combine and interact (Sobrino, 2017;Forceville et al., 2017). We collected     (Ye et al., 2019). To obtain the textual information, we extracted inside text from images using the API provided by Baidu AI. After that, human annotators rectified the extracted inaccurate text, removed any blurred text, and obtained text + image pairs from advertisements.

Data Filter
For text data, we removed external links and mentions (@username); we removed non-English text using the LANGID (Lui and Baldwin, 2012) library to label each piece of data with a language tag; we removed strange symbols such as emojis; we removed "metaphor" or "metaphoric" when they were regular words rather than hashtags, because explicit metaphorical expressions are not our interest (e.g., "This metaphor is very appropriate"); we removed text with fewer than 3 words or more than 40 words. For image data, we removed textbased images (all the words are in the image), as well as images with low resolution. Because this task is about multimodal metaphor, it is necessary to maintain consistency of data between models. In other words, either both the image data and the text data should be removed, or neither. In addition, in the de-duplication step, we considered removal only when both text and images were repeated.

Annotation Model
We annotated the text-image pairs with the occurrence of metaphors (literal or metaphorical); (if metaphorical) relations of target and source domain (target/source: target/source vocabulary in text or verbalized target/source vocabulary in image); target/source modality (text, image, or text + image), metaphor category (text-dominant, imagedominant, or complementary); sentiment category (the sentiment metaphors evoke, namely very negative, negative, neutral, positive, or very positive), and author intents (descriptive, expressive, persuasive, or other). The annotation model was Anno-tationModel = (Occurrence, Target, Source, Tar-getModality, SourceModality, MetaphorCategory, SentimentCategory, Intent, DataSource). Figure 3 is an annotation example.

Metaphor Annotation
Metaphor category. There are a variety of ways in which texts and images are combined in multimodal content (Hendricks et al., 2016;. Based on our review of the literature and observation of the samples in our dataset, we follow Tasić and Stamenković (2015) and divide multimodal metaphor into three categories: text dominant, image dominant, and complementary. Sometimes metaphors are expressed through texts with a mapping between source and target domains while the accompanying images serve as a visual illustration of the metaphors in the text, which is text dominant. As in Figure 2 (a), the text itself is sufficient to convey metaphorical information and can be identified as metaphorical expressions. "Highway" is a visual illustration of the source domain in a textual modality. By contrast, in the image dominant category, images play the dominant role in conveying metaphorical information and they provide sufficient information for readers to under-stand the metaphors. In Figure 2 (b), where we see the metaphorical message "Beetle (cars) are blood cells," the text enriches the understanding of metaphorical meaning by adding an explanation "your heart beats faster" to the visual manifestation. The complementary category involves a roughly equal role of texts and images in rendering metaphorical information. The understanding of metaphor depends on the interaction of and balance between different modalities. If texts and images are interpreted separately, metaphors cannot be understood. In Figure 2 (c), when people read the text, "A kitten is kissing a flower," and the inside text "Butterflies are not insects," they do not realize the metaphorical use until they observe the butterfly in the corresponding image and infer that the target "butterfly" is expressed in term of the source "flower". Metaphorical or literal. Our annotations focus on the dimension of expression, which involves identification of metaphorical and literal expressions by verbal means and visual means (Forceville, 1996;Phillips and McQuarrie, 2004). The metaphor annotation takes place at the relational level, which involves the identification of metaphorical relations between source and target domain expressions. For text modality, source and target domain expressions mean source and target domain words used in metaphorical texts. For image modality, source and target domain expressions mean words' verbalized source and target domain in the visual modality. That is, the annotation of metaphorical relations represented in the modality of image involve the verbalization of the metaphor's domains. Annotations involve naming and labeling what is linguistically familiar. Unlike text modality, which relies on explicit linguistic cues, for image modality, metaphorical relations are annotated based on perceptions of visual unities, and they determine the linguistic familiarity of images as well as existing words in the metaphor's domains. FollowingŠorm and Steen (2018), annotators identified the metaphorical text+image pairs by looking at the incongruous units and explaining one non-reversible "A is B" identity relation, where two domains were expressed by different modalites.

Intent and Sentiment Annotation
Interpreting authorial intent from multimodal messages in metaphor seems to be important for under-standing metaphors. As mentioned above, within CMT, the essence of metaphor is using one thing from a source domain to express and describe another from a target domain. This implies that one important intent of creating metaphor could be to enable readers to understand the entities being described better. "Perceptual resemblance" is a major means of triggering a metaphorical relation between two different entities (Forceville and Urios-Aparisi, 2009). We name it descriptive intent, which involves visual and textual representations regarding the object, event, concept, information, action or character, etc. Moreover, in modern times, the increasing ubiquity of multimodal metaphors means that people cannot ignore its power of persuasion (Urios-Aparisi, 2009). People often leverage metaphor in communication environments such as advertisements and social media to persuade readers to buy or do things. We name this intent as persuasive. In addition, inspired by a variety of arousing, humorous, or aesthetic effects of metaphors (Christmann et al., 2011), the expressive is included in our intent annotation within the enlarged definition: expressing attitude, thought, emotion, feeling, attachment, etc. Based on these factors as well as investigation of the samples in our datasets, we generalized their taxonomy and listed the categories of the author intent in metaphor as descriptive,persuasive, expressive, and others.
Numerous studies show that metaphorical language frequently expresses sentiments or emotions implicitly (Goatly, 2007;Kövecses, 1995Kövecses, , 2003. Compared to literal expressions, metaphors elicit more emotional activation of the human brain in the same context (Citron and Goldberg, 2014). Thus we also added the sentiment in our annotation, to test whether the sentiment impact of metaphors is stronger than literary messages from a multimodal perspective. The sentiment was placed in one of the five categories of very negative, negative, neutral, positive, or very positive.

Annotation Process
We took two independent annotation approaches for two different types of tasks: selecting types of sentiment and intent and the annotation of metaphor. To select the options for sentiment and intent, we used a majority vote through Crowd-Flower, the crowdsourcing platform. The participants were randomly presented with both the text and vision components with the instruction on the  The annotation of metaphors includes metaphor occurrence, metaphor category and domain relation annotation. For metaphor annotation, we used expert annotators to complete the challenging annotation task, which required relatively deep understanding of metaphorical units and the complete task of verbalization of domains in image. The annotator team comprised five annotators who are postgraduate student researchers majoring in computational linguistics with metaphor study backgrounds. The annotators formed groups of two, plus one extra person. Using cross-validation, the two-member groups annotated, and the fifth person intervened if they disagreed.

Quality Control and Inner Agreement
Annotations of multimodal metaphors rely on annotators' opinions and introspection, which might be subjective. Thus we took corresponding, different measures for different types of annotations to achieve high-quality annotation. To select options, we established strict criteria for the choice of category. Each text-image pair was annotated by at least 10 annotators and we used a majority vote through CrowdFlower, the crowdsourcing platform. Following Shutova (2017), we chose the category of annotated options on which 70% or more annotators agreed as the answer to each question (final decision) to provide high confidence of annotation. For metaphor annotation, we added a guideline course, detailed instruction, and many samples, and we held regular meetings to discuss annotation problems and matters that needed attention. The guidelines changed three times when new problems emerged or good improvement methods were found. The kappa score, κ, was used to measure inter-annotator agreements (Fleiss, 1971). The agreement on the identification of literal or metaphorical was κ = 0.67; identification of text dominant, image dominant or complementary was κ = 0.79; the identification of source and target domain relation was κ = 0.58, which means they are substantially reliable.

Dataset Analysis
Metaphor Category. We analyzed the role of textual and visual modalities to detect metaphors. From Figure 4 (a), we can see a complementary category among the three kinds of multimodal metaphors, which requires the interplay of textual and visual modality to understand the metaphorical meaning. It accounts for the largest proportion of metaphors, followed by the text-dominant and image-dominant categories. It shows the contribution of visual factors, which are similarly important in detecting metaphors. We therefore present a quantitative study of the role of textual and visual modalities in metaphor detection through human annotations and confirm the role and contribution of visuals in metaphor occurrence in natural language.
Author Intent. Figure 4 (b) shows that expressive and persuasive intentions occur most frequently in the metaphorical data. However, descriptive intention occurs most frequently in the non-metaphorical data. This suggests that on the one hand, we are more likely to use metaphorical expressions when expressing our feelings, expressing emotions, or trying to persuade others. On the other hand, we tend to use literal expressions to make relatively objective statements.
Sentiment. Figure 4 (c) shows that there are some differences in the distribution of sentiment between the metaphorical data and non-metaphorical data. In the non-metaphorical data, neutral sentiment accounted for the largest proportion of 51%, followed by positive sentiment (33%), strong positive sentiment (7%), negative sentiment (7%), and strong negative sentiment (2%). In the metaphorical data, positive sentiment accounted for the largest proportion of 42%, followed by neutral sen-  timents (39%), strong positive sentiment (8%), negative sentiment (8%), and strong negative sentiment (3%). It turns out that there are more non-neutral sentiments in metaphor expression than in nonmetaphorical expression, and that metaphors are more frequently used to convey sentiments. Our findings accord with the results of previous studies on monomodal textual metaphors that metaphors convey more sentiments or emotions than literary text (Mohammad et al., 2016). We confirm the stronger emotional impact of metaphors than literary messages from a multimodal perspective.
In positive sentiment, the most common words in the source domain are person, face, and flower; the most common words in the target domain are love, life, and success. In negative sentiment, heart, food, and smoke are the most common words in the source domain, and the world, disaster, and life are the most common words in the target domain. This shows that sentiment tendency can influence the category in the source and target domains to some extent.

Experiment
For the dataset constructed for this paper, we propose three tasks and provide their baselines, namely multimodal metaphor detection, multimodal metaphor sentiment analysis, and multimodal metaphor author intent detection.
We used the model shown in Figure 5 to detect metaphors, metaphorical sentiments, and metaphorical intentions. For text input, we used a text encoder to encode the text and to get the feature vector of the text. This paper used two different methods to encode the text, namely the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model (Devlin et al., 2019) and Bidirectional Long-Short Term Memory (Bi-LSTM) networks (Medsker and Jain, 2001). Similarly, for image input, we used an image encoder to extract image features. We used three different image pre-training models: VGG16 (Simonyan and Zisserman, 2014), ResNet50 (He et al., 2016), and Effi-cientNet (Tan and Le, 2019). These methods have been widely used by researchers in feature extraction for various tasks.
After obtaining the text feature vector and the image feature vector, we used four different feature fusion methods to combine the vectors, namely concatenation (Suryawanshi et al., 2020), elementwise multiply (Mai et al., 2020), element-wise add (Cai et al., 2019), and maximum (Das, 2019). Finally, we inputted the fusion vector into a fully connected layer and obtained the probabilities of different categories through the softmax activation function.

Experiment Settings
We used Pytorch (Paszke et al., 2019) to build the model. The pre-trained models are available in Pytorch. The word embeddings have been trained on a Wikipedia dataset by Glove (Pennington et al., 2014). In the training process, we did not update the parameters in the pre-training models. When the model gradually tended to converge, we updated the parameters of the pre-training models with training data to avoid overfitting. We used the Adam optimizer (Kingma and Ba, 2014) to optimize the loss function, and the training method of gradient clipping  to avoid gradient explosion. Other hyper-parameter settings are shown in Table 3.

Results
The classification results are shown in Table 4. "Random" means that random predictions were made using the data as a baseline. In general, the model performed best on metaphor detection, followed by metaphor intention detection, and finally metaphor sentiment detection. For image and   multimodal classification, the ResNet50 performed best, followed by VGG16, and finally EfficientNet. Because ResNet solved the problem of gradient disappearance through the method of residual connection, the classification performance was better than VGG16 and EfficientNet. For text and multimodal classification, BERT performed better than Bi-LSTM. BERT has been fully trained in a largescale corpus, using transfer learning technology to fine-tune our three tasks and data, so it can achieve better performance. From the perspective of different features, multimodal features perform best, followed by text-only features, and finally image-only features. Multimodal fusion helps to improve the classification performance by 6%. This shows that the combination of image and text features is indeed helpful for the detection and understanding of metaphors, especially the detection of sentiments and intentions in metaphors. In addition, the importance of text modal data is explained. Without text description, it is difficult to detect metaphors correctly using only visual modal data.
To verify the influence of feature fusion on classification, we compared four different feature fusion methods. The results are shown in Table 5. The concatenate method to merge image and text features produces the highest accuracy. It shows that concatenate can make full use of the complementarity between different modal data, eliminate the noise generated by the fusion of different modal data, and improve the detection effect. In contrast, the other three fusion methods cannot effectively eliminate the influence of noise introduced by different modal data, and it therefore interferes with the training of the model. Overall, the multimode model that combines the BERT text function and the ResNet50 image function through the concatenation method performs best on our three tasks.

Conclusion
This paper presents the creation of a novel resource, a large-scale multimodal metaphor dataset, MultiMET, with manual fine-gained annotation for metaphor understanding and research. Our dataset enables the quantitative study of the interplay of multimodalities for metaphor detection and confirms the contribution of visuals in metaphor occurrence in natural language. It also offers a set of baseline results of various tasks and shows the importance of combining multimodal cues for metaphor understanding. We hope MultiMET provides future researchers with valuable multimodal training data for the challenging tasks of multimodal metaphor processing and understanding ranging from metaphor detection to sentiment analysis of metaphor. We also hope that Multi-MET will help to expand metaphor research from monomodality to multimodality and improve the performance of automatic metaphor understanding systems and contribute to the in-depth understanding and research development of metaphors. The dataset will be publicly available for research.

Ethical Considerations
This research was granted ethical approval by our Institutional Review Board (Approval code: DU-TIEE190725 01). We collected publicly available Twitter and Facebook data using Twitter and Facebook APIs complying with Twitter and Facebook's terms of service. We did not store any personal data (e.g., user IDs, usernames) and we annotated the data without knowledge of individual identities.
We annotated all our data using two independent approaches (expert based and crowdsourcing based) for two different types of tasks: the annotation of metaphor and the selection of types of sentiment and intent. For metaphor annotation, a deep understanding of metaphorical units was necessary. This challenging task was completed by five researchers who involved in this project. To annotate sentiment and intent, we used CrowdFlower, the crowdsourcing platform. To ensure that crowd workers were fairly compensated, we paid them at an hourly rate of 15 USD per hour, which is a fair and reasonable rate of pay for crowdsourcing (Whiting et al., 2019). We launched small pilots through CrowdFlower. The pilot for sentiment options took on average 43 seconds, and crowd workers were thus paid 0.18 USD per judgment, in accordance with an hourly wage of 15 USD. At the same time, the annotation of author intent took on average 23 seconds, and we thus paid 0.10 USD per judgment, corresponding to an hourly wage of 15 USD.