FLUTE: Figurative Language Understanding through Textual Explanations

Figurative language understanding has been recently framed as a recognizing textual entailment (RTE) task (a.k.a. natural language inference (NLI)). However, similar to classical RTE/NLI datasets they suffer from spurious correlations and annotation artifacts. To tackle this problem, work on NLI has built explanation-based datasets such as eSNLI, allowing us to probe whether language models are right for the right reasons. Yet no such data exists for figurative language, making it harder to assess genuine understanding of such expressions. To address this issue, we release FLUTE, a dataset of 9,000 figurative NLI instances with explanations, spanning four categories: Sarcasm, Simile, Metaphor, and Idioms. We collect the data through a Human-AI collaboration framework based on GPT-3, crowd workers, and expert annotators. We show how utilizing GPT-3 in conjunction with human annotators (novices and experts) can aid in scaling up the creation of datasets even for such complex linguistic phenomena as figurative language. The baseline performance of the T5 model fine-tuned on FLUTE shows that our dataset can bring us a step closer to developing models that understand figurative language through textual explanations.


Introduction
Figurative language such as metaphors, similes or sarcasm plays an important role in enriching human communication, allowing us to express complex ideas and emotions in an implicit way (Roberts and Kreuz, 1994;Fussell and Moss, 1998).However, understanding figurative language still remains a bottleneck for natural language processing (Shutova, 2011).Recently Jhamtani et al. (2021) show that when faced with dialog contexts consisting of figurative language,some models show very large drops in performance compared to contexts without figurative language.Despite the fact that Transformer-based pre-trained language models (LMs) get even larger (Brown et al., 2020;Raffel et al., 2020), they are still unable to comprehend the physical world, cultural knowledge, or social context in which figurative language is embedded (Bisk et al., 2020).
In recent years, there have been several benchmarks dedicated to figurative language understanding, which generally frame "understanding" as a recognizing textual entailment (a.k.a natural language inference (NLI)) task -deciding whether one sentence (premise) entails/contradicts another (hypothesis) (Chakrabarty et al., 2021;Stowe et al., 2022;Srivastava et al., 2022).However, similar to general NLI datasets, these benchmarks suffer from spurious correlations and annotation artifacts (Mc-Coy et al., 2019;Poliak et al., 2018b).These can allow large language models (LLMs) to achieve near human-level performance on in-domain test sets, yet turn brittle when evaluated against out-ofdomain or adversarial examples (Glockner et al., 2018;Ribeiro et al., 2016Ribeiro et al., , 2020)).To tackle these problems, research in NLI has argued that it is not enough to correctly predict the entail/contradict labels, but also to explain the decision using natural language explanations that are comprehensible to an end-user assessing model's reliability (Camburu et al., 2018;Majumder et al., 2021;Wiegreffe et al., 2021a), leading to novel datasets such as e-SNLI (Camburu et al., 2018).Yet, there is no such dataset for figurative language, hindering our ability to assess LLMs' genuine understanding of figurative language.
In this paper, we make several contributions towards the goal of building models and assessing their ability to understand figurative language: • FLUTE: a new benchmark for figurative language understanding through textual explanations.FLUTE contains 9,000 highquality <literal, figurative> sentence pairs with entail/contradict labels and the associ- It's so annoying to have to hear my next door neighbors argue all the time in our shared hallway.

E
The sound of arguing neighbors can often be very disruptive and if it happens all the time in a common space like a shared hallway it is natural to find it annoying.
It's so pleasant to have to hear my next door neighbors argue all the time in our shared hallway.

C
The sound of arguing neighbors can often be very disruptive and so someone considering it to be pleasant is not really accurate.

Simile
The assembly hall was now hot and moist, more so than usual.
In fact, the assembly hall was now like a steam sauna.

E
A sauna is a hot and moist environment, so the simile is saying that the hall is even hotter and more moist than usual.
The assembly hall was now cold and dry, more so than usual.

C
A steam sauna is a small room or hut where people go to sweat in steam, so it would be hot and humid, not cold and dry.

Metaphor
He mentally assimilated the knowledge or beliefs of his tribe.
He absorbed the knowledge or beliefs of his tribe.

E
To absorb something is to take it in and make it part of yourself.
He utterly decimated his tribe's most deeply held beliefs.

C
Absorbed typically means to take in or take up something, while "utterly decimated" means to destroy completely.

Idiom
Lady Southridge was wringing her hands, trying hard and desperately to salvage the bleak and miserable situation so that it somehow looks positive.
Lady southridge was wringing her hands, trying to grasp at straws.

E
To grasp at straws means to make a desperate attempt to salvage a bad situation, which is exactly what Lady Southridge is trying to do.
Lady Southridge was wringing her hands, doing absolutely nothing to overturn the bleak and miserable situation so that it somehow looks positive.

C
To grasp at straws means to make a desperate attempt to salvage a bad situation, but the sentence describes not doing anything to change the situation Table 1: FLUTE examples of figurative text (hypothesis) and their respective literal entailment(E) and contradiction (C) premises, along with the associated explanations.* For simile, metaphor, and idiom, figurative examples are the hypothesis whereas for sarcasm, we have both figurative and literal hypotheses (see Section 2).ated explanations.The benchmark spans four types of figurative language: sarcasm, simile, metaphor, and idiom.Table 1 shows examples from our dataset.A noteworthy property of FLUTE is that both the entailment/contradiction labels and the explanations are w.r.t the figurative language expression (i.e., metaphor, simile, idiom) rather than other parts of the sentence.
• A scalable model-in-the-loop approach for building FLUTE.Model-in-the-loop approaches (i.e., GPT-3 (Brown et al., 2020) and crowdsourcing) have been recently proposed to generate NLI datasets, as well as free-form textual explanations (a.k.a natural language explanations (Camburu et al., 2018)) for model decisions (Liu et al., 2022a;Wiegreffe et al., 2021a).For figurative language, Ghosh et al. (2020) has shown that crowdworkers are mostly good at performing minimum edits to generate a literal sentence from a sarcastic one (e.g., using negation or antonyms), which can lead to trivial exam-ples easily classified by LLMs (Chakrabarty et al., 2021).Thus, for building FLUTE, we leverage the power of GPT-3 to generate diverse and high quality literal text (paraphrases/contradictions and/or explanations) using few-shot prompting, coupled with minimal human involvement (e.g., crowdworkers to minimally edit a literal sentence to make it sarcastic and experts for judging and minimally editing GPT-3 output to ensure quality control) (Section 2).
• Comprehensive set of experiments to assess FLUTE's usefulness towards building models that understand figurative language.We propose a setup inspired by instruction-based learning (Mishra et al., 2021;Sanh et al., 2021;Wei et al., 2021) and train a T5 (Raffel et al., 2019) model to jointly predict the label (entail/contradict) and explanation.We train two variants: T5 trained on e-SNLI dataset (Camburu et al., 2018) and T5 trained on FLUTE.We evaluate our model on the FLUTE test set (Section 3.2).We propose extensive auto-matic and human evaluation experiments to assess model understanding through explanations (Section 3.2).We show that the model trained on FLUTE produce higher quality explanations compared to model trained on e-SNLI (Section 4).
Our code and experimental setup is available at1 .Our data can be accessed at 22 Model-in-the-loop for building FLUTE FLUTE consists of pairs of premises (literal sentences) and hypotheses (figurative sentences) 3 , with the corresponding entailment or contradiction labels (NLI instances), along with explanations for each instance (Table 1).We describe the model-inthe-loop methods for creating premise-hypothesis pairs for each type of figurative language (Section 2.1) and the associated explanations (Section 2.2).

Sarcasm
When asked to generate literal equivalents of sarcastic sentences crowdworkers on Amazon Mechanical Turk (MTurk) usually perform trivial rephrasings at word/phrase level (Ghosh et al., 2020), which can lead to NLI datasets for sarcasm understanding where LLMs can achieve near-human performance (95%) due to simple lexical cues, such as negation or antonyms (Chakrabarty et al., 2021).Additionally, in many cases sarcasm data is collected from Twitter using hashtags, e.g., #sarcasm, which can be noisy and not diverse.
To address these issues we take a model-in-theloop approach: given a literal sentence we first use GPT-3 with few-shot prompting to generate a literal paraphrase, and then use crowdworkers to minimally edit this new literal sentence to form a sarcastic one (Figure 1b). 4 Then we pair the original literal sentence with generated literal paraphrase as entailment pair, and with the sarcastic one as a contradiction pair.Below we describe these two steps.
Entailment Pairs.Jointly modeling emotion and sarcasm had been shown beneficial for sarcasm detection (Chauhan et al., 2020).Thus, we select the literal sentences from the Empathetic Dialogue dataset (Rashkin et al., 2019).Each conversation in the dataset is grounded in a situation with emotion label provided.We select literal sentences labeled with negative emotions such as angry, afraid, embarrassed. 5Including the emotion in the GPT-3 prompts serves two purposes: (a) the generated paraphrases are complex and often more creative than the original literal input, and (b) it is easier for crowdworkers to transform paraphrases with emotional content into sarcastic counterparts with minimal edits.To generate the literal paraphrases, we provide the literal sentence and the associated emotion in the prompt and ask GPT-3 to paraphrase the input (top part of Figure 1b, see prompt in Appendix A.1.2).Every paraphrase generated from GPT-3 is verified by 3 experts.If the quality of the generated paraphrase is deemed insufficient, it is resampled from the model.Any individual example undergoes at most three rounds of sampling, with 15% of them judged as appropriate upon the first round.
Contradiction pairs.We recruit crowd workers on MTurk to convert the manually checked GPT-3-generated literal paraphrases into sarcastic sentences.The workers were provided with the paraphrases and instructed to make minimal edits (e.g., through negations, antonyms) to generate sarcastic sentences.We conducted a qualification test, and recruited 29 distinct workers from the original set of 85 workers. 6We recruit two independent workers for every paraphrase input.The resulting sarcastic outputs are verified by three experts.25% of instances were deemed insufficient quality and edited by the experts.Consider the sarcasm example in Table 1.The literal input "My next door . . ." is the premise.The first step generated the paraphrase hypothesis by adding the implicitly stated emotion "annoyed" and paraphrasing.Next, Turkers modified the paraphrase to its sarcastic counterpart -by replacing "annoying" (the emotion word) to it antonym "pleasant".
The final FLUTE sarcasm benchmark consists of 2,678 sarcastic sentences contradicted by the  2.

Simile
Entailment Pairs.We start by extracting sentences containing similes from (Chakrabarty et al., 2022(Chakrabarty et al., , 2021)).To generate an entailment pair, we perform two steps (see Fig. 1a): 1) given the sentence that contains a simile, create an auxiliary literal sentence by simply replacing the simile with the simile's property (e.g., replace like a steam sauna with hot and moist); 2) given the auxiliary literal sentence and the property, prompt GPT-3 to generate a literal paraphrase consistent with the property (see prompt in Appendix A.2.3).Experts deemed 720 generated instances satisfactory.To illustrate the pipeline, see the example for simile in Table 1: "In fact the assembly hall was now like a steam sauna".The object steam sauna is replaced with its property hot and moist, and the resulting sentence is fed into GPT-3 to generate the paraphrase "The assembly hall was now hot and moist, more so than usual".
Contradiction pairs.To generate the contradictions, we prompt GPT-3 to invert the meaning of the literal paraphrase (see above) w.r.t the property (see prompt in Appendix A. 2.4).Out of 720 generated contradictions, experts deemed 642 satisfactory.We further selected 108 challenging <simile, entailment, contradiction> instances from Liu et al.

Metaphors
Entailment Pairs.We follow a similar model-inthe-loop approach as the one for similes (Figure 1a).We manually select a total of 750 metaphors from the following datasets: (Chakrabarty et al., 2021;Srivastava et al., 2022;Stowe et al., 2022).Next, we prompt GPT-3 to generate paraphrases given the metaphoric sentences (see prompt in Appendix A. 4.2).Although the original datasets contain the literal equivalents of the metaphoric sentence, they are not fully adequate for our purposes because: (a) not all metaphor examples have the literal counterpart, and (b) often, the literal counterpart is a minimal modification (one-word edit) of the metaphor (Srivastava et al., 2022;Chakrabarty et al., 2021), which can lead to trivial examples for LLMs.
Contradiction Pairs.To generate contradictions pairs, we start with the GPT-3 generated literal sentence that entails the metaphoric sentence.Consider the metaphor in the Table 1, "He absorbed the knowledge or beliefs of his tribe" (taken from (Stowe et al., 2022)).In the original dataset, the non-entailment counterpart is "He absorbed the beverages of his tribe", which is created by using a verb in a different sense (literal sense of "absorb") to fit the context of beverage drinking.On the contrary, since we are interested in generating instances that contradict the metaphor itself, a more appropriate modification would be "He utterly decimated his tribe's most deeply held beliefs" (more examples are in Table 6 in the Appendix).We follow the same method to generate contradiction examples (using GPT-3) as for similes (see prompt in Appendix A.4.3).
Both paraphrases and contradictions are verified by three experts and edited when required.Our FLUTE benchmark contains 750 entailment and 750 contradictions pairs for metaphors (Table 2).

Idioms
Observing the successful generations of paraphrases and contradictions by GPT-3 in the case of simile and metaphors, we jointly generate paraphrases along with contradictions using GPT-3 (See Figure 1a blue dotted lines).We provide the idiom and its meaning in the prompt (see prompt in Appendix A.6). Three experts manually verified all the generated sentences and edited a total of 23% of total generations.We found that jointly generating paraphrases and contradictions greatly eased the data creation process and resulted in relatively high quality of generations.
FLUTE benchmark consists of 1000 entailment and 1000 contradiction pairs for idioms (Table 2).

FLUTE: Generating Textual Explanations
Our task prediction requires that the model not only correctly infer the label, but also explain why a given premise entails or contradicts the hypothesis.Towards this goal, we generate textual explanations for every <premise, hypothesis> pair.
For simile, metaphor, and sarcasm we provide the premise, hypothesis, and label (entailment or contradiction) and prompt GPT-3 to generate an explanation.We provide a natural language instruction followed by several examples.We generate entailment and contradiction explanations separately.
For idioms, the idiom meaning in the seed dataset already makes up for a great explanation.Thus, we utilize the provided idiom meaning to jointly generate the explanation for the entailing premise, as well as for the contradicting premise using GPT-3.Hence, for idiom data in addition to premise, hypothesis, and labels, we also provide the idiom itself and its meaning in the prompt.
LLMs such as GPT-3 have been scrutinized heavily because they can mimic or amplify societal bias (Sheng et al., 2021), religious stereotypes (Abid et al., 2021), and gender stereotypes (Borchers et al., 2022). 7For example, applications such as story generation often emulate societal bias by including more masculine characters and following social stereotypes based on their training data (Lucy and Bamman, 2021).However, this bias is not evident in the FLUTE dataset probably because the model is told to explain specific figurative instances and not to write creatively.In addition, we are reusing some of the standard datasets (Chakrabarty et al., 2021;Stowe et al., 2022;Srivastava et al., 2022) for FLUTE, which have probably already been removed of any provocative context.Finally, experts manually verified all explanations to ensure their correctness and ability to explain the essence of the entailment or contradiction in reasonable detail rather then learning a simple template (see Table 1).In cases where explanations were not accurate, experts edited them to ensure they are coherent, logically consistent, and grammatical.For explanations pertaining to sarcasm+paraphrase, experts edited a total of 21% of the generated explanations, while for simile, metaphor and idiom it was 27%, 40% and 10% respectively, which further demonstrates the potential of using GPT-3 to significantly reduce the human effort that goes into collecting textual explanations datasets.See Appendix A for details on hyperparameters and prompts.
3 Experimental Setup

Models
Prior works in explainability have trained two types of models.Pipeline models map an input to a rationale (I → R), and then a rationale to an output (R → O).8 Joint Self Rationalizing models map an input to an output and rationale (I → OR).Recently Wiegreffe et al. (2021b) have exposed the short-comings of free-text pipelines and have empirically shown that joint model rationales are more indicative of labels.Following this, we fine-tune a joint self-rationalizing T5 model.Taking advantage of the text-to-text format of T5 (Raffel et al., 2020) and the recent success of instruction-based models (Sanh et al., 2021;Wei et al., 2021), we design the following instruction for a given literal premise (P) and a figurative hypothesis (H): Does the sentence "P" entail or contradict the sentence "H"?Please answer between "Entails" or "Contradicts" and explain your decision in a sentence.
The above instruction is fed to the encoder of T5.The decoder outputs the label followed by the rationale.We fine-tune T5 with the following setups: in the first one, we fine-tune on e-SNLI (Camburu et al., 2018), and in the second, we finetune on FLUTE.
T5 e-SNLI : e-SNLI (Camburu et al., 2018) dataset comes with supervised ground-truth labels and rationales.We fine-tune the 3B version of T5 on e-SNLI for one epoch with a batch size of 1024, and an AdamW Optimizer with a learning rate of 1e − 4. We remove the Neutral examples from e-SNLI because our test data does not have such a category.We take the longest explanation per example in e-SNLI since our data has only one reference explanation.In case the explanations are more than one sentence we join them using 'and' since our data contains single-sentence explanations.This leaves us with 366,603 training and 6,607 validation examples.
T5 FLUTE : We fine-tune the 3B version of T5 model for 10 epochs with a batch size of 1024, and an AdamW Optimizer with a learning rate of 1e−4 in a multitask fashion with data from all the four types of figurative languages combined.Our training data consists of 7,035 samples which is 50X smaller than e-SNLI.For validation we use 500 examples which is used for selecting best checkpoint based on loss.

Evaluation Setup
To evaluate the above models, we built a test set by randomly selecting 750 instances (i.e., <premise, hypothesis> pairs with associated explanations) from the sarcasm dataset, and 250 examples each from simile, metaphor and idiom datasets, for a total of 1,500 instances.
Below we describe several automatic metrics and human evaluations we consider to assess the models' ability to understand figurative language.
Automatic Metrics To judge the quality of the explanations we compute the average between BERTScore (Zhang et al., 2020) 9 and BLEURT (Sellam et al., 2020), which we refer to as explanation score (between 0 and 100).Instead of reporting only label accuracy, we report label accuracy at three thresholds of explanation score (0, 50, and 60).Accuracy@0 is equivalent to simply computing label accuracy, while Accuracy@50 counts as correct only the correctly predicted labels that achieve an explanation score greater than 50.
Rationale Quality Human simulatability (Doshi-Velez and Kim, 2017) has a rich history in machine learning interpretability research as a reliable measure of rationale quality from the lens of utility to an end-user.Simulatability measures the additional predictive ability a rationale R provides over the input I for a given label O, computed as the difference between task performance when a rationale is given as input vs. when it is not (IR Table 3: Accuracy scores across four figurative language types by varying thresholds of explanation score, along with human evaluation scores H score , Yes% (higher is better), and No% (lower is better) for explanations generated by T5 fine-tuned on e-SNLI (T5 e-SNLI ) and T5 fine-tuned on FLUTE (T5 FLUTE ).p < 0.001 via Wilcoxon signed-rank test for all bolded results.
the two sentences, does the explanation justify the answer above?We provide four options: Yes (1), Weak Yes ( 2 3 ), Weak No ( 1 3 ), and No (0).For each explanation, we average the scores by the three annotators and report the sample average in Table 3 as H score .If the answer is anything other than Yes, we ask to categorize the shortcomings of the explanation: Insufficient Justification, Too Trivial, To Verbose, Untrue to Input, Violates Common Sense (Majumder et al., 2021).
Three workers were recruited for each instance and the IAA using Krippendorff's α (Krippendorff, 2011) between the workers is 0.45, indicating moderate agreement.See Appendix B for details on the H score computation and Figure 4 for a screenshot of the MTurk task interface.

Results and Discussion
Table 3 shows accuracy at varying explanation score thresholds.A threshold of 0% does not account for the quality of the textual explanation and is equivalent to simply reporting label accuracy.With an increase in threshold to greater than 50% we see accuracy scores dropping almost by half for T5 e-SNLI , showing most explanations generated from the model trained on e-SNLI are of poor quality.By increasing the threshold to greater than 60%, the accuracy scores further decrease, demonstrating that models like T5 fine-tuned on e-SNLI still struggle generating correct explanations even when the label predictions are correct.On the contrary, Table 3 shows that the accuracy scores for T5 FLUTE are significantly higher for each type of figurative language, indicating higher quality of explanations achieved by fine-tuning the model on our dataset.
In terms of the Rational Quality ( model fine-tuned on e-SNLI generates explanations that are less useful for gold label prediction compared to a model trained on FLUTE.Simulatability scores obtained using predicted rationales are negative compared to gold rationales since rationales generated by the model are still imperfect and may introduce noise that negatively affects the prediction.The positive score of gold rationales indicates that explanations provided in the FLUTE dataset are indeed useful for the model to make predictions.Table 3 also presents the H score scores for explanations collected from human evaluation (see Appendix B and prior work (Kayser et al., 2021;Majumder et al., 2021) for details on computation).We notice the scores for T5 FLUTE are 51.1, 41.3, 24.9, 22.2 points better than for T5 e-SNLI for Sarcasm, Simile, Metaphor, Idiom (in that order).Crowd workers answered with solid 'Yes' (that the explanation justifies the label) in 43.4% more cases on average when presented explanations from T5 FLUTE compared to T5 e-SNLI .Likewise, they answered 'No' in 28.5% less cases on average when presented explanations from T5 FLUTE compared to T5 e-SNLI .These results demonstrate a notably higher quality of explanations generated Premise The camp life really had energized him.Hypothesis Life in the camp wilted him.

Gold_Exp
To wilt means to lose vitality or strength, while to energize means to give strength or energy to.
T5e-SNLI If the camp life energized him, then he is not wilted.

T5FLUTE
To wilt something means to lose its energy or vitality, while to have energized something mean to feel a sense of excitement or life.

Premise
No one came to my wedding.

Hypothesis
No big deal that there were no guests at my wedding reception considering it's the only one I'll ever have.Label Contradiction.

Gold_Exp
A wedding reception is supposed to be a joyous event with many guests, so not having any guests at it would make it a very uneventful and pointless one.

T5e-SNLI
No one is same as No big deal.

T5FLUTE
A wedding reception is a celebration of the couple's union and is usually attended by family and friends so not having guests there would be seen as snub.by the model fine-tuned on FLUTE compared to the model fine-tuned on e-SNLI, despite the significantly larger size of the latter dataset.
Figure 2 shows that for cases where crowd workers did not choose an absolute 'Yes', the most common error category to be found is Insufficient Justification.The percentage of explanations for which shortcomings were identified is higher or the same for across all categories for T5 e-SNLI .The second example ("No one is same . . .") in Table 5 shows an insufficient explanation generated by T5 e-SNLI .T5 e-SNLI generated explanations were also more frequently marked as Too Trivial.Often, they do not explain the reasoning but rather follow a standard template if A then not B or if A then B, such as the first example in Table 5 for T5 e-SNLI .We share more such examples of erroneous explanations in the Appendix, Table 8.

Related Work
In recent years, evaluating how well RTE models can capture specific linguistic phenomena such as figurative language has attracted many NLP ), however, FLUTE has better diversity and also contains explanations for each NLI instance, as well as multiple expert checks to ensure higher quality (see Section 2).A portion of FLUTE's metaphors are based on the Big-bench corpora (Srivastava et al., 2022) as well as the IM-PLI dataset (Stowe et al., 2022), which is inspired by (Zhou et al., 2021)'s prior work on the paired idiomatic and literal dataset.However, there are several distinctions between these datasets and FLUTE.First, not all the metaphors in Big-bench have literal paraphrase and contradictions.Second, in case of both the Big-bench and IMPLI dataset, the nonentailment examples are created via minimal edits to the original metaphor, often resulting in neutral examples or contradiction to the non-metaphoric part of the sentence.In FLUTE, we ensure that the non-entailment examples are in contradiction to the metaphor (Table 6).
One of the motives for having NLEs with <literal, figurative> sentence pairs like FLUTE does is to evaluate model's ability to explain their decisions.Recent datasets such as CoS-E (Rajani et al., 2019), Movie Reviews (Zaidan and Eisner, 2008), and e-SNLI (Camburu et al., 2018) have been released in a similar vein.Recent work has also leveraged large language models to explain humor in image captions (Hessel et al., 2022) or sarcasm in dialouges (Kumar et al., 2022).The e-SNLI dataset (i.e., NLE of the entailment relations in the SNLI dataset) has been used in related work (Narang et al., 2020;Yordanov et al., 2021;Majumder et al., 2021;Feng et al., 2022) for explanation generation.In contrast to e-SNLI, which was created via crowdsourcing, we rely on a modelin-the-loop framework for FLUTE influenced by (Wiegreffe et al., 2021a).We have utilized the e-SNLI dataset for explanation generation and observed a T5 model trained on FLUTE performs notably better.

Conclusion
We release FLUTE, a dataset for figurative language understanding spanning across Sarcasm, Similes, Metaphors, and Idioms collected via a model-inthe-loop framework.To encourage genuine understanding of figurative language, our data also contains free-form textual explanations.Upon conducting baseline experiments with state-of-the-art benchmark models (i.e., models trained on the the e-SNLI dataset), we notice those models perform poorly.In contrast, performance of the T5 model fine-tuned on FLUTE shows that our dataset can bring us a step closer to developing models that understand figurative language through textual explanations.We hope our research on explanation generation for figurative language will be a fruitful future direction, and our dataset will be a challenging testbed for experimentation.

Limitations
While we focused on four types of figurative language and generated a diverse dataset, we believe it is just a first step towards capturing figurative NLI instances and their explanations, since figurative language is able to draw on a wide variety of cultural knowledge and contexts.Although the sarcasm portion captures the most common type of incongruity between sarcastic context and sentiment, sarcasm can manifest in many different forms -situational, underplayed, or dramatic, for which examples and explanations will differ.Finally this study doesn't explicitly focus on faithfulness of model generated Natural Language Explanations, however we hope to evaluate the faithfulness of these using methods described in contemporaneous literature on faithfulness of NLE's (Sia et al., 2022;Chan et al., 2022).While GPT-3 did not generate any examples of societal bias in this study, prior research has investigated the reliability or faithfulness of generations (Wiegreffe et al., 2021a).Likewise, we could also conduct a human study asking specific questions (e.g., whether the explanations mimic any bias, are credible, etc.); we leave this for future study.Shivani Kumar, Atharva Kulkarni, Md Shad Akhtar, and Tanmoy Chakraborty.2022.When did you become so smart, oh wise one?! sarcasm explanation in multi-modal multi-party dialogues.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5956-5968, Dublin, Ireland.Association for Computational Linguistics.
Emmy Liu, Chen Cui, Kenneth Zheng, and Graham Neubig.2022b.Testing the ability of language models to interpret figurative language.

A Appendix
In this section we report the details of the experiments (e.g., hyperparameters used for GPT-3 based generations, examples, as well as the prompts for the figurative language.
A.1 Sarcasm dataset A.1.1 Hyperparameters for the sarcasm dataset We use GPT-3-Davinci-001 model for auxillary paraphrase generations from which crowd workers create sarcasm.

A.1.2 Prompts for generation of Paraphrase from which Sarcasm is created
You will be presented with examples of some literal input sentences and their creative paraphrases.
For each example, we also provide the associated emotion.Your task is then to generate a creative paraphrase for a given literal sentence where the creative paraphrase should reflect the associated emotion without changing its meaning.Make sure to use some sort of humor and commonsense about everyday events and concepts 1) Literal: A lot of people have got engaged recently.

Emotion: surprised
Creative Paraphrase: The way all the couples are pairing off lately and naming the big day, I think Cupid's really busy.You will be presented with examples of two sentences typically a premise along with an entailing paraphrase of the premise called the hypothesis.Your task is to generate natural language explanations to justify the Entailment between the premise and the hypothesis.
1) Premise: Awful seeing a naked man run through my neighborhood.
Hypothesis: The sight of a man running through my neighborhood sans clothes was pretty disgusting.
Explanation: It is socially unacceptable to not wear clothes and step out of one's house so seeing a man who is running naked in the neighborhood is pretty shameful and disgusting.
2) Premise: My mother didn't cook her chicken all the way through at dinner the other night.
Hypothesis: The fact that my mother didn't cook her chicken all the way through at dinner makes me feel like I'm going to vomit.
Explanation: Eating undercooked chicken can cause food poisoning and so finding out that the chicken at dinner wasn't cooked all the way through often makes people throw up . . .

A.1.4 Prompts for generation of Explanation for Sarcasm (Contradiction)
You will be presented with examples of some literal and sarcastic sentences.Your task is then to write explanations to justify why it is sarcastic w.r.t the literal 1) Literal: When I moved into my apartment it was full of bugs Sarcasm: I absolutely loved when I moved into my apartment and found it crawling with bugs.
Explanation: Bugs are usually disgusting and most people are terrified of them therefore it is unlikely to love seeing someone's apartment infested by them.
2) Literal: I've been hearing some strange noises around the house at night.
Sarcasm: I am completely comforted by the weird noises I keep hearing around the house at night.
Explanation: Hearing weird noises around the house at night could invoke a potential danger such as a robbery or someone breaking in with malicious intent which makes someone scared rather than comforted. . . .

A.2.2 Challenging instances for the simile dataset
To select challenging instances from FigQA dataset Liu et al. (2022b), we use RoBERTa fine-tuned on SNLI, MNLI, FEVER-NLI, and ANLI (R1, R2, R3).We choose <simile, literal, contradiction> instances by taking the average between RoBERTa logit for entailment given <simile, literal> as input and the logit for contradiction given the <simile, contradiction> input.We then choose instances with the lowest such score.In this way, we are able to select instances for which RoBERTa is essentially confusing entailment and contradiction.
From our observation, these are usually ironic similes, e.g. for the simile I'm sharp as a pillow and the literal sentence I'm not sharp, RoBERTa would have a low logit for entailment, while for the sentence I'm sharp it would have a low logit for contradiction.

A.2.3 Prompts for Simile Paraphrase Generation
You will be presented with examples of some literal input sentences and their creative paraphrases.You will also be presented words that need to be preserved.Your task is to generate a creative paraphrase for a given literal sentence consistent in meaning.DO NOT CHANGE words after "Preserve:" keyword.
1. Sentence: overwhelmingly , it began to draw him in.
Preserve: overwhelmingly Creative Paraphrase: He was overwhelmingly obsessed with it. . . .

A.2.4 Prompts for Simile Contradiction Generation
You will be presented with a Sentence and a Property.Your goal is to invert the meaning of the Sentence with respect to the Property via a minimal edit.
1. Sentence: The place looked impenetrable and inescapable.Property: impenetrable and inescapable Inversion: This place looked easy to walk into and exit from. . . .

A.2.5 Prompts for Simile Contradiction Explanation Generation
You will be provided with a Simile and a contradictory sentence after the word "Contradiction".Your task is to explain why the contradictory sentence contradicts the Simile.
1. Simile: like a psychic whirlpool , it began to draw him in.
Contradiction: Mildly, it began to draw him in Explanation: A whirlpool is a strong current, so a psychic whirlpool drawing in indicates that it was drawing him in intensely, rather than mildly. . . .

A.3 Prompts for Simile Entailment Explanation Generation
You will be presented with a sentence containing a simile (Simile Sentence) and an entailing sentence (Entail Sentence).Please provide an explanation for why Simile Sentence is implied by the Entail Sentence.
1) Simile Sentence: The place looked like a fortress Entail Sentence: The place looked impenetrable and inescapable Explanation: A fortress is a military stronghold, hence it would be very hard to walk into, or in other words impenetrable. . . .

A.4.2 Prompts for Metaphor Paraphrase Generation
You will be presented with examples of some metaphor input sentences and their creative paraphrases.Your task is to generate a creative paraphrase for a given literal sentence consistent in meaning.
1. Sentence: A golden sun shines high in the sky.
Creative Paraphrase: a very bright sun shines high in the sky. . . .

A.4.3 Prompts for Metaphor Contradiction Generation
You will be presented with examples of some literal input sentences and their contradictions.Your task is then to generate contradiction of the new sentences via a minimal edit.
1. Sentence: The company released him after many years of service.Contradiction: The company hired him after many years of service. . . .

A.4.4 Prompts for Metaphor Entailment Explanation Generation
You will be presented with examples of sentences containing a metaphor along with an entailing paraphrase of the sentence.Your task is to generate natural language explanations to justify the Entailment with the input.
1. Metaphor: Krishna is an early bird.Contradiction: Krishna wakes up early everyday.
Explanation: Early bird means the person who wakes up early in the morning. . . .

A.4.5 Prompts for Metaphor Contradiction Explanation Generation
You will be provided with a sentence containing a Metaphor and a contradictory sentence after the word "Contradiction".Your task is to explain why the contradictory sentence contradicts the Metaphor.A.5 Idiom dataset A.5.1 Hyperparameters for the idiom dataset We use GPT-3-Davinci-002 model for idiom data generations.To jointly generate paraphrase and contradiction, we use the following hyperparameters: temperature=1, max tokens=200, top p=0.9, best of=1, frequency penalty=0.5,presence penalty=0.5,stop=[".."].

A.6 Prompts for joint paraphrase and contradiction generation for idioms
You will be presented with examples of some input sentences containing an idiom.You will be provided with the meaning of the idiom.Your task is to first generate a paraphrase that complies with the meaning of the idiom and then generate a negation of the paraphrase that contradicts the meaning of the idiom.Please look at the span within bold tags when performing paraphrase and negation.
1) Sentence: He looked great, and he was smiling <b>to beat the band</b>.

Idiom: to beat the band
Meaning: To a huge or the greatest possible extent or degree.
Paraphrase: He looked awesome and was smiling <b>hilariously in an uncontrollable manner</b> Negation: He looked awesome and was smiling <b> in a very coy and restrained man-ner</b>. . . .

A.6.1 Prompts for joint generation of Explanation for idioms
You will be presented with examples of sentences containing an idiom along with an entailing and contradictory paraphrase of the sentence.Your task to generate natural language explanations to justify the Entailment or Contradiction with the input.
1) Sentence: Not to share the bank with the table, or to take some minor part of it, but to go the whole hog.

Idiom: go the whole hog
Meaning: To do something as thoroughly as possible or without restraint.
Entailment: Not to share the bank with the table, or to take some minor part of it, but to take it all for themselves without any restraint.
Contradiction: Not to share the bank with the table, or to take some minor part of it, but to show some restraint and not go overboard.
Entail_Explanation: Usually to go the whole hog refers to do something as thoroughly as possible , taking it all for oneself or without any restraint.
Contra_Explanation: Usually to go the whole hog refers to do something as thoroughly as possible, without any sort of restraint and is often characterized by being extreme or overboard.. Entail_Explanation: To go through the mill in the context here refers that vanity publishing will abuse and treat the work of literature being very poorly.
Contra_Explanation: Usually when we say go through the mill it does not mean something being celebrated and treated well but instead being abused and treated poorly which is being said here in the context of vanity publishing doing to the work of literature.. . . .

B Details of Human evaluation
We follow Kayser et al. ( 2021) in all the below human evaluation procedures.Refer to Figure 4 for the example of the interface for crowdworkers.We first ask the crowdworkers to identify the relationship between the literal and figurative sentence (whether it is a contradiction or entailment).Using in-browser checks, we ensure that the crowdworkers understood the NLI pair by only accepting the submission if the relationship was identified correctly.
Then, we provide 2 explanations: one generated by T5 FLUTE and one generated by T5 e-SNLI .The crowdworkers do not know which one is which.For each NLE, we ask: Given the two sentences, does the explanation justify the answer above?, and provide four options: Yes, Weak Yes, Weak No, and No.We also ask to provide the shortcomings of the explanations if the worker selected a score lower than Yes.The workers have the following options to choose from, following prior work by (Majumder et al., 2021;Kayser et al., 2021): Violates Common Sense, Insufficient Justification, Untrue to Input, Too Trivial, To Verbose.Some examples of these shortcoming are provided in Table 8.In addition to the Figure 2, we provide the bar plot of the number of shortcomings as percentage of the sample by figurative language type in Figure 5.
We map the answers to 1, 2 3 , 1 3 , 0 respectively.Then, we compute the average score across 3 workers per entry, and the sample average per figurative language type for the corresponding model.

Violates Common Sense
The gesture was awkward.
The gesture was like a ballet move C A gesture is not a ballet move.
A ballet move is graceful and fluid, while an awkward gesture is not graceful or fluid.

Insufficient Justification
I stayed in a house with a rat problem for a while, it was not fun.
I simply cannot wait to stay in a house again that had a rat problem because it was not only a lot of fun but also super enjoyable C It was either not fun or not fun.
Rats are often associated with being filthy and unhygienic and so staying in a house with rat problems would not be fun at all.

Untrue to Input
Even though he was strong in his magic, it felt weak compared to the inferno I'd faced when I killed Mab.
He was strong in his magic but, compared to the blazing inferno that i'd faced when i'd killed mab, his power felt like a candle flame.

E
The candle flame is the inferno.
A candle flame is a very small flame, which is weak compared to blazing inferno.

Too Trivial
He ran fast.He ran like an olympic sprinter.E He ran fast like an olympic sprinter.
An olympic sprinter is a very fast person, so if he ran like an sprinter, it would mean that he ran fast.

Too Verbose
She asked the witnesses for information.
She petitioned the witnesses for information.
E She asked the witnesses for information and petitioned the witness.
Petitioning someone means asking them for information, which is what she did in this sentence.
figurative language sentence contradiction

Figure 2 :
Figure 2: Bar plot of the number of crowd workeridentified shortcomings of explanations generated by T5 e-SNLI and T5 FLUTE by shortcoming and model type as percent of the sample (lower means fewer shortcomings).An extended plot by figurative language type is available in Appendix, Figure 5.

2 )
Literal: We have enough candles mom Emotion annoyed Creative Paraphrase: I think the Catholic church is going to have to canonize a whole new generation of saints to justify our candle use mom . . .' A.1.3Prompts for generation of Explanation for Paraphrase from which Sarcasm is created (Entailment)

Figure 5 :
Figure 5: Bar plot of the number of crowd worker-identified shortcomings of explanations generated by T5 e-SNLI and T5 FLUTE by type of shortcoming, figurative language type, and by type of model as percent of the sample (lower means fewer shortcomings).

Table 2 :
Dataset statistics showing distribution of Figurative Language across FLUTE.

Table 4 )
, using predicted rationales from I → OR we observe that

Table 4 :
Rationale Quality (p < 0.001 via Wilcoxon signed-rank test) (higher is better) on FLUTE test set using accuracy of IR → O and I → O models trained on e-SNLI and FLUTE respectively.We use the predicted rationale R obtained from respective I → OR models T5 e-SNLI and T5 FLUTE as well as gold rationale (R*).Rationale Quality and accuracy are abbreviated as RQ. and Ac.

Table 5 :
Examples of T5 e-SNLI and T5 FLUTE model generated explanations vs. gold explanations for NLI involving metaphor (top) and sarcasm (bottom).More examples in Table 7 in Appendix.
debiasing GPT-3 job advertisements.In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 212-224, Seattle, Washington.Association for Computational Linguistics.
Metaphor: Joseph has the heart of a lion. 1.

Table 6 :
Example of Metaphors and their contradictions (from prior work and generated for this paper.).Note, examples from prior work replace the metaphor sentence with adding words that fits into their context whereas in this work we generate examples that contradicts the metaphor.

2 )
Sentence: I told her almost everything, including my reticence about seeing this work of literature go through the mill which was vanity publishing.Idiom: go through the mill

Table 8 :
Examples of shortcomings of T5 e-SNLI explanations.For this table, T5 e-SNLI explanations were sorted by most crowd workers votes for a respective shortcoming.