Figurative Language in Recognizing Textual Entailment

We introduce a collection of recognizing textual entailment (RTE) datasets focused on figurative language. We leverage five existing datasets annotated for a variety of figurative language -- simile, metaphor, and irony -- and frame them into over 12,500 RTE examples.We evaluate how well state-of-the-art models trained on popular RTE datasets capture different aspects of figurative language. Our results and analyses indicate that these models might not sufficiently capture figurative language, struggling to perform pragmatic inference and reasoning about world knowledge. Ultimately, our datasets provide a challenging testbed for evaluating RTE models.

Recognizing Textual Entailment (RTE), the task of identifying whether one sentence (context) likely entails another (hypothesis), is often used as a proxy to evaluate how well Natural Language Processing (NLP) systems understand natural language (Cooper et al., 1996;Dagan et al., 2006;Bowman et al., 2015). Figurative language is defined as any figure of speech which depends on a non-literal meaning of some or all of the words used. Thus, understanding figurative language can be framed as an RTE task (figurative * Equal Contribution.
◮ I start to prowl across the room like a tightrope walker on dental floss.
I start to prowl across the room recklessly.
✗ ◮ They had shut him in a basement that looked like a freight elevator. Simile They had shut him in a basement that looked dangerously claustrophobic. ✓ ◮ He weathered the costs for the accident.
He avoided the costs for the accident. ✗ Metaphor ◮ The bus bolted down the road. The bus paced down the road. ✓ ◮ Made $174 this month, gonna buy a yacht! I don't make much money. ✗ Irony ◮ Fans seem restless, gee, don't understand them. Fans seem restless -don't know the reason behind it. Table 1: Example RTE pairs focused on similes, metaphors, and irony that RoBERTa incorrectly labels. ◮ indicates a context and the following sentence is its corresponding hypothesis. ✓ and ✗ respectively indicate that the context entails, or does not entail the hypothesis. Bold text represent simile and metaphors and Italic represent their entail/not entail interpretations (top two rows). language expression vs. intended meaning), where the figurative language expression is the context and the intended meaning is the hypothesis in an RTE framework (See examples in Table 1).

✓
We investigate how suitable are state-of-theart RTE models trained on current RTE datasets to capture figurative language. We focus on three specific types of figurative language: similes, metaphors, and irony. Similes evoke comparisons between two seemingly different objects, metaphors expand the imagination beyond the literal narrative, and irony conveys the opposite of what is said.
We leverage five existing datasets annotated for these types of figurative language to create over 12,500 RTE examples that require understanding or identifying these phenomena. We evaluate how well standard neural RTE models capture these aspects of figurative language. Our results demonstrate that, although, systems trained on a popular RTE dataset may capture some aspects of various types of figurative language, they fail on cases where the interpretation relies on pragmatic inference and reasoning about world knowledge. We release the code and the data. 1
We are not the first to explore figurative language in RTE. Agerri (2008) analyze examples in the Pascal RTE-1 ( Dagan et al., 2006) and RTE-2 (Bar-Haim et al., 2006) datasets that require understanding metaphors and Agerri et al. (2008) present an approach for RTE systems to process metaphors. Poliak et al. (2018)'s diverse collection of RTE datasets includes examples based on figurative language, but focuses only on identifying puns.

Dataset Creation
We create RTE test sets that focus on similes, metaphors, and irony. We provide further background for these types of figurative language and describe the methods used for creating these test sets. Table 2 reports the final test sets' statistics.

Simile
Comparisons are inherent linguistic devices that express the likeness of two entities, concepts, or ideas. When used figuratively, comparisons are called similes. Similes are used to spark the 1 https://github.com/tuhinjubcse/Figurative-NLI  reader's imagination by making descriptions more emphatic or vivid (Paul et al., 1970). Similes use a common PROPERTY to compare two concepts often referred to as the TOPIC (the logical subject) and the VEHICLE (the logical object of comparison). For example, in the simile "Love is like an unicorn", love (TOPIC) is compared to a unicorn (VEHICLE), portraying the implicit property "rare". Recently Chakrabarty et al. (2020) released a test set of 150 literal sentences from subreddits r/WritingPrompts and r/Funny, each aligned with two human-written paraphrases with similes that retain the original meaning.
To create our RTE test set that focuses on similes, we treat these simile-literal aligned sentences as entailed context-hypothesis pairs. Given a literal input, "They had shut him in a basement that looked dangerously claustrophobic", an expert annotator re-framed it as "They had shut him in a basement that looked like a freight elevator". 2 We create Not-Entailed examples by flipping the literal verb/property with their respective antonyms and use the original (Literal, Simile) pairs as Entailed. For instance, in the case of an existing context-hypothesis pair expressing Entailment -"An ordinary citizen coming to power in this way is like a green moon." → "An ordinary citizen coming to power in this way is unprecedented"we alter "unprecedented" to "common" to make it a pair of Not-Entailment (NE) instance.
Understanding metaphors requires comprehending abstract concepts and making connections between seemingly unrelated ideas to appropriately deviate from literal meaning (Gutierrez et al., 2016;Mohammad et al., 2016;Kintsch and   2002; Glucksberg, 1998).When generating metaphoric paraphrases, Chakrabarty et al. (2021) create a diverse test set of 150 literal sentences curated from different domains and genres and asked two expert annotators to create metaphorical sentences, resulting in a total of 300 metaphorical examples. The expert annotators re-framed the literal sentences independently by replacing the literal verb with a metaphorical verb. For instance, an expert reframed the literal sentence "The tax cut will help the economy" to "The tax cut will fertilize the economy".
Since the most frequent type of metaphor is expressed by verbs (Martin, 2006;Steen, 2010) these literal and metaphorical paraphrases differ only by the verb they use. In an RTE framework, we treat these metaphorical-literal pairs as entailed context-hypothesis examples. To create Not-Entailed examples, we generate hypotheses by manually swapping the literal verb in the entailed hypothesis with its antonym. Note that for both simile and metaphor, automatic substitution using available lexicons is problematic as it often leads to ungrammatical sentences. Manually replacing the words with its antonym guarantees a high quality test set. We use antonyms to create Not-Entailed examples for Simile and Metaphors which contain both Neutral and Contradiction classes. Such lexical replacement using antonyms would clearly lead to higher quality contradiction example creation. On the contrary, creating neutral examples by lexical perturbation is challenging and if not done properly, it can lead to grammatical errors or incoherent sentences.

Irony
When using irony, speakers usually mean the opposite of what they say (Sperber and Wilson, 1981;Dews et al., 2007). We develop different test sets focusing on whether the RTE models should understand the conveyed meaning of ironic examples or should identify the speaker's ironic intent (i.e., identify if an utterance is ironic or not) given the hypothesis that the speaker was ironic.
Understanding Ironic Meaning (IMeaning) Peled and Reichart (2017) used skilled annotators to create a parallel dataset between tweets with verbal irony and their non-ironic rephrasings (15K pairs). Annotators also had the option to copy the original tweet or just to paraphrase it, in case the ironic intent is not easy to identify. Likewise, Ghosh et al. (2020) released a parallel dataset of speakers' ironic messages (S im ) and hearers' interpretations (H int ) of the speaker's intended meaning. This dataset (S im -H int ) contains 4,761 ironicliteral pairs. We use both datasets in our experiments and henceforth denote them as SIGN and S im -H int , respectively. For both datasets, the original ironic messages are treated as the contexts and the intended meanings are the hypotheses. However, all RTE contexts do not contradict their corresponding hypotheses. For instance, in case of Peled and Reichart (2017), the authors allowed annotators to not rephrase the ironic sentences with their opposite intended meanings (in case the sarcastic or ironic intent was not clear). Thus, for evaluation purposes (see Table 4), we annotated a subset of 2,000 random pairs from SIGN and evaluated the RTE models on that subset ( (Liu et al., 2019). In NBoW, word embeddings for contexts and hypotheses are averaged separately, and their concatenation is passed to a logistic regression softmax classifier. InferSent encodes the context and hypotheses independently using a BiLSTM, then their sentence representations are fed to a MLP. 4 For RoBERTa, we use the model fine-tuned on MNLI from the Transformer's library (Wolf et al., 2020). We expect models trained on MNLI to capture some forms of figurative language that often appear in works of fictions, conversations, speeches, and magazines like Slate.  Table 4 reports models' accuracy on our figurative language RTE datasets. We observe that for similes, metaphors and irony meaning, RoBERTalarge drastically outperforms the other two models. For Irony datasets, NBoW outperforms InferSent. While all models perform poorly on IIntent, In-ferSent's very low accuracy stands out. The low performances might be due to the templatic nature of this recast dataset which might be very different from the MNLI training data. 5 We now turn to an in-depth analysis of RoBERTa's performance across these datasets.

Results and Discussions
Ironic Meaning. RoBERTa-large attains over 90% accuracy on the two datasets focused on ironic meaning. When analyzing these examples, a vast majority of the hypotheses in both datasets use lexical antonyms ("flattering" ↔ "disgusting) or negation ("is great" ↔ "is not great") to represent the intended meaning. Thus, the presence of antonyms might be enough for RoBERTa to correctly predict that the hypothesis is not-entailed by the context. However, this does not hold true for hypotheses where the intended meanings were represented via more complex rephrasing. Ghosh et al. (2020) conducted a thorough study of the linguistic strategies that annotators have used for the rephrasing tasks. They presented a linguistically motivated typology of the strategies (e.g., "Lexical and phrasal antonyms", "Negation", "Weakening the intensity of sentiment", "Interrogative to Declarative Transformation", "Counterfactual Desiderative Constructions", and "Pragmatic Inference") and empirically validated the strategies over the SIGN and S im -H int datasets. 6 During our analysis, we observe that for the vast majority of cases where RoBERTa predicts incorrectly, the examples contain Rhetorical Questions ("nice having finals on birthday?" ↔ "do not like finals . . . "), pragmatic inferences ("Made $174 this month . . . a yacht!" ↔ "I don't make much money"), or desiderative constructions of [I wish] (that) ("glad you related the news" ↔ "[I wish] that you have told me sooner". We also observe that RoBERTalarge's predictions are regularly incorrect when the ironic messages contain certain irony markers (Ghosh and Muresan, 2018), such as metaphor ("shoe smell like bed of roses" ↔ "smells bad"),

Gold Pred
Simile ◮ Your guardian angel is just a little too much like a nerd at a comic convention. ✓ ✗ Your guardian angel is just a little too enthusiastic ◮ Growing up, people always thought you were like a social pariah. ✗ ✓ Growing up, people always thought you were ordinary ◮ They all agree the books are good reads, but they are like pseudo science fiction. ✓ ✗ They all agree the books are good reads, but they are too unrealistic.

◮
The smell of smoke carpeted on the delinquent. ✗ ✓ The smell of smoke took off on the delinquent ◮ As they strike the ground, they are effaced. ✗ ✓ As they strike the ground, they are remembered ◮ The avalanche polvarized anything standing in its way. ✗ ✓ The avalanche protected anything standing in its way.

◮
Life was never been perfect and would never be. ✓ ✗ Life has never been perfect and would never be.

◮
The highlight of my day figuring out how to make contact sheets . . . such a boring life. ✓ ✗ My entire day was occupied in making contact sheets in design such a waste.

◮
Gotta read 70ish+ pages today #great #mysundayfunday #thisshouldbefun. ✗ ✓ I have to read 70ish+ pages today. This is bad. Table 5: Examples from our Simile, Metaphor, and Irony datasets where Roberta-large fine-tuned on MNLI fails to classify the sentence pairs correctly. Gold and Pred means the true label and the predicted label respectively. ◮ indicates a context and the following sentence is its corresponding hypothesis. ✓ and ✗ respectively indicate that the context entails, or does not entail the hypothesis. alternate spelling where the speaker frequently overstate the magnitude of an ironic event ("dancing in heels is grrrrreat" ↔ ". . . hurts your feet") or hashtags that are composed of multi-word expressions that capture the irony ("god bless you . . . #notinthemood).
Simile. Likewise, for the simile dataset, we notice that RoBERTa-large often fails to reason with implicit knowledge about the physical and visual world knowledge (Table 5). This is inline with Weir et al. (2020)'s finding that transformer-based contextual language models poorly capture knowledge grounded in visual perceptions. For example, RoBERTa-large incorrectly predicts that the context "You wake one morning to find your entire family lying like gray slabs of cement" does not entail the hypothesis "You wake one morning to find your entire family lying unconscious". Nevertheless, RoBERTa-large correctly predicts that, "my eyes teared up . . . turning like a ripening tomato" entails "my eyes teared up . . . face turning red". We hypothesize that here RoBERTalarge was able to identify the association between "ripening tomato" and "red" that resulted in the correct prediction.

Conclusion
To understand the figurative language inference capabilities of RTE models, we introduce datasets adapted from existing corpora focusing on similes, metaphors, and irony. By testing models trained on MNLI, we find that while the RoBERTa-large model is able to capture some aspects of figurative language, it fails when the interpretation requires word knowledge and pragmatic inferences. We hope this work will spark additional interest in the research community to incorporate and test for figurative language in their NLU systems.