Investigating Robustness of Dialog Models to Popular Figurative Language Constructs

Humans often employ figurative language use in communication, including during interactions with dialog systems. Thus, it is important for real-world dialog systems to be able to handle popular figurative language constructs like metaphor and simile. In this work, we analyze the performance of existing dialog models in situations where the input dialog context exhibits use of figurative language. We observe large gaps in handling of figurative language when evaluating the models on two open domain dialog datasets. When faced with dialog contexts consisting of figurative language, some models show very large drops in performance compared to contexts without figurative language. We encourage future research in dialog modeling to separately analyze and report results on figurative language in order to better test model capabilities relevant to real-world use. Finally, we propose lightweight solutions to help existing models become more robust to figurative language by simply using an external resource to translate figurative language to literal (non-figurative) forms while preserving the meaning to the best extent possible.


Introduction
Human frequently employ figurative language such as metaphors (Carbonell, 1982) and idioms (Jackendoff, 1995) for effective and/or stylistic communication. Thus, dialog models interacting with humans should be equipped to handle these forms of communication. However, understanding figurative language might be challenging for machines since figurative constructions often exhibit noncompositional semantics and may rely on shared cultural and common-sense knowledge (Carbonell and Minton, 1983). For example, a powerful GPT2 model fine-tuned on DailyDialog dataset is unable to handle the metaphor 'built on the sand' (Figure * HJ and VG contributed equally for this paper. Order decided by coin flip. 1), and the response seems to rely on the unintended literal sense of 'sand'.
In this work, we investigate the performance of existing dialog models when faced with inputs containing figurative language use. (1) First, we identify the subsets in existing datasets (such as Daily-Dialog (Li et al., 2017) and PersonaChat (Zhang et al., 2018)) which have figurative language use such as metaphors and similes. We observe that the performance of all the dialog models under consideration is lower on such subsets containing figurative language use compared to the dataset as a whole. (2) Second, we gather manually written literal/non-figurative equivalents of the dialog utterances in DailyDialog and PersonaChat which exhibit figurative language use. For example, literal version of 'on the sand' can be 'unstable' ( Figure  1). We observe that performance of dialog models improves when using literal equivalents in place of figurative language. We release the resulting datasets, and encourage that new dialog models be tested separately on such datasets to understand and measure their ability to handle figurative language.
(3) Finally, we propose a simple defense against occurrences of figurative language in dialog context. More specifically, we use existing classifiers to detect presence of certain types of figurative language in dialog contexts, and use dictionary lookups to transform them to their literal counterparts before feeding them to the dialog models. The proposed technique is lightweight, does not require any retraining of the models, and is effective -though gaps still remain, leaving scope for interesting future explorations. 1

Figurative Language In Open Domain Dialog
We experiment with DailyDialog (DD) dataset (Li et al., 2017), which is an open domain dialog corpus with 13.1K conversations on colloquial topics like Tourism, Health etc, of which 1K dialogs (6.74K utterances) form the test split. To carry out the desired analysis, we need to first identify the utterances in the dataset which have figurative language use. To achieve high precision labeling, we rely on manual annotations instead of using external figurative language detectors/classifiers. The task was performed manually by two graduate students (native English speakers) studying in a university with English as the primary language of instruction. Additionally, we request the annotators to also write down the literal equivalent versions of the utterances containing figurative language. We release the resulting subset of DailyDialog dataset as DailyDialog-Figurative (DD-Fig), consisting of those dialog instances which contain figurative language, along with two manually written literal versions of each utterance. Though figurative constructs are only mildly frequent at an utterance level (2.2% in DD), their frequency of occurring at least once in a dialog is ≈6 times higher (13.1%). This means that model sensitivity to figurative constructs is more significant an issue than mere utterance-level frequency suggests, since figurative constructs occurring anywhere in a dialog potentially affect all model responses for that dialog. Additionally, handling figurative constructs is still critical to robust longtailed performance (Bamman, 2017), which matters for worst-case behaviour and user satisfaction (Goel et al., 2010;Ilievski, 2019 (Lowe et al., 2015), and GPT2-medium (Radford et al., 2019) fine-tuned on DD. To report automated metrics, we use the multi-reference annotation set collected by Gupta et al. (2019). However, automated metrics may not be well correlated with output response quality (Sai et al., 2020;Gangal et al., 2021). Therefore, we also carry out human evaluations, wherein human annotators (on Amazon Mechanical Turk) are asked to judge the appropriateness of a dialog response on a 1-5 Likert scale. We observe that most of the models perform much worse on DD- Fig (Ta  els change when evaluating only on the subset with figurative language use compared to the entire test split. We note that there are some substantial changes in the relative ranks of the models (Table  1). For instance, as per Meteor and human-ratings, Seq2Seq performs better than CVAE on the figurative subset, while doing worse on the complete test dataset. Such changes in relative ranks further highlight the need to separately report results for the proposed data subsets. Interestingly, Seq2Seq improves its relative rankings in general on the DD- Fig subset. We hypothesize this is because it generates very generic responses with very little information overlap with the context.
Does replacing figurative language with semantically equivalent literal translations lead to better performance? Above analysis only reports correlation in performance with contexts containing figurative language. However, there could be certain confounding factors involved. Thus, to make a more direct comparison, we will next compare results when using figurative contexts versus their literal counterparts. To perform this experiment, we utilize the human written literal versions in DD- Fig, and experiment with the GPT2 model (which is the best performing model as per human rating on the overall dataset). We report results under two setups: (1) when figurative language is present in the last utterance of the dialog history, and (2) when figurative language is present anywhere in the dialog history. Human ratings are collected using the same procedure as described for Table 1. Table 2 shows the main results. For some metrics such as Bleu-4, models perform more than 5 times better when fed with literal translations instead of figurative language. Between the two setups under consideration, we observe slightly higher impact (as per human evaluation ratings) when one or more figurative language constructs are in use anywhere in the dialog history.
Experiments with Personachat (PC): Per-sonaChat (Zhang et al., 2018) is a persona grounded dialog corpus, with 1K dialogs (7.8K utterances) forming the test split. We consider a GPT model fine-tuned on the train split of the PC dataset. We follow the same tagging and annotation procedure for the test split, and refer to the resulting dataset as PersonaChat-Figurative dataset (PC-Fig). Results in Table 2 demonstrate reduction in performance of the dialog model in human evaluations (Automated overlap metrics in the case of PC are considered unreliable since PC contains only one reference per dialog context.
We notice that compared to DailyDialog, Per-sonaChat utterances tend to be shorter, more informal, and highly spoken-language like utterances, with fast topic transitions. On replacing figurative language with its literal counterpart into such contexts, the replaced literal text, which is typically English of a more formal and written variety, ends up being much more out of sync with the context in the lexico-syntactic/stylistic sense than it is for DailyDialog. This has a slight downward effect on metrics, offsetting some of the gains from replacing away the figurative language.

Mitigation
We propose a lightweight mitigation approach wherein we use existing resources to detect and then construct literal translations for two popular figurative constructs: metaphors and idioms. Thus, the proposed mitigation approach does not require any retraining of dialog models. To better generalize to external data via recent contextual models, we skip using model by Gao et al. (2018), and instead learn a classifier C met bert by finetuning the bert-base-uncased (Devlin et al., 2018) checkpoint from Wolf et al. (2019). On VUA, C met bert gets a test F1 of 0.724, which is close to Gao et al.

Metaphor Detection Through
(2018)'s value of 0.726. Next, we run each test utterance in dilaog dataset through C met bert to get its probability p met of being metaphorical. To retain only more reliable predictions, especially consid-ering domain shift w.r.t VUA, we only choose utterances with p met >0.9. The set of metaphorical utterances thus identified is D met auto . Idiom Detection Through Lexicon: Idioms are frequently used expressions, which have a fixed, typically non-compositional meaning understood by most native speakers from cultural precedent. We curate a lexicon of 2048 commonly used idioms (e.g. filling his shoes) from an online source 3 -see Appendix A.2 for more details). All utterances with at least one lexicon entry as a substring are identified to create the set of automatically detected idiomatic utterances, D idiom auto . We unify the sets detected above to form D f ig auto = D met auto ∪ D idiom auto . |D f ig auto | constitutes 1520 of 6740 utterances for DD (22.5%) and 911 of 7801 utterances for PC (11.7%) respectively.
Dictionary Replacement: Wiktionary (Zesch et al., 2008) -the collaboratively created online dictionary, provides a curation of entries corresponding for phrases with "idiomatic" 4 usages. These entries encompass conventionalized metaphors 5 , idioms, euphemisms, commonly used similes etc. Each entry lists the surface form of the figurative construct paired with a gloss. Glosses are for the most part literal interpretations of the figurative construct. However, they often bear other details like dialect("US"), etymology("archaic") etc, which we remove through simple regex-based rules. This allows direct use of the now-cleaned gloss as a literal interpretation in-context. Furthermore, we expand entries whose surface forms contain uninflected verb forms or unrealized pronouns indicated by someone, one's etc, spawning one new entry per pronoun-inflection combination 6 This yields us a dictionary with 17, 743 tuples of the form {f ig i , Lit(f ig i )} 7 . Finally, for each detected utterance u ∈ |D f ig auto |, each matched occurrence of f ig i is replaced by Lit(f ig i ), ∀1 ≤ i ≤ n. Table 3, we see that mitigationbased literalization leads to higher quality model responses as per most automatic metrics as well as human evaluation. Though the proposed approach 3 https://www.englishclub.com/ref/Idioms/ 4 Overloaded use of the term to refer to several figurative phenomena at once, and not just idioms proper. 5 Commonly used metaphors with a fixed, nearly universally accepted meaning. 6 We use pyinflect python library 7 See Appendix §C for further analysis of the dictionary  offers only small improvements, it is lightweight in terms of time and memory complexity, and provides reasonably fluent and appropriate interpretations for the figurative constructs covered, since these are sourced from the long-term, collaborative editing underlying Wiktionary. Table 4 shows examples where mitigation-based literalization of the figurative context improves model response quality.

Results: From
Additionally, we observe that Rouge fails to correlate with Meteor in Table 2, but correlates in Table 3. One possible reason for such behavior is that Wiktionary uses dictionary-like, conservative literalizations, adding new words only as necessary. On the other hand, human annotators literalize more freely without regard for word choice fidelity. Meteor is more robust to variation in word choice, being enabled to capture synonymy and other forms of limited surface form variation. Rouge, being more sensitive however, is immediately dampened on account of this.
The proposed approach is based on simple rule based procedures relying on existing resources, and thus there is scope of multiple future extensions. The detection portion of our approach uses an external classifier and a fixed lexicon to detect metaphors & idioms respectively, leading to D met auto . Considering utterances in DD-Fig as gold, we find the recall of this approach, given by

Related Work
Past work has explored fine-grained analysis and understanding of the performance of dialog models (Roller et al., 2020). Saleh et al. (2020) analyze open domain dialog systems for skills such as inferring contradictions and determining the topic of conversation inter alia. Sankar et al. (2019) analyze the change in perplexity when applying certain perturbations in dialog history. Past work has analyzed dialog models from the point of view of safety from toxic language (Xu et al., 2020;Dinan et al., 2019), and gender biases (Dinan et al., 2020). Gao et al.
(2020) analyze how well dialog models respond to utterances from infrequent sentence function types (e.g Negative Declarative utterances like I feel bad today.). Louis et al. (2020) propose to identify the categorical mapping of an indirect response with respect a polar question in a task oriented dialog setup. Challenges in handling metaphors and idioms has been explored in prior work on machine translation (Mohammad et al., 2016;Kordoni, 2018;Mao et al., 2018). Mao et al. (2018) propose a method to identify metaphors in English text, and paraphrase them into their literal counterparts before translating to Chinese. Our work on analyzing dialog models against figurative language contexts is along similar direction, though the task setup and scope of figurative language involved are different. Figurative language generation has received reasonable attention such as simile generation (Chakrabarty et al., 2020) and idiom generation (Zhou et al., 2021). Compared to them, our focus is on analyzing capability of popular contemporary dialog models when faced with figurative language.

Conclusions
In this work, we demonstrate how existing dialog models fall short in handling figurative language use, and propose a light-weight mitigation technique to ameliorate this lacuna. We encourage future research in dialog models to separately analyze and report model performance The mitigation techniques used by us are pretty lightweight, but are not able to capture many occurrences of figurative language. Future work could look into improved techniques for figurative language detection. Our work is limited to a couple of open domain dialog datasets in English language. Similar analyses could be conducted on goal oriented dialog setups and datasets in other languages. et al., 2018) as well as the multiple references dataset from (Gupta et al., 2019). We do collect human evaluation ratings using crowd-sourcing. However, we neither solicit, record or request any kind of personal or identity information from the annotators. Our work primarily performs experiments on dialog in English (Bender and Friedman, 2018). Dialog models are known to suffer from biases learnable from dialog training data, such as gender bias (Dinan et al., 2020). However, our work and contribution does not present or release any new models or model checkpoints, and is primarily concerned with more careful evaluation of a particular phenomena (i.e figurative language), and discussion on lightweight mitigation strategy related to the same. Jaime G Carbonell. 1982

A Automatic Detection -Further Details
A.1 Metaphor Detection

A.2 Idiom Detection
Lexicon construction is done through a two step process. First, we curate the lexicon from the mentioned source . Since many entries in this lexicon are templates like behind someone's back which could have multiple realizations (e.g behind her back), we use a rule-based procedure to expand such instances to all possible realizations by enumerating over all POS combinations (nominative pronoun, nominative pronoun-verb, verbaccusative pronoun etc as applicable). For this, we use the pyinflect library.

C Wiktionary Mitigation -Resource Description
The final postprocessed version of the resource contains 17743 entries -these are distributed as 88.53% 'idioms', 8.08% euphemisms and 3.39% similes. The 'idioms' here consist of both idioms proper and conventionalized metaphors -it is not easy to provide an exact breakup since Wiktionary does not distinguish between the two.

C.1 Examples
In  Typical inference time ranges from 10-20 minutes.
Number of parameters: GPT2 has approx 120M parameters.
Hyper-parameter search: We varied random seed and the learning rate, when training GPT based models. We use validation loss to pick the best configuration. Best configuration uses initial random seed of 123.

E Additional Data Collection Details
Annotation Framework: A screenshot of the annotation collection task is shared in Figure 2.
Quality Control We restrict to annotators with > 90% HIT acceptance rate. We also perform spot checks and skip the ratings from annotators who seemed not to be adhering to the provided instructions.

F Qualitative Examples
In this section, we enlist qualitative examples from the various stages of analysis performed in our work. Table 7 shows how various models respond Figure 2: Annotation Framework to collect human judgement ratings on appropriateness of generated response with respect to dialog context. to example figurative contexts. Table 6 shows examples where figurative contexts are misinterpreted by models, with the response relying on the unintended literal sense of the construct.