ePiC: Employing Proverbs in Context as a Benchmark for Abstract Language Understanding

While large language models have shown exciting progress on several NLP benchmarks, evaluating their ability for complex analogical reasoning remains under-explored. Here, we introduce a high-quality crowdsourced dataset of narratives for employing proverbs in context as a benchmark for abstract language understanding. The dataset provides fine-grained annotation of aligned spans between proverbs and narratives, and contains minimal lexical overlaps between narratives and proverbs, ensuring that models need to go beyond surface-level reasoning to succeed. We explore three tasks: (1) proverb recommendation and alignment prediction, (2) narrative generation for a given proverb and topic, and (3) identifying narratives with similar motifs. Our experiments show that neural language models struggle on these tasks compared to humans, and these tasks pose multiple learning challenges.


Introduction
Large language models (LLMs) (Devlin et al., 2019;Liu et al., 2019a;Raffel et al., 2019;Lan et al., 2019;Lewis et al., 2019) have led to a paradigm shift in NLP, and have shown exciting progress on benchmarks such as GLUE and SuperGLUE (Wang et al., 2019a). In particular, these include tasks such as reading comprehension, natural language inference and coreference resolution. Many of these tasks rely on semantics and syntactic reasoning, which has been mastered by these LLMs. For example, apart from improving on distributional semantics through contextualized embeddings (Ethayarajh, 2019), recent work has shown evidence that these models implicitly learn emergent concepts such as subjectverb agreement (Jawahar et al., 2019), semantic roles (Tenney et al., 2019) and dependency structures (Hewitt and Manning, 2019). 1

Work in progress
Prevention is better than cure Instead of working on his English assignments weekly, he put them off until the last week before they were all due. He was able to finish them all over the final few days, but it took a lot of energy drinks and misery.
There was once a disease that spread like wildfire. It killed people by the thousands. Doctors said wash your hands regularly and you'll be ok. Eventually an inoculation shot was created, and it worked, except on the people who refused to wash their hands and died waiting for their turn to get a shot. Figure 1: We introduce ePiC, a crowdsourced dataset of narratives for employing proverbs in context. Our dataset contains narratives (N1 and N2) paired against proverbs (P) along with a fine-grained annotation of aligned spans between the narratives and proverbs. Aligned spans are shown with matching colors, and indicate correspondences in roles between proverbs and narratives. We explore three tasks: (1) proverb recommendation and alignment prediction (predict P given N1), (2) narrative generation for a given proverb and topic (generate N1 given P and K1), and (3) identifying narratives with similar motifs (e.g. identify N2 in a set of narratives given N1).
However, humans show an ability for deeper linguistic reasoning. We can identify people's intentions and goals (Douglas and Sutton, 2006), perform relational reasoning (Alexander et al., 2016) and find analogies in situations with little surface overlap (Holyoak, 2013). In particular, the ability to make verbal analogies in the form of proverbs is often noted as an indicator of human intelligence and literary ability (Penfield and Duru, 1988;Nippold et al., 2001). Proverbs are also repositories of information on culture, societal norms, values and folk wisdom (Raymond, 1956;White, 1987). Thus, AI systems need to understand and employ such knowledge. In this work, we investigate proverbs in narrative contexts as a testbed for evaluating ab-stract reasoning and analogical abilities of LLMs.
We introduce ePiC (employing Proverbs in Context), a high-quality crowdsourced dataset of narratives paired with proverbs. The dataset provides fine-grained annotation of aligned spans between proverbs and narratives, and is designed to minimize lexical overlap between narratives and proverbs. Figure 1 shows two examples of narratives for a proverb from our dataset, along with corresponding alignment annotations. We diverge from related extant resources (Wang et al., 2020;Tan et al., 2015Tan et al., , 2016 on using proverbs in terms of quality of narratives, direct supervision and having fine-grained alignment annotations. 2 We explore three tasks: (1) proverb retrieval ( § 5.1) and alignment prediction, (2) narrative generation for a given proverb and a set of keywords specifying a topic ( § 5.2), and (3) discovering narratives with similar motifs ( § 5.3). By benchmarking several LLMs, we find that existing models struggle with these tasks, suggesting much scope of improvement in abstract reasoning. In particular, humans show much higher performance in many cases. Our dataset will be publicly hosted on the web along with a public leaderboard on first publication.
In §3, we describe the crowdsourced creation of our dataset and provide details of annotation tasks. In §4, we analyse the extent of lexical overlap and quantitatively evaluate the biases in our dataset. We also perform a human study to evaluate the quality of generated narratives. §5 describes the three tasks and provides details of experimental evaluation of LLMs for each task. We conclude with a discussion of future direction in §6, and a statement of ethics and broader impact relevant to our dataset in §6. Our contributions are: • We introduce ePiC, a high quality dataset for employing proverbs in context. It contains multiple narratives for English proverbs, and fine-grained annotation of aligned spans between them. • We design three challenging tasks that require models to go beyond surface-level reasoning and provoke research towards making more socially grounded NLP systems. • We benchmark the performance of several stateof-the-art large language models in our proposed tasks using our dataset. Code and dataset will be available at https:// github.com/sgdgp/epic. 2 Existing datasets are automatically created by scraping web-text, and supervision is heuristic (based on cooccurrences of proverbs and contexts)

Related Work
Prior work has explored recommending Chinese idioms as context-based recommendation (Liu et al., 2019b) or as cloze-style reading comprehension tasks (Zheng et al., 2019). Learning to quote has been explored based on fiction Tan et al. (2015Tan et al. ( , 2016 and noisy social media conversations from Twitter, Reddit or Weibo (Lee et al., 2016;Wang et al., 2020). In the most related prior work, authors explore a quote retrieval task borrowing inspiration from context based recommendation systems (Huang et al., 2012;He et al., 2010). Wang et al. (2020) formulate learning to quote as a generation task by using topic modelling (Miao et al., 2017;Wang et al., 2019b) in a sequence-to-sequence network. While previous work has considered idioms, proverbs and common phrases as quotes whereas we specifically work with proverbs. In comparison to earlier datasets, our dataset is substantially superior in quality and supervision. Further, ePiC includes fine-grained annotations aligning parts of proverb to parts of the narrative, which has significant possibilities for model training, evaluation and interpretability.

Dataset Creation
In this section we describe the steps involved in creating the dataset in detail.

Proverb collection
We obtained a candidate set of English proverbs by scraping websites for 'The Phrase Finder' 3 and WikiQuotes 4 . Next, this set was manually pruned to remove lexical variations of the same proverb. This considerably filtered the candidate set, since many entries consisted of minor lexical or syntactic variations of the same proverb. This manual curation led to a set of 250 proverbs, which we consider in the current version of our dataset.

Collecting narratives
In the second step, we use Amazon Mechanical Turk to collect a diverse set of narratives corresponding to each proverb. We collect 10 narratives contributed by distinct turkers for each proverb, leading to a total of 2500 proverb-narrative pairs. We also ensure that no turker contributes a large number of narratives to alleviate annotator bias (Geva et al., 2019) (where models can overfit to annotator characteristics), while encouraging diversity in writing style and content. The turkers were asked to write short realistic stories, preferably within 100 words. Additionally to avoid surface-form biases, turkers were encouraged to to minimize lexical overlap and to not mention the proverb or parts of it in the narrative. This was done so that doing well on the tasks requires detailed understanding the narratives in rather than relying on surface-level cues. Turkers were paid 50 cents for each narrative for this task.

Span alignment annotation
In a third step, we solicit more fine-grained annotations between the narratives and the proverb in form of aligned spans. For this task, we present proverb-narrative pairs to turkers and ask them to find contiguous spans in narrative which align well with contiguous spans in the proverb. Turkers could submit upto 5 pairs of aligned spans per proverbnarrative pair. These pairs of aligned spans highlight the grounding of a proverb in the narrative (as previously shown in Figure 1). These annotations can allow verification of the reasoning capabilities of various neural models by checking if these models are able to identify these correspondences, and adds interpretability to our tasks. Turkers were paid 25 cents for each proverb-narrative pair annotation for this task. Table 1 shows the statistics of narrative collection for the proverbs. The narrative writing task was perceived as challenging yet interesting by most turkers due to (a) not having outlines about topics for the narrative beforehand (b) requirement of low lexical overlap with the proverb. Thus, the narrative writing task had a learning curve and some of the narratives submitted initially were not included in the dataset.   4 Dataset Analysis Table 2 shows some statistics of the dataset collected through the process described in §3. We note that the average length of stories is about 65 words and between 4 to 5 sentences. In this section, we analyze the characteristics and biases of the ePiC dataset in detail. We discuss lexical similarity between narratives and proverbs in §4.1. §4.2 describes the diversity of the dataset in terms of events and entities, and details of sentiment analysis of proverbs and narratives. Finally, §4.3 discusses human evaluation of the quality of the dataset.  Table 3: Avg. Jaccard similarity and number of common n-grams between proverbs and narratives. Numbers in parenthesis denote the corresponding statistics when the proverbs are randomly permuted and assigned to narratives.

Lexical overlap analysis
Using n-grams: We evaluate the extent of lexical overlap between proverbs and narratives by computing common n-grams between them. Table 3 reports the average Jaccard similarity score between n-gram sets of proverbs and narratives, and the average number of common n-grams. On average, there are 1.27 unigrams common between narratives and proverbs (including stopwords). In comparison, randomly permuting assignments of proverbs for narratives yields an average unigram Jaccard similarity of 0.0211 and 1.06 common unigrams. Thus, the overlap metrics in the dataset are comparable to those between unrelated texts. To evaluate diversity among narratives that correspond to a proverb, we compute average Jaccard similarity between of sets of unigrams for the narratives. This score is 0.107, which is comparable to a value of 0.098 for average unigram overlap between pairs of narratives corresponding to different proverbs. This suggests a high lexical diversity between the narratives in the dataset.  Table 4: Benchmarking proverb retrieval performance using word2vec and off-the-shelf LLMs ('base' versions).
Using distributional embeddings: We formulate a retrieval task to explore if we can retrieve the correct proverb corresponding to a narrative only by using similarity in their distributional representations. We define similarity between a proverb and a narrative by using cosine similarity between the embeddings of the proverb and the narrative. We use (1) word2vec embeddings (Mikolov et al., 2013) (2) contextual embeddings from LLMs to represent the proverb and narrative. We obtain the embeddings for a context c (where c can be a proverb or a narrative) as: • Word2vec : average of word embeddings for tokens in c. • BERT/RoBERTa : [CLS] token embedding on passing c through BERT/RoBERTa. • DistilBERT/AlBERT : [CLS] token embedding on passing c through DistilBERT/AlBERT • T5/GPT-2 Encoder: sum of embeddings of tokens in c after passing through the encoder For each narrative, we retrieve the proverb that has the highest similarity. For this retrieval task, we report the accuracy and Mean Reciprocal Rank of the correct proverb in Table 4. We note that while all models perform better than random, the performance is very low when using the out-of-thebox representations. In §5, we explore learningbased methods for the same setup.

Data characteristics
Diversity of narrative events: Fig 2 shows the distribution of events in our dataset. Following Mostafazadeh et al. (2016) we find events as the hyponyms of the word 'event' or 'process' using WordNet (Fellbaum, 2010). We see that the top events comprise less than 3% of all events in our dataset, and the long tail of less frequent events shows the diversity of the dataset.
Sentiment analysis: To evaluate if there are biases in the data in terms of sentiment associations between proverbs and corresponding narratives (e.g., if negative sentiment proverbs only correspond to negative sentiments in narratives), we perform sentiment analysis of the narratives using VADER (Hutto and Gilbert, 2014). Figure 3 shows the average sentiment scores of the narratives corresponding to a proverb plotted against the sentiment score of the proverb. We find that the narratives are diverse in terms of their sentiment polarities. Quantitatively the Pearson correlation score between the average sentiment score of the narratives w.r.t sentiment score of proverb is 0.35, showing a positive correlation. An example of proverb for which the narratives were close in sentiment scores to the proverb is 'a thing of beauty is a joy forever' while for 'there's no fool like an old fool' the sentiment polarity of narratives was on average opposite to that of the proverb. We also show the variance in terms of number of positive and negative sentiment narratives (out of 10) for each proverb in Figure 4. We note that there are indeed a small number of proverbs for which all or most narratives leaning towards a particular sentiment polarity. Quantitatively, for 23 proverbs, either 9 or all 10 of the narratives have positive VADER sentiment score. These include: 'Nothing succeeds like success' , 'Christmas comes but once a year' and 'Genius is one percent inspiration, ninety-nine percent perspiration'. There are 6 proverbs for which either 9 or all 10 narratives have a negative VADER senti-ment score. These include: 'The wages of sin is death', 'Fish always stink from the head down' and 'Don't wash your dirty linen in public'. However, as seen in Figure 4, the vast majority of proverbs in the dataset are represented by narratives with both positive and negative sentiment polarities.
Gender distribution of entities: We analyse entities in our dataset to find the ratio of male and female mentions in the narratives. Based on an off-the-shelf neural coreference pipeline, we find that 61% of the mentions in the narratives are male, while 39% are female. Around 48% of the narratives have predominantly male mentions, 26% of the narratives have predominantly female mentions and the rest have equal number of male and female mentions. The average number of words in predominantly male and female mention containing narratives was comparable ( 65 words).
Language complexity: To evaluate the diversity of language complexity in the narratives, we calculate the Fleisch reading ease 5 of the narratives. The highest reading score obtained was 112.1 (equivalent to 3rd grade reading levels) and the lowest was -41.5 (significantly above college graduate reading levels), the average score for the narratives in our dataset is 66.5 (equivalent to 8th/9th grade reading levels). This shows a considerable spread in the complexity of language in our dataset.
Hate speech: We analyze collected narratives to check the presence of toxic or hate speech in the narratives. Using an off-the-shelf hate speech classifier (Davidson et al., 2017), we found no instances of hate speech in the dataset.

Human Evaluation of Dataset Quality
We perform a human evaluation of the narratives in our dataset on various criteria to judge the quality of our dataset. We perform this evaluation using the AMT platform. We randomly sample 250 proverbnarrative pairs and the ask the turkers to evaluate the narratives on the following criteria: • Relatedness: how closely the narrative reflects the meaning of the proverb (1: totally unrelated, 5: perfectly related) • Interesting/Creative: how much is the narrative like a short creative or interesting story (1: very uninteresting/boring, 5: very creative/story-like) • Fluency: grammatical correctness of the narrative (1: poor English with grammatical or spelling mistakes, 5: perfect English with no errors in writing) • Overall rating All the ratings are done on Likert scales from 1 to 5, where 1 is the lowest value for each criterion and 5 is the highest. Also, the rating value '3' was calibrated to be slightly leaning to the higher end of the scale (instead of neutral) so that the turkers take a clear stand on the polarity of each criterion. Table 5 shows the qualitative evaluation of our dataset. The average overall rating was 3.67 and the average pair-wise inter-annotator agreement for labelling a narrative as overall good vs overall poor (overall score >= 3 vs < 3) is 0.84. We also rate the quality of the aligned spans in our dataset similarly on a scale of 1 to 5. The average rating of the alignment between spans was 3.91 and the average pair-wise inter-annotator agreement for alignment as good vs poor (rating >= 3 vs < 3) is 0.86.
To qualitatively compare against prior work, we do a similar qualitative analysis by rating 200 randomly drawn samples from the "Reddit" dataset of quotations in context from the Wang et al. (2020). Based on average likert scores in Table 5 we find that ePiC is significantly superior (using t-test; p < 0.05) on all criteria. We also highlight the key differences between ePiC and prior work in Table 6.

Tasks & Evaluation
In this section, we provide details of tasks associated with our dataset. We introduce three tasks: (1) Proverb and Alignment Prediction, (2) Narrative Generation, and (3) Identifying narratives with similar motif. In the following subsections, we describe the tasks along with experimental setup and benchmark results.

Task details
In this task, the objective is to predict the correct proverb for a given narrative from the set of all 250 proverbs in the dataset. The motivation of this task is to test whether language models can abstract the underlying meaning of the narratives and make an analogy with the correct proverb from a large set of proverbs. In terms of applications, this task is related to proverb recommendation, which can be useful in creative writing assistants. The task is challenging as there might be multiple proverbs loosely related to the narrative context, but not be completely consonant with subliminal themes in the narrative. An underlying assumption here is that a narrative would match well with exactly one proverb. We found this reasonable for most examples in the dataset.

Experiment Setup and Results
We consider two settings, predicting (1) Seen and (2) Unseen proverbs. • Seen proverbs: In this setting, the set of proverbs in the train and test set are the same. We divide the set of narratives corresponding to each proverb into train and test for each quote in a 60/40 split. Thus, in this setting, the train set has 1500 and test set has 1000 proverb-narrative pairs respectively. • Unseen proverbs: Here, we consider 150 proverbs in train set and 100 proverbs in test set (60/40 split on the set of proverbs). The sets of proverbs in the train and test split are disjoint. The size of train and test split are 1500 and 1000 respectively (since each proverb is paired with 10 narratives).
Proverb prediction Here we focus on only predicting the corresponding proverb for a narrative, without employing the span alignments in training or evaluation. For this, we fine-tune the retrieval models based on different LLMs previously described in §4. The cosine similarity between representations of proverbs and narratives is used to define logit scores for predicting a proverb given a narrative. We normalize the scores and train our model using cross-entropy loss. We fine-tune the pre-trained LLMs on our train set and report their best performance on our test set. To evaluate performance we consider accuracy and Mean Reciprocal Rank as metrics. Table 7 shows proverb prediction performance for 'seen' and 'unseen' proverbs. We note that RoBERTa performs the best among all models for both the 'seen' and 'unseen' settings, and the performance for all models is consistently lower for unseen proverbs (as would be expected, since this task involves much greater generalization). Further, while the performance of all models is much better than chance, even the highest performance is only 25.8%. As we'll see in §5.1.3, human performance for proverb prediction is much higher.
Predicting proverbs and alignment jointly We formulate this as a multi-task learning setup. We extend the models from the proverb prediction task by adding a component to predict span from narrative given a span from the proverb and the narrative. This span prediction network takes the context of proverb span and the complete sequence of token embeddings of the narrative. Using this, the model predicts the start and end token of the corresponding narrative span. We jointly train the model with multi-task learning of the two tasks, i.e., proverb and alignment prediction, on the 'seen' proverbs data split. For span prediction, we report precision, recall and F1. Recall in this case is fraction of  Table 6: Comparing ePiC with previous works on learning to quote based on different characteristics of the data and the collection process. ePiC contains contexts in form of narratives authored by crowdworkers explicitly for this task. In comparison, previous methods collect contexts and labels by mining existing text resources through heuristics (with no manual curation). We further provide fine-grained alignment annotation between the narratives and proverbs.  the tokens in ground-truth narrative span present in the predicted span. Similarly, precision is the fraction of tokens in the predicted span present in the ground-truth span. We also report the accuracy for proverb prediction. Table 8 shows results for our approach. While in principle, the two tasks should benefit from joint training, we observe that the performance on proverb prediction actually drops. Further, performance for alignment prediction is also seen to be low, indicating major scope for improvements in the individual tasks, but also leveraging their interdependence.  Table 8: Joint proverb and alignment prediction performance for seen proverbs using 'base' versions of LLMs.

Qualitative analysis of retrieval models
We plot the prediction performance graph for each proverb using fine-tuned RoBERTa and Bert models in Figure 5 to explore if different LLMs are better at different types of proverbs. We see that, in fact, RoBERTa performs better than BERT in almost all the cases except for a very small number of narratives.  Figure 5: Heatmap showing the percentage of proverbs with various fine-tuned BERT and RoBERTa proverb prediction accuracies. An example interpretation of the heatmap is -more than 15% of the proverbs have RoBERTA prediction accuracy as 25% and BERT prediction accuracy as 0%.
Some proverbs for which the accuracy is perfect (1.0) or near-perfect (0.75) with RoBERTa are 'a penny saved is a penny earned', 'look before you leap', and 'birds of a feather flock together'. Looking into the narratives for these proverbs in the test set we find a possible reason for this high performance is the presence of certain words or phrases which are synonymous to some words/phrases in the proverb. Examples of this include the rpesence of word 'group' for the proverb 'birds of a feather flock together' and words like 'money', 'dollar' and 'expense' for the proverb 'a penny saved is  a penny earned'. At the other end, some of the proverbs for which RoBERTa accuracy is 0 are 'ignorance is bliss' , 'don't count your chickens before they are hatched' , and 'prevention is better than cure'. There are also cases when the model is confused because of multiple topics being discussed in the narrative. For example, in some narratives in the test set for the proverb 'life's not all beer and skittles', earning money the hard way is being described. Even though this is not the main focus of the narrative, RoBERTa predicts 'time is money' for such narratives.

MCQ task for human performance comparison
We compare the proverb prediction performance of our fine-tuned LLMs against humans. Since predicting from a large set of 250 proverbs is infeasible for humans subjects, we modify the task slightly. We frame proverb prediction as a multiple choice question (MCQ) task where given a narrative, 5 proverbs are provided as choices. The set of choices include the correct proverb and 4 other distractor proverbs. We choose the distractor proverbs from a mix of proverbs with the highest prediction probabilities, and proverbs that are assigned the most similar probabilities to the correct answer from the RoBERTa model. We provide examples of the MCQ task and distractor options in Appendix §A.1. We conduct a user study for this MCQ task for a random sample of 100 narratives. Table 9 shows human accuracy for this task compared to LLMs, which shows that humans are much better at the task. Additionally, the estimate for human performance is likely an under-estimate, since in many cases human subjects were unfamiliar with the meanings of some of the proverbs provided in the options. The average pair-wise inter-annotator agreement between human subjects for this task was 0.75 and the Cohen-Kappa score was 0.42.
Semantically similar proverbs Our chosen set of 250 proverbs in ePiC includes instances of proverbs that are semantically very similar, or even paraphrases (e.g., 'never judge a book by its cover' and 'appearances can be deceptive'). This can be problematic since the presence of semantically similar proverbs as different options in MCQ (and as different classes in proverb classification task) can confuse both humans and automated models. To estimate the extent of this phenomenon, we perform an analysis of human errors on the aforementioned MCQ task. Out of 90 errors we find that for 44 cases, the chosen proverb was completely unrelated to the actual answer. In 12 out of these 44 cases, the unrelated proverb includes words related to words in the narrative. For 31 out of the remaining 46 cases the chosen proverb seems related to the narrative at first glance, but is not aligned and thus not the best fit. For the remaining 15 cases (one-sixth of human errors), the chosen proverb would have been equally appropriate for the narrative. Future work can consider handling semantic similarity between proverbs explicitly and devise suitable evaluation metrics.

Task details
One of the important use-cases for NLP models in the creative writing domain is to use these models to generate content. We explore the task of generating narratives corresponding to a proverb and given topic (specified as a set of keywords). We benchmark the performance of two recently proposed state-of-the-art models in text generation, T5 and BART, by fine-tuning them on ePiC.

Experiments and Results
We divide our dataset into train and test split similar to the proverb prediction task. Thus, we have 'seen' and 'unseen' settings for this task. We consider the set of verbs and named-entities as the keywords for a narrative. We train our narrative generation model conditioned on the proverb and the keywords. For automatic evaluation of the generated narratives we use BLEU, ROUGE-L and recall of the keywords mentioned in the generated narrative. Further we perform human evaluation to evaluate quality of the generated narratives. The human evaluation was conducted in AMT and considered the same criteria employed in Section 4.3. The semantics of each rating level for every criterion were kept same as in Section 4.3. Table 10 shows automatic evaluation of the narratives generated by different models. We also provide samples of generated narratives in the Appendix §A.2. When testing over both 'seen' and 'unseen' proverbs, BART performs better that T5 on the automatic evaluation metrics (BLEU, ROUGE-L and recall of keywords). Table 11 shows human evaluation of generated narratives using BART and T5 when tested over 'seen' proverbs. Low scores for BLEU and ROUGE-L in automatic metrics and low likert ratings of the generated narratives indicate much scope for future improvement on this task.

Task details
An important aspect of language understanding is the ability to make linguistic (and narrative) analogies, i.e., identifying 'similarity' between narratives (e.g., identifying two narratives that are variations on the 'Cinderella story' theme). Here, we explore the task of identifying narrative analogy by modeling 'similarity' between narratives based on proverbs illustrated by them. For this task, two narratives are taken to be similar if they are related to the same proverb.

Experiments and Results
For this task, we use the train and test split of 'seen' proverbs setup in the proverb prediction task. The aim is to find similar narratives for each narrative in the test split amongst all narratives in the test split. So for each narrative there are 3 other similar narratives (corresponding to the same proverb) in the test split (containing 1000 narratives).
Modelling similarity between narratives We use the learned models in proverb prediction task to obtain a probability distribution over the proverbs for each narrative. To model similarity, we compute the distance between the (vectors representing) two probability distributions using one of the following: (1) cosine distance; (2) Jenson-Shannon divergence; (3) L2 (Euclidean) distance; and (4) L1 (Manhattan) distance. We predict the narrative closest (in terms of distance metrics) to the input narrative as the most similar. Table 12 shows the accuracy of getting a similar narrative using different distance metrics and different fine-tuned LLMs.
Using cosine or Jenson-Shannon divergence as the distance metric on the probability distribution over proverbs predicted by RoBERTa model performs best on this task. However, the overall performance of models are still low and can be benefited by devising suitable training methods for this task.

Conclusion and Future Work
We introduce ePiC, a high quality crowd-sourced dataset of narratives paired with proverbs and a suite of challenging tasks associated with this dataset. We show that these provide a challenging testbed for evaluating abstract reasoning and analogical abilities of LLMs. Future work can explore more sophisticated mechanisms to use alignment annotations in improving the performance for proverb prediction and model interpretability. Additionally, researchers can explore conditional narrative generation through more informative prompts than using keywords. ePiC can also be extended in the future by incorporating more proverbs, and adding more layers of complexity like sarcasm or adversarially creating harder narratives. Most of all, the development of similarly challenging resources and tasks can enable the possibility of socially grounded NLP systems.

Ethics and Broader Impact
In §4, we note that our dataset shows considerable differences in the distribution of gender of entities (61% male vs 39% female), whereas in the real world we expect the ratios to be about equally balanced. Systems that don't account for this bias might end up performing better for narratives with male entities than with females. However, we note that narratives with male and female entities show no differences in overall length or the average number of mentions to those entities. For all the crowdsourcing tasks in this work, we limited the locale of eligible turkers to US, Canada and UK. Further, to ensure good-faith turkers, we required that the approval rate of the turkers be above 97%.
Our screening process has selection biases that likely over-samples narrative-writers from demographics that are over-represented on AMT (ethnically white, college-educated, lower-to-medium income and young) (Hitlin, 2016), and this is likely to have affected the topics and type of language usage in the collected narratives.
Finally, our investigation here has focused on traditional English proverbs, even while proverbs are universal in human languages and cultures (Penfield and Duru, 1988). This poses a real risk of development of AI models that better understand and employ specific types of figurative language than others. Such systems are likely to be less userfriendly to users that don't belong to specific socialcultural backgrounds. To mitigate these risks, but also since proverbs are universal repositories of culture-specific knowledge, future work should extend our effort to more equitably represent the variety and diversity of human thought and cultural experiences. Our investigation here, unfortunately, does not adequately do this. As the proverb goes, the road to hell is paved with good intentions. Table 13: Tricky MCQ questions from human evaluation task of proverb prediction: The above samples show the challenges in the human evaluation task. In case of narrative 1, the turkers often confuse with choice D which superficially seems related but is not correct. For narrative 2, the proverbs in choices C and D are quite close in meaning, thus resulting in a wrong choice by turkers. the pre-trained models (for example "bert-baseuncased" checkpoint for initialising BERT (Devlin et al., 2019)). For the proverb prediction models we did not truncate any tokens from the proverb and considered the maximum length of the narrative sequence to be 230 tokens. We used the AdamW (Loshchilov and Hutter, 2018) optimizer commonly used to train these models except for T5 (Raffel et al., 2019). We used AdaFactor (Shazeer and Stern, 2018) to train our T5 based proverb prediction model. We kept the learning rate as 0.00002 for training. Batch sizes was kept as 16 except for T5, for which we reduced the batch size to 4. The random seed for all experiments was 42. The proverb prediction models were trained for 25 epochs. The BART narrative generation model was trained for 15 epochs and loss converged after that. T5 took longer and was trained for 25 epochs.
Software and hardware specifications All the models are coded using Pytorch 1.4.0 6 (Paszke et al., 2019) and related libraries like numpy (Oliphant, 2006), scipy (Virtanen et al., 2020) etc. 6 https://pytorch.org/ We run all experiments on GeForce RTX 2080 GPU of size 12 GB. The system has 256 GB RAM and 40 CPU cores. The proverb prediction models typically take 2-5 mins for one epoch. For the joint proverb and span prediction models it took roughly 10 mins for one epoch. For narrative generation models it takes 10 mins for BART and around 18 mins for T5 to complete one epoch of training.