PIE: A Parallel Idiomatic Expression Corpus for Idiomatic Sentence Generation and Paraphrasing

Idiomatic expressions (IE) play an important role in natural language, and have long been a “pain in the neck” for NLP systems. Despite this, text generation tasks related to IEs remain largely under-explored. In this paper, we propose two new tasks of idiomatic sentence generation and paraphrasing to fill this research gap. We introduce a curated dataset of 823 IEs, and a parallel corpus with sentences containing them and the same sentences where the IEs were replaced by their literal paraphrases as the primary resource for our tasks. We benchmark existing deep learning models, which have state-of-the-art performance on related tasks using automated and manual evaluation with our dataset to inspire further research on our proposed tasks. By establishing baseline models, we pave the way for more comprehensive and accurate modeling of IEs, both for generation and paraphrasing.


Introduction
Idiomatic expressions (IEs) make language natural. These are multiword expressions (MWEs) that are non-compositional because their meaning differs from the literal meaning of their constituent words taken together (Nunberg et al., 1994). Their use imparts naturalness and fluency (Wray and Perkins, 2000;Sprenger, 2003;Pawley and Syder, 2014;Schmitt and Schmitt, 2020), is prompted by pragmatic and topical functions in discourse (Simpson and Mendis, 2003) and often conveys a nuance in expression (stylistic enhancement) using imagery that is beyond what is available in the context (Nunberg et al., 1994). Idiomatic expressions, including phrasal verbs (e.g., carry out), idioms (e.g., pull one's leg) are also an essential part of a native speakers vocabulary and lexicon (Jackendoff, 1995). 1 The parallel corpus is available at https://github. com/zhjjn/MWE_PIE.git IEs constitute a ubiquitous part of daily language and social communication, primarily used in conversation, fiction and news (Biber et al., 1999), frequently used by teachers when presenting their lessons to students (Kerbel and Grunwell, 1997) and occur cross-lingually (Baldwin et al., 2010;Nunberg et al., 1994). Their non-compositionality is the reason for their classical standing as "a pain in the neck" (Sag et al., 2002) and "hard going" (Rayson et al., 2010) for NLP.
The Oxford English dictionary defines the phrasal verb (an IE) vote out as 'To turn (a person) out of office.' Using Google translate 2 to translate the topical slogan "vote them out!" into eight of the world's most spoken and relatively resource-rich languages yielded the results shown in Figure 1. As native speakers will attest, other than in Spanish, all the translations mean just the opposite, "vote for them!" This, and other studies on computational processing of idioms and metaphors in (Salton et al., 2014;Shao et al.;Shutova et al., 2013) reinforce the need for nuanced language processing-a grand challenge for NLP systems.
Gaining a deeper understanding of IEs and their literal counterparts is an important step toward this goal. In this paper, we introduce two novel tasks related to paraphrasing between literal and idiomatic expressions in unrestricted text: (1) Idiomatic sentence simplification (ISS) to automatically paraphrase idiomatic expressions in text, and 2) Idiomatic sentence generation (ISG) to replace a literal phrase in a sentence with a synonymous but more vivid phrase (e.g., an idiom). ISS directly addresses the need for performing text simplification in several application settings, including summarizers (Klebanov et al., 2004) and parsing (Constant et al., 2017). Moreover, ISS may actually be helpful when an idiomatic expression does not have an exact counterpart in a target language. This is akin to the 'translation by paraphrase' strategy recommended for human translation when the source language idiom is obscure and non-existent in the target language (Baker, 2018). On the other hand, ISG advances the area of text style transfer (Jhamtani et al., 2017;Gong et al., 2019) bringing the as yet unexplored dimension of nuanced language to style transfer.
A second important component of this paper is the introduction of a new curated dataset of parallel idiomatic and literal sentences, where the idiomatic expressions are paraphrased, created for the purpose of advancing progress in nuanced language processing and serving as a testbed for the proposed tasks. Recent literature has explored several aspects of figurative and nonliteral language processing, including detecting and interpreting metaphors (Shutova, 2010b;Shutova et al., 2013), disambiguating IEs for their figurative or literal in a given context (Constant et al., 2017;Savary et al., 2017;Liu and Hwa, 2019) and analyzing sarcasm (Muresan et al., 2016;Joshi et al., 2017;Ghosh et al., 2018), by using curated datasets of sentences with linguistic processes in the wild. These datasets are ill-suited for the proposed tasks because they consist of specific figurative constructions (metaphors) (Shutova, 2010a), do not cover multiple IEs (Cook et al., 2008;Korkontzelos et al., 2013), or are not parallel (Haagsma et al., 2020;Savary et al., 2017) underscoring the need for a new dataset.
The newly constructed dataset permits us to benchmark the performance of several state-ofthe-art neural network architectures (seq2seq and pretrained+fine-tuned models, with and without copy-enrichment) that have demonstrated compet-itive performance in the related tasks of simplification, and style transfer. Using automatic and manual evaluations of the outputs for the two tasks, we find that the existing models are inadequate for the proposed tasks. The sequence-to-sequence models clearly suffer from data sparsity, the added copy mechanism helps preserve the context that is not replaced, and despite their prior knowledge of the pretrained models, they are still limited in their ability to paraphrase and generate. This leads us to discussing novel insights, applications and future directions for related research.
The main contributions of this work are summarized as follows.
1. We propose two new tasks related to idiomatic expressions-idiomatic sentence simplification and idiomatic sentence generation; 2. We introduce a curated dataset of 823 idiomatic expressions, replete with sentences containing these IEs in the wild and the same sentences where the IEs were replaced by their literal paraphrases.
3. We use the combination of the new dataset and the proposed tasks as a lens through which we gain novel insights about the capabilities of deep learning models for processing nuanced language generation and paraphrasing.

Task Definition
We propose two new tasks: idiomatic sentence generation transforms a literal sentence into a sentence involving idioms. Used frequently in everyday language, idioms are known to add color to expressions and improve the fluency of communication. The idiomatic rewriting improves the quality of text generation in that it could enhance the textual diversity and convey abstract and complicated ideas in a succinct manner. For example, the idiomatic sentence BP cut corners and violated safety requirements conveys the same idea as its literal counterpart BP saved time, money and energy and violated safety requirements, but in a more vivid and succinct manner. The second task is idiomatic sentence paraphrasing, simplifying sentences with idioms into literal expressions. As an example, the sentence-It is certainly not a sensible move to cut corners with national securityhas the idiom cut corners replaced the literal counterpart save money. By paraphrasing the idioms from which machine translation often suffers, our task of idiomatic sentence paraphrasing can also benefit machine translation.
In this work, we distinguish our task of idiomatic sentence generation from idiom generation. While the latter task creates new idioms with novel word combinations, our study is to use existing idioms in a sentence and preserve the semantic meaning.
The task of idiomatic sentence paraphrasing is closely related to text simplification that has mostly been studied as related tasks of lexical paraphrasing and syntactic paraphrasing (Xu et al., 2015). A significant departure of this task from that of these related tasks that centrally address style is that (i) we aim for local synonymous paraphrasing by transforming not the entire sentence but a phrase in the sentence, (ii) the transformation is not related to syntactic structures, but related to the complexity in meaning 3 . We propose doing joint monolingual translation with simplification and is similar in spirit to (Agrawal and Carpuat, 2020).
There are many technical challenges to performing these tasks. The task of idiomatic sentence paraphrasing involves first identifying that an expression is an idiom and not a literal expression (e.g. black sheep) (Fazly et al., 2009;Korkontzelos et al., 2013;Liu and Hwa, 2019). Once identified, the IE may have multiple senses (e.g. tick off ) and its appropriate sense will need to be identified before paraphrasing it. Third, an appropriate literal phrase will have to be generated to replace the IE. Finally, the literal phrase will have to be fit in the surrounding sentential context for a fluent construction. For idiomatic sentence generation, the context of the literal phrase could permit more than one candidate idiom (e.g. keep quiet). In this study, we assume that we have an idiomatic sentence and leave it to future work to explore the task in conjunction with this step.

Related Work
The theme of this paper is naturally connected to three streams of text generation tasksparaphrasing, style transfer and metaphoric expression generation. We will discuss these tasks and also the datasets used in these tasks to study their similarities and differences to our dataset and tasks.

Paraphrase
The aim of paraphrasing is to rewrite a given sentence while preserving its original meaning. Being widely studied in the recent research, many datasets have been constructed to facilitate the task. PPDB (Ganitkevitch et al., 2013), MRPC 4 , Twitter URL Corpus (Lan et al., 2017), Quora 5 and ParaNMT-50M (Wieting and Gimpel, 2017) have been the most commonly used datasets. The most commonly used Seq2Seq models have been successfully applied to paraphrasing Prakash et al. However, unlike paraphrasing a sentence or a literal-to-literal paraphrasing task, our proposed tasks are more constrained given the existence of idiomatic expressions. This renders the datasets used for the task of paraphrasing and the associated paraphrasing models inadequate for our task. Our dataset is created to fill this need to advance a fundamental understanding of idiomatic text generation and paraphrasing. Therefore, research into our tasks and dataset can also be used for paraphrasing when only part of the sentence need to be paraphrased or idioms need to be paraphrased.

Style Transfer
The task of style transfer can be defined as rewriting sentences into those with a target style. Recent research has primarily focused sentiment manipulation and changes in writing styles (Jhamtani et al., 2017;Gong et al., 2019). Our proposed tasks are different from the nature of style transfer studies in recent works because (i) our tasks retain a large portion of the input sentences while style transfer may need to completely change the input sentences, and (ii) our tasks explore the nuance component of style, an aspect heretofore unexplored. To test different models' performance on style transfer, several nonparallel corpora have been used (Yelp (Shen et al., 2017), Grammarly's Yahoo Answers Formality Corpus (Rao and Tetreault, 2018), Amazon Food Review dataset (McAuley and Leskovec, 2013) and Product Review dataset (He and McAuley, 2016)).
Despite their size, they lack the focus on IEs and are all non-parallel. This has led to the the study of unsupervised methods for style transfer, including cross-aligned auto-encoder (Hu et al., 2017), VAE (Hu et al., 2017), Generative Adversarial Network (Zeng et al., 2020), reinforcement learning for constraints in style transfer (Xu et al., 2018;Gong et al., 2019) and pipeline models (Li et al., 2018;Sudhakar et al., 2019). Owing to the essential departure of our tasks from those of previously studied style transfer tasks, and the limitation of non-parallel corpus, we create our own parallel dataset which focuses on IEs.

Metaphoric Expression Generation
Prior work on automated metaphor processing has primarily focused on their identification, interpretation and also generation. (Shutova, 2010b;Shutova et al., 2013;Abe et al., 2006). Also, data for this task is extremely sparse: there are not any large scale parallel corpora containing literal and metaphoric paraphrases which aims for metaphor generation. The most useful one is that of (Mohammad et al., 2016). However, their dataset has a small number (171) of metaphoric sentences extracted from WordNet. Early works on metaphor generation mainly focus on phrase level metaphor and template-based generation (Terai and Nakagawa, 2010;Ovchinnikova et al., 2014). Recent works also explore the power of neural networks (Mao et al., 2018;Yu and Wan, 2019;Stowe et al., 2020). However, most of the research on metaphor generation suffer from the lack of parallel corpora.
Our proposed tasks share some similarities with metaphor generation but also have differences. Instead of focusing on paraphrase of single word like most metaphor generation work, our tasks often require a mapping between two multi-word expressions, which makes our tasks more challenging.

Text Simplification
Text simplification aims to rewrite input sentences into lexically and/or syntactically simplified forms. The Simple Wikipedia Corpus (Zhu et al., 2010) and more recently, the Newsela dataset (Xu et al., 2015) and the WikiLarge dataset (Zhang and Lapata, 2017) dominate the research area. The use of different machine learning models have also been explored for this task, including statistical machine translation model (Wubben et al., 2012), the Seq2Seq architecture (Nisioi et al., 2017) and the Transformer architecture (Zhao et al., 2018).
Departing from previous attempts at lexical or syntactic simplification, our proposed task of idiomatic sentence paraphrasing aims to simplify the nuance of non-compositional and figurative expressions thereby permitting a more literal understanding of the sentence.
We summarize the datasets of the related tasks in Table 1.

Building the Dataset
We describe the details of the data collection, data annotation, corpus analyses and comparisons with other existing corpora.

Data Collection
The Parallel Idiomatic Expression Corpus (PIE), consists of idiomatic expressions (IEs), their definitions, sentences containing the IEs and corresponding sentences where the IEs are replaced with their literal paraphrases. One instance of the dataset is shown in Figure 2.
We collected a list of 1042 popular IEs and their meanings from an educational website 6 that has a broad coverage of frequently used IEs including phrasal verbs, idioms and proverbs. For a broad coverage of IEs we did not limit them to a specific syntactic category. The list was then split between the members of the research team consisting of a native English speaker, and three near-native English speakers. Some IEs such as "tick off" ( Figure  2) have multiple senses. The annotators labeled the sense of IEs in given sentences according to the sense information from reliable sources including the Oxford English Dictionary 7 , the Webster Dictionary 8 and the Longman Dictionary of Contemporary English 9 . IEs that were not available in any of the popular dictionaries were excluded from dataset as were proverbs that are independent clauses (e.g., the pen is mightier than the sword). To guarantee each sense is well represented, the annotators collected at least 5 sentences for each sense of an IE from online sources (e.g., the Contemporary corpus of American English, and examples listed in dictionaries).
The data collection step yielded the corpus with a total of 823 IEs and 5170 sentence-pairs using these IEs (an average of 6.3 sentence-pairs per id- iom). We also note that every instance (idiomaticliteral pair) is only one sentence long. The corpus statistics are summarized in Table 2.

Data Annotation
In order to create the parallel dataset of idiomatic and literal sentences for the proposed tasks, a native English speaker was asked to rewrite each idiomatic sentence into its literal form, where the IE was replaced by a literal phrase. As part of this manual paraphrasing, the annotator was asked to paraphrase only the IE so as not to alter its meaning in the context of the sentence, preserving the phrases syntactic function and to conform to the sense definition. The rest of the sentence was to be left unchanged. The annotator is free to use original sense definition when rewriting or use paraphrases of sense definition. After the first annotation pass, the researchers checked the literal sentences generated by the first annotator and corrected any errors.
To specify the span of the IE in each idiomatic sentence and that of the literal paraphrase in the corresponding literal sentence, BIO labels were used; B marks the beginning of the idiom expressions (resp. the literal paraphrases), I the other words in the IE (resp. words in the literal paraphrases) and O all the other words in the sentences. This labeling was done automatically considering that the only difference between a given idiomatic sentence and its literal sentence is the replacement of idiom with literal phrase. An example of the BIO labeled sentence pair is shown in Figure 2.

Corpus Analyses
We summarize the statistics of our PIE dataset in Table 2 and compare it with existing datasets in Table 1. We notice that the parallel sentences in our dataset are comparable in terms of sentence length, while simple sentences are much shorter in the text simplification dataset. This suggests that the tasks we propose may not result in significantly shorter sentences compared to their inputs, and this constitutes a core departure from the task of text simplification. Moreover, the sentences in our dataset are longer on an average compared to the sentences in existing datasets (with the exception of text simplification data). This can pose challenges to the text generation model performing the tasks proposed in the paper.
We also report the percentage of n-grams in the literal sentences which do not appear in the idiomatic sentences as a measure of the difference between the idiomatic and literal sentences. As shown in Table 3, there is smaller variation between the source sentences and the target sentences in our dataset. This is again due to the nature of our task, which calls for a local paraphrasing (rewriting only a part of the sentence).
We note that IEs may be naturally ambiguous due to the existence of both figurative and literal senses, as also pointed out in previous works. A small portion of IEs in our dataset have multiple senses, and one example is "tick off " in Figure 2. Table 4 presents the distribution of the senses in the IEs in our dataset, and the average number of senses is 1.05, suggesting that the majority IEs in our dataset are monosemous.

Dataset quality
Noting that the idiomatic to literal sentences were manually created, the quality of our dataset may be called into question. We point out that in an effort to quickly use sentences of good quality and in line with existing datasets for related tasks with idiomatic expressions (Haagsma et al., 2020;Korkontzelos et al., 2013) we collected idiomatic expressions in the wild. However, as acknowledged by previous dataset creation efforts, not all IEs oc-   cur equally frequently, which can result in a representation bias. In addition, finding true paraphrases of IEs in the wild is hard. In light of these practical data-related concerns, we resorted to a manual paraphrasing of the IEs as a trade-off between naturalness and representation. This idea of using non-natural instances is also influenced by successful recent approaches to training data collection and data augmentation using synthetic methods reported in severely resource-constrained domains such as machine translation (Sennrich et al., 2016) and clinical language processing (Ive et al., 2020).

Baselines
Translation Models: Considering that our tasks of idiomatic sentence generation and paraphrasing have never been studied before and the fact that they are both text generation tasks, we first choose some basic end-to-end models which have shown state-of-the-art performance on other text generation tasks. Accordingly, we used the LSTMbased Seq2Seq model (Sutskever et al., 2014) and the transformer architecture (Vaswani et al., 2017). These will be alluded to as Translation Models.  Finally, we used a sequential model inspired by the retrieve-delete-generate pipeline (Sudhakar et al., 2019;Zhou et al., 2021) that showed a competitive performance for style transfer. We note that novel instances of idiomatic sentences cannot be generated without previously encountering the IE. Considering this, we set up the pipeline model with a retrieval stage to retrieve an IE for a given literal sentence (resp. the correct sense given an idiomatic sentence). Toward this, a RoBERTa model for sentence classification was fine-tuned on our training data. The concatenation of the input sentence and the correct idiom or sense is considered as a positive instance and that of the input sentence and an irrelevant idiom or a different sense is considered a negative instance. Given all the concatenations of the input sentence and the idioms in our dataset, this stage aims to classify the correct one. In the deletion stage, we deleted the literal phrase that should have been replaced by the retrieved idioms (resp. deleted the IE in the given idiomatic sentence). Again, a RoBERTa model for sequence classification was fine-tuned on our training data with BIO labels. This stage aims to assign one of the BIO labels for each token in the input sentence and delete the tokens with labels of B and I. In the generating stage, we combined the results from the retrieval and deletion stages and use a finetuned BART model to generate final output-the literal sentences for the task of idiomatic sentence paraphrasing and idiomatic sentences for the task of idiomatic sentence generation.

Experimental Setup
For all the models, the maximum sentence length was set to 128. The batch size and base learning rates were set to 32 and 5e − 5 respectively. These models were all trained and run on the Google Colab platform.
For the translation models and copy models, the dimension of the hidden state vectors was set to 256 and the dimension of the word embeddings to 256. These baselines were trained with the parallel sentence pairs as appropriate, i.e., taking literal sentences as input and generating the corresponding idiomatic sentences or vice versa.
The baseline pretrained BART model was trained for 5 epochs and during inference a beam search with 5 beams was used with top-k set to 100 and top-p set to 0.5. The other hyper-parameters were set to their default values.
All the RoBerta and BART models in the pipeline model were trained for 5 epochs. For the BART model, during inference, we used a beam search with 5 beams with top-k set to 100 and top-p set to 0.5. The other hyper-parameters were set to their default values.

Evaluation
For automatic evaluation, Rouge (Lin, 2004), BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007) and SARI (Xu et al., 2016) are used to compare the similarity between the generated sentences and the references. These metrics has been widely used in various text generation tasks such as paraphrasing, style transfer and text simplification. To measure linguistic quality, we use a pre-trained language model BERT to calculate perplexity scores and a recently proposed measure, GRUEN (Zhu and Bhat, 2020).
Considering that automatic evaluation cannot fully analyze the results, we use human evaluation as a complement to the automatic evaluation metrics. For each task, We randomly sampled 100 input sentences and the corresponding outputs of all baselines. Human annotations were collected with respect to context, style and fluency of generated sentences based on the following criteria.
(1) Context preservation measures how well the context surrounding the idiomatic/literal phrase is preserved in the output.
(2) Target inclusion checks whether the correct IE or literal phrase is used in the output.
(3) Fluency evaluates the fluency and readability of the output sentence including how appropriately the verb tense, noun and pronoun forms are used.
(4) Overall meaning evaluates the overall quality of the output sentence.
For each output sentence, two annotators with native-speaker-level English proficiency were asked to rate it on a scale from 1 to 6 in terms of the context preservation, fluency and overall meaning. Higher scores indicate better quality. As for the target inclusion, they were asked to rate it on a scale from 1 to 3. Score 1 denotes that the target phrase is not included in the input at all, 2 denotes partial inclusion, and 3 is for the complete inclusion. We report the average score over all samples for each baseline in each aspect.

Results and Discussion
Results. We report the automatic and human evaluation results in Table 5 and 6. More detailed results with all the metrics considered are in the appendix. On both tasks, going by the automatic metrics, copy-enriched transformer, pretrained BART model and the pipeline model perform better than other baselines. Pretrained BART achieved the best performance in BLEU and GRUEN, and the pipeline model does best in SARI. As for human evaluation, BART and the pipeline again achieve the best performance among the baselines. While BART is the best in preserving contexts and achieving fluency, the pipeline is the best in idiom paraphrasing and generation. The overall agreement score for human evaluation is 0.76. Model competence. BART and the pipeline model outperform other baselines in that they leverage auxiliary information (large pretaining corpora and selective idiomatic expression information, respectively) which is not available to the other models. The benefit of the copy mechanism by explicitly retaining the contexts as required by our tasks, is   shown in the corresponding gains in automatic and manual evaluation scores for both Seq2Seq and transformer models.
When it comes to the comparison between BART and the pipeline, BART does better in retaining the contexts surrounding idiomatic expressions given its high context score in human evaluation while the pipeline is better at handling the idiomatic part, i.e., target inclusion. Despite the reported superior performance of BART in related text generation tasks (Lewis et al., 2019), our experiments show that BART has limited capability in idiom paraphrasing and generation. The pipeline method, by virtue of error propagation from its retrieval and deletion modules suffers in terms of both the context preservation and fluency. For task of idiomatic sentence generation, the accuracy for retrieval module is 0.27 and F1 score for deletion module is 0.68. For task of idiomatic sentence paraphrasing, the accuracy for retrieval module is 0.96 and F1 score for deletion module is 0.85. Comparison between two tasks. According to human evaluation results in Table 6, both BART and the pipeline received higher scores for idiomatic sentence paraphrasing than idiomatic sentence generation, suggesting that paraphrasing is relatively easier among the two tasks. This resonates with our intuitions as language users in that given a lexical resource, paraphrasing an IE is easier than finding the right IE to replace a phrase. Limitation of automatic metrics. Table 7 presents the correlation between automatic metrics and human judgements. All the correlation scores between automatic metrics and human evaluate scores are not high enough. For BLEU and SARI which mainly measure overlapping tokens, some synonymous idioms or literal phrases are ignored while they are still appropriate. For GRUEN metric aiming to measure text quality, its correlation scores with fluency and overall meaning are quite low. Therefore, more reliable automatic evaluation methods are needed. Error analysis. For task of idiomatic sentence generation, the primary challenge is in identifying the appropriate IE, which is the hardest when the IE is highly non-compositional (e.g., bird of passage in Table 11). The examples are presented in Table  11 in the Appendix. For the task of idiomatic sentence paraphrasing, one challenge is the difficulty of choosing the correct sense of the idiom. As is shown in Table 12 in Appendix, all the baseline models were unable to generate the correct literal phrases for "alpha and omega", which have two senses: the beginning and the end; the principal element. Also, we noticed that strong baseline models of pretrained BART and the pipeline model tend to use a short but inaccurate literal phrase when the correct one is long. Paraphrasing of "the bird of passage" in Table 12 is an example. Applications: Research in the proposed tasks has many potential practical applications. 1) An idiomatic sentence paraphrasing tool would be of importance in several language processing settings encountered by humans and machines. The nonliteral and stylized meaning of multi-word expressions (MWE) in general and idioms in particular, pose two broad kinds of challenges. First, they affect readability in target populations. For in-  Table 7: Instance-level Spearman's correlations between human and automatic evaluation for pretrained BART.
Literal sentence You can't delay making a decision any longer , you need to make up your mind .

Idiomatic sentence
You can't sit on the fence any longer , you need to make up your mind .

S2I Seq2Seq
You can't be in the obsession any night , you need to make up your plans Transformer you can't delay making a decision of any longer , you need to make your mind your mind . Seq2Seq-copy you can't sit sit the fence any , , you need to to up your . Transformer-copy you can't delay making a decision any longer , you need to make up your mind .

Pipeline
You can't delay making a decisione any longer, you make your mind.

BART
You can't delay making a decision any longer, you need to make up your own mind.

I2S Seq2Seq
You can't wait on the money any rival , you need to make up your energy . Transformer you can't sit on the ? any longer , you need to make up your mind . Seq2Seq-copy you can't delay making any any any , you need to make your your mind . Transformer-copy you can't sit on the troublesome any longer , you need to make your mind .

Pipeline
You can't stay on the fence any longer, you need to make up your mind.

BART
You can't be indecisive any longer, you need to make up your mind. Table 8: A sample of generated idiomatic sentences. Text in bold and italics red represents the idiomatic expressions correctly included in the outputs, text in bold blue represents the literal counterparts in the input sentences and text in underlined olive represents the idioms or literal phrases that are poorly generated.
stance, despite their intact structural language competence, individuals with Asperger syndrome and more broadly those with autism spectrum disorder are known to experience significant challenges understanding figurative language (idioms) in their native language (Kalandadze et al., 2018). It is also widely acknowledged that idiomatic expressions are some of the hardest aspects of language acquisition and processing for second language learners (Liontas, 2002;Ellis et al., 2008;Canut et al., 2020). Moreover, natural language processing systems are known to be negatively impacted by idioms in text ( (Salton et al., 2014;Shao et al.;Shutova et al., 2013) shown the negative impact of idioms and metaphors on machine translation leading to awkward or incorrect translations from English to other languages). Fruitful results of this task can lead to a system capable of recognizing and interpreting IEs in unrestricted text in a central component of any real-world NLP application (e.g., information retrieval, machine translation, question answering, information extraction, and opinion mining).2) A realistic application of the idiomatic sentence generation task would be for computer-aided style checking, where a post-processing tool could suggest a list of idioms to replace a literal phrase in a sentence. 3) True integration with an external NLP application would require combining the first step of IE identification followed by paraphrasing as done in (Shutova et al., 2013), which will require a combination of the paraphrasing with identification, and can be a future direction for research.

Conclusions
To conclude, in this paper, we proposed two new tasks: idiomatic sentence generation and paraphrasing. We also presented PIE, the first parallel idiom corpus. We benchmark existing end-to-end trained neural network models and a pipeline method on PIE and analyze their performance for our tasks. Our experiments and analyses reveal the competence and shortcomings of available methods, underscoring the need for continued research on processing idiomatic expressions. Future work should explore possibilities for improving performance through more extensive exploration of richer model architectures and using more reliable evaluation methods.   Literal sentence Joe , being one who is here today and gone tomorrow , stayed the night , had some rest and ate some food and left early the next morning .

Reference
Joe , being the bird of passage he is , stayed the night , had some rest and ate some food and left early the next morning .

Seq2Seq
First , being one , and putting the project going to be joined the ones , had some ice row and creating some people and creating some expensive of both the time .
Transformer joe , being one who is here today and gone tomorrow , kept the night , had some rest and punched some food a great early . Seq2Seq with copy joe , being the bird of he he , , , , , , , some some some some and and and and the .
Transformer with copy joe , being one who is here today and gone tomorrow , stayed the night , had a rest and ate food left the next early .
Pretrained BART Joe, being one who is here today and gone tomorrow, stayed the night, had some rest and ate some food and left early the next morning.
Pipeline cool heels joe, being one who is here today and gone tomorrow, stayed the night, and ate some food and left early the next morning.

Attribute multiple meaning Literal sentence
My life starts from you and ends at you , so you are my first and my last .

Reference
My life starts from you and ends at you , so you are my alpha and omega .

Seq2Seq
My friend from you and offensive , and yet you are my dream and my loved . You can't delay making a decision any longer , you need to make up your mind .

Reference
You can't sit on the fence any longer , you need to make up your mind .

Seq2Seq
You can't be in the obsession any night , you need to make up your plans . Transformer you can't delay making a decision of any longer , you need to make your mind your mind . Seq2Seq with copy you can't sit sit the fence any , , you need to to up your . Transformer with copy you can't delay making a decision any longer , you need to make up your mind .
Pretrained BART You can't delay making a decision any longer, you need to make up your own mind.

Pipeline
You can't delay making a decisione any longer, you make your mind.

Attribute low non-compositionality Literal sentence
Finding the ruins of Babylon was the archaeologist 's greatest find .

Reference
Finding the ruins of Babylon was the archaeologist 's treasure trove .

Seq2Seq
Missing the aftermath of pouring down the cake 's share of the city . Transformer catching up with silver lining of the challenges 's volatility . Seq2Seq with copy finding the ruins of unk was the 's 's trove . Transformer with copy finding the ruins of babylon was the archaeologist 's greatest silver spoons .
Pretrained BART Finding the ruins of Babylon was the archaeologist's greatest find.

Pipeline
Finding the ruins of babylon was the archaeologist' treasure trove. Table 11: Samples of generated idiomatic sentences. Text in blue represents the idiomatic expressions correctly included in the outputs; text in red represents the literal counterparts in the input sentences. text in green represents the idioms that are poorly generated.
Attribute high non-compositionality Idiomatic sentence Joe , being the bird of passage he is , stayed the night , had some rest and ate some food and left early the next morning .

Reference
Joe , being one who is here today and gone tomorrow , stayed the night , had some rest and ate some food and left early the next morning .

Seq2Seq
And , sitting the part of the Bieber he is , seemed the morning , he some smart and wound problems so well and gives early at the next morning .
Transformer joe , being the guards of nowhere he is , the night the night , and had some dealers and left the morning left the next morning .
Seq2Seq with copy joe , being one who here today and tomorrow tomorrow stayed stayed night , had some and and and and and left next next next .
Transformer with copy joe , being the bird of energy is stayed , stayed the night , some rest and ate ate some food left the next morning .
Pretrained BART Joe, being the traveler he is, stayed the night, had some rest and ate some food and left early the next morning.
Pipeline joe, being the person he is, stayed the night, had some rest and ate some food and left early the next morning.

Attribute multiple meaning Idiomatic sentence
My life starts from you and ends with you , so you are my alpha and omega .

Reference
My life starts from you and ends with you , so you are my first and my last .

Seq2Seq
My life dreams from you and read your family at you , so you are . You can't sit on the fence any longer , you need to make up your mind .

Reference
You can't delay making a decision any longer , you need to make up your mind .

Seq2Seq
You can't wait on the money any rival , you need to make up your energy . Transformer you can't sit on the ? any longer , you need to make up your mind . Seq2Seq with copy you can't delay making any any any , you need to make your your mind . Transformer with copy you ca n't sit on the troublesome any longer , you need to make your mind .
Pretrained BART You can't be indecisive any longer, you need to make up your mind.

Pipeline
You can't stay on the fence any longer, you need to make up your mind. Attribute low non-compositionality Idiomatic sentence Finding the ruins of Babylon was the archaeologist 's treasure trove .

Reference
Finding the ruins of Babylon was the archaeologist 's greatest find .

Seq2Seq
Edward the trap of nature was the racial out of Robert . Transformer finding and hide of confiement was shocking 's legal code . Seq2Seq with copy finding the ruins of unk was the unk 's greatest find . Transformer with copy finding the ruins of babylon was the archaeologist's family members .
Pretrained BART Finding the ruins of Babylon was the archaeologist's greatest find.

Pipeline
Finding the ruins of babylon was the archaeologist's trove. Table 12: Samples of generated literal sentences. Text in red represents the appropriate literal phrases included in the outputs. Text in blue represents the idioms in the input sentences. Text in green represents the literal phrases that are poorly generated.