Non-compositional Expression Generation Based on Curriculum Learning and Continual Learning

Non-compositional expressions, by virtue of their non-compositionality, are a classic ‘pain in the neck’ for NLP systems. Different from the general language modeling and generation tasks that are primarily compositional, generating non-compositional expressions is more challenging for current neural models, including large pre-trained language models. The main reasons are 1) their non-compositionality, and 2) the limited data resources. Therefore, to make the best use of available data for modeling non-compositionality, we pro-pose a dynamic curriculum learning framework, which learns training examples from easy ones to harder ones thus optimizing the learning step by step, but suffers from the forgetting problem. To alleviate the forgetting problem brought by the arrangement of training examples, we also apply a continual learning method into our curriculum learning framework. Our proposed method combined curriculum and continual learning, to gradually improve the model’s performance on the task of non-compositional expression generation. Experiments on idiomatic expression generation and metaphor generation affirm the effectiveness of our proposed curriculum learning framework and the application of continual learning. Our codes are available at https: //github.com/zhjjn/CL2Gen.git .


Introduction
Natural language has a common yet special class of constructions called non-compositional expressions that exhibit semantic non-compositionality, where the meaning of the expression cannot be inferred from that of its constituent words (e.g., metaphors and idioms) (Baldwin and Kim, 2010).They are commonly used for specific communicative intents (Moon et al., 1998;Baldwin and Kim, 2010) and are individually rare but collectively frequent, appearing frequently across genres (Moon et al., 1998;Haagsma et al., 2020).Most of the idioms indexed by the Oxford Dictionary have a frequency of less than 1 per million in the corpus of Contemporary American English (Rafatbakhsh and Ahmadi, 2019).They have been classically regarded as a "pain in the neck" to NLP systems (Sag et al., 2002) not only because of their non-compositionality, but also because of their contextual semantic ambiguity (used in noncompositional or compositional meaning depending on the context).Different NLP tasks related to non-compositional expressions have been studied, including sentiment analysis (Biddle et al., 2020), paraphrase generation (Zhou et al., 2021c), natural language inference (Chakrabarty et al., 2021a), metaphor detection (Su et al., 2020) and idiom usage recognition (Liu and Hwa, 2018).However, the generation of non-compositional expressions remains an important yet under-explored problem.Therefore, this paper focuses on the generation of non-compositional expressions.
As is shown in Table 1, non-compositional expression generation aims to generate the correct non-compositional expression given a sentence with original non-compositional expression masked.Its importance stems from that 1) noncompositional expressions are an important part of everyday human language use, and 2) their use imparts naturalness, fluency and stylistic enhancement.Therefore, the ability to generate non-compositional expressions renders machine generated language more natural and human-like whereas current SOTA pre-trained text generation models only pre-trained on normal compositional language tend to only generate compositional expressions (Zeng and Bhat, 2022).Besides, in our experiments, the simply fine-tuned model cannot correctly generate the idioms most of the time.Previously only a few of studies focus on metaphor generation (Yu and Wan, 2019;Chakrabarty et al., 2020;Stowe et al., 2021) whereas other types of non-compositional expressions remain under-

Idiom
Input Sentence It looks like the temperature is going to drop tonight , so be careful not to [MASK] .

Output Sentence
It looks like the temperature is going to drop tonight , so be careful not to catch a cold .

Metaphor
Input Sentence The scream [MASK] the night .

Output Sentence
The scream pierced the night .
Table 1: Examples of input and output in our tasks.Noncompositional expressions are highlighted in bold red explored (e.g.idioms).The sparsity of literature and data resources presents challenges for the study of non-compositional expression generation.
To better utilize available data and alleviate the limitation on resources, curriculum learning (Bengio et al., 2009) aims to enable the models to begin training from easier examples proceeding to examples with an increasing level of difficulty.As such, curriculum learning consists of two core constituents: (1) Deciding the level of learning difficulty for each example, and (2) Scheduling the order of training examples based on that difficulty level.Curriculum learning has recently emerged as a promising direction for different fields including computer vision (Weinshall et al., 2018;Wang et al., 2019;Li et al., 2020) and natural language processing (Platanios et al., 2019;Liu et al., 2020;Zhou et al., 2021d;Zhang et al., 2021).However, despite the relative success on computer vision tasks, the application of curriculum learning for natural language processing is still limited to neural machine translation, which has rich data resources while other applications, such as noncompositional expression generation, with limited data remain under-explored.
To this end, we propose a novel curriculum learning framework for non-compositional expression generation to fill the research gap of generating non-compositional expressions including both metaphors and idioms.Our study is the first to focus on this task and utilizes curriculum learning to alleviate the problem caused by limited data resources.In our work, we use the representation distance and the perplexity score as the difficulty measurement and a dynamic scheduling method to order the examples.Specifically, we observe that curriculum learning orders the training examples according to difficulty level to create a gradual shift of distribution of domain difficulty, which will cause the well-known catastrophic forgetting prob-lem (French, 1993) ignored in previous curriculum learning works.Therefore, we propose RE-GEM, a continual learning algorithm to alleviate the forgetting of learned knowledge in the early stage.
Overall, the main contributions are as follows: • We conduct a first study on non-compositional expression generation including both metaphors and idioms.
• We propose a novel curriculum learning framework specifically designed for noncompositional expression generation that uses the distance between contextualized representations and word embeddings and perplexity score as a measure of difficulty level.It is dynamically updated with the training, based on which the training examples are scheduled.
• We point out for the first time the forgetting problem caused by the curriculum learning and propose a scheme-RE-GEM-to alleviate this problem.
• We evaluate our proposed framework on two tasks: idiomatic expression generation and metaphor generation.Experimental results on both tasks affirm the effectiveness of our framework.Detailed ablation studies and analysis are provided to support our claims.
Curriculum Learning.First proposed by (Bengio et al., 2009), curriculum learning enables machine learning model training to gradually proceed from easy examples to harder ones according to a measure of difficulty level for each example, thereby permitting a better utilization of available data resources.With growing research interests received, curriculum learning has been applied to different fields including computer vision (Weinshall et al., 2018;Wang et al., 2019;Li et al., 2020) and natural language processing.Despite its benefits observed in computer vision tasks, including image classification (Weinshall et al., 2018), human attribute analysis (Wang et al., 2019) and visual question answering (Li et al., 2020), it has seen limited applicability in NLP mainly to NMT (Platanios et al., 2019;Liu et al., 2020;Zhou et al., 2021d).As a result, curriculum learning methods, including difficulty measurement and scheduling strategies, are mainly designed for the NMT task, which is largely different from the task of processing noncompositionality (non-compositional expression generation).To this end, we propose our curriculum learning method specifically designed for noncompositional expression generation.
Continual Learning Continual learning enables models to learn new knowledge and preserve knowledge acquired previously from a data stream with a continuously changing distribution.However, due to the well-known problem of catastrophic forgetting (French, 1993), continual learning is still challenging for current neural models.
The same forgetting problem could also appear in curriculum learning because curriculum learning rearranges the examples according to their difficulty levels, which will naturally create a training data stream with continuously changing distribution on difficulty domain and thus cause the problem of forgetting.Although the catastrophic forgetting problem has been explored in both computer vision (Rebuffi et al., 2017;Kirkpatrick et al., 2017;Zenke et al., 2017;Aljundi et al., 2019) and natural language processing (Xu et al., 2018;Liu et al., 2019;Sun et al., 2019;Chen et al., 2015;Shu et al., 2016;Thompson et al., 2019), there are no avail-able studies on curriculum learning mentioning this forgetting problem.Our work is the first attempt to point out the issue of forgetting in curriculum learning and study mechanisms to alleviate it.

Framework
In this section, we briefly introduce our proposed curriculum learning method for non-compositional expression generation.Curriculum learning for efficiently leveraging available data resources consists of two main parts: a measure of difficulty of training instances, and an arrangement of the training examples using this measure.Accordingly, for non-compositional expression generation, we propose a data arrangement method for dynamically arranging the training examples according to a newly studied difficulty metric.In addition, due to the current large pre-trained language models' insufficiency in processing non-compositional expressions (Dankers et al., 2022), non-compositional expressions that are difficulty for LMs to understand would have a high perplexity score and the representations between non-compositional expressions and their constituent words would be large.Therefore, we use a combination of the representation distance and perplexity score as a measure of examples' difficulty.
Moreover, in our experiments, we observe that following the curriculum learning principle of arranging the training examples based on their difficulty levels, the problem of forgetting arises due to the gradual shift of distribution in domain difficulty.Therefore, to alleviate this forgetting problem, we propose a simple yet effective continual learning method.Figure 1 demonstrates the workflow of our proposed curriculum learning framework and its details as follows.

Difficulty Metrics
In this section, we define the difficulty metric used by our framework.Previous works on curriculum learning mainly focus on compositional languages.Therefore, difficulty metrics, including sentence length, word rarity, embedding norm and etc, proposed in prior works cannot reflect the difficulty levels of non-compositional expressions in the sentences.As mentioned in Section 1, due to the non-compositionality, the meaning of the non-compositional expressions is different from the meaning of the constituent words.Therefore, the distance between the ideal representation of the non-compositional expressions and the representations of the constituent words would be large.
Utilizing this property, we first propose to use the distance between the contextualized representations of the non-compositional expressions and the original word embeddings of the constituent words to reflect the difficulty.A larger distance represents the model has already learned the representation of the expression instead of using the constituent words' embeddings, which means this non-compositional expression is easy for the model.On the contrary, a smaller distance means the target non-compositional expression is difficulty for the model.Therefore, the difficulty metric based on representation distance is calculated as follows: where l(•) is the final layer of the model and Emb(•) is the embedding layer of the model.
Besides, as mentioned in Section 1, due to the rareness of non-compositional expressions in large scale corpora relative to compositional expressions, large pre-trained models seldom see the use of non-compositional expressions during pre-training, which results in their inability to accurately capture the semantics of these expressions.Therefore, we assess the difficulty of training examples based on the models' familiarity with the non-compositional expressions.Toward this, we propose to utilize the perplexity score as a measure of the difficulty of each training example.This stems from the idea that in language modeling, perplexity is used as a quality measure for language models, which is indicative of its ability to predict the next word in a sequence of words.Perplexity score is built with ngrams that are extracted from text corpora: a lower perplexity score of an n-gram X, i.e., means the language model assigns a higher probability to generating X.Therefore, a lower perplexity on a non-compositional expression is indicative that the language model is more familiar with this expression, i.e., this expression is easier for the language model and thus is more likely to generate it.The difficulty metric based on perplexity score is calculated as follows: where Y is the target sentence in a training example and θ is the trainable parameters.

Scheduling Strategy
Having ascertained the difficulty level for each training example, the traditional curriculum learning scheme re-arranges the training examples using the difficulty level and fixes the order of the examples for the subsequent training process.Algorithm 2: TRAIN

Continual Learning
Essentially, after the order of the training examples is re-arranged based on difficulty level, the curriculum learning scheme creates a gradual shift of the distribution in the domain difficulty, which will cause the problem of catastrophic forgetting, currently ignored by previous studies of curriculum learning (Bengio et al., 2009;Weinshall et al., 2018;Wang et al., 2019;Li et al., 2020;Platanios et al., 2019;Liu et al., 2020;Zhou et al., 2021d).
During training, the model will first learn the examples with a lower difficulty level and then learn those with a higher difficulty in each epoch.In this process, some knowledge about the examples with a lower difficulty level will be forgotten.
To alleviate the forgetting problem, we propose RE-GEM, a modified version of GEM (Lopez-Paz and Ranzato, 2017), which fits in our framework better compared with the traditional continual learning methods for the following reason.
The traditional continual learning methods like GEM aim to alleviate the forgetting problem and thus sacrifice the learning ability on new data and part of the overall performance.This is done with the use of an episodic memory, M k , containing randomly sampled training examples from the data P k for the time step k.When minimizing the loss on the current time step t, GEM treats the losses on M k of step k < t as constraints by preventing their increase.An improved version of GEM (Chaudhry et al., 2018) was proposed to only treat the loss on a subset M ref of all the episodic memories for step k < t as a constraint instead of computing multiple losses.To guarantee the loss reduction on this episodic memory subset, M ref , their implementation first computes the loss gradient vector g on the current step and then computes the loss gradient vector g ref on the subset M ref .Whenever the angle between g and g ref is greater than 90°, the gradient g will be projected to ĝ as: and the parameters will be updated based on ĝ, with the intent of avoiding the forgetting of previous data and learning from new data.Instead, our main focus is to learn from new data and then to alleviate the forgetting of previous data through the use of RE-GEM.Therefore, when the angle between g and g ref is greater than 90°, we first project g ref to g as follows: Then the parameters will be updated based on both g and g where g guarantees the successful learning of current new data and g tries best to alleviate the forgetting of previous data.We leave exploring other forms of gradient updates to future work.
The constraint optimization problem in Eq.1 can be solved via the rule proposed in (Chaudhry et al., 2018) as follows: 4 Experiments

Datasets
We use two datasets focusing on two kinds of non-compositional expressions-MAGPIE (Haagsma et al., 2020) for idiom and MERMAID (Chakrabarty et al., 2021b) for metaphor.For the instances from MAGPIE, we mask the target idiom in each example.The position of the target idiom is provided in the dataset.The official training-development-testing splits are used.For MERMAID, the masked sentences have been provided and we use the available data splits.

Baselines
We tested six baseline models compare them with our proposed curriculum learning framework.The Vanilla model that does not use any CL methods, Competence-based CL (Platanios et al., 2019), Norm-based CL (Liu et al., 2020) and SGCL (Zhou et al., 2021d) are used as baselines.Due to space limitation, their description and experimental settings are provided in the Appendix.

Experimental Settings
For our framework, we utilize BART-base as our backbone model.For the task of idiomatic expression generation, the model is trained with batch size of 8 for 5 epochs.For metaphor generation, the model is trained with batch size of 16 for 10 epochs.Adam optimizer is used and the learning rate is set to 5 × 10 −5 .All the other parameters are set to their default.All of our experiments are performed 5 times and the mean of the results are reported.Beam search is used when decoding.

Evaluation Metrics
Automatic Evaluation Considering our focus non-compositional expression generation, we use the widely used text generation evaluation metrics ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) for evaluation.To automatically evaluate how well the idiomatic expressions are generated, we extract the newly generated part in the output sentences and then compare it with the target noncompositional expressions in the references via phrase-level BLEU and ROUGE scores following (Zhou et al., 2021c).We also evaluate with a stricter metric of phrase-level Accuracy, in which the generated non-compositional expression is considered to be correct if and only if every word strictly matches the target expression.For metaphoric expression generation task, corpus-level BLEU score and ROUGE score are used for evaluation following (Chakrabarty et al., 2021b).Due to the fact that the target metaphoric expressions only contain one word and is measured by ROUGE-1 score, we did not use the phrase-level scores described above.
Human Evaluation We used 100 instances from test sets for both tasks, and collected the outputs from the 3 best methods ranked by automatic evaluation.For each output sentence, two native English speakers, who were blind to the systems being compared, were asked to rate the output sentences.For idiom generation, we propose the following criteria: (1) Meaning ("Are the output and the reference meaning the same thing?")(2) Fitness ("Does the generated idiom make sense in the context?") (3) Fluency ("How fluent, grammatical, well formed and easy to understand are the generated utterances?") (4) Overall ("What is the overall quality of the generated utterances?").For metaphor generation, criteria described in (Chakrabarty et al., 2021b) is used.More details are in the Appendix.For other baseline models, it is obvious that most of them are not competitive compared with the vanilla model.Some of the baseline methods even degrade the performance of the vanilla model.

As shown in
Table 3 presents the results on the task of metaphor generation task.As in the case of idiomatic expression generation, in this task our proposed framework achieves the best performance considering all the evaluation metrics.Compared with the performance of vanilla model, our framework outperforms it by 1.08 on BLEU, 0.92 on BLEU-2, 1.71 on BLEU-4, 0.45 on ROUGE-1, 0.91 on ROUGE-2 and 0.45 on ROUGE-L after only 1 training epoch.After 10 training epochs, the improvements increase to 3.98 on BLEU, 2.67 on BLEU-2, 1.97 on BLEU-4, 2.12 on ROUGE-1, 2.43 on ROUGE-2 and 2.13 on ROUGE-L.Table 5 presents the results of human evaluation.It is shown that our method still outperforms other baselines and vanilla model by large margin on both idiom and metaphor generation task.
It should be noted that for metaphor generation task, all the baseline curriculum learning methods' influence on the performance is similar to that in the idiomatic expression generation task.That is, most of the baseline methods do are not competitive compared with the vanilla model, whereas our proposed curriculum learning framework shows an obvious improvement over the vanilla model.Based on the performance on both tasks, we see that the baseline curriculum learning methods cannot effectively improve the performance of the vanilla model (and may even negatively influence the performance).However, our proposed curriculum learning framework outperforms all the baseline models by reasonably large margins.

Analysis
Here we provide some ablation studies based on idiomatic expression generation to analyze the contribution of different modules used in our framework.Difficulty measurement.As shown in Table 8, using our difficulty metric can boost the performance of the vanilla model, which verifies the effectiveness of our proposed measurement of difficulty.Compared with the vanilla model, using our difficulty metric alone can improve the performance by 2 on accuracy, 0.71 on BLEU and 0.28 on ROUGE even without the scheduling method.In addition, compared with the other difficulty measurement methods used in previous studies (e.g.sentence length, word rarity and norm), ours shows a larger performance improvement (rows 3-6 in Table 8).Using our difficulty metric can thus outperform the best among sentence length, word rarity and norm by 4 on accuracy, 4.21 on BLEU and 2.34 on ROUGE, demonstrating its superiority for noncompositional expression generation.
Dynamic scheduling strategy.The effectiveness of our dynamic scheduling strategy can be verified by comparing rows 2 and 7 in Table 8.We see that using our difficulty metric, the performance improves by 1.7 BLEU and 1.09 ROUGE points via dynamic scheduling compared with the performance when fixed scheduling is used, which confirms the effectiveness of our proposed dynamic scheduling method for curriculum learning.
Continual learning scheme.As stated in Section 3.3, curriculum learning will cause the forgetting problem, which has been ignored by previous studies.As shown in Figure 2 To alleviate the forgetting problem, we proposed a continual learning algorithm called RE-GEM to help with the learning.As shown in rows 7 and 8 in Table 8, the application of RE-GEM successfully improved the performance by 3 on accuracy, 1.37 on BLEU, 3.64 on phrase-level BLEU, 1 on ROUGE and 2.35 on phrase-level ROUGE scores when only perplexity score and dynamic scheduling are used.Besides, rows 8-11 show the advantage of our proposed RE-GEM over other continual learning methods, including ER (Robins, 1995), MIR (Aljundi et al., 2019) and AGEM (Chaudhry et al., 2018) when applied to curriculum learning.The performance of RE-GEM is better than the best among ER, MIR and AGEM by 3 on accuracy, 3.24 on phrase-level BLEU and 2.06 on phraselevel ROUGE, a trend persisting in Figure 2. When the training loss using curriculum learning with continual learning suddenly spikes, the peak of the one using RE-GEM is the lowest.This suggests our RE-GEM is more effective in alleviating the forgetting problem while maintaining the performance in curriculum learning, especially compared with the traditional continual learning methods.

Conclusion and Future Work
In this paper, we first utilize curriculum learning to better utilize available data for non-compositional expression generation.We propose a novel curriculum learning framework by utilizing representation distance and perplexity score as a measure of difficulty level and a dynamic scheduling method to better leverage the available training data.Furthermore, for the first time we study a continual learning algorithm to alleviate the forgetting problem resulting from curriculum learning.Experiments on two non-compositional expression generation tasks of idiomatic expression generation and metaphor generation show that the proposed curriculum learning framework can effectively boost the performance of non-compositional expression generation outperforming previously studied curriculum learning methods.Future works should explore other difficulty metrics, more effective scheduling methods and continual learning schemes to further alleviate the forgetting problem, and study them for other text generation problems.

Limitations
As stated previously, our proposed framework utilizes perplexity score as a measure of difficulty, which is based on that non-compositional expressions are low resource languages compared with compositional expressions.Therefore, current large pre-trained language models will assign low probabilities to non-compositional expressions because of their unfamiliarity to non-compositional expressions.However, when it comes to compositional expressions, perplexity score cannot be used for measuring difficulty level, which limits our framework to only non-compositional expression generation.Besides Another limitation lies in the gradient update in our RE-GEM.For gradient computed based on current data and the gradient computed based on data in the memory, we use the same learning rate to update them.This could be improved by setting different learning rates for gradients computed based on different data.

A Baseline Models
Details about the baseline models: Then they proposed two ways of scheduling: fixed scheduling and dynamic scheduling.We use both scheduling methods as our baseline models.More details about this baseline are described in (Zhou et al., 2021d).

B Implementation
Our experiments and implementation are based on the Transformers library and PyTorch.

C Experimental Details
All our experiments are conducted with 2 NVIDIA V100 GPUs.

D Human Evaluation
Here we provide more details of the human evaluation.

D.1 Idiom Generation
Given reference sentences, annotators are expected to evaluate the quality of the generated sentences from three aspects: 1. Meaning: Are the output and the reference meaning the same thing?If yes, the score should be 3.If they are similar but not exactly the same, the score should be 2.If they are not similar, the score should be 1.
2. Fitness: Does the generated idiom make sense in the context?If the generated idiom always makes sense, the score should be 4.If the generated idiom makes sense under some circumstances, the score should be 3.If the generated idiom only makes sense under extreme circumstances, the score should be 2.If the generated idiom is invalid, the score should be 1.
3. Fluency: check whether the transferred sentence is fluent and readable on a scale of 1 to 5, ranging from "highly non-fluent" to "very fluent".This should include the tense (present or past), number(singular or plural) and pronoun(himself, herself, someone, his, her etc.)

D.2 Metaphor Generation
Given input sentences and the reference sentences, annotators are expected to evaluate the quality of the generated sentences from four aspects.A scale of 1-5 where 1 denotes the worst and 5 be the best is used: 1. Fluency: How fluent, grammatical, well formed and easy to understand are the generated utterances?
2. Meaning: Are the input and the output referring or meaning the same thing?
3. Creativity: How creative are the generated utterances?
4. Metaphoricity: How metaphoric are the generated utterances

D.3 Number of Parameters
Considering that our proposed curriculum learning and continual learning do not introduce more parameters, the number of parameters is identical to the number of parameters in the underlying language model: 140M for BART(base).

D.4 Average Runtime
The whole training process for one epoch on two GPUs took approximately 40 minutes including 10 minutes for evaluating difficulties and 30 for fine-tuning.

E Case Study
In Table 9 and 10, we provide more generated examples from idiomatic expression generation and metaphor generation.Examples from different difficulty levels are selected for comparison.
For both tasks, we could observe that most of the baseline models could correctly generate the target idiomatic expressions and metaphors when this example is regarded as easy for the model.However, when the example is selected from examples of medium difficulty levels, some baseline models start to generate wrong idiomatic expressions and metaphors.When it comes to the example from hard difficulty levels, all the baseline models cannot generate the correct idiomatic expressions and metaphors.Only our proposed method could still correctly generate the target idiomatic expressions and metaphors.Therefore, these generated examples confirm that different examples have different difficulty levels for the models, which justify the need for curriculum learning.Besides, these examples also demonstrate that our proposed method could effectively work by learning from an easy-to-hard order.

Medium
Input Sentence these were some of the qualities needed to [MASK] as a sportsmen even at the more modest levels .Target Sentence these were some of the qualities needed to make the grade as a sportsmen even at the more modest levels .Vanilla these were some of the qualities needed to rise to the occasion as a sportsmen even at the more modest levels .Competence + SL these were some of the qualities needed to stand up and be counted as a sportsmen even at the more modest levels .Competence + WR these were some of the qualities needed to make the killing as a sportsmen even at the more modest levels .Norm-based these were some of the qualities needed to rise to the occasion as a sportsmen even at the more modest levels .SGCL + fixed these were some of the qualities needed to make the killing as a sportsmen even at the more modest levels .SGCL + dynamic these were some of the qualities needed to rise to the occasion as a sportsmen even at the more modest levels .Ours these were some of the qualities needed to make the grade as a sportsmen even at the more modest levels .

Hard Input Sentence
Are the Americans going [MASK] again , or is this an indictment which we should place on trial ?Target Sentence Are the Americans going over the top again , or is this an indictment which we should place on trial ?Vanilla Are the Americans going behind the scenes again , or is this an indictment which we should place on trial ?Competence + SL Are the Americans going behind the scenes again , or is this an indictment which we should place on trial ?Competence + WR Are the Americans going behind the scenes again , or is this an indictment which we should place on trial ?

Norm-based
Are the Americans going behind the scenes again , or is this an indictment which we should place on trial ?SGCL + fixed Are the Americans going through the motions again , or is this an indictment which we should place on trial ?SGCL + dynamic Are the Americans going behind our backs again , or is this an indictment which we should place on trial ?Ours Are the Americans going over the top again , or is this an indictment which we should place on trial ?
Table 9: A sample of the generated sentences on MAGPIE highlighting the correct idioms, and the wrong idioms.Easy represents the easy example randomly selected from the examples in the start after ranking based on difficulty levels.Medium represents the example randomly selected from the examples in the middle after ranking based on difficulty levels.Hard represents the example randomly selected from the examples in the final after ranking based on difficulty levels.

Figure 1 :
Figure 1: The overview of our framework and the comparison between our framework and other CL methods.
However, after training the model on some training examples, it is expected that the perceived difficulty level of each training example will change.Therefore, it is unreasonable to use the same order of training examples for the entire training process.To address this issue, a competence score is proposed to dynamically reflect the model's ability.However, the competence score used by previous works, such as a number increased with time steps (Platanios et al., 2019), is actually not comparable to the difficulty score of training examples because the two measure unrelated aspects.To better reflect the dynamic difficulty levels, we propose a dynamic scheduling method that arranges the training examples.After each training epoch, the difficulty score d for each training example is updated using the model trained in the most recent epoch: d n (Y ) = d r (Y ) + d p (Y ) where d n (Y ) is the difficulty score for training example (X; Y ) after the model has been fine-tuned for n epochs and θ n refers to the trainable parameters of the model that has been fine-tuned for n epochs.X represents the input sentence with the target non-compositional expression masked in the training example and Y is the target sentence.After the difficulty scores for all the training examples Algorithm 1: PPLCL Input: Dataset P, Model M and number of epochs N Output: Fine-tuned Model M * 1 D0 = D(P, M) ; 2 Sort P based on each difficulty level in D0, resulting in a re-arranged P0 ; 3 for n = 1; n ≤ N do 4 M θn ⇐ TRAIN(Pn−1); 5 Dn = ∅, P * n = ∅ ; 6 for (X; Y ) ∈ P do 7 dn(Y ) = d r (Y ) + d p (Y ) ; 8 if dn(Y ) ̸ = dn−1(Y ) then 9 Dn ⇐ Dn {dn(Y )} ; n based on Dn, resulting in Pn ; 16 end 17 return M * = M θn ; i = 1; i ≤ s do 15 (X, Y) ∼ Pt; M ← (X, Y) 16 end 17 end 18 return M * = M θ ;have been updated, the training examples will be re-arranged according to the new difficulty scores.

Figure 2 :
Figure 2: X-axis represents the training steps and Y-axis represents the training loss.(a) and (b) represents the results on MAGPIE dataset and MERMAID dataset respectively.'vanilla' refers to the training loss of vanilla model.'ppl', 'agem' and 'regem' refer to the training loss of the model using our CL methods without continual learning, with AGEM and our RE-GEM.
, compared with the training loss of the vanilla model, the training loss of the model using only curriculum learning without continual learning shows sudden peaks at the beginning of each epoch.This shows that the model tends to forget knowledge learned from earlier (easier) examples in each epoch after curriculum learning is applied.Additionally, when the model moves from easier examples to harder examples, the knowledge learned from easier examples is beneficial for learning harder ones resulting in a non-increase of the training loss when the model learns harder examples during each epoch.
, our scheduling strategy only re-schedules training examples after each training epoch instead of each batch, which also limits the flexibility of scheduling training examples.Therefore, the order of training examples in each training epoch will still be fixed.More flexible and dynamic scheduling strategies should be explored.

Table 2 :
Performance of different methods on MAGPIE dataset.Competence represents using competence score for scheduling.SL refers to using sentence length as difficulty score.WR refers to using word rarity as difficulty score.Best performance is labeled in bold.Models trained for 5 epochs are converged.p-value refers to the results of significance test based on our method and second best method (Fixed SGCL).

Table 2
, for the idiomatic expression generation task, our proposed framework achieves the best performance with respect to all the evalu-

Table 3 :
Performance of different methods on MERMAID dataset.Best performance is labeled in bold.

Table 4 :
Ablation study on MAGPIE dataset.Diff refers to our difficulty metric.Fixed means the training examples are sorted only once before training and fixed during training.Dynamic refers to our dynamic scheduling strategy.

Table 5 :
Human evaluation results.The best performance is in bold.

Table 7 :
Results based on different backbone model on MERMAID dataset.

Table 8 :
Results based on full data on MAGPIE dataset.Input Sentence you would n't get anything from anywhere else because they 've cut it [MASK] .Target Sentence you would n't get anything from anywhere else because they 've cut it to the bone .Vanilla you wouldn't get anything from anywhere else because they don't cut it at the end of the day .Competence + SL you wouldn't get anything from anywhere else because they've cut it by the back door .Competence + WR you would n't get anything from anywhere else because they 've cut it to the bone .Norm-based you would n't get anything from anywhere else because they 've cut it to the bone .SGCL + fixed you wouldn't get anything from anywhere else because they don't cut it at the end of the day .SGCL + dynamic you wouldn't get anything from anywhere else because they've cut it by the back door .Ours you would n't get anything from anywhere else because they 've cut it to the bone .