Edit Distance Based Curriculum Learning for Paraphrase Generation

Curriculum learning has improved the quality of neural machine translation, where only source-side features are considered in the metrics to determine the difficulty of translation. In this study, we apply curriculum learning to paraphrase generation for the first time. Different from machine translation, paraphrase generation allows a certain level of discrepancy in semantics between source and target, which results in diverse transformations from lexical substitution to reordering of clauses. Hence, the difficulty of transformations requires considering both source and target contexts. Experiments on formality transfer using GYAFC showed that our curriculum learning with edit distance improves the quality of paraphrase generation. Additionally, the proposed method improves the quality of difficult samples, which was not possible for previous methods.


Introduction
Paraphrase generation is a task that transforms expressions of an input sentence while retaining its meaning. While there are various subtasks in paraphrase generation, formality transfer (Rao and Tetreault, 2018;Niu et al., 2018;Kajiwara, 2019;Wang et al., 2019;Kajiwara et al., 2020;Zhang et al., 2020;Wang et al., 2020;Chawla and Yang, 2020) has been extensively studied. As paraphrase generation can be regarded as a machine translation task (Finch et al., 2004;Specia, 2010) within the same language, the same models (Bahdanau et al., 2015;Vaswani et al., 2017) have been applied to a monolingual parallel corpus.
Recent studies (Platanios et al., 2019; have shown that curriculum learning (Bengio et al., 2009) achieves faster convergence and improved translation quality on neural machine translation. Curriculum learning designs a training process starting from easy training samples and gradually proceeds to difficult training samples. In these previous studies, curriculum learning that uses source-side features, i.e., sentence length and word rarity, as a metric to determine the difficulty has improved the quality of translation.
In this study, we adopt curriculum learning to the paraphrase generation task. Paraphrasing allows a certain level of semantic divergence between source and target sentences. For example, some paraphrases only require just a small number of transformations as shown in Table 1, while some  others require drastic transformations as Table 2 shows. For the former, transformation is easy because the target sentence can be generated by copying almost all the input sentence's words. For the latter, transformation is difficult because the input sentence requires replacement and reordering of clauses besides lexical and phrasal paraphrasing. Because of this feature in paraphrase generation, difficulty in transformations requires to consider both source and target contexts.
To address this problem, we propose to use an edit distance between a paraphrased sentence pair as a difficulty metric that approximates necessary amounts of transformations. We evaluate our method on a formality transfer task using Grammarly's Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault, 2018). The result of paraphrase generation from informal English to formal English confirmed the effectiveness of curriculum learning based on the edit distance. The detailed analysis revealed that the proposed method contributes to performance improvement in difficult samples regardless of the difficulty metrics, while sentence length and word rarity based methods degraded the performance.

Source Sentence Target Sentence
Yeah I think it would be funny.
I think it would be funny. I have one brother and three sisters.
I have one brother and three sisters. Do you mean which is least horrible?
Do you mean which is the least horrible? Their first two albums were pretty good.
Their first two albums were very good. Initial curriculum learning methods for neural machine translation considered only the difficulty of the training sample (Kocmi and Bojar, 2017;Zhang et al., 2018). These methods achieved faster convergence; however, they could not improve machine translation quality after convergence. Following these studies, Platanios et al. (2019) and  proposed a method that considers both the difficulty of the training samples and the model competence, which achieved both of faster convergence and improvement in the translation quality. This study bases on the model proposed by Platanios et al. (2019), who introduced the model competence in machine translation. Their method definesd i ∈ [0, 1] that is the difficulty score of the i-th training sample, and c(t) ∈ [0, 1] that is the model competence at the training step t. The method trains the model using only easier training samples than the model competency at each training step. In other words, the number of training samples increases as the training proceeds. Their method improved the translation quality while reduced the training time.
Platanios et al. (2019) defined the difficulty d(s i ) based on sentence length and word rarity. Here, an input sentence s i consists of a word string {w 1 , ..., w N i }. Considering translation of a long sentence is more difficult than a shorter one, the sentence length is adopted as one of the metrics: Besides, they considered words that infrequently appear in a training corpus are also difficult to translate because these words have fewer learning opportunities. Therefore, Platanios et al. (2019)  Suddenly they became angrier.
wherep ( where c 0 is the initial competence and T is the number of training steps estimated as necessary for convergence. They assumed that the competence is small at the beginning of training and increases monotonically as the training proceeds, which reaches the maximum value 1 when t = T .

Proposed Method
We approximate the difficulty of transformation in paraphrase generation as edit distance between a paraphrased sentence pair: where LevenshteinDistance(·, ·) computes the Levenshtein distance between the source sentence and the target sentence t i . The edit distance between sentences with simple transformations like Table 1 is small, and the edit distance between sentences with drastic rewriting like Table 2 is large. Hence, our curriculum learning starts training with paraphrases with a small number of transformations and gradually learns more dynamic transformations.
Algorithm 1 Edit-distance based curriculum learning Compute the difficulty scored i 8: end for 9: for t = 1, ..., T do: Curriculum learning 10: Compute the model competence c(t).

11:
Sample a data batch B t uniformly from all s i ∈ D, such thatd i ≤ c(t).

12:
Train neural machine translation model θ using B t as input. 13: end for We apply the edit-distance based difficulty metric to the competence-based curriculum learning (Platanios et al., 2019) framework. The entire algorithm is shown in Algorithm 1.

Experiment
We evaluate the performance of edit-distance based curriculum learning on a style transfer task: paraphrase generation from informal English to formal English using GYAFC 1 (Rao and Tetreault, 2018).

Corpus and Evaluation Metric
GYAFC provides parallel sentences from two domains, Entertainment & Music (E&M) and Family & Relationships (F&R). Following Niu et al. (2018), we expand the training set by combining sentences of each domain and add the label 2formal or 2informal at the beginning of an input sentence. Statistics of GYAFC corpus are shown in Table 3.
As preprocessing, we used Moses toolkit 2 (Koehn et al., 2007) for tokenization and normalize-punctuation. We also used  byte-pair encoding 3 (Sennrich et al., 2016) to limit the number of token types to 16, 000. On GYAFC, Rao and Tetreault (2018) reported that a correlation exists between manual annotation and BLEU (Papineni et al., 2002) scores for the task of informal to formal English transfer. Hence, we used BLEU as an evaluation metric.

Setup
As a paraphrase generation model, we implemented transformer (Vaswani et al., 2017) model using Joey NMT 4 (Kreutzer et al., 2019). Our transformer model has four-layers with a hidden size of 512 and a four attention heads for both the encoder and decoder. We used word embeddings of 512 dimensions tying the source, target, and the output layer's weight matrix. We also added dropout to the embeddings and hidden layers with a probability of 0.2. We trained using the Adam optimizer (Kingma and Ba, 2015) with the learning rate of 0.0002. The batch size was 4, 096 tokens. We saved the model every 800 updates applying early stopping with patience of five.
To evaluate the effectiveness of the edit distance 5 on curriculum learning (denoted as CL-ED), we compared to curriculum learning with sentence length (denoted as CL-SL) and word rarity (denoted as CL-WR). To compute the model competency with Equation (3), we need to set two hyperparameters of c 0 and T . We set c 0 to 0.01 and T to the number of training steps necessary for the transformer model with ordinary training reaches the 95% of the maximum BLEU score on the development set.

Results
The experimental results are shown in Table 4

CL-ED
The relationship is dead on arrival. These results indicate that existing curriculum learning based on sentence length and word rarity is not effective in paraphrase generation. In contrast, curriculum learning with the edit distance was effective on both domains.

Discussion
We investigated which type of sentences that the curriculum learning improved their paraphrase quality. We divided all the test sets into three classes: Easy, Medium, and Difficult, of the same size (916 sentences each) using difficulty metrics of sentence length, word rarity, and edit distance, respectively. We then computed a BLEU score of each class and calculated improvements over Baseline. Figure 1 shows the BLEU score differences of CL-SL, CL-WR, and CL-ED, compared to Baseline, respectively. Overall, the performance improvement on the Easy class is significant across the methods, which is intuitive as such sentences are easy to learn and used for training throughout curriculum learning. CL-SL and CL-WR degraded the BLEU scores on Medium class, and even deteriorated the baseline transformer on the Difficult class. In contrast, CL-ED improved the BLEU scores of Baseline even on the Difficult class, regardless of the metric of difficulty. Table 5 shows output examples. The Baseline output almost the same sentence as the input without necessary transformations. While CL-SL and CL-WR output a sentence that does not make sense, CL-ED, which is our method, successfully paraphrases the source sentence.

Summary and Future Work
In this study, we applied the edit distance to curriculum learning for paraphrase generation. Experiment results on an informal to formal style transfer task confirmed the effectiveness of our method, particularly for paraphrasing difficult sentences.
Curriculum learning can be applied to any task when reasonable metrics for task difficulty are available. Transfer learning using a pre-trained model (Devlin et al., 2019;Lewis et al., 2020) has significantly improved the performance of various natural language processing tasks. In transfer learning, fine-tuning samples similar to the ones in the pre-training corpus should be easier to learn. We plan to apply our edit-distance based curriculum learning to transfer learning.