Learning to Selectively Learn for Weakly-supervised Paraphrase Generation

Paraphrase generation is a longstanding NLP task that has diverse applications on downstream NLP tasks. However, the effectiveness of existing efforts predominantly relies on large amounts of golden labeled data. Though unsupervised endeavors have been proposed to alleviate this issue, they may fail to generate meaningful paraphrases due to the lack of supervision signals. In this work, we go beyond the existing paradigms and propose a novel approach to generate high-quality paraphrases with data of weak supervision. Specifically, we tackle the weakly-supervised paraphrase generation problem by: (1) obtaining abundant weakly-labeled parallel sentences via retrieval-based pseudo paraphrase expansion; and (2) developing a meta-learning framework to progressively select valuable samples for fine-tuning a pre-trained language model BART on the sentential paraphrasing task. We demonstrate that our approach achieves significant improvements over existing unsupervised approaches, and is even comparable in performance with supervised state-of-the-arts.


Introduction
Paraphrase generation is a fundamental NLP task that restates text input in a different surface form while preserving its semantic meaning. It serves as a cornerstone in a wide spectrum of NLP applications, such as question answering (Dong et al., 2017), machine translation (Resnik et al., 2010), and semantic parsing (Berant and Liang, 2014). With the recent advances of neural sequence-tosequence (Seq2Seq) architecture in the field of language generation, a growing amount of literature has also applied Seq2Seq models to the sentential paraphrasing task.
Despite their promising results, collecting large amounts of parallel paraphrases is often timeconsuming and requires intensive domain knowl- * Work was done as an intern at Amazon Alexa AI. edge. Therefore, the performance of supervised methods could be largely limited in real-world scenarios. Due to this problem, unsupervised paraphrase generation has recently received increasing attention, but the development is still in its infancy. Generally, sampling-based or editing-based approaches (Bowman et al., 2016;Miao et al., 2019) fail to incorporate valuable supervised knowledge, resulting in less coherent and controllable generated paraphrases (Liu et al., 2019). In this work, we propose going beyond the existing learning paradigms and investigate a novel research problem -weakly-supervised paraphrase generation, in order to push forward the performance boundary of sentential paraphrasing models with low-cost supervision signals.
As an understudied problem, weakly-supervised paraphrase generation is challenging mainly because of the following reasons: (i) although weak supervision has been applied in different lowresource NLP tasks (Dehghani et al., 2017;Aprosio et al., 2019), for paraphrase generation, it is unclear how to automatically acquire abundant weak supervision data that contains coherent, fluent and diverse paraphrases; (ii) weakly-labeled paraphrases tend to be noisy and are not equally informative for building the generation model (Ren et al., 2018;Li et al., 2019a;Yoon et al., 2020). Hence, selecting valuable parallel sentences from weakly-labeled data is vital for solving the studied problem; and (iii) the state-of-the-art paraphrasing methods are predominantly built upon traditional Seq2Seq models, while the necessity of learning from scratch largely magnifies the learning difficulty when dealing with scarce or noisy training data (Guu et al., 2018). Thus it is imperative to seek a more robust and knowledge-intensive backbone for learning with weakly-labeled paraphrases.
To address the aforementioned challenges, we present a novel approach for learning an effective paraphrasing model from weakly-supervised parallel data. By virtue of a simple yet effective pseudo paraphrase expansion module, for each input sentence, we are able to obtain multiple similar sentences without unbearable labeling cost and treat them as paraphrases. To mitigate the inaccurate supervision signals within the weakly-labeled parallel data and build an effective paraphrasing model, we further select valuable parallel instances by proposing a novel framework named Learning-To-Selectively-Learn (LTSL). Remarkably, LTSL leverages meta-learning to progressively exert the power of pre-trained language model, i.e., BERT (Devlin et al., 2018) and BART (Lewis et al., 2019), with weakly-labeled paraphrasing data. From a meta-learning perspective, the BERTbased gold data selector is meta-learned to select valuable samples from each batch of weakly paired sentences, in order to fine-tune and maximize the performance of the BART-grounded paraphrase generator. Afterwards, the paraphrase generation performance change on a small validation set will be used to perform meta-optimization on the data selection meta-policy. This way the two pre-trained components gold data selector and paraphrase generator in LTSL are able to reinforce each other by continuously learning on a pool of meta-selection tasks. To summarize, the major contribution of this work is three-fold: • We investigate an understudied research problem: weakly-supervised paraphrase generation, which sheds light on the research of sentential paraphrasing under a low-resource setting. • We develop a framework LTSL, which is a new attempt of leveraging meta-learning to enhance pre-trained language model on paraphrase generation with costless weak supervision data. • We conduct extensive experiments to illustrate the superiority of our approach over both supervised and unsupervised state-of-the-art methods on the task of paraphrase generation.

Related Work
Supervised Paraphrase Generation. With the fast development of deep learning techniques, neural Seq2Seq models have achieved superior performance over traditional paraphrase generation methods that rely on exploiting linguistic knowledge (McKeown, 1980(McKeown, , 1983 or utilizing statistical machine translation systems (Dolan et al., 2004;Bannard and Callison-Burch, 2005). Supervised paraphrasing methods are widely studied when the ground-truth parallel sentences are available during the training time. Among supervised efforts, Residual LSTM (Prakash et al., 2016) is one of the earliest works based on neural networks. Later on,  propose to make use of deep reinforcement learning and Iyyer et al. (2018); Chen et al. (2019a) leverage syntactic structures to produce better paraphrases. More recently, retrievalaugmented generation methods have also been investigated (Hashimoto et al., 2018;Kazemnejad et al., 2020;Lewis et al., 2020) for paraphrase generation and achieved promising performance.
Unsupervised Paraphrase Generation. Due to the burdensome labeling cost of supervised counterparts, unsupervised paraphrasing methods have drawn increasing research attention in the community. Methods based on variational autoencoders (VAE) are first proposed to generate paraphrases by sampling sentences from the learned latent space (Bowman et al., 2016;Bao et al., 2019;Fu et al., 2019), while the generated sentences are commonly less controllable. To tackle this issue, CGMH (Miao et al., 2019) uses Metropolis-Hastings sampling to add constraints on the decoder at inference time. Furthermore, researchers try to improve the generation performance in terms of semantic similarity, expression diversity, and language fluency by using simulated annealing (Liu et al., 2019), syntactic control (Huang and Chang, 2021), or dynamic blocking (Niu et al., 2020). In addition, pre-trained translation models have been explored to generate paraphrases via backtranslation (Wieting et al., 2017;Guo et al., 2021). But still, those methods can hardly achieve comparable results with supervised approaches.
Learning with Weak Supervision. The profound success of machine learning systems largely benefits from abundant labeled data, however, their performance has been shown to degrade noticeably in the presence of inaccurate supervision signals (Hendrycks et al., 2018), especially in an adversary environment (Reed et al., 2014). As one of the central problems in weak supervision, learning with noisy labels has received much research attention. Existing directions mainly focus on: estimating the noise transition matrix (Goldberger and Ben-Reuven, 2017; Patrini et al., 2017), designing robust loss functions or using regularizations (Ghosh et al., 2017;Li et al., 2017;, correcting noisy labels (Tanaka et al., 2018;Zheng et al., 2021) and selecting or reweight-  Figure 1: Overview of our approach for weakly-supervised paraphrase generation. For the LTSL framework, the blue dashed rectangle represents a meta-selection task, while green dashed line is the meta-optimization step.
ing training examples (Ren et al., 2018;Chen et al., 2019b;Yoon et al., 2020). In general, the stateof-the-art methods usually exploit a small clean labelled dataset that is allowed under the low resource setting (Mirzasoleiman et al., 2020). For instance, Gold Loss Correction (Hendrycks et al., 2018) uses a clean validation set to recover the label corruption matrix to re-train the predictor model with corrected labels. Learning to Reweight (Ren et al., 2018) proposes a single gradient descent step guided with validation set performance to reweight the training batch. Learning with noisy/weak supervision has drawn increasing attention in the NLP community (Qin et al., 2018;Feng et al., 2018;Ren et al., 2020), but it is seldomly investigated in the filed of paraphrase generation. In this work, we propose a new meta-learning framework that is capable of selecting valuable instances from abundant retrieved weakly-labeled sentence pairs.

Proposed Approach
Figure 1 illustrates our method for solving weaklysupervised paraphrase generation. In essence, there are two sub-tasks: (1) how to obtain abundant weakly-labeled parallel data from the unlabeled corpus; and (2) how to build a powerful paraphrase generation model from noisy weak supervision data. Formally, given a set of source sentences X = {x i } N i=1 without ground-truth paraphrases, we first obtain a weakly-labeled paral- for enabling weak supervision. In this work, we aim to denoise the weak supervision data by selecting a subset of valuable instances is allowed to be accessed, which is a common assumption in weakly-supervised learning (Ren et al., 2018). In the following subsections, we will introduce how to solve the main challenges with the proposed pseudo paraphrase expansion module and the metalearning framework LTSL.

Pseudo Paraphrase Expansion
To enable weakly-supervised paraphrase generation, we first propose a plug-and-play pseudo paraphrase expansion module. Essentially, the function of this module is to obtain multiple weakly-labeled pseudo paraphrases that are similar or relative to each of the input sequence x. Expansion via Retrieval. Inspired by the success of retrieval-enhanced methods (Kazemnejad et al., 2020;Lewis et al., 2020) in text generation tasks, we propose to build a retrieval-based expansion module to obtain abundant pseudo parallel paraphrases D pseudo . Given a source sentence x i , this module automatically retrieves a neighborhood set N (x i ) consisting of the K most similar sentences {y k i } K k=1 from a large unlabeled sentence corpus. Specifically, we adopt the simple yet effective retriever BM25 (Robertson and Zaragoza, 2009) in this work. In addition, we use the Elastic Search (Gormley and Tong, 2015) to create a fast search index for efficiently searching for the similar sentences to an input sequence. Here we use the in-domain sentence corpus since it is commonly available in practice and provides better results, but our approach is flexible to be extended to opendomain corpora such as Wikipideia. Further Discussion. It is worth mentioning that the main reasons of using BM25 rather than a trainable retriever are: (1) this module is not only restricted to retrieval-based expansion, it is designed as a plug-and-play module that can provide more flexibility for weak supervision; and (2) the model training can be more stable since the number of trainable parameters is largely reduced. In addition to the aforementioned retrieval-based method, our approach is also compatible with other expansion alternatives. For instance, we can also adopt domain-adapted paraphraser to generate weaklylabeled paraphrases. Due to the simplicity and learning efficiency, here we focus on retrievalbased expansion to enable weakly-supervised paraphrase generation, and we leave the exploration of other expansion methods for future study.

Learning to Selectively Learn (LTSL)
The pseudo paraphrase expansion eliminates the dependency of a large amounts of ground-truth labels, nonetheless, one critical challenge is that the obtained weak supervision data is inevitably noisy: though a weakly-labeled paraphrase is somewhat related and convey overlapping information to the input sentence, while they are not parallel in the strict sense. As a result, directly using all the expanded pseudo paraphrase pairs for learning paraphrasing models is unlikely to be effective. Architecture Overview. To address the aforementioned challenge, we propose a meta-learning framework named Learning-To-Selectively-Learn (LTSL), which is trained to learn data selection meta-policy under weak supervision for building an effective paraphrasing model. LTSL consists of two components: (i) the meta-learner gold data selector f θ (·) parameterized by θ that determines the selection likelihoods of the training samples to train the base model; and (ii) the base model paraphrase generator g φ (·) with parameters φ, which is a pre-trained autoregressive model that generates a paraphrase given the input sentence. At its core, the meta-learned gold data selector learns to select highly valuable samples from each batch, by measuring their ability to optimize the down-stream paraphrase generator. Meanwhile, the parameters of the paraphrase generator can be updated with the meta-selected samples progressively. Gold Data Selector (Meta-Learner). In order to represent each pair of parallel sentences, we adopt the widely recognized pre-trained model BERT (Devlin et al., 2018) to build the gold data selector. Specifically, for the i-th weakly supervised paraphrase pair (x i , y i ), its latent representation can be computed by: where [CLS] and [SEP] are special start and sepa-rator tokens. We use the last layer's [CLS] token embedding as z i . Our gold data selector decides the value of the pair (x i , y i ) for fine-tuning the pre-trained paraphrase generator by: where both W s and b s are learnable parameters.
Here v i is the probability distribution of whether to include the weakly-labeled sentence pair (x i , y i ).
Paraphrase Generator. The paraphrase generator could be built with any encoder-decoder backbones. In essence, the objective of the paraphrase generator is to maximize the conditional probability p φ (y|x) over the selected training samples in D train . As pre-trained language models are already equipped with extensive knowledge and have shown strong capability on a diverse set of generation tasks, we propose to use pre-trained language model BART (Lewis et al., 2019) to build our paraphrase generator, which can largely reduce the difficulty of learning from scratch. Specifically, the fine-tuning objective of BART is: Meta Reinforcement Learning. Our proposed meta-learning framework LTSL aims to learn a discriminative data selection meta-policy for maximizing the performance of the paraphrase generator. For each batch of data samples in D pseudo , we consider it as a meta-selection task that contains a series of selective actions. As the data selection process is inherently non-differentiable, we adopt reinforcement learning (RL) to enable the metaoptimization. We describe the RL environment of each meta-selection task as follows: STATE. The state s t is meant to be a summarization of the learning environment at time step t, which encodes the following information: (i) the representation of the t-th weakly-paired sentences; (ii) the average of the representations of all selected sentences at time step t. The gold data selector will take the concatenated vector as input and output the probability distribution that indicates whether to select this instance or not.
ACTION. At each time step t, the action a t ∈ {0, 1} decides whether to select the current weaklylabeled instance (x t , y t ). Ideally, the gold data selector can take the action to select those useful instances for the paraphrasing task and filter out those noisy ones. Specifically, a t = 1 represents the the current weakly-labeled instance will be selected, otherwise (i.e., a t = 0) not. The action on each sentence pair is sampled according to the output of the selection policy π θ (·|s t ).
REWARD. The function of reward is to guide the meta-learner to select valuable training instances to improve the performance of the pre-trained generator. After each batch of selections, the accumulated reward of this meta-selection task is determined by the performance change of the paraphrase generator evaluated on the validation set D dev . Note that we use Perplexity instead of word-overlapping metrics such as BLEU for evaluation since it is shown to be more efficient and stable (Zhao et al., 2020) for generation tasks. For each meta-selection task, the policy network receives a delayed reward when it finishes all the selections, which is a commonly used design in RL literature .

Meta-optimization
To optimize the data selection meta-policy, we aim to maximize the sum of expected rewards of each meta-selection task, which can be formulated as: where r t is the reward at time step t and θ is the parameter of the meta-learner gold data selector. We update the θ via policy gradient: where α denotes the learning rate. With the obtained rewards, the gradient can be computed by: where B is the number of instances in a batch. The details of the training process are shown in Algorithm 1. Specifically, we adapt the REINFORCE algorithm (Williams, 1992) to optimize the policy gradients and implement a reward baseline to lower the variance during the training. In essence, the obtained reward via a small validation set is used to conduct meta-optimization on the data selection meta-policy. By learning on a pool of meta-selection tasks, the meta-learned gold data selector can select valuable instances for enhancing the paraphrase generator. During the meta-learning process, the gold data selector Fine-tune the pre-trained generator g φ (y|x) with selected samples to get g φ (y|x)

10
Calculate the reward rt on the validation set between g φ (y|x) and g φ (y|x)

14
Update the generator g φ (y|x) with Dtrain 15 return The fine-tuned paraphrase generator g φ (y|x); and paraphrase generator can reinforce each other, which progressively improves the data selection policy and enhances the paraphrase generator.

Experiments
For the evaluation, we briefly introduce the experimental settings and conduct extensive experiments to corroborate the effectiveness of our approach. The details of our experimental settings and implementations can be found in the Appendix.

Experimental Settings
Evaluation Datasets & Metrics. In our experiments, we evaluate our proposed approach on multiple widely used paraphrasing benchmark datasets.
Since the problem of weakly-supervised paraphrase generation remains largely under-studied in the community, we compare our approach with both supervised and unsupervised paraphrase generation methods. It is worth mentioning that, due to historical reasons, existing supervised and unsupervised methods use different data splits and evaluation metrics. To make a fair and comprehensive evaluation, we follow the setting of each line of work  and conduct the comparison respectively. Specifically, we use the following datasets to compare with supervised methods: • Quora-S: is the Quora question pair dataset which contains 260K non-parallel sentence pairs and 140K parallel paraphrases. Here we denote the version used by supervised methods as Quora-S. We follow the same setting in ; Kazemnejad et al. (2020) and randomly sample 100K, 30K, 3K parallel sentences for training, test, and validation, respectively. • Twitter: is the twitter URL paraphrasing corpus built by Lan et al. (2017). Following the setting in ; Kazemnejad et al. (2020), we sample 110K instances from about 670K automatically labeled data as our training set and two non-overlapping subsets of 5K and 1K instances from the human-annotated data for the test and validation sets, respectively.
To compare our approach with unsupervised efforts, we adopt another two benchmark datasets: • Quora-U: is the version of Quora dataset used by unsupervised paraphrasing methods. We follow the setting in Miao et al. (2019); Liu et al. (2019) for a fair comparison and use 3K and 20K pairs for validation and test, respectively. • MSCOCO: is an image captioning dataset containing 500K+ paraphrases pairs for over 120K image captions. We follow the standard splitting (Lin et al., 2014) and evaluation protocols (Liu et al., 2019) in our experiments.
The detailed dataset statistics are summarized in Table 1. Notably, although all the datasets have ground-truth paraphrases, our approach does not use them in the training set, which is as same as unsupervised methods (Siddique et al., 2020). We only allow the model to access the parallel sentences in the validation set during the learning process. Specifically, when comparing with supervised baselines, we follow the previous works and adopt BLEU-n (Papineni et al., 2002) (up to n-grams), and ROUGE (Lin, 2004) scores as evaluation metrics; similarly, we use iBLEU (Sun and Zhou, 2012), BLEU (Post, 2018) and ROUGE scores for comparing with unsupervised methods.
Compared Methods. To show the superiority of our approach, we first include both widely used and state-of-the-art paraphrase generation methods as our baselines. In general, those methods can be divided into two categories: (1) supervised methods that are trained with all the parallel sentences in the training corpus, including Residual LSTM (Prakash et al., 2016), Transformer (Vaswani et al., 2017), RbM , and two retrieval-based methods RaE  In addition, we also include another weaklysupervised method WS-BART where we use the same BART model (Lewis et al., 2019) as in LTSL and directly fine-tune it with the clean validation set. Since this model is only fine-tuned with limited labeled data, here we consider it as a weaklysupervised baseline.

Automatic Evaluation Results
Paraphrase Generation. Table 2 summarizes the paraphrasing results of different methods. Overall, the results show that our approach LTSL achieves the state-of-the-art performance in most metrics. Specifically, we can make the following observations from the table: • LTSL outperforms most of the supervised baselines and achieves comparable performance to the state-of-the-art method (i.e., FSTE). In contrast with those supervised baselines that require large amounts of labeled parallel paraphrases, our approach LTSL delivers promising results with very low-cost supervision signals. This enables us to build an effective paraphrasing model under the real-world low-resource setting. • Compared to unsupervised approaches, LTSL overall achieves large performance improvements, especially on iBLEU and BLEU scores. The main reason is that the sampling or editing mechanisms in those methods lack supervised knowledge from parallel sentences. As shown in (Niu et al., 2020), those methods even fall behind a simple baseline copy-input which directly copies the source sentence as the output. Our approach is able to well alleviate this weakness by leveraging the knowledge from weak supervision data and pre-trained language models. • By virtue of the strength of pre-trained language model, the weakly-supervised baseline WS-BART performs relatively well compared to existing methods. However, it still falls behind our approach LTSL by a large margin, which demonstrates the effectiveness of our proposed learning framework.
Gold Data Selection. To further evaluate the effectiveness of the meta-learned data selection policy, we use the well-trained gold data selector to compute data values on unseen weakly-supervised candidates during training. Specifically, for each sentence in the test set, we retrieve 50 most similar sentences and consider them as weakly-labeled paraphrases, and then the gold data selector is used to predict the quality of these paraphrase candidates. Note that here we also include the adopted retriever, BM25, and a BERT-based Dense Passage Retriever (DPR) fine-tuned with the labeled parallel sentences from validation set for comparison. We use the computed probabilities to rank all the candidates and report NDCG@K and Recall@K in Figure 2 (a) and (b), respectively. We can see from the figures that the meta-learned gold data selector from LTSL is able to better select ground-truth parallel data than BM-25 and DPR. This observation indicates that the meta-learned gold data selector can effectively generalize to unseen data and select valuable weakly-labeled instances.   expansion and gold data selection). It again verifies the importance of learning effective data selection meta-policy. Meanwhile, by using a vanilla Transformer to build the paraphrase generator (w.r.t., w /o BART), the performance falls behind LTSL by a considerable gap, which shows the necessity of leveraging pre-trained language models. We also examine the effect of parameter K on the final performance and show the results on the Quora-S dataset in Figure 3 (similar results can be observed for other datasets). As we can see from the results, with the growth of K, the performance reaches the peak when K is set to 5 and then gradually decreases if K increases. It shows that it is necessary to incorporate abundant weaklylabeled data. However, when more candidates are added, the introduced noise from weakly-labeled data could impair the final performance.

Human Evaluation Results
To further illustrate the superior quality of the paraphrases generated by LTSL, we conduct subjective human evaluations. We randomly select 100 sentences from the Quora-U dataset and ask three human annotators to evaluate the top three performing methods under the unsupervised paraphrase generation setting. Table 5 presents the average scores along with the inter-annotator agreement (measured by Cohen's kappa κ) in terms of semantic coherence, language fluency, and expression diversity. We rate the paraphrases on a scale of 1-5 (1 being the worst and 5 the best) for the three evaluation criteria. As shown in the table, our approach outperforms all the competing approaches in terms of all the three perspectives. Moreover, the inter-annotator agreement shows moderate or good agreement between raters when assessing the outputs of our model.

Case Study
Last, we showcase the generated paraphrases from different methods on the Quora-U dataset. As illustrated in Table 4, we can clearly see qualitatively that LTSL can produce more reasonable paraphrases than the other methods in terms of both closeness in meaning and difference in expressions. For example, "How do I attract contributors for my project on Github?" is paraphrased as "How do I get more contributors on my GitHub project?". It is worth mentioning that existing unsupervised methods such as BackTrans cannot generate highquality paraphrases in terms of diversity, mainly because of the shortage of supervision signals. On the contrary, our LTSL approach is able to generate highly fluent and diverse paraphrases by leveraging valuable weak supervision data and the knowledge of large-scale pre-trained language model.

Conclusion
In this work we investigate the problem of paraphrase generation under the low-resource setting and propose a weakly-supervised approach. From automatic and human evaluations, we demonstrate that our approach achieves the state-of-the-art results on benchmark datasets. An interesting direction is to improve the generation performance by leveraging weakly-labeled data from different sources. We leave this as future work.