ParaMac: A General Unsupervised Paraphrase Generation Framework Leveraging Semantic Constraints and Diversifying Mechanisms

,


Introduction
Paraphrases are sentences that convey the same meaning with different forms of expressions.Automatic generation of paraphrases has been an essential task in natural language processing (NLP) ever since the early days of the computational linguistics study (McKeown, 1979), and has a broad application on downstream tasks including question answering (Dong et al., 2017), semantic parsing (Berant and Liang, 2014;Wu et al., 2021), machine translation (Seraj et al., 2015), and etc.Additionally, paraphrase generation is a significant data augmentation method (Gao et al., 2020;Yu et al., 2020), which can benefit the learning in lowresource settings.
Early works like rule-based (McKeown, 1983; Barzilay and Lee, 2003) and thesaurus-based (Bolshakov and Gelbukh, 2004) methods generate paraphrases mainly by explicit manipulation on words, phrases, or sentences.But these methods usually perform poorly and are restricted by either heavy manual work or large language resources.Later, the sequence-to-sequence (Seq2Seq) paradigm is brought into the paraphrase generation (Prakash et al., 2016).By training on parallel annotations and combined with GAN (Yang et al., 2019) or VAE (Gupta et al., 2018), it greatly improves the performance.However, these supervised methods highly depend on large annotated parallel data, which is hard to acquire.
To overcome the difficulty in obtaining highquality parallel corpora, recently, researchers begin to pay attention to unsupervised approaches using Pre-trained Language Models (PLMs) (Niu et al., 2021;Liu et al., 2020;Hegde and Patil, 2020;Meng et al., 2021), due to their great power in language modeling and understanding (Lin et al., 2019;Zhang et al., 2020).These existing works apply PLMs to paraphrase generation successfully and obtain good performances.However, they are still weak in either semantic equivalence or expression diversity, which are both necessary for a qualified rewriting.For example, methods that generate paraphrases by reconstructing or editing the original sentence (Liu et al., 2020;Hegde and Patil, 2020) usually only change common words locally, ignoring other global expression factors (e.g., ordering) and thus hindering the diversity.On the other hand, methods that generate paraphrases from scratch (Meng et al., 2021;Niu et al., 2021) usually lack strong semantic constraints and thus suffer from an inevitable semantic divergence.
To tackle these problems, we propose a novel paraphrase generation framework called Paraphrase Machine (ParaMac), which leverages PLMs to generate paraphrases given an input and its context.For this framework, we propose multi-aspect equivalence constraints and multi-granularity diversifying mechanisms to produce various input expressions while keeping the original meaning as tight as it can.Specifically, we design the equivalence constraints from three aspects: the context, the keyword, and the overall semantics.On the other hand, we consider the diversifying mechanisms in three granularity: the word level, the phrase level, and the sentence level.All these constraints and mechanisms are combined to guarantee that the generated sentence preserves the semantics of the input with expressions as diverse as possible.
As shown in Figure 1, incorporated with these constraints and mechanisms, ParaMac can utilize PLMs' linguistic ability to generate unsupervised paraphrase pairs effectively.We generate a highquality paraphrase dataset called Paraphrase Net (ParaNet) in an unsupervised way, which enable us to train a Paraphrase Model (ParaMod) based on a Seq2Seq PLM (e.g., T5 (Raffel et al., 2020)).
By applying ParaMod directly to paraphrasing benchmarks (i.e., Quora1 and MSCOCO (Lin et al., 2014)) without any fine-tuning, we achieve a significant improvement over previous SOTA (i.e., 9.1% and 3.3% absolutely in the BLEU score).After a further fine-tuning on domain-specific training data, ParaMod lifts the absolute improvements to even 18.0% and 4.6%.In addition, we want to highlight the framework's generality across downstream tasks, which is evaluated by applying ParaMod to language understanding and generation tasks to perform data generation or augmentation.We demonstrate that ParaMod is an excellent question generator that can cut down the manual work on question generation, and also a universal data augmentor to boost the performance of part of GLUE in the few-shot setting by an average of 2.0%.

Related Work
Supervised Approaches Typical supervised paraphrase generation methods mainly leverage annotated paraphrase pairs and neural Seq2Seq models (Prakash et al., 2016) such as LSTM (Hochreiter and Schmidhuber, 1997) or Transformer model (Vaswani et al., 2017).Following methods based on the Seq2Seq encoder-decoder architecture attempted to improve the performance by adding more constraints on generation.Bahdanau et al. (2015) tried to add attention and Cao et al. (2017); Gu et al. (2016) added copy mechanism to keep model focused on the important parts of input.VAE (Gupta et al., 2018) and GAN (Yang et al., 2019) enforced constraints from model and training aspects, respectively, to try to avoid unrealistic output.Some works leveraging other supervised signal can be categorized into zero-shot generation.For example, Mallinson et al. (2017); Wieting et al. (2017) made another attempt to generate paraphrase in a bilingual pivoting manner later known as back-translation.Guo et al. (2019) proposed to train a unified model on multilingual parallel data to achieve a one-step generation.Cai et al. (2021) further extended the pivoting idea from language to other semantic forms and explored the feasibility.There also are some works (Iyyer et al., 2018;Sun et al., 2021) tried to incorporate syntactic structures to improve the diversity.
Supervised methods can usually get good performance, while the primary obstacle is getting large-scale and high-quality parallel data.
Unsupervised Approaches Unsupervised methods are hard to categorize since they are much less explored.Liu et al. (2020) transformed paraphrase generation into an optimization problem and utilized certain objectives to reflect the semantic equivalence and diversity.Siddique et al. (2020) shared a similar idea but optimized via deep reinforcement learning.Bowman et al. (2016) trained a VAE to reconstruct the input and sampled from the trained decoder to get its latent paraphrase.Roy and Grangier (2019) leveraged residual connections which allows a interpolation from classical auto-encoder to vector-quantized auto-encoder.Most recent works are focused on transformerbased PLMs.Meng et al. (2021) pre-trained a context-LM to generate paraphrase candidates with regularization of context.Other works directly used PLMs (e.g., GPT-2 or BART) to generate paraphrase -Hegde and Patil (2020) used PLMs to reconstruct corrupted input, and Niu et al. (2021) brought up new blocking algorithm during generation to prevent PLMs from copying and repeating.
These methods intend to use PLMs in their process to improve performance.However, these gen-eration results often suffer from either inconsistency of semantics or the lack of diversity.

ParaMac
In this section, we will introduce our unsupervised paraphrase generation framework ParaMac.Our idea starts from two basic assumptions: the first is the paraphrase must contain some key information of the original sentence, which we use keywords to represent, and they can be rendered in expressions; the second is there exists a different order for keywords to reform a fluency sentence, as long as proper connecting parts are filled between them.
Based on these two assumptions, our unsupervised paraphrase generation framework can be divided into three parts: 1) Keywords Processing: we extract the keywords of input, rephrase and reorder them; 2) PLM-based Generation: with the help of context regulation and the linguistic ability of PLMs, we connect the rephrased and reordered keywords into a fluency sentence as a paraphrase candidate; 3) Candidates Ranking: we use a series of metrics to select the best candidate.
To ensure semantic consistency during the process and the expression diversity of the outputs, we design multi-aspect equivalence constraints and multi-granularity diversifying mechanisms.Specifically, the equivalence constraints are: • e1 Keywords constraint: keywords from the input are used as anchors in the output; • e2 Context constraint: we use context information to reduce the generation output space; • e3 Semantics constraint: we employ an automatic semantic evaluation on the candidates to get their sentence semantics ranks.
The diversifying mechanisms are embedded at three levels: • d1 Word level: keywords are replaced with their synonyms; • d2 Phrase level: the order of keywords is rearranged to change the possible collocation; • d3 Sentence level: a diversity score is used to encourage the different expressions.
Figure 2 provides an overview of the whole generation picture and where exactly these constraints and mechanisms are applied.

Keywords Processing
Keywords processing here consists of keywords extraction, filtering, substitution, and random permutation.In this paper, keywords refer to all the words and phrases extracted from the input.Keywords Extraction & Filtering is to perform a coarse selection of the important information in the original sentence, which uses e1 .In this step, we leverage the Rake algorithm (Rose et al., 2010), which is an efficient keywords extraction algorithm based on word co-occurrence in libraries.It can return not only words but also phrases given the input sentence.To avoid missing important information, we also add the rest of the nouns and verbs of the input.Formally, given an input sentence S, we get its keywords set {k i }.
Next, we intend to filter some of the redundant low informative keywords because the more keywords used in the later generation, the less diverse the output may be.Also, there will be a higher computational cost.So we want to relax the constraint by dropping certain a number of keywords.
In this paper, we measure the information of keyword k i by computing p(S|k i ), where the p(•|•) represent the conditional generation score of PLMs.The intuition here is that if a keyword k i is more likely to generate the whole sentence, the more informative and representative it is.According to the ranking of the score, we can filter out some of the low-scoring keywords Keywords Substitution & Random Permutation aims to increase the diversity by rephrasing and reordering keywords, which uses d1 and d2 .
The rephrasing, namely the substitution of synonym, is achieved by masking the target keyword and utilizing masked language models (MLMs) such as BERT or T5 to predict the masked token.As shown in the substituting step in Figure 2, the second prediction of beam search is regarded as a semantically equivalent substitution.Note that although all the keywords will go through this operation, sometimes the keyword stays the same if we use T5 as the MLM.Additionally, some handcrafted blocking rules based on WordNet are also used to avoid the situation that MLMs replace a word with an opposite meaning (e.g., large sofa to small sofa, both of which can sometimes fit in the context).
After the substitution, keywords are randomly re-shuffled into many different orders, filled with span-mask tokens to form the final input to the PLM.

PLM-based Generation
We use the powerful PLMs to generate fluency and meaningful sentences.Meanwhile, we use e1 and e2 to reduce the possible output space and prevent an overly free generation because they make the generated output have to fit in the original context and contain keywords of the input.
Specifically, we choose bidirectional model T5large as our PLM and leverage the pre-training task of T5 described in (Raffel et al., 2020).As shown in the PLM-based generation step of Figure 2, the input to T5 is formalized by connecting the keywords with span-masking tokens, concatenated with the context of the original sentence S.Then, the PLM is used to predict the masked spans, and we fill the predictions back to the input.In this way, we ensure that the key information of S is kept in output, and the model is context-aware during the generation.
After the generation, the generated outputs will be evaluated as candidates in the next step.

Candidates Ranking
In this section, we describe the multiple scoring functions applied in the evaluation of the quality of candidates.We consider the quality of a sentence from three aspects: semantic equivalence, fluency, and diversity.

Semantic Score
Semantic score s sem is designed according to the third semantic constraint e3 .Due to the scale of data we deal with, we use the Bert-Score (Zhang et al., 2020) as an automatic evaluation method.It is based on the embedding of the tokens to measure the semantic similarity of a pair of sentences.

Fluency Score
The fluency of generated sentences should also be highly valued, and here we choose the candidate's perplexity score to reflect the fluency.However, since the perplexity is the smaller the better, in practice we calculate the probability of the sentence as the metric s f lu .Specifically, we use a PLM to calculate s f lu using its next-word prediction probability. (1)

Diversity Score
This measurement is a direct expression of d3 , where we intend to encourage the diversity of the generated sentences.Specifically, both wording and words' order are considered in this case.Inspired by Ment et al (Meng et al., 2021), we use Jaccard distance to measure the difference of two sentences S 1 and S 2 .The score s div can be calculated by the following equation: where S is considered as a set of its words w, and p S (w) means the position of word w in S.

Comprehensive Score
Eventually, all the three evaluation scores are taken into consideration.We leverage the linear combination of s sem , s f lu , and s div to calculate the final comprehensive score s f inal as follows: (3) where λ 1 , λ 2 , and λ 3 are weight parameters.

ParaNet and ParaMod
With the above framework, we can now choose a proper corpus to generate our paraphrase dataset ParaNet.As the generation process needs the context of inputs, we choose the long-form BookCorpus (Zhu et al., 2015) as our generation input corpus.Since T5 uses BookCorpus as one of its pretraining corpora and we don't want the PLM of ParaMac to have seen the input sentences before, we use a newly crawled version2 (Sep.2020 by Shawn Presser) excluding the original BookCorpus, which finally leaves us 3551 books.The genres of these books include fiction, nonfiction, essay, poetry, plays, and screenplays, ranging from up to 100 topics such as romance, science fiction, fantasy, thriller, and suspense.
From this subset of BookCorpus, we randomly sample 10k examples.Each input examples contains 1) a complete sentence S with a length between 60 and 100 characters; 2) the context before and behind S, both with an average length of 250 characters.Given the 10k examples, we generate the ParaNet in an unsupervised way using Para-Mac.Then, based on ParaNet, we are able to train a Seq2Seq language model ParaMod, which can generate paraphrased sentences given any sentence.The implementation details can be found in Appendix A.

Experiments
In this section, we evaluate our proposed paraphrase generation model ParaMod in both unsupervised and supervised settings.Furthermore, we use our ParaMod as a data augmentation and generation tool to validate its effectiveness in downstream NLP tasks.

Paraphrase Generation
ParaMod can be directly applied to different domains of datasets without further fine-tuning, so we consider this setting unsupervised.The supervised setting refers to further fine-tuning ParaMod on domain-specific data.
To demonstrate the generality of our model, we choose Quora and MSCOCO as the evaluation datasets, which represent interrogative and declarative sentences, respectively.Quora 3 dataset is also popularly known as the Quora question pair.The latest version contains 149k parallel paraphrase question pairs and 260k non-parallel questions and we follow the split used in Liu et al. (2020); Meng et al. (2021).MSCOCO (Lin et al., 2014) is originally used for image caption, consisting of roughly 120k images, annotated with five captions each.We follow the standard split (Lin et al., 2014).

Baselines and Metrics
For the unsupervised setting, we compare our model to the following works as our baselines.We copy the results of these methods according to their papers if they have done the same experiment under the same settings.
The metrics used are iBLEU score (Sun and Zhou, 2012), BLEU score (Papineni et al., 2002), and ROUGE score (Lin, 2004).The BLEU and ROUGE score are the most widely-used metrics for sentence similarity.The iBLEU extends BLEU by penalizing the similarity between the generated sentence and the input, in which we adopted the same weight parameter as Meng et al. (2021).Finally, as with previous works (Chen et al., 2020;Gupta et al., 2018), we compute the value of these metrics by generating multiple paraphrases and select the best one with the highest iBLEU score.

Unsupervised Results
The main results of unsupervised paraphrase generation are in Table 1.It shows that the proposed ParaMod outperforms baselines on every metric.In particular, there is a huge lift in BLEU on Quora.On MSCOCO, there is a significant increase in Rouge values and an approximate 3.3% improvement in BLEU.The iBLEU score is also comparable with previous unsupervised methods.This result shows the potential and value of our ParaNet.One advantage of our model is that we smoothly utilize the original pre-training task in the generation process, which has no gap with the PLMs' pre-training.

Supervised Results
To demonstrate the strong power and generality of ParaMod, we only randomly select subsets of the whole training set.Specifically, we sample a 500 and a 10k training set on Quora and MSCOCO, respectively, then fine-tune ParaMod on these subsets for two epochs.The results of supervised paraphrase generation are shown in Table 2. On Quora, merely a 500-example 2-epoch fine-tuning can enable ParaMod to outperform all baselines, and a 10k-example 2-epoch fine-tuning dramatically improves the performance in all metric val-ues.On MSCOCO, the 500-example fine-tuning makes ParaMod comparable to previous SOTA, and a further 10k-example fine-tuning outperforms it with significant enhancement.This result shows the effectiveness of our ParaMod as a general initialization for paraphrase models, and also further validates the value of ParaNet.Zhou and Bhat (2021). 500and 10k stands for the model is fine-tuned 500 and 10k subsets, repectively.

Case Study
Here we provide some real examples generated by unsupervised models on Quora.Although ConRPG is the best baseline model, its model and code are both unavailable.Therefore we choose the secondbest CorruptLM as a comparison to our ParaMod.The examples are shown in Table 3.It can be seen that the output of ParaMod is more fluency, diverse, and accurate in semantics.

Downstream Tasks
In this work, we particularly highlight our model's generality across different downstream tasks, especially in low-resource settings.We consider two situations -the first application scenario is the question generation in tasks like semantic parsing and question answering; the second application is the few-shot learning for NLP tasks.

Low-Resource Generation
In this experiment, we consider the application of Knowledge-based Question Answering (KBQA), which aims to answer the given natural language question based on the knowledge base.Recently, one prominent approach to constructing datasets for KBQA is the synthesizing-then-paraphrasing pipeline (Lan et al., 2021).First, template questions are generated automatically, and then crowdsourced workers are recruited to paraphrase the template questions into the natural ones (Wang et al., 2015;Gu et al., 2021;Cao et al., 2022).Although this two-stage paradigm makes constructing large-scale datasets possible, the time and human efforts for paraphrasing are intensive and costly.We aim to validate the effectiveness of ParaMod to paraphrase questions for KBQA automatically.We adopt the dataset KQA Pro (Cao et al., 2022) since it is a large-scale Complex KBQA dataset whose template questions are generated according to a synchronous grammar and then paraphrased Input CorruptLM ParaMod (ours) Does Lipton green tea assist in weight loss?Does green tea assist for weight loss?Is lipton green tea an aid in weight loss?
Can you create another upwork account after how do i add another upwork account After the suspension, can you open another suspension?
Why is it that the American government is so Are the government corrupt?Why is the American government so corrupt?corrupt?
Are there any verified angel investors on quora?What angel investors are on quora?Does quora have any verified angel investors?by AMT workers.To validate the low-resource generation ability, we compare paraphrase models in the few-shot setting.We split 10k question pairs as the test set, leaving the rest for training use.The models are fine-tuned on subsets of the training set, which contain 100, 500, and 1k randomly selected examples, respectively.We compare our ParaMod with T5-base.The second-best unsupervised model CorruptLM is not used because it is not a general model and needs to train on domain-specific text, which we don't have in a low-resource setting.
We use the same metrics as the paraphrase generation tasks in the last two sections.The results are demonstrated in Table 4.It is clear that ParaMod's performance greatly exceeds T5-base.Especially when there are only very few golden human-written pairs, ParaMod is still able to achieve reasonably good generation quality.

Low-Resource Augmentation
In this experiment, we intend to demonstrate that ParaMod is also a universal data augmentor.We follow Gao et al. (2021), who developed a prompttuning method for the few-shot application on GLUE.We follow their experiment setting, choosing part of the GLUE tasks, and expand their Kshot data with our ParaMod.The baselines here are two other data augmentation methods, including the naive inserting/deleting/replacing by PLMs and the CorruptLM (trained on MSCOCO).
From the results shown in Table 5, we see that our ParaMod improves the performance of the chosen tasks by an average of 2.0% and also improves the stability of all tasks (smaller std values).It also shows that the naive augmentation and CorruptLM augmentation both cause a drop in performance.The possible reason is that the augmented examples might be semantically inconsistent with the origin example, and the few-shot model is very sensitive to out-of-distribution examples.It is also worth not-ing that we observe the performance drop when the augmenting size N increases for all augmentation methods.We consider this a normal phenomenon, for the extra training examples we created here are all similar to the original ones in terms of semantics, and training on too many similar examples can cause over-fitting.

Ablation Studies
In this section, we study various factors that can affect the performance of our ParaMod.

Paraphrase Pre-Training Data Size
We intend to explore the influence of the size of ParaNet used to train ParaMod.In Table 6, we train T5-base for 3 epochs on subsets of ParaNet that contains 25k, 50k, 75k pairs respectively, and evaluated the model on Quora.We can see from the results that the performance of paraphrasing on Quora gradually rises when we increase the amount of training data, showing that our ParaNet is helpful for paraphrase generation.93.7(0.1)51.7(0.7)72.5(2.2) 73.9(1.8)80.9(1.6)77.9(3.9) Table 5: The results of few-shot learning on GLUE.We report mean (and standard deviation) performance over 5 different splits (Gao et al., 2021).N = 3 means we increase the origin K-shot training set size to its N times.The CorruptLM used in this experiment is trained on MSCOCO.

Conclusions
In this paper, we introduce a novel unsupervised paraphrase generation framework ParaMac, which utilizes PLMs' linguistic ability, combined with multi-aspect equivalence constraints and multigranularity diversifying mechanisms, to improve the generation quality in terms of semantic equivalence and expression diversity.Moreover, we demonstrate the generality and value of our general paraphrase model in several downstream tasks.

Limitations
In this section, we intend to point out the three limitations of our unsupervised paraphrase generation framework.The first is that this framework requires a relatively high-quality corpus, since the paraphrase generation needs to use the context of inputs; Secondly, Our framework is relatively complicated.Due to the limitations of the input data, ParaMac cannot directly perform generation in paraphrase generation tasks such as Quora.In order to generate a paraphrase given any sentence, we produce ParaNet and ParaMod to do the job; Finally, The construction of ParaNet is demanding on computing resources.In practice, it needs about 30-50s to produce a pair on a 24GB RTX 3090, depending on the hyper-parameters of implementation.

A Implementations
In practice, our ParaMac works with a keywords filtering rate of round(|{k i }| • 0.15), where {k i } is the set of keywords; The number of keyword permutations is set to a maximum of 15; the PLM we use in fluency score computation is GPT-2;the base language model of Bert-Score is a Roberta-Large specially fine-tuned on the MNLI (Williams et al., 2018) task to increase the accuracy; the weighting parameters in the comprehensive score are set to λ 1 = 4.0, λ 2 = 8.0, and λ 3 = 1.2.These figures are set manually by experiments -we take 500 examples generated by ParaMac and calculate each score's standard deviation (std).To avoid one score totally overwhelming the other, we set the weights inversely proportional to its corresponding score's std.
For ParaMod, we choose the T5-base as the base model, using the 10k-pair ParaNet as the training set, training with AdamW optimizer, learning rate of 1e-4, and β 1 = 0.9, β 2 = 0.999.Due to the limitation of GPU memory, we set the batch size to 4 and gradient accumulation steps to 4 as well.The model converges after training about 20 hours for 25 epochs on a single 24GB RTX 3090, while we observe by experiment that five epochs of training yield the best performance.
All the experiments are completed by Pytorch and the transformers toolkit.We use the transformers' Trainer class to build the Seq2Seq model code base.

B Error Analysis
In this section, we want to provide an error analysis on some bad cases in our experiment.Before the large-scale generation of ParaNet, we conducted a small-batch experiment, and revised some of the error modes by human observation.The observed errors can be categorized into four types: • The confusion of affirmative and negative.
The model often ignores the negative suffix in aren't/isn't/haven't, and output are/is/have.This is mitigated by adding n't as a special keyword if it's in the sentence.
• The confusion of antonyms.In keyword substitution, the model sometimes fills in the antonym, e.g., large to small.This is prevented by using the word's synonym set in WordNet.
• Missing important part.This is mainly caused by the incomplete keyword extraction of RAKE.To alleviate this problem, we consider all the nouns and verbs as extra keywords and lower the filtering rate to avoid dropping important information.
• Punctuation symbols problems.Rake isn't good at handling punctuation symbols.Symbols are sometimes included in a keyword, thus restricting its output position.Also, Rake splits words connected with•-".For example, the keyword "semi-fluidic nature" will be split into "semi" and "fluidic nature", which greatly harms the semantics after reordering.We add extra rules to avoid these problems.
Examples of these error types are listed accordingly in Table 8.The first example changes the original negative saying to affirmative; the second generates small rather than the synonyms of large in the input; the third one misses station in its keywords, thus generates the output with different semantics; The last one fails to keep consistent with the input because the keyword semi-fluidic is splited.

Input Output
Such a possibility hadn't even been discussed during the This was discussed during the planning stages as planning stages.a possibility.
A large sofa was shoved against the wall, covered in A small sofa was wrapped in a soft blanket and tucked a thin blanket.
against the wall.
"As you command, controller," grudy said, and returned "I returned to the controller," to his station.grudy said.
"Does the semi-fluidic nature of the crystals present "Is there any weakness in the semi-crystalline structure a weakness in that regard.?" in that regard?" the fluidic nature present.Although error modes were revised and avoided when spotted in this optimizing process, there still exists some semantic errors in the final ParaNet we used.Nevertheless, these errors do not affect the overall quality of the dataset much.In terms of an automatically generated parallel dataset, the quality of ParaNet can still be said to be very good, which has also been demonstrated by our experimental results.

Figure 1 :
Figure 1: We propose an unsupervised paraphrase generation framework named ParaMac.Based on that, a highquality dataset named ParaNet is generated and used to train a general seq2seq pararphraser named ParaMod.
VAE Bowman et al. (2016) generated paraphrases by sampling from a continuous space.CGMH Miao et al. (2019) utilized Metropolis-Hasting sampling strategy for sentence generation.UPSA Liu et al. (2020) introduced a novel approach that transform the paraphrase generation into a optimization problem.DB Niu et al. (2021) brought up a blocking mechanism to diversify the output of PLMs.CorruptLM Hegde and Patil (2020) utilized GPT-2 to generate paraphrase by teaching PLMs to recover the input from corrupted sentence.ConRPG Meng et al. (2021) proposed to generate paraphrase by using the context as the regularizer.For the supervised setting, the baselines are: ResLSTM Prakash et al. (2016) trained a stacked residual LSTM to do Seq2Seq paraphrasing.VAE-SVG-eq Gupta et al. (2018) combined VAE and LSTM to generate realistic paraphrase.Transformer Vaswani et al. (2017) developed the Transformer with attention mechanism.DNPG Li et al. (2019) designed a multi-granularity Transformer-based model.ConRPG The unsupervised model of Meng et al. (2021) can be further trained on supervised data.LBoW Fu et al. (2019) used a discrete bag-ofwords as the latent encoding for the encoderdecoder generation model.SCSVED Chen et al. (2020) leveraged adversarial learning on variational encoder-decoder to help keep semantics consistent.

3. Candidates Ranking 2. PLM-based Generation 1. Keywords Processing
"the final stage is nearing completion, controller", she reported crisply.S': She reported crisply that she was nearing completion of the final stage of the project. S:

Table 2 :
Supervised paraphrase generation results.Baseline figures on Quora are mainly referred from Meng et al. (2021), figures on MSCOCO are mainly referred from

Table 4 :
Results for paraphrasing template question to natural language question on KQA Pro.

Table 3 :
Examples of the generation output of Our method and CorruptLM (Hegde and Patil, 2020) on Quora.

Table 6 :
The influence of training data size on ParaMod's performance.

Table 7 :
The influence of training epochs on ParaMod's performance.

Table 8 :
Examples of the error occurred in the generation of ParaNet.