ConRPG: Paraphrase Generation using Contexts as Regularizer

A long-standing issue with paraphrase generation is the lack of reliable supervision signals. In this paper, we propose a new unsupervised paradigm for paraphrase generation based on the assumption that the probabilities of generating two sentences with the same meaning given the same context should be the same. Inspired by this fundamental idea, we propose a pipelined system which consists of paraphrase candidate generation based on contextual language models, candidate filtering using scoring functions, and paraphrase model training based on the selected candidates. The proposed paradigm offers merits over existing paraphrase generation methods: (1) using the context regularizer on meanings, the model is able to generate massive amounts of high-quality paraphrase pairs; (2) the combination of the huge amount of paraphrase candidates and further diversity-promoting filtering yields paraphrases with more lexical and syntactic diversity; and (3) using human-interpretable scoring functions to select paraphrase pairs from candidates, the proposed framework provides a channel for developers to intervene with the data generation process, leading to a more controllable model. Experimental results across different tasks and datasets demonstrate that the proposed paradigm significantly outperforms existing paraphrase approaches in both supervised and unsupervised setups.


Introduction
Paraphrase generation (Prakash et al., 2016a;Cao et al., 2016;Ma et al., 2018;Wang et al., 2018) is the task of generating an output sentence which is semantically identical to a given input sentence but with variations in lexicon or syntax. It is a long-standing problem in the field of natural language processing (NLP) (McKeown, 1979;Meteer and Shaked, 1988;Quirk et al., 2004;Bannard and Callison-Burch, 2005a;Chen and Dolan, 2011) and has fundamental applications on end tasks such as semantic parsing (Berant and Liang, 2014), language model pretraining (Lewis et al., 2020) and 1 To appear at EMNLP2021. question answering (Dong et al., 2017).
A long-standing challenge with paraphrase generation is to obtain reliable supervision signals. One way to resolve this issue is to manually annotate paraphrase pairs, which is both labor-intensive and expensive. Existing labeled paraphrase datasets (Lin et al., 2014;Fader et al., 2013;Lan et al., 2017) are either of small sizes or restricted in narrow domains. For example, the Quora dataset 2 contains 140K paraphrase pairs, the size of which is insufficient to build a large neural model. As another example, paraphrases in the larger MSCOCO (Lin et al., 2014) dataset are originally collected as image captions for object recognition, and repurposed for paraphrase generation. The domain for the MSCOCO dataset is thus restricted to captions depicting visual scenes.
Unsupervised methods, such as reinforcement learning Siddique et al., 2020) and auto-encoders (Bowman et al., 2016;Roy and Grangier, 2019), on the other hand, have exhibited their ability for paraphrase generation in the absence of annotated datasets. The core problem with existing unsupervised methods for paraphrase is the lack of an objective (or reward function in RL) that reliably measures the semantic relatedness between two diverse expressions in an unsupervised manner, with which the model can be trained to promote pairs with the same meaning but diverse expressions. For example, Hegde and Patil (2020) crafted unsupervised pseudo training examples by corrupting a sentence and then fed the corrupted one to a pretrained model as the input with the original sentence as the output. Since the model is restricted to learning to reconstruct corrupted sentences, the generated paraphrases tend to be highly similar to the input sentences in terms of both wording and word orders. The issue in Hegde and Patil (2020) can be viewed as a microcosm of problems in existing unsupervised methods for paraphrase: we wish sentences to be diverse in expressions, but do not have a reliable measurement to avoid meaning change when expressions change. Additionally, the action of sentence corrupting can be less controllable.
In this work, we propose to address this issue by a new paradigm based on the assumption that the probabilities of generating two sentences with the same meaning based on the same context should be the same. With this core idea in mind, we propose a pipelined system which consists of the following steps: (1) paraphrase candidate generation by decoding sentences given its context using a language generation model; (2) candidate filtering based on scoring functions; and (3) paraphrase model training by training a SEQ2SEQ paraphrase generation model, which can be latter used for supervised finetuning on labeled datasets or directly used for unsupervised paraphrase generation.
The proposed paradigm offers the following merits over existing methods: (1) using the context regularizer on meanings, the model is able to generate massive amounts of high-quality paraphrase pairs; and (2) using human-interpretable ranking scores to select paraphrase pairs from candidates, the proposed framework provides a channel for developers to intervene with the data generation process, leading to a more controllable paraphrase model. Extensive experiments across different datasets under both supervised and unsupervised setups demonstrate the effectiveness of the proposed model.  Kazemnejad et al. (2020) proposed a retrievalbased approach to retrieve paraphrase from a large corpus. ; Sokolov and Filimonov (2020) casted paraphrase generation as the task of machine translation. ;  extended the idea of bilingual pivoting for paraphrase generation where the input sentence is first translated into a foreign language, and then translated back as the paraphrase. Sokolov and Filimonov (2020) trained a MT model using multilingual parallel data and then finetuned the model using parallel paraphrase data. ; Siddique et al. (2020) proposed to generate paraphrases using reinforcement learning, where certain rewarding criteria such as BLEU and ROUGE are optimized. Bowman et al. (2016); Yang et al. (2019) used the generative framework for paraphrase generation by training a variational autoencoder (VAE) (Kingma and Welling, 2013) to optimize the lower bound of the reconstruction likelihood for an input sentence. Sentences sampled through the VAE's decoder can be regarded as paraphrases for an input sentence due to the reconstruction optimization target. Fu et al. (2019) similarly adopted a generative method but worked at the bagof-words level. Other works explored paraphrase generation in an unsupervised manner by using vector quantised VAE (VQ-VAE) (Roy and Grangier, 2019), simulated annealing  or disentangled syntactic and semantic spaces (Bao et al., 2019). More recently, large-scale language model pretraining has also been proven to benefit paraphrase generation in both supervised learning (Witteveen and Andrews, 2019) and unsupervised learning (Hegde and Patil, 2020). Krishna et al. (2020) proposed diverse paraphrasing by warping the input's meaning through attribute transfer.

Unsupervised Methods
Regarding soliciting large-scale paraphrase datasets, Bannard and Callison-Burch (2005b) used statistical machine translation methods obtain paraphrases in parallel text, the technique of which is scaled up by Ganitkevitch et al. (2013) to produce the Paraphrase Database (PPDB).  translate the non-English side of parallel text to obtain paraphrase pairs. Wieting and Gimpel (2017) collected paraphrase dataset with million of pairs via machine translation. Hu et al. (2019a,b) produced paraphrases from a bilingual corpus based on the techniques of negative constraints, inference sampling, and clustering. A relevant work to ours is , which harnesses context to obtain sentence similarity.  focuses on sentence similarity rather than paraphrase generation.

Context-LM Score
NM N M Diversity Score Generation Score Step 4 Step 1: we first train a context-LM model that predicts the sentence probability in an autoregressive manner given contexts.
Step 2: the context-LM model is used to decode multiple candidate paraphrases with respect to a given context using diverse decoding of beam search.
Step 3: paraphrase candidates are filtered based on different scoring functions, i.e., the context-LM score, the diversity score and the generation score.
Step 4: the selected pair is used to train a SEQ2SEQ model, which can be latter used for supervised finetuning or be directly used for unsupervised paraphrase generation.

Model
The key point of the proposed paradigm is to generate paraphrases based on the same context. This can be done in the following pipelined system: (1) we first train a contextual language generation model (context-LM) that predicts sentences given left and right contexts; (2) the pretrained contextual generation model decodes multiple sentences given the same context, and decoded sentences are treated as paraphrase candidates; (3) due to the fact that decoded sentences can be extremely noisy, further filtering is needed; (4) given the selected paraphrase, a SEQ2SEQ model (Sutskever et al., 2014) is trained using one sentence of the paraphrase pair as the source and the other as the target; the SEQ2SEQ model can be directly taken for the use of paraphrase in the unsupervised learning setup, or used as initialization to be further finetuned on labeled paraphrase datasets in the supervised learning setup. An overview of the proposed framework in depicted in Figure 1, the constituent unit of which will be detailed in order below.

Training context-LM
Let c i = {w i,1 , w i,2 , · · · , w i,n } denote the i-th sentence within the given text, where n is number of words in c j . c i:j denotes the i-th to j-th sentences. c <i and c >i respectively denote the preceding and subsequent context of c i . Given contexts c <i and c >i , we first train a context-LM by maximizing p(c i |c <i , c >i ). The input is a sequence of words and the input representation for each word is the addition of three embeddings: the sentenceposition embedding, token-position embedding and the word embedding. Predicting c i follows a wordby-word fashion. We consider the style of both left-to-right generation and right-to-left generation to optimize p(c i |c <i , c >i ), which is respectively given by the following objective: (1) p(c i |c <i , c >i ) models the forward probability from contexts to sentences. For two sentences of the same meaning, the probability of generating contexts given the two sentences should be also the same, which correspond to the backward probability given from sentences to contexts. This is akin to the bi-directional mutual-information based generation strategy (Fang et al., 2015;Li et al., 2016a;Li and Jurafsky, 2016;Wang et al., 2021). The backward probability can be modeled by predicting preceding contexts given subsequent contexts p(c <i |c i , c >i ) and to predict subsequent contexts given preceding contexts p(c >i |c <i , c i ).
We implement the above models, i.e.
p(c >i |c <i , c i ) based on the SEQ2SEQ structure on a subset of CommonCrawl containing 10 billion tokens in total. We use Transformers as the backbone (Vaswani et al., 2017) 3 with the number of encoder blocks, decoder blocks, the number of heads, d model and d f f set to 6, 6, 8, 512 and 2048. We use adam (Kingma and Ba, 2014) for optimization, with learning rate of 1e-4, β 1 = 0.9, β 2 = 0.999. We consider a maximum number of +800 and -800 tokens as contexts.

Paraphrase Candidate Generation
Using the pretrained context-LM models, we generate potential paraphrases by decoding multiple outputs given the input sentence only based on p( − → c i |c <i , c >i ). The other three contextual objectives, i.e., p( ← − c i |c <i , c >i ), p(c <i |c i , c >i ) and p(c >i |c <i , c i ) cannot be readily used at the decoding stage since their computations require the completion of the target generation. They will thus be used at the later reranking stage. We use diverse decoding strategy of beam search (Li et al., 2016b) to generate diverse candidates. Decoded candidates are guaranteed to be fluent. 4

Paraphrase Filtering
The decoded andidates can not be readily used since (1) candidates often differ only by punctuation or minor morphological variations, with almost all words overlapping, and (2) many of them are not of the same meaning. We thus propose to further rank a candidate pairs. The ranking model consists of three parts:

Context LM Score
For a pair of sentences s 1 and s 2 of the same meaning, differences between the probabilities of generating them given the same context should be very similar. In the same way, the probabilities of predicting left and right contexts given the two sentences with the same meaning should also be similar. The ranking scoring function to rank (s 1 , s 2 ) consists the following parts: (1) the probability difference in generating two sentences given contexts, i.e., 1 |s| log p( − → s |c <i , c >i ) and 1 |s| log p( ← − s |c <i , c >i ); (2) the probability difference in generating contexts given two sentences, i.e., 1 |c <i | | log p(c <i |s, c >i ) and 1 |c <i | | log p(c <i |s, c >i ).

Lexicon and Syntactic Diversity
Two identical sentences will have the optimal score, which does not serve our purpose since we wish paraphrases to be as diverse as possible . We consider two types of diversity: (1) lexicon diversity, which encourages individual word or phrase replacements using synonyms; and (2) syntactic diversity, which encourages syntactic shifting such as heavy NP shift. Lexicon diversity is measured by the unigram-based Jaccard distance between two sentences. Syntactic diversity is measured by the relative position change for shared unigrams. If s 2 contains multiple copies of a word w in s 1 , we pick the nearest copy. Let pos s (w) denote the position index of w in s. The combination of lexicon and syntactic diversity is given as follows: (2) where the first part denotes the unigram Jaccard distance, and the second part denotes the relative position change for unigrams.

Mutual Generation Score
It is noteworthy that an intrinsic drawback of the proposed methodology (and other paraphrase generation methods as well) is that, two sentences that can fit into the same context are not necessarily of the exactly same meaning, e,g, sentences with very similar general semantics but vary in some specific details (e.g., number). Think about two sentences, I spent 5 dollars on this mug. v.s. I spent 6 dollars on this mug. If one sentence fits into certain contexts, it is very likely that the other sentence will also fit in. The issue can be alleviated with more contexts considered, but the practical problem still remains because our model can only consider a very limited number of contexts due to hardware limitations.
We propose a strategy to address this drawback. The strategy is inspired by the famous idiom that "Happy families are all alike; every unhappy family is unhappy in its own way". Paraphrases share the same meaning in the vector space, and there should be a direct and easy mapping between them. Nonparaphrases are different in random ways. It is thus easier to predict a paraphrase given a sentence than predict a specific non-paraphrase given the sen-tence. For example, p("six dollars"|"6 dollars") should be higher than generating a random sentence give the sentence e.g., p("5 dollars"|"6 dollars"). This is because, there are so many ways to generate non-paraphrase e.g., p("5 dollars"|"6 dollars") and p("7 dollars"|"6 dollars"), etc. These nonparaphrases split the probability, making the probability for an individual non-paraphrase low. To this end, we train a SEQ2SEQ model (Sutskever et al., 2014) on 8 million pairs of decoded candidates using Transformer-based. Next, using this model, we give the mutual decoding score for any sentence pair (s 1 , s 2 ) as follows: S generation = γ 1 1 |s 1 | log p(s 1 |s 2 )+γ 2 1 |s 2 | log p(s 2 |s 1 ) (3) For a sentence pair of the same meaning, they should have higher values of Eq.3.

Final Ranking Model
The final ranking score is a linear combination of scores above as follows: We build a ranking model to learn weights (i.e., α, β, γ, eight parameters in total). To train the ranking model, we annotate a small proportion of data on Amazon Mechanical Turk. A Turker is first given a sentence (denoted by a) randomly picked from the candidate pool. Next, the Turker is given two other decoded sentences (b 1 and b 2 ), and is asked to decide which one is a better paraphrase of a, in terms of three aspects: (1) semantics: whether the two sentences are of the same semantic meaning; (2) diversity: whether the two sentences are diverse in expressions; and (3) fluency: whether the generated paraphrase is fluent. Ties are allowed and will be further removed. We labeled a total number of 2K pairs. Let b + denote the better paraphrase by annotators, and b − denote the other. Based on the labeled dataset, a simple pairwise ranking model (Liu, 2011) is built for weight learning: It is worth noting that the filtering module provides a channel for developers to intervene with the data generation process, as developers can develop their own scoring functions to generate paraphrases of specific features. This leads to a more controllable paraphrase model.

Paraphrase Model Training
We select 10 million paraphrase pairs in total based on criteria above, on which we train a SEQ2SEQ model for paraphrase generation, using one sentence of the pair as the input, and the other as the output. We use the Transformer-base (Vaswani et al., 2017) as the model backbone. We use Adam (Kingma and Ba, 2014) with learning rate of 1e-4, β 1 = 0.9, β 2 = 0.98 and a warmup step of 4K. The trained model can be directly used for paraphrase generation in the unsupervised setup (Roy and Grangier, 2019;. For the supervised setup (Witteveen and Andrews, 2019; Kazemnejad et al., 2020;Hegde and Patil, 2020), where we have pairs of paraphrases containing sources from a source domain and paraphrases of sources from a target domain, we can fine-tune the pretrained model on the supervised paraphrase pairs, where we initialize the model using the pre-trained model, and run additional iterations on the supervised dataset. Again, we use adam (Kingma and Ba, 2014) for fine-tuning, with β 1 = 0.9, β 2 = 0.98. Batch size, learning rate and the number of iterations are treated as hyperparameters, to be tuned on the dev set.
It is worth nothing that the SEQ2SEQ model here is different from the SEQ2SEQ model in the filtering stage, as the model here is trained on the remaining paraphrase pairs and used for direct paraphrase generation, while the other is trained on the noisy pairs and used for candidate filtering.

Datasets
We carry out experiments in both supervised and unsupervised setups. For the unsupervised setting, we use the Quora, Wikianswers (Fader et al., 2013), MSCOCO (Lin et al., 2014) and Twitter (Lan et al., 2017) datasets. For the supervised setting, we use the Quora and Wikianswers datasets.
• Quora: The Quora question pair dataset 5 contains 140K parallel paraphrases and 260K nonparallel sentences. We follow the standard setup in Miao et al. (2019) where 3K and 30K paraphrase pairs are respectively used for validation and test.  (Lan et al., 2017), which originally contains 50K paraphrase pairs. We follow the data split in .

Baselines and Metrics
We compare our proposed ConRPG model to the following existing paraphrase generation models. Unsupervised paraphrase generation baselines we consider include:   proposed to treat unsupervised paraphrase generation as an optimization problem with an objective combining semantic similarity, expression diversity and language fluency being optimized using simulated annealing. • Corruption: (Hegde and Patil, 2020) proposed strategy of corrupting input sentences by removing stop words and randomly shuffle and replace the remaining 20% words. We use BART (Lewis et al., 2019) as the backbone to generate targets given corrupted inputs.
Results for VAE, Lag VAE, CGMH and UPSA on different datasets are copied from Miao et al. (2019) and . Supervised paraphrase generation baselines include: Both encoder and decoder are conditioned on the source input sentence so that more consistent paraphrases can be generated. • Pointer: See et al. (2017) augmented the standard SEQ2SEQ model by using a pointer mechanism which can copy source words in the input rather than decode from scratch. Results for ResidualLSTM, VAE-SVG-eq, Pointer, Transformer on various datasets are copied from . For reference purposes, we also implement the BT baseline inspired by the idea of back-translation (Sennrich et al., 2016;. We use Transformer-large as the backbone. BT is trained end-to-end on WMT'14 En↔Fr. 7 A paraphrase pair is obtained by pairing the English sentence in the original dataset and the translation of the French sentence. Next we train a Transformer-large model on paraphrase pairs. We evaluate all models using BLEU (Papineni et al., 2002), iBLEU (Sun and Zhou, 2012) and ROUGE scores (Lin, 2004) . The iBLEU score penalizes the similarity of the generated paraphrase with respect to the original input sentence. Concretely, the iBLEU score of a triple of sentences (s, r, c) is given by: where s is the input sentence, r is the reference paraphrase and c is generated paraphrase. α is set to 0.8 following prior works.

In-domain Results
We first show the in-domain results in Table 1. As can be seen, across all datasets, the proposed Con-RPG model significantly outperforms baselines in both supervised and unsupervised settings. For the supervised setting, ConRPG yields an approximately 2-point gain across different evaluation metrics against the strong DNPG baseline on both Quora and Wikianswers. We also observe that the BT model is able to achieve competitive results. This shows that back-translation can serve as a simple yet strong baseline for paragraph generation. For the unsupervised setting, we observe substantial performance boosts brought by Con-RPG over existing unsupervised methods including the state-of-the-art model UPSA. It is also surprising to see that unsupervised ConRPG outperforms the supervised VAE-SVG-eq model and achieves comparable results to supervised baselines such as Transformer.

Domain-adapted Results
We test the domain adaptation ability of the proposed method on the Quora and Wikianswers datasets. Results are shown in Table 3. We can see that ConRPG significantly outperforms baselines in both settings, i.e. Quora→Wikianswers and Wikianswers→Quora, showing the better ability of ConRPG for domain adaptation.

Human Evaluation
To further validate the performance of the proposed model, we sample 400 sentences from the Quora test set for human evaluation. We assign the input sentence and its generated paraphrase to three human annotators at Amazon Mechanical Turk (AMT), with "> 95% HIT approval rate". Turkers are asked to evaluate the quality of generated paraphrases by considering three aspects semantics, diversity and fluency, as detailed in Section 3.3.4. Each paraphrase is labeled by a 5-point scale (Strongly Agree, Agree, Unsure, Disagree, Strongly Disagree) and assigned to three annotators. We evaluate three models: BT, Corruption, and the proposed ConRPG model. The Cohen's kappa score (McHugh, 2012) for the three aspects are 0.55, 0.52 and 0.49, indicating moderate interannotator agreement.     Table 6 presents the influence of context length used to train context-LM on Wikianswers. As can be seen, the performance is sensitive to the context length, which can be explained by the fact that more contexts lead to a significantly better language modeling. Table 7 presents the impact of the percentage of selected paraphrase pairs in the filtering process on the final performance of Wikianswers. We tune the ratio ρ, which is defined as the number of remaining paraphrase pairs divided by the number of input contexts for context-LM. ρ = 1 is what we use in this work: selecting the top-1 paraphrase pair for each input context makes the number of remaining pairs equal to the number of input contexts. As expected, either too few or too many selected paraphrase pairs leads to worse performances. Too few pairs lead to insufficient training and too many pairs lead to noise that harm the final performance. A tricky balance of the percentage of selected paraphrase pairs is thus crucial for better final performances.

Effects of Different Modules
We are interested in the effectiveness of each module within the proposed framework. Table 8 shows the performance: (1) Removing the entire filtering module leads to the most degradation in performance, which is in line with our expectation: with filtering, high quality paraphrase pairs that both share the same meaning and are diverse in lexicon can be selected for training the final paraphrase generation model.
(2) Removing backward, i.e., p(c <i |c i , c >i ) and p(c >i |c <i , c i ) , leads to the second largest performance reduction. This is because removing backward greatly weakens the strength of context regularization, introducing more noise for the subsequent paraphrase filtering phase.
(4) Removing the diversity score or the generation score harms model performances. This observation verifies that using scores from different aspects significantly helps paraphrase quality.

Conclusion
In this paper, we propose ConRPG, a paradigm for paraphrase generation using context regularizer. ConRPG is based on the assumption that the probabilities of generating two sentences with the same meaning based on the same context should be the same. We acknowledge that the current system is rather complicated, which requires multiple pipelines and modules to build. We will simplify the system in future work.

Input Corruption ConRPG
What should be the first computer