Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach

In recent years, neural paraphrase generation based on Seq2Seq has achieved superior performance, however, the generated paraphrase still has the problem of lack of diversity. In this paper, we focus on improving the diversity between the generated paraphrase and the original sentence, i.e., making generated paraphrase different from the original sentence as much as possible. We propose BTmPG (Back-Translation guided multi-round Paraphrase Generation), which leverages multi-round paraphrase generation to improve diversity and employs back-translation to preserve semantic information. We evaluate BTmPG on two benchmark datasets. Both automatic and human evaluation show BTmPG can improve the diversity of paraphrase while preserving the semantics of the original sentence.


Introduction
Paraphrase generation or sentence paraphrasing is an important task in natural language processing, and it requires rewriting a sentence while preserving its semantics. Paraphrase generation has been widely used in many downstream tasks such as QA systems, semantic parsing, dialogue systems and so on.
In recent years, deep learning techniques like sequence-to-sequence(Seq2Seq) have achieved superior performance on natural language generation tasks (Zhao et al., 2010;Wubben et al., 2010). Many paraphrase models based on Seq2Seq have achieved inspiring results. For example, Prakash et al. (2016) leveraged stacked residual LSTM networks to generate paraphrase, and Gupta et al. (2018) proposed a deep generative framework based on variational auto-encoder for paraphrase generation.
Though paraphrase generation models based on Seq2Seq have demonstrated advanced ability, the generated paraphrase still has the problem of lack of diversity, i.e., the output paraphrase only makes trivial changes to the original sentence. A good paraphrase of a sentence is one that is semantically similar to that sentence while being (very) syntactically and/or lexically different from it (Bhagat and Hovy, 2013). Paraphrase which is too similar to the original sentence is much less useful in many real applications.
In this paper, we focus on improving the diversity of generated paraphrase, i.e., making generated paraphrase different from the original sentence as much as possible. An intuitive but uninvestigated idea is to adopt multi-round paraphrase generation. Concretely, we first send the original sentence into a paraphrase generation model to generate a paraphrase, and then we use the generated paraphrase as the input of the model to generate a new paraphrase. As long as we leverage a paraphrase generation model with strong diversity like variational auto-encoder (VAE) (Kingma and Welling, 2013), we can get the paraphrase as different as possible from the original sentence after multi-round generation.
However, existing paraphrase models can not ensure that the major semantics of the original sentence can be preserved after multi-round paraphrase generation, especially the model with strong diversity. With the increase of paraphrasing round, the generated sentence will be more and more different from the original sentence, and the semantics will be gradually different from the original sentence as well. To tackle this problem, we introduce back-translation to maintain the semantics of paraphrase. Back-translation, which translates the generated sentence into the original sentence, has been widely used in semi-supervised natural language generation (Zhao et al., 2020) and data augmentation (Li et al., 2020), and it can improve the robustness of machine-translation system (Li and Specia, 2019). We assume that paraphrase with similar semantics can be translated back to the original sentence. So, we can leverage back-translation to provide guidance for multi-round paraphrase generation.
Particularly, we propose Back-Translation guided multi-round Paraphrase Generation (BTmPG), by combining neural paraphrase model with back-translation to generate paraphrases in a multi-round process. The contributions of our work are summarized as below: 1) We propose a new multi-round paraphrase generation method to generate diverse paraphrase that is much different from the original sentence and leverage back-translation to preserve the major semantics during the multi-round paraphrase generation. Our code is publicly available at https://github.com/L-Zhe/BTmPG.
2) Automatic and human evaluation results demonstrate that our method can substantially improve the diversity of generated paraphrase, while preserving the semantics during multi-round paraphrase generation.

Related Work
Paraphrase generation or sentence paraphrasing can been seen as a monolingual translation task. Prakash et al. (2016) leveraged stacked residual LSTM networks to generate paraphrase. Gupta et al. (2018) found deep generative model such as variational auto-encoder can be able to achieve better performance in paraphrase generation.  proposed DNPG to decompose a sentence into sentence-level pattern and phrase-level pattern to make neural paraphrase generation more interpretable and controllable, and they found DNPG can be adopted into unsupervised domain adaptation method for paraphrase generation. Fu et al. (2019) proposed a new paraphrase model with latent bag of words. Wang et al. (2019) found that adding semantics information into paraphrase model can significantly boost performance. Siddique et al. (2020) proposed an unsupervised paraphrase model with deep reinforcement learning framework. Liu et al. (2020) regarded paraphrase generation as an optimization problem and proposed a sophisticated objective function. All methods above focus on the generic quality of paraphrase and do not care about the diversity of paraphrase.
There are also some methods focusing on im-proving the diversity of paraphrase. Gupta et al. (2018) leveraged VAE to generate several different paraphrases by sampling the latent space. (Kumar et al., 2019) provided a novel formulation of the problem in terms of monotone sub-modular function maximization to generate diverse paraphrase. Goyal and Durrett (2020) used syntactic transformations to softly "reorder" the source sentence and guide paraphrase model. Thompson and Post (2020) introduced a simple paraphrase generation algorithm which discourages the production of ngrams that are present in the input to prevent trivial copies or near copies. Note that the purpose of the work (Gupta et al., 2018; is different from ours, while Thompson and Post (2020) has the same purpose with our work, i.e., pushing the generated paraphrase away from the original sentence.

Model
In this section, we introduce the components of our model in detail. First, we define the paraphrase generation task and give an overview of our model. Next, we describe the paraphrase model and the back-translation model. Then, we show how to use the gumble-softmax to connect the paraphrase model with the back-translation model. Finally, we describe the loss function and training process of our model in detail. Figure1 shows an overview of our model.

Notations and Overview
Our model regards paraphrase generation as a monolingual translation task. Given a paraphrase pair (S 0 , P ), which S 0 is the original/source sentence and P is the target paraphrase given in the dataset. As is shown in Figure 1, we introduce a multiround paraphrasing method. In the first round generation, we send S 0 into a paraphrase model to generate a paraphrase S 1 . In the second round generation, we use the S 1 as the input of the model to generate a new paraphrase S 2 . And so forth, in the i-th round generation, we send S i−1 into the paraphrase model to generate S i .
Although multi-round generation can increase the paraphrase diversity, the semantics of paraphrase may change during generation. We thus introduce back-translation to tackle this problem based on the assumption that paraphrase can be translated back to the original sentence while the semantic information has not been changed. In the first round, we calculate the loss between S 1 and P to train our paraphrase model. In the i-th round, we send its generated paraphrase S i into a back-translation model to generate S i , and we optimize the cross-entropy loss between S i and S 0 . The back-translation model which translates the paraphrase in i-th round back to the original sentence can guide the paraphrase to preserve semantics during multi-round generation.
In addition, we introduce gumble-softmax embedding to tackle the problem that the model with sampling operation between different rounds' generation can not be optimized by SGD optimizer.

Paraphrase Model
We require sufficient diversity of paraphrase model so that it is able to introduce enough changes in the paraphrase of each round. The VAE (Kingma and Welling, 2013;Rezende et al., 2014) is a deep generative model that allows learning rich, nonlinear representations for high dimensional inputs. It can improve the diversity by sampling from latent space. Bowman et al. (2016) proposed a new model to apply VAE to natural language generation for the first time. Our paraphrase model is based on conditional VAE with LSTM. Transformer (Vaswani et al., 2017) has achieved excellent performance in many tasks. But our experiments show that it may cause KL divergence to become 0, called posterior collapse, which means a decrease of diversity. So we do not employ Transformer as encoder and decoder.
We define the embedding matrices of S i and P as E i s = {e 1 s , e 2 s , · · · , e L i s } and E p = {e 1 p , e 2 p , · · · , e M p } respectively, where e i s , e j p ∈ R de are the embedding vector of the word in S i and P , and d e is the embedding dimension.

Encoder
Conditional VAE contains two encoders that share parameters: an original sentence encoder and a paraphrase encoder. We first send E i s into original sentence encoder to get its encoding Then we send E p and h i s into paraphrase encoder to get its vector representation h z . h z is passed through two different feed-forward neural networks with parameter Φ to produce the mean µ and the variance σ 2 of the distribution of latent space. We can get the latent code z ∈ R dz by sampling from latent space and reparameterization, where d z is the dimension of latent code.

Decoder
We define the embedding matrix which be sent into decoder as E d = {e 1 d , e 2 d , · · · , e N d } ∈ R de×N . Then, we concatenate z with the embedding vector e i d as the input of decoder. The decoder also takes h i s as input. The output of decoder is defined as Then, an attention (Luong et al., 2015) and copy mechanism (See et al., 2017) are leveraged as follow. First, we get the attention weight p a and attention vector V a as follow.
Then, we leverage them to calculate the decoder probability p d and copy probability η.
|| is the concatenation operation. σ is the sigmoid activation function. The final output probability of decoder is as follow.

Loss Function of Paraphrase Model
The VAE with parameter Θ is trained by minimizing the following objective: where KL stands for the KL divergence. Eq. 4 is called evidence lower bound, which provides a lower bound of log p(P |S i ; Θ). Bowman et al. (2016) figured out that variational inference for text generation often yields models that ignore their latent variables, a phenomenon called posterior collapse. This may cause the low diversity of generated sentences. To tackle this problem, we propose a diversity loss. We find that the diversity of the generated sentence is affected by its first word. For example, the first word can determine the form of a question sentence. Unfortunately, compared with the questions beginning with "Is, May, Would", we are more likely to collect questions beginning with "What, When, How". This can lead to serious category imbalances when generating the first word. So we set the penalty coefficient of the first-word loss as follow.
where N b is the batch size during the training process, n w 1 is the number of sentences beginning with w 1 in this batch. e is the Euler's number that can make sure the penalty coefficient always no less than 1.

Back-Translation Model
Back-translation model aims to make sure the semantics of the generated paraphrase are the same with the original sentence during multi-round generation. It translates S i back to S 0 . Different from paraphrase model which needs diversity, back-translation model is more focused on semantics maintaining. We employ Transformer (Vaswani et al., 2017) with copy mechanism as back-translation model because of its excellent performance in many tasks.
The loss function of back-translation model is as follow: 1 : where λ is a hyper-parameter. BTModel indicates the back-translation model.
There are two parts in the loss of back-translation model: L i s and L p . We assume the i-th round paraphrase can be translated back to the original sentence S 0 if its semantics are preserved and thus we optimize L i s . Similarly, the paraphrase P can be translated back to the original sentence S 0 as well, so we also leverage L p to train back-translation model. This can improve the generalization ability of the back-translation model, because backtranslation model tends to guide paraphrase model to copy original sentence without changes if we do not employ true paraphrase data to train it.

Gumble-Softmax Embedding
We employ gumble-softmax embedding to connect each module of our model. We first define an embedding operation as follow: For the probability p generated by paraphrase model, we leverage gumble-softmax (Jang et al., 2017) to get its one-hot matrix without sampling from multinomial distribution. Then we can get the embedding matrix E as follow: where π is a multinominal distribution wih k dimension, g 1 , g 2 , · · · , g i are i.i.d samples drawn from Gumbel(0, 1). τ is a hyper-parameter. There are three places in our model needing to leverage gumble-softmax embedding. First, we leverage it to embed the output probability of the paraphrase model as the input of the next-round paraphrase model. Next, gumble-softmax embedding is also used to connect the back-translation model with the paraphrase model. Figure 1 shows these two cases. Finally, it is used in the multiround paraphrase generation process to replace the teacher forcing. Generally, Seq2Seq model employs teacher forcing for model training, with using ground truth to guide the generation process. However, there is no ground truth in multi-round paraphrase generation, it can only generate sentence with a autoregressive method. We employ gumble-softmax to replace sampling in each step of the autoregressive process. Figure 2 shows this process.

Loss Function
We train paraphrase model together with backtranslation model. The total loss of our model is as follow: Although we define a multi-round paraphrase model, we only train the first two rounds. Because we find that training too many rounds requires large computing resources, but can not improve the model performance significantly. During inference, we can generate paraphrase more than two rounds.

Datasets
We evaluate our BTmPG model on two benchmark datasets: MSCOCO 2 (Lin et al., 2014) dataset contains human annotated captions of over 120k images. Each image contains five captions from five different annotators. This dataset has been widely used in previous works (Prakash et al., 2016;Gupta et al., 2018;Cao and Wan, 2020). We sample the MSCOCO according to Prakash et al. (2016).
Quora 3 dataset is a question paraphrase dataset. It contains over 400k question pairs. Each pair marked with a binary value indicates whether the questions in the pair are truly a duplicate of each other. So we select all such question pairs with binary value 1 as paraphrase dataset. There are about 150k question pairs in total. We randomly divide the training, validation and the test set. Table 1 provides statistics of these two benchmark datasets.

Evaluation Metrics
We use five widely-used metrics to evaluate paraphrases: BLEU4, self-BLEU, self-TER, BERTScore and p-BLEU. BLEU4 is widely used in generation tasks. It can measure how well the sentences generated by our model can match the references. Notice that some works also calculate the ROUGE (Lin, 2004) or METEOR, but we think the role of these two metrics overlaps with BLEU4, as they all calculate the overlap degree between outputs and references. Therefore we only calculate BLEU4 to evaluate the match degree between outputs and references.
We evaluate the difference between the output sentence and the original sentence with two metrics. One of them is self-BLEU which is the BLEU4 score between the output sentence and the original sentence. The lower the value of self-BLEU, the more difference between output sentences and original sentences. Another is self-TER 4 . TER (Zaidan and Callison-Burch, 2010) is used to evaluate the edit distance between two sentences. Self-TER is calculated as the TER between the output sentence and the original sentence.
BERTScore 5 is proposed by Zhang et al. (2020) to evaluate the semantic similarity between the output sentence and the original sentence. BERTScore has been widely leveraged to measure semantic preserving in the paraphrase generation task (Cao and Wan, 2020). However, there may be some problems for BERTScore on our task due to the low score for reference. This is because BERTScore is not perfect in measuring semantic relevance. But as far as we know, there is no better score to evaluate semantic preserving, so we report BERTScore as a reference for semantic preserving. More evaluation about semantic relevance is shown in human evaluation.
We leverage p-BLEU (Cao and Wan, 2020) to evaluate the difference between outputs in different rounds. Concretely, for outputs in k rounds {y 1 , y 2 , · · · , y k }, the p-BLEU can be calculated as follow.
The lower p-BLEU means higher diversity between outputs in different rounds.
Notice that, BLEU4 may not suitable for our task , because we focus on the diversity of paraphrase. BLEU4 can only measure the match degree between outputs and references. However, a sentence usually has many more reference paraphrases, while the target given in the dataset is only one reference. So we also perform human evaluation to evaluate the semantic relevance, readability and diversity of generated paraphrases.

Baseline
As our model focuses on the diversity of paraphrase, we mainly compare our model with VAE-SVG-eq (Gupta et al., 2018), DiPS (Kumar et al., 2019) 6 , SOW-REAP (Goyal and Durrett, 2020) 7 and the decoding method proposed by Thompson and Post (2020) 8 . The last method penalizes the n-gram appearing in the original sentence to make the paraphrase different from the original sentence and enhance diversity. We mark this method as N-gram Penalty. We employ two different hyperparameters provided by the authors: one of them is low penalty for N-gram, and another is high penalty. In addition, we also compare our model with Transformer and Transformer copy.

Training Details
For both datasets, we truncate all the sentences longer than 20 words and maintain a vocabulary size of 25k. During testing, we replace UNK with the original word with the highest copy probability.
For paraphrase model, we leverage 2-layer LSTM. We set the embedding dimension d e to 300, hidden size d h of LSTM to 512. We set the latent code dimension d z to 128. For back-translation model, we leverage Transformer-copy with 3-layer encoder and decoder. We set the model size to 450, and the head number of multi-head attention to 9. We set λ to 1, which will be discussed in our ablation study. For the hyper-parameter τ in gumble-softmax, we refer (Nie et al., 2019) to increase the τ over iterations via an exponential policy: τ = τ −ne/Ne max , where n e is the current epoch and N e is the total number of epoch. We set τ max to 5. We train our model for 30 epochs. We set batch size to 50, and we select the model of the final epoch to generate paraphrase in test set. Table 2 shows the results of automatic evaluation. Our model substantially improves the BERTScore in the first round of paraphrase generation and generally gets the state-of-the-art performance. The value of self-BLEU can be significantly reduced with the increase of the round number of paraphrase generation while maintaining semantics.

Automatic Evaluation
For both datasets, the first round paraphrase generation of our model achieves the highest BERTScore than any other models. This is because back-translation model can provide sufficient semantic guidance for paraphrase model. As the increase of the round number, the values of self-BLEU and self-TER are reduced significantly, which means the paraphrase sentences our model generated are more and more different from original sentences. While BERTScore can still maintain a relatively high value. (A slight reduction of BERTScore is acceptable as BERTScore is not perfect in measuring semantic relevance.) We find that the paraphrase generated in the fifth round is good with balancing the diversity and the relevancy.
DiPS gets the BERTScore similar to round 5 generation, while its outputs lack of diversity com-  To explore the pairwise diversity of our model's outputs in different rounds, we also calculate the p-BLEU values for VAE-SVG-eq and our model (p-BLEU is not suitable for other models). For VAE-SG-Eq, we generate 10 outputs by random sampling the latent space. For our model, we select the first 10 rounds outputs. Table 3 shows the results of p-BLEU. The p-BLEU value of our model is much lower than VAE-SVG-eq, which means that our model has better ability to generate multiple diversified paraphrases than VAE-SVG-eq.

Ablation Study
In this section, we will explore the role of backtranslation model in preserving semantics. We set the hyper-parameter λ from 0 to 5. A bigger λ means back-translation provides more semantic guidance to paraphrase model. λ = 0 means that we remove back-translation model totally. We generate paraphrases of 20 rounds and calculate the values of BERTScore. In order to explore the effect of leveraging other paraphrase model in the multi-round generation framework, we also adopt VAE-SVG-eq in a multi-round generation process  to generate paraphrases of 20 rounds on Quora, and compute the values of BERTScore. Figure 3 shows the trend of BERTScore with the increase of the round number. Obviously, compared with VAE-SVG-eq, our improved VAE model can preserve semantics better. Back-translation can much improve the lower bound of BERTScore , which means backtranslation can help to preserve the semantics during multi-round paraphrase generation.
We also calculate the p-BLEU for the paraphrases of the first 10 rounds for different λ. Table  4 shows the result. From the table we can know that, although back-translation can help to preserve semantics, a higher λ can lead to a lack of diversity of paraphrase. Therefore, it is wise to select an appropriate λ according to the actual requirement.

Human Evaluation
We perform human evaluation on system outputs with respect to three aspects: relevancy, fluency and diversity. Relevancy indicates if the semantics of outputs and original are identical. Fluency indicates the readability of output sentences. Diversity indicates the lexical and syntactic differences between output sentences and original sentences and thus we use two indicators for lexical diversity and syntactic diversity respectively.
We randomly sample 100 sentences from each test set and get a total of 200 sentences for evaluation. We employ 6 graduate students to rate each instance. We ensure every instance is rated by at least three judges.  From the table, we can see that the paraphrase in the first round can preserve more semantics of original sentence but lack of diversity. With the increase of the round number, the relevancy score decreases slightly, but the diversity scores increase substantially. Fluency may be influenced by diversity, because human may feel a slight decrease of fluency with the increase of diversity. As compared with other models, our model can generate paraphrases with high diversity, while maintaining semantics and fluency well. Previous models like SOW-REAP and DiPS can not maintain the semantics, though they can produce paraphrases with relatively high diversity.

Case Study
We perform case studies for better understanding the model performance. Table 6 shows an example of Quora, which include paraphrases of the first 15 rounds.  This case shows how does our model modify sentences during multi-round paraphrase generation process. With the increase of round number, the difference between the generated paraphrase and the original sentence becomes larger, while the paraphrase still preserves the major semantics of the original sentence.

Conclusion
In this paper, we focus on improving the diversity of generated paraphrase, i.e., making the generated paraphrase much more different from the original sentence. We propose a multi-round paraphrase generation method BTmPG with the guidance of back-translation. Both automatic and human evaluation results show that our method can generate diverse paraphrase while maintaining semantics. Ablation study proves back-translation is very helpful to preserve semantics. In the future, we will explore other methods such as GAN, to improve paraphrase diversity. We will also test our method on more languages other than English.