Keep the Primary, Rewrite the Secondary: A Two-Stage Approach for Paraphrase Generation

Paraphrase generation is an important and challenging NLG problem. In this work, we propose a new Identiﬁcation-then-Aggregation (IA) framework to tackle this task. In the identiﬁcation step, the input tokens are sorted into two groups by a novel Primary/Secondary Identiﬁcation (PSI) algorithm. In the aggregation step, these groups are separately encoded, before being aggregated by a custom designed decoder, which autoregressively generates the para-phrased sentence. In extensive experiments on two benchmark datasets, we demonstrate that our model outperforms previous studies by a notable margin. We also show that the proposed approach can generate paraphrases in an interpretable and controllable way.


Introduction
Paraphrases refer to text (often sentences) that share the same meaning but use different choices of words and their ordering. Automatic generation of paraphrases is a longstanding problem that is important to many downstream NLP applications such as question answering (Dong et al., 2017;Buck et al., 2018), machine translation (Cho et al., 2014), and semantic parsing (Su and Yan, 2017). Most early research adopts the sequence-to-sequence model (Prakash et al., 2016;Cao et al., 2017;Li et al., 2018) to map the input text to its paraphrase by processing and generating each word in a uniform way. Rather than processing each word uniformly, some recent studies tackles this task in a decomposable manner. For instance, Li et al. (2019) adopt an external word aligner to extract paraphrasing patterns at different levels of granularity and then perform generation. Fu et al. (2019) first use source words to predict their neighbors and then organize the predicted neighbors into a complete sentence. In this work, we investigate decomposable paraphrase generation from a different perspective. Specifically, we consider using a non-parametric approach to label each token in an input sentence as either (i) primary, or (ii) secondary. Intuitively, the primary content of a sentence refers to the factual information that defines the shared meaning of the paraphrase pair. All other content is deemed as secondary, and typically controls the structure of the sentence. In practice, this distinction is determined by an algorithm that decides whether tokens are primary or secondary, as described in §3. To better illustrate our idea, in Figure 1, we show some examples sampled from Quora and MSCOCO (Lin et al., 2014) datasets. We see that, for many cases, the paraphrase pairs maintain the similar primary content (e.g., the phrases "baby elephants" and "baby elephant" in the first example) while the secondary content can be rephrased in several different ways.
Based on the above observation, we propose an Identification-then-Aggregation (IA) framework to address the paraphrase generation task. Given an input sentence, generating a paraphrase follows a two-stage process. First, the primary and secondary content of the input sentence is identified via a novel Primary/Secondary Identification (PSI) algo-rithm which is based on a common non-parametric rank coefficient. Second, a new neural paraphrase generation model aggregates the identified information and generates the result. Specifically, the proposed model consists of (1) two encoders which separately process the identified primary and secondary content; and (2) an aggregation decoder which integrates the processed results and generates the paraphrased sentence.
We test the proposed approach on two benchmark datasets with automatic and human evaluation. The results show that our approach outperforms previous studies and can generate paraphrases in an interpretable and controllable way.

Related Work
The automatic generation of paraphrases is important for many downstream NLP applications and it has attracted a number of different approaches. Early researches included rule-based approaches (McKeown, 1979;Meteer and Shaked, 1988) and data-driven methods (Madnani and Dorr, 2010). With the advances of neural networks, recent approaches tackle this problem by treating it as a sequence-to-sequence language generation task. Prakash et al. (2016) proposed to modify the networks structure to improve the generation quality. Cao et al. (2017), Wang et al. (2019), and Kazemnejad et al. (2020) proposed to improve the model performance by leveraging external resources, including phrase dictionary, semantic annotations, and an off-the-shelf pre-trained neural retriever. Other works proposed to adopt techniques like reinforcement learning (Li et al., 2018) and unsupervised learning (Roy and Grangier, 2019) for this task.
While achieving satisfactory results, these above methods do not offer users the way to control the generation process in a fine-grained way. To incorporate controllability into the generation model, different approaches have been proposed. Iyyer et al. (2018) trained the model to produce the paraphrased sentence with a given syntax. Li et al. (2019) proposed to adopt an external word aligner to train the model to generate paraphrases from different levels. In Fu et al. (2019)'s work, the model generates paraphrases by planning the neighbour of words and realizing the complete sentence.

Primary/Secondary Identification
Given an input sentence, our goal is to identify the primary content that are likely to appear in the paraphrased sentence. To this end, we propose a Primary/Secondary Identification (PSI) approach which dynamically evaluates the importance of different parts of the input sentence. The parts with high importance are deemed the primary content, while the rest parts are deemed secondary content.
Token Importance Formally, given a paraphrase pair X and Y, we define their pairwise similarity as F(X, Y). To determine the importance of the ith token x i of X in relation to Y, we first compute the pairwise similarity between X = X x i and Y as F(X , Y), where the operator removes the token x i from X. We assume that if the token x i belongs to the primary content that is maintained in both X and Y, then removing it from X will cause a significant drop in the pairwise similarity between X and Y. Based on this assumption, we measure the importance of x i as the ratio of change in the pairwise similarity score as Intuitively, a higher G(x i ; X, Y) means a larger decrease in the pairwise similarity, indicating a higher importance of the token x i and vice versa.
Similarity Measurement We now describe the details of the function F(·, ·). Inspired by Zhelezniak et al. (2019), we measure the pairwise similarity between X and Y based on a non-parametric rank correlation coefficient. Specifically, given X and Y, we first transform them into the representation matrices M(X) ∈ R |X|×D and M(Y) ∈ R |Y|×D via a D-dimensional pretrained embeddings. Then, the matrices are mapped into fixed size context vectorsx ∈ R 1×D andŷ ∈ R 1×D via an element-wise max-pooling operation. Finally, the pairwise similarity F(X, Y) is measured using Spearman's correlation coefficientρ of the context vectorsx andŷ as where r[x j ] denotes the integer rank ofx j in the context vectorx (similarly r[ŷ j ]). For a better illustration, in Table 1, we show sentence sampled from Quora and MSCOCO datasets along with their pairwise similarities. We see that the numerical results are highly correlated with human judgement which empirically demonstrate the effectiveness of our measurement method. Two brown bears walking through a green, grassy area. 0.743 A simple plain clear vase with a dead twig and water inside. 0.439 A man using a phone next to a motorcycle. 0.128 Table 1: Examples of different sentence pairs (X, Y) and their corresponding pairwise similarity scores F(X, Y).
Algorithm 1: Primary/Secondary Identification Input :Input sentence X = (x1, ..., x |X| ); Paraphrased sentence Y = (y1, ..., y |Y| ); Primary content threshold αp; Importance measurement function G(·, ·, ·). Putting this together, the detailed description for splitting the input sentence X into the primary content X p and secondary content X s is given in Algorithm 1, where the token [MASK] is used as a special placeholder and the threshold α p is tuned based on the performance on the validation set 1 . The joinmask(·) operation joins consecutive [MASK] tokens into a single [MASK] token. We note that the incorporation of the [MASK] token is crucial. Because, in this way, the generation model could have access to the original source sentence structure by simply overlapping the primary and secondary content. In the experiments, we found that removing [MASK] from the identified content causes a significant drop in model performance as the model can no longer have access to the original sentence structure.
In Figure 2, we show the computed results from PSI of an example presented in Figure 1. We can see that the primary content is effectively identified.
Inference During inference, given an input sentence, the primary and secondary content could not be directly identified as X p , X s = PSI(X, Y) Figure 2: For each token, the score from the PSI algorithm is presented. The words in red is the identified primary content and the rest words make up the secondary content.
since we do not have access to the target sentence Y. To this end, we propose two alternative approaches. For the first one, we simply identify the primary and secondary content using the input sentence X as X p , X s = PSI(X, X). For the second one, we train a neural sequence tagger S based on the labels provided by PSI(X, Y). Then we extract the content using the input X as X p , X s = S(X). In the experiment section, we provide more detailed comparisons between these approaches.

Neural Paraphrase Generator
Overview Given the input sentence X, it is first partitioned into the primary and secondary content using the PSI algorithm. Then the identified content is independently processed by the primary encoder and the secondary encoder. Finally, an aggregation decoder integrates the outputs from both encoders and generates the result. In Figure 3, we provide an illustration of the proposed framework.
Encoder Stacks In this work, we use the transformer architecture (Vaswani et al., 2017) to construct the primary and secondary encoders. Formally, the Multi-Head Attention is defined as MultiHead(Q, K, V), where Q, K, V are query, key and value. Each encoder has N E layers. Given the input X, the first layer operates as where E(X) is the input sequence embedding and FFN(·) is a feed-forward layer. For other layers: where n = 2, ..., N E . Given the primary content X p and secondary content X s of the input sequence, their representa- ∈ R |Xs|×d are computed by the primary and secondary encoder respectively and d is the model size.

Decoder Stacks
We design an aggregation decoder to integrate information coming from both encoders. Given the target sentence Y, it is first encoded via a masked multi-head attention as (7) Then, the primary content attention module takes the encoded primary content O Xs )).
(9) The first layer output O (1) Y is then acquired as The final output O (N D ) Y ∈ R |Y|×d is computed via a stack of N D layers. The final probability of Y is produced by a linear softmax operation.
Learning Finally, given the input primary content X p , secondary content X s and the target sequence Y, the learning objective is defined as
The Quora dataset was developed for the task of duplicated question detection. Each data instance consists of one source sentence and one target sentence. In the experiment, we randomly select one sentence as the source and the other as the target.
The MSCOCO dataset was originally developed for the image captioning task. In this dataset, each image is associated with five human-written captions. Although there is no guarantee that these captions must be paraphrases as they could describe different objects in the image, most of these captions are generally close to each, therefore the overall quality of this dataset is favorable and it is widely used for the paraphrase generation task.
Following Li et al. (2019) and Fu et al. (2019), for the Quora dataset, we split the size of training, validation and test sets as 100k, 4k and 20k. The MSCOCO dataset is split into 93k, 4k and 20k. The maximum sentence length for these two datasets is set as 16. The vocabulary size of the Quora and MSCOCO datasest are set to be 8k and 11k. IANet+X: Given the input sentence X, this model extracts the primary and secondary content using the approximated PSI(X, X) algorithm. Then, the paraphrase generator produces the paraphrased sentence using the identified content.
IANet+S: In this case, a neural sequence tagger S is first trained based on the labels provided by PSI(X, Y). During inference, the model extracts the primary and secondary content of the input as X p , X s = S(X) and then perform generation.
IANet+ref: In contrast to previous variants, this model obtains the primary and secondary content using the exact PSI(X, Y) algorithm against the reference Y. The reason to include this model is that, besides our proposed alternatives, there are other options that we can use. We will explore these options in the future work. But by evaluating IANet+ref we can show an upper bound on how much could be improved in this way.

Implementation Details
We implement our model with PyTorch (Paszke et al., 2017). For the primary and secondary encoders, we use a 3-layer transformer with model size of 256 and heads of 8. Since the decoder has to integrate the information from both encoders, we build it with a larger capacity. The number of layers is set to 4. The model size and the attention heads are set to be 512 and 8. For the sequence tagger S that is used in the IANet+S model, we use a 2-layer LSTM with hidden size of 512.
In the experiments, we adopt pretrained 300dimensional FastText Embeddings (Bojanowski et al., 2017) to perform the PSI algorithm. During training , we use Adam (Kingma and Ba, 2015) to optimize our model with a learning rate of 1e-4. In all experiments, we set α p in Algorithm 1 as 0.1 based on the performance on the validation set.

Evaluation Metrics
Following previous studies (Prakash et al., 2016;Fu et al., 2019;Li et al., 2019), we report results on several automatic metrics, including BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). All lower n-gram metrics (1-4 grams in BLEU and 1-2 grams in ROUGE) are reported. In addition, we include iBLEU (i-B) (Sun and Zhou, 2012) as another evaluation metric, which penalizes repeating the source sentence in its paraphrase. Table 2 lists the results on both datasets. We see that the transformer baseline already achieves pretty strong results. This is because the capacity of transformer model is large enough to fit the datasets quite well. Nonetheless, in most of the evaluation metrics, our model outperforms previous studies by a notable margin, demonstrating the effectiveness of the proposed approach.

Main Results
By comparing different variants of our model, we see that IANet-ref achieves the best results on all metrics. This is expected as it uses the reference sentence in determining the primary and secondary content. It is worth emphasising that the IANet+ref model, like everything else, does not receive the target sentence, it just gets inputs X and has their primary and secondary content more accurately (Prakash et al., 2016) 53  identified. This suggests that the deomposition of our approach is beneficial, and further work can be focused more on the identification step. On the other hand, without using the target sentence, both IANet+X and IANet+S must use an approximated approach at the inference time, which inevitably introduces noise in the identified content. As a result, the performance is lower than IANet+ref. We will provide more analysis in the analysis section.

Human Evaluation
We also conduct a human evaluation to assess our model, using graders proficient in English from an internal grading platform. We randomly select 150 examples from the Quora test set and compare our model with three representative baselines 4 . Three annotators are asked to rate the generated results from different models on a 3-point Likert scale (0, 1, or 2) with respect to the following features 5 : • Fluency: Whether the generated paraphrase is grammatically correct and easily understood. 4 Because the authors of DNPG (Li et al., 2019) did not release their code, thus we are not able to reproduce their results and are not able to include this model in the human evaluation. 5 More details of the human evaluation guideline can be found in the supplementary material.  • Accuracy: Whether the content in the generated paraphrase is consistent with the content in the original sentence. • Diversity: Whether the generated sentence structure differs from the reference sentence.

Fluency Accuracy Diversity
To measure the agreement between the annotators, we use the Fleiss kappa coefficient (Fleiss et al., 1971). The agreement results are shown in the first row of Table 3, indicating moderate agreement between annotators on all metrics.
From Table 3, we see that our model achieves the best result on all metrics, which demonstrates the effectiveness of the proposed approach. Especially, on the diversity metric, our model significantly outperforms other baselines (Sign Test, with p-value < 0.05) and performs comparably with the  reference sentence (p-value = 0.23). The improvement in the diversity metric mainly comes from the two-step nature of our generation framework. By first determining which parts of the sentence to keep (primary) or to change (secondary), our model could then focus on maintaining the primary content while rewriting the secondary content, resulting in a more accurate and diverse paraphrase.

Further Analysis
In this section, we present further discussions and empirical analysis of the proposed approach.

Inference Algorithms Comparison
As shown in Table 2, IANet-ref outperforms IANet-X and IANet-S on both datasets. Our analysis is that IANet-ref could more accurately identify the primary content from the source comparing with the other variants. To provide more analysis, we separately use PSI(X, X), the sequence tagger S, and PSI(X, Y) to identify the words that both appear in the source and target sentences in the Quora dataset. The results are shown in Table 4. We see that all three methods perform comparably on the precision (prec.) metric. However, PSI(X, Y) significantly outperforms other methods on the recall (rec.) metric, showing that it can accurately extract more primary content from the source. From Table  4, we also observe that better identification results lead to better generation performances. Therefore, to improve the model performance, future work could be focused more on the identification step.   In this part, we analyze the differences between different similarity measurements. As described in Eq. (1) and Algorithm 1, the pairwise similarity measurement F(X, Y) is the basis of the PSI algorithm. To see how different similarity measurements affect the system performance, we compare the adopted Spearman'sρ with the cosine similarity which is commonly used for measuring text similarity. We use both metrics to measure the training pair similarity of the MSCOCO dataset and the results are shown in Figure 4. As it can be seen that the distribution of cosine similarity is condensed in a much smaller interval comparing with the one from Spearman'sρ, showing that the Spearman'sρ is more discriminative and can detect more subtle differences between the sentence pairs. Therefore, it can better identify the primary content, leading to better model performance. For further analysis, we run experiments on the MSCOCO dataset using cosine similarity as the measurement approach. The results of IANet+S using both metrics are shown in Table 5 which also demonstrate the fact that a more discriminative measurement approach leads to a better model performance.
6.6.3 Effect of α p in PSI As described in Algorithm 1, the proposed PSI algorithm relies on a predefined threshold α p to perform the extraction of primary and secondary content. In this part, we examine the effect of different α p on the model performance. We vary the value of α p and measure the results of IANet+S  model on the Quora dataset. The results of three metrics (B-1, R-1, and R-L) are depicted in Figure  5. We see that the optimal value of α p is 0.1 and by further decreasing or increasing α p , the model performance drops. Our analysis is that, when α p is too small, the words that only cause small variation in the pairwise similarity (Eq. (1)) will be misclassified as primary. Therefore, extra noise might be introduced to the model input which in turns decreases the model performance. On the other hand, when α p is too large, some important words that should be classified as primary content might be excluded by the PSI algorithm, which also leads to the decrease of model performance.

Case Study
As described in section §6.6.1, the reason why IANet+ref outperforms IANet+X and IANet+S is that it can more accurately identify the primary content in order to generate a paraphrase that is similar to the reference sentence. On the other hand, both IANet+X and IANet+S adopt an approximated algorithm which would inevitably introduce extra noise in the identified content. For a better illustration, we sample one test case from Quora dataset and present the results generated by our different model variants in Table 6. We can see that, given the input sentence, all model variants can generate a sentence that is similar to the reference paraphrase. By further comparing the primary content (words in blue ), we can see that only the IANet+ref successfully identifies all the primary content that are also contained in the reference sentence. On the other hand, IANet+S misses the word best and IANet+X ignores the words young and adult. As a result, IANet+ref can generate paraphrase that is closer to the reference sentence, leading to higher performances in different evaluation metrics as shown in Table 2.

Controllable Paraphrase Generation
Since the identification of the primary and secondary content of the input are separated from the neural generator, we therefore have the flexibility to manually choose these content. In this way, we can more precisely control the generation process.
To examine the controllability of the proposed approach, we manually select the primary and secondary content of the sampled instance and use the IANet+S model to generate paraphrases accordingly. The results based on different selections are presented in Table 6. As demonstrated by the examples, our model is flexible to generate different paraphrases given different combinations of the primary and secondary content. We can observe in the generated paraphrases that the selected primary content is largely maintained while the secondary content is properly rephrased.
This controllable attribute could make our model useful for other tasks such as task-oriented dialogue generation. Suppose we want to generate more utterances with the same meaning of the user utterance "book a great restaurant in London tonight". The slot values can be fixed as the primary content and our model could produce more utterances with the same intent, e.g. "make a reservation at the best London restaurant for this evening". This remains to be rigorously tested in future work.

Conclusion
In this work, we propose a novel IA framework to tackle the paraphrase generation task. Addition-ally, we design a new neural paraphrase generator which works coherently under the proposed framework. We conduct extensive experiments on two benchmark datasets. The results of quantitative experiments and human evaluation demonstrate that our approach improves upon previous studies. The qualitative experiments show that the generation of the proposed model is interpretable and controllable. In the future, we would like to investigate a better inference algorithm to further bridge the gap between the IANet+S and IANet+ref models.