A Quality-based Syntactic Template Retriever for Syntactically-controlled Paraphrase Generation

Existing syntactically-controlled paraphrase generation (SPG) models perform promisingly with human-annotated or well-chosen syntactic templates. However, the difficulty of obtaining such templates actually hinders the practical application of SPG models. For one thing, the prohibitive cost makes it unfeasible to manually design decent templates for every source sentence. For another, the templates automatically retrieved by current heuristic methods are usually unreliable for SPG models to generate qualified paraphrases. To escape this dilemma, we propose a novel Quality-based Syntactic Template Retriever (QSTR) to retrieve templates based on the quality of the to-be-generated paraphrases. Furthermore, for situations requiring multiple paraphrases for each source sentence, we design a Diverse Templates Search (DTS) algorithm, which can enhance the diversity between paraphrases without sacrificing quality. Experiments demonstrate that QSTR can significantly surpass existing retrieval methods in generating high-quality paraphrases and even perform comparably with human-annotated templates in terms of reference-free metrics. Additionally, human evaluation and the performance on downstream tasks using our generated paraphrases for data augmentation showcase the potential of our QSTR and DTS algorithm in practical scenarios.


Introduction
Paraphrase generation (PG) (Madnani and Dorr, 2010) is to rephrase a sentence into an alternative expression with the same semantics, which has been applied to many downstream tasks, such as question answering (Gan and Ng, 2019) and dialogue systems (Jolly et al., 2020;Gao et al., 2020;Panda et al., 2021;Liang et al., 2019Liang et al., , 2021Liang et al., , 2022)).On this basis, to improve the syntactic diversity of paraphrases, syntactically-controlled paraphrase Figure 1: The generated paraphrases with different templates."Semantics" and "Syntax" represent the semantic similarity with the source sentence and the syntactic distances against the template, respectively.Obviously, an unsuitable template may lead to a poor paraphrase.generation (SPG) is proposed and attracts extensive attention in the research community (Iyyer et al., 2018;Chen et al., 2019a;Kumar et al., 2020;Sun et al., 2021;Hosking and Lapata, 2021;Yang et al., 2021bYang et al., , 2022a;;Huang et al., 2022;Hosking et al., 2022).Different from traditional PG models, SPG models take syntactic templates as additional conditions to generate paraphrases conformed with the corresponding syntactic structures, whose forms generally include syntax parse trees1 and sentence exemplars.After years of research, SPG models can already generate syntax-conforming and highquality paraphrases with human-annotated or wellselected syntactic templates (refer to the first case in Figure 1).
However, such promising performance heavily relies on those satisfying templates, whose difficult acquisition in practice largely hinders the applica-tion of SPG models.Firstly, manually tailoring templates for every source sentence is practically unfeasible since it is time-consuming and laborious.Alternatively, automatically retrieving decent templates is also difficult, and unsuitable templates would induce semantic deviation or syntactic errors in the generated paraphrases (refer to the second case in Figure 1).On templates retrieval, current solutions are mostly heuristic methods and assume that suitable templates should satisfy certain conditions, e.g., high frequency in the corpus (Iyyer et al., 2018) or high syntactic similarity on the source side (Sun et al., 2021).The only exception is Yang et al. (2022a), which utilizes contrastive learning to train a retriever.Nonetheless, it also assumes that a suitable template should be wellaligned with the corresponding source sentence.Albeit plausible, the retrieval standards in these methods have no guarantee of the quality of the generated paraphrases, which may lead to the unstable performance of SPG models in practice.
To address this limitation, we propose a novel Quality-based Syntactic Template Retriever (QSTR) to retrieve templates that can directly improve the quality of the paraphrases generated by SPG models.Different from previous methods, given a source sentence and a candidate template, QSTR scores the template by estimating the quality of the to-be-generated paraphrase beforehand.To achieve this, we train QSTR by aligning its output score with the real quality score of the generated paraphrase based on a paraphrase-quality-based metric, i.e., ParaScore (Shen et al., 2022).With sufficient alignment, templates with higher scores from QSTR would be more probable to produce high-quality paraphrases.Moreover, when generating multiple paraphrases for each source sentence, we observe a common problem in QSTR and previous methods that the top-retrieved templates tend to be similar, which may result in similar or even repeated paraphrases.Aiming at this, we further design a Diverse Templates Search (DTS) algorithm to enhance the diversity between multiple paraphrases by restricting maximum syntactic similarity between candidate templates.
Experiments on two benchmarks demonstrate that QSTR can retrieve better templates than previous methods, which help existing SPG models generate paraphrases with higher quality.Additionally, the automatic and human evaluation showcase that our DTS algorithm can significantly improve current retrieval methods on the diversity between multiple paraphrases and meanwhile maintain their high quality.In the end, using templates from QSTR to generate paraphrases for data augmentation achieves better results than previous retrieval methods in two text classification tasks, which further indicates the potential of QSTR in downstream applications.
In summary, the major contributions of this paper are as follows2 : • We propose a novel Quality-based Syntactic Template Retriever, which can retrieve suitable syntactic templates to generate highquality paraphrases.
• To reduce the repetition when retrieving multiple templates by current methods, we design a diverse templates search algorithm that can increase the mutual diversity between different paraphrases without quality loss.
• The automatic and human evaluation results demonstrate the superiority of our method, and the performance in data augmentation for downstream tasks further prove the application values of QSTR in practical scenarios.

Related Work
As the syntactically-controlled paraphrase generation task is proposed (Iyyer et al., 2018) and has received increasing attention, previous work mainly focuses on improving the performance of the generated paraphrases conforming to the corresponding human-annotated templates.More specifically, most of them modify the model structures to better leverage the syntactic information of templates (Kumar et al., 2020;Yang et al., 2021aYang et al., , 2022a)).Furthermore, Sun et al. ( 2021) and Yang et al. (2022b) generate paraphrases based on the pre-trained language models, e.g., BART (Lewis et al., 2020), ProphetNet (Qi et al., 2020), and yield better performance.
Compared to the concentration on SPG models, only rare methods focus on how to obtain templates for these SPG models in practice.Among them, Iyyer et al. (2018) and Huang and Chang (2021) directly use the most frequent templates in the corpus.Sun et al. (2021) select the syntax parse trees of the target sentences in the corpus whose source sentences are syntactically similar to the input source sentence.Yang et al. (2022a) retrieve candidate templates based on distance in the embedding space.However, the retrieval standards in these heuristic methods cannot guarantee the quality of generated paraphrases.On the contrary, our QSTR directly predicts the quality of the paraphrases to be generated with the template, and the retrieved templates have greater potential to generate high-quality paraphrases.

Methodology
In this section, we first introduce our QSTR ( §3.1), which includes the model architecture ( §3.1.1)and the training objective ( §3.1.2).Then we design a diverse templates search algorithm to improve the mutual diversity between multiple paraphrases ( §3.2).
To model the mapping relationships between the tokens in a sentence and the constituents of a related template, we calculate the pairwise dot product between two kinds of embeddings and obtain the correlation matrix C : where C ij indicates the degree of correlation between the token x i and the syntactic constituent t j .Then we take the maximum value of each row/column in C as the weight for weighted averaging sentence/template embeddings and obtain their final representations v s and v t : In the end, we concatenate v s and v t and transform it to a scalar s through a linear layer and a Sigmoid function: where s is a score that represents the matching degree between the source sentence and the syntactic template.
In a nutshell, given a sentence x and a template t, the goal of QSTR is to output a quality estimation s for the future paraphrase through modeling the interaction between x and t: s = QSTR(x, t). (7)

Training Objective
During training, we aim to align the estimated score s from QSTR with the real quality value of the paraphrase.Towards this end, for each x, we randomly sample some templates from the whole template library T as the candidate template set T k , which also includes the templates of x and the reference y as more competitive candidates.Then, given we can obtain the prior estimations S k = {s 1 , s 2 , . . ., s k } for these templates from QSTR by Eq.( 7).At the same time, we can use the SPG model to generate the corresponding paraphrases T k and evaluate their real quality.To acquire the quality scores of these paraphrases, we select the reference-based metric ParaScore ref (Shen et al., 2022) that has the highest correlation with human evaluation.Formally, given a source sentence x and its reference y, the quality value q i of the paraphrase p i can be calculated by: where we can use q i to construct the quality set Next, we use the Mean Square Error (MSE) loss to quickly align the prior predictions S k with the posterior quality Q k quantitatively: Moreover, to better learn the quality ranks among T k , we also calculate a pairwise rank loss L rank for S k according to Q k : where δ s ij = s i − s j , δ q ij = q i − q j , and 1 δ q ij <0 = 1 when δ q ij < 0 otherwise 0. In the end, the overall training objective L consists of the above two loss functions:

Diverse Templates Search Algorithm
In practice, the top templates retrieved by existing retrieval methods may have similar features, e.g., syntactic structures, which may lead to repetitions when generating multiple paraphrases for one source sentence.To improve the mutual diversity between multiple paraphrases while maintaining their high quality, we design a general diverse templates search (DTS) algorithm as described in Algorithm 1, which can be equipped with existing retrieval methods.Taking QSTR as an example, given an input sentence x, we traverse the whole template library T = {t 1 , t 2 , . . ., t |T| } and calculate a score s i for each t i (lines 2∼3).Then, we maintain a min heap T d in a size d to collect the satisfactory templates t i from T, which have high scores and meanwhile diverse syntactic structures between each other (lines 4∼12).To find these templates, we calculate the Tree Edit Distance (TED) (Zhang and Shasha, 1989) between the template t i and templates in T d and ensure that the minimum TED value is greater than a threshold β before appending t i to T d (line 6).After one traversal, the heap T d will contain the final d qualified templates for diversely rephrasing the source sentence x.

Datasets and Implementation Details
Datasets.Following previous work (Sun et al., 2021;Yang et al., 2022a), we conduct our experiments on ParaNMT-Small (Chen et al., 2019b) and QQP-Pos (Kumar et al., 2020).Specifically, ParaNMT-Small contains about 500K paraphrase pairs for training, and 1300 manually labeled (source sentence, exemplar sentence, reference sentence) triples, which are split into 800/500 for the test/dev set.And QQP-Pos contains about 140K/3K/3K pairs/triples/triples for the train/test/dev set.In the test/dev set of two datasets, the exemplars are human-annotated sentences with similar syntactic structures as the reference sentences but different semantics.The function of exemplars is to provide their syntax parse trees as templates that guide SPG models to generate paraphrases syntactically close to references.SPG Models.In our experiments, we use two strong SPG models, i.e., AESOP (Sun et al., 2021) and SI-SCP (Yang et al., 2022a), to evaluate the effectiveness of our QSTR.To the best of our knowledge, AESOP4 has state-of-the-art performance among previous SPG models.Besides, SI-SCP5 includes a novel tree transformer to model parentchild and sibling relations in the syntax parse trees and also achieves competitive performance.
Implementation Details.We use the Stanford CoreNLP toolkit 6 (Manning et al., 2014) to obtain the syntax parse trees of the sentences, and we truncate all parse trees by height 4 and linearise them following Yang et al. (2022a).Then, we build the template library using the parse trees from the training set of each dataset.The two encoders in QSTR are initialized with Roberta-base (Liu et al., 2019) for both datasets.We use the scheduled AdamW optimizer with a learning rate of 3e-5 during training.The max length of sentences and templates are 64 and 192 respectively.The batch size is set to 32, and the size k of T k is set to 10.We train our QSTR for 10 and 20 epochs on ParaNMT-Small and QQP-Pos respectively.The coefficients λ 1 and λ 2 in Eq.( 11) are all set to 1.And the threshold β in the DTS algorithm is set to 0.2.

Contrast Methods
Here we introduce all the contrast methods for obtaining templates.Moreover, we also conduct paraphrase generation with Vicuna-13B7 (Chiang et al., 2023) as a strong baseline.
Ref-as-Template.The syntax parse tree of the reference sentence is used as the template, which can be regarded as the ideal template for the source sentence.
Exemplar-as-Template. The syntax parse tree of the exemplar sentence is used as the template, which can be seen as the human-annotated template.
Random Template.We randomly select one template from the template library for SPG.Vicuna-13B.We use this large language model (LLM) (Chiang et al., 2023) for zero-shot paraphrase generation.Please refer to Appendix A for more details.

Evaluation Metrics
Semantic Metrics.We use BLEU-R (Papineni et al., 2002) to evaluate literal similarity between generated paraphrases and references.To further measure the semantic similarity, we use sentence transformer 8 to encode the sentences into embeddings and then calculate the cosine similarity between the paraphrase and the source/reference sentence as cos-S/cos-R.Syntactic Metric.Following Bandel et al. (2022), we calculate the Tree Edit Distance (TED) (Zhang and Shasha, 1989) between the syntax trees of the paraphrase and the template to reflect how Additionally, ParaScore (Shen et al., 2022) is the state-of-the-art metric for paraphrase quality evaluation, which can comprehensively evaluate semantic consistency and expression diversity of paraphrases.Therefore, we also use the referencebased and reference-free versions of ParaScore, i.e., ParaScore ref and ParaScore f ree , as more convincing metrics.Between them, ParaScore f ree can better reflect the quality of paraphrases when reference sentences are unknown in practical scenarios.

Main Results
Table 1 shows the performance of paraphrases with different templates on two datasets, using the AE-SOP model for SPG.The results based on the SI-SCP model are presented in Appendix B. Totally, several conclusions can be drawn from the results: (1) Under the reference-based metrics, i.e., BLEU-R, cos-R, ParaScore ref and iBLEU, our QSTR significantly surpasses other baselines, which demonstrates that the templates retrieved by QSTR are closer to the ideal templates.
(2) The results of the reference-free metrics (i.e., BLEU-S, cos-S and ParaScore f ree ) also verify the superiority of our QSTR compared to other methods in practical scenarios and QSTR perform fully comparably with the human-labeled exemplar sentences (Exemplar-as-Template).
(3) QSTR also achieves a much lower TED value than other retrieval methods, which indicates that the templates from QSTR are more suitable for the source sentence and the generated paraphrases conform more with the templates.Although some methods can achieve lower BLEU-S scores than QSTR (e.g., "Random Templates"), the corresponding cos-S scores are also significantly inferior, which means the generated paraphrases with these templates have poor semantic consistency with the source sentences.
Furthermore, despite the unfair comparison, we also report the results of Vicuna-13B, which conducts zero-shot paraphrasing without the need for templates.Although using a much smaller SPG model, QSTR can yield the closest performance to Vicuna-13B among template retrieval methods.And the detailed analysis is shown in Appendix A. In conclusion, these results sufficiently demonstrate that our QSTR can provide more suitable templates to generate high-quality paraphrases.

Mutual Diversity of Multiple Paraphrases
In this section, we evaluate the effectiveness of our DTS algorithm when generating multiple paraphrases for each source sentence.On the evaluation metrics, we first calculate the sentence-level repetition rate (Rep-Rate) between the paraphrases generated with the top-10 retrieval templates.And we use mutual-BLEU (M-BLEU) to measure the literal similarity between multiple paraphrases, which averages the corpus-level BLEU scores between different paraphrases.Additionally, we also report the average ParaScore f ree scores of 10 paraphrases for quality evaluation.
Table 2 presents the results of our DTS algorithm on two retrieval methods when retrieving 10 templates.The results showcase that equipped with DTS, the values of Rep-Rate and M-BLEU are decreased significantly, which means the DTS algorithm can effectively improve the mutual diversity between multiple paraphrases.Moreover, the stable ParaScore f ree scores of these paraphrases prove that the DTS algorithm has little impact on the quality of the paraphrases.To sum up, by combining our QSTR with the DTS algorithm, the SPG models can generate multiple paraphrases with both high mutual diversity and high quality.6 Analysis

Ablation Study
To verify the effectiveness of the two training objectives for QSTR, we further conduct an ablation study.Specifically, we calculate the Pearson correlation coefficient (PCC) between the predicted scores from QSTR and the true quality scores from ParaScore ref .Table 3 presents the results of QSTR when L mse or L rank is removed during training, which show that the correlations decline obviously without L rank or L mse .Thus, both objectives benefit the training of QSTR and they can be complementary to each other.

Human Evaluation
We further conduct the human evaluation on the paraphrases from the three retrieval methods (SISCP-R, QSTR, QSTR+DTS).Specifically, we randomly select 50 source sentences from the QQP-Pos test set and generate 5 paraphrases for each sentence using the AESOP model with templates from the three retrieval methods.Next, we let three annotators score each paraphrase from two aspects, i.e., the overall quality (1∼5) and the diversity against the source sentence (1∼5).The detailed guidelines are listed in Appendix C. Besides, we also define a paraphrase can be accepted if its quality score ≥ 4 and diversity score ≥ 3 and it is unique among the 5 paraphrases.The final evaluation results in Table 4 show that the paraphrases generated with QSTR have better quality and diversity under human evaluation.Moreover, the DTS can further promote the quality and diversity of the paraphrases and largely improve the acceptance rate.

Case Study
We list 10 generated paraphrases of different retrieval methods for the same source sentence in Table 5.Among them, random templates produce the most inferior paraphrases, which shows that current SPG models are very sensitive to different templates.With the templates retrieved by SISCP-R, the paraphrases may be similar to each other and also have some syntactic or semantic errors.In contrast, our QSTR performs better on the quality of the paraphrases and the DTS algorithm further improves the mutual diversity of paraphrases.

Applications on Downstream Tasks
To further test the performance of QSTR on downstream tasks, we apply it to augment data for fewshot learning in text classification tasks.Specifically, we select SST-2, MRPC, and QQP classi-fication tasks from GLUE (Wang et al., 2018) as evaluation benchmarks.Then, we randomly sample 200 instances from the train set to fine-tune bert-base-uncased (Devlin et al., 2019) to obtain the baseline classifier as the few-shot baseline.
In addition, we utilize the AESOP model with templates from different retrieval methods to generate paraphrases for the train set and the test set respectively.Specifically, the augmented data for the train set are used to train classifiers together with the original instances.As for the test set, we evaluate the augmented data as additional results, getting the majority voting as the final results.Moreover, we combine the aforementioned two strategies as a further attempt.Please refer to Appendix D for more training details.
The results in Table 6 present that our method brings the highest improvement over the baseline compared to other methods on the three strategies.Specifically, "Train Aug" leads to better performance with our QSTR but not stably with other methods."Test Aug" contributes to stable improvements with all methods.And "Train & Test Aug" further improves the final performance.In conclusion, our QSTR showcases the best performance under all strategies, which indicates that our method can effectively promote the application values of SPG models on downstream tasks.

Conclusion
In this work, we propose a quality-based template retriever (QSTR) to retrieve decent templates for high-quality SPG.Moreover, we develop a diverse templates search (DTS) algorithm to reduce the repetitions in multiple paraphrases.Experiments show that the SPG models can generate better paraphrases with the templates retrieved by our QSTR than other retrieval methods and our DTS algorithm further increases the mutual diversity of the multiple paraphrases without any loss of quality.Furthermore, the results of the human evaluation and the downstream task also demonstrate that our QSTR and DTS algorithm can retrieve better templates and help SPG models perform more stably in practice.

Limitations
Although Parascore ref (Shen et al., 2022)) has been the state-of-the-art metric for the quality evaluation of paraphrases, it is still far from perfect as the supervision signal for QSTR.We will explore better metrics for evaluating the quality of paraphrases to guide the training of QSTR in future work.Moreover, we only utilize Vicuna-13B for zero-shot paraphrase generation, which leads to an unfair comparison with other methods.In future work, We will try to finetune Vicuna-13B on the SPG task and verify the effectiveness of our method with this new backbone.A The experiment based on Vicuna-13B Vicuna-13B (Chiang et al., 2023) is a large-scale model that trained by fine-tuning LLaMA on usershared conversations collected from ShareGPT.And preliminary evaluation using GPT-4 as a judge shows Vicuna-13B outperforms other models like LLaMA and Stanford Alpaca in more than 90% cases (Chiang et al., 2023).To explore its performance on paraphrase generation, We use the instruction "Please give ten paraphrases of the next sentence in English, the input sentence.These paraphrases should have the same meaning and diverse syntactic structures with the given sentence."to obtain the generated paraphrases from Vicuna-13B.
The evaluation results on automatic metrics (as shown in Table 1) represent that Vicuna-13B can generate competitive paraphrases with our QSTR.Additionally, through our observation of specific cases, Vicuna-13B tends to add more additional information or expands the original sentence length during generating paraphrases.Several cases generated from Vicuna-13B are shown in Table 7.The advantage of this performance is that the generated paraphrases are more diverse against the source sentence, while the disadvantages are that the sentences become more redundant and the syntactic structures are not uncontrollable.

B The Results of SI-SCP
Firstly, to show the applicability of QSTR, the QSTR model used for the SI-SCP model is the same one used for the AESOP model (QSTRbased-on-AESOP).It is trained using the paraphrases generated by the AESOP model, which may leave a gap from the paraphrases from the SI-SCP model.Thus, we also conduct the more effective approach, using the SI-SCP model to generate paraphrases during the training process (QSTR-based-on-SI-SCP).Table 8 shows the performance of paraphrases with different templates from QSTR and other baselines on both datasets based on the SI-SCP backbone.The results showcase that our QSTR exhibits significant superiority compared to other baselines, and the QSTR-basedon-SI-SCP has better performance on most metrics than QSTR-based-on-AESOP.However, the promising results with QSTR-based-on-AESOP indicate that our QSTR may be directly applied to other SPG models without retraining.

C Guidelines for human evaluation
The overall quality evaluates paraphrases from the perspectives of grammar correctness and semantic consistency with the source sentence, and the larger the score, the higher the quality.The detailed guidelines are as follows.
• "5" means the paraphrase is fully grammatically correct and completely semantically consistent with the source sentence.
• "4" means the paraphrase has a slight grammatical error, but still maintains the correct semantics, and can also be considered a valuable paraphrase.
• "3" means the paraphrase has a slight grammatical error and a minor semantic deviation.
• "2" means the paraphrase has a serious grammatical error and a major semantic deviation, but is still a complete sentence.
• "1" means the paraphrase is not a complete sentence or is totally irrelevant to the source sentence.
The diversity evaluates whether the paraphrase has diverse expressions compared to the input sentence based on whether the words used are the same or whether the syntactic structure is the same.And the larger the score, the higher the diversity.The detailed guidelines are as follows.
• "5" means the paraphrase has a very different syntax from the source sentence and adopts many new words and phrases.
• "4" means the paraphrase has a new syntax against the source sentence and revises some original words.
• "3" means the paraphrase has a new syntax against the source sentence but still adopts original words or phrases.
• "2" means the paraphrase has the same expression as the source sentence but adopts a few new words or phrases.
• "1" means the paraphrase is almost identical to the source sentence.

D Training details of Downstream Tasks
Since MRPC does not provide the official dev set, we randomly sample 1200 instances from the official test set as the final testing set and use the rest instances as the dev set.QQP adopts the same settings.As for the SST-2, we directly use the

Figure 2 :
Figure 2: The model architecture and the training process of QSTR.QSTR models the relationship between the source sentence x and the template t and maps it into a score s, which denotes a quality estimation of the paraphrase to be generated.Then, the training objective is to align the score s with the true quality score q of the paraphrase p based on the ParaScore metric.
Freq-R.FollowingIyyer et al. (2018), we choose the most frequent template in the template library for SPG.AESOP-R.Sun et al. (2021) select the parse tree of the target sentence whose corresponding source sentence has the most similar syntactic structure to the input sentence.SISCP-R.Yang et al. (2022a) encode the sentences and the templates into the same space and retrieve syntactic templates based on the similarities between their representations.
, t 2 , ..., t m ).To further extract the semantic and syntactic features, we add two feed-forward networks FFN-s and FFN-t after the two encoders.Then we can obtain the final sentence embeddings e s = {e s 1 , e s 2 , . . ., e s n } and the final template embeddings e t = {e t 1 , e t 2 , . . ., e t m }:

)
Algorithm 1 Diverse Templates Search Algorithm Input: input sentence x, the whole template library T = {t 1 , t 2 , . . ., t |T| }, the number of the retrieved templates d, the syntactic diversity threshold β Output: diverse templates set T d 1: Initialize T d = ∅ as a min heap.

Table 1 :
Templates BLEU-S↓ BLEU-R↑ iBLEU↑ cos-S↑ cos-R↑ ParaScore f ree ↑ ParaScore ref ↑ TED↓ Performance of paraphrases generated with different kinds of templates using the AESOP model for SPG.Metrics with ↑ mean the higher value is better, while ↓ means the lower value is better.Results highlighted in bold and underline represent the best and the second-best results respectively.And results with mark * are statistically better than the most competitive method "SISCP-R" with p < 0.05.For all retrieval methods in "Available Templates in Practice", we use the top-1 retrieved template for each source sentence to generate the paraphrase.much the paraphrase is syntactically conformed with the template.In our experiments, we observe that lower TED values generally indicate that the templates are more suitable for the source sentence.

Table 2 :
The evaluation results of the paraphrases generated with the top-10 retrieval templates for each source sentence on the QQP-Pos dataset.

Table 3 :
Pearson correlation coefficients (PCC) between QSTR predictions and ParaScore ref scores under different training objectives on the dev/test set of the QQP-Pos dataset.

Table 4 :
The results of human evaluation.

Table 6 :
Test accuracies of downstream tasks (i.e., MRPC, QQP, and SST-2) after adding paraphrases with different templates respectively to the original baseline for data augmentation."Train Aug" means generating paraphrases for the training samples as the training corpus."Test Aug" represents generating paraphrases for the test samples and conducting majority voting for the final predictions."Train & Test Aug" combines the aforementioned two strategies.And results with the same mark † or ‡ are from the same model.
on Empirical Methods in Natural Language Processing, pages 3178-3190, Abu Dhabi, United Arab Emirates.Association for Computational Linguistics.
Kaizhong Zhang and Dennis Shasha.1989.Simple fast algorithms for the editing distance between trees and related problems.SIAM Journal on Computing, 18(6):1245-1262.