Improving Large-scale Paraphrase Acquisition and Generation

This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.


Introduction
Paraphrases are alternative expressions that convey a similar meaning (Bhagat and Hovy, 2013).Studying paraphrase facilitates research in both natural language understanding and generation.For instance, identifying paraphrases on social media is important for tracking the spread of misinformation (Bakshy et al., 2011) and capturing emerging events (Vosoughi and Roy, 2016).On the other hand, paraphrase generation improves the linguistic diversity in conventional agents (Li et al., 2016) and machine translation (Thompson and Post, 2020).It has also been successfully applied in data argumentation to improve information extraction (Zhang et al., 2015;Ferguson et al., 2018) and question answering systems (Gan and Ng, 2019).1. it's finally Friday and that's all that matters rn.
2. So so so so thankful it's finally Friday.Many researchers have been leveraging Twitter data to study paraphrase given its lexical and style diversity as well as coverage of up-to-date events.However, existing Twitter-based paraphrase datasets, namely PIT-2015 (Xu et al., 2015) and Twitter-URL (Lan et al., 2017), suffer from quality issues such as topic unbalance and annotation noise, 1 which limit the performance of the models trained using them.Moreover, past efforts on creating paraphrase corpora only consider one paraphrase criteria without taking into account the fact that the desired "strictness" of semantic equivalence in paraphrases varies from task to task (Bhagat and Hovy, 2013;Liu and Soh, 2022).For example, for the purpose of tracking unfolding events, "A tsunami hit Haiti." and "303 people died because of the tsunami in Haiti" are sufficiently close to be considered as paraphrases; whereas for paraphrase generation, the extra information "303 people dead" in the latter sentence may lead models to learn to hallucinate and generate more unfaithful content.
In this paper, we present an effective data collection and annotation method to address these issues.We curate the Multi-Topic Paraphrase in Twitter (MULTIPIT) corpus, which includes MUL-TIPIT CROWD , a large crowdsourced set of 125K sentence pairs that is useful for tracking information on Twitter, and MULTIPIT EXPERT , an expert annotated set of 5.5K sentence pairs using a stricter definition that is more suitable for acquiring paraphrases for generation purpose.Compared to PIT-2015 and Twitter-URL, our corpus contains more than twice as much data with more balanced topic distribution and better annotation quality.Two sets of examples from MULTIPIT are shown in Figure 1.
We extensively evaluate several state-of-the-art neural language models on our datasets to demonstrate the importance of having task-specific paraphrase definition.Our best model achieves 84.2 F 1 for automatic paraphrase identification.In addition, we construct a continually growing paraphrase dataset, MULTIPIT AUTO , by applying the automatic identification model to unlabelled Twitter data.Empirical results and analysis show that generation models fine-tuned on MULTIPIT AUTO generate more diverse and high-quality paraphrases compared to models trained on other corpora, such as MSCOCO (Lin et al., 2014), ParaNMT (Wieting andGimpel, 2018), and Quora. 2 We hope our MULTIPIT corpus will facilitate future innovation in paraphrase research.

Multi-Topic PIT Corpus
In this section, we present our data collection and annotation methodology for creating MULTI-PIT CROWD and MULTIPIT EXPERT datasets.The data statistics is detailed in Table 1.

Collection of Tweets
To gather paraphrases about a diverse set of topics as illustrated in Figure 1, we first group tweets that contain the same trending topic 3 (year 2014-2015) or the same URL (year 2017-2019) retrieved through Twitter public APIs4 over a long time period.Specifically, for the URL-based method, we extract the URLs embedded in the tweets that are posted by 15 news agency accounts (e.g., NYTScience, CNNPolitics, and ForbesTech).To get cleaner paraphrases, we split the tweets into sentences, eliminating the extra noises caused by multi-sentence tweets.More details of the improvements we made to address the data preprocessing issues in prior work are described in Appendix B.

Topic Classification and Balancing
To avoid a single type of topics dominating the entire dataset as in prior work (Xu et al., 2015;Lan et al., 2017), we manually categorize the topics for each group of tweets and balance their distribution.For trending topics, we ask three in-house annotators to classify them into 4 different categories: sports, entertainment, event, and others.All three  annotators are college students with varied linguistic annotation experience, and each received an hour-long training session.For URLs, most of them are linked to news articles and have already been categorized by the news agency. 5We include the tweets grouped by URLs that belong to the science/tech, health, politics, and finance categories.

Candidate Selection
The PIT-2015 (Xu et al., 2015) and Twitter-URL (Lan et al., 2017) corpora contain only 23% and 31% sentence pairs that are paraphrases, respectively.To increase the portion of paraphrases and improve the annotation efficiency, we introduce an additional step to filter out the tweet groups that contain either too much noise or too few paraphrases, and adaptively select sentence pairs for annotation ( §2.4).For each of the trend-based groups, we first select the top 2 sentences using a simple ranking algorithm (Xu et al., 2015) based on the averaged probability of words.We pair each of these two sentences with 10 other sentences that are randomly sampled from the top 20 in each group.Among these 20 sentence pairs, if the annotators found n ∈ [4,6] or [7,9] or [10,12] or [13,20] pairs as paraphrases, then we further deploy 20, 30, 40, or 50 sentence pairs for annotation, respectively.We pair one of the top 5 ranked sentences with 10 sentences randomly selected from those ranked between top 6 and top 50.Since the URLbased groups generally contain fewer sentences, we select the top 11 sentences and ask annotators to choose one as the seed sentence that can be paired with the rest 10 sentences to produce at least 3 paraphrase pairs.If such a seed sentence exists, we pair it with the rest 10 sentences and deploy them for 5 For example, URL https://www.nytimes.com/2019/08/09/science/komodo-dragon-genome.htmlbelongs to science topic.
annotation.Otherwise, we skip the entire group.

Crowd Annotation for Paraphrase Identification
We then annotate the selected sentence pairs using the crowdsourcing platform Figure-Eight6 to construct MULTIPIT CROWD .
Annotation Process.We design a 1-vs-1 annotation schema, where we present one sentence pair to workers at a time and ask them to annotate whether it is a paraphrase pair or not.A screenshot of the annotation interface is provided in Appendix A.1.We collect 6 judgments for every sentence pair and pay $0.2 per annotation (>$7 per hour).For creating MULTIPIT CROWD , with the purpose of identifying similar sentences and tracking information spreading on Twitter in mind, we consider two sentences as paraphrases even if one contains some new information that does not appear in the other sentence (see Figure 3 for examples).As a side note, because these sentences are grouped under the same trend or URL, the new information is always relevant and based on the context, otherwise, we will consider them non-paraphrases.
Quality Control.In every five sentence pairs, we embed one hidden test sentence pair that are pre-labeled by one of the authors, and constantly monitor the workers' performance.Whenever annotators make a mistake on the test pair, they will be alerted and provided with an explanation.Workers can continue in the task if they achieve >85% accuracy on the test pairs and >0. 2 Cohen's (Cohen, 1960) kappa when compared with the major vote of other workers.All workers are in the U.S.
Inter-Annotator Agreement.Examples: 0.69 for Trends-sourced ones, and 0.70 for all.We also sample 400 sampled sentence pairs and hire two experienced in-house annotators to label them.
Assuming the in-house annotation is gold, the F 1 of crowdworkers' majority vote is 89.1.
Accessing Topic Diversity.We manually examine 100 sentence pairs randomly sampled from MULTIPIT CROWD , PIT-2015 (Xu et al., 2015) and Twitter-URL (Lan et al., 2017).Figure 2 shows the results of the manual inspection.MULTI-PIT CROWD has a much more balanced topic distribution, compared to prior work where 58% of sentences in PIT-2015 are about sports and 63% of sentences in Twitter-URL are politics-related.This improvement can be attributed to the long time periodd ( §2.1) and topic classification step ( §2.2) in our data collection process.In contrast, PIT-2015 was collected within only 10 days (04/24/2013 -05/03/2013) that was overwhelmed by a popular sports event -the 2013 NFL draft (04/25 -04/27), and Twitter-URL was collected during the 3 months of the 2016 US presidential election.

Expert Annotation for Paraphrase Generation
Text generation models are prone to memorize training data and generate unfaithful hallucinations (Maynez et al., 2020;Carlini et al., 2021).Including paraphrase pairs that contain extra information other than world or commonsense knowledge in the training data only worsens the problem, as shown in Table 15 in Appendix F. For the purpose of paraphrase generation, we further create MULTI-PIT EXPERT with expert annotations, using a stricter paraphrase definition than the one used in MULTI-PIT CROWD .The different paraphrase criteria used for creating these two datasets and their corresponding examples are illustrated in Figure 3.
Data Selection.To create a high-quality corpus that focuses on differentiating strict paraphrases from the more loosely defined ones, we first use our best paraphrase identifier ( §3) fine-tuned on MUL-TIPIT CROWD to filter the sentence pairs and then have experienced in-house annotators to further annotate them.Specifically, we gather sentence pairs that are identified as paraphrases by the automatic classifier from 9,762 trending topic groups (from Oct-Dec 2021) and 181,254 URL groups (from Jan 2020-Jun 2021).To improve the diversity of our dataset, instead of presenting these pairs directly to the experts for annotation, we cluster the sentences by considering the paraphrase relationship transitive, i.e., if sentence pairs (s 1 , s 2 ) and (s 2 , s 3 ) are both identified as paraphrases, then (s 1 , s 2 , s 3 ) is a cluster.For each trend or URL, we show two seed sentences paired with up to 30 sentences in the largest cluster for the experts to annotate.In total, we have 5,570 sentence pairs annotated for MUL-TIPIT EXPERT , in which 100 sentences sourced by trend and 100 ones sourced by URL have at least 8 corresponding paraphrases.We use these 200 sets to form MULTIPIT NMR , the first multi-reference test set for paraphrase generation evaluation ( §4).
Expert Annotation.We ask two experienced annotators with linguistic backgrounds and rich annotation experience to annotate each sentence pair as paraphrases or not.Annotators thoroughly discuss

Results
Table 2 presents results for the models fine-tuned on each dataset.DeBERTaV3 large achieves the best results with 92 F 1 on MULTIPIT CROWD and 83.2 F 1 on MULTIPIT EXPERT .Transformer-based models consistently outperform BiLSTM-based models, especially on MULTIPIT EXPERT .
Beyond Fine-tuning.As MULTIPIT CROWD is a large-scale dataset annotated with a loose paraphrase definition, we test whether leveraging these "noisy" data improves model performance on MUL-TIPIT EXPERT .To reduce the noise that comes from the difference in definitions, we first adjust the labeling threshold for MULTIPIT CROWD from 3 to 4.
Then we consider two noisy training techniques adopted in prior work (Xie et al., 2020;Zhang and Sabuncu, 2018), namely filtering and flipping.Specifically, we fine-tune a teacher model on MULTIPIT EXPERT and use it to go through MULTI-PIT CROWD as follows: for each sentence pair p, if its label is i (0 for non-paraphrase, 1 for paraphrase) and P teacher (y = i|p) ≤ λ, we filter out p or flip its label to 1−i (i.e.0 → 1). 8Next, we fine-tune a new   3. Compared to finetuning on MULTIPIT EXPERT , adding the original MULTIPIT CROWD to the training data results in a 9.8 and 19.5 points drop in F 1 and precision, respectively, demonstrating the necessity of task-specific paraphrase definition.Among all methods, the flipping approach achieves the best F 1 of 84.2.We thus use it to create MULTIPIT AUTO ( §4).

Impact of Data Size
Figure 4 shows test set performance of DeBERTaV3 large fine-tuned on different amounts of data in MULTIPIT EXPERT .As there are 156 trend/URL groups in the train set, we truncate the data by group.With more training data, the model achieves better F 1 and accuracy but in a slower fashion compared to the early stage.This finding suggests that annotating more data can further improve the model's performance.

Paraphrase Generation
Paraphrase generation is a task that rewrites the input sentence while preserving its semantic meaning.Since new data is generated on Twitter every day, we introduce MULTIPIT AUTO , an automated continual growing dataset for paraphrase generation.We show that the model fine-tuned on MULTI-PIT AUTO generates more diverse and high-quality paraphrases than other paraphrase datasets.

Comparison with Existing Datasets
MSCOCO (Lin et al., 2014), and ParaNMT (Wieting and Gimpel, 2018), and Quora9 are three widely used datasets in paraphrase generation research (Zhou and Bhat, 2021).The Quora dataset contains over 400K question pairs, including 144K pairs labeled as duplicated (i.e., paraphrase), which are split into 134K/5K/5K as train/dev/test sets, respectively.MSCOCO consists of over 120K images, each of which has five captions.Following Chen et al. (2020), for each image, we randomly pick a caption and pair it with each of the other four captions, resulting in about 490K paraphrase pairs.We split them into train/dev/test sets with 330K/80K/80K pairs, respectively.ParaNMT is a dataset with more than 50 million paraphrase pairs that are automatically generated through backtranslation.Since back-translation may introduce noise, we use the manually labeled dev and test sets from Chen et al. (2019), which contain 499 and 871 instances, respectively.
MULTIPIT AUTO .We use the best performing model in Section 3 to extract paraphrase pairs from recent Twitter data (trending topics in Oct-Dec 2021 and URLs in Jan 2020-Jun 2021).We call these automated identified paraphrase pairs MUL-TIPIT AUTO ,10 which contains 302,307 pairs.One of the authors manually annotates 215 paraphrase pairs and uses them as the dev set.We use the multireference MULTIPIT NMR test set ( §2.5) for evaluation.As the test set and MULTIPIT AUTO come from the same time period, we filter out sentence pairs in MULTIPIT AUTO that share similar trends or URLs with the pairs from the test set.This leaves us with 290,395 pairs as the training set.
Following Chen et al. (2019), we remove paraphrase pairs with high BLEU scores in each training set to ensure there is enough variation between paraphrases, leaving about 137K pairs for MULTI-PIT AUTO , 47K for Quora, 275K for MSCOCO, and 443K for ParaNMT.Table 14 in Appendix F shows BLEU filtering improves model performance for all datasets.Detailed dataset statistics are provided in Appendix E.

Evaluation Metrics
We consider four automated metrics that are commonly used in previous work (Li et al., 2019;Niu et al., 2021) for paraphrase generation: BLEU (Papineni et al., 2002), Self-BLEU (Liu et al., 2021), BERT-Score (Zhang et al., 2020), and BERT-iBLEU (Niu et al., 2021 computed between the source sentence and the output, which measures surface-form diversity.BERT-Score is also calculated between the source sentence and the output, measuring semantic similarity.BERT-iBLEU is a harmonic mean of BERT-Score and 1−Self-BLEU, encouraging both semantic similarity and diversity.We use SacreBLEU (Post, 2018) to compute BLEU and bert-score11 to compute BERT-Score.

Generation Models
We consider two autoregressive language models, GPT-2 (Radford et al., 2019) and GPT-3 12 (Brown  et al., 2020), and two encoder-decoder language models, BART (Lewis et al., 2020) and T5 (Raffel et al., 2020).For GPT-3, we try both zero-shot and few-shot (4 examples) setups using in-context learning without any fine-tuning.For other models, we fine-tune seven configurations of them on MULTIPIT AUTO .Table 4 shows the test set results of each model and the diversity of human references measured by Self-BLEU.Among all models, the few-shot setting of GPT-3 achieves the highest BERT-iBLEU score, and the zero-shot setting achieves the second-best number with only 1 point behind, which is not surprising given its size.Compared to GPT-3 generations, human references are much more diverse with a decrease of 24.5 in Self-BLEU under the best case and 13.5 under the average case, indicating that there is still a big gap between large language models and humans.For supervised small-scale models, T5 large outperforms others with the best Self-BLEU and BERT-iBLEU scores.Although BART large gets the highest BLEU score, our experiments in Appendix F show BERT-iBLEU has the best correlation with human evaluation.We thus use T5 large in all the rest experiments.For all models except GPT-3, we use beam search with beam size = 4. Please refer to Appendix C for details on the training setup and hyperparameter tuning.GPT-3 prompting and hyperparameter setup are provided in Appendix D. Generation examples are displayed in Figure 16 in Appendix G.
Impact of Data Size.Figure 5 shows test set performance of T5 large fine-tuned on different amount of data in MULTIPIT AUTO from 1K to 137K.With more training data, the model generates more diverse and high-quality paraphrases as Self-BLEU decreases (improves) and BERT-iBLEU increases.This suggests that the paraphrase generation models will benefit from the continually growing size of our MULTIPIT AUTO corpus.

Cross-Dataset Generalization
Building a paraphrase generation model that generalizes to new data is always an ambitious goal.
To better understand the generalizability of each dataset, we fine-tune T5 large on MULTIPIT AUTO , Quora, MSCOCO, and ParaNMT separately and evaluate their performance across datasets.For fair comparisons, we use the same architecture, T5 large , in this experiment.Appendix G displays examples generated by these models on each dataset.Table 5: Automatic evaluation of models fine-tuned on four datasets.Here, BL: BLEU, S-B: Self-BLEU, B-S: BERT-Score, B-iB: BERT-iBLEU.Bold: the best, Underline: the second best.performance.On the contrary, since Quora and MSCOCO contain only questions or captions, models fine-tuned on them always generate questionor description-style sentences.For example, given "we should take shots.",model fine-tuned on Quora generates "Why do we take shots?".
We conduct a human evaluation to further compare MULTIPIT AUTO and ParaNMT datasets, by evaluating 200 randomly sampled generations from the model trained on each corpus. 13As shown in Table 6, MULTIPIT AUTO 's generations receive the highest scores in all three dimensions: fluency, semantic similarity and diversity.Each generation is rated by three annotators on a 5-point Likertscale per aspect, with 5 being the best.We also show the distribution of human evaluation results on each dimension in Figure 6 for a deeper comparison.Specifically, MULTIPIT AUTO model generates fewer really poor paraphrases (semantic similarity <3) and much more diverse paraphrases (diversity >3).We include our evaluation template in Appendix H.We measure inter-annotator agreement 13 The input is 4 × 50 sentences from each test set.using ordinal Krippendorff's alpha (Krippendorff, 2011), which yields 0.31 for fluency,14 0.56 for semantic similarity, and 0.81 for diversity.All values are considered fair to good (Krippendorff, 2004).
Additionally, we perform a manual inspection and observe that model fine-tuned on MUL-TIPIT AUTO generates more diverse kinds of good paraphrases and much fewer poor paraphrases than the one trained on ParaNMT.We define five good paraphrase types and six poor paraphrase types.The definitions and results are shown in Table 7.

Other Related Work
Besides the several frequently used paraphrase datasets we mentioned above, here are a few other paraphrase corpora.The MSR Paraphrase corpus (Dolan and Brockett, 2005) contains 5,801 sentences pairs from news articles, but it has a deficiency that skewing toward over-identification (Das and Smith, 2009) and having high lexical overlap (Rus et al., 2014).PPDB (Ganitkevitch et al., 2013) contains over 220 million phrase and lexical paraphrases without any sentence paraphrases.WikiAnswer (Fader et al., 2013) consists of 18 million word-aligned question pairs.However, same as Quora, WikiAnswer is restricted to only questions.In addition, the Semantic Textual Similarity (STS) shared task Cer et al. (2017)  to which two sentences are semantically similar to each other.Since it doesn't make a binary judgment for paraphrase relationships, it is not frequently used in paraphrase research.Recently, Dong et al. (2021) presents ParaSci, a large paraphrase dataset in the scientific field, and Kim et al. (2021) proposes BiSECT, a large split and rephrase corpus constructed using machine translation.Our work focuses on creating a large paraphrase corpus that contains more diverse and natural human-authored texts and investigating different paraphrase criteria.

Conclusion
In this paper, we present the Multi-Topic Paraphrase in Twitter (MULTIPIT) corpus.Our work surpasses prior Twitter-based paraphrase corpora in topic diversity as well as the quality and quantity of annotation.Experimental results demonstrate the necessity of defining paraphrases based on downstream tasks.Our paraphrase generation evaluation shows that models trained on our corpus have better generation quality and generalizability compared to models fine-tuned on existing widely-used paraphrase datasets.We believe that MULTIPIT will facilitate further research in both paraphrase identification and paraphrase generation.

Limitations
While our study shows MULTIPIT AUTO improves paraphrase generation quality and diversity, we observe model sometimes generates Twitter-specific artifacts (i.e."@JoeBiden").Future work could investigate techniques to mine paraphrases from other social media platforms such as Reddit.Another limitation is that our dataset is only in English, future work could extend this to multilingual as Twitter is used by users from different countries that speak different languages.
ODNI, IARPA, or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

A Annotation Interface
A.1 Crowdsourcing

B Data Pre-processing
Both PIT-2015 (Xu et al., 2015) and Twitter URL (Lan et al., 2017) datasets share similar preprocessing steps that introduced tokenization and sentence splitting errors.Moreover, PIT-2015 contains some spam patterns, such as "Follow Me PLEASE".We improved the quality of our dataset by fixing the pre-processing methods and removing spam patterns.More importantly, we split tweets into sentences to get cleaner paraphrases (see Table 8 for an example), without added noises from extra sentences in the tweet.We improve the sentence splitting script by Xu et al. (2015)
Paraphrase Identification.Hyperparameters for fine-tuning models in paraphrase identification experiments are given in Table 9.
Paraphrase Generation.Hyperparameters for fine-tuning models in paraphrase generation experiments are given in Table 10.We use perplexity on the dev set for model selection.
As ParaNMT contains only lowercase letters, we lowercase the input and references for generation and evaluation of the model fine-tuned on ParaNMT and lowercase the other models' generations while evaluating on ParaNMT.

D.2 Prompts
Zero-shot setting: Your task is to generate a diverse paraphrase for a given sentence.
Sentence: {sentence} Paraphrase: Few-shot setting: You will be presented with examples of some input sentences and their paraphrases.Your task is to generate a diverse paraphrase for a given sentence.pects and ∼0.1 with the overall score.Table 13 presents Spearman correlations for Self-BLEU, BERT-Score and BERT-iBLEU on all 400 generations.BERT-iBLEU outperforms the other two metrics.Because Self-BLEU measures diversity BERT-Score measures semantic similarity, both metrics get the best correlation with human evaluation on the corresponding aspect but the worst correlation on the other one.Notably, Self-BLEU gets the highest correlation with the overall measurement, but the reason behind it is more differentiation in diversity ratings compared to semantic similarity, as shown in Figure 7.This makes diversity the biggest role in the overall score.BLEU Filtering.We evaluate different BLEU thresholds on the dev set of MULTIPIT AUTO as shown in Figure 8.The model achieves the best performance at the threshold of 14, which is used across our experiments.

E Generation Dataset Statistics
Next, we compare model performance on all four datasets with and without BLEU filtering.Results are presented in Table 14.Applying BLEU filtering improves model performance with higher BERT-iBLEU on all datasets.

H Human Evaluation Details
We display our human evaluation instruction for each aspect (fluency, semantic similarity, diversity) in Figure 12,13,14.

7 .
In Tibet, climate change causes bigger, faster avalanches. 1. Bigger, faster collapsing glaciers, triggered by climate change 2. Bigger, Faster Avalanches, Triggered by Climate Change in Tibet 8. Bigger, faster avalanches in Tibet, triggered by climate change.9. @KendraWrites on a study that showed climate change drove cataclysmic avalanches in Tibet Formal Similar 3. It's finally Friday I'm so happiiiiiii.22. yayayayayyaya it's finally Friday and I have a half day today 20.I'm so happy it's finally Friday duck yeah 21.I've never been so happy that it's finally Friday Informal Diverse

Figure 1 :
Figure 1: Two sets of paraphrases in MULTIPIT, discussing a trending topic or a news article, respectively.

Figure 2 :
Figure 2: Topic breakdown on 100 randomly sampled sentence pairs from MULTIPIT CROWD , PIT-2015 and Twitter-URL.Our MULTIPIT CROWD corpus has a more balanced topic distribution.

Figure 3 :
Figure 3: Two different paraphrase definitions used for creating MULTIPIT CROWD and MULTIPIT EXPERT , with examples.The difference between the two criteria is whether considering Sentence2 that contains new information that requires fact-checking as a paraphrase of Sentence1.7

Figure 4 :
Figure 4: Test set performance of model fine-tuned on varying amounts of data in MULTIPIT EXPERT .model on the combination of MULTIPIT EXPERT and the re-labeled MULTIPIT CROWD .The experimental results are shown in Table3.Compared to finetuning on MULTIPIT EXPERT , adding the original MULTIPIT CROWD to the training data results in a 9.8 and 19.5 points drop in F 1 and precision, respectively, demonstrating the necessity of task-specific paraphrase definition.Among all methods, the flipping approach achieves the best F 1 of 84.2.We thus use it to create MULTIPIT AUTO ( §4).

Figure 5 :
Figure 5: Test set performance of model fine-tuned on varying amount of data in MULTIPIT AUTO , in terms of Self-BLEU (lower is better) and BERT-iBLEU.

Figure 6 :
Figure 6: Human evaluation distributions on generations by model fine-tuned on MULTIPIT AUTO or ParaNMT.

Figure 9 and
Figure 9 and Figure 10 display screenshots of the instruction and an example question of our crowdsourcing annotation for MULTIPIT CROWD .A.2 Expert Figure 11 displays a screenshot of the instruction of our expert annotation for MULTIPIT EXPERT .
Sentence: Mike Bloomberg is sending $18 million from his defunct presidential campaign to the DNC .Paraphrase: Mike Bloomberg is transferring $18M from his campaign to DNC , stretching campaign finance law .Sentence: Google Assistant on Android can read web pages to you Paraphrase: Google Assist lets your Android devices read entire web pages aloud Sentence: Charlie Patino scored a goal on his debut !Paraphrase: Charlie Patino's debut and he capped it off with a goal .Sentence: khem birch is the difference maker for the raptors this game Paraphrase: Khem Birch may be the MVP tonight for the Raptors .Sentence: {sentence} Paraphrase:

Figure 7 :
Figure 7: Label distribution of 1200 ratings on 400 generations by models fine-tuned on MULTIPIT AUTO and ParaNMT.

Figure 8 :
Figure 8: MULTIPIT AUTO dev set performance on various BLEU filtering thresholds.

Figure 12 :
Figure 12: Instruction for rating fluency aspect in our human evaluation.

Figure 13 :
Figure 13: Instruction for rating semantic similarity aspect in our human evaluation.

Table 1 :
Statistics of MULTIPIT CROWD and MULTIPIT EXPERT datasets.The sentence/tweet lengths are calculated based on the number of tokens per unique sentence/tweet.%Multi-Ref denotes the percentage of source sentences with more than one paraphrase.Compared with prior work, our MULTIPIT CROWD dataset has a significantly larger size, a higher portion of paraphrases, and a more balanced topic distribution.

Table 2 :
Results on the test sets of MULTIPIT CROWD and MULTIPIT EXPERT .Models are fine-tuned on the corresponding training set.DeBERTaV3 large performs the best on both datasets.LR: learning rate.

Table 6 :
Human evaluation results on generations by model fine-tuned on MULTIPIT AUTO or ParaNMT.

Table 7 :
measures the degree Which is the best GRE coaching centre in Bangalore?Gen: what is the best gre training centre . . .Very sad though that the amazing AJ and Kai will be missing the final.Gen: AJ and Kai will not be in the final.Paraphrase types with examples and statistics observed in the generations by models fine-tuned on MULTIPIT AUTO (M AUTO ) or ParaNMT.Statistics are based on manual inspection of generations by each model on 200 sampled sentences.The shown generation example for each type is by model with the higher value (bold).
Horrible Crash on the Aurora Bridge in Seattle.•Thecrash on the Aurora Bridge in Seattle looks horrible.That was the bridge I took to work everyday.Yikes.

Table 8 :
An example pair of raw tweets from our corpus.Annotating at tweet-level will include mismatched content and ambiguity.Cleaner paraphrase annotations can be acquired after sentence splitting.

Table 11 :
Table 11 presents the detailed statistics of MULTI-PIT AUTO , Quora, MSCOCO and ParaNMT.Statistics of datasets for paraphrase generation.We calculate sentence length based on the number of tokens per unique sentence.As ParaNMT is too large, we sample 500K for the calculation of sentence length and BLEU.W/o BF denotes without BLEU filtering.

Table 13 :
Spearman correlations with human evaluation on all 400 generations.Here, * * * : p < 0.0001, * * : p < 0.001, Correlation Analysis.With human evaluation, we calculate Spearman correlation to evaluate automatic metric quality.Since the four test sets have different numbers of references and MUL-TIPIT NMR has the most number of references, to evaluate BLEU, we examine 100 generations on MULTIPIT NMR (50 by T5 large fine-tuned on MULTIPIT AUTO and 50 by T5 large fine-tuned on ParaNMT).Results are shown in Table 12.BLEU gets a weak correlation around |0.2| with all as- * : p < 0.01.

Table 14 :
In-domain test set results of fine-tuning model on data with or without BLEU filtering.w/o BF denotes without BLEU filtering.Impact of Definition.We investigate how different paraphrase definitions affect generation performance.As shown in Table 15, model fine-tuned on MULTIPIT AUTO outperforms fine-tuning on the loosely defined data such as MULTIPIT CROWD .↓ B-S B-iB MULTIPIT CROWD 26,091 36.15 32.09 85.53 74.19 M AUTO -CROWD 326,517 45.55 37.90 85.80 74.12 M AUTO 136,645 41.14 33.34 85.86 77.79

Table 15 :
Test set results of models fine-tuned on data constructed with different paraphrase definitions.MUL-TIPIT CROWD contains its paraphrase pairs.M AUTO -CROWD is the automatically identified paraphrase pairs by the identifier fine-tuned on MULTIPIT CROWD .Table 16 presents generation examples by GPT-3 and fine-tuned T5 large on MULTIPIT NMR .Table 17 presents generation examples by T5 large fine-tuned on MULTIPIT AUTO , Quora, MSCOCO, and ParaNMT.Multi-Reference Examples.Table 18 displays three examples from the MULTIPIT NMR test set.