Improving Paraphrase Detection with the Adversarial Paraphrasing Task

If two sentences have the same meaning, it should follow that they are equivalent in their inferential properties, i.e., each sentence should textually entail the other. However, many paraphrase datasets currently in widespread use rely on a sense of paraphrase based on word overlap and syntax. Can we teach them instead to identify paraphrases in a way that draws on the inferential properties of the sentences, and is not over-reliant on lexical and syntactic similarities of a sentence pair? We apply the adversarial paradigm to this question, and introduce a new adversarial method of dataset creation for paraphrase identification: the Adversarial Paraphrasing Task (APT), which asks participants to generate semantically equivalent (in the sense of mutually implicative) but lexically and syntactically disparate paraphrases. These sentence pairs can then be used both to test paraphrase identification models (which get barely random accuracy) and then improve their performance. To accelerate dataset generation, we explore automation of APT using T5, and show that the resulting dataset also improves accuracy. We discuss implications for paraphrase detection and release our dataset in the hope of making paraphrase detection models better able to detect sentence-level meaning equivalence.


Introduction
Although there are many definitions of 'paraphrase' in the NLP literature, most maintain that two sentences that are paraphrases have the same meaning or contain the same information. Pang et al. (2003) define paraphrasing as "expressing the same information in multiple ways" and Bannard and Callison-Burch (2005) call paraphrases "alternative ways of conveying the same information." Ganitkevitch et al. (2013) write that "paraphrases are differing textual realizations of the same meaning." A definition that seems to sufficiently encompass the others is given by Bhagat and Hovy (2013): "paraphrases are sentences or phrases that use different wording to convey the same meaning." However, even that definition is somewhat imprecise, as it lacks clarity on what it assumes 'meaning' means.
If paraphrasing is a property that can hold between sentence pairs, 1 then it is reasonable to assume that sentences that are paraphrases must have equivalent meanings at the sentence level (rather than exclusively at the levels of individual word meanings or syntactic structures). Here a useful test is that recommended by inferential role semantics or inferentialism (Boghossian, 1994;Peregrin, 2006), which suggests that the meaning of a statement s is grounded in its inferential properties: what one can infer from s and from what s can be inferred.
Building on this concept from inferentialism, we assert that if two sentences have the same inferential properties, then they should also be mutually implicative. Mutual Implication (MI) is a binary relationship between two sentences that holds when each sentence textually entails the other (i.e., bidirectional entailment). MI is an attractive way of operationalizing the notion of two sentences having "the same meaning," as it focuses on inferential relationships between sentences (properties of the sentences as wholes) instead of just syntactic or lexical similarities (properties of parts of the sentences). As such, we will assume in this paper that two sentences are paraphrases if and only if they are M I. 2 In NLP, modeling inferential relationships between sentences is the goal of the textual entailment, or natural language inference (NLI) tasks (Bowman et al., 2015). We test MI using the version of RoBERTa large released by Nie et al. (2020) trained on a combination of SNLI (Bowman et al., 2015), multiNLI (Williams et al., 2018), FEVER-NLI (Nie et al., 2019), and ANLI (Nie et al., 2020).
Owing to expeditious progress in NLP research, performance of models on benchmark datasets is 'plateauing' -with near-human performance often achieved within a year or two of their releaseand newer versions, using a different approach, are constantly having to be created, for instance, GLUE (Wang et al., 2019) and SuperGLUE (Wang et al., 2020). The adversarial paradigm of dataset creation (Jia and Liang, 2017a,b;Bras et al., 2020;Nie et al., 2020) has been widely used to address this 'plateauing,' and the ideas presented in this paper draw inspiration from it. In the remainder of this paper, we apply the adversarial paradigm to the problem of paraphrase detection, and demonstrate the following novel contributions: • We use the adversarial paradigm to create a new benchmark examining whether paraphrase detection models are assessing the meaning equivalence of sentences rather than being over-reliant on word-level measures. We do this by collecting paraphrases that are MI but are as lexically and syntactically disparate as possible (as measured by low BLEURT scores). We call this the Adversarial Paraphrasing Task (APT).
• We show that a SOTA language model trained on paraphrase datasets perform poorly on our benchmark. However, when further trained on our adversarially-generated datasets, their MCC scores improve by up to 0.307.
• We create an additional dataset by training a paraphrase generation model to perform our adversarial task, creating another large dataset that further improves the paraphrase detection models' performance.
• We propose a way to create a machinegenerated adversarial dataset and discuss ways to ensure it does not suffer from the plateauing that other datasets suffer from.

Related Work
Paraphrase detection (given two sentences, predict whether they are paraphrases) (Zhang and Patrick, TwitterPPDB subset of AP T 5 2005; Fernando and Stevenson, 2008;Socher et al., 2011;Jia et al., 2020) is an important task in the field of NLP, finding downstream applications in machine translation (Callison-Burch et al., 2006;Apidianaki et al., 2018;Mayhew et al., 2020), text summarization, plagiarism detection (Hunt et al., 2019), question answering, and sentence simplification (Guo et al., 2018). Paraphrases have proven to be a crucial part of NLP and language education, with research showing that paraphrasing helps improve reading comprehension skills (Lee and Colln, 2003;Hagaman and Reid, 2008). Question paraphrasing is an important step in knowledgebased question answering systems for matching questions asked by users with knowledge-based assertions (Fader et al., 2014;Yin et al., 2015).
Paraphrase generation (given a sentence, generate its paraphrase) (Gupta et al., 2018) is an area of research benefiting paraphrase detection as well. Lately, many paraphrasing datasets have been introduced to be used for training and testing ML models for both paraphrase detection and generation. MSRP (Dolan and Brockett, 2005) contains 5801 sentence pairs, each labeled with a binary human judgment of paraphrase, created using heuristic extraction techniques along with an SVM-based classifier. These pairs were annotated by humans, who found 67% of them to be semantically equivalent. The English portion of PPDB (Ganitkevitch et al., 2013) contains over 220M paraphrase pairs generated by meaning-preserving syntactic transformations. Paraphrase pairs in PPDB 2.0 (Pavlick et al., 2015) include fine-grained entailment relations, word embedding similarities, and style annotations. TwitterPPDB (Lan et al., 2017) consists of 51,524 sentence pairs captured from Twitter by linking tweets through shared URLs. This ap-proach's merit is its simplicity as it involves neither a classifier nor a human-in-the-loop to generate paraphrases. Humans annotate the pairs, giving them a similarity score ranging from 1 to 6.
ParaNMT (Wieting and Gimpel, 2018) was created by using neural machine translation to translate the English side of a Czech-English parallel corpus (CzEng 1.6 (Bojar et al., 2016)), generating more than 50M English-English paraphrases. However, ParaNMT's use of machine translation models that are a few years old harms its utility (Nighojkar and Licato, 2021), considering the rapid improvement in machine translation in the past few years. To rectify this, we use the google-translate library to translate the Czech side of roughly 300k CzEng2.0 (Kocmi et al., 2020) sentence pairs ourselves. We call this dataset ParaParaNMT (PP-NMT for short, where the extra paraprefix reflects its similarity to, and conceptual derivation from, ParaNMT). Some work has been done in improving the quality of paraphrase detectors by training them on a dataset with more lexical and syntactic diversity. Thompson and Post (2020) propose a paraphrase generation algorithm that penalizes the production of n-grams present in the source sentence. Our approach to doing this is with the APT, but this is something worth exploring. Sokolov and Filimonov (2020) use a machine translation model to generate paraphrases much like ParaNMT. An interesting application of paraphrasing has been discussed by Mayhew et al. (2020) who, given a sentence in one language, generate a diverse set of correct translations (paraphrases) that humans are likely to produce. In comparison, our work is focused on generating adversarial paraphrases that are likely to deceive a paraphrase detector, and models trained on the adversarial datasets we produce can be applied to Mayhew et al.'s work too.
ANLI (Nie et al., 2020), a dataset designed for Natural Language Inference (NLI) (Bowman et al., 2015), was collected via an adversarial human-andmodel-in-the-loop procedure where humans are given the task of duping the model into making a wrong prediction. The model then tries to learn how not to make the same mistakes. AFLite (Bras et al., 2020) adversarially filters dataset biases making sure that the models are not learning those biases. They show that model performance on SNLI (Bowman et al., 2015) drops from 92% to 62% when biases were filtered out. However, their approach is to filter the dataset, which reduces its size, making model training more difficult. Our present work tries instead to generate adversarial examples to increase dataset size. Other examples of adversarial datasets in NLP include work done by Jia and Liang (2017a); Zellers et al. (2018Zellers et al. ( , 2019. Perhaps the closest to our work is PAWS (Zhang et al., 2019), short for Paraphrase Adversaries from Word Scrambling. The idea behind PAWS is to create a dataset that has a high lexical overlap between sentence pairs without them being 'paraphrases.' It has 108k paraphrase and non-paraphrase pairs with high lexical overlap pairs generated by controlled word swapping and back-translation, and human raters have judged whether or not they are paraphrases. Including PAWS in the training data has shown the state-of-the-art models' performance to jump from 40% to 85% on PAWS's test split. In comparison to the present work, PAWS does not explicitly incorporate inferential properties, and we seek paraphrases minimizing lexical overlap.

Adversarial Paraphrasing Task (APT)
Semantic Textual Similarity (STS) measures the degree of semantic similarity between two sentences. Popular approaches to calculating STS include BLEU (Papineni et al., 2002), BertScore (Zhang et al., 2020), and BLEURT (Sellam et al., 2020). BLEURT is a text generation metric building on BERT's (Devlin et al., 2019) contextual word representations. BLEURT is warmed-up using synthetic sentence pairs and then fine-tuned on human ratings to generalize better than BERTScore (Zhang et al., 2020). Given any two sentences, BLEURT assigns them a similarity score (usually between -2.2 to 1.1). However, high STS scores do not necessarily predict whether two sentences have equivalent meanings. Consider the sentence pairs in Table 3, highlighting cases where STS and paraphrase appear to misalign. The existence of such cases suggests a way to advance automated paraphrase detection: through an adversarial benchmark consisting of sentence pairs that have the same MI-based meaning, but have BLEURT scores that are as low as possible. This is the motivation behind what we call the Adversarial Paraphrasing Task (APT), which has two components: 1. Similarity of meaning: Checked through MI (Section 1). We assume if two sentences are M I (Mutually Implicative), they are semantically equivalent and thus paraphrases. Note Figure 1: The mTurk study and the reward calculation. We automatically end the study when a subject earns a total of $20 to ensure variation amongst subjects.
that MI is a binary relationship, so this APT component does not bring any quantitative variation but is more like a qualifier test for APT. All AP T sentence pairs are M I.
2. Dissimilarity of structure: Measured through BLEURT, which assigns each sentence pair a score quantifying how lexically and syntactically similar the two sentences are.

Manually Solving APT
To test the effectiveness of APT in guiding the generation of mutually implicative but lexically and syntactically disparate paraphrases for a given sentence, we designed an Amazon Mechanical Turk (mTurk) study (Figure 1). Given a starting sentence, we instructed participants to "[w]rite a sentence that is the same in meaning as the given sentence but as structurally different as possible. Your sentence should be such that you can infer the given sentence from it AND vice-versa. It should be sufficiently different from the given sentence to get any reward for the submission. For example, a simple synonym substitution will most likely not work." The sentences given to the participants came from MSRP and PPNMT (Section 1). Both of these datasets have pairs of sentences in each row, and we took only the first one to present to the par-ticipants. Neither of these datasets has duplicate sentences by design. Every time a sentence was selected, a random choice was made between MSRP and PPNMT, thus ensuring an even distribution of sentences from both datasets. Each attempt was evaluated separately using Equation 1, where mi is 1 when the sentences are M I and 0 otherwise: This formula was designed to ensure (1) the maximum reward per submission was $1, and (2) no reward was granted for sentence pairs that are non-MI or have BLEURT > 0.5. Participants were encouraged to frequently revise their sentences and click on a 'Check' button which showed them the reward amount they would earn if they submitted this sentence. Once the 'Check' button was clicked, the participant's reward was evaluated (see Figure  1) and the sentence pair added to AP H (regardless of whether it was AP T ). If 'Submit' was clicked, their attempt was rewarded based on Equation 1. The resulting dataset of sentence pairs, which we call AP H (Adversarial Paraphrase by Humans), consists of 5007 human-generated sentence pairs, both M I and non-M I (see Table 2). Humans were able to generate AP T paraphrases for 75.48% of  Table 2: Proportion of sentences generated by humans (AP H ) and T5 base (AP T 5 ). "Attempts" shows the number of attempts the participant made and "Uniques" shows the number of source sentences from the dataset that the performer's attempts fall in that category on. For instance, 1631 unique sentences were presented to humans, who made a total of 5007 attempts to pass AP T and were able to do so for 2659 attempts which amounted to 1231 unique source sentences that could be paraphrased to pass AP T . the sentences presented to them and only 53.1% of attempts were AP T , showing that the task is difficult even for humans. Note that 'M I attempts' and 'M I uniques' are supersets of 'AP T attempts' and 'AP T uniques,' respectively.

Automatically Solving APT
Since human studies can be time-consuming and costly, we trained a paraphrase generator to perform APT. We used T5 base (Raffel et al., 2020), as it achieves SOTA on paraphrase generation (Niu et al., 2020;Bird et al., 2020; and trained it on TwitterPPDB (Section 2). Our hypothesis was that if T5 base is trained to maximize the APT reward (Equation 1), its generated sentences will be more likely to be AP T . We generated paraphrases for sentences in MSRP and those in TwitterPPDB itself, hoping that since T5 base is trained on TwitterPPDB, it would generate better paraphrases (M I with lower BLEURT) for sentences coming from there. The proportion of sentences generated by T5 base is shown in Table 2. We call this dataset AP T 5 , the generation of which involved two phases: Training: To adapt T5 base for APT, we implemented a custom loss function obtained from dividing the cross-entropy loss per batch by the total reward (again from Equation 1) earned from the model's paraphrase generations for that batch, provided the model was able to reach a reward of at least 1. If not, the loss was equal to just the crossentropy loss. We trained T5 base on TwitterPPDB for three epochs; each epoch took about 30 hours on one NVIDIA Tesla V100 GPU due to the CPU bound BLEURT component. More epochs may help get better results, but our experiments showed that loss plateaus after three epochs. Generation: Sampling, or randomly picking a next word according to its conditional probability distribution, introduces non-determinism in language generation. Fan et al. (2018) introduce top-k sampling, which filters k most likely next words, and the probability mass is redistributed among only those k words. Nucleus sampling (or top-p sampling) (Holtzman et al., 2020) reduces the options to the smallest possible set of words whose cumulative probability exceeds p, and the probability mass is redistributed among this set of words. Thus, the set of words changes dynamically according to the next word's probability distribution. We use a combination of top-k and top-p sampling with k = 120 and p = 0.95 in the interest of lexical and syntactic diversity in the paraphrases. For each sentence in the source dataset (MSRP 3 and TwitterPPDB for AP M T 5 and AP T w T 5 respectively), we perform five iterations, in each of which, we generate ten sentences. If at least one of these ten sentences passes AP T , we continue to the next source sentence after recording all attempts and classifying them as M I or non-M I. If no sentence in a maximum of 50 attempts passes AP T , we record all attempts nonetheless, and move on to the next source sentence. For each increasing iteration for a particular source sentence, we increase k by 20, but we also reduce p by 0.05 to avoid vague guesses. Note the distribution of M I and non-M I in the source datasets does not matter because we use only the first sentence from the sentence pair.

Dataset Properties
T5 base trained with our custom loss function generated AP T -passing paraphrases for (56.19%) of starting sentences. This is higher than we initially expected, considering how difficult APT proved to be for humans (Table 2). Noteworthy is that only 6.09% of T5 base 's attempts were AP T . This does not mean that the remaining 94% of attempts can be discarded, since they amounted to the negative examples in the dataset. Since we trained it on TwitterPPDB itself, we expected that T5 base would generate better paraphrases, as measured by a higher chance of passing AP T on TwitterPPDB, than any other dataset we tested. This is supported by the data in Table 2, which shows that T5 base was able to generate an AP T passing paraphrase for 84.8% of the sentences in TwitterPPDB. The composition of the three adversarial datasets can be found in Table 2. These metrics are useful to understand the capabilities of T5 base as a paraphrase generator and the "paraphrasability" of sentences in MSRP and TwitterPPDB. For instance, T5 base 's attempts on TwitterPPDB tend to be M I much less frequently than those on MSRP and human's attempts on MSRP + PPNMT. This might be because in an attempt to generate syntactically dissimilar sentences, the T5 base paraphraser also ended up generating many semantically dissimilar ones as well.
To visualize the syntactic and lexical disparity of paraphrases in the three adversarial datasets, we present their BLEURT distributions in Figure 2. As might be expected, the likelihood of a sentence pair being M I increases as BLEURT score increases (recall that AP T -passing sentence pairs are simply M I pairs with BLEURT scores <= 0.5), but Figure 2 shows that the shape of this increase is not straightforward, and differs among the three datasets.
As might be expected, humans are much more skilled at APT than T5 base , as shown by the fact that the paraphrases they generated have much lower mean BLEURT scores (Figure 2), and the ratio of AP T vs non-AP T sentences is much higher ( Table 2). As we saw earlier, when T5 base wrote paraphrases that were low on BLEURT, they tended to become non-M I (e.g., line 12 in Table 3). However, T5 base did generate more AP T -passing sentences with a lower BLEURT on Twitter-PPDB than on MSRP, which may be a result of overfitting T5 base on TwitterPPDB. Furthermore, all three adversarial datasets have a distribution of M I and non-M I sentence pairs balanced enough to train a model to identify paraphrases. Table 3 has examples from AP H and AP T 5 showing the merits and shortcomings of T5, BLEURT, and RoBERTa large (the MI detector used). Some observations from Table 3 include: • Lines 1 and 3: BLEURT did not recognize the paraphrases, possibly due to the differences in words used. RoBERTa large however, gave the correct MI prediction (though it is worth noting that the sentences in line 1 are questions, rather than truth-apt propositions).
• Line 4: RoBERTa large and BLEURT (to a large extent since it gave it a score of 0.4) did not recognize that the idiomatic phrase 'break a leg' means 'good luck' and not 'fracture.' • Lines 6 and 12: There is a loss of information going from the first sentence to the second and BLEURT and MI both seem to have understood the difference between summarization and paraphrasing.
• Line 7: T5 not only understood the scores but also managed to paraphrase it in such a way that was not syntactically and lexically similar, just as we wanted T5 to do when we fine-tuned it.
• Line 9: T5 base knows that Fort Lauderdale is in Florida but RoBERTa large does not.

Experiments
To quantify our datasets' contributions, we designed experiment setups wherein we trained RoBERTa base (Liu et al., 2019) for paraphrase detection on a combination of TwitterPPDB and our datasets as training data. RoBERTa was chosen for its generality, as it is a commonly used model in current NLP work and benchmarking, and currently achieves SOTA or near-SOTA results on a majority of NLP benchmark tasks (Wang et al., 2019(Wang et al., , 2020;   Chen et al., 2021).
For each source sentence, multiple paraphrases may have been generated. Hence, to avoid data leakage, we created a train-test split on AP H such that all paraphrases generated using a given source sentence will be either in AP H -train or in AP Htest, but never in both. Note that AP H is not balanced as seen in Table 2. Table 4 shows the distribution of M I and non-M I pairs in AP H -train and AP H -test and 'M I attempts' and 'non-M I attempts' columns of Table 2 show the same for other adversarial datasets. The test sets used were AP H wherever AP H -train was not a part of the training data and AP H -test in every case.
Does RoBERTa base do well on AP H ? RoBERTa base was trained on each training dataset (90% training data, 10% validation data) for five epochs with a batch size of 32 with the training and validation data shuffled, and the trained model was tested on AP H and AP H -test. The results of this are shown in Table 6. Note that since the number of M I and non-M I sentences in all the datasets is imbalanced, Matthew's Correlation Coefficient (MCC) is a more appropriate performance measure than accuracy (Boughorbel et al., 2017).
Our motivation behind creating an adversarial dataset was to improve the performance of paraphrase detectors by ensuring they recognize paraphrases with low lexical overlap. To demonstrate the extent of their inability to do so, we first compare the performance of RoBERTa base trained only on TwitterPPDB on specific datasets as shown Table 5. Although the model performs slightly well on MSRP, it does barely better than a random prediction on AP H , thus showing that identifying adversarial paraphrases created using APT is nontrivial for paraphrase identifiers.
Do human-generated adversarial paraphrases improve paraphrase detection? We introduce AP H -train to the training dataset along with Twit-terPPDB. This improves the MCC by 0.222 even though AP H -train constituted just 8.15% of the entire training dataset, the rest of which was Twit-terPPDB (Table 6). This shows the effectiveness of human-generated paraphrases, as is especially impressive given the size of AP H -train compared to TwitterPPDB.
Do machine-generated adversarial paraphrases improve paraphrase detection? We set out to test the improvement brought by AP T 5 , of which we have two versions. Adding AP M T 5 to the training set was not as effective as adding AP H -train, increasing MCC by 0.188 on AP H and 0.151 on AP H -test, thus showing us that T5 base , although was able to clear AP T , lacked the quality which human paraphrases possessed. This might be explained by Figure 2 -since AP M T 5 does not have many sentences with low BLEURT, we cannot expect a vast improvement in RoBERTa base 's performance on sentences with BLEURT as low as in AP H .
Since we were not necessarily testing T5 base 's performance -and we had trained T5 base on Twit-terPPDB -we used the trained model to perform APT on TwitterPPDB itself. Adhering to expectations, training RoBERTa base (the paraphrase detector) with AP T w T 5 yielded higher MCCs. Note that none of the sentences are common between AP T w T 5 and AP H since AP H is built on MSRP and PPNMT and the fact that the model got this performance when trained on AP T w T 5 is a testimony to the quality and contribution of APT.
Combining these results, we can conclude that although machine-generated datasets like AP T 5 can help paraphrase detectors improve themselves, a smaller dataset of human-generated adversarial paraphrases improved performance more. Overall, however, the highest MCC (0.525 in Table 6) is obtained when TwitterPPDB is combined with all three adversarial datasets, suggesting that the two approaches nicely complement each other.

Discussions and Conclusions
This paper introduced APT (Adversarial Paraphrasing Task), a task that uses the adversarial paradigm to generate paraphrases consisting of sentences with equivalent (sentence-level) meanings, but differing lexical (word-level) and syntactical similarity. We used APT to create a human-generated dataset / benchmark (AP H ) and two machinegenerated datasets (AP M T 5 and AP T w T 5 ). Our goal was to effectively augment how paraphrase detectors are trained, in order to make them less reliant on word-level similarity. In this respect, the present work succeeded: we showed that RoBERTa base trained on TwitterPPDB performed poorly on APT benchmarks, but this performance was increased significantly when further trained on either our human-or machine-generated datasets. The code used in this paper along with the dataset has been released in a publicly-available repository. 4 Paraphrase detection and generation have broad applicability, but most of their potential lies in areas in which they still have not been substantially applied. These areas range from healthcare (improving accessibility to medical communications or concepts by automatically generating simpler language), writing (changing the writing style of an article to match phrasing a reader is better able to understand), and education (simplifying the language of a scientific paper or educational lesson to make it easier for students to understand). Thus, future research into improving their performance can be very valuable. But approaches to paraphrase that treat it as no more than a matter of detecting word similarity overlap will not suffice for these applications. Rather, the meanings of sentences are properties of the sentences as a whole, and are inseparably tied to their inferential properties. Thus, our approaches to paraphrase detection and generation must follow suit.
The adversarial paradigm can be used to dive deeper into comparing how humans and SOTA language models understand sentence meaning, as we did with APT. Furthermore, automatic generation of adversarial datasets has much unrealized potential; e.g., different datasets, paraphrase generators, and training approaches can be used to generate future versions of AP T 5 in order to produce AP T passing sentence pairs with lower lexical and syntactic similarities (as measured not only by BLEURT, but also by future state-of-the-art STS metrics). The idea of more efficient automated adversarial task performance is particularly exciting, as it points to a way language models can improve themselves while avoiding prohibitively expensive human participant fees.
Finally, the most significant contribution of this paper, APT, presents a dataset creation method for paraphrases that will not saturate because as the models get better at identifying paraphrases, we will improve paraphrase generation. As models get better at generating paraphrases, we can make APT harder (e.g., by reducing the BLEURT threshold of < 0.5). One might think of this as students in a class who come up with new ways of copying their assignments from sources as plagiarism detectors improved. That brings us to one of the many applications of paraphrases: plagiarism generation and detection, which inherently is an adversarial activity. Until plagiarism detectors are trained on adversarial datasets themselves, we cannot expect them to capture human levels of adversarial paraphrasing.