How Large Language Models are Transforming Machine-Paraphrase Plagiarism

The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work.However, the role of large autoregressive models in generating machine-paraphrased plagiarism and their detection is still incipient in the literature.This work explores T5 and GPT3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia.We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples.Our results suggest that large language models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.).Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5).The best-performing detection model (GPT-3) achieves 66% F1-score in detecting paraphrases.We make our code, data, and findings publicly available to facilitate the development of detection solutions.


Introduction
Paraphrases are texts that convey the same meaning while using different words or sentence structures (Bhagat and Hovy, 2013).Paraphrasing plays an important role in related language understanding problems (e.g., question answering (McCann et al., 2018), summarization (Rush et al., 2015)), but it can also be misused for academic plagiarism.Academic plagiarism is serious misconduct as its perpetrators can unjustly advance their careers, obtain research funding that could be better spent, and make science less reliable if their misbehavior remains undetected (Meuschke, 2021).
Table 1: Example excerpt from a Wikipedia article and its paraphrased versions using GPT-3.Important keywords are highlighted in boldfont and color.Autoregressive paraphrasing with GPT-3 keeps the same message while generating text with the original structure.The original example used is 3747-ORIG-44.txt.
Paraphrasing tools can be used to generate convincing plagiarized texts with minimum effort.Most of these tools (e.g., SpinBot2 , SpinnerChief3 ) use relatively rudimentary heuristics, such as word replacements with synonyms, and they already deceive plagiarism detection software (Wahle et al., 2022a).However, these tools scratch the surface of the possibilities compared to what large neural language models can achieve in producing convincing high-quality paraphrases (Zhou and Bhat, 2021).Notably, large autoregressive language models with billions of parameters, such as GPT-3 (Brown et al., 2020), make paraphrase plagiarism effortless yet exceedingly difficult to spot.
So far, large language models have found little ap-plication in plagiarism detection.As language models are already easily accessible for applications such as software development 4 or accounting5 , using language models for machine-paraphrasing will become as easy as a click of a button soon.Therefore, the number of machine-plagiarized texts will increase dramatically in the upcoming years.To counteract this problem, we need robust solutions before models are widely misused.
In this study, we generate machine-paraphrased text with GPT-3 and T5 (Raffel et al., 2020) to compose a dataset for testing against automatically generated paraphrasing.We test different configurations of model size, training schemes, and selection criteria for generating paraphrases.To understand how humans perceive machine-paraphrased text, we also performed an extensive study with 105 participants assessing their detection performance and qualityof-text assessments against existing automated detection methods.We show that while humans can spot paraphrasing of online tools and smaller autoencoding models, large autoregressive models prove to be a more complex challenge as they can generate human-like text containing the same key ideas and messages from their original counterparts (see Table 1 for an example).Popular paid plagiarism detection software (e.g.PlagScan6 , Turnitin7 ) is already deceived by rudimentary paraphrasing methods and large language models make this task even more challenging.We also test the models used for the generation, which show the highest performance in detecting machine-paraphrased plagiarism.
To summarize our contributions: • We present a dataset with machineparaphrased text from T5 and GPT-3 based on original work from Wikipedia, arXiv, and student theses to train and evaluate machine-paraphrased plagiarism.
• We explore the human ability to detect paraphrase through three experiments, focusing on (1) the detection difficulty of paraphrasing methods, (2) the quality of examples, and (3) the accuracy of humans in distinguishing between paraphrased and original texts.
• We empirically test plagiarism detection software (i.e., PlagScan) against machine learning methods and neural language models (autoencoding and autoregressive) in detecting machine-paraphrased plagiarism.
• We show that paraphrases from GPT-3 provide the most realistic plagiarism cases that both humans and automated detection solutions fail to spot, while the model itself is the best-tested candidate for detecting paraphrases.

Related Work
Plagiarism Detection: Plagiarism describes the use of ideas, concepts, words, or structures without proper source acknowledgment (Meuschke, 2021).Plagiarism datasets are limited to the number of real plagiarism cases known.With the recent success of artificial intelligence in natural language processing (NLP) applications, paraphrase generation and plagiarism detection methods increasingly rely on dense text representations and machine learning classifiers (Foltýnek et al., 2019).
Machine learning methods often fail to detect substantial paraphrasing from neural language models (Wahle et al., 2021).In particular, large autoregressive language models (e.g., GPT-3) can generate paraphrased content almost indistinguishable from original work (Witteveen and Andrews, 2019).However, these models are still insufficiently explored in the domain of plagiarism detection, even though their impact on the field is already being discussed (Dehouche, 2021).

Machine-Paraphrase Detection:
Machineparaphrasing can be described as the automatic generation of text that is semantically close to its source and written in other words (Bhagat and Hovy, 2013).Machine-paraphrasing experiences a growing research interest from NLP for learning semantic representations and related applications (Rush et al., 2015;McCann et al., 2018).However, paraphrasing can be used in plagiarism detection to deceive humans and thus needs detection solutions to prevent it (Foltýnek et al., 2019).
Lexical substitution is a common paraphrase mechanism used by plagiarists (Barrón-Cedeño et al., 2013).Many online paraphrasing tools also use synonym replacements and other lexical perturbations to paraphrase text automatically (Foltýnek et al., 2020a).(Foltýnek et al., 2020b) showed that machine-learning classifiers (e.g., Support Vector Machine) could easily detect paraphrasing from popular online paraphrasing tools such as SpinBot.(Wahle et al., 2021) proposed a benchmark with paraphrased examples from autoencoding models (e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019)), showing that neural language models can generate more challenging paraphrasing than traditional online tools (e.g., SpinnerChief, Spin-Bot).In a follow-up study, (Wahle et al., 2022a) evaluate neural language models (e.g., BERT) on paraphrased texts from SpinnerChief, another independent paid online paraphrasing tool.Their main finding was that neural language models outperform machine learning techniques and can obtain super-human performance in all test cases.(Foltýnek et al., 2020b;Wahle et al., 2021Wahle et al., , 2022a) ) results show that synonym replacements are simple to detect with state-of-the-art neural language models.However, none of these studies explore using large autoregressive models in their experiments.
So far, only a few studies have analyzed the impacts of plagiarism using autoregressive models.Seq2Seq models were first used by (Prakash et al., 2016) with stacked residual LSTM networks to generate paraphrases.(Witteveen and Andrews, 2019) train GPT-2 to generate paraphrased versions of a source text and select paraphrased candidates with the highest similarity according to universal sentence encoder (Cer et al., 2018) embeddings and low word overlap when compared to their original counterparts.(Biderman and Raff, 2022) show that GPT-J (Wang and Komatsuzaki, 2021), a smaller version of GPT-3 with six billion parameters, can plagiarize student programming assignments that are not detected by MOSS8 , a popular plagiarism detection tool.The scaling of models allows for the generation of text indistinguishable from human writing (Brown et al., 2020).In addition, the models' increase in size and consequentially their performance (Kaplan et al., 2020) have the potential to make the paraphrase detection task even more difficult.

Methodology
This study focuses on understanding how humans and machines perceive large autoregressive machine-generated paraphrase examples.There-fore, we first generate machine-paraphrased text with different model sizes of GPT-3 and T5.We then generate a dataset composed of 200,000 examples from arXiv (20,966), Wikipedia (39,241), and student graduation theses (5,226) using the best configuration of both models.
We investigate how humans and existing detection solutions perceive this newly automated form of plagiarism.In our human experiments, we compare paraphrased texts generated in this study to existing data that use paid online paraphrasing tools and autoencoding language models to paraphrase their texts.Finally, we evaluate commercial plagiarism detection software, machine-learning classifiers, and neural language model-based approaches to the machine-paraphrase detection task.

Paraphrase Generation
Method: We generate candidate versions of paragraphs using prompts and human paraphrases as examples in a few-shot style prediction (Table 2).
We provide the model with the maximum number of human paraphrased examples that fit its context window with a maximum of 2048 tokens total.For both models, we use their default configuration.
Paraphrasing models' goal is to mimic human paraphrases.Instead of manually engineering suitable prompts for the task, we use AutoPrompt (Shin et al., 2020) to determine task instructions based on the model's gradients.As suggested by the authors, we place the predict-token at the end of our prompt.One example of a generated prompt was "Rephrase the following sentence."As humans tend to shorten text when paraphrasing, we limit the maximum number of generated tokens concerning the original version to 90%, which is the approximate ratio of human plagiarism fragments in (Barrón-Cedeño et al., 2013).Table 2 provides an example of the model's input/output when generating paraphrases.
Candidate Selection: Paraphrases that are similar to their source are of limited value as they have repetitive patterns, while those with high linguistic diversity often make models more robust (Qian et al., 2019).The quality of paraphrases is typically evaluated using three dimensions of quality (i.e., clarity, coherence, and fluency), where high-quality paraphrases are those with high semantic similarity and high lexical and syntactic diversity (McCarthy et al., 2009;Zhou and Bhat, 2021).We aim to choose high-quality examples semantically close to the original content without reusing the exact words and structures (Witteveen and Andrews, 2019).
In this paper, we choose generated candidates that maximize their semantic similarity against their original counterparts while minimizing their countbased similarity.We select the Pareto-optimal candidate that minimizes ROUGE-L and BLEU (i.e., penalizing the exact usage of words compared to the original version) and maximizes BERTScore (Zhang et al., 2019) and BARTScore 9 (Yuan et al., 2021) (i.e., encouraging a similar meaning compared to the original version).Table 3 2021).The P4P database (Barrón-Cedeño et al., 2013) is composed of realistic plagiarism cases with the paraphrase phenomena they contain (e.g., 9 We use the large model version for both metrics. morphology-based, syntax-based, lexicon-based), and the PPDB 2.0 database (Pavlick et al., 2015) is a large-scale paraphrase corpus extracted with bilingual pivoting from which we extract the highquality phrasal and lexical subsets.

Human Evaluation
Our human study aims to understand how participants perceive machine-paraphrased plagiarism compared to original work and human-paraphrased text.We used Amazon's Mechanical Turk (AMT) service to obtain human assessments for paraphrased text classification.Additionally, we asked experts that actively published in the plagiarism detection domain over the past five years.To have adequate statistical power in our analyses (Card et al., 2020), we included a total of 105 participants (see Appendix A.1 for details on demographic information about participants).
In the first part of the human study (Q2 in Section 4), 50 participants are provided with a mutually exclusive choice of whether a text was machineparaphrased or original and a text field to justify their reasoning.In the second part (Q3 in Section 4), 50 participants from AMT and five experts from the research community were provided with a mutually exclusive choice of 5 points on a Likert scale for each of the three parameters of clarity, fluency, and coherence.For the first experiment, each participant evaluated five texts for five models resulting in 1,250 text evaluations.For the second experiment, each participant evaluated ten texts for three parameters, totaling 1,340 text evaluations.
Following common best practices on AMT (Berinsky et al., 2012), evaluators had to have over a 95% acceptance rate, be in the United States, and have completed over 1,000 successful tasks.We excluded evaluators' assessments if their explanations were directly copied text from the task (> 90% text match), did not match their classification, or were short, vague, or otherwise non-interpretable.Across experiments, 138 assessments (≈10%) were rejected and not included in the experiments.

Research Questions & Experiments
Q1: How does model size influence the quality of generated paraphrases? A. We ask this question to underline the problem's urgency as recently released models have a large number of parameters.Figure 1 shows the influ-BERTSc.BARTSc.Rouge-L BLEU In: Later in his career, Gates has pursued many business and philanthropic endeavors.
* Selected example in boldface.
ence of model size on the similarity scores of generated candidates against their original candidates on 500 random examples from the PPDB dataset.
With the increasing number of parameters, both models' semantic similarity scores (BERTScore, BARTScore) also rise.T5 shows the highest increase when extending the model from 3 billion parameters to 11 billion.GPT-3 (175B) reaches its overall highest semantic similarity, generating sentences with similar meanings compared to the source.Model's generated candidates also have higher count-based scores on average as they often repeat text from the source.As described before, we try to sample candidates with low word-count scores to avoid repetition of words.
We conclude that scaling models' size positively influences their performance at the task of paraphrasing, which agrees with previous research (Kaplan et al., 2020).While the limits and details of scaling models are still unknown, boosting their computing power will allow for more human-like texts to be produced.

Q2. Can humans identify whether a text is original, or machine-paraphrased?
A. This question is inspired by the Turing (1950) Test to differentiate machines from humans.To answer this question, we asked participants to assess whether texts were machine-generated (see Appendix A.3 for more details).We compared original work to an online paraphrasing tool (SpinnerChief), two auto-encoding models (BERT, Table 4 shows the mean human accuracy (i.e., the ratio of correct assignments to non-neutral assignments per participant) in detecting machineparaphrased text.The results show that humans can adequately detect the control model with 82% accuracy on average (where 50% is a chance level performance).In contrast, human accuracy at detecting paraphrases produced by autoencoding models was significantly lower, ranging from 61% to 71% over all participants.Plagiarism cases generated by large autoregressive models were usually hardly above chance (53% for GPT-3 and 56% for T5).
For more information on the annotator agreement, please see Appendix A.2. Human abilities to detect machine-paraphrased text appear to decrease with increasing model size and are particularly challenging for autoregressive models as they can change sentence structure and word order instead of single word replacements.Our findings on human detection against autoregressive models corroborate with recent results (Clark et al., 2021), challenging the common choice of humans as the gold standard.
Q3. How similar are machine-generated paraphrases to human-paraphrases? A. We sampled 500 examples pairs (i.e., original, human-paraphrased) from the PPDB corpus and paraphrased half of the original versions with GPT-3 (175B) and the other half with T5 (11B).As a proxy for similarity between originals, human-paraphrased, and machineparaphrased examples, we calculated their similarity using BERTScore.The average BERTScore between human-paraphrases and originals (76%) is lower than between machine-generated paraphrases and originals (79%).The similarity between human-paraphrases and machine-generated paraphrases is highest (81%).This result suggest that machine-generated paraphrases are typically closer to the human paraphrases than to the original, which we assume is due to the model's objective to mimic human behavior, which are provided as generation examples.

Q4. How do humans assess the quality of machineparaphrased plagiarism?
A. We asked human annotators to score generated paraphrases according to their clarity, fluency, and coherence (Zhou and Bhat, 2021) (see Appendix A.3 for more details about the questions).
As quality assessments are challenging to evaluate, we increased the requirements for participants.We asked the second group of 50 participants that required to have a higher education degree (bachelor's, master's, or Ph.D. degree).We also asked additional five experts that have published at least two peer-reviews papers on plagiarism detection in the last five years.Each participant annotated ten randomly drawn examples on a Likert scale from 1 to 5 regarding clarity, fluency, and coherence (Zhou and Bhat, 2021).
Table 5 shows the average rating for all 55 participants While original contents achieve the highest rating for all three dimensions, the largest version of GPT-3 achieves similar ratings.SpinnerChief's quality of paraphrases is significantly lower.BERT achieves convincing results as well, also because the frequency of word changes (15%) for synonyms is lower than SpinnerChief's (50%), and therefore generates examples closer to the original text.
Fluency was rated highest for all models, while clarity and coherence were the lowest.We assume that as source sentences come from diverse scientific fields, they might already be difficult to understand; thus, paraphrasing can confuse readers when technical terms are used wrong.For more information on annotator agreement and the relation between experts and their educational degree, please see Appendix A.2.

Q5. How do existing detection methods identify paraphrased plagiarism?
A. To test the detection performance of automated plagiarism detection solutions, we evaluate five methods and compare them to random guesses and a human baseline.We presume automated detection solutions can identify paraphrases better than humans as (Ippolito et al., 2020) showed that large language models are optimized to fool humans at the expense of introducing statistical anomalies which automated solutions can spot.As a de-facto solution for plagiarism detection, we test PlagScan, one of the best-performing systems, in a comprehensive test conducted by the European Network for Academic Integrity (Foltýnek et al., 2020a).We test a combination of naïve bayes classifier and word2vec (Mikolov et al., 2013), and three autoencoding transformers: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), andLongformer (Beltagy et al., 2020) which are the best performing models in machine-paraphrase detection of (Wahle et al., 2021(Wahle et al., , 2022a)).Additionally, we evaluate the largest versions of T5 and GPT-3 using few-shot prediction.
As paraphrasing models, we choose SpinnerChief; the best performing paid online paraphrasing tool   tested in (Wahle et al., 2022a).Spinnerchief attempts to change every fourth word with a synonym.We use BERT as an autoencoding baseline and set the masking probability to 15% as in (Wahle et al., 2021).As a large autoregressive model, we use GPT-3 175B, the best model for automated similarity metrics and deceiving humans.
Table 6 shows the average F1-macro except for the human baseline, which shows accuracy.For PlagScan, we assume positive examples when the text-match is greater than 50%.Looking at paraphrased plagiarism of SpinnerChief, humans reach between 79% and 85% accuracy on average.PlagScan achieves results up to 7% over the random baseline for Wikipedia articles but achieves close to random performance for student theses.
Results of detection models on BERT paraphrasing show similar patterns to SpinnerChief, as autoencoding models also replace masked words with synonyms.While detection results are generally lower for humans and PlagScan, autoencoding models improve by a significant margin.As pointed out in similar studies (Zellers et al., 2019;Wahle et al., 2021), models generating the paraphrased content are typically the best to detect it.The similarity in the architecture of the autoencoding models allows BERT, RoBERTa, and Longformer for the largest performance increase over SpinnerChief.Still, large autoregressive models achieve the best results in detecting machine-paraphrasing of BERT overall, with over 80% F1-score for GPT-3.
While detection results on large autoregressive paraphrasing seem low, models were not explicitly trained on the task and are predicted based on previous fine-tuning on other data (upper part) or not fine-tuning (lower part).We assume GPT-3 is the best detection solution because it generated the paraphrased texts.Therefore, we see T5 as a baseline when autoregressive paraphrasing models are unknown.
In general, neural detection models reach their highest performance for Wikipedia articles which we assume is due to their pre-training data containing Wikipedia examples.Student theses pose the most challenging scenario for both humans and neural approaches, as it contains challenging examples and is written by non-native English as a second language speakers.Across experiments, PlagScan is not able to reliably identify machineparaphrasing.Large autoregressive models make it challenging for PlagScan to find text matches as phrasal and lexical substitutions can change the words with synonyms and the order of words.The automatic detection results on paraphrasing of GPT-3 are alarming as many of the most used models fail to detect its paraphrases.Even though the absolute results of GPT-3 and T5 are low, they can perform better than humans at the detection task.Therefore, we assume that, similar to (Vahtola et al., 2021), there exist statistical abnormalities and patterns that automated solutions can leverage to increase their detection performance.

Epilogue Conclusion:
We generated machine-paraphrased plagiarism using large autoregressive models up to 175 billion parameters convincing paraphrased examples that deceived humans and plagiarism detection solutions.We tested the human ability to detect machine-generated paraphrases of large models and compared their assessments to well-established online tools.We evaluated one plagiarism detection software, one traditional machine-learning model, three autoencoding, and two large autoregressive models detecting machine-paraphrased examples.Despite some limitations, our results suggest that large language models may increase the number of automated plagiarism cases through convincing paraphrasing of original work.
Future Work: This study is an initial step toward understanding how large language models can foster illicit activities in the scientific domain.We plan to further examine the similarities and differences between human-and machine-generated paraphrases to understand whether humans have difficulties in detecting paraphrases in general.When looking at participants' justifications for classifying machine-generated paraphrases, we plan to analyze common terms and highlights to find possible markers for classification decisions.Over the scope of English, our approach could be applied to other languages and even generate paraphrases from one language to another using multilinugal models and data.Finally, as academic plagiarism mainly relies on scientific articles, we want to extend our study to large scientific corpora with high variation across domains and venues (Lo et al., 2020;Wahle et al., 2022b).

Limitations
Although our experiments explore how human and automated solutions struggle to identify machineparaphrased examples from large language models, we did not detail the similarities and differences between human-and machine-generated paraphrases.
Comparing human paraphrases and machine paraphrases -qualitatively and automatically -would allow for a better understanding of what makes paraphrasing so challenging.As the classification from our language models currently does not provide references or sources for their results, these models can only be used as a support tool to identify sentences and paragraphs for more detailed deliberation.While our study has the above limitations, the focus of this study was to underline the urgency of the problem of machine-generated plagiarism to promote better detection solutions in the future.

Ethics Statement
Plagiarism is illegal, unethical, and morally unacceptable in all countries (Kumar and Tripathi, 2013).While the binary classification of machineparaphrased examples in this study can indicate how automated detection solutions would point out potential plagiarism cases, a team of experts should make a final decision on such cases.False-positive cases of wrongly accused researchers could ruin their careers forever.Therefore, all cases should be carefully evaluated before any final verdict.As this study and related work show (Clark et al., 2021), humans are unreliable enough for paraphrase detection in the age of large neural language models.The difficulty of machine-paraphrase identification makes legal decisions on plagiarism cases particularly complex.We presume paraphrasing with language models will lead to more plagiarists getting unnoticed when using large models to generate their paraphrases.One exciting approach to gain transparency would rely on reconstructing the model's potential inputs (Tu et al., 2017;Niu et al., 2019) given the paraphrased version and classifying original candidates using a hybrid approach considering text-match and semantic features.We adopted a binary classification in gender for our human evaluation, which we plan to improve in future work so it can be more inclusive.Therefore, gender might not represent the natural diversity included in our dataset.

Figure 1 :
Figure 1: Paraphrasing similarity scores for a sample of the dataset with different model sizes of GPT-3 and T5.

Table 2 :
Example of generating paraphrased plagiarism with few-shot learning.As input the model receives a prompt and human paraphrase example pairs .After inserting the to-be-paraphrased sentence , the model then generates a paraphrased version as the output.

Table 3 :
Candidate selection of machine-generated paraphrases with an example from

Table 4 :
(Wahle et al., 2022a)ntifying whether parapgraphs of scientific papers from the arXiv subset are machineparaphrased.Human performance ranges from 82% on the control model to 53% on GPT-3 175B.This table compares mean accuracy of with five paraphrasing models and shows the results of a two-sample T-Test between each model and the SpinnerChief control model according to(Wahle et al., 2022a).Lowest scores are in boldface.

Table 5 :
Average scores on a Likert-scale from 1 to 5 of machine-generated plagiarism on the Wikipedia test set.Each example is judged by 50 participants with a bachelor's, master's, or PhD degree and five experts in the plagiarism detection community.Standard deviation is shown in parenthesis.Highest scores are in boldface.