Recovering Lexically and Semantically Reused Texts

Writers often repurpose material from existing texts when composing new documents. Because most documents have more than one source, we cannot trace these connections using only models of document-level similarity. Instead, this paper considers methods for local text reuse detection (LTRD), detecting localized regions of lexically or semantically similar text embedded in otherwise unrelated material. In extensive experiments, we study the relative performance of four classes of neural and bag-of-words models on three LTRD tasks – detecting plagiarism, modeling journalists’ use of press releases, and identifying scientists’ citation of earlier papers. We conduct evaluations on three existing datasets and a new, publicly-available citation localization dataset. Our findings shed light on a number of previously-unexplored questions in the study of LTRD, including the importance of incorporating document-level context for predictions, the applicability of of-the-shelf neural models pretrained on “general” semantic textual similarity tasks such as paraphrase detection, and the trade-offs between more efficient bag-of-words and feature-based neural models and slower pairwise neural models.


Introduction
When composing documents in many genresfrom news reports, to scientific papers, to political speeches-authors obtain ideas and inspiration from source documents and present them in the form of direct copies, quotations, summaries, or paraphrases. In the simplest case, e.g. in congressional bills, writers include text from earlier versions of the same document along with new material (Wilkerson et al., 2015). In news media, journalists often paraphrase or quote speeches, press releases, and interviews (Niculae et al., 2015; * Equal contribution. Tan et al., 2016). In academia, citations of papers usually appear along with summaries of their contributions (Qazvinian and Radev, 2010). These are instances of lexical and semantic local text reuse, where both source and target documents contain lexically or semantically similar passages, surrounded by text that is unrelated or dissimilar. Often, reused text is presented without explicit links or citations, making it hard to track information flow.
While many state-of-the-art (SoTA) NLP architectures have been trained on the closely-related tasks of document-and sentence-pair similarity detection (Reimers and Gurevych, 2019) and ad-hoc retrieval (Dai and Callan, 2019), prior methods for local text-reuse detection (LTRD) are mostly limited to lexical matching (Lee, 2007;Clough et al., 2002;Leskovec et al., 2009;Wilkerson et al., 2015;Smith et al., 2014) with some dictionary expansion (Moritz et al., 2016). To our knowledge, only Zhou et al. (2020) has applied neural models to this problem, proposing hierarchical neural models that use a cross-document attention mechanism to model local similarities between two candidate documents.
In this paper, we conduct a large-scale evaluation of several lexical overlap and SoTA neural models for LTRD. Among the neural models, we benchmark not only the hierarchical neural models proposed by Zhou et al. (2020), but also study the effectiveness of three classes of models not yet applied to LTRD: 1) BERT-based (Devlin et al., 2019) passage encoders trained on generic paraphrase, semantic textual similarity, and IR data (Reimers and Gurevych, 2019); 2) feature-based BERT models with direct sentence-level supervision; and 3) finetuned BERT-based models for sequence-pair tasks.
We conduct evaluations on four datasets, including 1) PAN and S2ORC (Zhou et al., 2020), benchmark LTRD datasets for plagiarism detection and citation localization; 2) Pr2News (MacLaughlin et al., 2020), a dataset of text reuse in news articles labeled with a mix of expert, non-expert, and heuristic annotation; 3) ARC-Sim, a new, publicly available 1 citation localization dataset created using citation links in the ACL ARC (Bird et al., 2008).
Our experiments address a number of previouslyunexplored questions in the study of LTRD, including 1) the impact of training on weakly-supervised data on model accuracy; 2) the effectiveness of SoTA neural models trained on "general" semantic similarity data for LTRD tasks; 3) the importance of incorporating document-level context; 4) the effects of domain-adaptive pretraining (Gururangan et al., 2020) on the accuracy of fine-tuned BERT models; and 5) the trade-offs between more efficient lexical overlap and feature-based neural models and slower pairwise neural models.

Related Work
LTRD methods have been applied in many domains, including tracking short "memes" in news and social media (Leskovec et al., 2009), tracing specific policy language embedded in proposed legislation (Wilkerson et al., 2015;Funk and Mullen, 2018), studying reuse of scripture in historical and theological writings (Lee, 2007;Moritz et al., 2016), tracing information propagation in news and social media (Tan et al., 2016;Clough et al., 2002;MacLaughlin et al., 2020), and detecting plagiarism on the web (Potthast et al., 2013;Sánchez-Pérez et al., 2014;Vani and Gupta, 2017). Most applications, however, use only lexical overlap and alignment methods to detect reuse, sometimes with lemmatization and dictionary curation.
Our work builds on the recent efforts of Zhou et al. (2020), who demonstrate the efficacy of hierarchical neural models in detecting instances of non-literal reuse where authors paraphrase, summarize, and heavily edit source content. However, as discussed in §1, we conduct a much larger set of experiments beyond those of Zhou et al. (2020). In addition to the hierarchical neural models with document-level supervision proposed by Zhou et al. (2020), we evaluate four sets of models: lexical overlap models, SoTA neural models trained for general paraphrase detection, hierarchical neural models with sentence-level supervision, and finetuned sequence-pair BERT models. Further, in 1 https://github.com/maclaughlin/ARC-Sim addition to evaluating models on the benchmark LTRD datasets introduced by Zhou et al. (2020),  we conduct experiments on two more challenging  datasets: ARC-Sim, a new citation localization  dataset with hard negative examples, and Pr2News (MacLaughlin et al., 2020), a dataset of text reuse in science news articles with heuristically-labeled training data.
Also related to our work is research studying sentence-pair problems, e.g. paraphrase detection (PD) (Dolan and Brockett, 2005), semantic textual similarity (STS) (Cer et al., 2017) and textual entailment, (Bowman et al., 2015), and documentranking problems, e.g. ad-hoc retrieval (Croft et al., 2009). In fact, it is trivial to adapt existing approaches to sentence-pair and document ranking problems to LTRD. As discussed in §3, we cast LTRD as sentence classification and ranking, identifying which sentences in a target text are lexically or semantically reused from some portion of the source. Thus, in order to adapt sentence-pair models to this task, we simply compute scores for all pairs of (source sentence, target sentence), and use some function (e.g. max) to aggregate the scores for each target sentence. Similarly, one can adapt existing ad-hoc retrieval approaches by treating each target sentence as a query and computing a score with the corresponding source. These approaches, however, may suffer from a lack of contextualization and/or efficiency issues. Sentencepair models that encode each source and target sentence separately, while efficient, might miss important contextualizing information in surrounding sentences. Similarly, neural IR models that process each target sentence as a separate query do not contextualize target sentences and also require a computationally-expensive forward pass for each query. We study the importance and impact of these limitations in our work, testing the effectiveness of multiple SoTA BERT-based architectures for sequence-pair similarity and ranking.

Problem Definition
Following Zhou et al. (2020), we define LTRD as two tasks: document-to-document (D2D) alignment and sentence-to-document (S2D) alignment. In D2D, for a given pair of documents (source document S, target document T), we aim to predict whether T reuses content from S. Thus, each pair has a corresponding binary label of 1 if T reuses content, else 0. Note, this is different than evaluating the similarity of the two documents as a whole, since, in this setting, only a small portion of T is adapted from S, and most of it is possibly unrelated. In S2D, given an (S, T) pair, we aim to predict which specific sentences t i ∈ T contain reused S content. Thus, each pair has n corresponding labels, one label for each sentence t i ∈ T. 2

Models
We benchmark four classes of models on this task:
• Rouge (Lin, 2004): Since authors of derived documents often paraphrase and summarize source content, we evaluate Rouge, a popular summarization evaluation metric. We evaluate Rouge-{1, 2, L}, selecting the best configuration for each dataset using validation data.
We compute two versions of each metric: singlepair (sp) and all-pairs (ap). In sp, for a given document pair (S, T), we compute a score for each sentence t i ∈ T by computing its similarity to the entire S. In ap, we compute a score for each sentence t i ∈ T by computing its similarity to each sentence s i ∈ S, then selecting the maximum score over all s i . These scores are then thresholded to make binary predictions. For the D2D task, we predict T as positive if it contains at least one positively predicted sentence. For the S2D task, we evaluate the predicted score for each t i ∈ T.

Pretrained Sentence-BERT Encoders
We evaluate Sentence-BERT (SBERT) (Reimers and Gurevych, 2019), a SoTA pretrained passage encoder for semantic-relatedness tasks. SBERT models are trained by 1) adding pooling (e.g. mean pooling) to the output of BERT; 2) training on pairs or triplets of passages to learn semantically meaningful passage representations; 3) at test time, computing the similarity between two passages as the cosine similarity between their pooled representations. We evaluate three SBERTs trained for different tasks: • Note, these pretrained SBERT models are not trained for LTRD. Instead, they are trained on largescale datasets for other related tasks (PD, STS, IR). These experiments thus evaluate how well off-theshelf tools generalize to a new task and domain. Just as the lexical models, we evaluate in sp and ap settings. Following Reimers and Gurevych (2019), we embed each source document, source sentence, and target sentence separately, then compute cosine similarity for each pair.

Hierarchical Neural Models (HNM)
We also benchmark three HNM. Similar to SBERT ( §4.2), HNM operate on frozen embeddings (Peters et al., 2019) which are computationally efficient since they only need to be calculated once (i.e. only one BERT forward pass for each source or target sentence). Unlike SBERT, however, HNM also have task-specific model architectures that learn to contextualize and align sentences.
BERT-HAN (shallow) (Zhou et al., 2020): this model mean pools frozen BERT embeddings to generate sentence representations, then uses a hierarchical attention network (HAN) (Yang et al., 2016) to add document-level context and a crossdocument attention (CDA) mechanism to align passages across documents. See Zhou et al. (2020).
At training time, BERT-HAN only calculates loss at the document-pair level, i.e. D2D classification. There is no sentence-level supervision (S2D). At inference, two sets of predictions are output: 1) the D2D prediction, as during training; 2) the intermediate hidden representations of the sentences t i ∈ T are extracted, then ranked by their similarity to the final hidden representation of the entire source document S.
GRU-HAN (deep) (Zhou et al., 2020): this model mirrors BERT-HAN, except with GloVe (Pennington et al., 2014) embeddings and a HAN with CDA at both the word and sentence level. It follows the same training and testing regime.
BCL-CDA: We adapt the BCL model from MacLaughlin et al. (2020) (originally designed for the task of intrinsic source attribution on Pr2News) for LTRD by adding a final CDA layer (Zhou et al., 2020). After generating contextualized representations of each source and target sentence with BCL, a CDA layer computes an attention-weighted representation of each target sentence, weighted by its similarity to the source sentences. The CDAweighted and original target sentence representations are then concatenated and fed into a final layer for prediction.
At training time, BCL-CDA is supervised with target sentence labels. At testing time, it makes target sentence-level predictions (S2D) just as in training. We make a D2D prediction for each (S, T) pair by taking the max over its sentence-level predictions. See Appendix C for full model details.

Fine-tuned BERT-based Models
Finally, we evaluate fine-tuned BERT-based models for sequence pair classification. Unlike the other three classes of models described above, features for these fine-tuned models cannot be precomputed. Instead, at test time, a separate forward pass is required for each (S, T) or (S, t i ) pair. Thus, though these models might achieve better performance than feature-based alternatives (Peters et al., 2019), it may be unfeasible to test them on large collections where many pairwise computations would be required.
Sequence Pair Models: We fine-tune Roberta Base (Liu et al., 2019) using the standard setup for sequence-pair tasks such as PD, STS, and IR (Devlin et al., 2019;Akkalyoncu Yilmaz et al., 2019). We create an input example for each (source document S, target sentence t i ) pair: [CLS] < s 1 , ..., s n > [SEP] t i [SEP] where < s 1 , ..., s n > contains the source document, split into sentences, with each sentence separated by a special [SSS] token ("source sentence start") and t i is a single target sentence. We feed the [CLS] representation into a final layer Table 1: Dataset statistics: the total number of (source S, target T) example pairs, the average # of sentences and words in each S and T, and the average # of positively labeled T sentences in each positive (S, T) pair. For Pr2News, we report the average # of T sentences with label > 0 in the human-labeled val and test sets. to make a prediction for t i . Thus, making a prediction for an entire (S, T) document pair requires n forward passes, one for each t i ∈ T. Domain-adapted Sequence Pair Models: As shown by Gururangan et al. (2020), further pretraining BERT-based models on in-domain text improves performance on a variety of tasks. We explore the effects of DAPT for LTRD, testing Roberta models domain-adapted on either biomedical publications, computer science publications or news data. We fine-tune these models as above.
Sequential Sequence Pair Models: Since the fine-tuned models discussed above operate on a single t i at a time, they cannot leverage information from the surrounding target context. Following the success of BERT-based models for sequential sentence classification (Cohan et al., 2019), we construct new input examples containing the full source and target documents, split into sentences: [CLS]< s 1 , ..., s n >[SEP]< t 1 , ..., t n >[SEP] Again, < s 1 , ..., s n > contains the source sentences. Similarly, < t 1 , ..., t n > contains the target sentences, with each separated by a special [TSS] token ("target sentence start"). We feed the final [TSS] representations into a multi-layer feedforward network to make a prediction for each target sentence.
Each pair is labeled with all corresponding target sentence labels. Since many pairs exceed Roberta's 512 Wordpiece length limit, we use Longformer Base , a Robertabased model with an adapted attention pattern to handle up to 4,096 tokens. We put global attention on the [SSS] and [TSS] tokens to allow the model to capture cross-document sentence similarity.

Datasets
We benchmark the proposed models on four different datasets (Table 1). See Appendix A for further dataset stastics and preprocessing details. (Zhou et al., 2020) PAN contains pairs of (S, T) web documents where T has potentially plagiarized S. Positive pairs contain synthetic plagiarism, generated by methods such as back-translation (Potthast et al., 2013). Negative examples are created by replacing S with another, unplagiarized source text,S, sampled from the corpus. D2D labels are binary: plagiarized or not. The S2D labels for t i ∈ T are 1 if t i plagiarizes S, else 0 (labels in negative pairs are 0).

Pr2News (MacLaughlin et al., 2020)
Pr2News contains pairs of (press release S, science news article T), where each T has reused content from S. There are three aspects of this dataset which are unlike the others we study: 1) All (S, T) pairs are positive and contain reuse. Thus, we only evaluate the S2D task. 2) While the val and test sets are human-annotated, the (S, T) pairs in the training set are labeled using a heuristic (TF-IDF cosine similarity). Though there has been some success training neural models on scores generated by word-overlap heuristics for the problems of document retrieval (Dehghani et al., 2017) and source attribution (MacLaughlin et al., 2020), applications of weakly-supervised models have not yet been studied on human-labeled LTRD test sets. 3) Target sentences, t i ∈ T, in the val and test sets are labeled on a 0-3 ordinal scale, ranging from no reuse (0) to near or exact duplication (3).

S2ORC (Zhou et al., 2020)
S2ORC is a citation localization dataset, containing (abstract S, paper section T) pairs. Citation localization consists of identifying which t i ∈ T, if any, cite the source. All citation marks are removed from the texts, so models can only make predictions by comparing the language of S and T, not just simply identify citation marks. Positive examples are created by sampling scientific papers from the broader S2ORC corpus , finding sections in those papers that contain citation(s) to another paper in the corpus, and pairing together the (cited source abstract S, citing section T). Negative pairs are created by pairing T with S, the abstract of a paper it does not cite. The D2D labels are 0 for negative pairs, 1 for positive. The S2D labels for t i ∈ T are 1 if t i contains a citation of S, else 0. S2D labels for negative pairs are all 0.
The design of this dataset follows the assumption that the citing sentence(s) in T often paraphrase or summarize some portion of the cited paper, which is, in turn, summarized by its abstract S. This assumption, however, may be incorrect if the citing sentence is a poor summary of the cited paper (Abu-Jbara and Radev, 2012) or it refers to content in the cited paper which is not included in the abstract. Nevertheless, this assumption allows for easy creation of large-scale, real-world LTRD datasets. This is in contrast to Pr2News, which is substantially smaller due its reliance on humanannotated val and test labels, and PAN, which uses automatic methods to generate synthetic examples. We discuss the trade-offs of using citation marks to generate LTRD datasets in §5.4.

ARC-Sim
Motivated by the design of S2ORC, we propose a new citation localization dataset 3 built on the ACL Anthology Conference Corpus (ARC) (Bird et al., 2008). Just as S2ORC, we construct our dataset using citations links between papers. Thus, we first break up each ARC paper by section, then use ParsCit (Councill et al., 2008) to find all sections that cite another paper in ARC. Positive examples are pairs (abstract S, paper section T) where S is cited by at least one t i ∈ T. Using this method we generate 61,131 positive (S, T) pairs. Most (88%) T contain only one positive sentence.
To create negative examples, we pair each S from the positive samples with a new section,T, that does not cite it. Importantly,T is sampled from the same target paper as the original T. This generates 44,250 negative pairs. 4 We argue that these negative samples method will be both more difficult and realistic than those in S2ORC. In S2ORC, negatives are generated by sampling a new sourcẽ S to pair with T. However, due to the large scale of the corpus,S and T are often completely unrelated (e.g. Bio vs. CS). These examples, therefore, are trivial and can be easily classified using simple lexical overlap. In ARC-Sim, however, negatives are generated by sampling a new sectionT from the same paper as T. We hypothesize that differentiating between these positive and negative examples will 1) be more difficult sinceT is likely still topically related to S and may contain some spurious lexical or semantic overlap; 2) be more indicative of real-world performance, since real users may need to identify which specific sections in a full target paper reuse content from the source. Further preprocessing and dataset split information is detailed in Appendix A. We use the same labeling scheme as S2ORC.
With dataset creation complete, we sample a set of 50 positive pairs from the val set to analyze in depth. Three expert annotators (authors of this paper) perform the LTRD task, predicting which t i ∈ T reuse content from S. Five pairs are marked by all annotators (Fleiss' Kappa: 0.83). The remaining 45 are split into 15 per annotator. Overall, we find that annotators mark more sentences as reused (avg. 1.6 sents / target) than the true citation labels (1.3 / target). This is reasonable since T often only cites S once, even if it discusses S in multiple sentences (Qazvinian and Radev, 2010). These false negatives are one disadvantage of using citation marks as supervision. Further, we find that annotators and ground truth often, but not always, agree -annotators identify at least one true citing sentence in 72% of pairs. This difference is mostly due to 1) citing sentences that discuss source content not described in the source abstract; 2) OCR errors that can make text hard to read. On the whole, we find that ACL-Sim is a useful LTRD dataset, but there are clear avenues for improvement, such as manually annotating reused sentences without citation marks and improving OCR.

Evaluation Settings & Metrics
D2D Metrics: We evaluate the D2D task as (S, T) pair classification using F1 score. A positive label indicates that T reuses content from S. A negative label indicates no text reuse. There is no D2D task for Pr2News since all examples are positive.
S2D Metrics: We evaluate S2D in two settings: corpus level (i.e. evaluating all target sentences from all pairs at once), and document level (i.e. evaluating the sentences in each target document w.r.t each other, then averaging scores across documents). The metrics for each setting depend on the dataset. At the corpus level, we evaluate binary-label datasets (PAN, S2ORC, ARC-Sim) with sentence-level F1 and ordinal-label datasets (Pr2News: 0-3 scale), with spearman's correlation (ρ) and NDCG@N (where N is the number of target sentences in the test set). At the document level, we evaluate binary-label datasets with mean average precision (MAP) and top-k accuracy (Acc@k), defined as the proportion of test examples where a positively-labeled sentence in T is ranked in the top k by the model. We evaluate ordinal-label datasets with NDCG@{1,3,5}. Note, in order for these document-level metrics to be meaningful, T must contain at least one positive sentence. Thus, our document-level evaluations are only calculated on the positive (S, T) pairs in each dataset. 5 Since Pr2News only contains positive examples, we use the full test set for all evaluations.
BERT-HAN & GRU-HAN: Since both HAN models are trained on document-level, not sentencelevel, labels, we cannot train them on Pr2News, where all document-level labels are positive. Thus, we skip evaluating the HAN models on this dataset.
Domain-adapted RoBERTa Models: We evaluate three DAPT models: 1) Biomed-DAPT for S2ORC and Pr2News since they contain biomedical texts, 2) News-DAPT for Pr2News since the target documents are news articles, 3) CS-DAPT model for S2ORC and ARC-Sim since they contain CS papers. 6 We do not apply DAPT to PAN since no models are adapted to a similar domain.

Results & Discussion
As seen in Tables 2 & 3, BERT-based models finetuned on LTRD data perform the best in general, outperforming lexical overlap, SBERT, and HNM. Overall, models achieve their best performances on PAN. We suspect that this is because many positive (S, T) pairs are easy, containing many plagiarized passages with high lexical overlap, and since many negative (S, T) pairs are topically unrelated and share little lexical or semantic overlap. On the other end of the spectrum is ARC-Sim, where models score relatively poorly. We hypothesize that this is because most T only contain one citing sentence and since, as discussed in §5.4, we focus on selecting hard negative target texts,T, sampled from the same document as the original T.

Impact of Weak Supervision
In general, the supervised BERT-based models outperform the unsupervised lexical overlap baselines. The exception to this finding is Pr2News, where the lexical overlap baselines Rouge ap and Rouge sp have the best corpus-level and document-level S2D scores, respectively. This result is perhaps not unexpected, since, unlike other datasets, the labeling methods of Pr2News differ substantially between training (heuristic generated by TFIDF ap scores), validation (non-expert-labeled) and test (expert-labeled). However, our results still contrast Dehghani et al. (2017), who, working on a document ranking task, find that weakly-supervised neural models consistently outperform the unsupervised methods used to label their training data. We hypothesize that our negative finding might be due, in part, to the small scale of Pr2News and our reliance on only a single heuristic as the supervision signal source. To address this, future work could explore applications on larger weakly-supervised LTRD datasets, e.g. closer in scale to the 50M document collection of Dehghani et al. (2017), and improving the weak-supervision signal to better reflect human judgements, e.g. through combination of multiple heuristics (Boecking et al., 2021).

Effectiveness of Off-the-shelf Tools
Next, we take a closer look at the performances of SBERT (Reimers and Gurevych, 2019). Note, these off-the-shelf models are trained on the related tasks of either PD, STS, or IR, not on our LTRD datasets. Though PD, STS, and IR receive substantially more attention in the NLP and IR literature, prior research has not yet explored the generalizability of models trained on these tasks to LTRD. We focus in particular on SBERT-PD, since Reimers and Gurevych (2019) recommend it for various applications and claim that it achieves strong results on various similarity and retrieval tasks. Examining our results, however, we find the opposite -SBERT performs worse in general than the lexical overlap baselines, and SBERT-PD performs no better than SBERT-IR (though both better than SBERT-STS). We suspect that the SBERT models would perform better if they were finetuned on in-domain LTRD data. However, since we aimed to evaluate the effectiveness of an offthe-shelf tool, we did not test this hypothesis.

Importance of Document-level Context
To examine the importance of incorporating document-level context for LTRD, we compare the results of Roberta and Longformer. 7 As noted in §4, input to both models follows the standard BERT sequence-pair setup (Devlin et al., 2019). However, Roberta operates on pairs of source documents and single target sentences (S, t i ), while Longformer operates on full document pairs (S, T), making predictions for all target sentences simultaneously. From Tables 2 & 3, we see that modeling target document context does not consistently improve performance. While Longformer outperforms Roberta on the D2D and corpus-level S2D tasks on most datasets, Roberta consistently scores higher on document-level S2D. To investigate the discrepancy between Longformer's strong corpuslevel S2D performance and its relatively weaker document-level S2D scores, we examine S2ORC 7 Longformer is initialized from RobertaBase, but has additional parameters and is further pretrained on a long-document corpus. Thus, though we cannot disentangle these effects from the benefits of incorporating document-level context, we believe our experiments provide a relatively fair comparison between two SoTA models for short vs. long input sequences. pairs where at least one t i cites S. As discussed in §5, these errors are reasonable, since T often only cites S once, even if it discusses S in multiple sentences (Qazvinian and Radev, 2010). Roberta's more-frequent FP errors, however, do not affect its document-level scores as much. Since, at the document-level, we evaluate how well models rank the t i in each T w.r.t each other, models perform well if they score positive sentences higher than negatives (no reuse). Indeed, though Roberta predicts high scores for many negatives, it does better than Longformer at scoring positives higher, leading to better ranking performance. Next, we first perform error analysis on PAN, the only dataset where Roberta outperforms Longformer across all metrics. We find that Roberta makes few D2D errors, of which most (80%) are FPs. Longformer, on the other hand, not only makes substantially more errors overall, but splits them roughly equally between FPs and FNs. These FNs are especially surprising since many positive examples in PAN have high lexical overlap. On the other hand, for the corpus-level S2D task, we find that both models have similar numbers of TPs and FNs, but that Longformer generates an order of magnitude more FPs, i.e. predicting that negative target sentences contain reuse.

Effects of Domain-adaptive Pretraining
We next examine the benefits of DAPT. Gururangan et al. (2020) find that further pretraining Roberta on text from a new domain improves downstream performance, provided that this new domain is similar to the downstream task. To examine whether this finding holds for LTRD, we conduct DAPT evaluations on three datasets -S2ORC, ARC-Sim and Pr2News. Unlike Gururangan et al. (2020), however, we find mixed results. On ARC-Sim and Pr2News, standard Roberta models outperform the corresponding DAPT models on most metrics. The ARC-Sim findings are especially surprising, since its domain (NLP papers) is substantially different from Roberta's standard pretraining data (books, news, web documents) and since Gururangan et al. (2020) show strong performance gains from DAPT on a classification dataset also based on ACL-ARC. Moving on to S2ORC, our findings are reversed, with both DAPT models outperforming Roberta. However, as noted in §6, since the extra pretraining data for these DAPT models is sampled from the same corpus as S2ORC, we cannot be sure how much of this boost is due to DAPT models pretraining on S2ORC's test data.

Trade-offs between Models
Finally, we discuss the trade-offs between models, focusing on differences in performance and relative computational efficiency. On one end of the efficiency spectrum are the lexical overlap metrics (TFIDF, Rouge-{1,2}) which are easily scaled to large document collections by simply keeping track of the ngrams in each source or target passage, then computing word-overlap scores for each (S, T) pair. 8 As discussed in §4, we evaluate these metrics in two settings, sp and ap, depending on whether we compute similarity scores between target sen-tences and entire source documents or with each source sentence separately (then compute an aggregate score). Though no single metric or evaluation setting consistently achieves the best performance, these models provides a very strong baseline, especially on the D2D task.
In the middle of the efficiency spectrum are SBERT and HNM. Though these models require an expensive forward pass to generate an embedding for each source or target passage, these embeddings can then be saved and reused. Scores for each (S, T) pair can be computed relatively quickly by either computing cosine similarity scores (SBERT) or running the pair through a lighter-weight taskspecific architecture (HNM). However, we find mixed and negative results regarding their effectiveness. Specifically, as discussed in §7.2, offthe-shelf SBERT models generally lag behind the computationally-cheaper lexical overlap baselines. Results are slightly more positive, though, for the HNMs. BCL-CDA, the best HNM, achieves the second best performance on two datasets (S2ORC, ARC-Sim). However, it still lags behind the best model, fine-tuned BERT, by a significant margin. Further, it performs worse than lexical overlap baselines on the other datasets, PAN and Pr2News. Turning next to the HAN models, we find that though they achieve competitive D2D performance on two of the three datasets, they have very weak S2D scores. We suspect that this is because they are only trained on the D2D task -at test time, they make sentence-level predictions by computing similarity scores between hidden source and target representations extracted from a pretrained D2D model. Due to this training formulation, the HAN models fail to learn sentence-level representations that are useful for prediction. See Appendix B for a discussion of our efforts to replicate the results from the HAN models on our datasets.
Lastly, the least efficient models are fine-tuned BERTs, which require a separate forward pass to compute a score for each (S, T) or (S, t i ) pair. As is the trend with other NLP tasks, though, these computationally-intense and parameter-rich models achieve the best average performance. This finding is clearest on S2ORC and ARC-Sim, where few t i contain reuse and that reuse is non-literal (e.g. paraphrase). On these datasets, the best fine-tuned BERT outperforms the next-best model (BCL-CDA) by an average of 6.3% (D2D) and 15.5% (S2D). However, on datasets where target documents directly copy large spans of source content with minimal changes (PAN) or where large-scale supervised training data is unavailable (Pr2News), fine-tuned BERT provides much less or no improvement over the lexical overlap metrics.

Conclusion
We study methods for local text reuse detection, identifying passages in a target document that lexically or semantically reuse content from a source. Through evaluations on four datasets, including a new citation localization dataset, we confirm the strong performance of BERT models fine-tuned on our task. However, we also find that lexical-overlap methods, e.g. TFIDF, provide strong baselines, frequently outperforming off-the-shelf neural passage encoders and hierarchical neural models.
Based on these findings, we suggest practitioners take one of two approaches: 1) in instances with little labeled training data or where most reuse is exact (i.e. copying), use traditional lexical overlap models; 2) in instances with large-scale labeled training data and where much of the reuse is nonliteral (e.g. summarization, paraphrasing), use a lexical overlap method to filter possible (S, T) pairs, then run a more expensive fine-tuned BERT on that subset. We suggest users opt for fine-tuned BERT models over pretrained passage encoders (SBERT) or HNMs for this second step since they achieve substantially higher performance. Suggestion #2 follows current approaches to neural IR, where neural models only rerank smaller lists of documents retrieved by a cheaper lexical overlap method, e.g. TF-IDF. Performance may be further boosted by fine-tuning BERT-based models that incorporate document-level context (i.e. Longformer) or ones that are adapted to the target domain of interest (i.e. DAPT), but often the standard Roberta Base achieves highly competitive results.  ARC-Sim We create this dataset using papers from the ACL Anthology Conference Corpus (Bird et al., 2008). Since we use citation marks to identify instances of text reuse, we use ParsCit (Councill et al., 2008) to first identify all in-line citation marks. We then create examples by matching together a section in a paper that contains a citation with the abstract of the cited paper (assuming the cited paper is also in the ACL ARC). Since citation marks have a distinctive lexical pattern, we remove them all after matching the pairs. We then split sections and abstracts into sentences using Stanford CoreNLP , keeping track of where the original citation was in order to generate S2D labels. We create negative examples by matching a cited abstract together with another section from the same paper as the original citing section (the new section is selected so that it does not cite the paper). Finally, for computational feasibility, we limit source documents to 20 sentences and target sections to 50, the 90th percentiles in the data. We remove pairs where the citation occurs after the 50th sentence in the target section. We split the dataset into train/val/test by cited abstract S, yielding the splits detailed in Table 4.

A Data Preprocessing
PAN: We download the public dataset. We filter out 1) malformed positive pairs that do not contain any positively-labeled sentences or contain positively-labeled sentences with no words; 2) extremely long pairs which cause GPU memory issues for our models, removing (source, target) pairs that contain more than 4,000 tokens total (80th percentile). Following Zhou et al. (2020), we split documents into sentences and tokenize them using NLTK (Bird and Loper, 2004).
For the hierarchical neural models (BERT-HAN, GRU-HAN, BCL-CDA), we follow Zhou et al. (2020) and cap documents at a predefined number of sentences so that the models fit in GPU memory. We cap source documents at 50 sentences (90th percentile). We split examples with target documents containing more than 45 sentences (90th percentile) into multiple examples, i.e. (source document, first 45 sentences of target document), (source docu-  (2020), we split documents into sentences and tokenize them using NLTK (Bird and Loper, 2004).
For the hierarchical neural models (BERT-HAN, GRU-HAN, BCL-CDA), we cap source documents at 20 sentences (99th percentile). We split examples with target documents containing more than 29 sentences (99th percentile) into multiple examples and merge back predictions at test time.

Pr2News
: We obtain the preprocessed and filtered Pr2News dataset from MacLaughlin et al. (2020, §4-5), who created it with data from Altmetric. We evaluate models on the provided test set of 50 expert-labeled (press release, news article) pairs. We use the set of 45 non-expert-labeled (press release, news article) pairs as our validation set (we filter out the 5 spurious validation set pairs noted by MacLaughlin et al. (2020)). Finally, we use the remaining 64,684 pairs labeled with their TF-IDF cosine similarity heuristic as training data. For pairs with more than one matched press release, we select the press release with the highest TFIDF cosine similarity to the news article.
For the hierarchical neural models (BERT-HAN, GRU-HAN, BCL-CDA), we cap source documents at 54 sentences (90th percentile). We split examples with target documents containing more than 57 sentences (90th percentile MacLaughlin et al. (2020) for details of the BCL model): Each source and target sentence is fed into frozen BERT BASE separately. We then use a CNN with 1-max pooling over time to aggregate the token representations from BERT's second to last layer into a single representation for each sentence. We search over CNN filter size ∈ {3,5,7} and number of filters ∈ {50, 100, 200}. The sentence representations in each source or target document are then contextualized with document-level BiLSTMs (two separate BiLSTMs for source or target documents). We search over hidden dimension size ∈ {64, 128} (same dimensionality for both BiLSTMs). After the BiLSTM layer, we are left with s i ∈ S and t i ∈ T, contextualized sentence representations for the sentences in the source and target documents. Next, we use a sentence-level CDA layer to computet i , an attention-weighted (Luong et al., 2015, §3.1: general attention) representation of t i , weighted by its similarity to the sentences s i ∈ S. Finally, we concatenate [t i ;t i ] and feed this to a final layer to make a prediction for each target sentence.
We set dropout at 0.2, batch size at 32, and search over the max number of epochs (10, with early stopping). We optimize with Adam with learning rate ∈ {1e-4, 5e-4}. For the PAN, S2ORC and ARC-Sim datasets, we use weighted cross- Table 5: Best HP configurations for all models across all datasets. t is the classification threshold (only for PAN, S2ORC and ARC-Sim). BERT-HAN and GRU-HAN have two thresholds, one for document classification, the other for sentence classification. All other models have a single, sentence-level threshold. n-gram is the n-gram range for TF-IDF (unigrams or unigrams and bigrams). For the neural models: e is epochs, lr is learning rate, and w is the weight placed on positive examples in weighted cross-entropy loss (weight on negative examples is 1). For BCL-CDA, f s is the CNN filter size, nf is number of CNN filters, and lhd is the BiLSTM hidden dimension. '-' indicates that there are no HPs to be optimized. '×' indicates that the model is not trained on that dataset.