CiteBench: A benchmark for Scientific Citation Text Generation

Science progresses by incrementally building upon the prior body of knowledge documented in scientific publications. The acceleration of research across many fields makes it hard to stay up-to-date with the recent developments and to summarize the ever-growing body of prior work. To target this issue, the task of citation text generation aims to produce accurate textual summaries given a set of papers-to-cite and the citing paper context. Existing studies in citation text generation are based upon widely diverging task definitions, which makes it hard to study this task systematically. To address this challenge, we propose CiteBench: a benchmark for citation text generation that unifies multiple diverse datasets and enables standardized evaluation of citation text generation models across task designs and domains. Using the new benchmark, we investigate the performance of multiple strong baselines, test their transferability between the datasets, and deliver new insights into the task definition and evaluation to guide future research in citation text generation. We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.


Introduction
In today's research climate, it is difficult to stay up to date with the state of the art. Automatic summarization of related papers can serve as a basis for literature review and reduce the paper writing workload for researchers (Hu and Wan, 2014). Related work generation is defined as generating a topic-based multi-document summary of a set of pre-defined cited papers (Hoang and Kan, 2010). The essential part of generating related work is to generate citation text based on the content of the cited paper(s) and the context from the citing paper. This constitutes the citation text generation task, which is the focus of our work. * Work done during an internship at UKP Lab.
While each of the previous works proposes methods for citation text generation, the lack of (a) common task formulation and (b) unified evaluation setup makes it difficult to compare those studies. To address this gap, this work introduces CITEBENCH: a citation text generation benchmark that unifies four existing datasets for citation text generation within a standardized task definition (a). We couple our benchmark with a range of baselines, a standardized evaluation kit that enables comparison of citation text generation across task variations (b) and additional tools for qualitative intent-and discourse-based analysis of citation texts. We use CITEBENCH to systematically investigate the task of citation text generation. Our results reveal the qualitative and quantitative differences between the existing datasets, and prompt further in-depth scrutiny of the citation text generation task and its objectives and evaluation measures.
In summary, the main contributions of this work are as follows: (1) we develop a benchmark for the task of citation text generation under a general formulation that unifies different task definitions and the corresponding datasets from previous studies; (2) we complement the benchmark with a standardized evaluation toolkit and baseline implementa-tions; (3) we report the performance of baseline models for this benchmark, including a qualitative analysis of model outputs and transfer performance. (4) we perform an in-depth analysis of the datasets in the benchmark and outline the conceptual challenges to be addressed by future work in citation text generation.
2 Related work 2.1 Benchmarking Benchmarks are unified dataset collections coupled with evaluation metrics and baselines that can be used to systematically compare the performance of NLP systems in a standardized evaluation setup. Well-constructed benchmarks can boost progress in the corresponding research areas, such as GLUE (Wang et al., 2018), KILT (Petroni et al., 2021), GEM (Gehrmann et al., 2021, and Dyn-aBench (Kiela et al., 2021). The goal of our work is to provide such a benchmark for the citation text generation task.

Text generation for scientific domain
Scientific text is different from other domains such as news and books in that scientific papers usually use a domain specific language and have a hierarchical structure. Recent years have seen a rise in natural language generation applications to scientific text, including scientific text simplification (Luo et al., 2022), scientific paper summarization (Qazvinian and Radev, 2008;Erera et al., 2019;Cachola et al., 2020), and slides generation (Sun et al., 2021). A key challenge in natural language generation is evaluation: unlike other NLP tasks that model a ground truth objective, in text generation multiple correct outputs are possible, and human evaluation is often prohibitively expensive. In line with the recent efforts that address the lack of systematic automated evaluation of text generation (Gehrmann et al., 2021), our paper contributes the first unified benchmark for citation text generation. Our analysis of ROUGE score (Lin, 2004) calculation by different packages reveals technical caveats in how citation text generation performance is evaluated in prior studies. Furthermore, our discourse-based analytical tools allow qualitative insights into the citation texts, and present a middle-ground alternative to the costly human evaluation.

Citation text generation
The task of automated related work summarization was introduced in Hoang and Kan (2010): a system was required to produce a topic-biased summary of related work specific to the citing paper. Since then, several task definitions and setups have been proposed.  introduce the dataset Multi-XScience, where the task is to generate a related work section consisting of multiple paragraphs, given the abstract of the citing paper and the abstracts of the cited papers. AbuRa'ed et al. (2020) use data from the ScisummNet Corpus (Yasunaga et al., 2019), Open Academic Graph (OAG) (Tang et al., 2008), Microsoft Academic Graph (MAG) (Sinha et al., 2015) and Hoang and Kan (2010). They use the cited paper's title and abstract to generate a citation sentence. Xing et al. (2020) create a dataset from the ACL Network Corpus (Radev et al., 2013). The authors use the abstracts of the cited papers and include context before and after the citation sentence as the input. Recently, Chen et al. (2021) create a dataset for related work generation based on the Semantic Scholar Open Research Corpus (S2ORC)  corpus and another dataset based on the Delve corpus (Akujuobi and Zhang, 2017). They use multiple cited abstracts as input and take the corresponding related work paragraph as the reference output. Closely related to the task of citation text generation, Luu et al. (2021) use the S2ORC corpus to study how scientific papers can relate to each other, and how these relations can be expressed in text.
A recent survey of works in automatic related work generation by Li et al. (2022) identified core limitations of the current state of the art, including over-focus on the computational linguistics domain, low factuality of the generated texts, lack of approaches to construct a full related work section, and lack of standardization in task definition and evaluation. Addressing the latter is the main focus of our work.

Task definition
We formalize the task of citation text generation as follows: Given a set of n (cited) target documents {D t 1 ...D t n }, a (citing) source document D s and a set of m citing document contexts {C s 1 ...C s m } ∈ D s , generate a citation text T ∈ D s . This general definition allows wide variation in how the task extracting structural paraphrases from aligned monolingual corpora. we present an approach for automatically learning paraphrases from… ABURAED [0] asked judges whether their paraphrases were roughly interchangeable given the genre. [0] In this paper, we present a new approach for word sense disambiguation (WSD) using an exemplar-based learning algorithm…  Figure 1: Data examples extracted from the datasets. Yellow -abstract, orange -title, blue -context before target citation text, red -context after target citation text, green -generated citation text. is implemented. The cited document D t i can be represented by the abstract, the concatenation of the title and the abstract, or even the full text of the paper. The context set C s includes the sentences before and after the citation text in the related work section of the citing document D s , as well as the abstract of D s . This flexibility allows our unified task definition to accommodate diverse approaches to citation text generation.

Datasets
We selected four diverse citation text generation datasets from prior work and converted them into the unified format according to our task definition: ABURAED (AbuRa'ed et al., 2020), CHEN (Chen et al., 2021), LU  and XING (Xing et al., 2020). Table 1 offers a qualitative comparison of the datasets included in our benchmark, and shows both the high variation of previously proposed citation text generation setups, and the flexibility of our general task definition. Figure 1 provides examples. The CHEN dataset is split into two parts CHEN Delve and CHEN S2ORC, however, in certain cases we use CHEN to refer to the union of these sub-datasets. Table 2 shows the quantitative statistics of the datasets in CITEBENCH. Unlike other datasets, XING does not contain a validation set; we created a validation dataset which consists of 10% of the data from the training split. In general, among all datasets, only a small portion of data instances contains extremely long input and output texts that exceed 4,096 and 1,024 tokens, respectively. We exploit this property to speed up our evaluation in Section 3.4.

Evaluation and Analysis Kit
Following previous work, we use ROUGE (Lin, 2004) as the automatic evaluation metric to evaluate the quality of the generated citation texts. Our initial investigation revealed that previous works use different ways to calculate ROUGE scores, which makes it difficult to compare the reported performance across studies. In CITEBENCH, we use the Huggingface ROUGE calculation pack-  While ROUGE and other quantitative metrics allow comparing the performance of different models, they provide little qualitative insights. Thus, to supplement our analysis, we employ two recently proposed discourse segmentation schemata to study the citation texts both found in the datasets and generated by the baseline systems.
Citation intent classification builds upon the setup from Jurgens et al. (2018) who classify citation sentences according to six citation intents: Background sentences give relevant information for the domain of the citing paper; CompareOrContrast explain similarities or differences between the cited papers and the citing paper; Extends build on the cited papers; Future suggests cited papers as potential basis for future work; Motivation means that a cited paper illustrates the need for certain research, and Uses indicates the use of data or methods from the cited papers. For the implementation we use the publicly available code from Cohan et al. Discourse tagging for citation text is a sentence classification task proposed by Li et al. (2022). The authors introduce six citation sentence types: Single_summ and Multi_summ refer to citation sentences that are detailed summaries of a single and multiple cited papers, respectively; Narrative_cite are high-level statements related to the cited papers; Reflection sentences are about how the cited paper relates to the current work and focuses mostly on the citing paper; Transition sentences are non-citation sentences that connect different parts of related work, and Other is to label everything else. For the implementation we use the publicly available code from Li et al. (2022) 4 . Li et al. (2022) report F1 score of 90.8 for their best performing model. They do not perform any in-depth analysis of the results for the discourse tagging. As the F1 score is higher for the discourse tagging model than for citation intent classification, the results for discourse tagging could be considered to have more weight.

Baselines
We complement our evaluation setup with widely used unsupervised baselines from previous work. LEAD (e.g. ) selects the first three sentences from the input as citation text. TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004) are graph-based unsupervised models for extractive text summarization. For TextRank, we use the default settings from the package summa 5 . For LexRank, we use the package lexrank 6 with the default settings for everything except the summary size (i.e., number of returned sentences), which is set to 2 instead of 1. In addition, we provide a range of new neural baselines based on the pre-trained Longformer Encoder Decoder (LED) (Beltagy et al., 2020). Unlike its predecessors such as BART (Lewis et al., 2020), LED can handle inputs up to 16,384 tokens, however our supervised baselines truncate the inputs at 4,096 tokens, which affects a negligible proportion of data items (Table 2) while substantially improving the training speed. We use three different versions of the LED model: led-base, led-large and led-large-arxiv 7 . The difference between led-large and led-large-arxiv is that led-large-arxiv is fine-tuned on the arXiv dataset for long document summarization (Cohan et al., 2018).
We also include supervised baselines *led-base and *led-large-arxiv models that have been fine-tuned on the mixture of all training splits of all datasets in the benchmark. The 5 https://github.com/summanlp/textrank 6 https://github.com/crabcamp/lexrank 7 https://huggingface.co/allenai/ led-base-16384; https://huggingface.co/allenai/ led-large-16384; https://huggingface.co/allenai/ led-large-16384-arxiv *led-base model is trained for 13,800 steps of batch size 16 on 4 GPUs. The *led-large-arxiv model is trained for 10,200 steps of batch size 8 on 4 GPUs. Table 3 reports the baseline performance on the CITEBENCH datasets. All the results are from a single run of the models on the test sample. We note that the LEAD baseline performs very well compared to the out-of-the-box neural models, and outperforms the led-base and led-large models on all of the datasets on all different ROUGE metrics. The led-large-arxiv model outperforms led-base and led-large, which we attribute to the fine-tuning on the arXiv dataset that consists of scientific text which is the same domain as the datasets in CiteBench. The *led-large-arxiv model outperforms all other baselines on all datasets except XING.

Transfer learning results
Unified task formulation allows us to explore transfer learning between different domains and citation text generation setups. We examine the transfer learning performance using the *led-base-X models fine-tuned on individual datasets, starting from the pre-trained led-base 8 model. Table 4 presents the results. Expectedly, models perform best when evaluated in-domain, on the test portion of the same dataset; yet we note that out-of-distribution transfer in several cases outperforms the strong unsupervised baselines. While a detailed investigation of reasons behind the differences in transfer falls outside of the scope of our work, we note that the success of transfer depends on many factors, including the amount of information available to the model during training, and single vs multi-document citation text generation setting.

Qualitative analysis
We now use the discourse analysis tools introduced in Section 3.3 to qualitatively compare the citation texts found in datasets and generated by the models. For this, we apply the corresponding citation intent and discourse tagging tools introduced above to the baseline test set outputs and macro-average the results across all datasets in CITEBENCH. We compare the resulting distributions to the distributions in the true citation text outputs found in the datasets, along with a macro-averaged total over the datasets. To quantify the discrepancies, we additionally calculate KL-divirgence between the label distributions in model outputs and individual datasets.
The distribution of citation intents in Figure 2, suggests a discrepancy between generated and original citation texts: while baseline models tend to under-generate the Background and CompareOrContrast sentences, they produce more Future, Uses and Extends citation sentences than the gold reference.
The KL divergence of citation intention distri-butions between different models' outputs and the reference texts on individual datasets (Figure 2) shows that all models' outputs are more deviated from the original citation intentions on ABURAED compared to other datasets. Interestingly, the two fine-tuned models that perform well in terms of ROUGE scores tend to also achieve lower KL divergence scores for all datasets, e.g., *led-base and *led-large-arxiv. This suggests that these two models learn a dataset's citation intention distribution from the training data during the fine-tuning process.
Turning to the discourse tagging results, Figure 3 suggests high discrepancy between what the baseline models output and what the reference test splits of the datasets contain. Most of our baseline models under-generate the Narrative_cite and -surprisingly -Single_summ class, while overgenerating the Reflection, compared to the true distributions. The only two exceptions are the finetuned *led-base and *led-large-arxiv models, which aligns with their high performance in terms of ROUGE. High KL divergence values in Figure 3 confirms that predicted discourse tags are more discriminative across the datasets compared to citation intention labels. The lowest discourse tag distribution divergence is observed for the *led-base and *led-large-arxiv baselines, suggesting that the learned ability to capture the discourse tag dis-tribution plays an important role in improving the citation text generation performance.

Reproduciblity
Some of the baselines included in this paper follow the methods suggested in the previous work. Yet, while carefully following these baselines, we occasionally observed a drop in the ROUGE score compared to the previously reported results. During an investigation, we found that the ROUGE-1 and ROUGE-L scores increase by about 3 points when replacing the ROUGE package from Huggingface ROUGE library 9 with the files2rouge package used by . We found that the reason behind this discrepancy is the stemming prior to the ROUGE score calculation: while files2rouge does it by default, Huggingface does not. In addition, tokenization and stopword removal can also affect the ROUGE scores. The above evaluation details are not always reported in previous work, which has an important implication: unless the same evaluation package and procedure are used, the measured improvements might be attributed to the particularities of evaluation. In this work, we use the Huggingface ROUGE library with the original citing text and the model outputs as the input for the evaluation. No additional stemming, lemmatization or tokenization are applied before providing the text to the ROUGE evaluation library. We hope that our standardized evaluation toolkit helps the community to avoid such issues in the future, and stress the need to carefully reporting the evaluation setup details in future studies.

Qualitative evaluation
Prior works use ROUGE and human evaluation on a small set of testing instances (30-100) to evaluate the performance of different approaches. None of them evaluate the citation intention and rhetorical structure of the generated text with regard to the original reference text, which is the essential property of the citation text generation task. In this work, we use a state-of-the-art automatic citation intention classifier and a recent citation text discourse rhetorical structure tagger to investigate to what extent the generated outputs are aligned with the original citation intentions and discourse structures. Although the performance of these automatic tools is not perfect, the results indicate that the outputs from different models tend to focus on specific citation text categories, which limits the utility of these models in assistance applications.
From our previous analysis in Section 4.3, we find that although the best two baseline models based on ROUGE (*led-base and *led-large-arxiv) achieve lower KL divergence on citation intention and discourse tagging, in general these two metrics does not correlate perfectly with ROUGE. Moving forward, we advocate to use these two metrics as an complementary automatic evaluation strategy for citation text generation to better understand the purpose and the structure of the generated outputs. Such content analysis will benefit from future improvement on citation intention classification and discourse tagging. Future work can also investigate the use of BERTScore (Zhang* et al., 2020) and other recently proposed text generation metrics an alternatives to ROUGE in the context of citation text generation.

Task variation
The goal of our general task definition is to accommodate different structural versions of the citation text generation task. However, input and output structure and granularity are not the only parameters that influence task complexity, and future work in citation text generation might explore other, qualitative aspects of the task. A manual review of the datasets included in CITEBENCH revealed that in some cases citation text could not be generated based on the provided inputs due to missing information (see Appendix A.2 for examples). This raises the question of what information is required for NLP systems to produce accurate citation texts, and how this information should be made available to the models, including explicit context and knowledge encoded in the pre-trained LLMs.
In addition, we have observed that the data includes instances of self-reference where citation text in fact talks about the citing paper (Appendix A.2). To get a glimpse of the self-reference prevalence in the datasets, we have searched citation texts throughout CITEBENCH for keywords that might indicate self-reference ("Our work", "in this paper", "we present", "we introduce"). Table 5 demonstrates that there is indeed variation in selfreference among datasets, with single-sentence citation text datasets (ABURAED and XING) contain-  (bottom, macro-averaged total), and KL divergence between dataset and system outputs (right). Empty cells denote infinity due to missing labels in system predictions.  ing less self-reference sentences than multiplesentence datasets. This suggests that task structure can influence the qualitative composition of citation texts, and calls for further investigation of data selection procedures for citation text generation. Finally, we note that citation text might vary depending on the author's intent; and as our analysis in Figures 2 and 3 demonstrate, datasets do differ in terms of the distribution of citation intents and discourse roles. While intent can be derived from the context by the citation text generation model, we argue that modeling discourse role and intent explicitly might lead to a more robust and realistic definition of citation text generation. We leave the exploration of this question to the future studies.

Conclusion
Citation text generation is a key task in scholarly document processing -yet prior work has been scattered across varying task definition and evaluation setups. We introduced CITEBENCH: a new benchmark for citation text generation that unifies five diverse datasets under a general task definition, provides multiple baselines and a standard evaluation framework, and enables systematic study of citation text generation. Our quantitative and qualitative analysis delivered new insights about the baseline performance and the composition of the datasets and model outputs in terms of citation intentions and citation discourse roles. Our discussion indicated promising directions for future work. CITEBENCH is designed to be a living benchmark for citation text generation with a focus on the unified task definition and evaluation setup, and we invite the community to extend it by providing new datasets, trained models, evaluation results and metrics.

Limitations
This work targets the scientific text domain, and most research is published in English. As result, our baseline models and results are limited to En-glish. The data in the source datasets is heavily skewed towards the Computational Linguistics and Computer Science domains. Adapting citation text generation models to other scientific domains, e.g. humanities or life sciences, constitutes a promising target for future research.
While our general task definition allows incorporating any information from the target and source documents, we offer no standardized way to include structured information like citation graph context and paper metadata. Our baseline implementations limit the input sequences to 4,096 tokens, which only affects a small portion of the data. However, this restriction can be lifted as long as the target language model can efficiently process long documents, and experimental time is not a concern -even in a limited setting, performing the full run of all fine-tuned citation text generation models in the benchmark is computationally expensive (Appendix A.3). Finally, CITEBENCH inherits the structural limitations of the datasets it subsumes, e.g. not preserving the full document information and document structure, and filtering out multimodal content. We leave the investigation of these extensions to the future work.
Samples of self reference sentences in the reference citation text (1) In our context, we call upon computer vision techniques to study the cephaloocular behavior of drivers.
(2) The main objective of our work is the elaboration of a new computer vision system for evaluating and improving driving skills of older drivers (age between 65 and 80 years).
(3) The features used in this work are complex and difficult to interpret and it isn't clear that this complexity is required.
(4) Our approach enables easy pruning of the RNN decoder equipped with visual attention, whereby the best number of weights to prune in each layer is automatically determined.  (1) and (2) from Chen et al. (2021); (3) and (4) from ) of sentences that refer to the authors own work and not cited paper's work Input Target Comment We present the syntax-based string-totree statistical machine translation systems built for the WMT 2013 shared translation task. Systems were developed for four language pairs. We report on adapting parameters, targeted reduction of the tuning set, and post-evaluation experiments on rule binarization and preventing dropping of verbs.
It is worth noting that the German parse trees [0] tend to be broader and shallower than those for English.
From this input we can not determine whether German parse trees are broader and shallower than English parse trees.
Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small indomain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling unknown word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.
Analyses have shown that this augmented data can lead to better statistical estimation or word coverage [0].
Here we do not know what "this" refers to in the target and the input does not mention anything about "better statistical estimation" or "word coverage" Table 7: Examples from the benchmark of targets that contain information that is not present in the input