Capturing Relations between Scientific Papers: An Abstractive Model for Related Work Section Generation

Given a set of related publications, related work section generation aims to provide researchers with an overview of the specific research area by summarizing these works and introducing them in a logical order. Most of existing related work generation models follow the inflexible extractive style, which directly extract sentences from multiple original papers to form a related work discussion. Hence, in this paper, we propose a Relation-aware Related work Generator (RRG), which generates an abstractive related work from the given multiple scientific papers in the same research area. Concretely, we propose a relation-aware multi-document encoder that relates one document to another according to their content dependency in a relation graph. The relation graph and the document representation are interacted and polished iteratively, complementing each other in the training process. We also contribute two public datasets composed of related work sections and their corresponding papers. Extensive experiments on the two datasets show that the proposed model brings substantial improvements over several strong baselines. We hope that this work will promote advances in related work generation task.


Introduction
The related work section generation task aims to automatically generate a summary of the most relevant works in a specific research area, which can help researchers to familiarize themselves with the state of the art in the field. Several methods (Hoang and Kan, 2010;Hu and Wan, 2014;Chen and Zhuge, 2019) have been proposed to study how to obtain the related work section automatically by Extractive Related Work: We find that CRISPR/Cas9 can robustly and specifically reduce the expression of these mi-croRNAs up to 96% [1]. We find that miRNA knockdown phenotypes caused by CRISPR/Cas9 transient editing can be stably maintained in both in vitro and in vivo models for a long term (up to 30 days) [2]. Although genome editing using the CRISPR-Cas system is highly efficient in human cell lines, CRISPR-Cas genome editing in primary human cells is more challenging [3].
Abstractive Related Work: Recently, [1] showed that CRISPER-Cas9 targeted miRNA-17, miRNA-200c and miRNA-141, repressed their activity in human colon cancer cell lines HCT116 and HT-29. Furthermore, in vivo targeting was effective for at least a month [2]. However, off-target mutagenesis and effects of a single miRNA on various gene targets are the limitations to the use of this modern technology specifically in brain disorders like prion diseases [3]. Table 1: Comparison of a related work paragraph generated by an extractive method (human-annotated) and an abstractive man-made related work paragraph with the same multiple original papers. extracting important sentences from multiple original papers. However, extractive approaches lack the sophisticated abilities that are crucial to highquality summarization such as paraphrasing and generalization, and often lead to a related work section with poor coherence and readability (See et al., 2017;Hsu et al., 2018). For example, as shown in Table 1, the extracted sentences share the pattern "We find..." as the subject of sentences, which, as a matter of fact, refer to different authors. On the contrary, the abstractive related work in Table 1 reveals that the works are conducted by different scholars. It also has conjunction words such as "Furthermore" and "However", which can explain the logical relationship between the cited works, and thus form an elegant narration. Hence, in this paper, we target on the abstractive related work generation task, which generates a related work including novel words and phrases not copied from the source text.
There are two main challenges in this task: (1) the related work should summarize the contribution of each paper, and (2) explain the relationship between different papers such as parallel, turning, and progressive relation, so as to introduce them in a logical order. While existing summarization models can address the first problem, they do not target at comparing and explaining the relationship between these articles. Hence, to tackle the above challenges, we propose a Relation-aware Related work Generator (RRG), which generates an abstractive related work given multiple scientific papers in the same research area. Firstly, we encode the multiple input articles in a hierarchical manner, obtaining the overall representation for each document. Then, we propose a relationaware multi-document encoder that relates multiple input documents in a relation graph. In the training process, the relation graph and the document representation interact and are refined iteratively, complementing each other. Finally, in the decoder part, we utilize the relation graph information to assist the decoding process, where the model learns to decide whether to pay attention to the input documents or the relationship between them.
To evaluate our model, we introduce two largescale related work generation datasets, which are composed of related work sections and their corresponding papers. Extensive experimental results show that RRG outperforms several strong baselines in terms of ROUGE metrics and human evaluations on both datasets.
In summary, our contributions include: • We address an abstractive related work generation task, which aims to generate an abstractive related work with novel words and phrases.
• We propose a relation-aware multi-document encoder that relates one of the multiple input documents to another, and establishes a relation graph storing the dependency between documents.
• We contribute two public large-scale related work generation datasets that are beneficial for the community.

Related Work
We discuss the related work on related work generation and multi-document summarization.
Related Work Generation. Most of the previous related work section generation methods are extractive. For example, Hoang and Kan (2010) take in a set of keywords arranged in a hierarchical fashion to drive the creation of an extractive related work. Later, (Hu and Wan, 2014) first exploits a Probabilistic Latent Semantic Analysis (PLSA) model to split the sentence set of multiple reference papers into different topic-biased parts, and then applies regression models to learn the importance of the sentences. Finally, it employs an optimization framework to generate the related work section. Chen and Zhuge (2019) propose to first construct a minimum Steiner tree of the keywords. Then the summary is generated by extracting the sentences from the papers that cite the reference papers of the paper being written to cover the Steiner tree.
However, abstractive approaches on related work generation have met with limited success. Apart from the lack of sufficient training data, neural models also face the challenge of identifying the logic relationship between multiple input documents.
Multi-document Summarization. The multidocument summarization task aims to cover the key shared relevant information among all the documents while avoiding redundancy (Goldstein et al., 2000). Existing multi-document summarization methods are mostly extractive (Christensen et al., 2013;Parveen and Strube, 2014;Ma et al., 2016;Chu and Liu, 2018). For example,  present a heterogeneous graph-based neural network which contains semantic nodes of different granularity levels apart from sentences. Recently, a vast majority of the literature is dedicated to abstractive multi-document summarization.  propose a large-scale multi-document summarization dataset created from scientific articles. Jin et al. (2020) propose a multi-granularity interaction network for extractive and abstractive approaches. Li et al. (2020a) develop a neural abstractive multi-document summarization model which leverages explicit graph representations of documents to guide the summary generation process.
While the multi-document summarization task aims to extract information shared by multiple documents, related work generation aims to compare and introduce the cited works in logic order.  ence, etc.), and the second is Delve (Akujuobi and Zhang, 2017), which consists of computer science papers. All the papers in each of these two datasets form a large connected citation graph, allowing us to make full use of the citation relationships between papers.
Dataset Preprocessing. For each case, the generation target is a paragraph with more than two citations, as a comprehensive related work usually compares multiple works under the same topic. The abstract of each cited paper is regarded as input, considering that the main idea of a cited paper is described in its abstract. We then conduct a human evaluation to examine the dataset quality. Concretely, we sample 200 cases from both datasets and ask three annotators to state how well they agree with the following statement, on a scale of one to three (disagree, neutral, agree): the related work can be partly generated based on the given abstracts of the cited papers. The evaluation is conducted on the Amazon Mechanical Turk, which has been employed in a variety of NLP tasks including summarization (Liu and Lapata, 2019a), question answering (Gan and Ng, 2019), and dialog system (Li et al., 2020b). The result shows that 94.5% cases win 3 scores, while only 3.5% cases obtain 1 score. This demonstrates the good quality of the datasets.
Statistics. Table 2 compares Delve and S2ORC to other public datasets including DUC data from 2003 and 2004, TAC 2011 data, and Multi-News, which are typically used in multi-document settings. We also list the statistics of a recent related work generation dataset RWS, which is proposed by Chen and Zhuge (2019). The total number of collected samples for the S2ORC and Delve is about 150,000 and 80,000, respectively. It can be seen that Multi-News is most similar to our dataset due to its large-scale. However, the average number of documents per case in Multi-News is smaller than ours.

Problem Formulation
Before presenting our approach for related work generation, we first introduce our problem formulation and used notations.
To begin with, for a set of relevant papers D = (d 1 , d 2 , · · · , d N ) in a specific area, where d i denotes a paper, we assume there is a corresponding related work Y = (y 1 , y 2 , · · · , y T ). N is the number of relevant papers, and T is the number of words in a related work. Given the multiple papers D, our model generates a related workŶ = (ŷ 1 ,ŷ 2 , · · · ,ŷT ). Finally, we use the difference between generated related workŶ and ground truth related work Y as the training signal to optimize the model parameters.

Overview
In this section, we introduce the Relation-aware Related work Generator (RRG) in detail. An overview of RRG is shown in Figure 1, which has three main parts: • Hierarchical Encoder reads multiple input documents and learns the multi-level representations for words and documents.
• Relationship Modeling relates one paper to another and obtains their relationship graph.
• Related Work Generator produces the abstractive related work by attending to the hierarchical representations and the relation graph between documents.

Hierarchical Encoder
To begin with, each input w i j is converted into the vector representationê i j by the learned embeddings.

Related work Generator
Hierarchical Encoder

Polished Relation Graph
Relation Graph Figure 1: Overview of RRG, which consists of three parts: (1) Hierarchical Encoder encodes the multiple inputs in hierarchical levels; (2) Relationship Modeling relates one paper to another and stores their relation graph; and (3) Related Work Generator generates the related work by attending to input documents and the relationship between them.
We then assign positional encoding (P E) to indicate the position of the word w i j where two positions need to be considered, namely document index i and word index j. We concatenate the position embedding P E i , P E j to obtain the final position embedding p i j . The definition of positional encoding is consistent with the Transfomer (Vaswani et al., 2017). The input word representation e i j is obtained by adding embeddingê i j and position embedding p i j . We then perform multi-head self-attention across the word representations in the same document to obtain the contextual word representation h w i j : where MHAM denotes the Multi-head Attention Module (Vaswani et al., 2017), and * denotes index j ∈ (1, N i ). Concretely, The first input is for query and the second input is for keys and values. Each output element, h w i j , is computed as the weighted sum of linearly transformed input values: Here, β i j,l is computed using a compatibility function that compares two input elements: where d is the hidden dimension, and W Q w , W K w , W V w are parameter matrices. From the word-level representation we obtain the overall representation for each document:

Relationship Modeling
The document representation h 0 d i does not contain cross-document information, thus, it cannot learn richer structural dependencies among textual units. In this subsection, we introduce a novel graphbased Relationship Modeling (RM), which not only allows sharing information across multiple documents but also models the logic dependency between documents. Note that it is impossible to explicitly list all the relationships between documents because the relationships vary from document pair to pair depending on the document content, and the content of documents is unlimited. Hence, we model the relationships hidden vectors and let the model capture such diverse relationships by the hidden vectors. Concretely, since the relationship graph is constructed based on the representation of each document, while a comprehensive document representation should consider its relationship with other documents. These two processes complement each other. Hence, our RM module is an iterative module, which has a stack of L identical layers. In each layer, we iteratively update the relationship graph, and then fuse the information from the graph to the document representation, as shown in Figure 2. For a start, the relation edge in our graph is initialized by the document representation: where MLP is a multi-layer perceptron, and [; ] is the concatenation operation.
In each iteration, we first propose a Relation Graph Updater (RGU) to renew the graph based on the polished document representation so far (shown in the right part of Figure 2): Here, * denotes index i ∈ (1, N ), meaning that all document representations will be involved in updating the relation graph. Concretely, RGU first aggregates the information from both the previous graph h l−1 r i,j and the document states h l−1 d * from the last layer, using a multi-head attention (MHAM introdced in §5.2). The input for query Q is h l−1 r i,j , and input for key K and value V is h l−1 d * . The output intermediate graph states s l−1 i,j are further encoded using a feed-forward layer and then merged with the intermediate hidden states h l−1 r i,j using a residual connection and layer norm.
We summarize the procedure below: where denotes Hadamard product, and c l−1 i,j is the internal cell state. z l−1 i,j is the update gate that controls which information to retain from the previous memory state. This update strategy is conceptually similar to long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). It differs in that multi-head attention is used and thus multiple graph slots are supported instead of a single one in LSTM, which gives it a higher capacity of modeling complex relations.
Next, the updated graph is fused in the Relationaware Attention Module (RAM) to update the document representation: RAM is similar to MHAM, where h l−1 d i is for query, h l−1 d * is for key and value. However, there are two changes in Equation 2 and Equation 4. Specifically, we modify Equation 2 to propagate edge information to the sub-layer output: In this way, the representation of each document is more comprehensive, consisting of its relation dependency information with other documents. What is more, when deciding the weight of each edge, i.e., β l−1,r i,j , we also incorporate relation edge information, since close relationships such as succession or transition can have a great impact on edge weight. Concretely, Equation 4 is changed to: (10) We summarize the whole relationship modeling process as: For brevity, we omit the subscript L in the following section.

Related Work Generator
To generate a consistent and informative summary, we propose an RNN-based decoder following ) that incorporates the outputs of the hierarchical encoder and the relationship graph as illustrated in Figure 1. Our decoder is a single-layer unidirectional LSTM. At each step t, the decoder updates the hidden state from s t−1 to s t : Following previous works (Bahdanau et al., 2015), we employ an attention mechanism to compute the attention distribution over the source words in the sequence-to-sequence structure: where c w t denotes word context vector. Similarly, we extend the attention mechanism to document level: The encoded relationship information is also important for facilitating the transition introduction in the related work, and the specific information in the graph that is needed at each step depends on which document is being introduced. Hence, we employ the document-level attention weights in Equation 17 to read the relationship graph: Finally, an output projection layer is applied to get the final generating distribution P v t over vocabulary, as shown in Equation 20: Our objective function is the negative log likelihood of the target word y t : In order to handle the out-of-vocabulary (OOV) problem, we equip our decoder with a pointer network (Gu et al., 2016;See et al., 2017). This process is the same as the model described in (See et al., 2017), thus, is omit here due to limited space.
6 Experimental Setup

Baselines
To evaluate the performance of our proposed model, we compare it with the following baselines: Extractive Methods: (1) LEAD: selects the first sentence of each document as the summary as a baseline. (2) TextRank (Mihalcea and Tarau, 2004): is a multi-document graph-based ranking model. (3) BertSumEXT (Liu and Lapata, 2019b): is an extractive summarization model with BERT. (4) MGSum-ext (Jin et al., 2020): is a multi-granularity interaction network for extractive multi-document summarization. Abstractive Methods: (1) PTGen+Cov: combines the sequence-tosequence framework with copy and coverage mechanism in summarization task (See et al., 2017).

Implementation Details
We implement our model in TensorFlow (Abadi et al., 2016) on an NVIDIA GTX 1080 Ti GPU. For all the neural models, we truncate the input articles to 500 tokens in the following way: for each example with S source input documents, we take the first 500/S tokens from each source document. The maximum document number is set to 5. The minimum decoding step is 50, and the maximum step is 100. The word embedding dimension is set to 128 and the number of hidden units is 256. We initialize all of the parameters randomly using a Gaussian distribution. The batch size is set to 16, and we limit the vocabulary size to 50K. We use Adagrad optimizer (Duchi et al., 2010) as our optimizing algorithm. We also apply gradient clipping (Pascanu et al., 2013) with a range of [−2, 2] during training. For the testing, we employ beam search with a beam size of 4 to generate more fluent summaries.
To obtain the extractive oracle, since it is computationally expensive to find a globally optimal subset of sentences that maximizes the ROUGE score, we employ a greedy approach, where we add one sentence at a time incrementally to the  summary, such that the ROUGE score of the current set of selected sentences is maximized with respect to the entire gold summary.

Automatic Evaluation
Following Chen et al. (2018), we evaluate summarization quality using ROUGE F 1 (Lin, 2004). We report unigram and bigram overlap (ROUGE-1 and ROUGE-2) to assess the informativeness and the longest common subsequence (ROUGE-L) as a means of the assessing fluency. Table 3 summarizes our results. The first block in the table includes extractive systems, and the second block includes abstractive baselines. As can be seen, abstractive models generally outperform extractive ones, especially in terms of ROUGE-L scores. We attribute this result to the observation that the gold related work of this dataset tends to use novel word combinations to summarize the original input documents, which demonstrates the necessity of solving the abstractive related work generation task. Among abstractive models, surprisingly, BertSumABS does not perform as well as other state-of-the-art baselines. This is probably because BERT does not fit well on scholar data that have technical terms. Finally, our model RRG gains an improvement of 1.83 (1.08) points compared with BertSumABS, 1.54 (0.83) points compared with GS on ROUGE-1 on S2ORC (Delve),  verifying the effectiveness of our RRG. Table 3 also summarizes ablation studies aiming to assess the contribution of individual components in our RRG model. The results confirm that the encoding paragraph position in addition to token position within each paragraph is beneficial (see row w/o PP), as well as relationship modeling (row w/o RM). Updating the relation graph also helps the summarization process, where removing the update mechanism causes ROUGE-L drop by 0.86 (0.99) (row w/o Upd) on S2ORC (Delve) dataset.

Human Evaluation
We also assessed the generated results by eliciting human judgments on 30 randomly selected test instances from Delve dataset. Our first evaluation study quantified the degree to which summarization models can retain the key information following a question-answering paradigm (Liu and Lapata, 2019a). We created a set of questions based on the gold-related work and examined whether participants were able to answer these questions by reading generated related works. The principle GOLD given a set of annotated images as training data , many methods have been proposed in the literature to  for writing a question is that the information to be answered is about factual description, and is necessary for a related work section. Two Ph.D. students majoring in computer science (also the authors) wrote five questions independently for each sampled ground truth related work since the Delve dataset also consists of computer science papers. Then they together selected the common questions as the final questions that they both consider to be important. Finally we obtain 67 questions, where correct answers are marked with 1 and 0 otherwise. Examples of questions and their answers are given in Table 5. Our second evaluation study assessed the overall quality of the related works by asking participants to score them by taking into account the following criteria: Informativeness (does the related work convey important facts about the topic in question?), Coherence (is the related work coherent and grammatical?), and Succinctness (does the related work avoid repetition?). The rating score ranges from 1 to 3, with 3 being the best. For both evaluation metrics, a model's score is the average of all scores.
Both evaluations were conducted on the Amazon Mechanical Turk platform with 3 responses per hit. Participants evaluated related works produced by the BertSumABS, MGSum-abs, GS, and our RRG. All evaluated models are those who achieved the best performance in automatic evaluations. Table 4 lists the average scores of each model, showing that RRG outperforms other baseline models among all metrics. We calculate the kappa statistics in terms of informativeness, coherence, and succinctness, and the scores are 0.38, 0.29, 0.34, respectively. To verify the significance of these results, we also conduct the paired student t-test between our model and GS (the row with shaded background). We obtain a p-value of 6 × 10 −6 , 5 × 10 −9 , and 7 × 10 −7 for informativeness, coherence, and succinctness. Examples of system output are provided in Table 5. We can see that related work generated by RRG correctly captures the relationship between papers [1,2] and [3,4], and successfully summarizes the contributions of corresponding papers. Among baselines, MGSum-ext fails to connect the cited papers in logic. MGSum-abs and GS fail to capture the transitional relationship between the first two works and the last two works.

Analysis of Relation Graph
To fully investigate what is stored by the relation graph, we draw a heatmap of the graph for the case in Table 5. Since the edge in relation graph is a vector containing semantic meaning, which cannot be directly explained, we use the edge between paper [2] and [3] as a benchmark and compute the cosine similarity between the benchmark and other relation edges. Dark color means that the relationship between the corresponding two papers is similar with edge [2]-[3], and vice versa. We already know that there is a transitional relationship between [2] and [3], so if an edge has a high cosine similarity

Conclusion
In this paper, we conceptualized the abstractive related work generation task as a machine learning problem. We proposed a new model that is able to encode multiple input documents hierarchically and model the latent relations across them in a relation graph. We also come up with two public large-scale related work generation datasets. Experimental results show that our model produces related works that are both fluent and informative, outperforming competitive systems by a wide margin. In the future, we would like to apply our model to abstract generation and paper generation tasks.