Scientific Discourse Tagging for Evidence Extraction

Evidence plays a crucial role in any biomedical research narrative, providing justification for some claims and refutation for others. We seek to build models of scientific argument using information extraction methods from full-text papers. We present the capability of automatically extracting text fragments from primary research papers that describe the evidence presented in that paper’s figures, which arguably provides the raw material of any scientific argument made within the paper. We apply richly contextualized deep representation learning pre-trained on biomedical domain corpus to the analysis of scientific discourse structures and the extraction of “evidence fragments” (i.e., the text in the results section describing data presented in a specified subfigure) from a set of biomedical experimental research articles. We first demonstrate our state-of-the-art scientific discourse tagger on two scientific discourse tagging datasets and its transferability to new datasets. We then show the benefit of leveraging scientific discourse tags for downstream tasks such as claim-extraction and evidence fragment detection. Our work demonstrates the potential of using evidence fragments derived from figure spans for improving the quality of scientific claims by cataloging, indexing and reusing evidence fragments as independent documents.


Introduction
Primary experimental articles (i.e., papers that describe original experimental work) provide the crucial raw material for all other subsequent scientific research. However, the drastically growing number of scientific literature makes it increasingly difficult for domain experts to efficiently utilize them. Automatic information extraction from biomedical * Work performed while the author is interning at the Information Sciences Institute, University of Southern California lc3 , the mammalian atg8 homolog , undergoes a set of modifications resulting in conversion from lc3i to lc3ii during autophagy 42 .
[fact] to further test the function of rag in autophagy [goal] we examined the lc3 modification in hek293 cells .
[method] expression of raga ql and ragc sn inhibited lc3 conversion in response to amino acid starvation ( fig. 7e ) .
[result] furthermore , expression of raga tn and ragc ql enhanced lc3 conversion even in the presence of amino acids .
[result] these results are consistent with the data observed in drosophila and further demonstrate a role of the rag gtpases in autophagy regulation in response to nutrient signals [implication] Figure 1: An example paragraph tagged with scientific discourse tags on each clause in SciDT dataset . The text is tokenized and converted to lower case. literature is a crucial step to help researchers to achieve this goal.
Extracting important information from biomedical literature to facilitate and accelerate scientific discovery has been a goal for computational linguistics for some time (Hobbs, 2002), with the focus of identifying relevant entities, relations, and events from text to populate a knowledge base. However, these methods do not take into account the fact that scientific work involves attempting to provide explanations for evidence derived from experiments and is therefore driven principally by authors attempting to convince expert readers that their claims are the "correct" explanations for the experimental evidence. Thus, an important aspect of building machines capable of understanding scientific literature is first recognizing different rhetorical components of scientific discourse, with which we will then be able to distinguish the observations made in experiments from their implications and distinguish between claims supported by evidence and hypotheses put forward to prompt further research. It is this goal, of being able to distinguish between the different rhetorical components of scientific discourses so that we can build AI systems to facilitate more accurate analysis and understand- ing of scientific literature, that motivates our work. Scientific discourse tagging is a task that tags clauses or sentences in a scientific article with different rhetorical components of scientific discourses. Figure 1 shows an example of a paragraph with discourse tags. In this work, we leverage a state-of-the-art contextualized word embedding and a novel word-to-sentence attention mechanism to develop a model for scientific discourse tagging that achieves the state-of-the-art performances on two benchmark datasets SciDT  and PubMed-20k-RCT (Jin and Szolovits, 2018) by 6.9% and 2.3% absolute F1 respectively. 1 More importantly, we show the strong transferability of our scientific discourse tagger to new datasets by beating the baseline (Huang et al., 2020) via zero-shot prediction on CODA-19 dataset (Huang et al., 2020). Furthermore, we demonstrate the effectiveness of scientific discourse tagging on two downstream scientific literature understanding tasks: claim-extraction and evidence fragment detection, and demonstrate the benefit of leveraging scientific discourse tags information. In particular, we outperform the state-of-the-art claim extraction model (Achakulvisut et al., 2019) by 3.8% F1, and outperform figure span detection baseline (Burns et al., 2017) by 5% F1.

Background and Related Works
Problem Formulation. We define scientific discourse tagging as a task that labels sentences in a scientific article based on its rhetorical elements of scientific discourse. Formally, a paragraph can be represented as an ordered collection of sequences 1 https://github.com/jacklxc/ ScientificDiscourseTagging S = [S 1 , S 2 , ..., S n ], and each element S i is annotated with a discourse label L i ∈ {L 1 , L 2 , ..., L k }. Note that S i may be defined differently in different datasets -e.g., sentences in the PubMed-RCT dataset (Dernoncourt and Lee, 2017), clauses in the SciDT dataset composed by Burns et al. (2016) and , and sentence fragments in CODA-19 dataset (Huang et al., 2020). For conciseness, we refer all these variations as sentences. The labels also can be slightly different. For example, in PubMed-RCT, L = {objective, background, methods, results, conclusions}, in CODA-19 (Huang et al., 2020), L = {background, purpose, method, finding/contribution, other} while in SciDT dataset (Burns et al., 2016;, the labels L = {goal, fact, hypothesis, problem, method, result, implication, none} as defined by De Waard and Maat (2012). Table 1 gives more details about the definitions of the tags.

Prior Works on Scientific Discourse Tagging
Feature-based Scientific Discourse Tagging. There has been a significant amount of work aimed at understanding types of scientific discourse. Teufel and Moens (1999) and Teufel and Moens (2002) described argumentative zoning, which groups sentences into a few rhetorical zones highlighted by important clauses such as "in this paper we develop a method for". Hirohata et al. (2008) used conditional random field (CRF) (Lafferty et al., 2001) with handcrafted features to classify sentences in abstracts into 4 categories: objective, methods, results, and conclusions. Liakata (2010) defined "zone of conceptualization" which classifies sentences into 11 categories in scientific papers and Liakata et al. (2012) used CRF and Lib-SVM to identify these "zone of conceptualization". Guo et al. (2010) used Naive Bayes and Support Vector Machine (SVM) (Cortes and Vapnik, 1995) to compare three schema: section names, argumentative zones and conceptual structure of documents. Burns et al. (2016) studied the problem of scientific discourse tagging, which identifies the discourse type of each clause in a biomedical experiment paragraph and composed a dataset for it. They adopted the discourse type taxonomy for biomedical papers proposed by De Waard and Maat (2012). The taxonomy contains eight types including goal, fact, result, hypothesis, method, problem, implication and none as Table 1 shows. Most recently, Cox et al. (2017) used the same schema (De Waard and Maat, 2012) by exploring a variety of methods for balancing classes before applying classification algorithms.
Deep Learning for Scientific Discourse Tagging.
Due to the prevalence of deep learning, neural sequence labeling approach using bidirectional LSTM (Hochreiter and Schmidhuber, 1997) and CRF (BiLSTM-CRF) (Huang et al., 2015) has been prevailing for classic word-level sequence tagging problems such as named entity recognition (NER), part of speech tagging (POS), and word segmentation (Huang et al., 2015;Dredze, 2015, 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;Peng and Dredze, 2017;Wang et al., 2017;Huang et al., 2019). Since scientific discourse tagging, which is a sentence-level sequence tagging problem, has one additional dimension of input comparing to word-level sequence tagging problems, an encoder is required to encode word-level representations to clause/sentence-level representations. While one simple way is to pre-compute sentence embeddings from word embeddings (Arora et al., 2016), there are more sophisticated methods to compute sentence-level embeddings on-the-fly using BiLSTM (Jin and Szolovits, 2018;Srivastava et al., 2019) or attention , before feeding them into a clause/sentence-level sequence tagger. Alternatively, as BERT (Devlin et al., 2018) prevails among various natural language processing (NLP) tasks, a simple baseline method is to directly use a BERT-like model's (e.g. SciBERT (Beltagy et al., 2019)) prefix token ([CLS]) representation of each sentence as the sentence representation for classification task (Huang et al., 2020). In this work, we combine these methods to present a state-of-the-art scientific discourse tagger.

Downstream Applications
Claim Extraction.
Claim extraction has been extensively studied in various domains. In addition to scientific articles (Stab et al., 2014), previous work has analyzed social media (Dusmanu et al., 2017), news (Habernal et al., 2014;Sardianos et al., 2015) and Wikipedia (Thorne et al., 2018;Fréard et al., 2010) for a task called Argumentation Mining to extract claims and premises. However, there are less attention and dataset available in the biomedical domain. Achakulvisut   claim-extraction dataset derived from MEDLINE 2 paper abstracts, and proposed a neural network model that significantly outperformed the rulebased method proposed by Sateli and Witte (2015). Figure 2 shows an example abstract with the last two sentences annotated as claims.
In this work, we formulate claim extraction (Achakulvisut et al., 2019) similarly as scientific discourse tagging: S contains sentences and L i ∈ {0, 1} indicates whether the corresponding sentence is a claim or not. Evidence Fragment detection. Burns et al.
(2017) coined the concept of "evidence fragments" as the text section in narrative surrounding a figure reference that directly describes the experimental figure. They composed an evidence fragment detection dataset, and proposed the evidence fragment detection task that tags each clause with semantically referred subfigure codes. They further proposed a rule-based method of using these subfigure codes as anchors to link evidence fragments to European Bioinformatics Institute's INTACT (Orchard et al., 2013) data records. As a result, IN-TACT's preexisting, manually-curated structured interaction data can serve as a gold standard for machine reading experiments.
Burns et al. (2017) formulated the problem into a clause-level tagging problem. Formally, each clause S i in a paragraph S = [S 1 , S 2 , ..., S n ] is annotated with a set of subfigure codes .., f i m } that each clause is semantically referring to, where the length m can be any nonnegative integer. Figure 3 shows an illustration of a paragraph of evidence fragment detection annotation. Each clause in the paragraph is associated with a set of semantically relevant subfigures.

Scientific Discourse Tagger
Model Overview. We formulate scientific tagging as a sentence level sequence tagging problem. We develop a deep structured model extending , which consists of a contextualized word embedding layer, an attention layer that summarizes word embeddings into sentence embeddings, and a BiLSTM-CRF sequence tagger (Huang et al., 2015) on top of the sentence embeddings for discourse type tagging. Figure 4 gives an overview of the architecture. We detail each component in this section. Embeddings. We explore pre-trained BioGloVe embedding (Burns et al., 2019), BioBERT (Lee et al., 2019) and SciBERT (Beltagy et al., 2019) embedding, which are GloVe and BERT embeddings trained on the text in biomedical domain. Sentence Representations via Attention. We observe that only keywords are essential to determine the discourse types, and attention is an appropriate mechanism for emphasizing certain inputs and ignoring others.  also explored using an attention mechanism to summarize word representations to sentence representations, however, we propose a new variation of attention mechanism using an LSTM. Specifically, we first encode the sentence using an LSTM to get contextualized hidden vectors of each word h i , and use them to Word Embeddings in Sentences Sentence-level Sequence Tagging. We observe that the discourse labels have a clear transition of logic flow (e.g. result usually followed by implication, and method usually followed by hypothesis). Therefore, we extend LSTM sequence tagger used by  to BiLSTM-CRF sequence tagger (Huang et al., 2015) to label discourse types for each sentence in a paragraph. Labels in BIO Scheme. We use the BIO scheme (Sang and Veenstra, 1999) to train all of our models (Baseline models for SciDT dataset do not use BIO2 scheme). Specifically, we convert the labels into BIO scheme where none label represents O and all other labels are converted into B label when the previous label type is different from the current label and I label when the previous label is the same as the current label.

Claim Extractor
Due to the similar problem formulation of evidence extraction task (Achakulvisut et al., 2019), we directly employ the discourse tagging model for claim extraction.

Evidence Fragment Detector
Problem Reduction. As Figure 3 shows, since each clause in evidence fragment detection task may refer to more than one subfigure codes, we cannot directly solve it as a standard classification task. Instead, we reduce it to a clause-level sequence tagging problem under a block-based assumption. We treat each paragraph as a single input. During training, we encode the subfigure code reference sequences of the clauses in each paragraph into a single BIO (Sang and Veenstra, 1999) sequence (where B indicates the clause is the beginning of a block, I indicates the clause is in the same block as the previous clause, and O indicates that no subfigure code is being referred to) as demonstrated at the end of each clause in Figure 3. For prediction, we decode the semantic subfigure code references of all clauses from the BIO sequence for each paragraph following the same block-based assumption.
Block-Based Assumption. Most subfigure code reference labels are block-based. We call contiguous clauses that share the same subfigure code reference labels as a block, which is segmented by red lines in Figure 3. We further observe that most blocks explicitly mention all of the semantically referred subfigure codes at least once. Therefore, assuming this property is true for all blocks, we can reconstruct a sequence of semantic subfigure code references for all clauses in a paragraph. We use the explicitly mentioned subfigure codes for each block and a BIO sequence indicating where each block starts and ends for the reconstruction. Consequently, during encoding, we convert annotated semantically referenced subfigure code labels into BIO scheme. During decoding, we first localize the start and end position of each block using BIO predicted tags, then fill each block with all explicitly mentioned subfigure codes.
Clause-level Sequence Tagger. The key part of our sequence tagging-based solution for evidence fragment detection is to determine where a block starts and ends. We apply a clause-level sequence tagger to tag each clause in a paragraph. Due to the small size of the evidence fragment detection dataset, we empirically observe that feature-based CRF sequence taggers outperform neural-network based sequence taggers, we thus adopt the featurebased model. In addition to the scientific discourse tags, we use explicitly mentioned subfigure codes as well as unigram, bigram and trigram words as features. For each clause, we use all features described previously from the current clause in addition to the same sets of features from the adjacent previous and next clauses.

Experimental Setup
We evaluate the performance of our scientific discourse tagger on PubMed-RCT dataset (Dernoncourt and Lee, 2017) and SciDT dataset (Burns et al., 2016;) (Section 5.1). We also examine the transferablity of our scientific discourse tagger to new datasets using CODA-19 dataset (Huang et al., 2020) (Section 5.2). We further study the efficiency of scientific discourse tags on claim-extraction task via transfer learning as well as evidence fragment detection task in a pipeline fashion (Section 6). Figure 5 shows the distribution of the labels in the three datasets introduced below as well as their mappings used for zero-shot predictions in Section 5.2.

Datasets
PubMed-RCT Dataset. We use PubMed-RCT (Dernoncourt and Lee, 2017) as the standard dataset to evaluate our scientific discourse tagger against other strong baselines. PubMed-RCT is derived from PubMed for sequential sentence classification. It has two versions -a smaller PubMed 20k RCT, and a 10 times larger PubMed 200k RCT. Due to our limited availability of computational resources, we only consider PubMed 20k RCT in this work. PubMed 20k RCT is a large dataset that consists of 20k abstracts of randomized controlled trials (RCTs), with vocabulary of 68k across 240k sentences. Each sentence of an abstract is labeled with one of the following roles (section heads) in the abstract: background, objective, method, result or conclusion. dataset with more fine-grained taxonomy. We further expand SciDT dataset by applying the same clause parsing and annotation pipeline described by . This dataset is derived from the Pathway Logic (Eker et al., 2002) and INTACT databases (Orchard et al., 2013). Texts from all sections of each of those papers were pre-processed by parsing each sentence to generate a sequence of main and subordinate clauses using Stanford Parser (Socher et al., 2013). Domain experts were asked to label each of the clauses using the 7-label taxonomy proposed by De Waard and Maat (2012) whose distributions are shown in Figure 5. We apply sequential methods to sequences of clauses in individual paragraphs. Overall, SciDT dataset has a total of 634 paragraphs and 6124 clauses. We randomly split 570 paragraphs as the training and validation set and the rest as the test set. Each paragraph contains up to 30 clauses and the number of word per clause has a mean of 17.7 and a standard deviation of 12.5. The total vocabulary size is 8563, which is a small dataset for an NLP task. However, we note the difficulties of obtaining such dataset. We further perform a quality assessment of the dataset by re-annotating the test set. We obtain Cohen's kappa coefficient κ = 0.823, which indicates a high quality of the dataset. CODA-19 Dataset. CODA-19 (Huang et al., 2020) is a human-annotated dataset on a subset of the abstracts of CORD-19 (Wang et al., 2020), which is a corpus of scholarly articles about COVID-19. Wang et al. (2020) segmented each abstract into sentence fragments by comma (,), semicolon (;), and period (.). Each sentence fragment is labeled with one of the research aspects: background, purpose, method, finding/contribution or other, which is similar to the label sets of PubMed-RCT (Dernoncourt and Lee, 2017). There are 10966 abstracts in total. We use this dataset to further examine our scientific discourse tagger architecture's applicability to new datasets as well as the transferability of our trained scientific discourse tagger to new datasets.

Baseline Models
PubMed-RCT Dataset. We compare our discourse tagger against two strong baselines on the PubMed 20k RCT dataset: (1) a hierarchical sequential labeling network (HSLN) proposed by Jin and Szolovits (2018) and (2) Table 2 reports the test F1 score of our scientific discourse tagger and its variations against baseline models on PubMed 20k RCT dataset and SciDT dataset. Our best scientific discourse tagger outperforms the state-of-the-art model (Srivastava et al., 2019) on PubMed 20k RCT dataset by more than 2 % absolute F1 score. Given the large size of PubMed 20k RCT, this result robustly demonstrates the strength of our model. Our model also significantly outperforms  with 5% absolute F1 score (per McNemar's test, p < 0.01).

Supervised Learning Results
Based on these performance, we claim our scientific discourse tagger as state-of-the-art. Note that for scientific discourse tagging, the micro F1 performance is equivalent to accuracy.
Ablation Studies. We also perform ablation studies to compare the effect of different word embeddings and attention mechanisms to the performance of our scientific discourse tagger on PubMed-RCT and SciDT dataset in Table 2. All neural network based models discussed for scientific discourse tagging tasks, including ours consist of a word embedding, a word-to-sentence encoder and a sentencelevel sequence tagger. As we introduce in Section 3.1, our best model has SciBERT (Beltagy et al., 2019) as our contextualized word embedding, an LSTM-attention structure as our word-to-sentence encoder and BiLSTM-CRF (Huang et al., 2015) as our sentence sequence tagger. Comparing to other baseline models, we improve the model design by adopting the state-of-the-art BERT (Devlin et al., 2018) based language model as our contextualized embedding. Instead of bidirectional LSTM as word-to-sentence encoder used by Jin and Szolovits (2018) and Srivastava et al. (2019), we improve the attention structure proposed by . We compare the effect of different embeddings and attention types used in scientific discourse tagger. As Table 2 indicates, our main improvement comes from SciBERT (Beltagy et al., 2019). In addition to BioBERT (Lee et al., 2019) which trains BERT (Devlin et al., 2018) on biomedical domain corpus, SciBERT uses a domain specific vocabulary. BERT as a contextualized embedding also contributes partially to the performance improvement as the BioBERT embedding globally outperforms BioGloVe (Burns et al., 2019), which is a static embedding trained on biomedical domain corpus, on PubMed-RCT dataset. Another source of improvement comes from the attention structure. Our LSTM-attention outperforms the RNN-attention that  used.
Error Analysis. Figure 6 compares the confusion matrices of  and our best scientific discourse tagger on SciDT test set. As suggested by the overall performance, our model globally predicts the discourse tags more precisely than . Specifically,  failed to predict problem tag, but our model achieved 0.63 accuracy on predicting problem tag. Figure 6 also indicates the different difficulties of predicting different discourse labels due to the imbalance of the label distributions, as Table 5 shows.

Transfer Learning on CODA-19 Dataset
We further demonstrate the strong performance of our scientific discourse tagger by training it on CODA-19 dataset (Huang et al., 2020). As Table  3 shows, our model outperforms the baseline from Huang et al. (2020) by 14.6% absolute F1 on the test set. More importantly, we use these results as baselines and CODA-19 dataset as an example dataset to show the transferability of our model to new datasets. We first perform zero-shot prediction using our best trained scientific discourse taggers on PubMed-RCT (Dernoncourt and Lee, 2017) or SciDT dataset . We map the labels from the original datasets to the target CODA-19 dataset by applying majority vote to the predicted labels on the training set as the lines in Figure 5 show. Then we perform predictions using the best trained scientific discourse taggers on CODA-19 test set and convert the predicted labels from the original label sets to the target CODA-   Table 4: Claim extraction performance measured by binary F1 score, which regards 0 as negative label. 19 label set. As a result, as Table 3 shows, our zero-shot prediction results are even higher than the baseline from Huang et al. (2020) which was directly trained on the CODA-19 dataset. This result indicates the strong transferability of our trained scientific discourse tagger as a useful tool on new datasets.
Furthermore, we separately perform a standard transfer learning by taking the scientific discourse tagger pre-trained on PubMed-RCT dataset and fine-tuning it on CODA-19 dataset. We replace the last CRF layer with a new one to match the labels of CODA-19 dataset. As a result, we achieved 0.909 test F1, which is another 2.4% absolute F1 improvement on our model directly trained on CODA-19 dataset. This is likely due to the similar label structures between PubMed-RCT and CODA-19 dataset.
6 Downstream Applications 6.1 Claim Extraction Dataset. Achakulvisut et al. (2019) introduced an expertly annotated dataset for extracting claim sentences from biomedical paper abstracts. They followed the definitions by Sateli and Witte (2015) to annotate a claim as a statement that either declares something is better, proposes something new, or describes a new finding or a causal relationship. Each sentence is tagged with a binary label indicating it is a claim or not. Each abstract may contain multiple claims as Figure 2 shows. The dataset contains 1500 abstracts sampled from MEDLINE database. Baseline Model. Achakulvisut et al. (2019) constructed claim-extraction dataset and proposed a model using the sentence classification technique presented by Arora et al. (2016) as sentence encoding method, and the standard BiLSTM-CRF (Huang et al., 2015) as the sentence-level sequence tagger. Their best model was pre-trained on PubMed 200k RCT (Dernoncourt and Lee, 2017) for transfer learning and used GloVe (Pennington et al., 2014) as their word embedding. Model Performance. Table 4 compares the test binary F1 performance of Achakulvisut et al. (2019) with our test performance. We first train our scientific discourse tagger model directly on the claimextraction dataset. We obtain test binary F1 score of 0.791, which is already higher than Achakulvisut et al. (2019). Then as Achakulvisut et al. (2019) suggested, we pre-train the scientific discourse tagger on PubMed 20k RCT (Dernoncourt and Lee, 2017) and fine-tune it on the claim-extraction dataset. We replace the last CRF layer with a new one to match the binary label structure of claimextraction dataset. As a result, we obtain test binary F1 score of 0.828, which is another 3.7% absolute F1 improvement on our model without transfer learning. This result demonstrates the benefit of transfer learning from scientific discourse tagging task to it's downstream-tasks.

Evidence Fragment Detection
Dataset. Burns et al. (2017) introduced evidence fragment detection dataset, which shares the same format and source of clause-based paragraphs with SciDT dataset ). As Figure 3 shows, each clause was annotated with subfigure codes that it is semantically referring to. Each clause may not refer to any subfigure code, or simultaneously refer to multiple subfigure codes. The explicit mentions of the subfigure codes were also annotated. All paragraphs are from Results section of experimental papers, and most of the paragraphs are from a subset of SciDT dataset (Burns et al., 2016;  sis, problem and fact as indicators of beginning of a evidence fragment, and result and implication as indicators of the end of a evidence fragment. They also used other features including section headings and whether the references to subfigures are entirely disjoint. Note that their document-level rule-based tagging is across multiple paragraphs in the Results section. Model Performances. We use a feature-based CRF with block-based encoding-decoding method to solve this task as a sequence tagging problem. The decoding method described in Section 3.2.2 achieves 0.94 F1 score given the ground truth BIO sequences. This improvement is because of the improvement of the CRF sequence tagger. This result shows the strong benefit of scientific discourse tags as the upstream task of evidence fragment detection.

Discussion
We use the claim-extraction task and the evidence fragment detection task as two examples to demonstrate the benefit of leveraging pre-trained scientific discourse taggers and scientific discourse tags to improve the downstream-task performance via transfer learning or in a pipeline fashion. As  proposed, given the output of evidence fragment detection system, we can link subfigure codes with INTACT (Orchard et al., 2013) records to obtain evidence fragments for each experimental figure.
We further suggest that the evidence fragment detection task can help biocurators delineate evidence fragments as independent documents so they can be cataloged, indexed, and reused. Traditionally scientists' arguments are based on relationships between claims and evidences within the same paper and possibly a limited number of cited papers. With the help of evidence fragments, we are able to discard the convention of only linking claims to evidence from a single paper or of following citations, which are often based on linking separate claims from different papers. As a future work, we can surface the evidence fragments combined with figures and captions across multiple papers. Clark et al. (2014) proposed the "Micropublications" semantic model, which is an abstract framework that integrates scientific argument and evidence from scientific documents. Our scientific discourse tagger, claim extractor and evidence fragment detector may serve as the actual implementation of the modules in such a framework. Ultimately, we hope to dramatically increase the amount of primary evidence used to generate individual claims and therefore improve the quality of those claims.

Conclusions
We develop a state-of-the-art model for scientific discourse tagging and demonstrate its strong performance on PubMed-RCT dataset (Dernoncourt and Lee, 2017) and SciDT dataset (Burns et al., 2016; as well as its strong transferability on new datasets such as CODA-19 dataset (Huang et al., 2020). We then demonstrate the benefit of leveraging the scientific discourse tags on downstream-tasks by providing claim-extraction task and evidence fragment detection task as two show cases. We further propose a future direction that scientific discourse tagging helps delineate evidence fragments as independent documents so they can be cataloged, indexed, and reused. As a result, we can dramatically increase the amount of primary evidence used to generate individual claims and therefore improve the quality of those claims.