MSˆ2: Multi-Document Summarization of Medical Studies

To assess the effectiveness of any medical intervention, researchers must conduct a time-intensive and manual literature review. NLP systems can help to automate or assist in parts of this expensive process. In support of this goal, we release MSˆ2 (Multi-Document Summarization of Medical Studies), a dataset of over 470k documents and 20K summaries derived from the scientific literature. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is the first large-scale, publicly available multi-document summarization dataset in the biomedical domain. We experiment with a summarization system based on BART, with promising early results, though significant work remains to achieve higher summarization quality. We formulate our summarization inputs and targets in both free text and structured forms and modify a recently proposed metric to assess the quality of our system’s generated summaries. Data and models are available at https://github.com/allenai/ms2.


Introduction
Multi-document summarization (MDS) is a challenging task, with relatively limited resources and modeling techniques. Existing datasets are either in the general domain, such as WikiSum  and Multi-News (Fabbri et al., 2019), or very small such as DUC 1 or TAC 2011 (Owczarzak and Dang, 2011). In this work, we add to this burgeoning area by developing a dataset for summarizing biomedical findings. We derive documents and summaries from systematic literature reviews, a type of biomedical paper that synthesizes results across many other studies. Our aim in introducing MSˆ2 is to: (1) expand MDS to the biomedical domain, (2) investigate fundamentally challenging issues in NLP over scientific text, such as summarization over contradictory information and assess-Figure 1: Our primary formulation (texts-to-text) is a seq2seq MDS task. Given study abstracts and a BACK-GROUND statement, generate the TARGET summary. Systematic reviews synthesize knowledge across many studies (Khan et al., 2003), and they are so called for the systematic (and expensive) process of creating a review; each taking 1-2 years to complete (Michelson and Reuter, 2019). 2 As we note in Fig. 2, a delay of around 8 years is observed between reviews and the studies they cite! The time and cost of creating and updating reviews has inspired efforts at automation (Tsafnat et al., 2014;Beller et al., 2018;Marshall and Wallace, 2019), and the constant deluge of studies 3 has only increased this need.
To move the needle on these challenges and support further work on literature review automation, we present MSˆ2, a multi-document summarization dataset in the biomedical domain. Our contributions in this paper are as follows: • We introduce MSˆ2, a dataset of 20K reviews and 470k studies summarized by these reviews. • We define a texts-to-text MDS task ( Fig. 1) based on MSˆ2, by identifying target summaries in each review and using study abstracts as input documents. We develop a BART-based model for this task, which produces fluent summaries that agree with the evidence direction stated in gold summaries around 50% of the time. • In order to expose more granular representations to users, we define a structured form of our data to support a table-to-table task ( § 4.2). We leverage existing biomedical information extraction systems  ( §3.3.1, §3.3.2) to evaluate agreement between target and generated summaries.

Background
Systematic reviews aim to synthesize results over all relevant studies on a topic, providing high quality evidence for biomedical and public health decisions. They are a fixture in the biomedical literature, with many established protocol around their registration, production, and publication (Chalmers et al., 2002;Starr et al., 2009;Booth et al., 2012). Each systematic review addresses one or several research questions, and results are extracted from relevant studies and summarized. For example, a review investigating the effectiveness of Vitamin B12 supplementation in older adults (Andrès et al., 2010) synthesizes results from 9 studies. The research questions in systematic reviews can be described using the PICO framework (Zakowski et al., 2004). PICO (which stands for Population: who is studied? Intervention: what intervention was studied? Comparator: what was the intervention compared against? Outcome: what was measured?) defines the main facets of biomedical research questions, and allows the person(s) conducting a review to identify relevant studies (studies included in a review generally have the same or similar PICO elements as the review). A medical systematic review is one which reports results for applying any kind of medical or social intervention to a group of people. Interventions are wideranging, including yoga, vaccination, team training, education, vitamins, mobile reminders, and more. Recent work on evidence inference  goes beyond identifying PICO elements, and aims to group and identify overall findings in reviews. MSˆ2 is a natural extension of these paths: we create a dataset and build a system with both natural summarization targets from input studies, while also incorporating the inherent structure studied in previous work.
In this work, we use the term review when describing literature review papers, which provide our summary targets. We use the term study to describe the documents that are cited and summarized by each review. There are various study designs which offer differing levels of evidence, e.g. clinical trials, cohort studies, observational studies, case studies, and more (Concato et al., 2000). Of these study types, randomized controlled trials (RCTs) offer the highest quality of evidence (Meldrum, 2000).

Dataset
We construct MSˆ2 from papers in the Semantic Scholar literature corpus. First, we create a corpus of reviews and studies based on the suitability criteria defined in §3.1. For each review, we classify individual sentences in the abstract to identify summarization targets ( §3.2). We augment all reviews and studies with PICO span labels and evidence inference classes as described in §3.3.1 and §3.3.2. As a final step in data preparation, we cluster reviews by topic and form train, development, and test sets from these clusters ( §3.4).

Identifying suitable reviews and studies
To identify suitable reviews, we apply (i) a highrecall heuristic keyword filter, (ii) PubMed filter, (iii) study-type filter, and (iv) suitability classifier, in series. The keyword filter looks for the phrase "systematic review" in the title and abstracts of all papers in Semantic Scholar, which yields 220K matches. The PubMed filter, yielding 170K matches, limits search results to papers that have Label Sentence BACKGROUND ... AREAS COVERED IN THIS REVIEW The objective of this review is to evaluate the efficacy of oral cobalamin treatment in elderly patients .

OTHER
To reach this objective , PubMed data were systematic ally search ed for English and French articles published from January 1990 to July 2008 . ...

TARGET
The efficacy was particularly highlighted when looking at the marked improvement in serum vitamin B12 levels and hematological parameters , for example hemoglobin level , mean erythrocyte cell volume and reticulocyte count .

OTHER
The effect of oral cobalamin treatment in patients presenting with severe neurological manifestations has not yet been adequately documented .... been indexed in the PubMed database, which restricts reviews to those in the biomedical, clinical, psychological, and associated domains. We then use citations and Medical Subject Headings (MeSH) to identify input studies via their document types and further refine the remaining reviews, see App. A for details on the full filtering process.
Finally, we train a suitability classifier as the final filtering step, using SciBERT (Beltagy et al., 2019), a BERT (Devlin et al., 2019) based language model trained on scientific text. Details on classifier training and performance are provided in Appendix C. Applying this classifier to the remaining reviews leaves us with 20K candidate reviews.

Background and target identification
For each review, we identify two sections: 1) the BACKGROUND statement, which describes the research question, and 2) the overall effect or findings statement as the TARGET of the MDS task ( Fig. 1). We frame this as a sequential sentence classification task (Cohan et al., 2019): given the sentences in the review abstract, classify them as BACKGROUND, TARGET, or OTHER. All BACKGROUND sentences are aggregated and used as input in modeling. All TARGET sentences are aggregated and form the summary target for that review. Sentences classified as OTHER may describe the methods used to conduct the review, detailed findings such as the number of included studies or numerical results, as well as recommendations for practice. OTHER sentences are not suitable for modeling because they either contain information specific to the review, as in methods; too much detail, in the case of results; or contain guidance on how medicine should be practiced, which is both outside the scope of our  task definition and ill-advised to generate.
Five annotators with undergraduate or graduate level biomedical background labeled 3000 sentences from 220 review abstracts. During annotation, we asked annotators to label sentences into 9 classes (which we collapse into the 3 above; see App. D for detailed info on other classes). Two annotators then reviewed all annotations and corrected mistakes. The corrections yield a Cohen's κ (Cohen, 1960) of 0.912. Though we retain only BACKGROUND and TARGET sentences for modeling, we provide labels to all 9 classes in our dataset.
Using SciBERT (Beltagy et al., 2019), we train a sequential sentence classifier. We prepend each sentence with a [SEP] token and use a linear layer followed by a softmax to classify each sentence. A detailed breakdown of the classifier scores is available in Tab. 9, App. D. While the classifier performs well (94.1 F1) at identifying BACKGROUND sentences, it only achieves 77.4 F1 for TARGET sentences. The most common error for TARGET sentences is confusing them for results from individual studies or detailed statistical analysis. Tab. 1 shows example sentences with predicted labels. Due to the size of the dataset, we cannot manually annotate sentence labels for all reviews, so we use the sentence classifier output as silver labels in the training set. To ensure the highest degree of accuracy for the summary targets in our test set, we manually review all 4519 TARGET sentences in the 2K reviews of the test set, correcting 1109 sentences. Any reviews without TARGET sentences are considered unsuitable and are removed from the final dataset.

Structured form
As discussed in §2, the key findings of studies and reviews can be succinctly captured in a structured representation. The structure consists of PICO elements ) that define what is being studied, in addition to the effectiveness of the intervention as inferred through Evidence Inference ( §3.3.2). In addition to the textual form of our task, we construct this structured form and release it with MSˆ2 to facilitate investigation of consistency between input studies and reviews, and to provide additional information for interpreting the findings reported in each document.

Adding PICO tags
The Populations, Interventions, and Outcomes of interest are a common way of representing clinical knowledge (Huang et al., 2006). Recent work  has found that the Comparator is rarely mentioned explicitly, so we exclude it from our dataset. Previous summarization work has shown that tagging salient entities, especially PIO elements (Wallace et al., 2020), can improve summarization performance (Nallapati et al., 2016a,b), so we mark PIO elements with special tokens added to our model vocabulary: <pop>, </pop>, <int>, </int>, <out>, and </out>.
Using the EBM-NLP corpus , a crowd-sourced collection of PIO tags, 4 we train a token classification model (Wolf et al., 2020) to identify these spans in our study and review documents. These span sets are denoted P = {P 1 , P 2 , ..., PP }, I = {I 1 , I 2 , ..., IĪ } and O = {O 1 , O 2 , ..., OŌ}. At the level of each review, we perform a simple aggregation over these elements. Any P, I, or O span fully contained within any other span of the same type is removed from these sets (though they remain tagged in the text). Removing these contained elements reduces the number of duplicates in our structured representation. Our dataset has an average of 3.0 P, 3.5 I, and 5.4 O spans per review.

Adding Evidence Inference
We predict the direction of evidence associated with every Intervention-Outcome (I/O) pair found in the review abstract. Taking the product of each I i and O j in the sets I and O yields all possible I/O pairs, and each I/O pair is associated with an evidence direction d ij , which can take on one of the values in {increases, no_change, decreases }. For each I/O pair, we also derive a sentence s ij from the document supporting the d ij classification. Each review can therefore be represented as a set of tuples T of the form (I i , O j , s ij , d ij ) and car-dinalityĪ ×Ō. See Tab. 2 for examples. For modeling, as in PICO tagging, we surround supporting sentences with special tokens <evidence> and </evidence>; and append the direction class with a <sep> token.
We adapt the Evidence Inference (EI) dataset and models (DeYoung et al., 2020) for labeling. The EI dataset is a collection of RCTs, tagged PICO elements, evidence sentences, and overall evidence direction labels increases, no_change, or decreases. The EI models are composed of 1) an evidence identification module which identifies an evidence sentence, and 2) an evidence classification module for classifying the direction of effectiveness. The former is a binary classifier on top of SciBERT, whereas the latter is a softmax distribution over effectiveness directions. Using the same parameters as DeYoung et al. (2020), we modify these two modules to function solely over I and O spans. 5 The resulting 354k EI classifications for our reviews are 13.4% decreases, 57.0% no_change, and 29.6% increases. Of the 907k classifications over input studies, 15.7% are decreases, 60.7% no_change, and 23.6% increases. Only 53.8% of study classifications match review classifications, highlighting the prevalence and challenges of contradictory data.

Clustering and train / test split
Reviews addressing overlapping research questions or providing updates to previous reviews may share input studies and results in common, e.g., a review studying the effect of Vitamin B12 supplementation on B12 levels in older adults and a review studying the effect of B12 supplementation on heart disease risk will cite similar studies. To avoid the phenomenon of learning from test data, we cluster reviews before splitting into train, validation, and test sets. We compute SPECTER paper embeddings (Cohan et al., 2020) using the title and abstract of each review, and perform agglomerative hierarchical clustering using the scikit-learn library (Buitinck et al., 2013). This results in 200 clus-   ters, which we randomly partition into 80/10/10 train/development/test sets.

Dataset statistics
The final dataset consists of 20K reviews and 470k studies. Each review in the dataset summarizes an average of 23 studies, ranging between 1-401 studies. See Tab. 3 for statistics, and Tab. 4 for a comparison to other datasets. The median review has 6.7K input tokens from its input studies, while the average has 9.4K tokens (a few reviews have lots of studies). We restrict the input size when modeling to 25 studies, which reduces the average input to 6.6K tokens without altering the median. Fig. 2 shows the temporal distribution of reviews and input studies in MSˆ2. We observe that though reviews in our dataset have a median publication year of 2016, the studies cited by these reviews are largely from before 2010, with a median of 2007 and peak in 2009. This citation delay has been observed in prior work (Shojania et al., 2007;Beller et al., 2013), and further illustrates the need for automated or assisted reviews.

Experiments
We experiment with a texts-to-text task formulation (Fig. 1). The model input consists of the BACK-GROUND statement and study abstracts; the output is the TARGET statement. We also investigate the use of the structured form described in §3.3.2 for a supplementary table-to-table task, where given inputs of I/O pairs from the review; the model tries to predict the evidence direction. We provide initial results for the table-to-table task, although we consider this an area in need of active research.

Texts-to-text task
Our approach leverages BART (Lewis et al., 2020b), a seq2seq autoencoder. Using BART, we encode the BACKGROUND and input studies as in Fig. 3, and pass these representations to a decoder. Training follows a standard auto-regressive paradigm used for building summarization models. In addition to PICO tags ( §3.3.1), we augment the inputs by surrounding the background and each input study with special tokens <background>, </background>, and <study>, </study>.
For representing multiple inputs, we experiment with two configurations: one leveraging BART with independent encodings of each input, and LongformerEncoderDecoder (LED) (Beltagy et al., 2020) which can encode long inputs of up to 16K tokens. For the BART configuration, each study abstract is appended to the BACKGROUND statement and encoded independently. These representations are concatenated together to form the input to the decoder layer. In the BART configuration, interactions happen only in the decoder. For the LED configuration, the input sequence starts with the BACK-GROUND statement followed by a concatenation of all input study abstracts. The BACKGROUND representation is shared among all input studies; global attention allows interactions between studies, and a sliding attention window of 512 tokens allows each token to attend to its neighbors.
We train a BART-base model, with hyperparameters described in App. F. We report experimental results in Tab. 5. In addition to ROUGE (Lin, 2004),  we also report two metrics derived from evidence inference: ∆EI and F1. We describe the intuition and computation of the ∆EI metric in Section 4.3; because it is a distance metric, lower ∆EI is better. For F1, we use the EI classification module to identify evidence directions for both the generated and target summaries. Using these classifications, we report a macro-averaged F1 over the class agreement between the generated and target summaries (Buitinck et al., 2013). For example generations, see Tab. 13 in App. G.

Table-to-table task
An end user of a review summarization system may be interested in specific results from input studies (including whether they agree or contradict) rather than the high level conclusions available in TAR-GET statements. Therefore, we further experiment with structured input and output representations that attempt to capture results from individual studies. As described in §3.3.2, the structured representation of each review or study is a tuple of the form It is important to note that we use the same set of Is and Os from the review to predict evidence direction from all input studies. Borrowing from the ideas of (Raffel et al., 2020), we formulate our classification task as a text generation task, and train the models described in Section 4.1 to generate one of the classes in {increases, no_change, decreases }. Using the EI classifications from 3.3.2, we compute an F-score macroaveraged over the effect classes (Tab. 6). We retain all hyperparameter settings other than reducing the maximum generation length to 10.
We stress that this is a preliminary effort to demonstrate feasibility rather than completeness -our results in Tab. 6 are promising but the underlying technologies for building the structured data: PICO tagging, co-reference resolution, and PICO relation extraction, are currently weak . Resorting to using the full cross-product of Interventions and Outcomes results in duplicated I/O pairs as well as potentially spurious pairs that do not correspond to actual I/O pairs in the review.

∆EI metric
Recent work in summarization evaluation has highlighted the weaknesses of ROUGE for capturing factuality of generated summaries, and has focused on developing automated metrics more closely correlated with human-assessed factuality and quality (Zhang* et al., 2020;Wang et al., 2020a;Falke et al., 2019). In this vein, we modify a recently proposed metric based on EI classification distributions (Wallace et al., 2020), intending to capture the agreement of Is, Os, and EI directions between input studies and the generated summary.
For each I/O tuple (I i , O j ), the predicted direction d ij is actually a distribution of probabilities over the three direction classes P ij = (p increases , p decreases , p no_change ). If we consider this distribution for the gold summary (P ij ) and the generated summary (Q ij ), we can compute the Jensen-Shannon Distance (JSD) (Lin, 1991), a bounded score between [0, 1], between these distributions. For each review, we can then compute a summary JSD metric, which we call ∆EI, as an average over the JSD of each I/O tuple in that review: Different from Wallace et al. (2020), ∆EI is an average over all outputs, attempting to capture an overall picture of system performance, 6 and our metric retains the directionality of increases and decreases, as opposed to collapsing them together.
To facilitate interpretation of the ∆EI metric, we offer a degenerate example. Given the case where all direction classifications are certain, and the probability distributions P ij and Q ij exist in the space of (1, 0, 0), (0, 1, 0), or (0, 0, 1), ∆EI takes on the following values at various levels of consistency between P ij and Q ij for the input studies: 100% consistent ∆EI = 0.0 50% consistent ∆EI = 0.42 0% consistent ∆EI = 0.83  In other words, in both the standard BART and LED setting, the evidence directions predicted in relation to the generated summary are slightly less than 50% consistent with the direction predictions produced relative to the gold summary.

Human evaluation & error analysis
We randomly sample 150 reviews from the test set for manual evaluation. For each generated and gold summary, we annotate the primary effectiveness direction in the summary to the following classes: (i) increases: intervention has a positive effect on the outcome; (ii) no_change: no effect, or no difference between the intervention and the comparator; (iii) decreases: intervention has a negative effect on the outcome; (iv) insufficient: insufficient evidence is available; (v) skip: the summary is disfluent, off topic, or does not contain information on efficacy.
Here, increases, no_change, and decreases correspond to the EI classes, while we introduce insufficient to describe cases where insufficient evidence is available on efficacy, and skip to describe data or generation failures. Two annotators provide labels, and agreement is computed over 50 reviews (agreement: 86%, Cohen's κ: 0.76). Of these, 17 gold summaries lack an efficacy statement, and are excluded from analysis. Tab. 7 shows the confusion matrix for the sample. Around 50% (67/133) of generated summaries have the same evidence direction as the gold summary. Most confusions happen between increases, no_change, and insufficient.
Tab. 8 shows how individual studies can provide contradictory information, some supporting a positive effect for an intervention and some observing no or negative effects. EI may be able to capture some of the differences between these input studies. From observations on limited data: while studies with positive effect tend to have more EI predictions that were increases or decreases, those with no or negative effect tended to have predictions that were mostly no_change. However, more work is needed to better understand how to capture these directional relations and how to aggregate them   Petrelli and Barni (2013), a review investigating the effectiveness of cisplatin-based (CAP) chemotherapy for non-small cell lung cancer (NSCLC). Input studies vary in their results, with some stating a positive effect for adjuvant chemotherapy, and some stating no survival benefit.
into a coherent summary.

Related Work
NLP for scientific text has been gaining interest recently with work spanning the whole NLP pipeline: datasets (S2ORC , CORD-19 (Wang et al., 2020b)  , and Wikipedia Current Events (Gholipour Ghalandari et al., 2020). Most similar to MSˆ2 is MultiNews, where multiple news articles about the same event are summarized into one short paragraph. Aside from being in a different textual domain (scientific vs. newswire), one unique characteristic of MSˆ2 compared to existing datasets is that MSˆ2 input documents have contradicting evidence. Modeling in other domains has typically focused on straightforward applications of single-document summarization to the multi-document setting (Lebanoff et al., 2018;Zhang et al., 2018), although some methods explicitly model multi-document structure using semantic graph approaches (Baumel et al., 2018;Liu and Lapata, 2019;. In the systematic review domain, work has typically focused on information retrieval (Boudin et al., 2010;Ho et al., 2016;Schoot et al., 2020), extracting findings , and quality assessment . Only recently in Wallace et al. (2020) and this work has consideration been made for approaching the entire system as a whole. We refer the reader to App. I for more context regarding the systematic review process.

Discussion
Though MDS has been explored in the general domain, biomedical text poses unique challenges such as the need for domain-specific vocabulary and background knowledge. To support development of biomedical MDS systems, we release the MSˆ2 dataset. MSˆ2 contains summaries and documents derived from biomedical literature, and can be used to study literature review automation, a pressing real-world application of MDS.
We define a seq2seq modeling task over this dataset, as well as a structured task that incorporates prior work on modeling biomedical text . We show that although generated summaries tend to be fluent and on-topic, they only agree with the evidence direction in gold summaries around half the time, leaving plenty of room for improvement. This observation holds both through our ∆EI metric and through human evaluation of a small sample of generated summaries. Given that only 54% of study evidence directions agree with the evidence directions of their review, modeling contradiction in source documents may be key to improving upon existing summarization methods.
Limitations Challenges in co-reference resolution and PICO extraction limit our ability to generate accurate PICO labels at the document level. Errors compound at each stage: PICO tagging, taking the product of Is and Os at the document level, and predicting EI direction. Pipeline improvements are needed to bolster overall system performance and increase our ability to automatically assess performance via automated metrics like ∆EI. Relatedly, automated metrics for summarization evaluation can be difficult to interpret, as the intuition for each metric must be built up through experience. Though we attempt to facilitate understanding of ∆EI by offering a degenerate example, more exploration is needed to understand how a practically useful system would perform on such a metric.
Future work Though we demonstrate that seq2seq approaches are capable of producing fluent and on-topic review summaries, there are significant opportunities for improvement. Data improvements include improving the quality of summary targets and intermediate structured representations (PICO tags and EI direction). Another opportunity lies in linking to structured data in external sources such as various clinical trial databases 7,8,9 rather than relying solely on PICO tagging. For modeling, we are interested in pursuing joint retrieval and summarization approaches (Lewis et al., 2020a). We also hope to explicitly model the types of contradictions observed in Tab. 8, such that generated summaries can capture nuanced claims made by individual studies.

Conclusion
Given increasing rates of publication, multidocument summarization, or the creation of literature reviews, has emerged as an important NLP task in science. The urgency for automation technologies has been magnified by the COVID-19 pandemic, which has led to both an accelerated speed of publication (Horbach, 2020) as well as proliferation of non-peer-reviewed preprints which may be of lower quality (Lachapelle, 2020). By releasing MSˆ2, we provide a MDS dataset that can help to address these challenges. Though we demonstrate that our MDS models can produce fluent text, our results show that there are significant outstanding challenges that remain unsolved, such as PICO tuple extraction, co-reference resolution, and evaluation of summary quality and faithfulness in the multi-document setting. We encourage others to use this dataset to better understand the challenges specific to MDS in the domain of biomedical text, and to push the boundaries on the real world task of systematic review automation.

Ethical Concerns and Broader Impact
We believe that automation in systematic reviews has great potential value to the medical and scientific community; our aim in releasing our dataset and models is to facilitate research in this area. Given unresolved issues in evaluating the factuality of summarization systems, as well as a lack of strong guarantees about what the summary outputs contain, we do not believe that such a system is ready to be deployed in practice. Deploying such a system now would be premature, as without these guarantees we would be likely to generate plausible-looking but factually incorrect summaries, an unacceptable outcome in such a high impact domain. We hope to foster development of useful systems with correctness guarantees and evaluations to support them.  Table 9: Precision, Recall, and F1-scores for all annotation classes, averaged over five folds of cross validation.

A MeSH Filtering
For each candidate review, we extract its cited papers and identify the study type of each cited paper using MeSH publication type, 10 keeping only studies that are clinical trials, cohort studies, and/or observational studies (see Appendix A.1 for full list of MeSH terms). We exclude case reports, which usually report findings on one or a small number of individuals. We observe that publication type MeSH terms tend to be under-tagged. 11 Therefore, we also use ArrowSmith trial labels (Cohen et al., 2015;Shao et al., 2015) and a keyword heuristic (the span "randomized" occurring in the title or abstract) to identify additional RCT-like studies. 12 Candidate reviews are culled to retain only those that cite at least one suitable study and no case

B Suitability Annotation
The annotation guidelines for review suitability are given below. Each annotator was tasked with an initial round of annotation, followed by a round of review, then further annotation.

B.1 Suitability Guidelines
A systematic review is a document resulting from an in-depth search and analysis of all the literature relevant to a particular topic. We are interested in systematic reviews of medical literature, specifically those that assess varying treatments and the outcomes associated with them. There are many different types of reviews, and many types of documents that look like reviews. We need to identify only the "correct" types of reviews. Sometimes this can be done from the title alone, sometimes one has to read the review itself.
The reviews we are interested in: • Must study a human population (no animal, veterinary, or environmental studies)  OTHER WHAT THE READER WILL GAIN Three prospect i ve r and omized studies , a systematic review by the Cochrane group and five prospect i ve cohort studies were found and provide evidence that oral cobalamin treatment may adequately treat cobalamin deficiency .

TARGET
The efficacy was particularly highlighted when looking at the marked improvement in serum vitamin B12 levels and hematological parameters , for example hemoglobin level , mean erythrocyte cell volume and reticulocyte count .

OTHER
The effect of oral cobalamin treatment in patients presenting with severe neurological manifestations has not yet been adequately documented .

TARGET
Oral cobalamin treatment avoids the discomfort , inconvenience and cost of monthly injections .
TARGET TAKE HOME MESSAGE Our experience and the present analysis support the use of oral cobalamin therapy in clinical practice • Must review studies involving multiple participants. We are interested in reviews of trials or cohort studies. We are *not* interested in reviews of case studies -which describe one or a few specific people.
• Must study an explicit population or problem (P from PICO) -Example populations: women > 55 old with breast cancer, migrant workers, elementary school children in Spokane, WA, etc.
• Must compare one or more medical interventions -Example interventions: drugs, vaccines, yoga, therapy, surgery, education, annoying mobile device reminders, professional naggers, personal trainers, and more! Note: placebo / no intervention is a type of intervention. -Comparing the effectiveness of an intervention against no intervention is okay. -Combinations of interventions count as comparisons (e.g. yoga vs. yoga + therapy). -Two different dosages also count (e.g. 500ppm fluoride vs 1000ppm fluoride in toothpaste).
-Must have an explicit outcome measure -Example outcome measures: survival time, frequency of headaches, relief of depression, survey results, and many other possibilities.
• The outcome measure must measure the effectiveness of the intervention.

C Suitability Classifier
Four annotators with biomedical background labeled 879 reviews sampled from the candidate pool (572 suitable, 307 not, Cohen's Kappa: 0.55) according to the suitability criteria (guidelines in Appendix B). We aim to include reviews that perform an aggregation over existing results, such as reporting on how a medical or social intervention affects a group of people, while excluding reviews that make new observations, such as identifying novel disease co-morbidities or those that synthesize case studies. For our suitability classifier, we finetune Sci-BERT (Beltagy et al., 2019) using standard parameters; using five-fold cross validation we find that a threshold of 0.75 provides a precision of greater than 80% while maintaining an adequate recall (Figure 4).
Though there are a fairly large number of false positives by this criteria, we note that these false positive documents are generally reviews; however, they may not investigate an intervention, or may not have suitable target statements. In the latter case, target identification described in § 3.2 helps us further refine and remove these false positives from the final dataset.

D Sentence Annotation
Sentence annotation guidelines and detailed scores are below. Each annotator was tasked with annotating 50-100 sentences, followed by a round of review, before being asked to annotate more.

D.1 Sentence Annotation Guidelines
A systematic review is a document resulting from an in-depth search and analysis of all the literature relevant to a particular topic. We are interested in systematic reviews of medical literature, specifically those that assess varying treatments and the outcomes associated with them. Ignore any existing labels; these are automatically produced and error prone. If something clearly fits into more than one category, separate the labels by commas (annoying, we know, but it can be important). For sentences that are incorrectly broken in a way that makes them difficult to label, skip them (you can fix them, but they'll be programmatically ignored). For reviews that don't meet suitability guidelines, also skip them. We want to identify sentences within these reviews as belonging to one of several categories: • BACKGROUND: Any background information not including goals.
• GOAL: A high level goal sentence, describing the aims or purposes of the review. were studied or compared, or what outcomes are measured in those studies. This may also include whether or not a meta-analysis is performed.
• DETAILED_FINDINGS: Any sections reporting study results, often includes numbers, p-values, etc. These will frequently include statements about a subset of the trials or the populations.
• GENERAL FINDINGS: There are four types of general findings we would like you to label. These do not include things like number of patients, or a p-value (that's DETAILED FIND-INGS). Not all of these four subtypes will always be present in a paper's abstract. Some sentences will contain information about more than one subtype, and some sentences can contain information about some of these subtypes as well as DETAILED FINDINGS.
-EFFECT: Effect of the intervention, may include a statement about significance. These can cover a wide range of topics, including public health or policy changes. -EVIDENCE_QUALITY: Commentary about the strength or quality of evidence pertaining to the intervention. -FURTHER_STUDY: These statements might call for more research in a particular area, and can include hedging statements, e.g.: * "More rigorously designed longitudinal studies with standardized definitions of periodontal disease and vitamin D are necessary." * "More research with larger sample size and high quality in different nursing educational contexts are required." * "However, this finding largely relies on data from observational studies; high-quality RCTs are warranted because of the potential for subject selection bias." -RECOMMENDATION: Any kind of clinical or policy recommendation, or recommendations for use in practice. This must contain an explicit recommendation, not a passive statement saying that a treatment is good. "Should" or "recommend" are good indicators. These may not always be present in an abstract. E.g.: * "Public policy measures that can reduce inequity in health coverage, as well as improve economic and educational opportunities for the poor, will help in reducing the burden of malaria in SSA." -ETC: Anything that doesn't fit into the categories above.
All sentences appear in the context of their review. Some of the selected reviews might not actually be reviews; these were identified by accident. These should be excluded from annotation -either make a comment on the side (preferred) or delete the rows belonging to the non-review. Examples follow. Please ask questions -these guidelines are likely not perfect and we'll have missed many edge cases Examples: BACKGROUND A sizeable number of individuals who participate in population-based colorectal cancer (CRC) screening programs and have a positive fecal occult blood test (FOBT) do not have an identifiable lesion found at colonoscopy to account for their positive FOBT screen.
GOAL To determine the effect of integrating informal caregivers into discharge planning on postdischarge cost and resource use in older adults.
METHODS MAIN OUTCOMES Clinical status (eg, spirometric measures); functional status (eg, days lost from school); and health services use (eg, hospital admissions). Studies were included if they had measured serum vitamin D levels or vitamin D intake and any periodontal parameter.
DETAILED_FINDINGS Overall, 27 studies were included (13 cross-sectional studies, 6 casecontrol studies, 5 cohort studies, 2 randomized clinical trials and 1 case series study). Sixty-five percent of the cross-sectional studies reported significant associations between low vitamin D levels and poor periodontal parameters. Analysis of group cognitive-behavioural therapy (CBT) v. usual care alone (14 studies) showed a significant effect in favour of group CBT immediately post-treatment (standardised mean difference (SMD) -0.55 (95% CI -0.78 to -0.32)).
EFFECT This review identified short-term benefits of technology-supported self-guided interven-tions on the physical activity level and fatigue and some benefit on dietary behaviour and HRQoL in people with cancer. However, current literature demonstrates a lack of evidence for long-term benefit.
EVIDENCE_QUALITY Interpretation of findings was influenced by inadequate reporting of intervention description and compliance.
No meta-analysis was performed due to high variability across studies. RECOMMENDATION

D.2 Detailed Sentence Breakdown Scores
Sentence classification scores for 9 classes are given in Table 9. The corresponding confusion matrix can be found in Table 10. Table 11 provides an example of sentence classification results over 3 classes.

E Dataset Contradiction Scores
The confusion matrix between review effect findings and input study effect findings is given in Table 12.

F Hyperparameters and Modeling Details
We implement our models using PyTorch (Paszke et al., 2019), the HuggingFace Transformers (Wolf et al., 2020) and PyTorch lightning (Falcon, 2019) libraries, starting from the BART-base checkpoint (Lewis et al., 2020b). All models were trained using FP16, using NVidia RTX 8000 GPUs (GPUs with 40G or more of memory are required for most texts-to-text configurations). All models are trained for eight epochs as validation scores diminished over time; early experiments ran out to approximately fifty epochs and showed little sensitivity to other hyperparameters. We use gradient accumulation to reach an effective batch size of 32. We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-5, an epsilon of 1e-8, and a linear learning rate schedule with 1000 steps of warmup. We ran a hyperparameter sweep over decoding parameters on the validation set for 4, 6,  There is moderate evidence that acupuncture is more effective than no treatment, and strong evidence of no significant difference between acupuncture and sham acupuncture , for shortterm pain relief.
The is insufficient evidence to support the use of acupuncture for LBP. CONCLUSIONS There is limited evidence for the effectiveness of acupuncture in LBP in the short term.
Objectives : To provide a quantitative analysis of all r and omized controlled trials design ed to determine the effectiveness of physical interventions for people with spinal cord injury ( SCI ).
There is initial evidence supporting the effectiveness of some physical interventions for people with SCI.
The Results : This systematic review provides evidence that physical interventions for people with SCI are effective in improving muscle strength and function in the short term.
BACKGROUND Neuroendocrine tumours ( NET ) most commonly metastasize to the liver. Hepatic resection of NET hepatic metastases ( NETHM ) has been shown to improve symptomology and survival. METHODS A systematic review of clinical studies before September 2010 was performed to examine the efficacy of hepatic resection for NETHM.
Poor histologic grade, extra-hepatic disease and a macroscopically incomplete resection were associated with a poor prognosis. CON-CLUSION Hepatic resection for NETHM provides symptomatic benefit and is associated with favourable survival outcomes although the majority of patients invariably develop disease progression Theatic resection of NETHM has been shown to improve survival in patients with advanced, well-differentiated NETs.
The aim of this systematic review and meta-analysis was to assess the efficacy on an intervention on breastfeeding self-efficacy and perceived insufficient milk supply outcomes.
Although significant effect of the interventions in improving maternal breastfeeding selfefficacy was revealed by this review, there is still a paucity of evidence on the mode, format, and intensity of interventions.
The findings of this systematic review and meta-analysis suggest that breastfeeding education is an effective intervention for improving breastfeeding self-efficacy and breastfeeding duration among primiparous women. and 8 beams; maximum lengths of 64, 128, and 256 wordpieces; and length penalties of 1, 2, and 4. We find little qualitative or quantitative variation between runs and select the setting with the highest Rouge1 scores: 6 beams, a length penalty of 2, and 128 tokens for output maximum lengths. We use an attention dropout (Srivastava et al., 2014) of 0.1. Optimizer hyperparameters, as well as any hyperparameters not mentioned, used defaults corresponding to their libraries. Training requires approximately one day on two GPUs. Due to memory constraints, we limit each review to 25 input documents, with a maximum of 1000 tokens per input document.
We make use of NumPy (Harris et al., 2020) in our models and evaluation, as well as scikitlearn (Buitinck et al., 2013), and the general SciPy framework (Virtanen et al., 2020) for evaluation.

G Example generated summaries
See Table 13 for examples of inputs, targets, and generations.

H Validation Results
We provide results on the validation set in Tables  14 and 15.

I A Brief Review of Systematic Reviews
We provide a brief overview of the systematic review process for the reader. A systematic review is a thorough, evidence-based process to answer scientific questions. In the biomedical domain, a systematic review typically consists of five steps: defining the question, finding relevant studies, determining study quality, assessing the evidence (quantitative or qualitative analysis), and drawing final conclusions. For a detailed overview of the steps, see Khan et al. (2003). While there are other definitions and aspects of the review process (Aromataris and Munn, 2020; Higgins et al., 2019), the five-step process above is sufficient for describing reviews in the context of this work. We emphasize that this work, indeed the approaches used in this field, cannot replace the labor done in a systematic review, and may instead be useful for scoping or exploratory reviews. The National Toxicology Program, 13 part of the United States Department of Health and Human Services, conducts scoping reviews for epidemiological studies. The National Toxicology Program has actively solicited help from the natural language processing community via the Text Analysis Conference. 14 Other groups conducting biomedical systematic reviews include the Cochrane Collaboration, 15 the Joanna Briggs Institute, 16 Guidelines International Network, 17 SickKids, 18 the University of York, 19 and the public health agencies of various countries, 20 to name a few. Systematic review methodologies have also been applied in fields outside of medicine, by organizations such as the Campbell Collaboration, 21 which conducts reviews over a wide range of areas: business, justice, education, and more.

I.1 Automation in Systematic Reviews
Automation in systematic reviews has typically focused on assisting in portions of the process: search and extraction, quality assessment, and interpreting findings. For a detailed analysis of automated approaches in aiding the systematic review process, see Norman (2020); Marshall and Wallace (2019).
Search and Extraction. Search, screening, and extracting the results of studies into a structured representation are several components of the sys-et al. (2021) extracts relations from nutritional literature, and uses content planning methods to generate summaries highlighting contradictions in the relevant literature.