Detect and Classify – Joint Span Detection and Classification for Health Outcomes

A health outcome is a measurement or an observation used to capture and assess the effect of a treatment. Automatic detection of health outcomes from text would undoubtedly speed up access to evidence necessary in healthcare decision making. Prior work on outcome detection has modelled this task as either (a) a sequence labelling task, where the goal is to detect which text spans describe health outcomes, or (b) a classification task, where the goal is to classify a text into a predefined set of categories depending on an outcome that is mentioned somewhere in that text. However, this decoupling of span detection and classification is problematic from a modelling perspective and ignores global structural correspondences between sentence-level and word-level information present in a given text. To address this, we propose a method that uses both word-level and sentence-level information to simultaneously perform outcome span detection and outcome type classification. In addition to injecting contextual information to hidden vectors, we use label attention to appropriately weight both word and sentence level information. Experimental results on several benchmark datasets for health outcome detection show that our proposed method consistently outperforms decoupled methods, reporting competitive results.


Introduction
Access to the best available evidence in context of patient's individual conditions enables healthcare professionals to administer optimal patient care . Healthcare professionals identify outcomes as a fundamental part of the evidence they require to make decisions (van Aken et al., 2021). Williamson et al. * Danushka Bollegala holds concurrent appointments as a Professor at University of Liverpool and as an Amazon Scholar. This paper describes work performed at the University of Liverpool and is not associated with Amazon. sentence There were no significance betweengroup differences in the incidence of wheezing or shortness of breath (2017) define an outcome as a measurement or an observation used to capture and assess the effect of treatment such as assessment of side effects (risk) or effectiveness (benefits). With the rapid growth of literature that reports outcomes, researchers have acknowledged and addressed the need to automate the extraction of outcomes from systematic reviews (Jonnalagadda et al., 2015;Nye et al., 2018) and answering clinical questions (Demner-Fushman and Lin, 2007). Jin and Szolovits (2018) mention that automated Health Outcomes Detection (HOD) could speed up the process of analysing and assessing the effectiveness of clinical interventions in Evidence Based Medicine (EBM; Sackett et al., 1996). HOD has been conducted in the past as either an Outcome Span Detection (OSD) task, where we must detect a continuous span of tokens indicating a health outcome (Nye et al., 2018;Brockmeier et al., 2019) or as an Outcome Classification (OC) task, where the goal is to classify the text spans into a pre-defined set of categories (Wallace et al., 2016;Jin and Szolovits, 2018;Kiritchenko et al., 2010). However, the two tasks are highly correlated and local token-level information enables us to make accurate global sentence-level outcome predictions, and vice versa. An outcome type predicted for a text span in a sentence must be consistent with the other outcome spans detected from the same sentence, while the outcome spans detected from a sentence must be compatible with their outcome types. These mutual compatibility constraints between outcome spans and their classes will be lost in a decoupled approach, resulting in poor performance for both OSD and OC tasks.
Two illustrative examples in Table 1 show the distinction between the OSD, OC and Joint OSD & OC tasks. Specifically, in the first sentence, OSD extracts all outcomes i.e. wheezing and shortness of breath, OC classifies the text into an outcome type, Physiological, and then Joint OSD & OC extracts an outcome span and classifies it concurrently i.e. it extracts wheezing and also classifies it as a Physiological outcome. Motivated by the recent success in joint modelling of tasks such as aspect extraction (AE) and aspect sentiment classification (ASC), which together make a customer sentiment analysis task called Aspect Based Sentiment Analysis (ABSA; Xu et al., 2019), we model HOD as a joint task involving both OSD and OC. HOD can be formally defined as follows: Health Outcome Detection (HOD): Given a sentence s = w 1 , . . . , w M extracted from a clinical trial abstract, the goal of HOD is to identify an outcome span o d = b i , . . . , b N (i.e OSD), and subsequently predict a plausible outcome type t(o d ) ∈ Y for o d (i.e. OC), where 1 ≤ i ≤ N ≤ M , and Y is a predefined set of outcome types.
We propose Label Context-aware Attention Model (LCAM), a sequence-to-sequence-to-set (SEQ2SEQ2SET) model, which uses a single encoder to represent an input sentence and two de-coders -one for predicting the label for each word in OSD and another for predicting the outcome type in OC. LCAM is designed to jointly learn contextualised label attention-based distributions at word-and sentence-levels in order to capture which label/s a word or a sentence is more semantically related to. We call them contextualised because they are enriched by global contextual representations of the abstracts to which the sentences belongs. Label attention incorporates label sparsity information and hence semantic correlation between documents and labels.
A baseline BiLSTM and or clinically informed BERT base (Devlin et al., 2019) models are used at the encoding stage of our model and later for decoding with sigmoid prediction layers. We also use a multi-label prediction (MLP) layer for the two tasks (i.e. OSD and OC), with a relaxed constraint at token-level that ensures only the top (most relevant) prediction is retained, whereas all predicted (relevant) outcome types are retained at the sentence-level during OC. We use an MLP layer because some annotated outcomes belong to multiple outcome types. For example, depression belongs to both "Physiological" and "Life-Impact" outcome types.
HOD remains a challenging task due to the lack of a consensus on how outcomes should be reported and classified (Kahan et al., 2017). Dodd et al. (2018) recently built a taxonomy to standardise outcome classifications in clinical records, which has been used to annotate the EBM-COMET (Abaho et al., 2020) dataset. Following these recent developments, we use EBM-COMET to align outcome annotations in the evaluation dataset we use in our experiments (Dodd et al., 2018). Our main contributions in this work are summarised as follows 1 : 1. We propose the Label Context-aware Attention Model to simultaneously learn labelattention weighted representations at wordand sentence-level. These representations are then evaluated on a biomedical text mining task that extracts and classifies health outcomes (HOD).
2. We introduce a flexible, re-usable unsupervised text alignment approach that extracts parallel annotations from comparable datasets. We use this alignment for data augmentation in a low-resource setting.
3. We investigate the document-level contributions by a piece of text (e.g. an abstract) for predictions made at the token-level.

Related work
Joint training to achieve a dichotomy of tasks has previously been attempted, particularly for sequence labelling and sentence classification. Targeting Named Entity Recognition (NER) and Relation Extraction (RE), Chen et al. (2020) transfer BERT representations via a joint learning strategy to extract clinically relevant entities and their syntactic relationships. In their work, the joint learning models exhibit dramatic performance improvements over disjoint (standalone) models for the RE task. Our work differs from (Chen et al., 2020) in that we use attention layers prior to the first and second classification layers. Ma et al. (2017) train a sparse attention-based LSTM to learn context features extracted from a convolution neural network (CNN). The resulting hidden representations are used for label prediction at each time step for sequence labelling, and subsequently aggregated via average pooling to obtain a representation for sentence classification. The sparse constraint is strategically biased during weights assignment (i.e. important words are assigned larger weights compared to less important words). Karimi et al. (2020) perform ABSA (Xu et al., 2019) by feeding a BERT architecture with a sentence s = ([CLS], x 1:j , [SEP ], x j+1:n , [SEP ]), where x 1:j is a sentence containing an aspect of a product, x j+1:n is a customer review sentence directed to the aspect and [CLS] is a token not only indicating the beginning of a sequence, but also a sentiment polarity in the customer review about the aspect. They fine-tune a BERT model to conduct both aspect extraction and aspect sentiment classification. The above mentioned works tend to generate attention-based sentence-level representations that encapsulate the contribution each word would make in predicting sentence categories. We however generate label-inclined attention representations at word-level that can be used to effectively deduce word categories/labels. To the best of our knowledge, we are the first to perform a joint learning task that achieves MLP at two classification stages, token-and sentence-levels, while using only the top predictions at token level.

Data
The absence of a standardised outcome classification systems prompted Nye et al. (2018) to annotate outcomes with an arbitrary selection of outcome type labels aligned to Medical Subject Headings (MeSH) vocabulary. 2 Moreover their outcome annotations have been discovered with flaws in recent work (Abaho et al., 2019), such as statistical metrics and measurement tools annotated as part of clinical outcomes e.g. "mean arterial blood pressure" instead of "arterial blood pressure", "Quality of life Questionnaire" instead of "Quality of life", "Work-related stress scores" instead of "Work-related stress".
Motivated by the taxonomy proposed by Dodd et al. (2018) to standardise outcome classifications in electronic databases and inspired the annotation of EBM-COMET dataset (Abaho et al., 2020), we attempt to align EBM-NLP's arbitrary outcome classifications to standard outcome classifications that are proposed by Dodd et al. (2018). These standard classifications were found (after extensive analysis and testing) to provide sufficient granularity and scope of trial outcomes. We propose an unsupervised label alignment method to identify and align parallel annotations across the EBM-NLP and EBM-COMET. Additionally, we use the discovered semantic similarity between the two datasets and merge them in order to create a larger dataset for evaluating our joint learning approach. The merged dataset contains labels that follow the taxonomy proposed by Dodd et al. (2018). All three datasets are used during evaluation, with each one being randomly split into two, where 80% is retained for training and 20% for testing as shown in Table 2. We hypothesise that the merged dataset would improve performance we obtain on the original independent datasets.

Physiological Mortality
Life-Impact Resource-use Adverse-effects  Table 3: Cosine distance between representations of EBM-NLP labels (first column) and EBM-COMET labels (top and second row). EBM-COMET outcome type labels were drawn from the outcome domains defined in (Dodd et al., 2018) taxonomy. Due to space limitations, we denote these domains as P X such as P 0, P 1 etc. The taxonomy hierarchically categorised them into 5 outcome types which are accordingly included in the top row. Outcome domains definitions are, P 0-Physiological/clinical, P 1-Mortality/survival, P 25-Physical functioning, P 26-Social functioning, P 27-Role functioning, P 28-Emotional functioning/wellbeing, P 29-Cognitive functioning, P 30-Global quality of life, P 31-Perceived health status, P 32-Delivery of care, P 33-Personal circumstances, P 34-Economic, P 35-Hospital, P 36-Need for further intervention, P 37-Societal/carer burden, P 38-Adverse events/effects

Label alignment (LA) for Comparable Datasets
Given two datasets S and T with comparable content, with S containing x labels such that L s = {l 1 s , . . . , l x s } and T containing y labels L t = {l 1 t , . . . , l y t }, we design LA to measure the similarity between each pair of labels (l s , l t ).
For this purpose, we first create an embedding for each label l s in a sentence s(∈ S) by applying mean pooling over the span of embeddings (extracted using pre-trained BioBERT (Lee et al., 2020)) for the tokens corresponding to an outcome annotated with l s as shown in (1). Next, we average the embeddings of all outcome spans that are annotated with l s in all sentences in S to generate an outcome type label embedding l s . Likewise, we create an outcome type label embedding, l t for each outcome type in the target dataset T . After generating label embeddings for all outcome types in both S and T , we compute the cosine similarity between each pair of l s and l t as the alignment score between each pair of labels l s and l t respectively.
where O ls , is an outcome span annotated with outcome type label l s , i and i+(d−1) are the locations of the first and last words of the outcome span.
where |l s | is the number of outcome spans annotated with label l s and l s is label l s embedding. Table 3 shows the similarity scores for label pairs (l s , l t ) across S (EBM-COMET) and T (EBM-NLP) respectively. For each label (which is an outcome domain) in EBM-COMET, we identify the EBM-NLP label which is most similar to it by searching for the least cosine distance across the entire column. After identifying those pairs that are most similar, we automatically replace outcome type labels in EBM-NLP with EBM-COMET outcome type labels as informed by the similarity measure.
Results show that Physiological outcomes (containing domain P 0) are similar to Physical outcomes and therefore the latter outcomes are labelled Physiological, Life-Impact outcomes are similar to Mental outcomes and therefore the latter outcomes are labelled Life-Impact. Mortality and Adverse-effects outcomes both remain unchanged because both categories exists in source and target datasets, and their respective outcomes are discovered to be similar. We evaluate the LCAM architecture on the resulting merged dataset, and additionally, evaluate the alignment approach by comparing the performances before and after merging. Figure 1 illustrates an end-to-end SEQ2SEQ2SET architecture of the LCAM model. It depicts a twophased process to achieve classification at token and sentence level. In phase 1, input tokens are encoded into representations which are sent to a decoder (i.e. a sigmoid layer) to predict a label for each word, hence OSD. Subsequently, in phase 2, the token-level representations are used to generate individual outcome span representations, which are sent to another decoder (sigmoid layer) that is Figure 1: Illustration of the LCAM Architecture. It encodes a sequence of tokens of a sentence within an abstract, generates contextualised representations by adding a global representation of the abstract at word-and sentencelevel. Two attention layers are used to aid generation of label-aware representations used to decode labels at word-level for OSD and sentence-level for OC. used to predict the label/s for each outcome span, hence OC. We use MLP for the OC task because some outcomes are annotated with multiple outcome types. The pseudo code for LCAM is shown in the Supplementary.

Outcome Span Detection (OSD)
Given a set of sentences S = {s i } |S| i=1 within an abstract a, each s i having N words, s i = w 1 , . . . , w N , with each word tagged to a label l w and use BIO tagging scheme (Sang and Veenstra, 1999). OSD aims to extract one or more outcome spans within s i . For example, in Figure 1, OSD extracts the outcome span "incisional hernia" given the input sentence.
Encoder: In our OSD task setting, we initially implement a baseline LCAM using a BiLSTM to encode input tokens (that are represented by d-dimensional word embeddings we obtain using GloVe (Pennington et al., 2014) 3 ) into hidden representations for every word within an input sentence. We then consider generating each input words hidden representation using a pre-trained clinically informed BERT base model called BioBERT (Lee et al., 2020). The LCAM model learns (3), where w n ∈ s i , h n ∈ R k×1 and k is the dimensionality of the hidden state. The upper equation under 3 is used for a BiLSTM Text encoder and the lower for a BioBERT one.

Abstract Hidden State Context
To make the hidden state representation contextaware, we add a compound representation of the abstract in which the sentence containing w n belongs.
where f is a function computing the average pooled representation of the encoded abstract, AbsEncoder ∈ {BiLSTM, BioBERT}, AbsEncoder(a) ∈ R k×|a| , |a| is the length of the abstract (measured by the number of tokens contained in it) and f (AbsEncoder(a)) ∈ R k×1 .

Label-word attention
We compute two different attention scores, the first is to enable the model pay appropriate attention to each word when generating the overall outcome span representation. Then the second attention score, is to allow the words interact with the labels in order to capture the semantic relation between them, hence making the representations more labelaware. To obtain the first attention vector A (1) , we use a self-attention mechanism (Al-Sabahi et al., 2018;Lin et al., 2017) that uses two weight parameters and a hyper parameter b that can be set arbitrary, where W ∈ R |lw|×b , V ∈ R b×k and A (1) ∈ R |lw|×1 . |l w | is the number of token-level labels. Furthermore, we obtain a label-word attention vector A (2) using a trainable matrix U ∈ R |lw|×k . Similar to the interaction function Du et al. (2019) use, this attention is computed in (6) as the dot product between the h c n and U, where A (2) Label-word representation The overall representation used by the decoder for classification of each token is obtained by merging the two attention distributions from the previous paragraphs as shown by (7), where E t l n ∈ R |lw|×k , denotes the token-level (t l ) representation. The training objective is to maximise the probability of a singular ground truth label and minimise a cross-entropy loss, where N is number of tokens in a sentence, l w is the number of labels.

Outcome Classification (OC)
OC predicts outcome types for the outcome spans extracted during OSD. Similar to what is done at token-level, we add an abstract representation (which is a mean pool of its token's representations) to add context to each tokens representation. An outcome span is represented by concatenating the vectors of its constituent words, where m is the number of tokens contained in outcome span O s . We adopt the aforementioned self-attention and label-word attention methods at sentence-level to aid extraction of an attention based sentence-level representation of an outcome as follows: where [A (1) , A (2) ] ∈ R |ls|×m , O s ∈ R m×k and s ≥ 0. Given an outcome span representation E s l , the training objective at sentence-level (s l ) is to maximize the probability of the set of terms, argmax θ P (y = (l 1 s , l 2 s , ..., l 6 s ) ∈ l s |E s l ; θ) (11) where y i ∈ {0, 1},ŷ i ∈ [0, 1] l s ∈ {Physiological, Mortality, Life-Impact, Resource-use, Adverse-effects}. The overall joint model loss is:

Experiments
The joint learning LCAM framework is evaluated on the three datasets discussed in section 3: the expertly annotated EBM-COMET, the EBM-NLP (Nye et al., 2018) and the merged dataset created by aligning (covered in section section 3) parallel annotations between EBM-NLP and EBM-COMET.

Implementation
For pre-processing the data, we first label each word in the sentences contained in an abstract with either one of {B, I, O}. Subsequently, to the end of each sentence, we include a list of outcome types corresponding to the outcome spans in the sentence. However, it is important to note that, not all sentences within an abstract had outcome spans. For example, the annotated sentence below contains outcome span "Incisional hernia" whose outcome label (Physiological) is placed at the end of the sentence.

Setup
The Joint setup is concurrent sequence labelling (OSD) and sequence classification (OC) whereas the standalone setup, is OSD and OC performed separately. The former is achieved using (a) a Baseline model, LCAM-BiLSTM (using a BiLSTM encoder) (b) LCAM-BioBERT (using BioBERT encoder), whereas the latter is achieved by fine-tuning the original (c) BioBERT and (d) SciBERT (Beltagy et al., 2019) models. Our datasets are novel in the sense that the outcome type labels of the outcomes are drawn from Dodd et al. (2018) taxonomy, which is not the basis of prior outcome annotations such as the EBM-NLP dataset. The models were evaluated on the tasks by reporting the macro-averaged F1. For the standalone models, we use token-classification and text-classification fine-tuning scripts provided by Huggingface (Wolf et al., 2020) for OSD and OC respectively. Inaddition to the macro-F1, we visualise ranking metrics pertaining to MLP, in order to compare our model to related work for MLP. The metrics of focus include precision at top n P@n (fraction of the top n predictions that is present in the ground truth) and Normalized Discounted Cumulated Gain at top n (nDCG@n).

Results
The first set of results we report in Table 4 are based on the independent test sets ( Table 2) for each of the datasets. The joint LCAM-BioBERT and standalone BioBERT models are not only competitive but they consistently outperform the baseline model for both OSD and OC tasks. We observe the LCAM-BioBERT model outperform the other models in the OSD experiments for the last two datasets in Table 4. On the other hand, the standalone BioBERT model achieves higher F1 scores for the last two datasets in the OC task.

Impact of the abstract context injection and Label attention
As shown in Table 6, the performance deteriorates (with respect to the results reported in Table 4) without the attention layers ("-Attention") by averagely 10% for OSD and 11.3% for OC. Similarly, exclusion of the abstract representation ("-Abstract") leads to an average performance decline of 4.3% for OSD and 2.7% for OC. As observed the decline resulting from "-Abstract" is less significant than that resulting from "-Attention" for both OSD and OC tasks. This decline explains the significant impact of both (1) the semantic relational information between both tokens and labels as well as outcome spans and labels gathered by the attention mechanism, (2) information from the text surrounding   a token or an outcome span embedded into an abstract representation. This therefore justifies inclusion of both these components.
To evaluate the proposed label alignment method (subsection 3.1), we train a model using the aligned dataset (EBM-COMET+EBM-NLP) and evaluate it on the test sets of the original datasets in Table 5. We see significant improvements in F-scores for OSD in both EBM-COMET and EBM-NLP. Additionally, for OC, we see a significant improvement in F-score on EBM-NLP dataset and a slight im-provement in F-score on the EBM-COMET dataset. Overall, this result shows that the proposed label alignment method enables us to improve performance for both OSD and OC tasks.
To further evaluate the LCAM-BioBERT model, we focus on the OC task results alone where the classifier returns the outcome types given an outcome span, and compare MLP performance to the baseline and another related MLP model, labelspecific attention network (LSAN) (Xiao et al., 2019), that learns biLSTM representations for multi-label classification of sentences. For comparison, we compute P@n and nDCG@n using formulas similar to (Xiao et al., 2019). As illustrated in Figure 2, the LCAM model outperforms its counterparts for all datasets, and most notably for P@1. Our joint BiLSTM baseline model performs comparably with LSAN, and indeed outperforms it on the EBM-COMET dataset for P@1, nDCG@1 and nDCG@3. We attribute LCAMs superior performance to (1) Using a domain-specific (biomedical) language representation model (BioBERT) at its encoding layer, (2) Applying label-specific attention prior to classifying a token as well as before classifying the mean pooled representation of an Example Input sentence Predicted labels Predicted labels P@1 P@2

Ground truth
The primary outcomes were hospitalised death 1 , severe disability 2 at 15 months of age, neonatal behavioural neurological 3 assessment (nbna) score at 28 days of age, and Bayley scales of infant development 4 (BSID) score (including mental development 5 index (mdi) score and psychomotor development 6 index (pdi) score) at 15 months of age at follow-up.

LCAM Output
The primary outcomes were hospitalised death 1 , severe 2 disability 3 at 15 months of age, neonatal behavioural neurological assessment (nbna) score at 28 days of age, and Bayley scales of infant development (BSID) score (including mental development 4 index (mdi) score and psychomotor development 5 index (pdi) score) at 15 months of age at follow-up.  outcome span and finally (3) injecting global contextual knowledge from the abstract into the token and document (outcome-span) representations.

Error Analysis
We review a few sample instances that exhibit the mistakes the joint LCAM model makes in the OSD and OC tasks in Table 7.
OSD errors: We observe the model partially detecting outcome phrases e.g. In Example 1, it detects death instead of hospitalised death, development instead of mental development, and in Example 2, it does not detect "(DFS)" as apart of the outcome phrase. Additionally, it completely misses some outcomes such as infant development in Example 1.
OC errors: Incorrect token-level predictions will most likely result into incorrect outcome classification. In Example 1, Instead of severe disability, the model detects "severe" as an outcome and "disability" as a separate outcome and classifies them as Physiological and Life-Impact respectively. Similarly, in Example 3, both outcomes are misclassified because at token level multiple outcomes are detected rather than one, hospital and stay rather than hospital stay, postoperative and hospital stay rather than postoperative hospital stay.

Conclusion
We proposed a method to jointly detect outcome spans and types using a label attention approach. Moreover, we proposed a method to align multiple comparable datasets to train a reliable outcome classifier. Given real-world scenarios where it is often impractical or computationally demanding to build a model for each and every single task, our experimental results demonstrate the effectiveness of an approach that simultaneously (jointly) achieves two different task without compromising the performance of the individual tasks when decoupled.

Ethical Considerations
Joint learning can have multiple applications, where multiple tasks are simultaneously achieved whilst preserving (or even improving) standalone performance when tasks are separately conducted.
In this particular work, we are motivated by the need to jointly model a pair of tasks (Outcome span detection and Outcome classification) in order to enhance outcome information retrieval. Recent developments in the domain such as emergence of an outcome classification system that is aimed at standardising outcome reporting and classification motivated us to re-construct the datasets we use in order to align them with this classification. The datasets contain text from abstracts of clinical trials published on PubMed. We cannot ascertain that all these abstracts are unbiased assessments of effects of interventions, especially with recurring articles citing several biases including selection bias (trial clinicians favour certain participating patients because of personal reasons), reporting/publishing bias (only reporting statistically significant results) and many more. Nevertheless, we provide more details and reference these datasets both within the article and the supplementary material.
These are then used to generate a label-word representation (line 16), all label-word representations forming a sentence (line 17) are used to compute an outcome extraction(OE) loss using eqn 9 (line 19). Once again we add context to the newly generated toke-level representations (line 20). For every outcome, we repeat steps in lines 10-14 to obtain label attention scores., i.e. depicting the contribution the particular outcome phrase makes to each label and these are used to obtain a label-document representation for the outcome (line 30). This representation is then used to compute the outcome classification loss (line 32). The loss we minimise in the joint learning is computed as shown by line 33.

B Hyperparameters and Run time
We perform a grid search through multiple combinations of hyperparameters included in Table 8 below. Using 20% of EBM-COMET+EBM-NLP dataset as a dev set, we obtain the best F1 values. Table 8 shows the range of values (including the lower and upper bound) for which the LCAM-BioBert is tuned to obtain optimal configurations. Using a shared TITAN RTX 24GB GPU, the baseline joint model i.e. LCAM-BiLSTM runs for approximately 45 minutes when evaluating on the EBM-COMET dataset, 190 minutes when evaluating on the EBM-NLP dataset and at-least 320 minutes on the merged dataset EBM-COMET+EBM-NLP. For the LCAM-BioBERT model, the experiments last at-least 14 hours on the EBM-COMET dataset, 30 hours on the EBM-NLP and 42 hours on the merged EBM-COMET+EBM-NLP. Table 9 includes the tuned ranges for the Standalone models (BioBERT and SciBERT) which we fine-tune for the outcome extraction (OE) and outcome classification task. Similar to the joint model, the best values are chosen based on the EBM-COMET test set F1 values. Training and evaluation on the EBM-COMET, EBM-NLP and EBM-COMET+EBM-NLP consume 7, 34, and 45 GPU hours respectively.

C.2 EBM-COMET
A biomedical corpus containing 300 PubMed "Randomised controlled Trial" abstracts manually annotated with outcome classifications drawn from the taxonomy proposed by (Dodd et al., 2018). The abstracts were annotated by two experts with extensive experience in annotating outcomes in systematic reviews of clinical trials (Abaho et al., 2020). Dodd et al. (2018)'s taxonomy hierarchically categorised 38 outcome domains into 5 outcome core areas and applied this classification system to 299 published core outcome sets (COS) in the Core Outcomes Measures in Effectiveness (COMET) database.

C.3 EBM-COMET+EBM-NLP
We merge the two datasets above for two main purposes, (1) to align the annotations of the EBM-NLP to a standard classification system (Dodd et al., 2018) for outcomes and (2) create a larger dataset to use in evaluating our joint learning approach.

C.4 Pre-processing
We create one single vocabulary using the merged dataset and use it for all three datasets. Whilst generating the vocabulary, we simultaneously split the abstracts into sentences using Stanford tokeniser. This vocabulary is then used in creating tensors representing sentences, where each tensor contains id's of the token/words in the sentence. The same procedure is followed to create tensors containing id's of the labels ("BIO") corresponding to the words in the sentences. Additionally, we create tensors with id's of outcome classification labels, so for each sentence tensor, their is a corresponding token-level label tensor and a sentence-level label (outcome label) tensor. For the baseline where we use a BiLSTM to learn GloVe representations, we follow instructions to extract GloVe 5 specific vectors for words, token-level labels and sentence labels in the dataset. All the files with d-dimensional vectors are stored as .npy files. For the joint BERTbased models, we use flair (Akbik et al., 2019) to extract TransformerWord Embeddings from pretrained BioBERT for the tokens.