Self-Supervised Detection of Contextual Synonyms in a Multi-Class Setting: Phenotype Annotation Use Case

Contextualised word embeddings is a powerful tool to detect contextual synonyms. However, most of the current state-of-the-art (SOTA) deep learning concept extraction methods remain supervised and underexploit the potential of the context. In this paper, we propose a self-supervised pre-training approach which is able to detect contextual synonyms of concepts being training on the data created by shallow matching. We apply our methodology in the sparse multi-class setting (over 15,000 concepts) to extract phenotype information from electronic health records. We further investigate data augmentation techniques to address the problem of the class sparsity. Our approach achieves a new SOTA for the unsupervised phenotype concept annotation on clinical text on F1 and Recall outperforming the previous SOTA with a gain of up to 4.5 and 4.0 absolute points, respectively. After fine-tuning with as little as 20\% of the labelled data, we also outperform BioBERT and ClinicalBERT. The extrinsic evaluation on three ICU benchmarks also shows the benefit of using the phenotypes annotated by our model as features.


Introduction
Supervised fine-tuning on the top of the BERTbased models has recently become the standard approach in Information Extraction delivering stateof-the-art (SOTA) results across different tasks (Devlin et al., 2019). The dependence of these models on the availability of the costly human-annotations remains a serious obstacle towards a large scale deployment of such models. This problem is especially actual in the clinical domain with limited availability of experts.
In the self-supervised setting, automatic annotations are cheap to produce. Some rule-based automatic labelers, such as CheXpert (Irvin et al., 2019), which is built on NegBio (Peng et al., 2018), are often used to create training data for supervised BERT-based models (e.g., automatic annotators of radiology reports (Smit et al., 2020)). Those models usually generalise over a small set of classes (under 20).
Other automatic labelers exploit ontologies. For example, the UMLS (Unified Medical Language System) ontology (Bodenreider, 2004b) is almost predominantly used to match linguistics patterns in clinical text to medical concepts (e.g., using the MetaMap tool (Aronson, 2006)). Due to the complexity of the Information Extraction task in this challenging setting (sparse multi-class), the approaches that use the data annotated (e.g., (Arbabi et al., 2019;Kraljevic et al., 2019;Tiwari et al., 2020)) mostly rely on non-contextualised embeddings focusing on the detection precision. However, especially for clinical text, which is noisier and exhibits a variety of clinical expressions requiring disambiguation, relying on the context is essential. We argue that recall is very important, especially when automatic annotation results are used further in the downstream tasks.
In this work, we propose a self-supervised approach for sparse multi-class classification that fully relies on the context to detect contextual synonyms of medical concepts in clinical text. To be more precise, our model is based on the Clinical-BERT (Alsentzer et al., 2019) model which was pre-trained on the biomedical and clinical corpora that are widely used, producing state-of-the-art results in a range of supervised biomedical tasks, e.g. named entity recognition, relation extraction and question answering (Peng et al., 2019;Hahn and Oleynik, 2020). We separate the detection of frequent and rare classes by introducing different training objectives. The special training objective for rare classes increases the proximity of the respective textual embeddings and the ontology embeddings of concepts. Our work also exploits data augmentation techniques, such as paraphrasing and guided text generation to aid sparse class detection and diversify the training data.
We apply our methodology for the phenotype detection task with more than 15,000 concepts from the Human Phenotype Ontology (HPO) (Köhler et al., 2017) 1 . The phenotyping task is an important Clinical NLP task that can improve the understanding of disease diagnosis (Aerts et al., 2006;Deisseroth et al., 2019;Liu et al., 2019a;Son et al., 2018;. It remains unexplored due to the complexity of the classification into that large amount of classes. We test our approach on clinical data, namely on electronic health records (EHRs) and radiology reports.
Our main contributions: (1) Self-supervised methodology for contextual phenotype detection in clinical records. (2) Methodology for sparse class detection with the special training objective that increases proximity of contextual synonyms to ontology embeddings. (3) Data augmentation methodology to further improve the detection of sparse classes.
Our self-supervised models improve the current SOTA on F1 up to 4.5 absolute points, while on Recall up to 4.0 absolute points for the phenotype detection task for clinical data, which demonstrates how relying on the context is essential for this type of data. Second, after fine-tuning, our model outperforms the fine-tuned BERT-based models with as little as 20% of labelled data, which confirms efficiency of our self-supervised training objectives. Moreover, the extrinsic evaluation shows the benefits of using the phenotypes annotated by our model as features to predict ICU patient outcomes.
We present related work in Section 2, our phenotyping methods in Section 3, and our experimental setup in Section 4. Then, we present and discuss key results in Section 5. Finally, we conclude this work in Section 6.

Related Work
Most of the current methodologies for phenotype detection are supervised, BERT-based (e.g., BioBERT  or Clinical-BERT (Alsentzer et al., 2019)) and dedicated to the detection of certain rather limited phenotypes or their groups (Liu et al., 2019b;Zhang et al., 2019;Yang et al., 2020;Franz et al., 2020;Li et al., 2020).
Unsupervised methods in the clinical NLP domain traditionally rely on the usage of ontologies and knowledge bases. Human Phenotype Ontology (Köhler et al., 2017) is the most widely used ontology of phenotypes. The use of HPO in annotating phenotypic information automatically remains unexplored, mainly due to the complexity of formalising the task with over 15,000 concepts.
More recently, unsupervised deep learning methods have been applied to the problem, which allowed to perform the semantic analysis and go beyond shallow matching (Arbabi et al., 2019;Kraljevic et al., 2019;Tiwari et al., 2020). These approaches use non-contextualised embeddings, focus on the precision of detection with limited context exploitation. For example, the authors in (Kraljevic et al., 2019) propose a procedure to learn vectors of words enriched with their averaged context over the corpus to map them to correct medical concepts. We use contextualised word representations in contrast to all the related approaches and focus on recall.

Methodology
This section introduces the problem of phenotype detection along with our self-supervised method. It elaborates our data augmentation strategies, selective supervision in low-resource conditions, and finally explains our inference algorithm.
Problem Definition While annotating clinical text, clinicians usually relate HPOs to short spans, which usually have around 2-3 words depending on the corpus. 3 Following this rationale, we define the phenotype annotation as a two-step process: (1) detect HPO-relevant text spans, and (2) assign respective HPO concepts to those spans. More formally, given a textual document X = {t 1 , ..., t N } represented by a sequence of tokens, and a full set of  HPO concepts H = {H 1 , ..., H M } under the root node Phenotypic Abnormality (HP:0000118) (exclusive) of the HPO ontology, our goal is to model: (1) p(1 H |t n ), which is the conditional probability of the token t n being HPO-relevant; (2) p(H m |t n ), which is the conditional probability if the HPO concept H m should be assigned to the token t n .
In the self-supervised setting, we consider only the training examples with textual spans matched with exact match to the HPO concepts as defined by the ontology. The main assumption here is that by capturing context of those term spans, the model will be able to generalise and detect formally different HPO spans seen in the similar contexts as the HPO concepts (a.k.a. contextual synonyms, for example, Fever (HP:0001945) will be matched to "feverish"). To support this challenging setting, we have designed a series of relevant training objectives described below.
Training Objectives As shown in Figure 1 (left), the proposed model for phenotype annotation consists of a Transformer encoder which is identical to and initialised with ClinicalBERT (Alsentzer et al., 2019). Besides, there are two additional classifiers on the top of the Transformer encoder which predict if a token is HPO-relevant and assign HPO concepts to those HPO-relevant tokens.
We enrich the model with the following three training objectives. (1) A binary cross-entropy loss L 1 to predict p(1 H |t n ), where 1 H is 1 if t n is HPO-relevant, otherwise 0. (2) A cross-entropy loss L 2 with softmax to predict p(H m |t n ), which is defined over the most frequent HPO concepts found in the training data. The intuition behind this objective is to increase precision of prediction in the resulting performance. (3) The Euclidean distance L 3 between the token embedding v tn and the respective HPO concept embedding v Hm , which is defined to increase recall of the model and targets the detection of the rare HPO concepts.
Note that the objectives above can be used for pre-training and further fine-tuning of the models in a way similar to BERT.
HPO Embeddings Prior to the training of the phenotyping model, we build the knowledge graph (KG) embeddings for HPO concepts. Figure 1 (middle and right) shows that this KG model has a Transformer encoder which learns the embeddings of HPO concepts given their definitions. It is designed to encode both the hierarchical connections between HPO concepts and the semantics in definitions of HPO concepts, so that the similar HPO concepts have similar embeddings. Therefore, we consider two learning objectives.
The first learning objective, namely relational loss L 4 , is to encourage the neighbouring HPO concepts to have similar embeddings and nonneighbouring HPO concepts to have different embeddings. The objective is implemented based on the distance of embeddings between neighbouring HPO concepts and non-neighbouring HPO concepts with softmax.
Second, the semantic loss L 5 encourages the HPO embeddings to encode the semantics of input definitions and, more specifically, we adopt the skip-gram negative sampling (Mikolov et al., 2013).

Data Augmentation
There are two issues related to the creation of the training data by shallow matching: (1) this data can be too limited to help the model capture contextual phenotypes; (2) rare HPO concepts will not be found in the clinical text used for training and the model will not be able to detect them at the inference time. We are addressing those two problems by creating textual variants for existing HPO-relevant spans and generating context around rare HPO concepts.
HPO-relevant span variants with paraphrasing are used to replace the original spans in the training sentences. We create the variants by using the standard lexical pivoting paraphrasing technique where equivalent phrases in one language are found by "pivoting" over a shared translation into another language (Mallinson et al., 2017). We build the English-French-English pivot Seq2seq model.
Phenotypes are also often inferred from ranges of numerical values. E.g., anemia (HP:0001903) can be inferred from "Hgb 5 g/dl". We take the advantage of a series of reference laboratory values (from MIMIC (Johnson et al., 2016)) to create surrogates for the original names with numerical values. The named entities for which abnormal results are available are mapped to HPO concepts by an expert.
HPO context variants with synthetic text are created with a Seq2Seq model, which is trained to generate the textual context conditioned on HPOrelevant spans. For example, the sentence "patient was admitted with Angelman Syndrome to the ER" is generated given the input "Angelman Syndrome".

Decision Strategy for Inference
At the phenotype annotation inference stage, we assume that the HPO-relevant spans of frequent HPO concepts can be detected by p(1 H |t n ) and p(H m |t n ) with high precision, while the Euclidean distance between contextualised token embedding v tn and HPO embedding v Hm should be able to find those of rare HPO concepts with good recall.
More precisely, we formalise the decision strategy as Algorithm 1.
Algorithm 1: The decision strategy of inferring phenotype annotation. X is the input sequence; H stands for the full set of HPO concepts, H freq ⊂ H includes most frequent HPO concepts; Initialise thresholds τ p , τ d for p(1 H |t n ) and distance function D(v, u) respectively with pre-defined values; return {r 1 , r 2 , . . . , r N }.

Experimental Setup
This section will introduce the datasets, implementation details, baselines and evaluation metrics.  (Ive et al., 2016) and 1.5M generated notes by using data augmentation given PubMed abstracts.
Ontologies In the self-supervised setting, we consider HPO names, synonyms, abbreviations from the HPO as well as Unified Medical Language System (UMLS) (Bodenreider, 2004a) with exact match in clinical text as training samples.

Datasets
The following datasets are used as test data in the self-supervised setting, as well as train data in the supervised fine-tuning experiments. Annotation Procedure To collect supervised datasets for evaluation and fine-tuning, we have annotated EHRs with HPO concepts with the help of three expert clinicians. The EHRs were preannotated with HPO concepts by keyword matching, and then the annotations were corrected by the three clinicians with consensus. The clinicians were specifically asked to identify contextual synonyms such as "drop in blood pressure" and "BP of 79/48" for Hypotension (HP:0002615).
MIMIC We have created our own sub-corpus of 242 discharge summaries from MIMIC-III with gold annotations. We used 146 EHRs for finetuning in the low-resource setting. 48 and 48 EHRs are reserved respectively for validation and testing in both self-supervised and supervised settings.
COVID We have collected and annotated two COVID datasets of short radiology reports: (1) COVID-I has 67 radiology reports from the Italian Society of Medical and Interventional Radiology 4 and (2) COVID-II is the International dataset with 100 radiology reports presented by (Cohen et al., 2020). From COVID-II, we have selected the patients with the diagnosis of the COVID-19 viral pneumonia. We take all the unique patients and extracted the longest (in terms of the tokens count) records for those patients. Reports from both datasets often contain not only the findings, but also the brief patient history. Both datasets are used as test sets for the self-supervised model. In the experiments with supervision, COVID-I was used to fine-tune and COVID-II to test.
PubMed To ensure comparison to the previous work, we also present our results for the PubMed dataset provided by (Groza et al., 2015) which contains 228 abstracts annotated by the creators of HPO. The common HPOs in this dataset are neurodevelopmental and skeletal disorders (e.g. Angelman syndrome), which is a quite different group of phenotypes as compared to the groups represented in the MIMIC and COVID data. An important difference between our annotation procedure as described above and the human annotation for the PubMed data is that the latter instructed annotating HPO-relevant spans only if they were presented in a canonical form close to HPO names: for example, "hypoplastic nails" and "nail hypoplasia" were included, but not "nails were hypoplastic". We re-use the random split: 40 abstracts for training and 188 for testing following NCR's setting (Arbabi et al., 2019). The statistics over the dataset is in Table 1.

Implementation Details 5
The Transformer encoders in Figure 1 are initialised by ClinicalBERT, the two classifiers are two dense layers and the pooling layer concatenates max and average pooling. The maximum input length is 64 tokens. The proposed models are pre-trained for 100k steps and fine-tuned for 5k steps with batch size 64. The set of frequent HPO concepts | H freq |= 400 is decided by keyword matches. For data augmentation, we train a Seq2Seq Transformer model on a range of parallel English-French corpora in the biomedical field, namely the European Medicines Agency, Corpus of Parallel Patent Applications and the PatTR corpora. 6 The Seq2Seq model is based on Open-NMT (Klein et al., 2017). More details are given in Appendix C.

Setups
In the self-supervised setting, we train our models using either EHRs corpus for MIMIC and COVID (E) or scientific literature corpus for PubMed (S). We experiment with two setups with and without data augmentation. We also evaluate the efficiency of our training objectives for pre-training and fine-tune our models with all the available supervised data.
However, in the real-life clinical setting, human annotations are very costly thus particular attention should be paid to the learning efficiency with a very small amount of data. We simulate this lowresource scenario and analyse the annotation cost / performance benefit trade-offs for our model. To be more precise, we run a set of experiments where each time we pick a certain percentage of training examples according to one of the following strategies: (1) Random sampling: the samples are selected at random; (2) Uncertainty-based sampling: the entropy score based on p(H m |t n ), m ∈ {1, 2, . . . , M } is computed to measure the uncertainty of the self-supervised model for each sample, and then the samples with the highest uncertainty score are selected; (3) Oracle: we also count the number of mismatched phenotypes between the keyword-based and gold annotations, and the samples with the most mismatches are selected.

Baselines
As baselines in the self-supervised setting, we report (1) Keyword: a naive method that simply matches HPO names, synonyms and abbreviations to text spans, (2)  In the selective supervision setting, we use pretrained models and fine-tune them on the datasets. More specifically, we use (1)

Metrics
We report the scores of micro-averaged Precision, Recall and F1-score at the document level. Following the best practices and to make our work comparable with others, we adopt the evaluation strategy of (Liu et al., 2019a). Thus, when we compute the following scores: (1) Exact match: only the exact same HPO annotations were counted as correct.
(2) Generalised match: both the predicted and target HPO annotations are first extended to include all ancestors in HPO up until Phenotypic Abnormality (HP:0000118) (exclusive). Then the HPO annotations are de-duplicated for each document and the scores are computed.

Results and Discussion
This section discusses the results for the selfsupervision and selective supervision settings.
Self-Supervised Setting We report results of the self-supervised model for the MIMIC, COVID, and PubMed datasets in Table 2. It compares the proposed model to the previous SOTA for the phenotyping task. Our principal observation is that our method outperforms all the baselines in terms of F1 and recall across datasets for both the exact and generalised matches. For example, for the exact match, our best models obtain F1 gain of at least 0.02, 0.05, and 0.02 and Recall scores gain of at least 0.04, 0.02, and 0.01 for MIMIC, COVID-I and COVID-II, respectively. This confirms the efficiency of our methodology for the detection of contextual synonyms in clinical text.
We note that our method does not give better performance for the PubMed dataset. We hypothesise that this happens due to the difference of gold annotation standards, as well as the fact that this dataset is oriented towards the detection of rare phenotypes with less frequent context patterns that are hence difficult to learn for our model.

Low-Resource Setting
In this setting, we first study the efficiency of our self-supervised objectives for fine-tuning. Results are in Table 3 (more in Appendix B). Naturally fine-tuning leads to better automatic annotation accuracy on specific datasets. Our pre-training procedure is efficient and outperforms BERT-based models with at least 0.09, 0.16, 0.35 absolute increase in F1 (exact match) for the three datasets. Our analysis of the annotation cost / performance benefit trade-offs demonstrated that with only 20% of the training samples selected using the uncertainty criteria our fine-tuned model is able to achieve better F1 than ClinicalBERT which are fine-tuned on full training sets (see Figure 2).
The HPOs are sparse (less than 7 annotations on average) in the datasets as shown in Table 1 Table 2: The proposed models in the self-supervised setting (without fine-tuning) achieved the best recall and F1 on MIMIC and COVID clinical text datasets. On PubMed which is scientific literature, our model clearly benefited from augmented data. Keyword, NCR and NCBO are reported as they achieve top F1 among the self-supervised baselines (Section 4.5) and full results are reported in Appendix A. The notations "Ours (E)" and "Ours (S)" refer to the models pre-trained on the EHR corpus and the scientific literature corpus, respectively.  Table 3: The proposed model with fine-tuning in full achieved the best precision, recall and F1 scores on MIMIC, COVID-II and PubMed. The COVID-I is not reported as it is used to fine-tune the corresponding model. Only finetuned ClinicalBERT is reported as baseline because it achieves overall better F1 than fine-tuned BERT, BioBERT and SciBERT. Full results are available in Appendix B. The notations "Ours (E)" and "Ours (S)" refer to the models pre-trained on the EHR corpus and the scientific literature corpus, respectively.  balled outputs of our MIMIC and COVID-I selfsupervised model that achieves the best gain.
Our first observation is that our model is suc-cessful in capturing HPO-relevant contextual synonyms, which contributes to higher recall of our model. For example, "low pressure" and  Figure 2: In the low resource setting with selective supervision, we pick subsets with 20%, 40%, 50%, 60%, 80% labelled data to fine-tune. The uncertainty sampling strategy is consistently better than the other two strategies on MIMIC and then applied on COVID-II and PubMed. The proposed models outperform fine-tuned BERT-based models with as little as 20% of labelled data. Details in Table 9 in Appendix B. Best to view in colours.
"hypotensive" are associated with Hypotension (HP:0002615) and "low platelets" with Thrombocytopenia (HP:0001873). 7 Errors in the prediction mainly concern subtle distinctions between closely related phenotypes: e.g. "shortness of breath" triggers prediction Respiratory Distress (HP:0002098) whereas the gold label Dyspnea (HP:0002094) is the generalisation of Respiratory Distress.

Extrinsic Evaluation
We evaluate the benefit of using phenotypes extracted by our models as features to enhance performance on downstream tasks. Following the setting by (Harutyunyan et al., 2019) with three public ICU benchmarks based on MIMIC-III, we train LSTMs with different input features: (1) 17 structured clinical features selected by (Harutyunyan et al., 2019) like heart rate and temperature or (2) structured clinical features plus phenotypes annotated by NCR, ClinicalBERT and our fine-tuned model respectively. The patients with both structured clinical features and textual notes are collected, and as a result, there are 21,346 patients (25,106 admissions) for training (with 4fold cross validation) and 3,824 patients (4,497 admissions) for testing. Table 4 shows that the LSTMs which are fed with structured clinical features and phenotypes annotated by our model are 7 All examples hereinafter are paraphrased.
consistently better than others on all three benchmarks. This demonstrates that increasing recall in phenotyping is essential for downstream tasks.

Conclusion
In this paper, we have proposed a deep selfsupervised phenotype annotation approach relying on contextualised word embeddings and data augmentation techniques. Our experimental results in a challenging sparse multi-class setting, with over 15,000 candidate HPO concepts, indicate that our methodology is particularly efficient to detect contextual mentions of phenotype concepts in clinical text. We demonstrate that increasing phenotyping recall is essential for downstream tasks.

Ethics Considerations
The study has been carried out in accordance with relevant guidelines and regulations for the MIMIC-III data. Other data used in this study can be accessed without any preliminary requests. Clinical experts received consulting fees for their work. The purpose of the developed models is to extract phenotypic information from unstructured healthcare data. This information is only to assist human medical experts in their decisions. Before the deployment in the actual clinical setting our methodology is subject to systematic debugging, extensive simulation, testing and validation under the supervision of expert clinicians.      Table 9: Results on MIMIC, COVID-II and PubMed in the selective supervision setting. The F1 scores of exact match correspond to Figure 2.