Generalizing over Long Tail Concepts for Medical Term Normalization

Medical term normalization consists in mapping a piece of text to a large number of output classes.Given the small size of the annotated datasets and the extremely long tail distribution of the concepts, it is of utmost importance to develop models that are capable to generalize to scarce or unseen concepts.An important attribute of most target ontologies is their hierarchical structure. In this paper we introduce a simple and effective learning strategy that leverages such information to enhance the generalizability of both discriminative and generative models.The evaluation shows that the proposed strategy produces state-of-the-art performance on seen concepts and consistent improvements on unseen ones, allowing also for efficient zero-shot knowledge transfer across text typologies and datasets.


Introduction
Term normalization is the task of mapping a variety of natural language expressions to specific concepts in a dictionary or an ontology.It is a key component for information processing systems, and it is extensively used in the medical domain.In this context, term normalization is often used to map reported adverse events (AEs) related to a drug to a medical ontology, such as MedDRA (Brown et al., 1999).This is a challenging task, due to the high variability of natural language input (i.e., from the informality of social media and conversational transcripts to the formality of medical and legal reports) and the high cardinality and long tail distribution of the output concepts.AEs are usually mappable to different levels of the same ontology: low-level concepts, which are closer to layman terms, and higher level concepts, which encompass the meaning of multiple low-level concepts.In MedDRA,1 these two sets of concepts are called Lowest Level Terms (LLT), and Preferred Terms (PT) respectively; both of them have a very high cardinality (48,713  Currently this problem is addressed with large pretrained language models (Gonzalez-Hernandez et al., 2020), finetuned on medical term normalization datasets, such as SMM4H (Gonzalez-Hernandez et al., 2020) or CADEC (Karimi et al., 2015).However, these datasets contain maximum 5,000 samples, distributed on a few PT/LLT classes, and with a long tail distribution (see Figure 1).Due to the size and distribution of these datasets, the resulting models usually perform well on examples that are seen in the training, but struggle to generalize on rare or unseen samples.To improve the generalization capabilities of the models on long tail concepts, in this paper, we suggest to leverage the hierarchical nature of the medical ontology to enrich the large language models with domain knowledge before finetuning on a given training set.Extensive experimental evaluation on three different datasets shows that the proposed strategy can be successfully applied to various model typologies, and it consistently outperforms other mainstream learning strategies, showing generalization capabilities not only across the long tail distribution, but also across text typologies and datasets.The code and resources needed to replicate our experiments and test our learning strategy are publicly available2 .

Related Work
Medical term normalization is generally regarded as either a classification or a ranking problem (Yuan et al., 2022).In the former case, a neural architecture encodes the term into a hidden representation and outputs a distribution over the classes (Limsopatham and Collier, 2016;Tutubalina et al., 2018;Niu et al., 2019), but this is difficult to scale to ontologies containing thousands of concepts, due to the absence of comprehensive datasets.In the ranking approach, on the other hand, the goal is to rank concepts by their similarity to the input term (Leaman et al., 2013;Li et al., 2017;Sung et al., 2020): a system is trained on binary classification, where terms and matching concepts are the positive samples, while terms and non-matching concepts are the negative ones.The raw output of the model is then used to rank the concepts.
Recent work successfully combined the two approaches.Ziletti et al. (2022) presented a system mixing a BERT-based classifier (Devlin et al., 2019) and a zero/few-shot learning method to incorporate label semantics in the input instances (Halder et al., 2020), showing improved performance in single model and in ensemble settings.
Finally, systems like CODER (Yuan et al., 2022) and SapBERT (Liu et al., 2021) introduced novel contrastive pretraining strategies that leverage UMLS to improve the medical embeddings of BERT-based models.While SapBERT leverages self-alignment methods, CODER maximizes the similarities between positive term-term pairs and term-relation-term triples and it claimed state-ofthe-art results on several tasks, including zero-shot term normalization.Another recent work by Zhang et al. (2021) introduced an even more extensive pretraining procedure, based on self-supervision and a combination of the traditional masked language modelling with contrastive losses .The strategy proved to be extremely effective for medical entity linking, a kind of term normalization which makes use of the full original context (instead of using only the AE).
3 Proposed Learning Strategy: OP+FT Let's consider a target ontology (e.g.MedDRA v23.1) containing two sets of concepts PT = {p i } and the LLT = {ℓ i }.The ontology is structured so that every ℓ i has only one parent p j : parent(ℓ i ) = p j , but each p j can be parent of many ℓ i .Given a set of Adverse Events AE = {a i }, every a i can be univocally mapped to a p j : norm(a i ) = p j .
Our objective is to train a large language model M to encode norm: given a sample (a i , p j ), such that norm(a i ) = p j , we want M(a i ) = p j .
We propose a learning strategy based on the hierarchical structure of the ontology, composed of two steps: Ontology Pretraining and Finetuning.
During the first step, we expose the language model M to all possible output classes p j by leveraging the intrinsic hierarchical relation between LLT and PT.Specifically, we use the parent relation to create a new set of training samples from the ontology, pairing each ℓ i to its parent concept p j .In the case of MedDRA, the new set of samples contains 48,713 (ℓ i , p j ) pairs, where each p j appears multiple times, associated with different ℓ i .For example the PT "asthenia" will appear in the samples (weakness, asthenia), and (loss of energy, asthenia).As LLTs are more informal than PTs, the language model M can be pretrained on this new set of data to gain general knowledge about all the target classes.This pretraining set is highly similar to our target dataset (M(a i ) = p j ), increasing the model transfer capability.We call this process "Ontology Pretraining" (OP).
The second step consists in finetuning (FT) an OP model on a specific term normalization dataset, which maps every AE a i to the corresponding PT p j .This step is crucial because the OP model lacks specific knowledge about the natural language style of real-world samples.Finetuning will also exploit the dataset sample distribution to boost the model's accuracy on the specific set of p j in the training set.Note that the FT step can also be applied to a regular model without OP, resulting in a regular finetuning.
We hypothesize that the combination of OP+FT to a discriminative or a generative language model M will improve its performance on seen concepts in the training set, while making it more generalizable to long tail and unseen concepts.
4 Experimental Setting

Datasets
To investigate the performance of our learning strategy, we used three English datasets for MedDRA term normalization, with different writing styles.SMM4H (Gonzalez-Hernandez et al., 2020).Public dataset for the challenge SMM4H 2020 -Task 3, AE normalization.It contains 2,367 tweets, 1,212 of which report AEs with highly informal language, mapped to a PT/LLT.CADEC (Karimi et al., 2015).Public dataset containing 1,250 posts from the health forum "Aska-Patient", containing user-reported AEs mapped to a PT/LLT.The language is informal, but still more medically precise than SMM4H.PROP.Proprietary dataset provided by Bayer Pharmaceuticals, containing 2,010 transcripts of phonecalls with healthcare professionals reporting their patients' AEs, mapped to PTs.The language is more formal and medically accurate.

Data Preparation
All datasets were preprocessed to obtain samples containing only (a i , p j ) pairs.The samples in CADEC and SMM4H which were labelled with an ℓ i ∈ LLT were re-labelled with parent(ℓ i ) = p j ∈ PT , obtaining a uniform output space for all datasets containing only PT concepts.
Since the focus of this work is on the generalization capabilities of the models, it is important to test the models on different sets of unseen labels.For this reason, we created three random splits of train/test samples using a 60:40 proportion, instead of using the public fixed train/test split.Given a train and a test set, every test sample with label p j falls into one of the following categories: -IN, if p j is present in the training set; -OUT, if p j is not present in the training set.The most important set of samples to measure the generalization capabilities of the models is OUT.
Table 1 reports figures for the resulting datasets.CADEC and PROP contain the largest number of samples (5,866 and 4,453 respectively), while SMM4H is sensibly smaller, with only 1,442 samples.The largest datasets also contain the largest number of PTs: 488 for CADEC and 634 for PROP.SMM4H only contains 274 PTs instead.Most of the PTs are unique to one of the three datasets and do not appear in the other ones, making it impossible to gain a substantial advantage by combining them (see Appendix C).We observe that the percentage of OUT samples varies from 5% to 12%, with SMM4H being the most challenging dataset.The standard deviation is low, showing that the presence of 5-12% OUT samples is a characteristic of the specific dataset, resulting from its long tail PT distribution.Note also that the smaller the dataset, the higher the percentage of OUT samples in the test set.

Models
To test the proposed strategy and observe how it affects generalization, we selected different kinds of widely-adopted models.In particular, we compare PubMedBERT (Gu et al., 2020), Sci5 (Phan et al., 2021), GPT-2 (Radford et al., 2019), CODER (Yuan et al., 2022) and SapBERT (Liu et al., 2021).PubMedBERT (PMB).It was chosen as an example of a BERT-based classifier due to its medical pretraining (PubMed articles) and strong performance in other medical tasks (Gu et al., 2020;Portelli et al., 2021;Scaboro et al., 2021Scaboro et al., , 2022)).GPT-2 and Sci5.GPT-2 was selected as an example of a general-purpose autoregressive language model for text generation, while Sci5 was chosen for its medical pretraining, performed on the same kind of texts as PMB.The models were trained to generate a PT, given an input prompt containing the adverse event.CODER and SapBERT (SapB).To the best of our knowledge, CODER and SapBERT are some of the best dataset-agnostic models for medical term embeddings.They were both trained on the UMLS ontology (Bodenreider, 2004), which is a super-set of MedDRA, and tested on several term normalization datasets, showing promising results.Following both original papers, we use CODER and SapBERT to generate embeddings for a i and for all p j ∈ PT .We then select as prediction the p j that minimizes the cosine similarity with a i .
We also trained both models according to our proposed strategy.Both models were trained using the contrastive settings described in their paper and the respective codebases3 .
See Appendix A for training details for all models and B for more details on the contrastive training of CODER and SapBERT.
Performance is assessed with the Accuracy metric, but we also report the F1 Score in Appendix D, as it can give more insights when classes are unbalanced.

Experimental Results
In an ablation-study fashion, we compare the OP+FT learning strategy with its two components: OP and FT. Figure 2 contains the results for all the tested models and training strategies, and is organized as follows.We display a plot for each dataset, reporting the accuracy of the models on IN samples (•), OUT samples (⋄) and the whole test set (•).The first column shows the performance of a basic CODER and SapBERT model without any additional training.We consider their accuracy on OUT (⋄) as our generalization goal, and plot them as solid lines across the chart.The following three columns display the performance of all the models, trained with one of the learning strategies (FT, OP and OP+FT respectively).For tabular results, see Appendix D.
CODER and SapBERT on their own proved to be strong baselines across the three datasets.Looking at the first column, they reach 40-50% accuracy on CADEC and SMM4H (overall, IN and OUT, see solid lines), and around 15-20% overall accuracy (•) on PROP.
All learning strategies seemed to be ineffective on CODER: its performance (gray markers) remains roughly the same across all strategies (FT, OP or OP+FT).A possible explanation for this behaviour is that CODER embeddings are already in an optimal state according to the training objectives, as they have been trained on a very similar task.In fact, CODER generates predictions using the similarity between the embeddings, and the stable performance indicates that there were no drastic changes in the structure of the embedding space.
A clearer effect of the training strategies can be seen on SapBERT (lilac markers), although it is still limited when compared with the other models.SapBERT embeddings are probably more subject to adjustments compared to CODER because the latter was trained for significantly more steps and using more objective functions, leading to lessadaptable embeddings.
The following observations apply the other three models: PMB, GPT-2 and Sci5.The FT strategy (second column), as expected, works really well for IN (•) samples: on CADEC the IN accuracy of all models is over to 80%, while it is close to 50% for the other two datasets.However, the OUT accuracy (⋄) is lower than 20% in all cases (significantly lower than the solid line), and reaches 0% for PMB, showing that finetuning alone is not sufficient for classifiers to generalize on OUT samples in this setting.
The OP strategy (third column), brings the OUT accuracy of all models on pair with the CODER/SapBERT baselines, while the IN/overall accuracy matches or surpasses them.Comparing OP with FT, we see that the overall accuracy (•) of the model is generally lower for OP.However, the performance on OUT (⋄) samples doubles for generative models, and jumps from 0 to 40% for PMB.This shows that the first step of our proposed learning strategy has the desired effect, as it improves the models' understanding of all the output classes.
Finally, looking at the models trained with the OP+FT strategy (fourth column), we see that they outperform the FT ones on overall and IN accuracy.The effect is particularly strong on the SMM4H dataset (cf.PMB FT, 44% and PMB OP+FT, 70%).At the same time, the performance on OUT (⋄) samples remains similar to the OP models and close to the CODER baseline (gray solid line).The only exception is CADEC, where the performance on OUT is in-between the baseline and the accuracy with FT only.This shows that the proposed OP+FT learning strategy can successfully improve the generalization capabilities of various language models, while also improving their overall performance.We further test the generalization of OP+FT models in zero-shot, cross-dataset term normalization, normalizing the terms of each dataset with models that have been trained on one of the other two.Figure 3 shows the accuracy of GPT-2, with a plot for each test dataset, different training datasets on the x-axis, and one column for each learning strategy (FT or OP+FT).The behaviour of the other models is similar (see Appendix D).In all columns, we observe a drop in overall accuracy (•) between the first data point and the following ones (e.g., cf.CADEC trained on CADEC and CADEC trained on SMM4H).However, this drop is larger for FT models than OP+FT ones (e.g., 35 vs. 20 points on CADEC).In addition, the OUT accuracy of OP+FT models remains high regardless of the training set.This shows that OP+FT models generalize better than FT models across-dataset.Note that generalization is still challenging when moving from a dataset with highly informal language to a formal one (see PROP trained on SMM4H).

Conclusions
In this paper, we shed some light on the importance of generalization for medical term normalization models.We showed that AE normalization models trained with traditional finetuning, despite showing high accuracy on leaderboards, have low generalization capabilities due to the long tail distribution of the target labels.Our proposed training strategy (OP+FT), which leverages the hierarchical structure of the ontology, outperforms traditional models, while also obtaining state-of-the-art results in generalization on OUT samples.This was also demonstrated in a zero-shot normalization setting.OP+FT showed improvements on discriminative and generative language models, while it seems to be less effective on models trained with contrastive losses.This promising technique could also be applied to other tasks with massive output spaces organized in a hierarchical manner.

Limitations
The proposed learning strategy was tested only for the task of medical term normalization (from adverse events to MedDRA concepts).However, it would be interesting to test its effectiveness on other term normalization tasks beyond MedDRA mapping and outside of the medical domain.
Even restricting the problem to medical term normalization, and using datasets with different text styles, we only focused on English texts.Medical ontologies such as MedDRA and UMLS are released in multiple languages, and the research community is moving towards multi-lingual ap-proaches.In the future, we plan to extend this strategy to other languages (such as Spanish and Chinese) and to test the models' capacity to perform crosslingual transfer in zero-shot scenarios.Other model-related details:

A Training Specifications
• PMB A classification head (24,571 output classes) was added to the base model.
• GPT-2 Given a sample (a i , p j ), the input prompt for the model was "INPUT: a\nMEANING:".The model was trained to complete the sentence with p j .
• Sci5 Given a sample (a i , p j ), the input prompt for the model was "normalize: a".The model was trained to respond with a string containing p j .
• CODER / SapBERT Following their original papers, we use CODER/SapBERT to normalize an AE a as follows: where sim is a similarity measure (cosine, in our case), and C(•) is the result of embedding a term with CODER/SapBERT.p is the predicted PT, which is compared with the actual one to evaluate the model.

B Sample Creation for Contrastive Training
B.1 CODER CODER leverages on term-term pairs and termrelation-term triples for its contrastive training strategy.We create positive/negative samples for the term-term pairs using the AEs having equal/different PT, and term-relation-term triples connecting AEs whose PTs have the same parent.
For example, let's consider the following (a i , p j ) samples, for which we also report parent(p j ): ai pj parent(pj) feel like crap malaise Asthenic conditions weak knees asthenia Asthenic conditions zap me of all energy asthenia Asthenic conditions This will generate the following training samples for CODER: • positive term-term: (weak knees, zap me of all energy) because they share the same p j "asthenia" • negative term-term: (weak knees, feel like crap) and (zap me of all energy, feel like crap) because they are labelled with a different p j ("asthenia" vs "malaise") • positive term-relation-term: (weak knees, RO, feel like crap) and (zap me of all energy, RO, feel like crap), because their p j share the same parent "Asthenic conditions".RO stands for "Related Other", one of the standard term relations defined in the UMLS ontology, and we use it to encode the relation "same granparent".
This sample generation procedure is repeated for all samples in the three datasets (SMM4H, CADEC and PROP), as well as for the additional samples generated from MedDRA for the OP strategy.

B.2 SapBERT
SapBERT leverages on term-term synonym pairs, where the positive pairs belong to the same upperlevel concept.
The finetuning script present in the GitHub repository requires a list of term pairs belonging to the same concept.In the the case of the three datasets (SMM4H, CADEC and PROP) we generate the terms pairs as (ℓ i , a j ), where parent(ℓ i ) = norm(a j ).For the OP strategy, the samples are all possible pairs (ℓ i , ℓ j ), where parent(ℓ i ) = parent(ℓ j ).

C Dataset Comparison
Most of the PTs present in the three datasets are unique to a specific dataset, making it really challenging to perform transfer learning from one to the other without dealing with long-tail and unseen concepts.The Venn diagram in Figure 4 shows the number of PT concepts in common between all the datasets.706 PTs are unique to one of the three datasets, 276 are shared among at least two datasets, and only 98 appear in all three of them.Out of all the PTs in PROP, 64% are unique (410 out of 634) to this dataset alone, making it the most challenging to perform cross-dataset normalization on.The following most challenging datasets are CADEC (41% unique PTs) and SMM4H (34% unique PTs).

D Complete Results
Tables 3, 4 and 5 include the full results of all tested models (both Accuracy and F1 Score).Tables 6, 7, 8, 9 and 10 report the accuracy for all the cross-dataset experiments (one table for each model).

Figure 1 :
Figure 1: Long tail distribution of PTs in the datasets used for this paper.

Figure 2 :
Figure 2: Accuracy of all models on the three datasets on IN (•), OUT (⋄) and all (•) samples.

Figure 3 :
Figure 3: Cross-dataset accuracy for GPT-2 (FT and OP+FT) on OUT (⋄) and all (•) samples.One plot for each test dataset; the x-axis reports the training dataset.

Figure 4 :
Figure 4: Venn diagram of the shared/unique PT concepts for the three datasets.
for LLT and 24,571 for PT, in MedDRA version 23.1).The following are examples of AEs, with their corresponding LLTs and PTs:

Table 1 :
Dimensions of the datasets, reporting the average figures over the three train/test splits (± std), as well as the number of unique PT terms contained in each dataset.

Table 2 :
Further details about training parameters

Table 5 :
Full metrics (accuracy and F1 score) of all tested models on the PROP dataset.

Table 6 :
Cross-dataset accuracy for CODER FT and CODER OP+FT on the three datasets.

Table 7 :
Cross-dataset accuracy for SapBERT FT and SapBERT OP+FT on the three datasets.

Table 8 :
Cross-dataset accuracy for PMB FT and PMB OP+FT on the three datasets.

Table 9 :
Cross-dataset accuracy for GPT-2 FT and GPT-2 OP+FT on the three datasets.

Table 10 :
Cross-dataset accuracy for Sci5 FT and Sci5 OP+FT on the three datasets.