Conceptual Grounding Constraints for Truly Robust Biomedical Name Representations

Effective representation of biomedical names for downstream NLP tasks requires the encoding of both lexical as well as domain-specific semantic information. Ideally, the synonymy and semantic relatedness of names should be consistently reflected by their closeness in an embedding space. To achieve such robustness, prior research has considered multi-task objectives when training neural encoders. In this paper, we take a next step towards truly robust representations, which capture more domain-specific semantics while remaining universally applicable across different biomedical corpora and domains. To this end, we use conceptual grounding constraints which more effectively align encoded names to pretrained embeddings of their concept identifiers. These constraints are effective even when using a Deep Averaging Network, a simple feedforward encoding architecture that allows for scaling to large corpora while remaining sufficiently expressive. We empirically validate our approach using multiple tasks and benchmarks, which assess both literal synonymy as well as more general semantic relatedness.


Introduction
Biomedical and clinical free-text contain mentions of biomedical terms which can provide valuable information for text mining applications. Such textual mentions, as well as their corresponding reference names in biomedical ontologies, can often be expressed in various synonymous surface forms (e.g. pleuritic pain vs. pain breathing), which is challenging for downstream applications. Effective dense representation of these biomedical names  SNOMED-CT C0564504 schizoid fantasy schizoid fantasy -mental defense mechanism F60.1 C0338969 introverted personality disorder introverted personality C0036339 schizoid personality disorder unspecified schizoid personality disorder Table 1: Example of SNOMED-to-ICD-10 mappings. The synonym sets for the SNOMED-CT concepts C0564504, C0338969, and C0036339, are fused into one large set of semantically related names for the ICD-10 code F60.1.
has been mainly investigated through the normalization task of disorder linking, which consists of matching disease mentions to reference terms of concept identifiers in ontologies (e.g. matching the mention myocardial depression to the reference term Myocardial Dysfunction) (Leaman et al., 2015). While past research has gradually shifted its focus from lexical representations (Leaman et al., 2013;D'Souza and Ng, 2015) to dense distributed representations (Limsopatham and Collier, 2016;Li et al., 2017;Phan et al., 2019;Sung et al., 2020), encoders are still typically optimized towards normalization tasks, which are focused on resolving word-level analogies between synonymous biomedical names.
Recent research has focused more explicitly on encoding domain-specific biomedical semantics by training biomedical name representations that are robust, i.e., reflecting the synonymy and semantic relatedness of names by their closeness in the embedding space, preferably in a consistent way that generalizes across different biomedical subdomains and corpora. To date, the most effective approaches have applied some form of conceptual grounding: minimizing the distance between on the one hand representations of names, and on the other hand pretrained embeddings of their concept identifiers. These concept embeddings are supposed to reflect domain-specific semantics, and are constructed using a variety of different techniques, including distributional similarity of graph relations and distributional similarity of textual occurrences in large-scale free-text, as well as combinations thereof (Kartsaklis et al., 2018;Phan et al., 2019).
While knowledge graph embeddings of biomedical concepts can encode a variety of semantic relations, Kartsaklis et al. (2018) show that such graph embeddings need to incorporate textual features to make them effective targets for conceptual grounding. Such features help to translate textual representations of names to the topology of the concept embedding space, which otherwise reflects only ontological information. In other words, concept embeddings are mostly useful targets for grounding to the extent that name representations can be efficiently mapped to them by the encoder architecture. This raises the question whether we can increase the effectiveness of conceptual grounding by better aligning the topology of the created name embedding space and the pretrained concept embedding space. In this paper, we investigate how to maximally exploit low-cost concept embeddings, which can be constructed using only pretrained word embeddings and sets of biomedical synonyms or semantically related names.
To this end, we enrich a siamese neural network encoder for biomedical names with 2 novel constraints which are meant to effectively map encoded names to pretrained concept embeddings. The first constraint, which we call the linear constraint, applies canonical correlation analysis (CCA) to pretrained embeddings of names and their concepts to project them into a space which improves their linear mapping. These transformed embeddings are then used as input representations for the neural encoder. The second constraint adds a training objective which we call prototypical grounding: minimizing the distance between a pretrained concept embedding and the average of all the encoded names belonging to that concept. This average is an approximation of the prototypical representation of a concept in the name embedding space.
While the linear constraint involves a simple preprocessing step, the prototypical grounding constraint can be computationally expensive for largescale corpora. Therefore, we use a simple Deep Averaging Network (DAN) (Iyyer et al., 2015) as encoder to prove the effectiveness and scalability of our approach, even for a neural architecture that has no access to word order like LSTMs have or cannot apply attention over specific word combinations like Transformers can. We train and evaluate our encoder on different categorizations of biomedical names. For instance, Table 1 shows how concepts from the SNOMED-CT ontology capture literal synonymy, while these concepts can also be grouped into the ICD-10 coding system which reflects more general semantic relatedness. Our experimental results show that our approach is effective for both types of categorizations, as well as for various ontologies and benchmarks.

Related work
Biomedical name encoders A variety of neural architectures have been proposed for encoding biomedical names. Kartsaklis et al. (2018) use a multi-sense LSTM with attention over different word senses. This attention is conditioned on the context of the biomedical name. Phan et al. (2019) include a character-level Bidirectional LSTM in a word-level Bidirectional LSTM which extracts a fixed-size representation using max pooling over all dimensions, followed by a linear transformation. Sung et al. (2020) finetunes pretrained contextsensitive BioBERT (Lee et al., 2019) representations and uses them in tandem with lexical TF-IDF representations. While past research has explicitly investigated the role of various training objectives, even jointly in multi-task training regimes, the specific impact of encoder architectures has not received much attention or comparison.
Averaging networks Research on sentence embeddings and paraphrasing has consistently found that simple encoding procedures such as averaging of word embeddings can rival or even outperform complex neural architectures on tasks for which those are finetuned (Wieting et al., 2016;Shen et al., 2018;Wieting and Kiela, 2019). Moreover, research on Deep Averaging Networks (Iyyer et al., 2015) has found that feedforward neural networks that use averaged word embeddings as input can be tuned to textual classification tasks such as sentiment analysis if the network is sufficiently large and/or deep. This way, small differences in the input can be magnified by the network where relevant.
Prototypical networks While successful approaches to few-shot learning such as Matching Networks (Vinyals et al., 2016) optimize representation models on the level of single instances, follow-up work has shown the benefits of simultaneously learning class representations using those same models. For instance, prototypical networks (Snell et al., 2017) train a neural encoder with objectives that involve class prototypes, which are created by averaging the encodings of all instances that belong to a single class. In this paper, we include a training objective for our encoder which forces synonymous or semantically related biomedical names to form class prototypes that approximate the pretrained embedding of their concept identifier.

Encoding model 3.1 Encoder architecture
Our encoder is a Deep Averaging Network (DAN) (Iyyer et al., 2015) which extracts a fixed-size representation for an input name n: where N t is the bag of tokens from a name, u t is a pretrained word embedding of a token, u n is a name embedding created by averaging all the pretrained word embeddings of all tokens, and enc is a feedforward neural network with Rectified Linear Unit (ReLU) as non-linear activation function. As pretrained word embeddings we use 300-dimensional fastText (Bojanowski et al., 2017) representations which we train on 76M sentences of preprocessed MEDLINE articles released by Hakala et al. (2016). This fastText model also allows for constructing word embeddings for out-ofvocabulary tokens by composing character n-gram embeddings.

Training objectives
Our training objectives optimize the mapping between an encoded name f (n) and the pretrained embedding of its concept u p . While in principle any type of pretrained concept embeddings could be used, our experiments use concept embeddings which are simply the average of all pretrained name embeddings belonging to the concept: These concept embeddings can be constructed entirely from synonym sets only, and have been proven effective in experiments by Phan et al. (2019).
Linear constraint: CCA We apply canonical correlation analysis (CCA) to find the best linear combination between pretrained name embeddings and the pretrained embeddings of their concept identifiers that maximizes their correlation. We can then project both the name embeddings and the concept embeddings to this new space for training objectives that use them as input. In order to not lose any information for further training, the projected embedding space has the same dimensionality as the original embedding space.
Siamese triplet loss To enforce embedding similarity between names that are synonyms or semantically related, we use a siamese triplet loss (Chechik et al., 2010). This loss forces the encoding of a biomedical name to be closer to the encoding of a true synonym than that of a negative sample name, within a specified (possibly tuned) margin: where CCA denotes that the pretrained name embedding used as input for the DAN has first been transformed by the CCA constraint. We take cosine distance as distance function d. To select negative names during training we apply distance-weighted negative sampling (Wu et al., 2017) over all training names.
Prototypical grounding constraint To enforce prototypical grounding, we average the name encodings of all synonyms or semantically related terms belonging to a concept identifier, in order to approximate a prototypical representation of the concept in the name embedding space. We then minimize the cosine distance between this prototypical concept representation and the pretrained embedding of the concept: To avoid overfitting, we enforce this objective using a random dropout of synonyms from C n , in order to stochastically approximate prototypical similarity to the concept embedding.
This constraint implies that the dimensionality of the encoder output should be the same as the dimensionality of the pretrained concept embeddings. However, if the dimensionality of the concept embeddings is smaller than the desired output dimensionality, this could be solved using e.g. random projections, which work well for increasing the dimensionality of neural encoder inputs (Wieting and Kiela, 2019).
Multi-task setup Our multi-task setup simply sums the siamese triplet losses and prototypical grounding: where both losses use either the original pretrained name and concept embeddings, or their CCA projections. While the proportion of both losses could be tuned using coefficients, our experiments prove this to be redundant, since both losses systematically converge to zero or near-zero values in all experiments. (2019), we use SNOMED-CT 1 disorder names as biomedical synonym sets. However, since this data is of a diverse nature and quality, we try to select the most natural and coherent data by matching it with a large target domain of processed MED-LINE articles released by Hakala et al. (2016) containing 76M sentences with 120M unique noun phrases scraped from 4K articles. We match disorder names with our target domain in 4 consecutive steps. Firstly, we only retain disorder names of which all tokens appear in the vocabulary of our target domain. Secondly, many disorder names have duplicates with a small set of redundant metatags such as (disorder) and (finding) added to the name, which very rarely appear as natural language in our target domain (we list these metatags in Appendix A). Since they do not reflect relevant synonymy, we leave out such duplicates. Thirdly, we only retain disorder names of up to 6 tokens, since this is the maximum length of the 20K disorder names which directly match noun phrases from our target domain. This is also similar to the length distribution in disorder normalization benchmarks as the NCBI Disease corpus (Dogan et al., 2014) and the ShARe/CLEF eHealth 2013 corpus (Pradhan et al., 2015). Lastly, we leave out all disorder names which belong to more than one concept identifier.

ICD-10
The SNOMED-to-ICD-10 mapping, which has been officially provided by the U.S. National Library of Medicine 2 , groups multiple SNOMED-CT concepts together under more coarse-grained ICD-10 codes, using concept unique identifiers (CUIs) from the UMLS 3 ontology which encompass those SNOMED-CT concepts. We fuse the synonym sets of SNOMED-CT concepts belonging to the same ICD-10 concept into a single set of semantically related terms. Table 1 gives some examples of the SNOMED-to-ICD-10 mappings. These examples show how ICD-10 concepts introduce a broader range of synonymy. While many of the SNOMED-CT synonyms can be resolved using word-level analogies (e.g. myocardial depression vs. myocardial dysfunction), the ICD-10 related terms that bridge different SNOMED-CT concepts require more domain-specific semantics to be linked (e.g. for matching myocardial dysfunction with muscular degeneration of heart).

Heterogeneous names: MedMentions
The

Ranking tasks and data distributions
Ranking tasks We evaluate the usefulness of biomedical name representations for synonym retrieval and concept mapping by applying 3 different performance metrics to a single ranking task. Given a mention m of a biomedical name which belongs to the concept identifier c, we have to rank a set of biomedical names S which includes C syn ⊂ S, a set of names which belong to the same concept identifier c as the mention m. To rank the biomedical names according to their similarity to the mention, we first encode both the mention m as well as every name n ∈ S, and then rank every name n using the cosine similarity between the encoded mention f (m) and the encoded name f (n). The aim of this task is to rank every correct synonym or semantically related name syn ∈ C syn as high as possible. We measure the synonym retrieval and concept mapping performance for this task using different metrics. For synonym retrieval, we report Mean average precision (mAP) over all synonyms. For concept mapping, we report Accuracy (Acc), the proportion of instances where the highest ranked name n is a correct synonym syn ∈ C syn , and Mean reciprocal rank (MRR) of the highest ranked correct synonym. Table 2 gives an overview of the data distributions after splitting. For MedMentions, we take our train, validation, test, and zeroshot data from the data splits provided by MedMentions ST21pv. For SNOMED-CT and ICD-10, we devise our own sampling method. Firstly, we randomly divide the synonym sets in training concepts and zero-shot test concepts. Secondly, to hold out test mentions from the training data, we randomly sample a single name from each concept which has at least two names (as to avoid empty training concepts), and repeat this procedure to get more test data. We then carry out the same procedure to sample validation data which we use to calculate the stopping criterion during training.

Data distributions
We calculate synonym retrieval and concept mapping performance for the test and validation mentions by ranking for a test mention m all names S present in the training data, including the synonyms C syn which are present in the training data for the concept identifier c of the test mention. The performance of the encoders for the training data is calculated by treating a single training name at a time as test item.
The zero-shot test concepts are used to observe how well our encoders can extrapolate to previously unobserved concepts, for which the encoder has not specifically learned conceptual grounding. We frame the zero-shot setup as a way of testing transfer learning within the same domain, by not including any training names at all. This setup can show that our encodings are robust enough to be used out-of-the-box in entirely novel settings. For this setup, we treat a single zero-shot name at a time as test item, and rank all correct synonyms C syn present in the zero-shot data among all names S from the zero-shot data.

Reference model and baselines
Reference model: BNE We compare our DAN model against the Biomedical Name Encoder (BNE) by Phan et al. (2019), which we train using the exact same data. To have a direct comparison with their model, we leave out the character embeddings from their encoder architecture and only use our fastText word embeddings as input embeddings. This results in a bidirectional LSTM (BiLSTM) (Graves and Schmidhuber, 2005) with We also include the publicly released BNE model with skipgram word embeddings, BNE + SG w , 4 which was trained on approximately 16K synonym sets of disease concepts in the UMLS, containing 156K disease names. We don't include this model for the disorder data, since it was trained on at least part of that data, and we want to avoid that data leakage affects the fairness of the model comparisons.
Baselines As baseline encoder we use the 300dimensional fastText name embeddings which are used as input for the DAN (defined in Equation  1 in Section 3.1). This encoder is an example of a Simple Word-Embedding Model (SWEM) with average pooling, which has been proven to be a strong baseline for various NLP tasks (Shen et al., 2018). We also include two other pretrained baselines among our comparison of encoders: 600dimensional Sent2Vec (Pagliardini et al., 2018) embeddings with word unigram and bigram representations, trained on the same MEDLINE data as our fastText embeddings; and averaged 728dimensional context-specific token activations extracted from the publicly released BioBERT model (Lee et al., 2019).

Training details
We fit the CCA for the linear constraint using all training names and their corresponding concept prototypes constructed from the same training names. The encoder architectures of our own DAN model and the BNE reference model are implemented in PyTorch (Paszke et al., 2019). Both the input and output dimensionality are 300 (which is the dimensionality of the input fastText embeddings described in Section 3.1). All encoder architectures for which we report results performed best with a single hidden layer. We tuned the hidden size of the DAN to 38,400 dimensions using a grid search over 300 × 2 n , with n starting at 1 and being increased until performance declined again. We tuned the BiLSTM for the BNE model to 4,800 dimensions using the same grid search, to make sure the architecture  Table 5: Synonym retrieval and concept mapping scores for the MedMentions encoders. The highest score is denoted in bold, the second highest is underlined.
ICD-10 code R07.1 Test mention pain provoked by breathing Target synonyms anterior pleuritic pain / breathing painful / chest pain on breathing / pleural pain / pleuritic pain

CCA+DAN BNE fastText
Top 10 ranking chest pain on breathing anterior pleuritic pain pleuritic pain breathing painful pleural pain chest pain chronic chest pain pain in heart upper chest pain parasternal pain chest pain on breathing breathing painful back pain worse on sneezing disorder characterized by back pain disorder characterised by back pain anterior pleuritic pain pain in heart pleuritic pain precordial pain chronic chest pain chest pain on breathing breathing painful disorder characterized by back pain disorder characterised by back pain back pain worse on sneezing distress from pain in labor persistent pain following procedure chronic mouth breathing chronic chest pain dermatitis caused by sweating and friction Table 6: A comparison of the synonym retrieval by various encoders for the ICD-10 test mention pain provoked by breathing. While fastText is already good at matching a few semantically related terms at the top, it retrieves no further names in its top ranks. The BNE ranking picks up on more specific biomedical semantics, but still has a limited coverage. In contrast, the conceptually grounded CCA+DAN ranks all 5 target names at the top.
was compared fairly to our model. At that point, the DAN has ±23M trainable parameters, whereas the BiLSTM already has ±200M trainable parameters. This allows us to empirically confirm that our proposed DAN model is more computationally efficient than the BNE BiLSTM.
Adam optimization (Kingma and Ba, 2015) is performed on a batch size of 64, using a learning rate of 0.001 and a dropout rate of 0.5. Input strings are first tokenized using the Pattern tokenizer (Smedt and Daelemans, 2012) and then lowercased. We use a triplet margin of 0.1 for the siamese triplet loss L syn defined in Equation 3. For the prototypical constraint L proto defined in Equation 4, we use a synonym dropout rate of 0.5. As stopping criterion we use the mAP of synonym retrieval for held-out validation names: we stop training once this score for the current epoch is worse than for the previous epoch.

Results and discussion
We compare the 3 baselines and the BNE reference model against 3 variants of our model. The CCA fastText model only applies the learned CCA mapping to the pretrained fastText embeddings. The CCA+DAN model applies the linear CCA constraint before training, while the DAN model leaves out the linear constraint. Table 3 and 4 show the concept mapping and synonym retrieval performance of the different encoders for the ICD-10 and SNOMED-CT data. We see that the fastText baseline consistently outperforms the other baselines. Applying the CCA transformation to the fastText baseline improves performance for every metric, including zero-shot cases. In other words, applying this linear constraint for conceptual grounding already leads to better extrapolation. The DAN model, which combines the siamese triplet loss  with only the prototypical grounding loss, is able to fit the training data to near perfection without overfitting, since it generalizes well across both test and zero-shot data. Applying the CCA constraint before training increases the performance even more. These observations support the hypothesis of this paper that increasing the effectiveness of conceptual grounding can improve trained encoders. The results also clearly confirm the robustness of our approach: synonym retrieval is dramatically improved for the test data, without any performance loss for concept mapping. In other words, the representations have encoded more domain-specific semantics while retaining the relevant lexical information. Table 6 gives an example of the impact of our conceptual grounding constraints for ICD-10 test data: the model is able to encode domainspecific semantics beyond word-level analogies for the semantically related names of the test mention pain provoked by breathing. Not only does the CCA+DAN model rank all semantically related names at the top: all the following top-ranked names, such as chest pain, also have clear semantic links to the mention. In contrast, the BNE model ranks less related names such as back pain worse on sneezing and disorder characterized by back pain higher than correct synonyms such as pleuritic pain.

ICD-10 & SNOMED-CT
MedMentions Table 5 shows the performance of the different encoders for the MedMentions data. Table 7 gives an example of how, similar to the disorder data, our CCA+DAN encoder is able to encode specific semantics that the BNE model is lacking: the conceptual grounding constraints have allowed our encoder to represent the semantic similarity between cariogenesis, tooth decay and cavities, while the BNE model does not improve over the fastText baseline.
Despite showing similar trends to the disorder data, the relative improvements of our CCA+DAN encoder over the reference BNE model are less dramatic. Interestingly, the publicly released BNE + SG w model trained by Phan et al. (2019) performs worse out-of-the-box than our pretrained fastText embeddings. This highlights the difficulty of achieving true robustness of biomedical name encoding.

Semantic relatedness benchmarks
We also evaluate our name encoders on two biomedical benchmarks of semantic similarity, which allow to compare cosine similarity between name embeddings with human judgments of relatedness. MayoSRS (Pakhomov et al., 2011) contains multiword name pairs of related but different concepts, and can indicate how much generalized domain knowledge has been captured by our conceptual grounding constraints. UMNSRS (Pakhomov et al., 2016) contains only single-word pairs, which also stem from different concepts. This benchmark makes a distinction between similarity and relatedness.
The correlations in Table 8 confirm the robustness of our conceptually grounded biomedical name representations. While the correlations for the BNE models barely improve over those of the fastText embeddings, our CCA+DAN encoder improves substantially over all 3 benchmarks, regard-  Table 8: Spearman's rank correlation coefficient between cosine similarly scores of name embeddings and human judgments, reported on semantic similarity (sim) and relatedness (rel) benchmarks. The highest score is denoted in bold, the second highest is underlined.
less of the data source it was trained on. Remarkably, while the publicly released BNE model of Phan et al. (2019) was trained on 156K disease names, the CCA+DAN encoder already outperforms it on MayoSRS when trained on the ICD-10 and SNOMED-CT subsets, which contain only 30K disease names. This proves that Deep Averaging Networks can be effective even for large-scale encoding of biomedical names. Moreover, this finding suggests that future work on biomedical name encoders should not take complex neural architectures for granted. On the contrary, enforcing more relevant constraints such as our conceptual grounding constraints can boost even lightweight encoder architectures.

Conclusion and future work
In this paper, we have shown how two conceptual grounding constraints for biomedical name encoders can infuse name representations with more domain-specific semantics without losing robustness. These representations can help with retrieving literal synonyms as well as semantically related terms, and can be sufficiently expressed by a Deep Averaging Network, which is a feedforward neural network that only takes averaged word embeddings as input.
We believe future work can include a comparison of neural encoding architectures with a wider range of complexity. Decreasing the complexity of neural architectures can allow for including more comprehensive training objectives which target more effective encoding of domain-specific semantics.