Biomedical Named Entity Recognition via Dictionary-based Synonym Generalization

Biomedical named entity recognition is one of the core tasks in biomedical natural language processing (BioNLP). To tackle this task, numerous supervised/distantly supervised approaches have been proposed. Despite their remarkable success, these approaches inescapably demand laborious human effort. To alleviate the need of human effort, dictionary-based approaches have been proposed to extract named entities simply based on a given dictionary. However, one downside of existing dictionary-based approaches is that they are challenged to identify concept synonyms that are not listed in the given dictionary, which we refer as the synonym generalization problem. In this study, we propose a novel Synonym Generalization (SynGen) framework that recognizes the biomedical concepts contained in the input text using span-based predictions. In particular, SynGen introduces two regularization terms, namely, (1) a synonym distance regularizer; and (2) a noise perturbation regularizer, to minimize the synonym generalization error. To demonstrate the effectiveness of our approach, we provide a theoretical analysis of the bound of synonym generalization error. We extensively evaluate our approach on a wide range of benchmarks and the results verify that SynGen outperforms previous dictionary-based models by notable margins. Lastly, we provide a detailed analysis to further reveal the merits and inner-workings of our approach.


Introduction
Biomedical Named Entity Recognition (BioNER) (Settles, 2004;Habibi et al., 2017;Song et al., 2021;Sun et al., 2021) is one of the core tasks in biomedical natural language processing (BioNLP).It aims to identify phrases that refer to biomedical entities, thereby serving as the fundamental component for numerous downstream BioNLP tasks (Leaman and Gonzalez, 2008;Kocaman and Talby, 2021).
Existing BioNER approaches can be generally classified into three categories, i.e. (1) supervised methods; (2) distantly supervised methods; and (3) dictionary-based methods.Supervised methods (Wang et al., 2019b;Lee et al., 2020;Weber et al., 2021) train the BioNER model based on large-scale human-annotated data.However, annotating large-scale BioNER data is expensive as it requires intensive domain-specific human labor.To alleviate this problem, distantly supervised methods (Fries et al., 2017;Zhang et al., 2021;Zhou et al., 2022) create a weakly annotated training data based on an in-domain training corpus.Nonetheless, the creation of the weakly annotated data still demands a significant amount of human effort (Fries et al., 2017;Wang et al., 2019a;Shang et al., 2020).For instance, the preparation of the in-domain training corpus could be challenging as the corpus is expected to contain the corresponding target entities.To this end, most existing methods (Wang et al., 2019a;Shang et al., 2020) simply use the original training set without the annotation as the in-domain corpus, which greatly limits their applicability to more general domains.In contrast to the supervised/distantly-supervised methods, dictionary-based methods are able to train the model without human-annotated data.Most of the existing dictionary-based frameworks (Aronson, 2001;Song et al., 2015;Soldaini and Goharian, 2016;Nayel et al., 2019;Basaldella et al., 2020) identify phrases by matching the spans of the given sentence with entities of a dictionary, thereby avoiding the need of extra human involvement or indomain corpus.As human/expert involvement in the biomedical domain is usually much more expensive than in the general domain, in this paper, we focus our study on the dictionary-based method for the task of BioNER.
Although dictionary-based approaches do not re-quire human intervention or in-domain corpus, they suffer from the synonym generalization problem, i.e. the dictionary only contains a limited number of synonyms of the biomedical concepts that appear in the text.Therefore, if an entity synonym in the text is not explicitly mentioned by the dictionary, it cannot be recognized.This problem severely undermines the recall of dictionary-based methods as, potentially, a huge amount of synonyms are not contained in the dictionary.To address the synonym generalization problem, we propose SynGen (Synonym Generalization)a novel framework that generalizes the synonyms contained in the given dictionary to a broader domain.Figure 1 presents an overview of our approach.(1) In the training stage, SynGen first samples the synonyms from a given dictionary as the positive samples.Meanwhile, the negative samples are obtained by sampling spans from a general biomedical corpus.Then, it fine-tunes a pre-trained model to classify the positive and negative samples.In particular, SynGen introduces two novel regularizers, namely a synonym distance regularizer which reduces the spatial distance; and a noise perturbation regularizer which reduces the gap of synonyms' predictions, to minimize the synonym generalization error.These regularizers make the dictionary concepts generalizable to the entire domain.(2) During the inference stage, the input text is split into several spans and sent into the finetuned model to predict which spans are biomedical named entities.To demonstrate the effectiveness of the proposed approach, we provide a theoretical analysis to show that both of our proposed regularizers lead to the reduction of the synonym generalization error.
We extensively test our approach on five wellestablished benchmarks and illustrate that SynGen brings notable performance improvements over previous dictionary-based models on most evaluation metrics.Our results highlight the benefit of both of our proposed regularization methods through detailed ablation studies.Furthermore, we validate the effectiveness of SynGen under the fewshot setup, notably, with about 20% of the data, it achieves performances that are comparable to the results obtained with a full dictionary.In summary, our contributions are: • We propose SynGen -a novel dictionarybased method to solve the BioNER task via synonym generalization.
< l a t e x i t s h a 1 _ b a s e 6 4 = " X e q Q 3 i g L T e M t n u + P 9 F 0 i l L Z + 9   • We provide a theoretical analysis showing that the optimization of SynGen is equivalent to minimizing the synonym generalization error.
• We conduct extensive experiments and analyses to demonstrate the effectiveness of our proposed approach.

Methodology
In this section, we first give the definition of the dictionary-based biomedical NER task.Then, we introduce our Synonym Generalization (i.e.Syn-Gen) framework, following by the details of the synonym distance regularizer and the noise perturbation regularizer.Lastly, we provide a theoretical analysis on the problem of synonym generalization to show the effectiveness of our proposed approach.

Task Definition
Given a biomedical domain D (e.g.disease domain or chemical domain), we denote the set of all possible biomedical entities in D as S = {s 1 , ..., s |S| }, where s i denotes the i-th entity and |S| denotes the size of S.Then, given an input text x, the task of biomedical NER is to identify subspans inside x that belong to S, namely finding where k is the number of spans, and b i , e i ∈ [1, |x|] are the beginning and the ending token indices in x for the i-th span, respectively.However, in real-life scenarios, it is impractical to enumerate all possible entities in S. Normally, we only have access to a dictionary Ŝ ⊂ S, Ŝ = {ŝ 1 , ..., ŝ| Ŝ| } which contains a subset of entities that belong to S.2 Thereby, the goal of dictionary-based biomedical NER is then to maximally recognize the biomedical entities from the input text conditioned on the available dictionary.

Synonym Generalization Framework
Figure 1 depicts an overview of the proposed Syn-Gen framework.(1) In the training stage ( §2.2.1), it samples synonyms from a given dictionary (e.g.UMLS) as positive samples.Meanwhile, the negative samples are obtained by sampling spans from a general biomedical corpus (e.g.PubMed).Then, SynGen learns to classify positive and negative samples through the cross-entropy objective.Moreover, we propose two regularization methods (i.e.synonym distance regularization and noise perturbation regularization), which have been proved to be able to mitigate the synonym generalization error (detailed in §2.3).(2) In the inference stage ( §2.2.1), SynGen splits the input text into different spans and scores them separately following a greedy extraction strategy.

Training
During training, we first sample a biomedical entity ŝi (i.e. the positive sample) from the dictionary Ŝ and encode its representation as ri = E(ŝ i ), where E(•) is an encoder model such as BERT (Kenton and Toutanova, 2019).The probability of ŝi being a biomedical entity is then modelled as where MLP(•) contains multiple linear layers and σ(•) is the sigmoid function.Meanwhile, we sample a negative text span ñi from a general biomedical corpus, i.e.PubMed3 , and compute its representation as well as its probability as ri = E( ñi ) and p( ñi ) = σ(MLP( ñi )), respectively.Lastly, the training objective of our model is defined as Negative Sampling Filtering (NSF).To obtain the negative sample ñi , we first sample spans with a random length from the PubMed corpus.Then, to avoid erroneously sampling the false negatives (i.e. the spans that are biomedical entities), we encode all sampled spans into the embedding space and remove the samples that are close to the entities contained in the given dictionary.Specifically, we denote the set of sampled negative spans as Ñ .
Then, ∀ ñi ∈ Ñ , it satisfies where t d is a hyper-parameter that specifies the threshold of minimal distance, and F (•) is an offthe-shelf encoder model.In our experiments, we use SAPBert (Liu et al., 2021a,b) as the model F (•) since it is originally designed to aggregate the synonyms of the same biomedical concept in the adjacent space.
Synonym Distance Regularizer (SDR).Intuitively, if the distinct synonyms of a single concept are concentrated in a small region, it is easier for the model to correctly identify them.Thereby, to equip our model with the ability to extract distinct synonyms of the same biomedical concept, we propose a novel regularization term-Synonym Distance Regularizer (SDR).During training, SDR first samples an anchor concept ŝa and its associated concept ŝp from the dictionary Ŝ. 4 Then, a random negative sample ñn ∈ Ñ is sampled.Finally, SDR is computed by imposing a triplet margin loss (Chechik et al., 2010) between the encoded sampled synonyms and the sampled negative term as where γ s is a pre-defined margin, and ra = E(ŝ a ), rp = E(ŝ p ), and rn = E( ñn ), respectively.In §2.3, we provide a rigorous theoretical analysis to show why reducing the distance between synonyms leads to minimizing the synonym generalization error.
Noise Perturbation Regularizer (NPR).Another way to mitigate the scoring gap between biomedical synonyms is to reduce the sharpness of the scoring function's landscape (Foret et al., 2020;Andriushchenko and Flammarion, 2022).This is because the synonyms of one biomedical entity are expected to distribute in a close-by region.Based on this motivation, we propose a new Noise Perturbation Regularizer (NPR) that is defined as where ŝi is a biomedical entity sampled from the dictionary Ŝ and v is a Gaussian noise vector.Intuitively, NPR tries to flatten the landscape of the loss function by minimizing the loss gap between vectors within a close-by region.More discussion of increasing the function flatness can be found in Foret et al. (2020); Bahri et al. (2022).In §2.3, we theoretically show why NPR also leads to the reduction of synonym generalization error.
Overall Loss Function.The overall learning objective of our SynGen is defined as where α and β are tunable hyperparameters that regulate the importance of the two regularizers.

Inference
During inference, given the input text x, Syn-Gen first splits x into spans with different lengths, namely, where m s is the maximum length of the span.Then, we compute the score of every span x [i:j] ∈ X as p(x [i:j] ) with Equation (1).We select the spans whose score is higher than a pre-defined threshold t p as candidates, which are then further filtered by the greedy extraction strategy as introduced below.
Greedy Extraction (GE).It has been observed that a lot of biomedical terms are nested (Finkel and Manning, 2009;Marinho et al., 2019).For example, the entity T-cell prolymphocytic leukemia contains sub-entities T-cell and leukemia.However, in most of the existing BioNER approaches, if one sentence x contains the entity T-cell prolymphocytic leukemia, these approaches usually only identify it as a single biomedical entity and ignore its sub-entities T-cell or leukemia.To address this issue, our SynGen applies a greedy extraction (GE) strategy to post-process the extracted biomedical terms.In particular, our GE strategy first ranks the recognized terms by their length in descending orders as and set the initial validation sequence x (1) = x.
Then, it checks the ranked terms from t 1 to t n .If the term t i is a sub-sequence of the validation sequence x (i) (i.e.∃p, q < |x (i) |, such that ), it will recognize the term t i as a biomedical entity and set a new validation sequence x (i+1) by removing all t i 's occurrence in the sequence x (i) .As a result, the sub-entities inside a recognized entity will never be recognized again because they will not be contained in the corresponding validation sequence.

Theoretical Analysis
Most of the existing dictionary-based frameworks suffer from a common problem that terms outside of the given dictionary cannot be easily recognized, which we refer to as the synonyms generalization problem.To understand why the Syn-Gen framework can resolve this problem, we give a theoretical analysis focusing on the correctness of entities in S. Specifically, given a bounded negative log-likelihood loss function f , it tends to 0 if an entity s ∈ S is correctly classified as positive.Otherwise, f (r) tends to b if the entity s is wrongly recognized as a non-biomedical phrase.Following the traditional generalization error (Shalev-Shwartz and Ben-David, 2014), we further define the average empirical error for entities in the dictionary Ŝ as R = 1 To better analyze the generalization error for all synonyms, we consider the most pessimistic error gap between R and the error of arbitrary s ∈ S, namely f (E(s)).Then, the synonym generalization error can be defined as follows: Definition 1 (synonym generalization error).Given a loss function f (r) ∈ [0, b], the synonym generalization error is defined as: It can be observed from Definition 1 that small E s implies error f (E(s)) for arbitrary s will no deviate too far from R. Therefore, training f with the dictionary terms Ŝ is generalizable to the entities in the whole domain S. To give a theoretical analysis of E s , we further assume that Ŝ is an ϵ−net of S, namely, ∀s ∈ S, ∃ŝ ∈ Ŝ, such that ∥ŝ − s∥ ≤ ϵ.Intuitively, given an entity s ∈ S, we can always find an entity ŝ in the dictionary Ŝ within distance ϵ.We further assume that f is κ-Lipschitz, namely, ∥f (x) − f (y)∥ ≤ κ∥x − y∥.Then, we have the following bound hold.Theorem 1 (Synonym Generalization Error Bound).If Ŝ is an ϵ−net of S. The loss function f ∈ [0, b] is κ-Lipschitz continuous.We have with probability at least 1 − δ, The proof can be found in Appendix A. It can be observed from Theorem 1 that reducing the synonym distance upper bound ϵ or function f 's Lipschitz constant κ can help reduce the generalization error gap E s .
Theorem 1 explains why both SDR and NPR are able to help improve the NER performance.(1) For SDR, it allows the model to learn to reduce the distance between synonyms, which is equivalent to reducing ϵ of Equation ( 7).Therefore, it helps to reduce the synonym generalization error upper bound.(2) For NPR, it helps reducing the Lipschitz constant κ because given a vector v, minimizing R n is equivalent to minimizing as v is fixed during the parameter optimization procedure.Therefore, optimizing R n is a necessary condition for reducing f 's Lipschitz constant κ.

Experimental Setup
We evaluate all the models on 6 popular BioNER datasets, including two in the disease domain (i.e.NCBI disease (Dogan et al., 2014) and BC5CDR-D (Li et al., 2016)), two in the chemical domain (i.e.BC4CHEMD (Krallinger et al., 2015) and BC5CDR-C (Li et al., 2016)) and two in the species domain (i.e.Species-800 (Pafilis et al., 2013) and LINNAEUS (Gerner et al., 2010)).Note that BC5CDR-D and BC5CDR-C are splits of the BC5CDR dataset (Li et al., 2016) for evaluating the capability of recognizing entities in the disease and chemical domains respectively, following Lee et al. (2020).We evaluate the performance by reporting the Precision (P), Recall (R), and F 1 scores.
The entity name dictionary used in our model is extracted by the concepts' synonyms in diseases, chemicals, and species partition from UMLS (Bodenreider, 2004).The negative spans are randomly sampled from the PubMed corpus 3 .We tune the hyper-parameters of SDR (i.e.α), NPR (i.e.β), threshold constant t d , t p and maximal span length m s by using grid search on the development set and report the results on the test set.The hyper-parameter search space as well as the best corresponding setting can be found in Appendix D. We use PubMedBert (Gu et al., 2021) as the backbone model by comparing the development set results from PubMedBert, SAPBert (Liu et al., 2021a,b), BioBert (Lee et al., 2020), SciBert (Beltagy et al., 2019), and Bert (Kenton and Toutanova, 2019) (Detailed comparison can be found in Appendix C).Our experiments are conducted on a server with NVIDIA GeForce RTX 3090 GPUs.For all the experiments, we report the average performance of our used three metrics over 10 runs.

Comparison Models
We compare our model with baseline models depending on different kinds of annotations or extra efforts.The supervised models (BioBert and SBM) require golden annotations.The distantly supervised models (SBMCross, SWELLSHARK and AutoNER) depend on several different kinds of annotation or extra efforts which will be discussed in the corresponding models.The dictionary-based models mainly use the UMLS dictionary.BioBert (Lee et al., 2020) first pre-trains an encoder model with biomedical corpus and then finetunes the model on annotated NER datasets.SBM is a standard Span-Based Model (Lee et al., 2017;Luan et al., 2018Luan et al., , 2019;;Wadden et al., 2019;Zhong and Chen, 2021) for NER task.We use the implementation by Zhong and Chen (2021).SBMCross utilizes the same model as SBM.We follow the setting of Langnickel and Fluck (2021) to train the model on one dataset and test it on the other dataset in the same domain which is referred as the in-domain annotation.For example, in the NCBI task, we train the model on the BC5CDR-D dataset with SBM and report the results on the NCBI test set.SWELLSHARK (Fries et al., 2017) proposes to first annotate a corpus with weak supervision.Then it uses the weakly annotated dataset to train an NER model.It requires extra expert effort for designing effective regular expressions as well as hand-tuning for some particular special cases.AutoNER (Wang et al., 2019a;Shang et al., 2020) propose to first train an AutoPhase (Shang et al., 2018) tool and then tailor a domain-specific dictionary based on the given in-domain corpus.The corpus is the original training set corpus without human annotation.Afterwards, it trains the NER model based on the distantly annotated data.We also report the ablation experiments by removing the dictionary tailor or replacing the in-domain corpus with evenly sampled PubMed corpus.EmbSim first uses a pre-trained model to encode the input spans and the entities in the UMLS dictio- The standard deviation analysis is in Figure 7.
nary into the corresponding vector representation.Then, it calculates the minimal distance between a given span and an arbitrary entity in the dictionary.It recognizes spans with minimal distances smaller than a threshold as named entities.We report the performance based on BioBert.For the results using other backbone models, please refer to Appendix C. (Aronson, 2001;Divita et al., 2014;Soldaini and Goharian, 2016) proposes to perform exact concept mapping of spans with the entities in the UMLS dictionary.We report the results for both cased and uncased matching.

MetaMap
SPED (Rudniy et al., 2012;Song et al., 2015) calculates the Shortest Path Edit Distances between the query span and each entity in the dictionary.Then it recognizes a span as a biomedical entity if the minimal distance ratio is smaller than a specific threshold.
We remove the component utilizing the annotated data and only keep the tf-idf features to make it comparable to other models without extra effort.
QuickUMLS (Soldaini and Goharian, 2016) is a fast approximate UMLS dictionary matching system for medical entity extraction.It utilizes Simstring (Okazaki and Tsujii, 2010) as the matching model.We report the scores based on Jaccard distance.For the performance using other distances, please refer to Appendix C.

Experimental Results
Main Results.We first compare the overall performance of our proposed SynGen framework with the baseline models, and the results are shown in Table 1.It can be observed that: (1) Our proposed SynGen model outperforms all other dictionarybased models in terms of F 1 score.This is because it alleviates the synonym generalization problem and can extract entities outside of the dictionary.As a result, the recall scores are significantly improved.
(2) By comparing SBM and SBMCross, we can find that the performance is very sensitive to the selection of the in-domain corpus.Even using the golden annotation of another training set in the same domain can lead to a sharp performance decrease.Therefore, preparing a good in-domain corpus is quite challenging.(3) SynGen is already comparable to the performance of SBMCross with average F 1 scores of 67.4 and 68.5 respectively.It shows that our dictionary-based model is comparable to a supervised model without in-domain data.(4) The precision score (P) of QuickUMLS is very high.This is because it is mainly based on matching exact terms in the dictionary and may not easily make mistakes.However, it cannot handle the synonyms out of the UMLS dictionary.As a result, it fails to retrieve adequate synonyms, leading to low recall scores.( 5) By comparing the ablation experiments for AutoNER, we can conclude that the human labor-intensive dictionary tailor and the in-domain corpus selection are very important for distantly supervised methods.Without the dictionary tailor or the in-domain corpus, by the SDR component does help improve the overall performance, we plot how the synonym distance changes as the SDR weight increases in Figure 6.Specifically, we train the model with different hyper-parameter α and measure the synonym distance by sampling 10,000 synonym pairs from UMLS.Then, we calculate the average distance between synonym name pairs for different α.As suggested in Figure 6, as α increases, the synonym distance decreases which shows the effectiveness of the SDR component in controlling the synonym distance.On the other hand, we also plot how the evaluation scores change as the synonym distance increases in Figure 2.For the previous results, we further split the distance intervals into 8 segments and get the average overall performance in each interval.The results indicate that as the synonym distance is regularized (decreases), the overall performance increases.This observation shows the effectiveness of our proposed SDR component and justifies the analysis in Theorem 1.
Influence of Noise Perturbation.To show the usefulness of our proposed NPR component, we plot how the scores change as the NPR weight (i.e.β) changes.The results are shown in Figure 3.It indicates that as β increases, all metrics including precision, recall, and F 1 scores increase.This observation shows that the NPR component does help improve the performance which also justifies our theoretical analysis.
Few-Shot Analysis.To further show that our SynGen framework can be applied in the few-shot scenarios, we run the model with part of the original dictionary entries with the dictionary size ratio ranging in {0.1%, 0.2%, 0.5%, 1%, 2%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%}.The results are shown in Figure 4. To better show the capability of few-shot learning, we also draw the same figure for the MetaMap model as shown in Figure 5.It can be concluded from the results that in SynGen, when the dictionary size is very 0.00 0.01 0.02 0.03 0.04 0.05 6.00 6.25 6.50 6.75 7.00 7.25 7.50 small, the performance increases as the dictionary size increases.Using nearly only 20% dictionary entries can achieve comparable results as using the full set of dictionaries.However, in the MetaMap model, the performance increases linearly as the dictionary size increases.It shows that the word match based model cannot handle the few-shot cases.This observation shows the potency of using our approach to undertake few-shot learning.It should be noted that the performance stops increasing after a certain ratio.This observation can also be explained with Theorem 1. Increasing the dictionary size can only mitigate the second term in Equation ( 7).After the second term decreases to a certain amount, the first term dominates the error and continues increasing the dictionary size cannot further reduce the upper bound.This observation can further verify the correctness of our theoretical analysis.To further improve the performance, we should consider reducing the first term in Equation ( 7) and this is how our proposed SDR and NPR work.

Synonym distance
Standard Deviation Analysis.To further show the effectiveness and consistency of our proposed components, we conduct a standard deviation analysis as shown in Figure 7.We run each model for 10 runs with different random seeds and draw the box plot of the F 1 scores.performs the model variants without the proposed components, which further validates the effectiveness consistency of each component of our SynGen and the effectiveness of the overall SynGen framework.
Case Study.We conduct a case study to demonstrate how our proposed NPR and SDR components enhance performance.As shown in Table 4, we select a range of terms from the NCBI corpus.Using SynGen, both with and without the NPR+SDR component, we predict whether each term is a biomedical term.A term predicted as a biomedical term is marked with a check mark (✓); otherwise, it's marked with a cross (✗).Our findings reveal that SynGen accurately identifies certain terms like "maternal UPD 15" as biomedical, even if they are not indexed with the UMLS.However, without the NPR+SDR components, the system cannot recognize such terms, underscoring the significance of these components.Moreover, SynGen is designed to prevent misclassification of common words (e.g., "man", "breast") and peculiar spans like "t (3; 15)" as biomedical entities.Without our proposed components, the system might erroneously categorize them, further emphasizing the essential role of the NPR+SDR components in SynGen.

Related Works
Existing NER methods can be divided into three categories, namely, supervised methods, distantly supervised methods, and dictionary-based methods.
In supervised methods, Lee et al. (2020) propose to pre-train a biomedical corpus and then fine-tune the NER model.Basaldella et al. (2020) propose to apply exact string matching methods to extract the named entities.Rudniy et al. (2012); Song et al. (2015) propose to extract the entity names by calculating the string similarity scores while QuickUMLS (Soldaini and Goharian, 2016) propose to use more string similarity scores (Okazaki and Tsujii, 2010) to do a fast approximate dictionary matching.To the best of our knowledge, there is still no existing work that gives a theoretical analysis of the synonyms generalization problem and proposes a corresponding method to solve it.

Conclusion
This paper proposes a novel synonym generalization framework, i.e.SynGen, to solve the BioNER task with a dictionary.We propose two novel regularizers to further make the terms generalizable to the full domain.We conduct a comprehensive theoretical analysis of the synonym generalization problem in the dictionary-based biomedical NER task to further show the effectiveness of the proposed components.We extensively evaluate our approach on a wide range of benchmarks and the results verify that SynGen outperforms previous dictionary-based models by notable margins.

Limitations
Although the dictionary-based methods achieve considerable improvements, there is still an overall performance gap compared with the supervised models.Therefore, for domains with wellannotated data, it is still recommended to apply the supervised model.Our proposed SynGen framework is suggested to be applied in domains where there is no well-annotated data.

Broader Impact Statement
This paper focuses on biomedical named entity recognition.The named entity recognition task is a standard NLP task and we do not make any new dataset while all used datasets are properly cited.This work does not cause any kind of safety or security concerns and it also does not raise any human rights concerns or environmental concerns.

C Full Results
Table 5 shows the full results of the experimental results.

D Hyper-Parameter Tuning
The hyper-parameter with the corresponding search space are listed in Table 6.

E Advantages and Disadvantages for NER methods
In Table 7, we compare the advantages and disadvantages of different NER paradigms.
r o E t w Z r 8 8 T + p H Z e e k f H x t l y o 2 T J C H P d i H Q 3 D g F C p w B V W o A Y Z 7 e I I X e D U e j W f j z X i f j O a M 6 c 4 O / I E x + g E I W 5 g s < / l a t e x i t > Ŝ < l a t e x i t s h a 1 _ b a s e 6 4 = " S H r + + + w 5 m z 5 z 2 z a t W s E e A s s S e k e v Z + 9 3 3 x t p M 1 e u a n 4 0 c 4 C Q l X m C E p u 7 Y V K z d D Q l H M S F 5 2 E k l i h I e o T 7 q a c h Q S 6 W a j 8 D n c 1 4 o P g 0 j o x x U c q b 8 3 M h R K m Y a e n i w y y m m vE P / z u o k K T t 2 M 8 j h R h O P x o S B h U E W w a A L 6 V B C s W K o J w o L q r B A P k E B Y 6 b 7 K u g R 7 + s u z p H V Y s 4 9 r R 9 d W t W 6 B M U p g F + y B A 2 C D E 1 A H l 6 A B m g C D F D y A J / B s 3 B u P x o v x O h 6 d M y Y 7 F f A H x s c P n I e Z E A = = < / l a t e x i t > Ñ x < l a t e x i t s h a 1 _ b a s e 6 4 = " f c 5 L k t g N b w i Y a A y N v j D r p A k U d R s = " > A A A B 8 n i c b V C 7 S g N B F L 0 b X z G + 4 q O z G Q y C V d g V U T s D F l p G M A / Y L G F 2 M k m G z M 4 s M 7 N C X P I Z N h a K 2 I q V X 2 J n 6 Z 8 4 m 6 T Q x A M X D u f c y z 3 3 h j F n 2 r j u l 5 N b W F x a X s m v F t b W N z a 3 i t s 7 d S 0 T R W i N S C 5 V M 8 S a c i Z o z T D D a T N W F E c h p 4 1 w c J n 5 j T u q N J P i 1 g x j G k S 4 J 1 i X E W y s 5 L c i b P o E 8 7 Q 5 a h d L b t k d A 8 0 T b 0 p K F x / 3 3 1 f v e 2 m 1 X f x s d S R J I i o M 4 V h r 3 3 N j E 6 R Y G U Y 4 H R V a i a Y x J g P c o 7 6 l A k d U B + k 4 8 g g d W q W D u l L Z E g a N 1 d 8 T K Y 6 0 H k a h 7 c w i 6 l k v E / / z / M R 0 z 4 O U i T g x V J D J o m 7 C k Z E o u x 9 1 m K L E 8 K E l m Ch m s y L S x w o T Y 7 9 U s E / w Z k + e J / X j s n d a P r l x S x U X J s j D P h z A E X h w B h W 4 h i r U g I C E B 3 i C Z 8 c 4 j 8 6 L 8 z p p z T n T m V 3 4 A + f t B 7 S m l T M = < / l a t e x i t > X < l a t e x i t s h a 1 _ b a s e 6 4 = " e 8 E 4 E g K c O f 4 g 3 y j 5 2 L 5 3 5 r b L Z y Y = " > A A A B 8 n i c b V C 7 S g N B F J 3 1 G e M r P j q b
Figure 2: Influence of synonym distance.

Figure 7 :
Figure 7: The box plot of each model's F 1 score over 10 runs.
Wang et al. (2019b);Cho and Lee (2019);Weber et al. (2019Weber et al. ( , 2021) ) develop several toolkits by jointly training NER model on multiple datasets.However, supervised annotation is quite expensive and human labor-intensive.In the distantly supervised methods,Lison et al. (2020Lison et al.  ( , 2021));Ghiasvand and Kate (2018);Meng et al. (2021);Liang et al. (2020) propose to first conduct a weak annotation and then train the BioNER model on it.Fries et al. (2017);Basaldella et al. (2020) propose to use utilize a well-designed regex and special case rules to generate weakly supervised data whileWang et al. (2019a);Shang et al. (2020) train an in-domain phase model and make a carefully tailored dictionary.However, these methods still need extra effort to prepare a high-quality training set.In the dictionary-based methods, Zhang and Elhadad (2013) propose a rule-based system to extract entities.Aronson (2001); Divita et al. (2014); Soldaini and Goharian (2016); Giannakopoulos et al. (2017); He (2017);

Figure 8
Figure8shows the intuition of how SDR and NPR improve the performance.On the left-hand side, it shows the synonyms and the corresponding loss function value without SDR and NPR.On the right-hand side, it shows the synonym points and the regularized function value.The results indicate that when applied the SDR, the distance between the synonyms for the same concept concentrates more closely with each other.On the other hand, with the NPR, the Lipschitz constant for the function decrease, and the landscape for the function becomes quite flat.As a result, the function value for the synonyms is more close to each other.

Table 1 :
Main results.We repeat each experiment for 10 runs and report the averaged scores.For BioBert and SWELLSHARK, we report the score from the original paper.We mark the extra effort involved with superscripts, where ♮ is gold annotations; ♢ is in-domain annotations; ♭ is regex design; △ is special case tuning; ♡ is in-domain corpus; ♯ is dictionary tailor.The bold values indicate the best performance among the dictionary-based models.

Table 3 :
Ablation Study.In order to show the effectiveness of our model's each component, we conduct several ablation experiments over different variants of our model, and the results are shown in Table2and Table3.We have the following observations from the two tables.(1) By comparing the vanilla SynGen model with its variants, such as SynGen w/o NPR, SynGen w/o SDR, and SynGen w/o NPR + SDR, we can observe that removing any component results in performance drop, which means that our proposed NPR and SDR component can effectively improve the performance.Comparison of different backbone models 5 .
We can see from Figure 7 that our proposed SynGen consistently out-