Hardness-guided domain adaptation to recognise biomedical named entities under low-resource scenarios

Domain adaptation is an effective solution to data scarcity in low-resource scenarios. However, when applied to token-level tasks such as bioNER, domain adaptation methods often suffer from the challenging linguistic characteristics that clinical narratives possess, which leads to unsatsifactory performance. In this paper, we present a simple yet effective hardness-guided domain adaptation framework for bioNER tasks that can effectively leverage the domain hardness information to improve the adaptability of the learnt model in the low-resource scenarios. Experimental results on biomedical datasets show that our model can achieve significant performance improvement over the recently published state-of-the-art (SOTA) MetaNER model.


Introduction
Named Entity Recognition (NER) is a fundamental NLP task which aims to locate named entity (NE) mentions and classify them into predefined categories such as location, organization, or person.NER usually serves as an important first sub-task for information retrieval (Banerjee et al., 2019), task oriented dialogues (Peng et al., 2020) and other language applications.Consequently, NER has seen significant performance improvements with the recent advances of pre-trained language models (PLMs) (Akbik et al., 2019;Devlin et al., 2019).Unfortunately, a large amount of training data is often essential for these PLMs to excel and except for a few high-resource domains, the majority of domains have limited amount of labeled data.
This data-scarcity problem amplifies given the context of biomedical NER (bioNER).Firstly, the annotation process for the biomedical domains is time-consuming and can be extremely expensive.Thus, many biomedical domain corpora, especially for those privately developed, are often scarcely labeled.Furthermore, each biomedical domain can have distinct linguistic characteristics which are non-overlapping with those found in other biomedical domains (Lee et al., 2019).This linguistic challenge often diminishes the robustness of PLMs transferred from high-resource biomedical domains to low-resource ones (Giorgi and Bader, 2019).
Given the premise, this paper focuses on adapting PLMs for bioNER tasks learned from highresource biomedical domains to low-resource ones.A potential solution is to inject the prior "experience" to this adaptation process.Few works have explored this area such as Li et al. (2020a) and Li et al. (2020b), where the former followed the optimization/meta-learning strategy by (Finn et al., 2017) and the latter introduced a feature critic module similar to the work of Li et al. (2019).
We show that simply incorporating the hardness information that each domain contributes to the learning of the bioNER LMs could significantly boost the adaptation performance of existing learning paradigms under various low-resource settings.This happens since the importance/hardness of biomedical domains can vary significantly, as shown in Table 1.While some domains might contain a lot of NEs, many do not, (e.g., the last row in Table 1), and hence, contribute little to learning the bioNER LMs.Meanwhile, the domain difficulty ties both to the number of entities and to the length of those entities.Given the non-overlapping linguistic characteristics found in biomedical domains, this poses another challenge to effectively adapt the trained bioNER LMs to the new domain.Therefore, we argue that the current adaptation framework proposed by Li et al. (2020b) could be further enhanced using the hardness information.We present two simple but effective ways of incorporating hardness information into our learning framework, named HGDA.We show that our hardness-guided domain adaptation approaches for bioNER tasks outperform the SOTA domain adap-tation NER technique by Li et al. (2020b).

Related Works
Few works have addressed domain adaptation for NER.Both Li et al. (2020a) and Li et al. (2020b) seek a robust representation for the sequence labeling function BiLSTM-CRF using the meta-learning framework (Finn et al., 2017), with the latter further includes an auxiliary network to promote adversarial learning.Hu et al. (2022) consider label dependencies via an auto-regressive framework built on top of Bi-LSTM for cross-domain NER.However, different domains can have different level of hardness which both works have not yet addressed.Existing domain adaptation techniques using meta learning framework to incorporate hardness information via 1) actively ranking the tasks in term of difficulty level (Yao et al., 2021;Zhou et al., 2020;Liu et al., 2020;Achille et al., 2019); 2) designing an adaptive task scheduler (Yao et al., 2021); or 3) relying on generative approaches to quantify the uncertainties of tasks (Kaddour et al., 2020;Nguyen et al., 2021).To our knowledge, we are the first to perform hardness guided domain adaptation for bioNER tasks.

Problem Setup
Given a set of biomedical corpora from multiple source domains D source (e.g., Drug, Gene, Species, etc), we aim to learn a sequence labelling function h : X → Y 1 from a set of tasks p(T ) sampled from D source so that h can be adapted to a new task T ′ sampled from the target domain D target (e.g., Disease).This function h should contain 1) a sentence encoder parameterized with θ (e.g., BiLSTM) that captures the contextual information about words, and 2) a tag decoder parameterized with ϕ (e.g., CRF) that assigns the entity tags to these words 2 .Thus, the learning objective is to search for the optimal Θ * ≡ {θ, ϕ} from D source .This optimal Θ * should minimise the risk of adapting h from D source to T ′ from D target .

Task Generation
To optimize for Θ * with stochastic optimization, one first needs to sample from p(T ), i.e., task gen- , and Y = {y i 1 , . . ., y i L } N i=1 .X and Y denote the set of sentences and tags/labels respectively.N is the total number of sentences, and L is the number of word tokens for the sentence i.
2 We consider the BIO tagging schema.
eration.Each bioNER task T i in our setting is divided into a support set T S i and a query set We further restrict both T S i and T Q i to contain only K sentences respectively sampled from a domain in D source .This value of K is dependent on the amount of data we have during adaptation phase for T ′ and can be as small as 5 or 10.This is to mimic the same few-shot setting in the training phase which has been shown to reduce the PAC-Bayesian error bound during the adaptation phase (Ding et al., 2021).To encode the hardness information into our task generation process, we further consider the imbalance issue caused by the NER tasks.As shown in Table 2, the majority of the sentences in the biomedical corpora does not contain any NEs.Thus, it is highly likely that the K randomly-sampled sentences contain no NEs, which can result in a biased sequence labeller that always predicts "O" in the adaption phase.To avoid this issue, we propose our first HGDA approach by selecting the K sentences in T S i to be those containing at least one biomedical NE, which is shown to be highly effective during the adaptation phase.

Bilevel Optimization
To regularize θ, HGDA includes a domain classifier as a separate head on top of the sentence encoder.This enforces the network to learn a domain conditional invariant sentence encoder (Blanchard et al., 2017;Li et al., 2018;Shao et al., 2019).This domain classifier, parameterized with ω, consists of a fully connected layer and is used to predict which domain the sentences in a task T i belong to.The classification function f will henceforth be used to represent the composition of the sentence encoder and the domain classifier.Consequently, the learning objective of HGDA is where λ control the trade-off between the labelling loss and the classification loss.As HGDA follows the bilevel optimization framework, we first generate a batch of task from p(T ).For each T i in this batch, we train the model on T S i then validate the performance on T Q i using our learning objective.Consequently, we gather the gradients from each T i in the current batch of task and make the update to the parameters, finishing one iteration of the training process.This runs until no further improvement can be made.The full algorithm is The difference between the effects of the two dose levels of Z.She was monitored for one more day and then discharged with instructions to discontinue her diet pills 0.01 The Raf/Ras/ERK/MAPK pathway is known to be involved in NGF-induced outgrowth Our analysis reveals that the oviduct is lined, along its entire length, by a monolayered epithelium comprised of squamous-type cells.In one case study, Bramson et al.
Table 1: Examples of domain hardness scores (computed from our method) for tasks generated from three domains (gene, drug, and species respectively) during the training procedure.The score is based on a scale from 0 to 1, the higher the score, the more challenging the domain is.The NEs are put in brackets with red color for each sentence.
summarized by Alg. 1 and Alg. 2 in the appendix.

Task Hardness
Although picking the K sentences with NEs for T i is shown to improve the DA performance (see Table 3), it is not realistic in practice to have only sentences with NEs and wasteful not using the sentences without NEs as these sentences would still provide the sentence encoder with important contextual information of the clinical narratives.Hence, HGDA incorporates another simple but effective way of computing the bioNER task hardness based on the losses.The gradients propagated by T i will be weighted by the hardness level of T i .Specifically, we define the task difficulty Γ i = {γ θ i , γ ϕ i , γ ω i } for task T i with its corresponding objective values as follows where {γ θ i , γ ϕ i , γ ω i } represent the task hardness scores to update {θ, ϕ, ω} respectively.By incorporating task hardness in the optimization process, HGDA, after collecting adequate contextual information for the sentence encoder, should gradually shift the focus to more challenging labelling tasks for the tag-decoder rather than the ones that contribute little to no learning value, e.g., a task that contains short and simple sentences without bioNEs.This happens as multiplying the hardness score with the corresponding gradient value will force the gradient update to zero for sentences with no NEs.Table 1 shows how HGDA ranks the contribution of each task towards the gradient updates.

Datasets
We use the pre-processed version of the benchmark corpora (see Tab. 2) which were used by  (Habibi et al., 2017;Lee et al., 2019;Zhu et al., 2018).
the SOTA bioNER BioBERT (Lee et al., 2019) and are publicly available at BioBERT's github website3 .These corpora are categorized into four non-overlapping biomedical domains, namely Disease, Drug, Gene and Species, each of which will serve as the target domain in our DA experiments.
When the sentence encoder is BiLSTM, HGDA uses BioWordVec embeddings pre-trained based on both PubMed database and clinical notes from MIMIC-III (Chen et al., 2018;Yijia et al., 2019).

Experimental Settings
To analyze the adaptability of the HGDA under low-resource scenarios, we consider the following experimental settings: • The size of T ′ : We use T ′ ∈ {5, 10, 20, 50} to replicate the data scarcity issue in lowresource scenarios of privately labelled medical corpora.Table 3: Average F1-performance of the sequence encoder adaptation for bioNER tasks with the best performance boldfaced.All results are averaged from 20 distinct samples, e.g., given T ′ size is 5, we adapt our HGDA variants and their baselines using T ′ and validate their bioNER performance using the test data to record the f1-score.We then repeat this process with 20 different T ′ of size 5, average the final results, and report the results using this table.HGDA and HGDA-NEs use BiLSTM as the sentence encoder, while HGDA* and HGDA-NEs* use BERT as the sentence encoder.Unless otherwise specified, the HGDA-variants outperform their baselines with a p-value < 0.05.
We implement two variants of HGDA and compare them with the SOTA MetaNER (Li et al., 2020b).
• MetaNER will act as our major baseline.It is the latest and most related work to HGDA, showing SOTA performance.We followed the parameter settings that the authors detailed in their paper and tried to replicate the MetaNER model based on our understandings.We validated our implementation by comparing its performance to the baseline multi-tasking method used in MetaNER.
• BioBERT is used to demonstrate the difficulty that deep PLMs face in low-resource scenarios.
• HGDA-NEs follows the strategy in the task generation discussion, i.e., HGDA-NEs only trains with sentences that contains at least one bioNE.
As our HGDA and HGDA-NEs can either use BiL-STM or BERT as the sequence encoder, we will clearly highlight this information in the presentation of results to avoid any confusions.Additionally, corpora from the target domain are unseen by the model during the training phase.For instance, if the "Disease" domain is treated as D target for adaptation, we only perform learning for Θ * using the remaining D source = { "Drug", "Gene", and "Species"}.More detailed parameter settings to reproduce this work can be found in the appendix.

Results & Discussions
Table 3 presents the NER performance of MetaNER, BioBERT, HGDA and their variants under the previously defined adaptation settings.We have the following observations: • MetaNER v.s.HGDA: By simply incorporating the hardness information in the gradient update, HGDA achieves a significant performance improvement over MetaNER with an average of 4 − 5% improvement in terms of F1 score.In multiple cases (e.g., JNLPBA 5 shots, BC2GM 10 shots, etc), the performance gain of HGDA goes up to 15% in terms of F1-score.This result demonstrates that using the hardness to differentiate the importance of each task in the gradient update will contribute to the NER performance.
• HGDA v.s.HGDA-NEs: Both HGDA and HGDA-NEs work well in our experiments, outperforming the strong baseline by a large margin.HGDA re-weights the gradient update based on the task difficulty and HGDA-NEs trains the learner exclusively only on sentences containing NEs.It is not surprising to see both approaches perform similarly when T ′ increases, as HGDA automatically tries to down-weight tasks with sentences containing few/no NEs dynamically.
• It is interesting that both HGDA and HGDA-NEs might perform worse than MetaNER on the LINNAEUS corpus.Table 2 shows that 87% of LINNAEUS sentences contains no bioNEs.Since both HGDA and HGDA-NEs toss out those sentences implicitly and explicitly during training, this could have attributed to the performance loss.
• The BioBERT performance shows the weakness of adapting deep PLMs in the lowresource scenarios for bioNER tasks.Under our HGDA settings, both HGDA* and HGDA-NEs*, which use BERT as the sentence encoder, perform significantly better than the BioBERT baseline.This might suggest that our techniques are architecture invariant.Additionally, the significant performance gaps when T ′ = {5,10} further elevate the necessity of HGDA for deep sentence encoder.
Additionally, we also provide the precision and recall results for all of our experiments, these results can be found in the appendix, Table 4 and Table 5.

Conclusion
We have proposed simple yet effective methods that effectively leverage the domain hardness information to improve the effectiveness of the learnt model under the low-resource NER settings.Experiments on biomedical corpora have shown that the sequence labelling function derived from our HGDAs have achieved substantial performance improvements compared to current SOTA baselines.

Limitations
HGDA and its variants are trained using English bioNER corpora which have limited morphology.
We have not applied HGDA to other languages to further verify the performance so this can be a potential area for future works.To make sure that the batch of tasks are constructed properly, we have to make modifications to the dataloaders.This prevents the GPUs to be fully utilised during training and leads to long training time, e.g., taking up to 48 hours to train with 1 NVIDIA RTX3090.We use multiple RTX3090s to train our models; thus, for GPUs with lower memory, the batch size must be changed which might affect the results.Since we try to validate the performance of each configuration for 20 times as discussed in the experimental results section, it takes a considerable amount of time to finish the validation of the adaptation performance.Finally, due to the limitation of available pages, we cannot show detailed information of pvalues that suggests the significance of our work.
Algorithm 1 HGDA Require: p(T ) from source domains Require: α, β, λ hyper-parameters Require: m tasks batch size 1: Initialize θ, ϕ, ω 2: while not converge do 3: for i = 1, . . ., m do 4: end for When BERT (Devlin et al., 2019;Wolf et al., 2020) acts as the sequence encoder, we set the max sequence length for padding and truncating to be Algorithm 2 Bilevel optimization for T i Require: T S i ∩ T Q i = ∅ Require: θ, ϕ, ω current iteration parameters Require: β, λ hyper-parameters 1: Initialize θ i , ϕ i , ω i with θ, ϕ, ω 2: for i = 1, . . ., adaptation steps do 3: L lab i = L h (θ i , ϕ i ) , T S i 4: L cls i = L f (θ i , ω i ) , T S i 5: ω i ← ω i − β∇ ω i L cls i 9: end for 10: 256 as biomedical texts tends to be longer than the general texts.We use cased vocabulary for a slightly better performance and set the dimensionality of the encoder layers and the pooler layer to 768.For the tokenization, BERT uses WordPiece tokenization (Wu et al., 2016) to deal with the outof-vocabulary (OOV) issue which is common for biomedical texts.The default learning rates α and β for K of 32 are set to 1e-5 and are subjected to changes as shown in Eq. 3. The output from the BERT sequence encoder will then be fed into two separate fully connected layers.One of them is to predict the tagging sequence.The other is to predict which domain for the sentences in T i .
All models are trained using the SGD optimizer (Kiefer and Wolfowitz, 1952) with a linear learning rate scheduler (Wolf et al., 2020).We set the gradient clip at 5; momentum at 0.9; weight decay at 1e-6; and dropout rate at 0.2 for all our training (Li et al., 2020b).After cross-validating for different values of λ , we use λ = 1 for all HGDA variants to control the trade-off between the sequence labelling loss and the domain classifying loss.Additionally, since HGDA involves the bilevel optimization framework, we have to approximate for the gradients acquired from T Q i using first order gradient approximations (Nichol et al., 2018) and implicit gradients (Rajeswaran et al., 2019) to avoid the computation for the Jacobian matrix.Please contact the corresponding author for the codes to re-implement this work.
neutrophils with [chemoattractants] [FMLP] or [platelet activating factor (PAF)] results in different but overlapping functional responses.0.46 Of even more interest, [IkappaBalpha] overexpression inhibited the production of [matrix metalloproteinases 1 and 3] while not affecting their tissue inhibitor....more durable inhibition of HIV -1 replication than was seen with the [NF-kappa B] inhibitors alone or the [anti-Tat sFv intrabodies] alone.Spontaneous occurrence of early region 1A reiteration mutants of type 5 adenovirus in persistently infected human T-lymphocytes.Here we report the fabrication of single-molecule transistors based on individual C60 molecules connected to gold electrodes.0.18 The contractile effects of [oxytocin], prostaglandin F2 alpha and their combined use on human pregnant myometrium were studied in vitro.Transcriptional activation of the [proopiomelanocortin gene] by [cyclic AMP-responsive element binding protein].

Table 2 :
Biomedical corpora used in our experiments