PDALN: Progressive Domain Adaptation over a Pre-trained Model for Low-Resource Cross-Domain Named Entity Recognition

Cross-domain Named Entity Recognition (NER) transfers the NER knowledge from high-resource domains to the low-resource target domain. Due to limited labeled resources and domain shift, cross-domain NER is a challenging task. To address these challenges, we propose a progressive domain adaptation Knowledge Distillation (KD) approach – PDALN. It achieves superior domain adaptability by employing three components: (1) Adaptive data augmentation techniques, which alleviate cross-domain gap and label sparsity simultaneously; (2) Multi-level Domain invariant features, derived from a multi-grained MMD (Maximum Mean Discrepancy) approach, to enable knowledge transfer across domains; (3) Advanced KD schema, which progressively enables powerful pre-trained language models to perform domain adaptation. Extensive experiments on four benchmarks show that PDALN can effectively adapt high-resource domains to low-resource target domains, even if they are diverse in terms and writing styles. Comparison with other baselines indicates the state-of-the-art performance of PDALN.


Introduction
Named Entity Recognition (NER) is typically framed as a sequence labeling task that targets to locate and classify named entities in text into predefined semantic types, such as Person, Organization, Location, etc. NER is a fundamental task in information extraction (Karatay and Karagoz, 2015) and text understanding (Krasnashchok and Jouili, 2018). The effectiveness of most existing NER models depends on sufficient labeled data, which is time-consuming and labor-intensive. Current research proposes cross-domain NER, which enables NER on the low-resource target domain by transferring knowledge from other high-resource source domains.
However, it is challenging to build a crossdomain NER component with high precision and recall, due to the domain shift problem (Ben-David et al., 2010). When casting the cross-domain NER as a transfer learning problem, most solutions (He and Sun, 2017;Yang et al., 2017;Aguilar et al., 2017;Liu et al., 2020b) require high-quality cross-domain features for knowledge transfer. Limited labeled data prohibit transfer learning from extracting informative features. Besides, it is hard to find a single training dataset covering all the required NER types. Even if words overlap across domains, their combination or usage is different from each other.
Domain adaptation (Sun et al., 2015) is widely studied to solve the domain shift issue. Existing approaches mainly introduce either word-level or discourse-level domain adaptations to enable crossdomain NER. To mitigate the word-level discrepancy, previous endeavors propose distributed word embedding (Kulkarni et al., 2016), label-aware maximum mean discrepancy estimation (Wang et al., 2018), and projecting learning (Lin and Lu, 2018). As to the discourse-level discrepancy, existing approaches introduce multi-level adaptation layers (Lin and Lu, 2018), tensor decomposition (Jia et al., 2019), and multi-task learning with external information (Liu et al., 2020b;Aguilar et al., 2017). However, those methods require sufficient labeled data, which hinders their performances under low-resource scenarios. To tackle both label sparsity and domain shift problem, existing approaches (Liang et al., 2020;Simpson et al., 2020;Cao et al., 2020) exploit external resources to generate pseudo labels for the low-resource domain. Nevertheless, the less confident labels may deteriorate the robustness of models because of noise.
In this paper, we propose a progressive domain adaptation cross-domain NER model PDALN. It introduces a novel domain adaptation component, which is enhanced by a progressive KD framework.
PDALN addresses both word-and discourse-level domain adaptation on two low-resource scenarios: unsupervised and semi-supervised cross-domain NER. We first augment mix-domain training data by cross-domain anchor pairs, which alleviates the sparsity of annotated target domain. Next, we enable knowledge transfer across domains through domain invariant features learned from a multigrained MMD adaptation metric. Additionally, we fuse contrastive learning (Hadsell et al., 2006) with a pre-trained model to extract robust features. Finally, instead of directly fine-tuning the model on the augmented adaptive data under the MMD-based metric, we integrate the cross-domain NER model into a sequential KD framework to learn a lowcapacity student model.

Base Model
To obtain expressive sentence features, we adopt a pre-trained language model (e.g. BERT (Devlin et al., 2018)) to encode the sentence X = [x CLS , x 1 , ..., x N , x SEP ] (after padding tokens in BERT) into sentence representation h = [h CLS , h 1 , ..., h N , h SEP ]. The task objective is denote as CRF loss, where L crf = log p(Y|X ).
(2) where log φ n (y i = j|h i , V) = exp(V T j h i ), h i is the encoded contextualized word vector, V is the weight matrix. A is the parameter for the transition matrix φ e . Z(·) is the normalization constant.

Maximum Mean Discrepancy (MMD) Measurement
The MMD is defined in particular function spaces H k that measures the difference in cross domain distributions (P s , P t ). H k is the Reproducing Kernel Hilbert Space (RKHS) endowed with a characteristic kernel k. The squared formulation of MMD, d 2 k (P s , P t ), is defined as where ϕ : X → H k . The most important property is that P s = P t iff d 2 k (P s , P t ) = 0. The characteristic kernel associated with the feature map ϕ and Gaussian Kernel k(D s , D t ).
To calculate MMD loss in cross-domain NER, we first compute the squared formulation of MMD between the BERT representations of source/target samples: where H s and H t are sets of the BERT embeddings h s and h t with corresponding number N s and N t .

The Proposed Model
In this section, we present the structure of the proposed model. We first introduce domain adaptation components. On the one hand, we design an adaptive data augmentation to tackle the label sparsity issue. On the other hand, we introduce a multigrained MMD metric on the augmented adaptive data to extract domain invariant features. There is an intuitive illustration in Figure 1 to show how our domain adaption approach mitigates the domain shifting. Besides, we exploit the power of the pretrained model to capture expressive data features. We integrate a sequential self-training strategy to progressively and effectively perform our domain adaption components, as shown in Figure 2. We describe the details of cross-domain adaptation in Section 4.1 and progressive self-training for lowresource domain adaptation in Section 4.2.

Cross-domain Adaptation
When labels are insufficient in the target domain, most cross-domain NER models are vulnerable to over-fitting, thus yielding unsatisfactory performance. Therefore, we augment mix-domain data by Cross-Domain Anchor pairs. Those augmented data is defined as adaptive data, which can alleviate the data insufficiency problem. Our adaptive data is designed to simultaneously mitigate the domain gaps on both word-level and discourse-level. Those adaptive data form an adaptive space, as shown in Figure 1, which bridge two domains for cross-domain knowledge transferring.

Adaptive Data Augmentation
We first give the definition of Cross-Domain Anchor. An entity in source domain is denoted by e s whose labels are [y s i s , ...y s j s ]. A target entity is e t whose labels are The cross-domain anchor is a relationship between two entities from different domains. y s i s = y t i t denotes two entities belong to same entity type when their first label is the same. Intuitively, the anchor pairs address the cross-domain word discrepancy by sharing words per NER type cross domains.
Then, we use the cross-domain anchor pairs M Anchor to create adaptive data D aug . Suppose we have e p , where p ∈ {s, t} and e p ∈ X p = [x p 1 , ..., x p i p , ...x p j p , ..., x p |X p | ]. Given an anchor pair (e p , e q ) ∈ M Anchor , where q ∈ {s, t} and q = p, we replace e p in X p with e q as the augmented adap- Intuitively, the augmented adaptive sentences are regarded as mix-domain augmented data that share sentence pattern cross domains. Such semantically or syntactically similar sentences are the adaptive data, which can explore the unknown area in the target domain. The grey space, shown in Figure  1 (b), denotes the adaptive space, which is comprised of adaptive sentences like " The Australia firm's parent company." and "San Francisco will play three one-day internationals.". These two sentences are augmented by the Cross-Domain Anchor pair ("Australia", "San Francisco") which are both assigned to the label "LOC". When model fine-tuning is processed on the adaptive data, the model can benefit from the cross-domain features acquired from the adaptive space to improve model generalizability on the low-resource target domain.

Multi-grained MMD for Domain-invariant Features
As aforementioned, the adaptive space function is regarded as a cross-domain bridge. In this part, we seek to strengthen its domain adaptability and further aggregate the cross-domain features. We adapt domain-adaptation MMD (Long et al., 2015) to gather data points with similar word and sentence features, as shown in Figure 1 (c). Since MMD is to compute the norm of the difference between two domain means, MMD-based NER objective can thus learn both discriminative and domain invari-ant representations. We propose the multi-grained MMD method to simultaneously alleviate both the word-level and discourse-level discrepancy.
To distinguish the adaptation on word-level and discourse-level, we propose word MMD loss and sentence MMD loss, denoted by L w MMD and L d MMD respectively.
where H CLS is the set of CLS token embeddings. CLS is the sentence pool output for the token CLS in pre-trained language model. The word level MMD loss is denoted by the same label y ∈ label = {B-X, I-X, O}: where µ y is the corresponding coefficient. H y are the set of token embeddings with the label y.
Finally, the representations of a sentence and its tokens are the domain invariant features, which capture the cross-domain knowledge under the guide of L d MMD and L w MMD . As shown in Figure 1 (c), the domain invariant features work to gather samples around the adaptive space to assist adaptation on both source and target domains.

Self-training for Low-Resource Domain
Adaptation(DA)

Robust Feature Adaptation
Considering limited vocabulary and noise data samples on both source and target domains, we adopt contrastive learning (Hadsell et al., 2006;Ye et al., 2020;Liu et al., 2020a;Wu et al., 2020) to extract robust features through text augmentation like synonym replacement (Wu et al., 2020) and span deletion (Wei and Zou, 2019). We construct a distorted dataset where z = W h CLS is a mapping vector of a sentence X . W is a trainable parameter.z = W h CLS is the mapping vector of X that is augmented by operating synonym replacement or span deletion on X . Z neg is constructed by other sentences in D ∪ D c except X and X . τ is a temperature hyper-parameter.

Low-Resource Objectives
To address the low-resource scenarios, we consider both zero-resource and minimal-resource crossdomain NER training settings. We first perform the base model on both the source domain and target domain to seek the cross-domain bridge through multi-grained MMD adaptation. The unsupervised cross-domain NER loss is denoted as: which is free of any annotated target examples but still enables domain adaptation by L d MMD (D s , D tu ). The semi-supervised crossdomain NER objective is denoted as: where α and β are the hyperparameters to balance the multi-grained MMD loss terms.

Progressive Joint KD and DA
We propose a progressive domain adaptation by integrating a sequential teacher-student framework to prevent the model from over-fitting on limited labeled data and augmented adaptive data. The intuition is that the student easily overlooks "problematic" examples but learns things that generalize well. Therefore, the KD framework enjoys the merits that it progressively improves the domain adaptation confidence over data. The cross-domain NER loss over adaptive data is denoted as: In the progressive KD framework, we use f θtea and f θstu to denote teacher and student models, respectively. Suppose fθ is the base model learned by the objective in Equation 9, we initial the teacher model and the student model as: θ At t-th iteration, the student model loss is denoted as: Where X ∈ D aug , containing N entities. f ·,n (X ) means the output of entity n. The updated model isθ (t) stu = arg min θstu L distill . Finally, we update the teacher-student model for the (t + 1)-th iteration by: θ

Experiments
In this section, we evaluate PDALN and other baselines on four public benchmarks. We conduct two groups of comparison experiments for unsupervised and semi-supervised cross-domain NER separately. We also conduct further ablation studies and hyperparameter studies to validate the efficacy of the domain adaptation approaches.

Baselines
We compare PDALN with the following state-ofthe-art cross-domain NER models: BiLSTM+CRF (Lample et al., 2016) harnesses character-level Bi-LSTMs to capture the morphological and orthographic features and word-level Bi-LSTMs to integrate the sentence grammar feature. At last, the model stacks a CRF layer to predict the labels considering their dependencies. BERT+CRF replaces traditional BiLSTM component with the powerful pre-trained language model BERT to obtain more informative and contextual enhanced word representations. La-DTL (Simpson et al., 2020) proposes the labelaware MMD metric learning to mitigate the word distribution discrepancy. DATNet (Zhou et al., 2019) proposes a generalized resource-adversarial discriminator to capture the share feature space across different domains. Then the domain shared space guides the target domain prediction on NER task. JIA2019 (Jia et al., 2019) combines language model and NER task to construct multi-task learning structure, and then exploits tensor decomposition to learn the task embedding for cross-domain NER prediction over such task embeddings. Multi-Cell (Jia and Zhang, 2020) proposes a multi-cell compositional LSTM structure for crossdomain NER under the multi-task learning strategy.
In addition, we compare the evaluation of two variants of PDALN. We replace the sequential KD framework in the self-training stage with MT and VAT, Mean Teacher strategy (Tarvainen and Valpola, 2017) and Virtual Adversarial Training (Miyato et al., 2018), respectively.

Training and Implementation Details
We adopt the Adam optimization algorithm with a decreasing learning rate of 0.00005. We utilize the pre-trained BERT (BERT-base, cased) where the number of transformer blocks is 12, the hidden layer size is 768, and the number of self-attention heads is 12. Each batch contains 32 examples, with a maximum encoding length of 128. The coefficient µ y in Equation 6 is 0.25. The temperature hyper-parameter τ = 0.05. We choose 100 labeled target examples and 500 labeled source examples to augment adaptive data in the size of 1400 (100*4+500*2). Each target example operates 4 times anchor word replacement into 4 augmented sentences, while 2 replacements for each source example. Particularly, we take 10/100/240 as target/source/adaptation examples in the Webpage dataset, due to its insufficient target examples.

Results and Discussion
Domain Adaptation on Unsupervised NER The unsupervised NER follows the zero-shot paradigm, preventing model training from any testing labeled data. Compared with other unsupervised NER baselines, PDALN achieves the best F-1 on all benchmarks, even suffering failure on the precision scores. As the unsupervised NER results are shown in Table 1, PDALN and BERT+CRF both attain competitive performance on the recall scores, which benefits from the powerful contrastive-learning fused pre-trained language model. But for WNUT2016 and Wikigold, PDALN surpassing BERT+CRF shows the benefits from sentence-level domain adaptation through L d MMD and robust feature extraction through L c .
Evaluation on Semi-supervised NER As shown in Table 1, most of the baselines cannot achieve decent performance gain by taking in limited annotated resources. But PDALN outperforms the best public baseline range from 1.5% to 4.0% on all benchmarks. Most of the existing approaches adopt BiLSTM as their fundamental component to aggregate input information. Unfortunately, BiL-STM cannot capture expressive sentence features due to its intrinsic shortcomings, vanishing or exploding gradient problems. Therefore, these approaches are prone to increasing false-positive predictions and suffer unsatisfied recall scores. Even though pre-trained language models can attain stunning recall scores, their precision scores dramatically fall behind the baselines. The main reason is that such a powerful pre-trained model is prone to over-fitting on small annotated data. Compared with BERT+CRF, our promising precision gain and increasing recall scores show that our model can make a successful tradeoff between the precisions and recalls. Besides, we compare with two variants (w/ MT and w/ VAT) of our model with different KD strategies, like Mean Teacher and Virtual Adversarial Training. Their performance is close to ours on the high-quality labeled data, SciTech. But their performance on the other domains shows they are vulnerable to the noise and easily overfit on limited annotated samples. PDALN overcomes that well by the progressive domain adaptation with moderate knowledge distillation from the teachers.

Ablation Study
We conduct ablation studies that quantify the contribution of each adaptation component in PDALN. As Table 2 shows, the removal of augmented data causes dramatic performance decreases on all four benchmarks. That indicates adaptive data augmentation plays the most vital role in the low-resource cross-domain NER task.     Our progressive KD framework shows its importance on precision gain as w/o L distill causes the worst precision drop. Our multi-grained MMD (either the sentence-level or word-level MMD) meth-ods play noteworthy contributions for cross-domain NER adaptation as well, as their removals also cause serious performance loss. The removal of L c attests the robust feature extraction works well when the annotated data (e.g. Wikigold) are not very precise.
Evaluation on Entity Type We provide PDALN's performance on each entity type in

Related Work
Recently, label sparsity has achieved great success in many research frontiers (Liu et al., , 2021Zhang et al., 2020c;Xia et al., 2018Xia et al., , 2020Xia et al., , 2021Zhang et al., 2020a,b). One of the widely adopted strategies is a cross-domain transfer which mainly deals with the domain shift problem. The causes for domain shift in NER are mainly twofold including the discrepancies of word distributions and sentence patterns between source and target domain.
On the one hand, word distributions are not compatible between different domain datasets. Therefore, existing works equip the model with diverse domain adaptation components to alleviate domain shift. Kulkarni et al. (2016) propose distributed word embedding methods to leverage domain-specific knowledge to boost their cross-domain NER performance. Wang et al. (2018) introduce a label-aware mechanism into maximum mean discrepancy (MMD) to explicitly reduce domain shift between the same labels across domains in medical data. Lin and Lu (2018) employ projecting learning to obtain a transfer matrix that maps target domain words into the word space of the source domain.
On the other hand, diverse sentence patterns are usually caused by various factors, like written styles, publication categories, data quality, etc. The solutions for mitigating the discourse-level discrepancy mainly include multi-level adaptation layers (Lin and Lu, 2018), tensor decomposition (Jia et al., 2019) and multi-task learning with external information (Liu et al., 2020b;Aguilar et al., 2017). As we mentioned before, Lin and Lu (2018) construct the word adaptation component in their model. Besides, they construct another sentence-adaptation layer, which takes in the adapted word embedding to extract another adaptation sentence feature. Jia et al. (2019) use multi-task learning and tensor decomposition to extract latent factors. Through latent factors, knowledge can be transferred across source and target domains. Liu et al. (2020b) employ NER label experts to guide model learning between domains. The label-aware guidance layer is key to enable domain adaptation. Jia and Zhang (2020) a multi-cell compositional LSTM structure for cross-domain NER under the multi-task learning strategy. Besides, those (Liang et al., 2020;Simpson et al., 2020;Cao et al., 2020) exploit external resources to generate pseudo labels for the low-resource domain with the assistance of a pretrained language model. However, those methods either lack the capability to capture expressive text features for the adaptation or require sufficient labeled target data, which impedes their performances under both zeroresource and minimal-resource scenarios. For the pre-trained model assisted approaches mainly rely on external knwledge bases which introduces too much noise.

Conclusion
In this paper, we propose a progressive adaptation knowledge distillation framework, including anchor-guided adaptive data to address data sparsity, multi-grained MMD to bridge the domain adaptation, and progressive KD to stably distill cross-domain knowledge. The results exhibit the model's superiority over the most state-of-the-arts.