Structure and Label Constrained Data Augmentation for Cross-domain Few-shot NER

,


Introduction
Named entity recognition (NER) is a fundamental Natural Language Processing (NLP) task to detect entity mentions and classify them into predefined labels (Grishman and Sundheim, 1996).Benefiting from powerful feature representations, deep learning based NER approaches (Lample et al., 2016;Devlin et al., 2019;Li et al., 2020) have achieved promising performances.However, their success depends heavily on the large-scale dataset with accurate annotations that are labor-intensive and time-consuming.Although some general domains (e.g., news) provide rich annotations, it is unaffordable to manually annotate NER labels in some new environments (e.g., bio-medicine).Therefore, * Yufeng Chen is the corresponding author.few-shot NER (Fritzler et al., 2019;Hou et al., 2020) has attracted increasing attentions, aiming to build an NER model with only a small number of supporting samples in specific domains.
Mainstream researches on cross-domain fewshot NER aim to transfer relevant knowledge from source domains.Most of them focus on optimizing model architectures based on metric-learning, such as Prototypical Network based ProtoBERT (Snell et al., 2017;Fritzler et al., 2019;Hou et al., 2020), a nearest neighbor based network NNShot (Yang and Katiyar, 2020), a viterbi decoding variant nearest neighbor based network StructShot (Yang andKatiyar, 2020), andContainer (Das et al., 2022), a contrastive model.There are also studies based on data augmentation methods, such as MELM Zhou et al. (2022), which uses cross-lingual pre-trained models for data augmentation.
Cross-domain few-shot NER is full of challenges because of the domain gaps, especially for NER tasks.However, existing work lacks research on this problem.As a structure prediction task, NER requires synthetic entity with highly matching of label dependencies.However, there are distinctions among texts from various domains.The same entity mention in various domains is labeled with different entity types or distinct span boundaries.Unfortunately, existing studies have not adequately explored the influence of domain gaps on NER tasks.Consequently, these gaps significantly impact the performance of existing approaches.
In view of these challenges, we divide domain gaps into two categories as shown in Figure 1(a).Category-I: The structure of entities differs across domains.For example, 'the Lincoln Memorial' is represented as a contiguous location entity in flat NER datasets, while it is additionally labeled with a person entity "Lincoln" in the nested NER datasets.Additionally, entity boundaries may vary from domain to domain.For example, in different datasets, the inclusion of "the" in the phrase "the Lincoln Memorial" can vary.Category-II: The annotations of entities differs across domains.Difference domains usually have different pre-defined entity types.For example, in OntoNotes dataset, 'the Lincoln Memorial' and 'Washington D.C' are annotated as 'LOC' and 'GPE' types, respectively, where as in WNUT dataset, both are classified as 'location'.Our proposed method aims to alleviate the negative impact of structure and annotations differences on cross-domain few-shot NER.By doing so, we aim to enhance the performance of NER models in the target domain.
Based on the above analyses, we introduce two types of relationships to sufficiently model two kinds of domain gaps in cross domain few-shot NER method.Word-to-word relation: 'the Washington Monument' is annotated as one entity in flat NER datasets while annotated as two entities in nested ones.When the source domain is a flat entity dataset and the target domain is a nested entity dataset, it is likely to generate non-nested entity data, leading to the NER model not being able to learn the knowledge in the target domain.Word-to-tag relation: the corresponding entity types of 'Washington D.C.' are 'GPE' and 'location' in 'OntoNotes' and 'WNUT', respectively, which may cause label conflict if directly learned.
In this paper, we propose a novel method called Structure and Label Constrained Data Augmentation (SLC-DA) for Cross-domain Fewshot NER.SLC-DA novelly design a label constrained pre-train task, which allows the model to learn the mapping relationships between entities across diverse domains.Furthermore, SLC-DA incorporate structure constrained optimization objectives in the data augmentation process to generate domain-aware augmented data to help NER models smoothly transition from source to target domains.
Concretely, for structure-constrained data augmentation as shown in figure 1(b), we calculate the word-to-word relation to model the entity structure between entity word tokens and other tokens, and generate structure-enhanced NER data in the target domain for training.For label-constrained data augmentation as shown in figure 1(c), we replace the same entity mentions with their corresponding different categories for each instance and utilize the language model to learn these word-to-tag relationships in different domains to avoid confusion.
To evaluate the effectiveness of our proposed approach, we compare our approach with previous works on flat NER and report results surpassing the current state-of-the-art.Additionally, we report competitive results on nested NER.Our findings demonstrate that our proposed method is simple yet highly effective.Finally, our main contributions are summarized as follows: • To bridge domain gaps for cross-domain fewshot NER, we analyze this issue from two new perspectives, i.e., entity annotations and entity structures and define word-to-word and wordto-tag relations to model them, respectively.
• We propose a method called Structure and Label Constrained Data Augmentation (SLC-DA), introducing a label-constrained pre-train task and structure-constrained optimization objectives in the data augmentation process.
• We achieved state-of-the-art results in the cross-domain few-shot NER task.We also achieved competitive results by transferring from a flat entity dataset to a nested entity dataset for the first time.

Method
In this section, we present our proposed method, called Structure and Label Constrained Data Augmentation (SLC-DA) for cross-domain few-shot NER. Figure 2 depicts the overview of our method, which includes two modules: structure constrained data augmentation and label constrained pre-train task.We illustrate the details of how we learn entity structure and label relation.

Structure Constrained Data Augmentation
To enhance the quality of generated NER data, we propose to augment data with structure constrained optimization objective by learning and preserving entity structures.
In the structure constrained data augmentation module, we first combine source domain data and target domain support set to pretrain the data augmentation model.Then, we capture the entity structure by modeling the word-to-word-relation.Subsequently, the structure constrained data augmentation is used to generate more entities that conform to the target domain entity structure and replace the original entities to compose the augmentation data.
Let D source and D target denote the source domain dataset and target domain support set.Given a N tokens sentence X = [x 1 , x 2 , ..., x N ] with corresponding NER labels L = [l 1 , l 2 , ..., l N ], we encode the sentence to H = [h 1 , h 2 , ..., h N ].Then we randomly mask entity tokens to generate a new sequence X ′ and M = [m 1 , m 2 , ..., m N ], where For a masked entity token x ′ i , we use masked language model (MLM) to generate the entity x ′′ i which is the most similar to the entity x i , and the new sequence X ′′ is generated by minimizing the loss where θ represents model parameters.
Following Qian et al. (2021), we exploit Gaussian embedding, which is innately more expressive than point embedding.For token embedding h i , we compute its Gaussian Embedding G i ∼ N (µ i , Σ i ) as follows: where µ i denotes the semantics of x i , covariance matrix Σ i represents the uncertainty, ELU represents exponential linear unit, and ϵ is set toe −14 for numerical stability.We use KL-divergence to calculate the similarity of entity structure Since the KL-divergence is asymmetric, we obtain the similarity by calculating the KL-divergence in both directions We define the optimization objective as follows to minimize the structure difference between newly generated entities and original ones.
To sum up, for sequence X, the total loss is formulated as In this way, we generate entities that conform to the target domain entity structure.So we can combine the generated data with the source domain data to train the NER model.

Label Constrained Pre-train Task
In the label constrained pre-training module, to alleviate label inconsistency between different domains, we design several label constraint strategies to align predefined labels between source and target domains.First, we extract all entities and corresponding sentences from the support set, and find sentences containing these entities in the source domain data set.Then, we form a sentence pair containing the same entity with the sentences in the support set.Let the pretrain language model (PLM) learn different labels of the same entity in different domains and learn the relationship between these two labels.This PLM is subsequently trained with the train data and structure constrained of the source domain to become a NER model.Finally, when inference, we utilize a pre-trained label-constrained model to compute the mapping relationships between labels in the source and target domains.This allows us to bridge the gap between the two domains.We also predefine the label mapping to bridge the source and target domain.
To alleviate the label inconsistency among different domains, we propose a novel label-constrained pre-training task to align the inconsistent predefined labels between the source and the target domains in training and prediction process.Based on pre-trained contextual representations, we design a label mapping strategy by calculating the similarity of various predefined labels to align inconsistent labels between the source and target domains in training and prediction process.
In the training process, we first filter out all the entities E and their labels L from the sentences S in the target domain denoted as [e 1 , e 2 , ..., e N ], [l 1 , l 2 , ..., l N ], and [s 1 , s 2 , ..., s N ], respectively.s i is a sequence of m tokens [x 1 , x 2 , ..., x m ].We then select sentences containing these entities and labels from the source domain data as and match all sentences which have the same entity up as [e 1 ; s 1 ; s ′ 1 , e 2 ; s 2 ; s ′ 2 , ..., e n ; s n ; s ′ p ]. Then we swap entities in these sentences with their corresponding labels and generate the representations of two labels l i and l ′ i and compute the KL-Distribution between them as: and achieve label alignment by learning the relationship between the labels of these entities in the source and the target domains.Finally, we apply the saved parameters to initialize the NER model.
In the prediction process, we propose a simple but efficient post processing method to align labels from different domains.Since the NER model is pre-trained on the source domain, it will be affected by predefined labels of the source domain when identifying entities, making some predictions of predefined labels not part of the target domain.Therefore, during inference, we post-process results predicted by the model.
We employ label-constrained pre-training task to obtain contextual representations for different entity labels and then compute the mapping relationships of entity categories between the source and target domains (including the "other" category).Specifically, according to the official annotation guidelines for each dataset, we generate descriptive statements for each entity category and calculate KL divergence based on the representations of description sentences for each entity category between the source and target domains.This process allows us to derive the mapping relationships between entity categories in the source domain and entity categories in the target domain.

Experiments
We validate our proposed method in the flat entity and nested entity settings.The details of the experiments are elaborated in this section.We use precision (P), recall (R), and F (F1) as evaluation metrics.All experimental results are the average score over five runs with random seeds.

Datasets
We validate our proposed method on various domain datasets.We use OntoNotes 5.0 (General) (Pradhan et al., 2013) (Derczynski et al., 2017), I2B2(Medical) (Stubbs and Uzuner, 2015), GUM(Mixed) (Zeldes, 2017) as our flat NER setting target domain, ACE2004 (Event) (Doddington et al., 2004), ACE2005 (Event) (Walker and Consortium, 2005) as our nested NER target domain.For the source domain, we use the OntoNotes train/development/test splits released for the CoNLL 2012 shared task.For the target domains, we consider all datasets except for OntoNotes, and then extract the support set as mentioned in Section 2.2.The statistics of datasets used in experiments can be found in A.4.

Baselines
We compare the performances of SLC-DA on different datasets in the flat and nested entity settings with the following cross-domain Few-Shot NER models.1) Direct-Transfer: which trains the NER model on the source domain data and evaluates it on the target domain support set. 2) MELM (Zhou et al., 2022): our baseline method, which exploits the Masked Entity Language Model (MELM) to generate the augmented NER data.3) ProtoBERT (Snell et al., 2017;Fritzler et al., 2019;Hou et al., 2020)

Main Results
Table 1 presents the results of flat and nested NER.
Compared with these strong baselines, SLC-DA leads to significant improvements and achieves state-of-the-art performances in flat NER setting.
We also report competitive results in the newly proposed nested-entity cross-domain setting.Particularly, in the flat NER setting, our method improves 1.9% on average compared to SOTA, and improves 12.1% on average compared to CON-TaiNER.In the nested NER setting, our method improves 6.45% and 11.2% on average compared to SOTA, 7.6% and 12.2% compared to baseline for ACE04 and ACE05, respectively.All these results well demonstrate the effective of our method.The reason we did not report all baselines for the ACE04/ACE05 datasets is because these datasets contain nested named entities, which pose a challenge for traditional baseline methods designed for flat entities.These baseline methods, such as container-based approaches, are not suitable for accurately handling nested entities.Therefore, including their results in the evaluation would not provide a fair comparison.
The experimental results demonstrate that our method and the metric-based method can achieve good performance when there is small domain difference between the source and target domains (where the target domain is the CoNLL dataset in the news domain).However, our method has significant advantages over other methods when there is a large domain difference and limited target domain data, as demonstrated by the experiments on I2B2 (medical) and GUM (mixed) datasets.This indicates that our data augmentation method can help the NER model smoothly transfer from the source domain to the target domain.In addition, due to the different usage of the target domain support set, our method only uses the generated augmented data and source domain data in the training set, while other metric-based comparison methods use the support set as the training set.As k-shot increases, our method performs slightly worse than ProML method on WNUT and I2B2 datasets but better than other methods.
MELM in the nested NER setting is even less effective than direct migration.The main reason is that the labels of ACE04 and ACE05 are identical to the labels of OntoNotes in five cases, but ACE04 and ACE05 have nested entities, resulting in the failure of data augmentation.By contrast, our SLC-DA can learn the entity structure of target domain and thus generate appropriate augmented data, benefiting the NER model to learn the knowledge effectively and further improve the ability of recognizing entity in the target domain.Parameter settings can be found in A.1.
In our study, the migration from OntoNotes to CoNLL dataset cannot strictly be considered as a cross-domain setup, since the OntoNotes dataset includes news data as one of its sources.However, to maintain consistency with existing literature and experimental settings, we conducted experiments under this particular setup.

Ablation Study
We conduct ablation studies to explore the effect of structure-constrained and label-constrained mod- ules on the overall performance.The results of flat NER and nested NER are reported in Table 2.
w/o structure-constrained: Structure constrained is not used in data augmentation, and entities predicted by language model are directly used as newly generated data.When ablating the structure-constrained module, Table 2 shows that the performances of SLC-DA drop dramatically for both flat and nested NER in the 1-shot and 5-shot settings.Particularly, for flat NER in both 1-shot and 5-shot settings, when removing the 'structure-constrained' module, the F1-scores drop by over 20% and 10% on the CoNLL and WNUT datasets, respectively.For nested NER, the F1-scores drop by over 6% and 15% on ACE04 and ACE05 datasets, respectively.Overall, these results prove that our structure-constrained data augmentation module plays an important role in SLC-DA and it is necessary to exploiting entity structure information in data augmentation methods.
w/o label-constrained: The NER model are directly initialized by pretrained bert-base-cased and no longer learns the relationship between entity labels from the source and target domain.When ablating the label-constrained module, Table 2 illustrates that the performances of SLC-DA drop slightly for both flat and nested NER in the 1-shot and 5-shot settings.Concretely, for flat NER in both 1-shot and 5-shot settings, when removing the 'label-constrained' module, the F1-scores drop by approximately 1.5% and 1% on the CoNLL and WNUT datasets, respectively.For nested NER, the F1-scores drop by over 2% and 2.5% on ACE04 and ACE05 datasets, separately.Subsequently, these analyses demonstrate that the labelconstrained data augmentation module also have a consistent effect on the performance of SLC-DA and modeling relations among different labels of the same entities contributes to data augmentation.
In summary, both 'structure-constrained' and 'label-constrained' module have important effects on performances of our proposed method.However, compared with the operation 'w/o structureconstrained', removing the 'label constrained' module from SLC-DA results in a more marginal decrease of performances on both flat and nested NER, illustrating that 'structure-constrained' module is more influential than 'label-constrained' in the SLC-DA method.We conjecture that there are two possible reasons: I) 'structure-constrained' module directly participates in the process of augmented data generation while 'label-constrained' does not, since 'structure-constrained' is one of the optimization objective of data generation process while 'label-constrained' is only used as the initialization of parameters.II) the scale of training data for 'structure-constrained' module is larger than that for 'label-constrained' module, leading to the differences of model's ability on capturing entity structure and label-relation information.and ACE05 in the 5-shot settings.Our method achieves the best F1-score (87.6%) on almost all labels and an extremely highest recall (86.6%) compared with MELM.The results well demonstrate the effective of our method.

Results on Different Labels
Upon observing the results of the ablation experiments, it can be seen that the MELM model only performed well on the 'ORG' category in the CoNLL dataset, which is because this category is included in the source domain OntoNotes.For the other categories, the MELM model showed high precision but low recall and F1 scores, indicating that the model could not identify most entities belonging to the target domain label.On the other hand, our SLC-DA model achieved better results in all categories except for 'corporation' in the WNUT dataset, with an increase in recall proving that our method helped the NER model learn to identify entities belonging to the target domain, and thus, demonstrated that our approach can help NER models more smoothly and effectively transfer from the source to the target domain.
In term of the abnormal result of label 'corporation', we conjecture it is because 'corporation' and 'group' are overlapped by label 'ORG' in OntoNotes.Since there are more entities with 'group' label than 'corporation' label, the model better learns the mapping relationship between 'group' and 'ORG'.Consequently, some data that should be labeled as 'corporation' is labeled as 'group'.These results in the precision decrease in label 'group' and the recall drop in label 'corporation' by comparison.

Case Study
In Table 4, we present some cases by comparing words generated by SLC-DA and MELM to verify the effectiveness of our method.It can be seen that our method can generate appropriate entities according to the entity structure when encountering unseen entities in the source domain.In addition, when the target domain contains more difficult  2022) devises two prompting mechanisms for better training data generation, which is heavily influenced by the prompt strategies and needs heavy computation.However, these methods suffer from serious label inconsistency issues in cross-domain scenarios.The knowledge they learned in the source domain cannot be directly applied to the target domain.Moreover, due to the complexity of NER tasks, these works need to design complex learning strategies to be applied to few-shot NER tasks.Unlike previous works, our work focuses on the knowledge of entity internal structure.As far as we know, we are the first to introduce entity structure and label information in data augmentation to solve cross-domain few-shot NER tasks, which may inspire the following exploration to study internal structure knowledge of entities.

Data Augmentation
Data augmentation is a popular solution for fewshot learning tasks, which is also studied in crossdomain few-shot NER.However, noise is inevitably introduced in the process of introducing augmented data.As a token-level task, NER is vulnerable to the noise caused by augmented data.Dai and Adel (2020) uses label-wise token replacement, synonym replacement, and mention replacement to augmented data but does not increase the diversity of entities.Ding et al. (2020) and Zhou et al. (2022) respectively train language model and mask language model to fuse the alignment information of entities and labels to constrain the newly generated words to match labels, but inevitably introduce a lot of noise.While current methods aim to align the newly generated entities with the original labels, they face limitations in cross-domain few-shot NER.This is because each domain may have distinct entity structures and labels, rendering the generated data incompatible with the target domain's entity structure.Consequently, conventional data augmentation methods are not directly applicable to cross-domain few-shot NER.

Conclusion
In this paper, we propose a Structure and Label Constrained Data Augmentation (SLC-DA) for cross-domain few-shot NER, by introducing entity structure and label information from various domains in the data augmentation process, to obtain high-quality synthetic data in the target domain.Experimental results on both flat and nested few-shot NER tasks show that our method can significantly improve the quality of generated data and help NER model find more target domain entities.

Limitations
Named entities are typically classified as flat, nested, or discontinuous entities, with significant structural differences between the three types.This makes it challenging for existing methods to effectively transfer from flat NER datasets to either nested or discontinuous NER datasets.While our experiments in this paper validate that our method can effectively transfer NER models from flat to nested datasets, we have yet to demonstrate its efficacy in transferring to discontinuous NER datasets.This is an avenue for future research, as our method's effectiveness on discontinuous datasets remains to be explored.

A Example Appendix
A.1 Parameter Settings We elaborate experimental settings of SLC-DA and NER models.SLC-DA: We use bert-base-cased (Devlin et al., 2019) with a language model head for our structure constrained data augmentation model.The model is trained for 20 epochs on source domain training data and target domain support set, using Adam optimizer with batch size set to 16 and learning rate set to 1e-5.
We calculate the loss for each word generated by the masked language model and select the best top-K.For a sentence containing n entities, this results in generating K n new sentences (with k=5 in the presented experimental results).The additional time required by our method compared to MELM is not substantial because of the few-shot setting of the target domain's support set.Let's take the example of the WNUT 2017 dataset with a 5-shot setting.The support set consists of 216 sentences and 334 entity words.MELM requires approximately 4 minutes for data augmentation, while our method takes around 13 minutes.
NER Model: For flat NER setting, we use bertbase-cased (Devlin et al., 2019) with CRF (Lample et al., 2016) head as the NER model.The model is trained for 5 epochs on source domain training data and target domain support set, using Adam optimizer (Loshchilov and Hutter, 2019) with batch size set to 32 and learning rate set to 5e-5.For nested NER setting, we use the MRC-NER (Li et al., 2020)  We provide experimental results in the 10-shot, 50-shot and full set settings, as shown in Table 5 and 6.The statistics of the support sets and augmented data used in our experiments is shown in Table 7.
In the 10-shot settings, our SLC-DA method still outperforms all other comparisons on CoNLL and GUM datasets.Although ProML method performs better than us on WNUT and I2B2 datasets, our approach based on data augmentation explore a various direction and can be combined with ProML to achieve more improvements.

A.3 Statistics of Predefined Labels
Table 8 displays all predefined labels from various datasets in both source and target domains.Concretely, 'OntoNotes' dataset from the source domain contains 18 predefined entity types.For the target domains, there are 18, 11, 7, 6, and 4 entity categories in 'I2B2', 'GUM', 'ACE04/ACE05', 'WNUT' and 'CoNLL' dataset, respectively.Note that we collect their corresponding label descriptions from the annotation guidelines.

A.4 Statistics of datasets used in experiments
The datasets utilized in our study are open-source and consist of various entity types.Further information on the datasets can be found in Table 9.The table lists the entity types present in the datasets, along with the percentage of nested entities relative to the total number of entities.We report the size of the augment data generated by the support set for each data set in the table 10.

Figure 1 :
Figure 1: Examples for entities have different structures and labels between the source and target domains.Different colors and combinations of squares represent different entity types and structures, respectively.

Figure 2 :
Figure2: An overview of our proposed approach, which comprises structure constrained data augmentation (bottom) and label constrained pre-train task (top).First, we introduce the word-to-tag relation for label constrained pretrain task.Then, we compute the word-to-word relation for structure constrained data augmentation to generate augmented data for the target domain support set.Among the sentences generated for each sentence, we'll pick the top-K sentences that meet our satisfaction criteria, which we'll denote as √ .Finally, we merge the generated data with the source domain data to train the NER model.
: an implementation of Prototypical Network based on BERT.4) NNShot and Structshot (Yang and Katiyar, 2020):a nearest neighbor based network and a viterbi decoding variant nearest neighbor based network.5) CON-TaiNER (Das et al., 2022): a model based on contrastive learning to learn the relationship between

Table 1 :
(Chen et al., 2022)-DA and comparison methods for cross-domain few-shot NER.We use F (F1) as evaluation metrics.' †' represents the results are cited from the initial paper, and ' ‡' represents the results reimplemented by us.All experimental results are the average score over 5 runs with random seeds.entities of different categories.6)ProML(Chenet al., 2022): designed multiple prompt schemas are to enhance label semantics.

Table 2 :
Ablation study on SLC-DA method for cross domain few-shot NER.

Table 3 :
Table3shows the results of our SLC-DA model on different label entities in CoNLL, WNUT, ACE04 Performance of SLC-DA and comparison methods on each entity type of all datasets in the 5-shot settings.'Structure' denotes only structure constrained is used in data augmentation, and 'Label' denotes only label constrained is used in data augmentation.

Table 4 :
ORG [The United House administration] ORG SLC-DA [the [Reagan] P ER administration] ORG [the [Carter] P ER administration] ORG [the [Clinton] P ER administration] ORG Case study of augmented data, blue represents all entities and red represents nested entities.

Table 5 :
Overall performances of all systems on four datasets in 10-shot settings for few-shot NER.We use F1 scores as evaluation metrics.' ‡' represents the results re-implemented by us.Each experimental result is the average of performance over 5 runs with random seeds.

Table 6 :
Performances of systems in 50-shot and full set settings for few-shot NER.hours.For our label constrained data augmentation component, the training time per epoch is influenced by the difference in size between the source and target domains, ranging from 0.5 to 3 hours.For training the NER model, taking the Onton-Notes5.0dataset of approximately 70k samples as an example, the training time is around 1 hour.

Table 7 :
Statistics of support and augmented data used in our experiments.

Table 8 :
Statistics of predefined labels in all datasets from both source and target domains.