Data Augmentation for Cross-Domain Named Entity Recognition

Current work in named entity recognition (NER) shows that data augmentation techniques can produce more robust models. However, most existing techniques focus on augmenting in-domain data in low-resource scenarios where annotated data is quite limited. In this work, we take this research direction to the opposite and study cross-domain data augmentation for the NER task. We investigate the possibility of leveraging data from high-resource domains by projecting it into the low-resource domains. Specifically, we propose a novel neural architecture to transform the data representation from a high-resource to a low-resource domain by learning the patterns (e.g. style, noise, abbreviations, etc.) in the text that differentiate them and a shared feature space where both domains are aligned. We experiment with diverse datasets and show that transforming the data to the low-resource domain representation achieves significant improvements over only using data from high-resource domains.


Introduction
Named entity recognition (NER) has seen significant performance improvements with the recent advances of pre-trained language models (Akbik et al., 2019;Devlin et al., 2019). However, the high performance of such models usually relies on the size and quality of training data. When used under low-resource or even zero-resource scenarios, those models struggle to generalize over diverse domains (Fu et al., 2020), and the performance drops dramatically due to the lack of annotated data. Unfortunately, annotating more data is often expensive and time-consuming, and it requires expert domain knowledge. Moreover, annotated data can quickly become outdated in domains where language changes rapidly (e.g, social media), lead- ing to the temporal drift problem (Rijhwani and Preotiuc-Pietro, 2020).
A common approach to alleviate the limitations mentioned above is data augmentation, where automatically generated data can increase the size and diversity in the training set, while resulting in model performance gains. But data augmentation in the context of NER is still understudied. Approaches that directly modify words in the training set (e.g, synonym replacement (Zhang et al., 2015) and word swap (Wei and Zou, 2019)) can inadvertently result in incorrectly labeled entities after modification. Recent work in NER for low resource scenarios is promising (Dai and Adel, 2020;Ding et al., 2020) but it is limited to same domain settings and improvements decrease drastically with smaller sizes of training data.
To facilitate research in this direction, we investigate leveraging data from high-resource domains by projecting it into low-resource domains. Based on our observations, the text in different domains usually presents unique patterns (e.g. style, noise, abbreviations, etc.). As shown in Figure 1, the text in the newswire domain is long and formal, while the text in the social media domain is short and noisy, often presenting many grammar errors, spelling mistakes, and language variations. In this work, we hypothesize that even though the textual patterns are different across domains, the semantics of text are still transferable. Additionally, there are some invariables in the way the named entities appear and we assume that the model can learn from them. In this work, we introduce a cross-domain autoencoder model capable of extracting the textual patterns in different domains and learning a shared feature space where domains are aligned. We evaluate our data augmentation method by conducting experiments on two datasets, including six different domains and ten domain pairs, showing that transforming the data from high-resource to lowresource domains is a more powerful method than simply using the data from high-resource domains. We also explore our data augmentation approach in the context of the NER task in low-resource scenarios for both in-domain and out-of-domain data.
To summarize, we make the following contributions: 1. We propose a novel neural architecture that can learn the textual patterns and effectively transform the text from a high-resource to a low-resource domain. 2. We systematically evaluate our proposed method on two datasets, including six different domains and ten different domain pairs, and show the effectiveness of cross-domain data augmentation for the NER task. 3. We empirically explore our approach in lowresource scenarios and expose the case where our approach could benefit the low-resource NER task

Related work
Data augmentation aims to increase the size of training data by slightly modifying the copies of already existing data or adding newly generated synthetic data from existing data (Hou et al., 2018;Wei and Zou, 2019). It has become more practical for NLP tasks in recent years, especially in lowresource scenarios where annotated data is limited (Fadaee et al., 2017;. Without collecting new data, this technique reduces the cost of annotation and boosts the model performance. Previous work has studied the data augmentation for both token-level tasks (Şahin and Steedman, 2018;Gao et al., 2019) and sequence-level tasks (Wang and Yang, 2015;Min et al., 2020). Related to data augmentation on NER, Dai and Adel (2020) conducted a study that primarily focuses on the simple data augmentation methods such as synonym replacement (i.e., replace the token with one of its synonyms) and mention replacement (i.e., randomly replace the mention with another one that has the same entity type with the replacement).  studied sequence mixup (i.e., mix eligible sequences in the feature space and the label space) to improve the data diversity and enhance sequence labeling for active learning. Ding et al. (2020) presented a novel approach using adversarial learning to generate high-quality synthetic data, which is applicable to both supervised and semi-supervised settings.
In cross-domain settings, NER models struggle to generalize over diverse genres (Rijhwani and Preotiuc-Pietro, 2020;Fu et al., 2020). Most existing work mainly studies domain adaptation (Liu et al., 2020a;Jia et al., 2019;Wang et al., 2020;Liu et al., 2020b) which aims to adapt a neural model from a source domain to achieve better performance on the data from the target domain. Liu et al. (2020a) proposed a zero-resource cross-domain framework to learn the general representations of named entities. Jia et al. (2019) studied the knowledge of domain difference and presented a novel parameter generation network. Other efforts include the different domain adaptation settings (Wang et al., 2020) and effective cross-domain evaluation (Liu et al., 2020b). In our work, we focus on cross-domain data augmentation. The proposed method aims to map data from a high-resource domain to a low-resource domain. By learning the textual patterns of the data from different domains, our proposed method transform the data from one domain to another and boosts the model performance with the generated data for NER in low-resource settings.

Proposed Method
In this work, we propose a novel neural architecture to augment the data by transforming the text from a high-resource domain to a lower-resource domain for the NER task. The overall neural architecture is shown in Figure 2.
We consider two unparalleled datasets: one from the source domain D src and one from the target domain D tgt . We first linearize all sentences by inserting every entity label before the correspond-  ing word. At each iteration, we randomly pair a sentence from D src and a sentence from D tgt as the input to the model. The model starts with word-by-word denoising reconstruction and then detransforming reconstruction. In denoising reconstruction, we aim to train the model to learn a compressed representation of an input based on the domain it comes from in an unsupervised way. We inject noise into each input sentence by shuffling, dropping, or masking some words. The encoder is trained to capture the textual semantics and learn the pattern that makes each sentence different from sentences in other domains. Then we train the decoder by minimizing a training objective that measures its ability to reconstruct each sentence from its noisy version in its corresponding domain. In detransforming reconstruction, the goal is to transform sentences from one domain to another domain based on their textual semantics. We first transform each sentence from the source/target domain to the target/source domain with the model from the previous training step as the input. The encoder then generates latent representations for transformed sentences. After that, different from denoising reconstruction, the decoder here is trained to reconstruct each sentence from its transformed version in its corresponding domain. Besides denoising and detransforming reconstruction, we also train a discriminator to distinguish whether the latent vector generated by the encoder is from the source domain or target domain. In this case, the encoder can generate a meaningful intermediate represen-tation. Otherwise, the model would bypass the intermediate mapping step between domains and replace it by memorizing rather than generalizing. In the following sections, we will introduce the details of our model architecture and the training algorithm.

Data Pre-processing
Following Ding et al. (2020), we perform sentence linearization so that the model can learn the distribution and the relationship of words and labels. In this work, we use the standard BIO schema (Tjong Kim Sang and Veenstra, 1999). Given a sequence of words w = {w 1 , w 2 , ..., w n } and a sequence of labels l = {l 1 , l 2 , ..., l n }, we first linearize the words with labels by putting every label l i before the corresponding word w i . Then we generate a new sentence x = {l 1 , w 1 , l 2 , w 2 , ..., l n , w n } and drop all O labels from them as the input. Special tokens <BOS> and <EOS> are inserted at the beginning and the end of each input sentence.

Cross-domain Autoencoder
Word-level Robustness Our cross-domain autoencoder model involves an encoder Enc : x → z that maps input sequences from data space to latent space. Previous work (Shen et al., 2020) has demonstrated that input perturbations are particularly useful for discrete text modeling with powerful sequence networks, as they encourage the preservation of data structure in latent space representations. In this work, we perturb each input

Operation Description
Shuffle generate a new permutation of all words Dropout randomly drop a word from the sequence Mask randomly mask a word with <MSK> token Table 1: Word-level operations to inject noise in each input sequence. Each operation is randomly applied to input sequences with a certain probability p.
sentence by injecting noise with three different operations (see Table 1) to ensure that similar input sentences can have similar latent representations.
Denoising Reconstruction The neural architecture for denoising reconstruction is shown in Figure  2(a). Consider a pair of two unparalleled sentences: one sentence x src from D src in the source domain and another sentence x tgt from D tgt in the target domain. The model is trained to reconstruct each sentence by sharing the same encoder and decoder parameters while using different embedding lookup tables. The token embedders Emb src and Emb tgt hold a lookup table of the corresponding domains. The encoder is a bi-directional LSTM model that takes the noisy linearized sentences as input and returns hidden states as latent vectors. At each decoding step, the decoder takes the current word and the latent vector from the previous step as input.
Then it uses the vocabulary in the corresponding domain to project each vector from latent space to vocabulary space and predicts the next word with additive attention (Bahdanau et al., 2015). The training objective for denoising reconstruction is defined as below. The goal of this training objective is to force the model to learn a shared space where both domains are aligned through the latent vectors and generate a compressed version of input sentence.
Detransforming Reconstruction In detransforming reconstruction, the first step is to transform each sentence from the source/target domain to the target/source domain. As shown in Figure  2(b), given a pair of sequences x src and x tgt from the source and target domain, we first map x src tox tgt in the target domain and x tgt tox src in the source domain by applying the model M i−1 θ , which includes embedders, encoder, and decoder, from previous training step. After that, we feedx tgt andx src to the encoder and generate compressed latent representations z tgt and z src . Then the decoder maps z tgt tox src in the source domain and z src tox tgt in the target domain. The goal is to learn the mapping between different domains and reconstruct a sequence from its transformed version in its corresponding domain.
The training objective for detransforming reconstruction is shown below.
Domain Classification For domain classification, we apply adversarial training. We use the encoder to extract the textual patterns of sentences from different domains. The encoder generates the latent representations for the noised or transformed version of inputs and the discriminator tells if the given latent vector is actually from the source or target domain. Then the encoder will refine its technique to fool the discriminator in a way that will end up capturing the patterns to convert text from the source/target domain to the target/source domain. The discriminator is first trained in the denoising reconstruction and then fine-tuned in the detransforming reconstruction to distinguish source domain sentences and target domain sentences. As shown in Figure 2, the discriminator D X takes inputs from both domains without knowing where the sequences come from. Then, the model predicts the corresponding domains of the inputs. The inputs are the latent vectors z, where both domains have been mapped to the same space. We formulate this task as a binary classification task. The training objective of adversarial training is described as below: Final Training Objective The final training objective is defined as the weighted sum of L noise , L trans , and L adv : where λ 1 , λ 2 , and λ 3 are parameters that weight the importance of each loss.

Training Algorithm
Based on our observation, the model's ability to reconstruct sentences across domains highly relies on the denoising reconstruction and domain classification components. Therefore, in this work, we take two phases to train our model. In the first phase, we train the model with only denoising reconstruction and domain classification so that it can learn the textual pattern and generate compressed representations of the data from each domain. We calculate the perplexity for denoising reconstruction as the criterion to select the best model across iterations. In the second phase, we train the model together with denoising reconstruction, detransforming reconstruction, and the domain classification. The goal is to align the compressed representations of the data from different domains so that the model can project the data from one domain to another. We calculate the sum of the perplexity for both denoising and detransforming reconstruction as the model selection criterion.

Data Post-processing
We generate synthetic data using the cross-domain autoencoder model as described in Section 3.2. We convert the generated data from the linearized format to the same format as gold data. We use the following rules to post-process the generated data: 1) remove sequences that do not follow the standard BIO schema; 2) remove sequences that have <UNK> or <MSK> tokens; 3) remove sequences that do not have any entity tags.

Experiments
In this section, we will introduce the cross-domain mapping experiment and the NER experiment. In the cross-domain mapping experiment, we analyze the reconstruction and generation capability of the proposed model. We then tested our proposed method and evaluated the data generated from our model on the NER task. Details of the data set, experimental setup, and results are described below.

Datasets
In our experiments, we use two datasets: Ontonotes 5.0 Dataset (Pradhan et al., 2013) and Temporal Twitter Dataset (Rijhwani and Preotiuc-Pietro, 2020). We select data from six different domains in English language, including Broadcast Conversation (BC), Broadcast News (

Cross-domain Mapping
In this section, we describe the experimental settings of our proposed cross-domain autoencoder model and report the evaluation results.

Cross-domain Autoencoder
We use our proposed cross-domain autoencoder model (described in Section 3.2) to generate synthetic data. In our experiments, we build the vocabulary with the most common 10K words and 5 special tokens: <PAD>, <UNK>, <BOS>, <EOS> and <MSK>. We use a bi-directional LSTM layer as the encoder and a LSTM layer as the decoder. For the discriminator, we use a linear layer. The hyper-parameters are described in Appendix A.
Results For cross-domain mapping experiments, we consider two different domains as the source domain: NW and SM. The textual patterns in NW are similar to that in other domains while the textual patterns in SM is quite different from that in other domains (see Appendix B on domain similarity). In Table 2, we report the results of cross-domain mapping experiments on ten different domain pairs. We use perplexity as the metric to evaluate reconstruction. The lower perplexity indicates a higher accuracy and a higher quality of reconstruction. From the results of our experiments, we notice that the average perplexity with NW as source domain is lower than the average perplexity with SM as source domain, indicating that the model can easily reconstruct both in-domain and out-of-domain sentences when the textual patterns are transferable.

Named Entity Recognition
Here we describe the experimental settings of the sequence labeling model for the NER experiment and report the evaluation results.

Sequence Labeling Model
We fine-tune a BERT (Devlin et al., 2019) model to evaluate our cross-domain mapping method on the NER task. BERT is pre-trained with masked language modeling and next sentence prediction objectives on the text from the general domain. We use BERT as the base model because it is capable of generating contextualized word representations and achieving high performances across many NLP tasks. We adapt a linear layer on top of BERT encoder to classify each token into pre-defined entity types. The hyper-parameters are described in Appendix A.

Results
To evaluate the quality of the generated data, we conduct experiments on the NER task with ten different domain pairs. The results are shown in Table 3   Spelling Augmentation: substitute word by spelling mistake words dictionary, 5) Exp 2.5 Synonym Replacement: substitute word by WordNet's (Miller, 1995) synonym, 6) Exp 2.6 Context Replacement: substitute words by BERT contextual word embeddings, and 7) Exp 2.7 DAGA from Ding et al. (2020).
In Table 4, we compare our approach with the previous data augmentation method for the NER task by reporting the F1 score. We consider two different experiments: NW → SM and SM → NW. We augment the data from the source domain as training data. The validation data and test data are from the target domain. Based on the results, we observe that: 1) the improvement from traditional data augmentation (Exp 2.1~Exp 2.6) is quite marginal. Only three of them can outperform the baseline in both NW → SM and SM → NW and the performance gain is from 0.34%~1% F1 score; 2) data augmentation techniques are effective when we only have a small number of training samples. For example, in SM → NW, when using only 1K training samples, all methods can outperform the baseline. However, when using 4K training samples, only three of them can outperform the baseline; 3) Simply learning the textual patterns (Exp 2.7) in each domain may not always result in a good performance. When the size of training data is quite limited, the model struggles to learn the textual patterns and thus cannot achieve a good performance; and 4) transforming text from the source domain to the target domain is much powerful because, on average, it can outperform the baseline by 8.7% and 10.1% in two experiments, respectively.

Analysis
In this section, we take a step further towards exploring our approach in the context of the NER task in low-resource scenarios. Specifically, we investigate two crucial questions on data augmentation for both in-domain and out-of-domain datasets.
Q1: Does the model require a large number of training data from the target domain? To answer this question, we randomly select 5%, 10%, 20%, 40%, 80% of samples from the target domain as the train data to train our cross-domain autoencoder model. We consider NW as the source domain and SM as the target domain. After training the model, we generate samples from NW to SM and merge them into the training set from NW as the new training set. Then we do evaluation on the test set from the SM. We establish a baseline that only uses the data from NW for training the model. Figure 3 shows the results of model performance on the NER task. From this figure, we can see that our method can consistently achieve higher F1 scores than the baseline. Even if there is only 5% training samples from the target domain, our model can still achieve significant improvements, averaging 4.81% over the baseline. Figure 3: Model performance on the NER task with different amounts of target data and increasing amounts of augmented data for training. The test set is from SM. Each curve shows model performance on NER when using different percentages of target data in the training set. The x-axis denotes the total number of training samples used. A 5% means that out of all the training instances, only 5% are coming from the target data, while the rest comes from the augmentation model. The y-axis denotes the F1 score for the NER task.
Q2: Can we generate enough data to reach competitive results in the target domain? We train our cross-domain autoencoder model with all the data from NW and only 5% data (totally 500 samples) from SM. Then we generate synthetic data from NW to SM. After that, we do evaluation on the NER task by combining 5% data from SM and different numbers of training samples as training data. Figure 4 shows the F1 scores achieved by sequence labeling model. With 5% data from SM, the model can only reach 65.25% F1 score. When we add more generated samples, the model can reach up to 77.59% F1 score, pushing the model performance by 12.34%.

Discussion and Limitation
In this work, we explore how to leverage existing data from high-resource domains to augment the data in low-resource domains. We introduce a cross-domain autoencoder model that captures the textual patterns in each domain and transforms the data between different domains. However, there are several limitations: 1) the maximum lengths of the generated sequences: at each iteration, the maximum lengths of the generated sequencesx tgt andx src are set to the same as original input sequences x src and x tgt , respectively. This is not an ideal case because, intuitively, a short sentence in the source domain may correspond to a long Figure 4: Model performance on the NER task with a fixed amount of target data and increasing amounts of augmented data for training. The training set starts with 500 instances from SM (5% of the training data in SM) and the synthetic data generated from NW to SM. The test set is from SM. The x-axis denotes the number of the generated samples that are combined with the data from SM for training the NER model. The y-axis denotes the F1 score for the NER task. sentence in the target domain. Fixing the maximum lengths of the generated sequences may hurt the model on capturing the semantics of original input sequences and result in a lower quality of reconstruction. 2) unparalleled datasets: in our experiments, the datasets D src and D tgt are unparalleled, which means the sentences x src and x tgt do not correspond to each other. When we generate sequences from one domain to another, there is no guidance and thus we cannot control the quality of the generated sequencesx tgt andx src .
Although our proposed method can not outperform the upper bound (i.e, training the model on the data from the target domain), it can achieve a significant improvement than only using the data from the source domain, providing a strong lower bound baseline for semi-supervised learning.

Conclusion
In this work, we present a novel neural architecture for data augmentation where the model learns to transform data from a high resource scenario to data that resembles that of a low resource domain. By training the model on reconstruction loss, it can extract the textual patterns in each domain and learn a feature space where both domains are aligned. We show the effectiveness of our proposed method by evaluating a model trained on the augmented data for NER, concluding that transforming text to low-resource domains is more powerful than only using the data from high-resource domains.
Our future work includes three directions: i) explore how to embed more supervision about forcing a better alignment in the latent space, ii) design a strategy to control the quality of the generated sequences, and iii) generalize our method to other NLP tasks such as text classification.

A Experimental Settings
This section describes the hyper-parameters for both cross-domain autoencoder model and sequence labeling model.

Cross-domain Autoencoder
For cross-domain autoencoder, the embedding size is 512. The hidden state sizes of the LSTM and linear layer are set to 1024 and 300, respectively. The probability of word dropout is set as 0.1 to inject noise into input sequences. We use Adam (Kingma and Ba, 2015) as the optimizer with an initial learning rate of 5e-4 for both encoder and decoder. For the discriminator, we use RMSprop as the optimizer with initial learning rate 5e-4. The batch size is 32 and the number of training epochs is set to 50. We apply gradient clipping (Pascanu et al., 2013) of 5 and the dropout rate is 0.5. In the first training phase, λ 1 , λ 2 , and λ 3 are set to 1, 0, and 10, respectively. In the second training phase, we change λ 2 to 1.

Sequence Labeling Model
For sequence labeling model, the dropout rate is set to 0.1. We use AdamW (Loshchilov and Hutter, 2019) as the optimizer with initial learning rate 5e-5 and weight decay is set to 0.01. The batch size is 32 and the number of training epochs is 20.

B Domain Similarity
This section empirically analyzes the performance gains obtained by training models with synthetic data generated by our method. To this end, we analyze the data from the source and target domain, as well as the generated data. For this analysis, we consider two sets: train and test. The training set is used to directly update model parameters while the test set provides an unbiased evaluation of the final model. Entities that only appear in the training set are defined as non-overlapping entities, and those that appear in both the training set and the test set are defined as overlapping entities. The domain similarity is then defined as the percentage of overlap entities among all entities. As shown in Table