Investigation on Data Adaptation Techniques for Neural Named Entity Recognition

Data processing is an important step in various natural language processing tasks. As the commonly used datasets in named entity recognition contain only a limited number of samples, it is important to obtain additional labeled data in an efficient and reliable manner. A common practice is to utilize large monolingual unlabeled corpora. Another popular technique is to create synthetic data from the original labeled data (data augmentation). In this work, we investigate the impact of these two methods on the performance of three different named entity recognition tasks.


Introduction
Recently, deep neural network models have emerged in various fields of natural language processing (NLP) and replaced the mainstream position of conventional count-based methods (Lample et al., 2016;Vaswani et al., 2017;Serban et al., 2016). In addition to providing significant performance improvements, neural models often require high hardware conditions and a large amount of clean training data. However, there is usually only a limited amount of cleanly labeled data available, so techniques such as data augmentation and selftraining are commonly used to generate additional synthetic data.
Significant progress has been made in recent years in designing data augmentations for computer vision (CV) (Krizhevsky et al., 2012), automatic speech recognition (ASR) (Park et al., 2019), natural language understanding (NLU) (Hou et al., 2018) and machine translation (MT) (Wang et al., 2018) in supervised settings. In addition, semisupervised approaches using self-training techniques (Blum and Mitchell, 1998) have shown * Work completed while studying at RWTH Aachen University. promising performance in conventional named entity recognition (NER) systems (Kozareva et al., 2005;Daumé III, 2008;Täckström, 2012). In this work, the effectiveness of self-training and data augmentation techniques on neural NER architectures is explored.
To cover different data situations, we select three different datasets: The English CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) dataset, which is the benchmark on which almost all NER systems report results, it is very clean and the baseline models achieve an F1 score of around 92.6%; The English W-NUT 2017 (Derczynski et al., 2017) dataset, which is generated by users and contains inconsistencies, baseline models get an F1 score of around 52.7%; The GermEval 2014 (Benikova et al., 2014) dataset, a fairly clean German dataset with baseline scores of around 86.3% 1 . We observe that the baseline scores on clean datasets such as CoNLL and GermEval can hardly be improved by data adaptation techniques, while the performance on the W-NUT dataset, which is relatively small and inconsistent, can be significantly improved.
2 Related Work 2.1 State-of-the-art Techniques in NER Collobert et al. (2011) advance the use of neural networks (NN) for NER, who propose an architecture based on temporal convolutional neural networks (CNN) over the sequence of words. Since then, many articles have suggested improvements to this architecture. Huang et al. (2015) propose replacing the CNN encoder in Collobert et al. (2011) with a bidirectional long short-term memory (LSTM) encoder, while Lample et al. (2016) and Chiu and Nichols (2016) introduce a hierarchy into the architecture by replacing artificially designed features with additional bidirectional LSTM or CNN encoders. In other related work, Mesnil et al. (2013) have pioneered the use of recurrent neural networks (RNN) to decode tags.
Recently, various pre-trained word embedding techniques have offered further improvements over the strong baseline achieved by the neural architectures. Akbik et al. (2018) suggest using pre-trained character-level language models from which to extract hidden states at the start and end character positions of each word to embed any string in a sentence-level context. In addition, the embedding generated by unsupervised representation learning (Peters et al., 2018;Devlin et al., 2019;Taillé et al., 2020) has been used successfully for NER, as well as other NLP tasks. In this work, the strongest model for each task is used as the baseline model.

Data Adaptation in NLP
In NLP, generating synthetic data using forward or backward inference is a commonly used approach to increase the amount of training data. In strong MT systems, synthetic data that is generated by back-translation is often used as additional training data to improve translation quality (Sennrich et al., 2016). A similar approach using backward inference is also successfully used for end-to-end ASR (Hayashi et al., 2018). In addition, back-translation, as observed by Yu et al. (2018), can create various paraphrases while maintaining the semantics of the original sentences, resulting in significant performance improvements in question answering.
In this work, synthetic annotations, which are generated by forward inference of a model that is trained on annotated data, are added to the training data. The method of generating synthetic data by forward inference is also called self-training in semi-supervised approaches. Kozareva et al. (2005) use self-training and co-training to recognize and classify named entities in the news domain. Täckström (2012) uses self-training to adapt a multi-source direct transfer named entity recognizer to different target languages, "relexicalizing" the model with word cluster features.  propose cross-view training, a semisupervised learning algorithm that improves the representation of a bidirectional LSTM sentence encoder using a mixture of labeled and unlabeled data.
In addition to the promising pre-trained embed-ding that is successfully used for various NLP tasks, the masked language modeling (MLM) can also be used for data augmentation. Kobayashi (2018) and  propose to replace words with other words that are predicted using the language model at the corresponding position, which shows promising performance on text classification tasks.
Recently, Kumar et al. (2020) discussed the effectiveness of such different pre-trained transformerbased models for data augmentation on text classification tasks. And for neural MT, Gao et al. (2019) suggest replacing randomly selected words in a sentence with a mixture of several related words based on a distribution representation. In this work, we explore the use of MLM-based contextual augmentation approaches for various NER tasks.

Self-training
Though, the amount of annotated training data is limited for many NLP tasks, additional unlabeled data is available in most situations. Semisupervised learning approaches make use of this additional data. A common way to do this is selftraining (Kozareva et al., 2005;Täckström, 2012;. At a high level, it consists of the following steps: 1. An initial model is trained using the labeled data.
2. This model is used to annotate the additional unlabeled data.
3. A subset of this data is selected and used in addition to the labeled data to retrain the model.
For the performance of the method it is critical to find a heuristic to select a good subset of the automatically labeled data. The selected data should not introduce too many errors, but at the same time they should be informative, i.e. they should be useful to improve the decision boundary of the final model. One selection strategy (Drugman et al., 2016) is to calculate a confidence measure for all unlabeled sentences and to randomly sample sentences above a certain threshold.
We consider two different confidence measures in this work. The first, hereinafter referred to as c 1 , is the posterior probability of the tag sequence y given the word sequence x: whereby s(x, y) is the unnormalized log score assigned by the model to the sequence, consisting of an emission model q E i and transition model q T : For the second confidence measure, we take into account the normalized tag scores at each position. To get a confidence score for the entire sequence, we take the minimum tag score of all positions. Thus, c 2 is defined as follows:

MLM-based Data Augmentation
Instead of using additional unlabeled data, we apply MLM-based data augmentation specifically for NER by masking and replacing original text tokens while maintaining labels. For each masked token x i : wherex i is the predicted token, w ∈ V is the token from the model vocabulary andx is the original sentence with There are several configurations that can affect the performance of the data augmentation method: Techniques of selecting the tokens to be replaced, the order of token replacement in case of multiple replacement and the criterion for selecting the best tokens from the predicted ones. This section studies the effect of these configurations.

Sampling
Entity spans (entities of arbitrary length) make the training sentences used in NER tasks special. Since there is no guarantee that a predicted token belongs to the same entity type as an original token, it is important to ensure that the masked token is not in the middle of the entity span and that the existing label is not damaged. In this work, we propose three different types of token selection inside and outside of entity spans: • Entity replacement: Collect entity spans of length one in the sentence and randomly select the entity span to be replaced. In this case, exactly one entity in the sentence is replaced. The sentences without entities or with longer entity spans are skipped.
• Context replacement: We consider tokens with the label "O" as context and alternate between two setups: (1) Select only context tokens before and after entities, and (2) select a random subset of context tokens among all context tokens.
• Mixed: Select uniformly at random the number of masked tokens between two and the sentence length among all tokens in the sentence.
The first approach allows only one entity to be generated and thus benefits from conditioning to the full sequence context. However, it does not guarantee the correct labeling for the generated token. The disadvantage of the second approach is that we do not generate new entity information, but only generate a new context for the existing entity spans. Even if a new entity type is generated, it has the original "O" label without a NER classification pipeline. The disadvantage of the third approach is that the token may be selected in the middle of the entity span and the label is no longer relevant. The sampling approaches depicted on the Figure 1. In addition, the number of replaced tokens should be properly tuned to avoid inadequate generation. In this work, we do not set any boundaries for maximum token replacement and leave such investigation to future work.

Order of Generation
In our method, we predict exactly one mask token per time. Our sampling approaches allow multiple tokens to be replaced. Therefore we have two possible options for the generation order: • Independent: Each consecutive masking and prediction is made on top of the original sequence.
• Conditional: Each consecutive masking and prediction is made on top of the prediction of the previous step.

Criterion
The criterion is an important part of the generation process. On the one hand, we want our synthetic sequence to be reliable (highest token probability), on the other hand, it should differ as much as possible from the original sequence (high distance). We Figure 1: Sampling approaches example 2 for the MLM data augmentation. Gray color refers to the tokens with the entity type "O" (context), green color refers to the PER entity type and purple color refers to the ORG entity type. Red square represents the subset of tokens which is used for replacement.
propose two criteria for choosing the best token from the five-best predictions: • Highest probability (top token): Choose the target token only based on the MLM probability for that token.
• Highest probability and distance (joint criterion): Choose the target token based on the product of the MLM probability for the token and Levenshtein distance (Levenshtein, 1966) between the original sentence and the sentence with the new token.
Regardless of the combination of the parameters, the sentences must be changed. As a result, we guarantee that there is no duplication in our synthetic data with the original dataset.

Discussion
The main disadvantage of using a language model (LM) for the augmentation of NER datasets is that the LM does not take into account the labeling of the sequence and the prediction of the masked token, which only depends on the surrounding tokens. As a result, we lose important information for decision-making. Incorporating label information as described in  into the MLM would be the way to tackle this problem.
Another way to reduce the noise in the generated dataset is to apply a filtering step to the generation pipeline. One way to incorporate filtering into the augmentation process is to set the threshold for the MLM token probabilities: If the probability of the predicted token is less than a threshold, we ignore such prediction. However, the problem of misaligning token labels is not resolved. Therefore, we adapt our proposed confidence measure from Section 3 for filtering.
In this work, we do not discuss the selection of the MLM itself as well as the effects of fine-tuning on the specific task.
All datasets have the original labeling scheme as BIO, but following Lample et al. (2016) we convert it to the IOBES scheme for training and evaluation. For our baseline models, we do not use any additional data apart from the provided training data. Development data is only used for validation. For CoNLL we skip all document boundaries. The statistics for the datasets are shown in Table 1

Model Description
The Bidirectional LSTM -Conditional Random Field (BiLSTM-CRF) model (Lample et al., 2016) is a widely used architecture for NER tasks. Together with pre-trained word embeddings, it surpasses other neural architectures. We use the BiLSTM-CRF model implemented in the Flair 4 framework version 0.5, which delivers the state-ofthe-art performance.
The BiLSTM-CRF model consists of 1 hidden layer with 256 hidden states. Following Reimers and Gurevych (2017), we set the initial learning rate to 0.1 and the mini-batch size to 32. For each task, we select the best performing embedding from all embedding types in Flair. For training models with CoNLL data, we use pre-trained GloVE (Pennington et al., 2014) word embedding (Grave et al., 2018) together with the Flair embedding (Akbik et al., 2018) as input into the model. For W-NUT experiments, we use roberta-large embedding provided by Transformers library (Wolf et al., 2019). German dbmdz/bert-base-germancased embedding is used for experiments with the GermEval dataset.

Unlabeled Data
Additional unlabeled data is required for selftraining. To match the domain of the test data, we collect the data from the sources mentioned in the individual task descriptions.
W-NUT Like the test data, the data for W-NUT consists of user comments from Reddit, which were created in April 2017 5 (comments in the test data were created from January to March 2017), as well as titles, posts and comments from StackExchange, which were created from July to December 2017 6 (the content of the test data was created from January to May 2017). The documents are filtered according to length and community as described in the task description paper and tokenized with the TweetTokenizer from nltk 7 .
CoNLL The data was sampled from news articles in the Reuters corpus from October and November 1996. The sentences are tokenized using spaCy 8 and filtered (by removing common patterns like the date of the article, sentences that do not contain words and sentences with more than 512 characters as this is the length of the longest sentence in the CoNLL training data).
GermEval We randomly sampled additional data from sentences extracted from news and Wikipedia articles provided by the Leipzig Corpora Collection 9 . In addition to tokenizing the sentences using spaCy, we do not do any additional preprocessing or filtering.

Self-training
Before applying the approach described in Section 3, we need to find the thresholds t for the confidence measures c 1 and c 2 for each corpus. We evaluate both confidence measures on the development sets of the three corpora. One way to evaluate confidence measures is to calculate the confidence error rate (CER). It is defined as the number of misassigned labels (i.e. confidence is above the threshold and the prediction of the model is incorrect or the confidence is below the threshold and the prediction is correct) divided by the total number of samples. Figure 2 shows the CER of c 1 and c 2 on the development set of W-NUT for different threshold values t. For the threshold of 0.0 or 1.0 the CER degrades to the percentage of incorrect or correct predictions as either all or no confidence values are above the threshold. For c 2 there is a clear optimum att 2 = 0.42 and for larger and smaller thresholds the CER rises rapidly.
In contrast, the optimum for c 1 att 1 = 0.57 is not as pronounced. This motivated us not only to choose the best value in terms of CER, but also a lower threshold t 1 = 0.42 with slightly worse CER. In this way, we include more sentences where the model is less confident without introducing too many additional errors. The threshold values for  CoNLL and GermEval are selected analogously. Table 2 provides an overview of all threshold values that are used in all subsequent experiments. The unlabeled data is annotated using the baseline models described in Section 3 (we choose the best runs based on the score on the development set) and is filtered based on the different confidence thresholds. Then we sample a random subset of size k from these remaining sentences. For tasks where the data comes from different sources, e.g. news and Wikipedia for GermEval, we uniformly sample from the different sources to avoid that a particular domain is overrepresented. The selected additional sentences are then appended to the original set of training sentences to create a new training set that is used to retrain the model from scratch.
To validate our selection strategy, we test our pipeline with different confidence thresholds for both confidence measures. Figure 3 shows the results on the test set of W-NUT. For each threshold, 3394 sentences are sampled, i.e. the size of the training set is doubled. The results confirm our selection strategy. t 1 andt 2 give the best results of all tested threshold values. In particular, t 1 performs better thant 1 . Table 3 shows the results of self-training on all three datasets. For each of them, we test the three selection strategies by sampling new sentences in the size of 0.5 times, 1 times and 2 times the size of the original training data. For W-NUT we get up to 2% of the absolute improvements in the F1 score over the baseline. On larger datasets like CoNLL and GermEval these effects disappear and we only get improvements of up to 0.1% and in some cases even deterioration.

MLM-based Data Augmentation
We follow the approach explained in Section 4 and generate synthetic data using pre-trained models from the Transformers library. We concatenate original and synthetic data and train the NER model on the new dataset. We test all possible combinations of the augmentation parameters from Section 4 on the W-NUT dataset. Table 4 shows the result of the augmentation. When sampling with one entity, there is no difference between independent and conditional generation, since only one token in a sentence is masked. We therefore only carry out an independent generation for this type of sampling. We report an average result among 3 runs along with a standard deviation of the model with different random seeds.
W-NUT and CoNLL datasets are augmented using a pre-trained English BERT model 10 and Ger-mEval with a pre-trained German BERT model 11 respectively. We do not fine-tune these models.
Sampling from the context of the entity spans shows significant improvements on W-NUT test set. First of all, it includes implicit filtering: Only the sentences with the entities are selected and re-10 https://huggingface.co/ bert-large-cased-whole-word-masking 11 https://huggingface.co/ bert-base-german-cased  placed. Therefore, compared to other methods, we add less new sentences (except when replacing entities). Second of all, since replacing tokens with a language model should result in the substitution with similar words, the label is less likely to be destroyed while context tokens are replaced.
On the other hand, the mixed sampling strategy performs the worst among all methods. We believe that this is the effect when additional noise is included in the dataset (by noise we mean all types of noise, e.g. incorrect labeling, grammatical errors, etc). Allowing masking of words up to sequence in some cases destroys the sentence, e.g. incorrect and multiple occurrences of the same words can occur. In Appendix B we present the examples of augmented sentences for each augmentation approach and each dataset. Additionally, we report the average number of masked token.
To analyze the resulting models, we plot the average confidence scores of the test set as well as the number of errors per sentence for the best baseline model and best augmented model. We use the best baseline system with 54.6% F1 score and the best model corresponding to the setup of line 8 in Table 4 with 57.4% F1 score. We count the error every time the model predicts a correct label with low confidence or an incorrect label with high confidence. We set high and low confidence to be 0.6 and 0.4 respectively. Figure 4 shows that the augmented model makes a more reliable prediction than the best baseline system model.
We repeat the promising MLM generation pipeline on the CoNLL and GermEval datasets. These datasets contain more entities in the original data. In addition, even though the entity replacement sampling did not work well on W-NUT  Table 4 dataset, we repeat these experiments, since generating new entities is the most interesting scenario for using the MLM augmentation.
Although the MLM-based data augmentation leads to improvements of up to 3.6% F1 score on the W-NUT dataset, Table 5 shows that such effect disappears when we apply our method to larger and cleaner datasets such as CoNLL and GermEval. We believe there are several reasons for that. First, our MLM-based data augmentation method does not guarantee the accuracy of the labeling after augmentation. So for larger datasets, there are many more possibilities to increase the noise of the corpus. Moreover, we do not study  Table 4: Results of the MLM-based augmentation on the W-NUT dataset. entity refers to the sampling tokens from entity spans of length one, mixed means sampling from the complete sequence, context indicates sampling from the entity span context, random context denotes sampling from random context labels. conditional refers to the conditional generation and independent refers to the independent generation type. The top token criterion selects the token based on the highest probability, and the joint criterion takes into account the token probability and the Levenshtein distance.
how well pre-trained models suit the specific task, which might be crucial for the DA. Besides, for GermEval augmentation, we use the BERT model with three times fewer parameters than for W-NUT and CoNLL.

Filtering of Augmented Data
As discussed in Section 4, an additional data filtering step can be applied on top of the augmentation process. We report results on two different filtering methods: First, we set a threshold for the probability of the predicted token (in our experiments we use the probability 0.5); Second, we filter sentences by minimum confidence scores as discussed in Section 3. We set the minimum confidence score according to Table 2. We apply filtering to the worst and best-performing model according to the numbers in Table 4. The filtering results on W-NUT are shown in Table 6.
In the case of the worst model, filtering based on the token probability improve the performance of the model by 2.6% compared to the unfiltered one. Filtering by confidence score does not improve the performance, but significantly reduces the standard deviation of the score. The results are expected, since by using token probability we increase the sentence reliability and completely change the synthetic data, while using the confidence score we filter on the same synthetic data. In the case of the better model, we see the opposite trend. Here filtering leads to performance degradation and an increase in the standard deviation.
We apply the same filtering techniques for CoNLL and GermEval. Table 7 shows the results for 3 different models. We choose the best, the worst and the model with the highest number of additional sentences for filtering. In the case of the worst model, the performance is improved by 1.1% F1 score with the minimum confidence filtering for CoNLL and 0.5% F1 score for GermEval compared to the unfiltered version. However, for the best model, the results remain at the same level and the baseline systems are not improved.
Although we do not achieve significant improvements compared to the baseline system, we see a potential in the MLM-based augmentation with the combination with filtering.

Discussion and Future Work
In this work, we present results of data adaptation methods on various NER tasks. We show that MLM-based data augmentation and self-training approaches lead to improvements on the small and noisy W-NUT dataset.
We propose two different confidence measures for self-training and empirically estimate the best    Table 7: F1 scores of using filtered augmented data on CoNLL and GermEval. The first line represents the augmentation method from Table 4. thresholds. Our results on the W-NUT dataset show the effectiveness of the selection strategies based on those confidence measures. For MLM-based data augmentation, we suggest multiple ways of generating synthetic NER data. Our results show that even without generating new entity spans we are able to achieve better results.
For future work, we would like to incorporate label information into the augmentation pipeline by either conditioning the token predictions on labels or adding additional classification steps on top of the token prediction. Another important question is the choice of the MLM and the impact of taskspecific fine-tuning. Further investigations into the filtering step should also be carried out.
For both self-training and MLM-based data aug-mentation we would like to improve the integration in the training process. The contribution of the original training data to the loss function could be increased or additional data could be weighted by their confidence. Finally, we would like to test whether we can combine the two methods to achieve additional improvements.

B.1 Data statistics
The number of masked tokens solely depends on the augmentation strategy discussed in section 4. Table 9 reports the average number of masked tokens in the sentence on W-NUT dataset for each augmentation strategy. Table 10 and Table 11 show the average number of masked tokens in the sentence for the most promising augmentation strategies for CoNLL and GermEval tasks.

B.2 Data Examples
We show the data examples on different dataset by varying one augmentation parameter while keeping others unchanged. Table 12 shows the examples on W-NUT dataset. In Table 13 and Table 14 we collect the examples for GermEval and CoNLL.

Parameter
Value Example -<PER>Christopher Reeve</PER> --<PER>Reeve</PER> was best known for playing the comic book hero <PER>Superman</PER> in four movies but his greatest heroics came in real life. entity <PER>Christopher Reeve</PER> --<PER>Reeve</PER> was best known for playing the comic book hero <PER>Batman</PER> in four movies but his greatest heroics came in real life . context <PER>Christopher Reeve</PER> The <PER>Reeve</PER> is best known for playing the comic book superhero <PER>Superman</PER> in four movies but his greatest heroics came in real life. random context <PER>Christopher Reeve</PER> --<PER>Reeve</PER> popular best known for popular popular popular book hero <PER>Superman</PER> in four movies but his popular heroics came in real popular popular Sampling mixed <PER>Christopher Reeve</PER> The <PER>He</PER> is best known for playing the comic book superhero <PER>Superman</PER> in the films but his greatest heroics came in real life. -Four weeks ago <ORG>Stagecoach </ORG> said it had agreed the deal in principle, and it expected to pay 110 million stg-plus for the firm, with <ORG>Swebus</ORG>' current owner, the state railway company. independent Four days ago <ORG>it</ORG> said it had made the deal in principle, and it expected to raise 110 million euros to the operation contract including <ORG>Swebus</ORG> ' current employer being the state railway company.
Order conditional Two years ago <ORG>Stagecoach</ORG> said it had made the deal in principle, and was expected to pay 110 million marks for the operation, with <ORG>Swebus</ORG>'s owner, the Swedish railway company. -<ORG>ZDF</ORG> said <LOC> Germany </LOC> imported 47,600 sheep from <LOC> Britain </LOC> last year, nearly half of total imports. top token <ORG>He</ORG> said <LOC> they </LOC> imported more goods from <LOC> Germany </LOC> that year, nearly half of all number.
Criterion joint <ORG>ZDF</ORG> this <LOC> this </LOC> this 47,600 sheep this <LOC> this </LOC> this year this nearly half of this imports.