AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER

Neural methods have been shown to achieve high performance in Named Entity Recognition (NER), but rely on costly high-quality labeled data for training, which is not always available across languages. While previous works have shown that unlabeled data in a target language can be used to improve cross-lingual model performance, we propose a novel adversarial approach (AdvPicker) to better leverage such data and further improve results. We design an adversarial learning framework in which an encoder learns entity domain knowledge from labeled source-language data and better shared features are captured via adversarial training - where a discriminator selects less language-dependent target-language data via similarity to the source language. Experimental results on standard benchmark datasets well demonstrate that the proposed method benefits strongly from this data selection process and outperforms existing state-of-the-art methods; without requiring any additional external resources (e.g., gazetteers or via machine translation).


Introduction
Named entity recognition (NER) is a fundamental information extraction task, which seeks to identify named entities in text and classify them into predefined entity types (such as person, organization, location, etc.) and it is key in various downstream tasks, e.g., question answering (Mollá et al., 2006). Neural NER models are highly successful for languages with a large amount of quality annotated data. However, most languages don't have enough labeled data to train a fully supervised model. This motivates research on cross-lingual transfer, which leverages labeled data from a source language (e.g., 1 Code is publicly available at https://aka.ms/ AdvPicker English) to address the lack of training data problem in a target language. In this paper, following Wu and Dredze (2019) and Wu et al. (2020a), we focus on zero-shot cross-lingual NER, where labeled data is not available in the target language.
The state-of-the-art methods for zero-shot cross-lingual NER are mainly divided into three categories: i) feature-based methods (Wu and Dredze, 2019;Wu et al., 2020b;Pfeiffer et al., 2020), which train a NER model to capture language-independent features of the labeled source-language data and then apply it to the target language; ii) translation-based methods (Mayhew et al., 2017;Xie et al., 2018), which build pseudo target-language dataset via translating from labeled source-language data and mapping entity labels; and iii) pseudo-labeling methods, which generate pseudo-labeled data for training a target-language NER model via a source-language model (Wu et al., 2020a) or annotation projection (Ni et al., 2017).
However, each method has its own disadvantages. Feature-based methods only learn the knowledge in the source language, but cannot leverage any target-language information. Translation-based methods require high-quality translation resources, which are expensive to obtain. And pseudo-labeled methods assume that all pseudo-labeled data is beneficial for cross-lingual transfer learning, which is not always the case.
Therefore, here we propose a novel approach -AdvPicker -which combines feature-based and pseudo-labeling methods, while not requiring any extra costly resources (e.g., translation models or parallel data). Furthermore, to address the described problems, we enhance the source-language NER model with unlabeled target language data via adversarial training. Unlike other pseudolabeling methods, we only leverage the languageindependent pseudo-labeled data selected by an adversarial discriminator, to alleviate overfitting the model in language-specific features of the sourcelanguage.
Specifically, we first train an encoder and a NER classifier on labeled source-language data to learn entity domain knowledge. Meanwhile, a language discriminator and the encoder are trained on a token-level adversarial task which enhances the ability of the encoder to capture shared features. We then apply the encoder and the NER classifier on unlabeled target-language data to generate pseudolabels and use an adversarial discriminator to select less language-specific data samples. Finally, we utilize knowledge distillation to train a targetlanguage NER model on this selected dataset.
We evaluate our proposed AdvPicker over 3 target languages on standard benchmark datasets. Our experimental results show that the proposed method benefits strongly from this data selection process and outperforms existing SOTA methods; without requiring any additional external resources (e.g., gazetteers or machine translation).
Our major contributions are as follows: • We propose a novel approach to combine feature-based and pseudo-labeling methods via language adversarial learning for crosslingual NER; • We adopt an adversarial discriminator to select what language-independent data to leverage in training a cross-lingual NER model to improved performance. To the best of our knowledge, this is the first successful attempt in selecting data by adversarial discriminator for XL-NER; • Experiments on standard multi-lingual datasets showcase AdvPicker achieves new state-of-the-art results in cross-lingual NER.
2 Related Work

Cross-Lingual NER
Cross-lingual transfer for NER has been widely studied in recent years. Prior works are divided into three categories: feature-based, translation-based, and pseudo-labeling. Feature-based methods generally use languageindependent features to train a NER model in the labeled source-language data, which include word clusters (Täckström et al., 2012), Wikifier features (Tsai et al., 2016), gazetteers (Zirikly and Hagiwara, 2015), and aligned word representations (Ni et al., 2017;Wu and Dredze, 2019), etc. Moreover, for language-independent features, adversar-ial learning was applied on word/char embedding layers (Huang et al., 2019;Bari et al., 2020) or encoders (Zhou et al., 2019;Keung et al., 2019). Translation-based methods generally use pseudo target-language data translated from labeled sourcelanguage data. Ni et al. (2017) proposed to project labels from the source language into the target language by using word alignment information. Most recent methods translate the annotated corpus in the source language to the target language word-byword (Xie et al., 2018) or phrase-by-phrase (Mayhew et al., 2017) and then copy the labels for each word/phrase to their translations. While (Jain et al., 2019) proposed to translate full sentences in the source language and project entity labels to targetlanguage sentences.
To leverage unlabeled target-language data, pseudo-labeling methods generate the pseudolabels by annotation projection on comparable corpora (Ni et al., 2017) or via models trained on source-language labeled data (Wu et al., 2020a).
In this paper, we propose AdvPicker, an approach that requires no translation and combines feature-based and pseudo-labeling methods. Moreover, we leverage pseudo-labeled data differently from other pseudo-labeling methods. Through adversarial training, we select language-independent pseudo-labeled data for training a new targetlanguage model.

Language Adversarial Learning
Language-adversarial training (Zhang et al., 2017) was proposed for the unsupervised bilingual lexicon induction task. And it has been applied in inducing language-independent features for crosslingual tasks in NER (Zhou et al., 2019;Xie et al., 2018), text classification (Chen et al., 2019b), and sentiment classification (Chen et al., 2018). Keung et al. (2019) proposed a multilingual BERT with sentence-level adversarial learning. However, this method does not improve crosslingual NER performance significantly. To address this limitation, AdvPicker uses multilingual BERT with token-level adversarial training for cross-lingual NER, which induces more languageindependent features for each token embedding.

Knowledge Distillation
Knowledge distillation was proposed to compress models (Buciluȃ et al., 2006) or ensembles of models (Rusu et al., 2016;Hinton et al., 2015;Sanh et al., 2019;Mukherjee and Hassan Awadallah, 2020) via transferring knowledge from one or more models (teacher models) to a smaller one (student model). Besides model compression, knowledge distillation has also been applied to various tasks, like cross-modal learning (Hu et al., 2020), machine translation (Weng et al., 2020), and automated machine learning (Kang et al., 2020).
In this paper, we adapt knowledge distillation to leverage unlabeled data in the cross-lingual NER task. This helps the student model learn richer information from easily obtainable data (with pseudolabels).

AdvPicker
In this section, we introduce our approach (Ad-vPicker) which utilizes the adversarial learning approach to select language-independent pseudolabeled data for training an effective targetlanguage NER model. Figure 1 illustrates the framework of the proposed AdvPicker. Specifically, as shown in Figure 1(a), we train an encoder and a NER classifier on the labeled source-language data. Meanwhile, a language discriminator and the encoder are trained on the token-level adversarial task. We then apply encoder and classifier over unlabeled target-language data to generate pseudolabels and use the adversarial discriminator to select the most language-independent pseudo-labeled data samples. Finally, we utilize knowledge distillation to train a target-language NER model on this selected dataset.
In the following section, we describe the language-independent data selection process, including the token-level adversarial training, data selection by the discriminator, and knowledge distillation on select language-independent data.

Token-level Adversarial Training for Cross-Lingual NER
To avoid the model overfitting on language-specific features of the source-language, we propose the token-level adversarial learning (TLADV) framework, which is shown in Figure 1(a). Following Keung et al. (2019), we formulate adversarial cross-lingual NER as a multi-task problem: i) NER and ii) binary language classification (i.e source vs. target language). For the NER task, we train the encoder and classification layer on NER annotated text in the source language. The encoder learns to capture the NER features of the input sentences and then the classification layer tries to predict the entity labels for each word based on their feature vectors. For the language classification task, we train a language discriminator and an encoder on the labeled source-language dataset and unlabeled targetlanguage data. The language discriminator is added to classify whether an embedding generated by the encoder is associated to the source or the target language. The encoder tries to produce languageindependent embeddings that are difficult for the language discriminator to classify correctly. We define the encoder, the language discriminator, and their objectives as follows: where E is the feature encoder which generates language-independent feature vectors h for each sentence x. Following Keung et al. (2019), we use multilingual BERT as the feature encoder here and denote the encoder as mBERT-TLADV. NER Classifier We feed h into the NER classifier which is a linear classification layer with the softmax activation function to predict the entity label of token x.
where P θ (Y NER ) ∈ R |C| is the probability distribution of entity labels for token x and C is the entity label set. W NER ∈ R de×|C| and b NER ∈ R |C| denote the to-be-learned parameters with d e being the dimension of vector h. Language Discriminator The language discriminator is comprised of two linear transformations and a ReLU function for classifying token embedding. The sigmoid function is used to predict the probability of whether h belongs to the source language.
with d d being the hidden dimension of discriminator and d the language classification task label size. σ is the sigmoid function to obtain the language probability of each word. For language-adversarial training, we have 3 loss functions: the encoder loss L E , the language discriminator loss L DIS , and the NER task loss L NER . Figure 1: Framework of the proposed AdvPicker. a) Overview of the token-level adversarial training process. The lines illustrate the training flows and the arrows indicate forward or backward propagation. Blue lines show the flow for source-language samples and grey ones are for the target language. L NER , L E , and L DIS are the losses of the NER classifier, encoder and discriminator modules in AdvPicker respectively (Section 3.1). Encoder and NER classifier are trained together on source-language samples (blue solid lines on the left side). Encoder and discriminator are trained for the adversarial task (on the right side). b) Language-independent data selection on pseudo-labeled data. c) Knowledge distillation on selected data.
Note that we don't add these three loss functions together for backward propagation. Parameters of different components in adversarial learning are alternatively updated based on the corresponding loss function, similarly to Keung et al. (2019). Specifically, for the NER task, the parameters of the encoder and the NER classifier are updated based on L NER . For the adversarial task, the parameters of the encoder are updated based on L E , while the parameters of the discriminator are updated based on L DIS . Algorithm 1 shows the pseudocode for the adversarial training process.
where x is the sentence, y DIS ∈ {0, 1} is the ground truth label for the language classification task, y DIS ∈ {0, 1} is the negative label for the language classification task, and y NER ∈ R N ×|C| is the ground truth of named entity recognition task for corresponding input x.

Language-Independent Data Selection
To obtain the pseudo labelsŷ T-NER for targetlanguage examples, we apply the learned mBERT-TLADV model on the unlabeled target-language data x T . However, the pseudo-labeled dataset D = {x T ,ŷ T-NER } may then contain lots of languagespecific samples. We then leverage the adversarial discriminator to select pseudo languageindependent samples from the generated set. The language discriminator tries to make the encoder unable to distinguish the language of a token through confrontation. In this way, the encoder should pay more attention to features that are less related to the source language when learning the NER task. After adversarial training, the language discriminator can still correctly classify certain embeddings with a high probability. We define these as language-specific samples. Other samples are ambiguous regarding language (for example, sentences with probability close to 0.5), and they are defined as samples that are more languageindependent.
In order to quantify the language independence of each sample, we use the language discriminator to calculate the probability of whether the sentence x T is from the source language P θ (Y DIS , x T ), the formula is as follows: Calculate model loss L m using x, y and P θ (Y m ), Eq. (4) 13: Update parameters of embeddings w.r.t. the gradients using ∇L m with parameters W m , b m and the encoder E (if m in {NER, E}). 14: end for 15: end for language sentence and its feature vector, and P θ (Y DIS , x T ) is the probability of x T mentioned in Eq.( 3).
In order to select from the pseudo-labeled data, we design an index score to represent the language independence of a sentence x T (the degree of model confusion on different languages). We assume it follows a uniform distribution and reaches its maximum when P θ (Y DIS , x T ) = 0.5. Conversely, the index is at its minimal value when P θ (Y DIS , x T ) is equal to 0 or 1.
We select target-language samples with the highest score in the top ρ as language-independent data. ρ is a hyper-parameter which is the data ratio of pseudo-labeled data. Finally, we obtain the selected target-language pseudo-labeled dataset D subset = {x T subset ,ŷ T-NER subset }, a subset of target language pseudo-labeled dataset.
There are two reasons for selecting languageindependent samples by the language discriminator. First, these samples' feature vectors contain less language-specific information which is helpful for cross-lingual transfer learning. Second, the NER classifier is trained on source-language labeled data. Therefore, it is more likely to generate high-quality predictive labels on selected targetlanguage samples that have similar feature vectors to source-language samples.

Knowledge Distillation on Language-Independent Data
To leverage such less language-dependent data, we train a target-language NER model on the selected pseudo-labeled data D subset . Considering a lot of helpful information can be carried in soft targets instead of hard targets (Hinton et al., 2015), we use the soft labels of the selected pseudo-data to train a student model h T stu via knowledge distillation. To construct the student model, we used the pretrained cased multilingual BERT(mBERT) (Devlin et al., 2019) as the initialization and a linear layer with softmax function: is the distribution of entity labels probability output from the student model. W T-NER ∈ R ds×|C| and b T-NER ∈ R |C| are learnable parameters of the student NER model.
Following Wu et al. (2020a), the loss function L KD is defined as the mean squared error (MSE) between the prediction output P θ (Y T-NER ) and the soft labels of the selected data, which is formulated as: whereŷ T-NER subset ∈ D subset are the selected soft labels with N tokens and P θ (Y T-NER ) is the prediction probability of the selected sentence x T . By minimizing the MSE loss, the student model is trained supervised on the target-language selected data pseudo-labels.
For inference in the target language, we only apply the student model on test cases to predict the probability distribution of entity labels for each token in sentences, as Eq. (7). To ensure the entity labels follow the NER tagging scheme, the prediction result is generated by Viterbi decoding (Chen et al., 2019a).

Language
Type Train Dev Test  (2019), we use the BIO labeling scheme (Farber et al., 2008) and the official split of train/validation/test sets. As previous works (Täckström et al., 2012;Jain et al., 2019;Wu et al., 2020b), for all experiments, we always use English as source language and the others as target languages. Our models are trained on the training set of English and evaluated on the test sets of each target language.
Note that for each target language, we only use text in its training set to train our model with these unlabeled target language data. In adversarial learning, we randomly sample data from all target languages and construct a target-language dataset of the same size as the English training dataset.

Implementation Details
We implement AdvPicker using PyTorch 1.6.0. For data pre-processing, we leverage WordPiece (Wu et al., 2016) to tokenize each sentence into a sequence of sub-words which are then fed into the model. For the encoder (i.e. E in Eq.(1)) and student model (i.e. h T stu in Eq. (7)), we employ the pre-trained cased multilingual BERT in Hugging-Face's Transformers (Wolf et al., 2020) 4 as backbone model, which has 12 transformer blocks, 12 attention heads, and 768 hidden units. We empirically select the following hyperparameters. Specifically, referring to the settings of Wu et al. (2020b), we freeze the parameters of the embedding layer and the bottom three layers of the multilingual BERT used in the encoder and the NER student model. We train all models using a batch size of 32, maximum sequence length of 128, a dropout rate of 0.1, and use AdamW (Loshchilov and Hutter, 2019) as optimizer. For sequence prediction, we apply Viterbi decoding (Chen et al., 2019a) on all models in our experiments.
Following Keung et al. (2020), in all experiments the other hyper-parameters are tuned on each target language dev set. We train all models for 10 epochs and choose the best model checkpoint with the target dev set. For adversarial learning, we set the learning rate of 6e-5 for the NER loss L NER and 6e-7 for both loss encoder L E and discriminator loss L DIS . For knowledge distillation, we use a learning rate of 6e-5 for the student models. We set the hidden dimension of the discriminator as 500. For data selection, ρ is set to 0.8. Following Tjong Kim Sang (2002a), we use the entity level F1score as evaluation metric. Moreover, experiments are repeated 5 times for different random seeds on each corpus.
Note that the selected data from the discriminator is generated by combination of output from mBERT-TLADVs with different random seeds, as we observe only a small number of samples with high score in the selected data generated by each model. Specifically, for each target-language sentence x T , there are 5 corresponding soft label sequences generated from 5 different mBERT-TLADV models. From those, only sequences that have the highest sum of each predicted label confidence are kept.
Our models are trained on a Tesla P100 GPU (16GB). mBERT-TLADV has 178M parameters and trains in ≈130min, while the student models h T stu have 177M parameters and take ≈21min.  Table 2: Results of our approach and prior state-of-the-art methods for zero-shot cross-lingual NER. * denotes the version of the method without additional data.

Comparison with State-of-The-Art Results
These include AdvPicker, prior SOTA methods, and two re-implemented baseline methods, i.e., mBERT-TLADV (Section 3.1) and mBERT-ft (mBERT fine-tuned on labeled source-language data). Note that some existing methods use the translation model as an additional data transfer source, whereas our method does not. For a fair comparison, we compare against the version of UniTrans (Wu et al., 2020a) w/o translation (as reported in their paper). Our method outperforms the existing methods with F1-scores of 75.01, 79.90, and 82.90, when using only source-language labeled data and target-language unlabeled data. Particularly, compared with Unitrans* (previous SOTA), AdvPicker achieves an improvement of F1-score ranging from 1.41 in German to 1.71 in Dutch. Furthermore, our result is comparable to the full UniTrans using also translation (0.04 F1 difference on average). Besides, AdvPicker achieves an average F1score improvement of 1.83 over mBERT-TLADV and 2.95 over mBERT-ft. These results well demonstrate the effectiveness of the proposed approach, which is mainly attributed to more effectively leveraging unlabeled target language data and selecting the language-independent data for cross-lingual transfer.

Quality of Selected Data
Language-Independence We use the selected dataset D subset to train student models via knowledge distillation. P θ (Y T-NER ) in D subset is calcu-    lated over feature vectors generated by mBERT-TLADV. To validate the language-independence of these feature vectors, we apply three discriminators defined as in Eq.
(3) to classify the token fea-ture vectors from three different encoders: mBERT, mBERT-ft, and mBERT-TLADV. Unlike in the adversarial learning setting, we fix the parameters of the three encoders and only train the discriminators. We use each language training set to train discriminators and evaluate on each target language corresponding test set. Table 3 reports the discriminator accuracy for 3 different encoders. We can see that the classification accuracy is reduced with adversarial training, which suggests that the similarity of feature vectors between the source language and the target language is improved; which further demonstrates the feature vectors become more language-independent when adversarial training is applied. Pseudo Labels To evaluate the languageindependent quality of pseudo labels, we calculate the F1 scores of the pseudo labels and the number of sentences involved. We denote languageindependent data and language-specific data as "selected data" and "other data" respectively. Table 4 reports the pseudo labels F1 scores of both language-independent data and languagespecific data for each target language using mBERT-TLADV. Generally, the average F1 score of language-independent data is 12.87 points higher than language-specific data, which suggests that language-independent data has higher quality pseudo-labels. Furthermore, the selected data contains less language-specific information. Table 5 reports the number of languageindependent and language-specific examples for each language. From these results (Tables 4, 5), we observe that the selected data is still high-quality, even if we set a very loose threshold (80% of unlabeled data being selected).

Model Performance over Selected/Other Data Splits
In order to better analyse the behaviour of Ad-vPicker across data variations, we use the trained language discriminator to split the target language test sets into Selected and Other partitions (similarly to how the training set is processed). Table  7 shows the different models's F1 scores for the partitioned data. From Table 7, we can draw these conclusions: 1) As expected, models perform better over Selected data than over Other data; 2) AdvPicker is only trained on Selected data, but nonetheless outperforms all baseline models in both data partitions; 3) AdvPicker's approach effectively selects examples with better features and is not over-biased towards Selected data.

Ablation Study
To validate the contributions of different process in the proposed AdvPicker, we introduce the following variants of AdvPicker and baselines to perform an ablation study: 1) AdvPicker w/o KD, which directly combines the prediction of test data from mBERT-TLADVs with different seeds without knowledge distillation on pseudo-labeled training data. 2) AdvPicker w All-Data, which trains a student model on all target-language pseudo-labeled data generated by mBERT-TLADV. 3) mBERT-ft, mBERT fine-tuned on source-language labeled data. 4) mBERT-TLADV (Section 3.1), meaning mBERT trained on source-language labeled data with token-level adversarial learning. Table 6 reports the performance of each method and their performance drops compared to Ad-vPicker. Moreover, we can draw more in-depth observations as follows: 1) Comparing AdvPicker with AdvPicker w/o KD and AdvPicker w All-Data, we can see that selecting the language-independent data is reasonable. That also validates the effectiveness of the model trained on language-independent data via knowledge distillation.
2) mBERT-ft outperforms mBERT-TLADV. Such results well demonstrate that token-level adversarial learning is helpful to train a languageindependent feature encoder and brings performance improvement.
3) By comparing the F1 scores of AdvPicker w All-Data and AdvPicker on the target languages, we observe that training on selected data brings higher performance improvements on larger datasets, e.g., German [de] and Dutch [nl], and lower improvements on the smaller Spanish [es] dataset. Although selected data has high-quality pseudo labels, smaller sizes of selected datasets may limit performance improvements.

Stability Analysis
Because BERT fine-tuning is known to be unstable in few-shot tasks, as discussed in Zhang et al. (2021). mBERT-based methods' performances on the CoNLL NER dataset are likely also unstable. To evaluate the stability of AdvPicker, we compare   the standard deviation of F1 scores for mBERT-ft, Unitrans, and AdvPicker. Table 2 includes the standard deviation of F1 scores over five runs for each model. AdvPicker has a lower average standard deviation in the three target languages than the other mBERT-based methods. Such results demonstrate that selected data can bring a degree of stability to the model, or limit instability, as the student model in AdvPicker is trained on selected data with the soft labels from other trained models.

Conclusion
In this paper, we propose a novel approach to combine the feature-based method and pseudo labeling via language adversarial learning for cross-lingual NER. AdvPicker is the first successful attempt in selecting language-independent data by adversarial discriminator to cross-lingual NER. Our experimental results show that the proposed system benefits strongly from this new data selection process and outperforms existing state-of-the-art methods, even without requiring additional extra resources.