TransAdv: A Translation-based Adversarial Learning Framework for Zero-Resource Cross-Lingual Named Entity Recognition

,


Introduction
Named Entity Recognition (NER) is a fundamental task that aims to locate named entities in a given sentence and assign them to predefined types, i.e., person, location, organization, etc.In recent years, neural NER models have achieved remarkable performance on this task with a large amount of labeled data.However, many low-resource languages do not have enough data for supervised learning.Therefore, transferring labeled data or trained models from high-resource to low-resource languages is gaining increasing attention.
In this paper, we concentrate on zero-resource cross-lingual NER where no labeled data in the target language is available.Existing methods fall into three main categories: i) model transfer based methods (Wu and Dredze, 2019;Wu et al., 2020c), which train a source model on the labeled source * Corresponding author.language data to learn language-independent features and then directly apply it to the target language; ii) data transfer based methods (Mayhew et al., 2017;Xie et al., 2018), which translate the labeled source language data and map all entity labels to generate pseudo target language data; iii) knowledge transfer based methods (Wu et al., 2020a;Chen et al., 2021), which train a source model on the labeled source language data and then apply it over the unlabeled target language data to distill a student model.
Each kind of method has its drawbacks, (Wu et al., 2020b) is the first to unify the three kinds of methods with great success.However, the noise in the translation process significantly limits its performance.There are two common translation strategies for cross-lingual NER tasks: i) sentence translation followed by entity alignment, where the propagation of entity alignment errors is inevitable; ii) directly word-by-word translation (Wu et al., 2020b), where the generated sentence is noisy in terms of word order.
To better utilize the translated data, we propose a translation-based adversarial learning framework named TransAdv for zero-resource cross-lingual NER, and the overview architecture is shown in Figure 1.The contributions of our work can be summarized as follows: • We better unify data transfer and knowledge transfer for cross-lingual NER, mitigating lexical and syntactic errors of word-by-word translated data through multi-level adversarial learning and multi-model knowledge distillation.
• We conduct extensive experiments over 6 target languages with English as the source language, and the results validate the effectiveness and reasonableness of our model.T , then evaluated on the labeled test data of the target language.

Data Creation
In this section, we construct multiple datasets based on the labeled source language data as shown in Figure 2.
Following (Wu et al., 2020b), we apply MUSE (Lample et al., 2018) to translate a source language sentence x S into a target language sentence x T word-by-word.The entity label of each source language word is then directly copied to its corresponding translated word.Since MUSE has inevitable translation errors that it may not strictly translate each word into the target language, we also try Google translate API for more accurate word-by-word translation.After word-by-word translating and copying entity labels, we construct a pseudo target language training dataset D T from D S .(Zhang et al., 2021a) propose an aspect codeswitching mechanism to augment the training data for cross-lingual aspect-based sentiment analysis.In this section, we apply a similar mechanism to switch named entities between the source and translated sentences to construct two bilingual sentences: x swi S is derived from x S with named entities in x T , and x swi T is derived from x T with named entities in x S .The entity label of each word in x swi S and x swi T is also the same as its corresponding word in x S and x T benefiting from word-by-word translation.Therefore, we can construct two bilingual datasets D swi S and D swi T .Due to the difference between the word orders of the source language and the target language, we also design a word shuffling method for NER data.Since NER is a coarse-grained sequence labeling task, completely shuffling all words in a sentence will affect the internal relations of the words within entities.Therefore, we separately shuffle the words in each entity or each context between entities, with all entity labels retained.For sentences x S and x T , two shuffled sentences are denoted as x shu S and x shu T .Based on D S and D T , we can build two shuffled datasets D shu S and D shu T .

Multi-Level Adversarial Learning for Cross-Lingual NER
In cross-lingual tasks, the source and the target language usually have differences in lexical and syntactic features.To avoid the model overfitting on the source language data and make the model better fine-tuned on the word-by-word translated target language data, we follow (Chen et al., 2021) and propose a multi-level adversarial network.It is formulated as a multi-task problem with NER, word-level language classification and sentencelevel order classification.Different modules in the network and their loss functions are defined as below: Generator We choose multilingual BERT (Devlin et al., 2019) as the generator and feed a given sen- tence

NER Classifier
We feed h into a fully-connected layer followed by a softmax layer to yield a probability distribution over the entity label set Y: where h i ∈ R dg denotes the feature vector of the ith word with d g being the dimension of h, Language Discriminator We feed h into two fullyconnected layers followed by a sigmoid layer to classify the language of each word: where with d l being the hidden dimension of the language discriminator.Order Discriminator We first feed h into a onelayer LSTM to encode sequence features of the sentence, then the hidden state of the last word is fed into a fully-connected layer followed by a sigmoid layer to classify the order of the sentence: where During training, different datasets will first be fed into mBERT separately as shown in Figure 1, then the generated h will be sent to the corresponding module.We have a total of 4 loss functions: the NER task loss L ner , the language discriminator loss L l , the order discriminator loss L o , and the generator loss L g : (5) where y ner i and y l i denote the ground truth entity tag and language tag of the word x i , y o denotes the ground truth order tag of the sentence x.
Similarly to (Chen et al., 2021), for the NER task, the parameters of the generator and the NER classifier are updated based on L ner ; for the adversarial task, the parameters of two discriminators are updated based on L l and L o respectively, while the parameters of the generator are updated based on L g .Finally, we denote the trained source model as Θ src .

Multi-Model Knowledge Distillation on Unlabeled Data
Based on Θ src , we further fine-tune it on different datasets to derive teacher models with different emphases.Actually, three combinations of the datasets constructed in section 2.2 are considered in our network, including During fine-tuning, the language discriminator trained in section 2.3 is also loaded to continue adversarial fine-tuning with Θ entity and Θ context while the adversarial strategy is more fine-grained.For Θ entity we only discriminate languages of entity words and for Θ context we only discriminate languages of context words.The new language discriminator losses are shown as in Eq. 6.These two discriminators are adapted to characteristics of D entity and D context with the aim of enabling Θ entity and Θ context to better fuse representations of entity or context words respectively of the source and the target languages.
(6) We then implement a multi-model distillation on the unlabeled target language dataset D u T .Let xi denotes the ith word in an unlabeled sentence x ∈ D u T and p ner (x i , Θ) denotes the probability distribution predicted by model Θ.We combine soft labels generated by Θ src and three enhanced teacher models to obtain the united soft label: ) where w k is the weight for each model.
Finally, we distill a student model Θ stu by minimizing the mean squared error (MSE) between p uni (x i ) and the probability distribution predicted by Θ stu : (8) For inference on the labeled test data of the target language, we only employ the distilled student model Θ stu .

Baselines
We compare our model with the following zeroresource cross-lingual NER models to evaluate the performance of TransAdv: mBERT-FT (Wu and Dredze, 2019) fine-tune the multilingual BERT.Ad-vCE (Keung et al., 2019) improve upon mBERT's performance via adversarial learning.TSL (Wu et al., 2020a) propose a teacher-student learning method.Unitrans (Wu et al., 2020b) propose an approach to unify both model and data transfer.RIKD (Liang et al., 2021) propose a reinforced knowledge distillation framework.AdvPicker (Chen et al., 2021) attempt to select languageindependent data by adversarial learning.TOF (Zhang et al., 2021b) design a target-oriented finetuning framework to exploit various data.We conducted experiments on the following NER benchmark datasets: CoNLL-2002 (Sang andErik, 2002) for Spanish [es] and Dutch[nl], CoNLL-2003 (Sang andDe Meulder, 2003) for English [en] and German[de], and WikiAnn (Pan et al., 2017) for English[en], Arabic[ar], Hindi [hi] and Chinese [zh].Each dataset of a certain language is split into train, dev and test sets and statistics of all datasets are shown in Table 1.All datasets are annotated with 4 entity types: LOC, MISC, ORG and PER, using the BIO entity labeling scheme.

Datasets and Metrics
Following previous work (Sang and Erik, 2002), we employ entity level F1 score as the metric of evaluation.We run each experiment 5 times with different random seeds and report the average F1 score on the test set for reproducibility.More details of implementation details and model analysis are in appendix A and B.  The main results of baselines and TransAdv on CoNLL and WikiAnn are shown in Table 2.According to the results, TransAdv outperforms all baselines, proving our model's effectiveness.

Main Results
In general, due to the strong effect of knowledge distillation, knowledge transfer based methods such as TSL, RIKD, AdvPicker and our TransAdv significantly surpass model transfer based methods like mBERT-FT and AdvCE which directly apply the model to the target language.
For western languages in CoNLL, TransAdv achieves 1.62%, 0.88% and 0.7% absolute gain of F1 scores over Unitrans which also employs word-by-word translation.Despite using the same translation resources, our model still has a significant improvement over it, which may due to the adversarial network mitigating the lexical and syntactic errors of the translated data.Compared with TOF, which is the state-of-the-art model, TransAdv achieves an absolute F1 scores increase of 0.58%, 0.99% on es, nl and decrease of 1.05% on de.TOF requires extra labeled data of Machine Reading Comprehension (MRC) for both the source language and target language, which is costly and not in a strictly zero-resource case.For many lowresource languages, word-by-word translation is much more available than labeled MRC data.
As for non-western languages in WikiAnn, TransAdv also shows significant improvements over the baselines on hi, zh.We even achieve 0.68% absolute F1 scores gain on zh over mBERT-FT, which re-tokenizes the Chinese dataset and obtains relatively high results.

Conclusion
In this paper, we propose a framework named TranAdv for zero-resource cross-lingual NER, which mitigates lexical and syntactic errors of word-by-word translated data and better utilizes it through multi-level adversarial learning and multi-model knowledge distillation.We evaluate TransAdv over 6 target languages with English as the source language.Experimental results show that TransAdv achieves competitive performance to the state-of-the-art models.

Limitations
Although word-by-word translation data is easy to obtain in most cases, high-quality translation models are not available for some low-resource languages that are extremely short of parallel corpora.Moreover, when the difference in word order between the source and target languages is slight, adversarial training of word order may result in the loss of valid order information.in three enhanced teacher models are the same as in the source model, and the student model is trained with the learning rate of 6e-5 for L kd .The weights of four models in Eq. 7 are all set to 1/4.

B Model Analysis B.1 Ablation Study
To verify the validity of different modules in the proposed model, we introduce the following variants of TransAdv to further carry out an ablation study: 1) TransAdv w/o LDIS and TransAdv w/o ODIS, which remove the language discriminator or the order discriminator respectively during multilevel adversarial learning.Moreover, when the language discriminator is removed, the entity language discriminator and the context language discriminator during the adversarial learning of Θ entity and Θ context are also removed.The performance of each variant compared to TransAdv is shown in Table 3. From the results, we can draw out the following inferences: 1) Comparing TransAdv with TransAdv w/o LDIS and TransAdv w/o ODIS, we can see the performance drops.It confirms the effectiveness of the two discriminators that they may avoid the model overfitting on the source language and make the model better fine-tuned on the word-by-word translated target language data.
2) We observe that the performance of TransAdv outperforms the performance of TransAdv w/o Θ entity , TransAdv w/o Θ context and TransAdv w/o Θ order , showing that teacher models derived from different combinations of datasets may have different emphases on improving the robustness of the entire model.
3) TransAdv w/o MLADV and TransAdv w/o MMKD both significantly decline in performance compared with TransAdv, which illustrates that the two main modules both play essential roles in TransAdv.

B.2 Analysis of Translation Strategies
To evaluate the impact of different translation strategies on TransAdv, we introduce the following translation methods: 1) MUSE: use the same word-byword translation as (Wu et al., 2020b) based on fastText monolingual word embeddings.2) Google Word: use Google translate API to translate the sentence word-by-word.3) Google Phrase: split a sentence into phrases based on entity labels and then use Google translate API to translate the sentence phrase-by-phrase.4) Google Word&Phrase: split a sentence into phrases based on entity labels and then use Google translate API to translate the sentence word-by-word for context phrases and phrase-by-phrase for entity phrases.
The comparison of different translation strategies for each language is shown in Figure 4. We observe that for western languages in CoNLL, models with MUSE obtain the best F1 score on es, nl and the second best F1 score on de; for non-western languages in WikiAnn, models with Google Word obtain the best F1 score on ar, hi and the second best F1 score on zh.It may be because when English is the source language, for western languages there are many word anchors that can be shared, and using noisier MUSE can obtain more diverse translation data without affecting the performance; whereas for non-western languages, there are much less word anchors that can be shared, so Googlebased direct translation can better introduce information of the target language.
On the other hand, Google Phrase and Google Word&Phrase are generally less effective than the 748  other two strategies which are based entirely on word-by-word translation.This may be due to the fact that the word-by-word translated data is more compatible with the sentence-level order adversarial training in TransAdv.

B.3 Analysis of Language discriminators
To analysis the effect of language discriminators of different grains, clusters of embeddings of models in different stages trained on CoNLL with Dutch (nl) as the target language are shown in Figure 3.
We find that in the source model Θ src , embeddings corresponding to entity labels of the source and the target languages have been partially fused due to the original language discriminator while embeddings corresponding to context labels are too scattered.Moreover, in the entity-enhanced teacher model Θ entity , embeddings of different languages get further fused thanks to word-by-word translated data and the entity language discriminator while embeddings corresponding to context labels are still relatively scattered.In the contextenhanced teacher model Θ context , due to the context language discriminator, the integration of embeddings corresponding to context labels is basically complete while embeddings corresponding to entity labels are not.The above results together demonstrate the effectiveness of different language discriminators.

Figure 1 :
Figure 1: The overall architecture of our proposed TransAdv.

Figure 2 :
Figure 2: The process of data creation.
S and D order = D T ∪ D shu T .D entity contains the word-by-word translated dataset D T to involve knowledge of the target language and the code-switch dataset D swi T to share the same contexts but entities with different languages.D context contains D T and the code-switch dataset D swi S to share the same entities but contexts with different languages.D order contains D T and the shuffled dataset D shu T to share the same sentences but different orders of words.An entity-enhanced teacher model Θ entity , a contextenhanced teacher model Θ context and an orderenhanced teacher model Θ order are derived by finetuning Θ src on D entity , D context and D order with the same loss function L ner as in Eq. 5.
2) TransAdv w/o Θ entity , TransAdv w/o Θ context and TransAdv w/o Θ order , which remove the corresponding teacher model during multi-model knowledge distillation.3) TransAdv w/o MLADV, which removes the multi-level adversarial learning module with the Θ src directly trained on D S 4) TransAdv w/o MMKD, which removes the multi-model knowledge distillation module and then the student model is directly distilled with Θ src .

Figure 3 :Figure 4 :
Figure 3: Clusters of embeddings of models in different stages (Circles correspond to words in the source language, triangles correspond to words in the target language).

Table 1 :
Statistics of the datasets.

Table 2 :
Results of TransAdv and baselines.(F1%).All results are from original papers or the paper of RIKD.