DualNER: A Dual-Teaching framework for Zero-shot Cross-lingual Named Entity Recognition

We present DualNER, a simple and effective framework to make full use of both annotated source language corpus and unlabeled target language text for zero-shot cross-lingual named entity recognition (NER). In particular, we combine two complementary learning paradigms of NER, i.e., sequence labeling and span prediction, into a unified multi-task framework. After obtaining a sufficient NER model trained on the source data, we further train it on the target data in a {\it dual-teaching} manner, in which the pseudo-labels for one task are constructed from the prediction of the other task. Moreover, based on the span prediction, an entity-aware regularization is proposed to enhance the intrinsic cross-lingual alignment between the same entities in different languages. Experiments and analysis demonstrate the effectiveness of our DualNER. Code is available at https://github.com/lemon0830/dualNER.


Introduction
Aiming at classifying entities in un-structured text into pre-defined categories, named entity recognition (NER) is an indispensable component for various downstream neural language processing applications such as information retrieval (Banerjee et al., 2019) and question answering (Fabbri et al., 2020).Current supervised methods have achieved great success with sufficient manually labeled data, but the fact remains that most of the annotated data are constructed for high-resource languages like English and Chinese, posing a big challenge to lowresource scenarios (Mayhew et al., 2017;Bari et al., 2021).
To address this issue, zero-shot cross-lingual NER is proposed to transfer knowledge of NER from high-resource languages to low-resource languages.The knowledge can be acquired in either of the following two ways: 1) from aligned cross- * Corresponding author.lingual word representations or multilingual pretrained encoder fine-tuned on high-resource languages (Conneau et al., 2020;Bari et al., 2021).2) from translated target language data with label projection (Mayhew et al., 2017;Jain et al., 2019;Liu et al., 2021).These two kinds of methods can be unified into a knowledge distillation (KD) framework, to further improve the cross-lingual NER performance (Wu et al., 2020;Fu et al., 2022).Though widely used, the transfer process still suffers from poor translation quality, label projection error and over-fitting of large-scale multilingual language models.
In this paper, we present a simple and effective framework, named DualNER, alleviating the above problems from a different angle.We combine two popular complementary learning paradigms of NER, sequence labeling and span prediction, into a single framework.Specifically, we first train a teacher NER model by jointly exploiting sequence labeling and span prediction with the annotated source language corpus.Unlike the previous KD-based methods that produce pseudo labels for the corresponding paradigms, we propose a dualteaching strategy to make the two paradigms complement each other.More concretely, the model prediction for sequence labeling is used to construct the pseudo-labels for span prediction and vice versa.Furthermore, we propose a multilingual entity-aware regularization forcing same entities in different languages to have similar representations.By doing this, the trained model is able to leverage the intrinsic cross-lingual alignment across different languages to enhance the cross-lingual transfer ability.
Experiments and analysis conducted on XTREME for 40 target languages well validate the effectiveness of DualNER1 .

Model
Given an example of training data (X, Y sla ), where X={x 1 , x i , ..., x n } is the input sequence and Y sla ={y 1 , y i , ..., y n } is the corresponding label (e.g., "B-ORG", "I-PER", "O") sequence, we can extract the start and end index sequence, Y start and Y end , as reference for span prediction, and convert the training instance to a quadruple (X, Y sla , Y start , Y end ).
Sequence Labeling Layer.Formally, we stack a softmax classifier layer on H, and the objective of sequence labeling is where θ and θ sla denote the parameters of PLM and the classifier respectively.
Span Prediction Layer.For the formulation of span prediction, we adopt two (C + 1)-class classifiers, where C denotes the number of NER entities (e.g., LOC, PER, ORG, 3 entities in XTREME-40 dataset), and one is used to predict whether each token is the start of an entity, and the other is used to predict whether each token is the end.Formally, given the representations H and two label sequences Y start and Y end of length n, the losses for start and end index predictions are defined as: (3)

Training
To achieve zero-shot cross-language NER, we adopt a two-stage training strategy.
Stage 1: Multitask Learning.At the first stage, we fine-tune a multilingual pre-trained model on the labeled source language data in a multi-task manner: Stage 2: Dual-teaching.At the stage two, we focus on generating pseudo labels for both labeled and unlabeled data with the trained NER model θ tea .In particular, the pseudo labels for the sequence labeling task are converted by the model prediction for the span prediction task, and vice versa.Specifically, based on the predictions P sla , P start and P end of an input sequence X src (or X trg ), we construct the pseudo labels for sequence labeling and span prediction as follows: where Sequential and ExtractSpan are the corresponding transformation between sequence labels and span labels.Furthermore, in order to strengthen the correlation of the same entities across languages, we present an entity-aware regularization term.We illustrate an example in Appendix A. More concretely, for the j-th entity, we extract the start token and the end token by applying argmax to the distributions P start and P end , and obtain its representation r j by concatenating the representations of the two tokens.We use a mean square error (MSE) loss to pull the representations of the same entities across different languages together: where C is the number of NER entities and R c is the representation set of the c-th entity in a minibatch.
The overall training objective is defined as: where α is a hyper-parameter to balance the effect of MSE loss.During training, we update the teacher NER model θ tea using the better student model θ stu based on the validation performance.At inference time, we only use the prediction of Span Prediction Layer.

Setup
The proposed method is evaluated on the crosslingual NER dataset from the XTREME-40 benchmark (Hu et al., 2020).Named entities in Wikipedia are annotated with LOC, PER, and ORG tags in BOI-2 format.We try two types of unlabeled target language data: Natural Language Text, the target language text in the training set of XTREME-40;and Translation Text (Fang et al., 2021).We take XLM-R-base (Conneau et al., 2020) and InfoXLM-large (Chi et al., 2021) as our backbones, and set α as 0.5.Detailed experimental setups are shown in Appendix B. We use entitylevel F1-score of all language development sets to choose the best checkpoint, and report the F1-score on each test set of each language.

Main Result
We compare DualNER to the following baselines: 1) FILTER (Fang et al., 2021), which feeds paired language input into PLM and is trained with selfteaching; 2) CLA, which formulates NER as a sequence labeling problem; 3) SPAN, which formulates NER as a span prediction problem; and 4) MLT, the model trained after our Stage 1. Besides, we name DualNER trained on unlabeled target natural language text as DualNER+TRG Gold , while denote DualNER trained on target language trans- 2) DualNER significantly outperforms the baselines on almost all of the languages, demonstrating the effectiveness of our proposed method.3) Directly combining CLA and SPAN into a multitask learning framework (i.e., MLT) fails to achieve consistent improvement.This observation shows that the gain of DualNER entirely comes from the proposed dual-teaching training strategy, rather than the usage of multitask learning.4) As expected, using natural language text (i.e., DualNER+TRG Gold ) achieves better performance compared to translation text (i.e., DualNER+TRG T rans ), since translations possibly lose the idiomatic expressions of some entities.

Ablation Study
To analyze the impact of different components of DualNER, we investigate the following three variants: 1) DualNER w/o J mse , removing the entityaware regularization; 2) DualNER w/ selfKL, where Dual-teaching is replaced by Self-teaching with KL loss at the Stage 2. 3) DualNER w/o TRG, where we only use the source language data in the Stage 2. We take XLM-R base as the backbone.The results are listed in Table 2. Compared with DualNER w/ selfKL, DualNER obtains a significant improvement of 3.46 points, validating our motivation in making use of complementarity of different task paradigms of NER.The degradation of DualNER w/o J mse and DualNER w/o TRG confirm the intrinsic cross-lingual alignment and the importance of task-related target language information.

Visualization
We choose English, Korean, and Arabic, which comes from different language families, and visualize the entity representations r in Eq. 9 with  hypertools (Heusser et al., 2017).As shown in Figure 2, the representations of different entities in the same language are clearly distributed in different regions, while the representations of the same entity across different languages are concentrated.

Effect of Source Language Corpus Size
In this experiment, we study the impact of annotated source language corpus size on DualNER by sampling different percentages of annotated source language corpus for the Stage 1.Meanwhile, we remove the labels of the remaining source data, and mix it with the unlabeled target language text for the Stage 2. Figure 3 shows the comparison between DualNER and MLT.Surprisingly, DualNER trained with only 20% of annotated source data achieves better performance than MLT trained using complete data, demonstrating the data-efficiency of our proposed method.

Conclusion
In this paper, we propose a simple and effective dual-teaching framework, coined DualNER, for zero-shot cross-lingual named entity recognition.
In particular, DualNER makes full use of the ex-changeability of the labels in span prediction and sequence labeling, and generates abundant pseudo data for available labeled and unlabeled data.Experiments and analysis validate the effectiveness of our DualNER.

Limitations
The performance of DualNER relies on the capability of cross-lingual transfer of multilingual pretrained models.In practice, for an adequate quality of the pseudo-labels generated in the stage 2, it is necessary to ensure that the NER model has acquired certain ability to conduct cross-lingual transfer in the stage 1.

Figure 2 :
Figure 2: The visualization of the entity representations in different languages, where the triangle-shaped, circleshaped, and pentagonal-shaped(blue) points denote location, organization, and person entities, respectively.

Figure 3 :
Figure 3: Effect of Source Language Corpus Size.We report F1-score on all the validation sets.

Table 1 :
Experimental results on the test sets of XTREME-40 NER.We highlight better results between CLA and SPAN with gray and highlight best results among all methods with pink .

Table 2 :
Ablation Study.We run 3 times with different random seeds and report mean and standard deviation on all the validation sets.lationtext as DualNER+TRG T rans .Table1reports the zero-shot cross-lingual NER results.The conclusions are as follows: 1) CLA and SPAN have no obvious advantages over each other.