CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation

Named entity recognition (NER) suffers from the scarcity of annotated training data, especially for low-resource languages without labeled data. Cross-lingual NER has been proposed to alleviate this issue by transferring knowledge from high-resource languages to low-resource languages via aligned cross-lingual representations or machine translation results. However, the performance of cross-lingual NER methods is severely affected by the unsatisfactory quality of translation or label projection. To address these problems, we propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER with the help of a multilingual labeled sequence translation model. Specifically, the target sequence is first translated into the source language and then tagged by a source NER model. We further adopt a labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence. Ultimately, the whole pipeline is integrated into an end-to-end model by the way of self-training. Experimental results on two benchmarks demonstrate that our method substantially outperforms the previous strong baseline by a large margin of +3~7 F1 scores and achieves state-of-the-art performance.


Introduction
Named entity recognition (NER) focuses on recognizing entities from raw text into predefined types (Sang, 2002;Sang and Meulder, 2003;Yadav and Bethard, 2018;Fang et al., 2021;Lin et al., 2021;Wang and Henao, 2021), which is an essential component for downstream natural language processing (NLP) tasks, such as information retrieval (Banerjee et al., 2019) and question answering (Przybyla, 2016;Aliod et al., 2006).However, most of the Figure 1: Illustration of our method.It enables crosslingual zero-shot transfer from source (English) to target (Chinese) language via labeled sequence translation and then entity projection.
existing approaches are highly dependent on the annotated training data and do not perform well in low-resource languages.
Zero-shot cross-lingual NER aims to address this challenging problem by transferring knowledge from the high-resource source language with lots amounts of annotated corpora to those languages without any labeled data (Xie et al., 2018).Some methods leverage the cross-lingual representations (Ni et al., 2017), where the NER model is trained on the labeled corpus of the source language and then directly evaluated on target languages.Due to the success of multilingual pretrained language models (Devlin et al., 2019;Conneau et al., 2020), these model-based transfer methods have shown a significant improvement in cross-lingual NER.Another line of research is the data-based transfer (Wu et al., 2020a), which adopts word-to-word translation to project the cross-lingual NER labels.For example, Liu et al. (2021) employ a multilingual translation model with placeholders for label projection.Nevertheless, these methods are still limited by weak entity projection and do not leverage the unlabeled corpora in target languages.
Along the line of using the multilingual model to encourage knowledge transfer among different languages, we propose a Cross-lingual Entity Projection (CROP) framework to leverage the unlabeled corpora of target languages, which is supported by a strong multilingual labeled sequence translation model guided by multiple bilingual corpora and the corresponding phrase-level alignment information.In Figure 1, the unlabeled target sentence is forward translated to the source language and tagged by the source NER model.Then, we use the labeled sequence translation model to backtranslate the annotated sentence to the target language.Given the target annotated sentence, we project the entity labels of "Gothenburg" to the target raw sentence through lexical matching.Finally, we use self-training to integrate the pipeline into an end-to-end NER model.
Specifically, we construct multilingual corpora to train the labeled sequence translation model, where the aligned spans of the sentence pair are both surrounded by the boundary symbols.We conduct experiments on two benchmarks, including XTREME-40 of 40 languages and CoNLL-5 of 5 languages.Experimental results show that our method reaches new state-of-the-art results.Furthermore, we also evaluate the performance of the multilingual labeled sequence translation model and visualize multilingual sentence representations.Analytic results demonstrate that our method can transfer knowledge among even distant languages.

Zero-shot Cross-lingual NER
Given the source NER model Θ src ner only trained on the source NER dataset and the target raw sentence x = (x 1 , . . ., x m ) with m words, the zero-shot cross-lingual NER aims to identify each word of target language to predefined types and then obtains the labels t = (t 1 , . . ., t m ).The problem definition of zero-shot cross-lingual NER is described as: where the target raw sentence x and labels t have the same length m. t i is the i-th label.The source language has annotated labels but the target corpora have no accessible handcrafted labels.P (t|x) represents the predicted distributions of labels.The source NER model Θ src ner trained on the source annotated corpus is expected to be evaluated on the target language without any labeled dataset.The previous work (Wu et al., 2020a) propose to unify the model-based transfer and data-based transfer with machine translation to transfer knowledge from the source language to the target language.

Framework
In Figure 2 i=1 of the source language L src with N samples, where x (i)  is the input sentence and t (i) contains labels.The raw sentences in {D L k x } K k=1 are translated to the source language and tagged by the source NER model.Then, the source annotated corpora are back-translated to the target sentences via a labeled sequence translation model.The labels of the target translated sentences are projected to the target raw corpora to construct the annotated corpora , where f (x) is projected label of the sentence x by a simple lexical matching between translated entities and original words.The source corpus D Lsrc x,t and target annotated corpora x,f (x) } K k=1 are further utilized by self-training.

Backbone Model for NER
Our backbone model for NER is comprised of an encoder and a linear classifier to identify entities to predefined types.Given the input sentence x = (x 1 , . . ., x m ) with m words, we use the encoder Θ e to extract top-layer features: where H = (h 1 , . . ., h m ) are the representations of the last encoder layer, where h i is the i-th word representation of the input sentence x.Θ e are parameters of the feature extractor.Then, a sequence of representations H = (h 1 , . . ., h m ) are fed into a linear classifier with the softmax function to generate the probability distribution of each input word: where t = (t 1 , . . ., t m ) are corresponding labels of the input sentence, and Θ ner = {W c , b c } represent model parameters of the NER backbone model.P (t|x) ∈ R m×T is the predicted probabilities and T is the number of the predefined types.In this work, we set T = 7 on the XTREME benchmark and T = 9 on the CoNLL benchmark.

Labeled Sequence Translation
We adopt the multilingual labeled sequence translation (LST) to transfer knowledge from high-resource to low-resource languages.The where one side is the source language L src and the other side is the language L k ∈ L tgt , the multilingual model is trained on corpora D b : where Θ mt are parameters of translation model.
To support labeled sequence translation (LST), we use the sentence pair to construct training samples, where the aligned spans in the sentence pair are surrounded by the boundary symbols using phrase-level alignment pairs.In Figure 3, x and y are sentence pair.The aligned fragments of the source sentence and target sentence are both annotated by the boundary symbols.These samples are used for the training of labeled sequence transla-1 https://github.com/robertostling/eflomal tion: (5) where (x p , y p ) is the sentence pair constructed by the original sentence pair and the phrase-level alignment pairs.
Our model is optimized by jointly minimizing the translation objective and labeled sequence translation objective: where L t is the objective of multilingual translation and L lst is the objective of the multilingual labeled sequence translation.We alternate two training objectives by setting α = 0.5.Our multilingual model supports (i) multilingual translation and (ii) labeled sequence translation.After alternately training on two objectives, we obtain the final multilingual translation model Θ mt .Once the multilingual training is done, our model serves as the off-the-shelf multilingual labeled translation model and does not require alignments.
During the inference stage, the source sentence x with labels is switched to labeled sequence x p , where all entities are surrounded by indicators.Then, the model translates the source labeled sentence x p to the target labeled sentence y p .The boundary symbol indicates the entities in the translation sentence.For example, the translation phrase y v 1 :v 2 have the same NER labels with the source phrase x u 1 :u 2 , where both phrases are surrounded by the boundary token b i .

Cross-lingual Entity Projection
Given the labeled corpus of the source language L src and the unlabeled corpora {D L k x } K k=1 of K languages, the source NER model Θ src ner is used to tag the unlabeled training corpora of target languages, aided by the labeled translation model Θ mt .

Forward Translation The target raw corpora {D L k
x } K k=1 of K languages are translated into the source language and tagged by the source NER model Θ src ner .We obtain the source labeled translated corpora {D , where f (•) is the predictor of the source NER model Θ src ner .Labeled Sequence Back-translation The source annotated corpora are back-translated to the target languages with entity labels.In Figure 3, the source sentence b 1 ).The boundary symbols b 1 and b 2 are used to locate the translated entities e 1 and e 2 in y p .We obtain the back-translated data k=1 by the translation model Θ mt .Entity Matching Given the target translated entities with labels D tgt , we search the matched entities in the unlabeled target sentence by lexical matching (string matching word by word).In Figure 1, "哥德堡" in the unlabeled sentence matches "哥 德堡" in the translated sentence, so "哥德堡" is labeled with the same entity type LOC (Location).The labels of translated entities are projected into the raw sentence to construct target labeled corpora {D L k x,f (x) } K k=1 .Finally, the target annotated corpora and the original corpus D Lsrc x,t are used for multilingual NER model training.

Self-training
Given a labeled corpus D Lsrc x,t of the source language and target unlabeled corpora {D L k x } K k=1 of target languages, the training objective based on D Lsrc x,t is formulated as below: where Θ all ner are NER model parameters.Then, we leverage the source NER model Θ ner src trained on the labeled corpus to project the entity labels to the target raw corpora described in Section 3.4 and get labeled corpora The multilingual corpora of target languages with predicted labels are adopted to train a neural network with the combined loss function L tgt as below: where x is the golden target data and f (x) is the pseudo label generated by the cross-lingual entity projection.
The multilingual NER model is jointly trained on the original dataset and target corpora with labels: where L src and L tgt are training objectives on the original and distilled dataset.

Dataset
CCaligned Our labeled multilingual model continues to be tuned on the same training data called CCaligned (El-Kishky et al., 2020) as the previous work (Fan et al., 2020;Goyal et al., 2021).We use a collection of parallel data in different languages from the CCaligned dataset, where the parallel data is paired with English and other 39 languages.The valid and test sets are from the FLORES-101 dataset (Goyal et al., 2021).

XTREME-40
The proposed method is further evaluated on the cross-lingual NER dataset from the XTREME benchmark (Hu et al., 2020).Named entities in Wikipedia are annotated with LOC, PER, and ORG tags in BOI-2 format.Following the previous work (Hu et al., 2020), we use the same split for the training, dev, and test set.
Post-Processing The synthetic data is postprocessed to train the multilingual NER model.(i) We use the language detection toolkit2 to filter the translated sentence with the incorrect language.(ii) We delete sequences, which exceed the maximum length (128 words) and only contain O (other) tags.(iii) The NER model trained on the multilingual corpora is directly employed to tag the unlabeled corpora.The discarded sentence is re-labeled by the multilingual NER model.Finally, we combine the labels predicted by the source NER model Θ src ner trained on the original dataset and the multilingual NER model Θ all ner trained by self-training to improve the accuracy of label projection.

Baselines and Evaluation
Our method is compared with the different baselines initialized by cross-lingual pretrained models including mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) for model-based transfer.We also conduct experiments without any pretrained model on the Transformer (Vaswani et al., 2017) architecture.UniTrans (Wu et al., 2020a) unifies both model transfer and data transfer for cross-lingual NER.Following this line of research, MulDA (Liu et al., 2021) proposes a sequence translation method to translate labeled training data of the source languages to other languages and avoids the word order change caused by wordto-word or phrase-to-phrase translation.Besides, we also produce the results of Translate-Train, where the labeled source corpus is translated into the other labeled corpora of multiple languages using our multilingual model.Following the previous work (Sang, 2002;Wu et al., 2020a), the metrics are the entity-level precision, recall, and F1 scores.For simplicity, we report the F1 scores of different methods in all tables.

Training Details
Multilingual Labeled Translation The pretrained multilingual model M2M large3 is adopted as the translation model, which has 12 layers with an embedding size of 1024 and 16 attention heads.We continue fine-tuning the model with Adam (β 1 = 0.9, β 2 = 0.98) on the labeled corpora constructed by the multilingual corpora and the align-ment pairs from CCAligned4 , where the parallel data is paired with English and other 39 languages and the phrase-level alignment pairs are extracted by the alignment tooleflomal.The learning rate is set as 1e-4 with a warm-up step of 4,000.The batch size is set as 1536 tokens on 32 A100 GPUs.
Cross-lingual NER For a fair comparison, we implement all methods using the same architecture and model size.We separately adopt the base architecture of Transformer, mBERT, and XLM-R as the backbone model, which all have 12 layers with an embedding size of 768, a feed-forward network size of 3072, and 12 attention heads.We set the batch size as 24 for CoNLL-5 and 32 for XTREME-40.The NER model is trained on CoNLL-5 for 15 epochs and XTREME-40 for 10 epochs, where the warm-up step is the 10% steps of the whole training steps.The synthetic data is post-processed to train the multilingual NER model.We delete sequences, which exceed the maximum length (128 words) and only contain O (other) tags.The NER model trained on the multilingual corpora is directly employed to tag the unlabeled corpora.The discarded sentence is re-labeled by the multilingual NER model.Finally, we combine the labels predicted by the source NER model Θ src ner trained on the original dataset and the multilingual NER model Θ all ner trained by self-training in Equation 9 to improve the accuracy of label projection.

Main Results
CoNLL-5 Table 2 presents the results of our method and previous baselines on transferring knowledge from English to other four languages, including es, nl, de, no.We can observe that the XLM-R gains strong improvement compared to previous baselines due to the effective cross-lingual transfer.Based on the cross-lingual pretrained model, our method can leverage cross-lingual entity projection to further encourage transferability from the NER model of the source language to the multilingual NER model of all languages.Our method significantly outperforms the previous strong baseline UniTrans on average, especially on German by a large margin +5.3 points.It can be attributed to our multilingual model, which has better translation quality on German and Norwegian than Spanish and Dutch.

XTREME-40
Table 3 compares the performance of our method with previous relevant methods initialed by different cross-lingual pretrained language models including mBERT and XLM-R.Given our translation model, the multilingual translated annotated corpora (Translate-Train) from the data of source languages can be used to improve the model performance compared to the XLM-R.Particularly, our proposed method gains significant improvement compared to other languages by a large margin (nearly +6 F1 points), due to the effectiveness of cross-lingual entity projection.
All experimental results demonstrate that our proposed framework strengthens transferability from the source language to nearly 39 target languages.
Ablation Study To verify the effectiveness of our method, we separately study the effects of the model-based transfer by cross-lingual pretrained model and the data-based transfer by cross-lingual entity projection.Our method has two advantages: (1) the model is trained on the original multilingual corpora with pseudo labels, which avoids the extra translation error.(2) our method uses the multilingual model trained on 41 languages to improve the entity projection of low-resource languages.In Table 4, Transformer ③ without any transfer methods gets the worst performance (only 15.1 F1 scores).
Our method ② without any pretrained model outperforms Transformer ③ by +43.0 F1 points, which has the similar transferability to the cross-lingual pretrained language models.Combining the merits of the cross-lingual pretrained model and selftraining for multiple languages, we obtain the best performance on the XTREME-40 benchmark.
Distribution of Multilingual Corpora An important difference between our method and the previous baselines is that we provide an effective way to leverage the unlabeled corpora of target languages.The raw data is first translated to the source language data and annotated by the NER model trained on the original dataset.Then, the translated source sentences are back-translated to target languages, where the entity labels are projected to the target raw words.Our cross-lingual entity projection avoids the extra translation errors instead of direct utilization for translated labeled corpora.In Figure 4(a), we visualize the distribution of the encoder representations by randomly sampling 1K sentences of each language from the target golden corpora.Figure 4(b) shows the distribution of the round-trip translated target corpora.We observe that the distribution of translated corpora has changed a lot since there are incorrectly translated words highly affected by translation quality, especially for low-resource languages.

Performance of Multilingual Translation
To ensure the effectiveness of our method, we evaluate the translation performance of 40 languages between M2M (Goyal et al., 2021) and our labeled sequence multilingual translation model on the FLORES-101 benchmark.Compared to M2M, our model supports the additional language eu by extending the fine-tuning data.Therefore, we report the SentencePiece-based BLEU using SacreBLEU5 of 39 translation directions except eu languages.Quality of Labeled Sequence Translation Section 3.3 introduces the multilingual labeled sequence translation, where the entities are surrounded by the boundary symbols and then translated to the target language.The multilingual model is trained with the bilingual corpus and the corresponding phrase-level alignment pairs to ensure the quality of labeled sequence translation.We calculate the precision of the baseline model and our model by randomly sampling 250 sentence pairs of each language from the whole training data    sentences bring large improvement to the zero-shot cross-lingual NER, which benefits from knowledge transfer of the multilingual self-training.When the size of target annotated corpora is greater than 10K, our method gets exceptional performance.

Quality of Entity Projection
Given the target annotated translated sentence and the raw sentence, our method searches the matched entity and projects the labels to the raw sentence.After filtering the sentences, we utilize the labeled sentences with pseudo labels for multilingual NER model training.Figure 6 reports the F1 scores of the projected labels of the target corpora compared to the ground-truth labels, where each language has high F1 scores.The accurate cross-lingual label projection with an average of 82.1 F1 scores of 39 languages guarantees the positive influence of our method to avoid excessive noise interference.
Example Study Table 7 lists a concrete example to compare our multilingual model with the baseline.In practice, we set the special token __SLOT{i}__ as the boundary symbol b i .The entities are surrounded by "__SLOT{i}__" for translation, where "__SLOT{i}__" is used as the boundary symbol.The positions of the boundary symbols "__SLOT0__" and "__SLOT1__" are misplaced during translation for the baseline model.In contrast, the multilingual model trained with alignment pairs accurately translates sentences and maintains the correct position of boundary symbols owing to the phrase-level alignment information.
Transfer for Distant Languages Compared to the transferability inaugurated by cross-lingual pretrained models, our method bridges the gap between the source language and distant target languages.The average F1 scores of similar and distant languages to English are denoted by Avg sim and Avg dis .In Figure 7, Avg sim gains +5.1 points improvement while Avg dis outperforms XLM-R by a large margin +19.2 points.The NER model trained on the English corpus initialized by the pretrained model is easier to be extended to similar languages but is hard to be transferred to distant languages (Leng et al., 2019).Through cross-lingual entity projection, our method productively encourages knowledge transfer from the source language to distant languages contrasted with the baseline.

Explanation for Entity Matching
In Figure 8, we list two detailed examples of entity matching (a) mismatched entity and (b) matched entity.For the first example, "哥的堡" in the back-translated target is not mismatched to "哥德堡" in the raw target by the lexical matching, so the labels of "哥 的堡" (LOC) can not be projected to the "哥德 堡".For the second example, "哥德堡" in the back-translated target is the same as "哥德堡" in the raw sentence word by word, so we can obtain the labeled entity "哥德堡" (LOC) in the target sentence.The target sentences with missing entities are discarded, where the labels can not be projected to the entity like in the first example.Finally, we only need to select the 10% sentences of all raw target sentences for the multilingual NER training to avoid extra noise and get state-of-the-art performance compared to previous baselines.

Multilingual Translation
Inspired by the success of the neural machine translation (Bahdanau et al., 2015;Vaswani et al., 2017), multilingual translation has attracted considerable attention due to its capability to handle multiple languages in a shared single model (Pan et al., 2021;Xie et al., 2021;Zhou et al., 2021a;Zhang et al., 2021).Previous works explicitly leverage the word-level or phrase-level extracted alignment information to improve performance.(Song et al., 2019;Yang et al., 2020Yang et al., , 2021)).Recently, massively multilingual models (Fan et al., 2020;Goyal et al., 2021) are proposed, which all are trained on large sources of training data.Motivated by previous works, we combine the phrase-level alignment pairs and the many-to-many multilingual model covering 40 languages to construct a labeled sequence translation system for the cross-lingual NER task in this work.

Conclusion
In this work, we propose a novel zero-shot crosslingual NER framework with a multilingual labeled sequence translation model advised by multilingual corpora and phrase-level alignment pairs.The knowledge of the source NER model is effectively transferred to target languages by a round-trip translation and label projection.In this way, the multilingual translation model plays the role of the bridge to transfer knowledge from source languages to low-resource target languages.Experimental results evaluated on the CoNLL-5 and XTREME-40 benchmarks demonstrate the effectiveness of our method compared to the strong baselines.

Limitations
The total number of languages in our multilingual labeled sequence translation was limited owing to the data availability of cross-lingual NER.Once NER datasets of more languages are available, we can train a stronger multilingual translation model to further enhance the overall performance.In future work, our method can be scaled up to hundreds of languages to meet the needs of practical industrial scenarios.
Figure 4: t-SNE (Maaten and Hinton, 2008) visualization of the sentence representations for the golden data (a) and the translated data generated by our multilingual model (b).Each color denotes one language.

Figure 6 :
Figure 6: F1 scores of cross-lingual entity projection based on the golden labels.The languages are ordered by alphabet order.

Figure 7 :
Figure 7: Comparison between the multilingual model w/o the alignment information and the counterpart w/ the alignment information in the training.

Θ 𝑠𝑟𝑐 Tagging Forward Translation Labeled Sequence Back-translation Entity Projection Cross-lingual Entity Projection Target Raw Corpora en Source Annotated Data Translated Source Annotated Corpora Source NER Model de es fr Translated Target Annotated Corpora Target Annotated Corpora In dieser Zeit lernte er Goze Delchev kennen Während dieser Zeit traf er Goze Delchev O O O O O B-PER I-PER O O O O O O B-PER I-PER Translated Target Sentence Raw Target Sentence Projection … … de es fr
, . . ., b i , x u 1 :u 2 , b i , . . ., x m ) and y p = (y 1 , . . ., b i , y v 1 :v 2 , b i , . . ., y n ), where y v 1 :v 2 is the target translation of source piece x u 1 :u 2 .x u 1 :u 2 denotes the source phrase from the u 1 -th token to the u 2 -th token and y v 1 :v 2 denotes the target phrase from the v 1 -th token to the v 2 -th token.b i is the boundary symbol to indicate the i-th entity.We use the alignment tool eflomal 1 to extract the aligned phrases of the sentence pair.x p and y p

Table 3 :
Results of our proposed method CROP and other relevant baselines for cross-lingual NER."Avg all " represents the average F1 scores of all 39 languages on the test set of the XTREME-40 benchmark.

Table 4 :
Ablation study of our proposed method.Avg all denotes the average F1 scores of 39 languages.

Table 7 :
Evaluation results for similar and distant languages of the source language with different sizes of pseudo data.Avg sim and Avg dis represent the average F1 scores of similar languages and distant languages.