MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

Named Entity Recognition (NER) for low-resource languages is a both practical and challenging research problem. This paper addresses zero-shot transfer for cross-lingual NER, especially when the amount of source-language training data is also limited. The paper first proposes a simple but effective labeled sequence translation method to translate source-language training data to target languages and avoids problems such as word order change and entity span determination. With the source-language data as well as the translated data, a generation-based multilingual data augmentation method is introduced to further increase diversity by generating synthetic labeled data in multiple languages. These augmented data enable the language model based NER models to generalize better with both the language-specific features from the target-language synthetic data and the language-independent features from multilingual synthetic data. An extensive set of experiments were conducted to demonstrate encouraging cross-lingual transfer performance of the new research on a wide variety of target languages.


Introduction
Named entity recognition (NER) aims to identify and classify entities in a text into predefined types, which is an essential tool for information extraction. It has also been proven to be useful in various downstream natural language processing (NLP) tasks, including information retrieval (Banerjee et al., 2019), question answering (Fabbri et al., 2020) and text summarization (Nallapati et al., 2016). However, except for some resource-rich languages * Equal contribution, order decided by coin flip. Linlin Liu and Bosheng Ding are under the Joint PhD Program between Alibaba and Nanyang Technological University. 1 Our code is available at https://ntunlpsg. github.io/project/mulda/.
(e.g., English, German), training sets for most of the other languages are still very limited. Moreover, it is usually expensive and time-consuming to annotate such data, particularly for low-resource languages (Kruengkrai et al., 2020). Therefore, zero-shot cross-lingual NER has attracted growing interest recently, especially with the influx of deep learning methods (Mayhew et al., 2017;Joty et al., 2017;Jain et al., 2019;. Existing approaches to cross-lingual NER can be roughly grouped into two main categories: instance-based transfer via machine translation (MT) and label projection (Mayhew et al., 2017;Jain et al., 2019), and model-based transfer with aligned cross-lingual word representations or pretrained multilingual language models (Joty et al., 2017;Baumann, 2019;Conneau et al., 2020;. Recently, Wu et al. (2020) unify instance-based and model-based transfer via knowledge distillation.
These recent methods have demonstrated promising zero-shot cross-lingual NER performance. However, most of them assume the availability of a considerable amount of training data in the source language. When we reduce the size of the training data, we observe significant performance decrease. For instance-based transfer, decreasing training set size also amplifies the negative impact of the noise introduced by MT and label projection. For model-based transfer, although the large-scale pretrained multilingual language models (LM) (Conneau et al., 2020; have achieved state-of-the-art performance on many cross-lingual transfer tasks, simply fine-tuning them on a small training set is prone to over-fitting (Wu et al., 2018;Kou et al., 2020).
To address the above problems under the setting of low-resource cross-lingual NER, we propose a multilingual data augmentation (MulDA) framework to make better use of the cross-lingual generalization ability of the pretrained multilingual LMs. Specifically, we consider a low-resource setting for cross-lingual NER, where there is very limited source-language training data and no targetlanguage train/dev data. Such setting is practical and useful in many real scenarios.
Our proposed framework seeks the initial help from the instance-based transfer (i.e., translate train) paradigm Fang et al., 2020). We first introduce a novel labeled sequence translation method to translate the training data to the target language as well as to other languages. This allows us to finetune the LM based NER model on multilingual data rather than on the sourcelanguage data only, which helps prevent over-fitting on the language-specific features. One commonly used tool for translation is the off-the-shelf Google translate system 2 , which supports more than 100 languages. Alternatively, there are also many pretrained MT models conveniently accessible, e.g., more than 1,000 MarianMT (Junczys-Dowmunt et al., 2018;Kim et al., 2019) models have been released on the Hugging Face model hub. 3 Note that the instance-based transfer methods add limited semantic variety to the training set, since they only translate entities and the corresponding contexts to a different language. In contrast, data augmentation has been proven to be a successful method for tackling the data scarcity problem. Inspired by a recent monolingual data augmentation method (Ding et al., 2020), we propose a generation-based multilingual data augmentation method to increase the diversity, where LMs are trained on multilingual labeled data and then used to generate more synthetic training data.
We conduct extensive experiments and analysis to verify the effectiveness of our methods. Our main contributions can be summarized as follows: • We propose a simple but effective labeled sequence translation method to translate the source training data to a desired language. Compared with exiting methods, our labeled sentence translation approach leverages placeholders for label projection, which effectively avoids many issues faced during word alignment, such as word order change, entity span determination, noisesensitive similarity metrics and so on.
• We propose a generation-based multilingual data 2 https://cloud.google.com/translate 3 https://huggingface.co/transformers/model doc/marian.html augmentation method for NER, which leverages the multilingual language models to add more diversity to the training data.
• Through empirical experiments, we observe that when fine-tuning pretrained multilingual LMs for low-resource cross-lingual NER, translations to more languages can also be used as an effective data augmentation method, which helps improve performance of both the source and the target languages.
2 MulDA: Our Multilingual Data Augmentation Framework We propose a multilingual data augmentation framework that leverages the advantages of both instance-based and model-based transfer for crosslingual NER. In our framework, a novel labeled sequence translation method is first introduced to translate the annotated training data from the source language S to a set of target languages T = {T 1 , . . . , T n }. Then language models are trained on {D S , D T 1 , ..., D Tn } to generate multilingual synthetic data, where D S is the sourcelanguage training data, and D T i is the translated data in language T i . Finally, we post-process and filter the augmented data to train multilingual NER models for inference on target-language test sets.

Labeled Sequence Translation
We leverage labeled sequence translation for the training data of the source language to generate multilingual NER training data, which can also be viewed a method for data augmentation. Prior methods (Jain et al., 2019; usually perform translation and label projection in two separate steps: 1) translate source-language training sentences to the target language; 2) propagate labels from the source training data to the translated sentences via word-to-word/phrase-to-phrase mapping with alignment models or algorithms. However, these methods suffer from a few label projection problems, such as word order change, wordspan determination , and so on. An alternative to avoid the label projection problems is word-by-word translation (Xie et al., 2018), but often at the sacrifice of the translation quality.
We address the problems identified above by first replacing named entities with contextual placeholders before sentence translation, and then after translation, we replace placeholders in translated  sentences with the corresponding translated entities. An illustration of the method is shown in Figure 1.
Assume a sentence X S = {x 1 , . . . , x M } ∈ D S and the corresponding NER tags {y 1 , . . . , y M } are given, where x i 's are the sentence tokens and M is the sentence length. Let {E 1 , . . . , E n } denote the predefined named entity types. Our method first replaces all entities in {x 1 , . . . , x M } with placeholders (src of step 1 in Figure 1). Placeholders Ek are reconstructed tokens with the corresponding entity type E as prefix and the index of the entity k as suffix. Assume {x i , . . . , x j } is the k th entity in the source sentence, and the corresponding type is E z , then we can replace the entity with the placeholder E z k to get {. . . , x i−1 , E z k, x j+1 , . . .}. We use X S * to denote the generated sentence after replacing all entities with placeholders. X S * is fed into an MT model to get the translation X T * in the target language T . With such design, the placeholder prefix E can provide the MT model 4 with relevant contextual information about the entities, so that the model can translate the sentence with reasonably good quality. Besides, we observe most of placeholders are unchanged after translation, 5 which can be used to help locate the position of entities.
In the second step, we translate each entity   with the corresponding context. More specifically, we use brackets to mark the span of each entity and translate it to the target language successively, one at a time (src of step 2 in Figure 1). For example, to translate entity . .} into the MT model. Then we can get entity translations by extracting the square bracket marked tokens from the translated sentences. We translate the entities directly if the square brackets are not found.
Finally, we can replace placeholders in X T * (obtained from the first step) with the corresponding entity translations (obtained from the second step) and copy placeholder prefix as entity labels to generate the synthetic training data in the target language (step 3 in Figure 1). We tested the proposed method with Google translate and the MarianMT (Junczys-Dowmunt et al., 2018;Kim et al., 2019) models, and we found that both produce high quality synthetic data as we had expected.

Synthetic Data Generation with Language Models
Although labeled sequence translation generates high quality multilingual NER training data, it adds limited variety since translation does not introduce new entities or contexts. Inspired by DAGA (Ding et al., 2020), we propose a generation-based multilingual data augmentation method to add more diversity to the training data. DAGA is a monolingual data augmentation method designed for sequence labeling tasks, which has been shown to be able to add significant diversity to the training data. As the example shown in Figure 2, it first linearizes labeled sequences by adding the entity type before sentence tokens. Then an LSTM-based LM (LSTM-LM) is trained on the linearized sequences in an autoregressive way, after which the begin-ofsentence token [BOS] is fed into the LSTM-LM to generate synthetic training data autoregressively. The monolingual LSTM-LM of DAGA is trained in a similar way as the example shown in Figure 3, except that there is no language tag [en].
To extend this method for multilingual data augmentation, we add special tokens at the beginning of each sentence to indicate the language that it belongs to. The source-language data and the multilingual data obtained via translation are concatenated to train/finetune multilingual LMs with a shared vocabulary (as shown in Figure 5). Given a labeled sequence {x 1 , . . . , x M } from the multilingual training data, the LMs are trained to maximize the probability p(x 1 , . . . , x M ) in Eq. 1: where θ is the parameter to optimize, and p θ (x t |x <t ) is the probability of the next token given the previous tokens in the sequence, which is usually computed with the softmax function. token and a language token to the model to generate synthetic training data for the specified language. Besides, to leverage the cross-lingual generalization ability of large scale pretrained multilingual LMs, we also finetune a recent state-of-the-art seq2seq model mBART , which is pretrained with multilingual denoising tasks. Sentence permutation and word-span masking are the two noise injection methods used to add noise to original sentence X = {x 1 , . . . , x M } to output g(X), where g(.) is used to denote the noise injection function. After encoding g(X) with the Transformer encoder, the Transformer decoder is trained to generate the original sequence X autoregressively by maximizing Eq. 1.
Denoising word-span masked sequences is the most relevant to our data augmentation method, since only small modifications are required to make our finetuning task as consistent to the pretraining task as possible. More specifically, we design our finetuning task with the following changes: 1) use the linearized labeled sequences (as shown in Figure 5) as input X; 2) modify g(.) to mask random trailing sub-sequences such that g(X) = {x 1 , . . . , x z , [mask]}, where 1 ≤ z ≤ |X| is a random integer. After finetuning with such task, we can conveniently feed a randomly masked sequence {x 1 , . . . , x z , [mask]} into mBART to generate synthetic data. Figure 4 shows a more concrete example to illustrate how mBART is finetuned with the linearized sequences in our work.

Semi-supervised Method
Unlabeled multilingual sentences are usually easy to get, for example, data from the Wikimedia 6 . To make better use of these unlabeled multilingual data, we propose a semi-supervised method to prepare more pseudo labeled data for finetuning multilingual LMs. Inspired by self-training (Zoph et al., 2020;, we use the NER model trained on the multilingual translated data to annotate the unlabeled sentences. After that, we use two additional NER models trained with different random seeds to filter the annotated data by removing those with different tag predictions.

Post-Processing
We also design several straightforward methods to post-process and filter the augmented data generated by the LMs: • Delete sequences that contain only O (other) tags.
• Convert the generated labeled sequences to the same format as gold data by separating sentence tokens and NER tags.
• Use the NER model trained on the multilingual translated data to label the generated sequences (after tag removal). Then compare the tags generated by the LM and NER model predictions, and remove the sentences with inconsistencies.

Experiments
We conduct experiments to evaluate the effectiveness of the proposed multilingual data augmentation framework. Firstly, we compare our labeled sequence translation method with the previous instance-based transfer (i.e., translate train) methods. Following that, we show the benefit of adding multilingual translations. Then we continue  to evaluate the generation-based multilingual data augmentation method by comparing cross-lingual NER performance of the models trained on monolingual, bilingual, and multilingual augmented data, respectively. Finally, we further evaluate our methods on a wider range of distant languages. We use the most typical Transformer-based NER model 7 in our experiments, which is implemented by adding a randomly initialized feed forward layer to the Transformer final layer for label classification. Specifically, to demonstrate that our framework can help achieve additional performance gain even on the top of the state-of-the-art multilingual LMs, the checkpoint of the pretrained XLM-R large (Conneau et al., 2020) model is used to initialize our NER models.

Labeled Sequence Translation
We finetune the NER model on the translated targetlanguage data to compare our labeled sequence translation method ( §2.1) with the existing instancebased transfer methods.
Experimental settings The CoNLL02/03 NER dataset (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) is used for evaluation, which contains data in four different languages: English, German, Dutch and Spanish. All of the data are annotated with the same set of NER tags. We follow the steps described in §2.1 to translate En-glish train data to the other three languages. Following Jain et al. (2019) and , Google translation system is used in the experiments. Since our NER model is more powerful than those used by Jain et al. (2019) and , we reproduce their results with XLM-R large for a fair comparison. All of the NER models are finetuned on the translated target-language sentences only for 10 epochs with the best model selected using English dev data, and then evaluated on the targetlanguage original test data.  Results We present the results in Table 1. As we can see, our method outperforms the best baseline method by 2.90 and 2.97 on German and Dutch respectively, and by 2.23 on average. Since our models are only finetuned with the data generated by the labeled sequence translation method, the results directly demonstrate the effectiveness of our method. Moreover, compared with the two recent baseline methods (Jain et al., 2019;, our method does not rely on complex label projection algorithms and is much easier to implement.

Multilingual Translation as Data Augmentation
After showing that our labeled sequence translation method can generate high quality labeled data in the target language, in this section, we run ex-periments to verify the hypothesis that multilingual translation may help improve the cross-lingual transfer performance of multilingual LMs in low resource scenarios.
Experimental settings We use the same NER dataset as above. In order to simulate low resource scenarios, we randomly sample 500, 1k and 2k sentences from the gold English train set. Our labeled sequence translation method is used to translate the sampled data to pseudo labeled data in the three target languages, German, Spanish and Dutch. To better demonstrate how the training data affects cross-lingual NER performance, we train the NER model on four different conditions: 1) En: train the models on English data only; 2) Tgt-Tran: train the models on the pseudo labeled data in a certain target language only; 3) En + Tgt-Tran: train the models on the combination of English data and pseudo labeled target-language data; 4) En + Multi-Tran: train one single model on the combination of English data and pseudo labeled data in all three target languages. We find filtering the translated sentences can further improve cross-lingual transfer performance, so we use an NER model trained on the sampled English data to label the translated sentences, count the number of entities in each sentence different from NER model predictions, and then remove the top 20% sentences with the most inconsistent entities. This is similar to the third step described in §2.4, except that we remove all the inconsistent sentences from the augmented data, since the LMs can be used to generate a large number of candidate sentences. We set max number of epochs to 10 and use 500 sentences randomly sampled from the English dev data to select the best models for each setting. Then the best models are evaluated on the original target language test sets.
Results Table 2 compares the cross-lingual NER performance of the models trained on the different training sets. Although the performances of En and Tgt-Tran are relatively bad in most of the cases, combining them can always boost the performance significantly, especially when the dataset size is small. Adding multilingual translated data further improves cross-lingual performance by more than 1% on average when English data size is 1k or less. Therefore, multilingual translation can be used as an effective data augmentation approach in the low resource scenarios of cross-lingual NER. Moreover,   Besides, we also observe that multilingual translated data can even help improve NER performance of the source language. Table 3 summarizes English test data results for the above settings. Tgt-Tran (avg) is the average English results of the models trained on three different Tgt-Tran of German, Spanish and Dutch respectively. En + Tgt-Tran (avg) is the average for combining En with each of the three different Tgt-Tran. As we can see, adding additional translated data consistently improves English NER performance. Particularly, En + Multi-Tran achieves the best performance. Therefore, we can also use multilingual translated data to improve low-resource monolingual NER performance.

Generation-based Multilingual Data Augmentation
In this section, we run experiments to verify whether applying generation-based data augmentation methods to the multilingual translated data can further improve cross-lingual performance in the low resource scenarios.
Experimental settings We follow the steps described in §2.2 to implement the proposed data augmentation framework on top of LSTM-LM (Kruengkrai, 2019) and mBART     arately, and then use them to augment the data processed in §3.2. We concatenate English gold data and the filtered multilingual translated data to train/finetune the modified LMs, where LSTM-LM is trained from scratch and mBART is intialized with the mBART CC25 checkpoint 8 for finetuning. mBART CC25 is a model with 12 encoder and decoder layers trained on 25 languages. We follow the steps described in §2.4 to post-process the augmented data, and concatenate them with the corresponding English gold and translated multilingual data to train the NER models. The size of the augmented data used in each setting is the same as the size of the corresponding English gold data. MulDA-LSTM and MulDA-mBART are used to denote the methods that use LSTM-LM and mBART augmented data respectively. In addition, we also report a bilingual version of our method, denoted with BiDA-LSTM, which performs data augmentation on English and the translated target-language data only. We follow the same settings as above to evaluate cross-lingual performance of the NER models trained on different data.

Results
Average results of 5 runs are reported in Table 4. Note that MulDA-LSTM and MulDA-mBART train a single model for all the target languages in each setting, while BiDA-LSTM trains one model for each target language in each setting. Therefore, we compare BiDA-LSTM with 8 https://github.com/pytorch/fairseq/blob/master/ examples/mbart/README.md En + Tgt-Tran only. As we can see, the proposed multilingual data augmentation methods further improve cross-lingual NER performance consistently. For the 1k and 2k setting, MulDA-LSTM achieves comparable average performance as BiDA-LSTM.

Evaluation on More Distant Languages
We evaluate the proposed method on a wider range of target languages in this section.
Experimental settings The Wikiann NER data (Pan et al., 2017) processed by Hu et al. (2020) is used in these experiments. 1k English sentences (D S 1k ) are sampled from the gold train data to simulate the low resource scenarios. We also assume MT models are not available for all of the target languages, so we only translate the sampled English sentences to 6 target languages: ar, fr, it, ja, tr and zh. D T trans is used to denote the translated target-language sentences by following steps described in §2.1. The low quality translated sentences are filtered out in the same way as §3.2. To evaluate our method in the semi-supervised setting, we also sample 5,000 sentences from the training data of the 6 target languages and then remove the NER tags to create unlabeled data D T unlabeled . We follow the steps described in §2.3 to annotate D T unlabeled with one NER model trained on {D S 1k , D T trans }, and then filter the pseudo labeled data with two other NER models trained on the same data but with different random seeds. We use D T semi to denote the data generated with this  Results We summarize the results in Table 6. Tran-Train is the average performance of the 6 languages that have corresponding training data translated from English. Zero Shot is the average performance of the other target languages. MulDA-LSTM demonstrates promising performance improvements on both the Tran-Train and Zero Shot languages. The performance of MulDA-mBART is slightly lower, one possible reason is the noise introduced by the sentences labeled at character level. We follow the gold data format to label translated zh and ja sequences at character level, which is inconsistent with how mBART is pretrained. Please refer to Table 5 for the detailed cross-lingual NER results of each language.

Effectiveness in Label Projection
The label projection step of the previous methods needs to locate the entities and determine their boundaries, which is vulnerable to many problems, such as word order change, long entities, etc. Our method effectively avoids these problems with placeholders. In the two examples shown in Figure 6, Jain et al. (2019) either labeled only part of the whole entity or incorrectly split the entity into two,  incorrectly split the entities into two in both examples, while our method can correctly map the labels.

Multilingual Data Augmentation
We look into the data generated by our multilingual data augmentation method. During LM training,   the NER tags can be viewed as a shared vocabulary between different languages. As a result, we find that some generated sentences contain tokens from multiple languages, which are useful to help improve cross-lingual transfer (Tan and Joty, 2021). Two examples are shown in Figure 7.

Related Work
Cross-lingual NER There has been growing interest in cross-lingual NER. Prior approaches can be grouped into two main categories, instancebased transfer and model-based transfer. Instancebased transfer translates source-language training data to target language, and then apply label projection to annotate the translated data (Tiedemann et al., 2014;Jain et al., 2019). Instead of MT, some earlier approaches also use parallel corpora to construct pseudo training data in the target language (Yarowsky et al., 2001;Fu et al., 2014). To minimize resource requirement, Mayhew et al. (2017) and Xie et al. (2018) design frameworks that only rely on word-to-word/phrase-to-phrase translation with bilingual dictionaries. Besides, there are also many studies on improving label projection quality with additional feature or better mapping methods (Tsai et al., 2016;. Different from these methods, our labeled sentence translation approach leverages placeholders to determine the position of entities after translation, which effectively avoids many issues during label projection, such as word order change, entity span determination, noise-sensitive similarity metrics and so on. Model-based transfer directly applies the model trained on the source language to the targetlanguage test data (Täckström et al., 2012;Ni et al., 2017;Joty et al., 2017;Chaudhary et al., 2018), which heavily relies on the quality of cross-lingual representations. Recent methods have achieved significant performance improvement by fine-tuning large scale pretrained multilingual LMs (Devlin et al., 2019;Keung et al., 2019;Conneau et al., 2020). Besides, there are also some approaches that combine instance-based and model-based transfer Wu et al., 2020). Compared with these methods, our approach leverages MT models and LMs to add more diversity to the training data, and prevents over-fitting on language-specific features by fine-tuning NER models on multilingual data.
Data augmentation Data augmentation (Simard et al., 1998) adds more diversity to training data to help improve model generalization, which has been widely used in many fields, such as computer vision (Zhang et al., 2018), speech (Cui et al., 2015;Park et al., 2019), NLP (Wang and Eisner, 2016; and so on. For NLP, back translation (Sennrich et al., 2016) is one of the most successful data augmentation approaches, which translates target-language monolingual data to the source language to generate more parallel data for MT model training. Other popular approaches include synonym replacement (Kobayashi, 2018), random deletion/swap/insertion Kumar et al., 2020), generation (Ding et al., 2020), etc. Data augmentation has also been proven to be useful in the cross-lingual settings Singh et al., 2020;Riabi et al., 2020;Qin et al., 2020;, but most of the exiting methods overlook the better utilization of multilingual training data when such resources are available.

Conclusions
We have proposed a multilingual data augmentation framework for low resource cross-lingual NER. Our labeled sequence translation method effectively avoids many label projection related problems by leveraging placeholders during MT. Our generation-based multilingual data augmentation method generates high quality synthetic training data to add more diversity. The proposed framework has demonstrated encouraging performance improvement in various low-resource settings and across a wide range of target languages.

A.3 Visualization of Entity Representations
We visualize the last layer transformer outputs of the finetuned NER model with t-SNE. We finetune two XLM-R initialized NER models on English and MulDA-LSTM respectively, and generate last layer representations with Chinese test data. Only the token representations corresponding to the B and I tags are saved. The two dimensional t-SNE visualizations are shown in Figures 9 and 10. As we can see, the representation clusters corresponding to different NER entities in Figure 10 (MulDA-LSTM) are further separated than that in Figure 9 (English).

A.4 Parameters
The parameters used for NER model fine-tuning are shown in Table 8.