Constrained Labeled Data Generation for Low-Resource Named Entity Recognition

Named Entity Recognition (NER) in low-resource languages has been a long-standing challenge in NLP. Recent work has shown great progress in two directions: developing cross-lingual features/models to transfer knowledge to low-resource languages, and translating source-language training data into low-resource target-language training data by projecting annotations with cheap resources. We focus on the second direction in this study. Existing methods suffer from the low quality of the resulting annotated data in the target language; for example, they cannot handle word order and lexical ambiguity well. To handle these limitations we propose a novel approach that uses the projected annotation to generate pseudo supervised data with a transformer language model and a constrained beam search. This allows us to generate more diverse, higher quality, as well as higher quantities of annotated data in the target language. Experiments demonstrate that, when combining our method with available cross-lingual features, it achieves state-of-the-art or competitive performance on NER in a low-resource setting, especially for languages that are distant from our source language, English. 1


Introduction
Named entity recognition (NER), the task of finding and classifying named entities in text, has been a mature topic in natural language processing (NLP). However, its success is highly dependent on the amount and quality of annotated data. For most of the world's languages, the amount of supervised resource is limited. How to develop a good NER system with little to no annotated data has become a challenging problem. Figure 1: (a) A pipeline of our data generation system; (b) An English-to-German example. An NER annotated English sentence at the top as the input produces (multiple) NER annotated German sentence(s) at the bottom. Red words are labeled named entities. Our generation method is denoted by CLDG (see §3.3 for details).
To address this challenge in low-resource NER, recent works study the benefits of weakly-or partially-annotated data (Dehghani et al., 2018;Mayhew et al., 2019), and that of transferring knowledge from the high-resource languages to the low-resource languages. Common corpora for developing cross-linguality include parallel text (Wang and Manning, 2014;Ni and Florian, 2016), Wikipedia (Nothman et al., 2013;Pan et al., 2017), and multilingual dictionaries or gazetteers (Tsai et al., 2016). However, the effectiveness of these approaches depends on the quality and quantity of data. For example, parallel text in some lowresource languages is unavailable and the dictionary size is usually smaller; there are 295 languages in Wikipedia, but most of them are too sparse to be useful. Mayhew et al. (2017) and Xie et al. (2018) employed phrase-level and word-level translation respectively to produce target-language training data by projecting annotations. Xie et al. (2018) also tried to alleviate word order divergence across languages by adding self-attention layers. However, this only makes the NER classifier insensitive to word order and the benefits of order information are still ignored.
In this study, we propose Constrained Labeled Data Generation (CLDG), a method that generates pseudo labeled data in low-resource languages with only cheap resources: a dictionary and unannotated text in the target language. Fig. 1 illustrates the pipeline of our labeled data generation system. We first translate high-resource labeled sentences to the target language word-by-word with a dictionary. Next, we construct target-language text from the source-language named entities with a pretrained language model. We introduce a decoding strategy with declarative constraints (i.e. hard constraints) to ensure the presence of entities in the generated text.
By constructing data artificially this way, we get sentences with the projected annotated entities, and with more natural, contextually correct, word order. Moreover, multiple annotated target language sentences can be generated with our method from a given annotated sentence in English. To the best of our knowledge, this work is the first to artificially generate labeled data via constrained text generation. Our method improves the current state-ofthe-art results on NER across several low-resource languages. Since our approach generates pseudo data from the labeled source-language tokens, it can potentially generalize to other cross-lingual NLP tasks.
2 Related Work 2.1 Cross-lingual NLP There are two main approaches to cross-lingual learning: parallel projection, and developing language-independent features. The first approach obtains pseudo labeled target-language data by projecting annotations from the source to the target using a parallel corpus. A model is then trained in the target language. It has been applied to many tasks, such as part-of-speech tagging (Fang and Cohn, 2016;Das and Petrov, 2011), NER (Wang and Manning, 2014;Mayhew et al., 2017) andparsing (McDonald et al., 2011). The second method attempts to learn language-independent features with which a model trained in the source can transfer directly to the target language. For example, Tsai et al. (2016) developed cross-lingual features from inter-language links in Wikipedia. Multilingual BERT (Devlin et al., 2019a;Pires et al., 2019) is trained on 104 languages and it can provide powerful cross-lingual contextual representations for many tasks.

Transformers for Text Generation
Self-supervised learning has achieved remarkable success in a wide range of NLP tasks (Vaswani et al., 2017;Peters et al., 2018;Devlin et al., 2019b). Pourdamghani et al. (2019) apply transformers to unsupervised machine translation, but it is hard to align named entities in the translated sentences. In terms of text generation, transformer-based models like GPT (Radford et al., 2019;Brown et al., 2020) have shown great potential. These models are pretrained on the large unsupervised corpora crawled from the web. BART (Lewis et al., 2020) proposes to learn a model by reconstructing the input corrupted by an arbitrary operation (e.g. token masking, token deletion, text infilling, etc.). It is particularly effective in text generation. T5 (Raffel et al., 2020) improves transfer learning by reformulating all tasks into a unified "text-to-text" format. It achieves state-of-the-art results on benchmarks, such as summarization.
To overcome the challenge of generating coherent long text, ProGeT (Tan et al., 2020) first produces a sequence of informative words and then progressively adds tokens until completing a full passage. It evaluates word importance with TF-IDF metric. In our experiments, we use this method to select input to the language model. Unlike ProGeT which generates sequences in multiple stages, we complete the text at one time.

Constrained Text Generation
Constrained text generation aims to decode sentences with expected attributes such as topics (Feng et al., 2018), style (Luo et al., 2019, etc. In this work, we focus on hard constraints. MaskGAN (Fedus et al., 2018) fills in missing text conditioned on context. It can be used for hardconstrained generation by masking non-constraint words, but the constraints have fixed positions in text. Insertion Transformer (Stern et al., 2019) solves this issue by inserting tokens between lexical constraints iteratively. To consider all the valid hard-constrained generation, it has to permute the constraints ordering. Grid beam search (GBS) (Hokamp and Liu, 2017) offers another solution to the problem by extending beam search and applying hard constraints that allow word insertion and permutation. Fig. 3 shows a visualization of GBS. The vertical axis represents completed constraints, and the horizontal axis indicates the output sequence, including constrained and unconstrained tokens. At each step, each hypothesis produces candidates in two directions: generating a word from the model distribution, or completing a constraint. Then it selects the top k candidates as the next hypotheses to continue. Dynamic Beam Allocation was proposed to improve the speed of constrained decoding (Post and Vilar, 2018). In this paper, we extend GBS to allow the source-language constraints for text generation in the target language.
3 Algorithm Problem Setting. Our objective is to generate hard-constrained annotated data of higher quality and larger quantity in the target language from a source language (e.g. English) in an unsupervised way. In this work, we limit ourselves to a setting where only the following resources are available: • Monolingual corpora in the target language.
• A dictionary from the source to the target.
• NER training data in the source language.
Our data generation pipeline consists of the following steps: 1. Word-by-word translation from the NER training data in the source language to the target languages ( §3.1).
2. Taking the important translated words as input, a pretrained transformer model is used to generate the target-language NER training data. The model is pretrained from scratch with data extracted from Wikipedia ( §3.2).
3. Hard constraints are applied to the generation to include the named entities with their labels ( §3.3).

Word-level Translation
We adopt Cheap Translation (Mayhew et al., 2017) or Bilingual Word Embedding Translation (Xie et al., 2018) to translate training data from the source language into the target language word-byword with a dictionary.

Pretraining Language Models
To reduce the noise introduced by wrong wordlevel translation, we only take the important words as input to the generation model. The vocabulary is sorted with TF-IDF scores, and only a small proportion of words with higher scores (e.g. 25%, is defined as the TF-IDF threshold in Table 1) are kept as the input. We extract text in the target language from Wikipedia as the training data, and train the model with the objective of reconstructing full text from important words and phrases. The selection of important words is also based on the TF-IDF scores.
In this work, we experiment with BART and T5 provided by HuggingFace (Wolf et al., 2020) for target-language model pretraining, though our method can use other off-the-shelf generative language models. Since BART and T5 are transformers with both the encoder and the decoder, text is conditionally generated from the bidirectional context.
During training, the model predicts the next token conditioned on the previous words sampled from the ground-truth data distribution. During generation, however, the model generates words conditioned on its previous imperfect prediction. Since the model has never seen such noisy input, its performance would degrade, and this traininggeneration discrepancy would accumulate along the generation sequence. This problem is referred to as "exposure bias" (Ranzato et al., 2016). To alleviate this issue and increase the robustness of the language model, we add noise to the gold data during training, by randomly replacing 10% of input words with others in the sentence. We train the model on 100k English sentences and evaluate it on 5k sentences to select the best model as well as the TF-IDF threshold. The experimental results in Table 1 show BART to be the most suitable one, and that 25% gives the best performance and covers most important words in a  Figure 2: A visualization of CLDG with the example from Fig. 1. Yellow rectangles represent hypotheses, blue arrows are translating blue words from the source language to the target language, the green blocks represent candidates, and green arrows show the selected candidates (beam size = 4) for the next hypotheses. "<farm ministry>" is an example of a phrase-level constraint. "Landwirtschaft" is translated from "farm". Since it is selected as one candidate, it closes the hypothesis, and therefore its next token must finish the current constraint, i.e. the next candidate must be the translation from "ministry". "<British>" shows an example of multiple translations. The vertical axis c indicates coverage of hard constraints. Each rectangle represents a beam containing k hypotheses. Dashed arrows start or continue a constraint based on whether the current constraint is finished or not, while solid arrows generate new words. Beams on the top layer contain candidates covering all the constraints.

Hard-constrained Generation
Transformer language models can produce any word which may or may not be in the input. To ensure the presence of source-entities in the generated text, we extend GBS to CLDG (Constrained Labeled Data Generation). Fig. 2 illustrates our constrained decoding process. See the Appendix for its pseudo-code. The constraints are the named entities in the source language; they are first translated in the target language (with their labels) and CLDG constrains the output sequence to include them in the output sequence. We use the coordinate system in Fig. 3. In each grid (t, c), candidates of new tokens are produced by generating all the possible tokens from previous hypotheses in grid (t − 1, c), and choosing one token for each constraint of each hypothesis in grid (t−1, c−1). Once we start a phrase-level constraint, we close hypotheses and only choose the next token for the current incomplete constraint. Then we select candidates with the top-k (k is beam size) scores as hypotheses for the current grid. Since constraints are named entities in the source language, we use a dictionary to translate them into the target language in GBS decoding.
Unlike the original GBS with constraints in the target language, CLDG produces hypotheses from multi-translations with different token lengths for each constraint in the source language. i.e. when decoding one sample, the number of beam nodes along the vertical direction in Fig. 3 varies for different hypothesis paths. This is due to multiple translation choices.
With open-ended GBS generation that uses top-k candidates selecting strategy, the model tends to generate similar text when constraints and input are the same. Besides, it also suffers from issues like repetitive generation, etc. The sampling method (Holtzman et al., 2019) can be used to address the above mentioned problems. However, when candidates from constraints and non-constraints are put together for sampling, the tokens in the constraints have little chance to be selected. This tends to construct sentences with most entities appearing in the end. We modify GBS to select hypotheses from both constraints candidates and new generation candidates, separately and evenly. For the constraints we use top-k beam search to pick candidates, while for produced tokens we sample the top-k hypotheses among candidates of beam search, which better solves this problem and also gives more diverse data when decoding multiple times.
Another potential problem in our method is the unintended introduction of new named entities into the generation process, as only translated named entities have labels. This would degrade the quality of pseudo labeled data, thus leading to a low NER recall. To cope with this issue, we adopt the following methods: (1) We restrict the number of new unconstrained tokens to be less than a parameter max unconstrained. Once the number of unconstrained tokens hits the maximum bound, only constraints are considered in subsequent decoding. (2) We use a naive NER predictor trained on previously-produced data to detect and relabel the added entities. Experiments show this can effectively improve the model performance.

Implementation Details of CLDG
Lexical Ambiguity. We tackle the problem of multiple translations by allowing multiple candidate tokens, each for one translation. The language model will choose the better candidates among all candidates to continue its generation. In many cases, one entry has too many (>35) translations, which would lead to poor generation quality if we consider all of them. To handle this problem, we consider a subset of frequently-occurred translations (Mayhew et al., 2017). Word Order. We address the problem of word order in two levels in the decoding: (1) The global phrase order in a sentence; (2) The local word order within phrases. When there is no phrase-level translation in the dictionary, we first translate word by word. Then we reorder and select the most appropriate one based on the language model. For example, when translating the organization "University

Experiments
We generate target-language annotated data via the pipeline in §3. Then we train an NER model on the generated data. We use the standard BiLSTM-CRF architecture (Ma and Hovy, 2016) with an AllenNLP implementation (Gardner et al., 2018).

Datasets
We evaluate our method on the benchmark CoNLL 2002 and 2003 NER datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) which contain 4 languages: English, Spanish, German, Dutch. Previous study shows that English is closely related to the above European languages in terms of word order (Mayhew et al., 2017). Hence, in order to demonstrate the advantages of our method, we add several languages that are dissimilar to English: Akan, Arabic, Turkish, Uzbek, Wolof, Yoruba. We evaluate their performances on the LORELEI project's data (Strassel and Tracey, 2016). Among the 9 languages we evaluate, Wolof   and Yoruba are truly low-resource languages, and for the other languages, we limit the resources used in order to mimic a truly low-resource scenario.
In all the experiments, we choose English CoNLL train set as the source and generate training data in the target language. CoNLL has 4 named entities labels PER, LOC, ORG, MISC, while LORELEI contains PER, LOC, ORG, GPE. To address this mismatch, we manually changed some MISC and LOC labels in CoNLL to GPE.

Compared Methods
We experiment with different methods as described below. Resources used for each approach are reported in Table 4. All the methods are evaluated on the same NER model with multilingual BERT (Devlin et al., 2019a), hereafter mBERT, as word embeddings. For each experiment, we run 5 times using different seeds and report the mean and standard deviation (Reimers and Gurevych, 2017 Table 3.

Bilingual Word Embeddings Translation (BWET)
This approach (Xie et al., 2018) translates annotated source-language data into the target language by inducing a cross-lingual word-level mapping with the fastText embeddings trained on Wikipedia and the MASTER-LEXES dictionaries.

Our Method (CLDG)
We follow the procedure described in §3 to produce training data. Table 2 presents statistics of monolingual corpora used for language model pretraining. See §4.3 for detailed description.
5. Google Translate Google Translate is used to translate English CoNLL train set into the target language sentence by sentence. We project labels across translations using fast align (Dyer et al., 2013). For languages supported by Google Translate, this serves as an upper bound for the translation quality.
6. Supervised Learning We train on the targetlanguage gold data and consider it as an upper bound for the cross-lingual learning.

Experimental Setup for CLDG
One advantage of our method is that given one labeled English sentence, we are able to generate multiple sentences in the target language with specified named entities and labels. Moreover, we can adjust the extent and the range of reordering in generation according to the characteristics of each language, which means a more coherent ordering in the target context. Aside from named entities, we can also adjust how many additional phrases in the source are regarded as constraints. In one extreme setting, we only include source-language entities to generate open-ended text.
To demonstrate the universality of our method, we first apply one general setting to the generation of all the European languages and all the non-European languages, respectively. Then we finetune the generation setting on each language as well as generate more data with different settings to obtain a better result. Results are presented and analyzed in §4.4.
In the general setting for LORELEI languages, we concatenate two sets of data. One is generated without reordering, with translation based on the most frequent source-word pair in the dictionary.    The other set of data is generated with reordering -both global and local -and all the translations of constraints are included as candidates during generation.
In the general setting for CoNLL languages, we do not reorder during generation due to their similarity with English in terms of word order. Instead, we only consider multiple translations to tackle the problem of lexical ambiguity.

Results
We compare all the methods for different languages in Table 5 and Table 6. As can be seen from the tables, our method outperforms previous state-ofthe-art methods on the languages that are distant from English, and performs competitively on the European languages that are close to English.  Table 8: NER on different dictionaries (one seed). "g-CLDG-XX" indicates producing training data with the general setting described in §4.3 using XX dictionary.

Languages Similar to & Distant from English
Interestingly, in the European-language experiments, all the methods did not show obvious edges over zero-shot learning except for German. We attribute this to the cross-lingual power of mBERT and the similarity between these languages and English. Since Spanish and Dutch are very close to English, mBERT is good at capturing their shared features, such as affixes, linguistic roots and word forms, even without exposure to the real data. These features might be good enough for NER already. Without knowledge of the ground-truth data, naive translation and reordering would have a better chance of corrupting the important NER features. We verify this by conducting experiments using BERT instead (Table 7). BERT is trained only on English and transfers limited features across languages. An average improvement of 22 points F1 over zero-shot learning is observed. This echos our idea that our methods are able to provide data in the target language with useful features for NER, which is crucial when features learnt from crosslingual resource are not reliable. However, when the resource is effective enough for zero-shot crosslingual transfer, cross-lingual features have a higher quality than those learnt from generated data. Yoruba (c) Figure 4: Learning curve of data generation. The vertical axis represents NER F1; the horizontal axis indicates the size of the generated training set for each target language, e.g. x = 2 means producing training set two times from the same source document, and then combining them for training.
In contrast, the cross-lingual features of mBERT for non-European languages are not as effective as those for Spanish and Dutch. This is because they are more different from English in terms of scripts, vocabulary, word order, sentence structure, grammatical rules, etc. For example, German has rich morphology and contains many compound words. Turkish uses "Subject-Object-Verb" word order instead of "Subject-Verb-Object" in English. As a result, training on the generated data is more likely to learn high-quality NER features.
Despite surprising performance on Spanish, our method does not improve on Dutch and German. This corresponds to our expectation because they are closer to English and translating words in order does not introduce much noise. Wu et al. (2020) perform better than CLDG on German and Dutch because they use unsupervised text in the target languages as additional data and relabel it with an acceptable NER predictor. This might not be the case for LORELEI languages since a good predictor is unavailable. In contrast, CLDG can work much better in low-resource languages. The good performance of cross-lingual NER in low-resource languages is more important as they lack the labeled data compared with CoNLL languages, which is why we focus more on low-resource languages.
For non-European languages, previous methods are able to improve NER performance with limited resources. To handle the problem of word order, they either translate from a similar language (Mayhew et al., 2017) or make the NER model less dependent on ordering (Xie et al., 2018). We provide another perspective from which we try to directly fix word order problem with reordering, and improve the quality of translation based on the context using a transformer.

Generation Settings Ablation
In the general setting for all LORELEI languages, we combine two sets of data produced with the two settings described in §4.3, in order to avoid overfitting the generated training data. Observing an average improvement of 2.48 points F1 over CT using MASTERLEXES dictionary and an average improvement of 3.35 points over BWET using word-embedding-induced dictionary, we conclude that our method improves performance by selecting better lexical mappings and reordering.
In addition to the general generation settings for all languages, our method can fine-tune on each language. Take Yoruba as an example. Yoruba is a West African language spoken by around 50 million people. It is very under-resourced. Even Yoruba Wikipedia contains only about 66K sentences. For Yoruba, by training on the data generated with a fine-tuned setting (i.e. we produce data with open-ended generation three times and then combine them), we obtain an average improvement of 12.86 over zero-shot learning and an average improvement of 6.03 points over CT (see Table 5). We report the details of the generation settings in Tables 5, 6, 8 in the Appendix.

Generation Size Ablation
To study how NER performs as a function of the amount of data generated, we record the scores when gradually generating more data. Fig. 4 shows that generally the more data we produce, the better NER can be. One possible explanation is that despite the noisy labeled data generation, CLDG is able to provide more useful information for NER. However, when the amount of data achieves an upper bound -usually this upper bound is 3 or 4 according to experiments -the noise may overtake the beneficial signals and thus corrupt the perfor-mance.

Dictionary Ablation
By comparing the results using different dictionaries (Table 8), we observe that the performance of our method depends on the dictionary quality. For example, in Akan, BWET performs much worse than CT. Although our method is able to beat BWET with a margin of 8.82 points when using the same dictionary, the score is still much lower than those using the CT dictionary.
Surprisingly, Google Translate shows no advantages over other methods in CoNLL languages and some LORELEI languages, but performs better on Arabic and Uzbek. There are several reasons. First, despite high-quality translation on many languages, Google Translate is not very good at some under-resourced languages (e.g. Yoruba). Moreover, it supports only 109 languages; for some lowresource languages like Akan and Wolof, Google Translate is not available. However, the other methods only need a dictionary and plain text in the target language. Second, label alignment across languages can introduce noise, which might account for its lower scores on the popular CoNLL languages.

Conclusion and Discussion
In this study, we propose a novel low-resource method to generate pseudo labeled training data in low-resource languages from English data, via constrained text generation. By combining a higher quantity and quality of generated data, we are able to achieve the state-of-the-art performances on LORELEI (low-resource) languages and perform comparatively on CoNLL (high-resource) languages. Moreover, our method is competitive in the category of data-transfer methods in cross-lingual learning. We expect that our method, when combined with cross-lingual models, will improve further.

A Dataset Statistics
We report the dataset statistics of our supervised learning experiments (see Tables 5 and 6) below.

B Pseudo-code of CLDG
The pseudo-code of CLDG algorithm is described in the below Algorithm table.

C Generation Settings
In this section, we report different generation settings in Table 11. Notations of parameters are described in Table 10.