MANER: Mask Augmented Named Entity Recognition for Extreme Low-Resource Languages

This paper investigates the problem of Named Entity Recognition (NER) for extreme low-resource languages with only a few hundred tagged data samples. NER is a fundamental task in Natural Language Processing (NLP). A critical driver accelerating NER systems' progress is the existence of large-scale language corpora that enable NER systems to achieve outstanding performance in languages such as English and French with abundant training data. However, NER for low-resource languages remains relatively unexplored. In this paper, we introduce Mask Augmented Named Entity Recognition (MANER), a new methodology that leverages the distributional hypothesis of pre-trained masked language models (MLMs) for NER. Thetoken in pre-trained MLMs encodes valuable semantic contextual information. MANER re-purposes thetoken for NER prediction. Specifically, we prepend thetoken to every word in a sentence for which we would like to predict the named entity tag. During training, we jointly fine-tune the MLM and a new NER prediction head attached to eachtoken. We demonstrate that MANER is well-suited for NER in low-resource languages; our experiments show that for 100 languages with as few as 100 training examples, it improves on state-of-the-art methods by up to 48% and by 12% on average on F1 score. We also perform detailed analyses and ablation studies to understand the scenarios that are best-suited to MANER.


Introduction
Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP) (Nadeau and Sekine, 2007). Given an unstructured text, NER aims to label the named entity of each word, be it a person, a location, an organization, and so on. NER is widely employed as an important first step in many downstream NLP applications, such as scientific information retrieval (Krallinger and Valencia, 2005;Krallinger et al., 2017), question answering (Mollá et al., 2006), document classification (Guo et al., 2009), and recommender systems (Jannach et al., 2022).
Recent advances in NER have mainly been driven by deep learning-based approaches, whose training relies heavily on large-scale datasets (Rosenfeld, 2021). As a result, the most significant progress in NER is for resource-rich languages such as English (Wang et al., 2021), French (Tedeschi et al., 2021), German (Schweter and Akbik, 2020), and Chinese (Zhu and Li, 2022). This reliance on large training datasets makes it challenging to apply deep learning-based NER approaches to low-resource languages where training data is scarce. To illustrate the ubiquity of low-resource languages, WikiANN (Rahimi et al., 2019), one of the largest NER datasets, has NER-labeled data for 176 languages, but 100 of these languages have only 100 training examples.
Providing NER for low-resource languages is critical to ensure the equitable, fair, and democratized utilization of NLP technologies that are required to achieve the goal of making such technologies universally available for all (Magueresse et al., 2020;King, 2015). Several research efforts are pushing the frontiers of NER for low-resource languages in two orthogonal and complementary directions. The first direction aims to obtain larger NER datasets to solve the data scarcity problem, via either data collection or augmentation (Malmasi et al., 2022;Al-Rfou et al., 2014;Meng et al., 2021a). The second direction aims to develop new model architectures and training algorithms capable of handling scarce data. For example, ideas from meta-learning (de Lichy et al., 2021), distant supervision (Meng et al., 2021b), and transfer learning (Lee et al., 2017) leverage the few-shot generalizability of language models for NER in data-scarce settings.
Our Contributions. In this work, we propose Mask Augmented Named Entity Recognition (MANER), a new NER approach for low-resource languages that does not rely on additional data and does not require modifications to existing, off-Figure 1: How MANER (b) differs from a standard NER model (a). MANER 1) modifies the input to add a [mask] token before each word and 2) predicts the NER tag for a word from its preceding [mask] token. the-shelf pre-trained models. The key intuition of MANER is to exploit the semantic information encoded in a pre-trained masked language model (MLM), in particular, in the [mask] token. Specifically, we reformat the input to the MLM by prepending a [mask] token to every token in the text to be annotated with NER tags. This reformatted input is then used to fine-tune the MLM with a randomly initialized NER prediction head on top of the prepended mask tokens. Extensive experiments on 100 extremely low-resource languages (each with only 100 training examples) demonstrate that MANER improves over state-of-the-art approaches by up to 48% and by 12% on average on F1 score. Detailed ablation and analyses of MANER demonstrate the importance of using the encoded semantic information and suggest scenarios in which MANER is most applicable.

Methodology
We now introduce MANER in detail and describe how it functions differently from a standard NER model (henceforth referred to as SNER).
SNER takes a sentence as input, passes the sentence through a transformer encoder model to obtain contextualized word embeddings, and applies a NER classifier layer on top of each word embedding to get the word's NER class.
MANER, in contrast, repurposes the [mask] token for the NER task. Two key differences that MANER implements as compared to SNER are 1) instead of giving the model the input sentence as is, MANER modifies the input sentence to append a [mask] token in front of each word and passes this modified sentence through the transformer encoder; and 2) instead of predicting the NER tag directly from the word embedding itself, MANER predicts the NER tag from the [mask] token embedding prepended to each word in the modified input sentence. These differences are illustrated in figure 1. We hypothesize that in such a setting, MANER will be better able to use the [mask] token to weigh the relative relevance of the neighboring word vs. the rest of the context when determining the label to assign to the neighboring word. Below, we expand on the above differences and introduce the two key components in MANER.
Modified input sentence. Let the set of NER labels be denoted by N . Let the sequence of NER labels for a sentence S = {w 0 , w 1 , .., w n−1 } consisting of n words be L = {c 0 , c 1 , .., c n−1 } where c i ∈ N , 0 ≤ i < n. To obtain the input that MANER requires we append a [mask] token to the beginning of each word in sentence S. The new sentence is S ′ = {m, w 0 , m, w 1 , .., m, w n−1 } where m is the [mask] token. The modified labels L ′ are {c 0 , ∅, c 1 , ∅, .., c n−1 , ∅}. The original NER label of each word is assigned to the [mask] token to the immediate left of the word.
MANER's classifier design. MANER uses a pre-trained, masked language model as the backbone with an NER classifier head on top. The transformer model takes a sentence as input and outputs embeddings for each token in the sentence. The NER classifier uses the token embeddings to output the most probable NER class for each token. Denote the MANER model by M. The transformer model is given by T , T (S ′ ) = T ({m, w 0 , m, w 1 , .., m, w n−1 }) = {e 0 , e 1 , .., e 2n−1 }, where e i ∈ R D , 0 ≤ i < 2n − 1 is the token embedding, and D is its dimension. The NER classifier is modeled using a weight matrix M ∈ R D×|N | that takes the computed token embeddings as input. Using these token embeddings, the classifier outputs scores for all NER labels for each token in the sentence. Passing these scores through a softmax nonlinearity provides probabilities p i ∈ R |N | for all NER classes in N for a given token i in S: MANER training and inference. During training, the weights of M and T are learned/fine-tuned by minimizing cross-entropy loss. Note that the loss is not calculated for labels marked ∅ in the modified label set L ′ . The NER label of the word is given by the NER label of the [mask] token preceding it. During inference, each word in the sentence is prepended with the [mask] token, and the NER class of each word is the most probable NER class of its prepended [mask] token .  ace  als  am  ang  ba  bo  cbk-zam  cdo  ce  co  crh  dv  fo  frr  fur  gan  gu  hsb  ia  jv  km  ksh  lmo  ln  min  mn  mt  mwl  mzn  nap  nds  ne  nov  pms  pnb  ps  qu  rw  sa  scn  sco  szl  tk  ug  vep  vls  war  wuu  zea

Experiments
We perform various empirical studies on MANER to 1) demonstrate its superior performance in lowresource language NER tasks and 2) provide insights into its performance and scenarios in which it will work well.  In Table 1, we report the average of F1 score for the 100 languages in WikiANN that we consider. MANER provides a significant 12% average improvement in our low-resource language settings. The MLM inspired NER-model MLM-NER , in contrast, performs only similarly to SNER. We also plot, in Figure 2, the F1 score of 50 randomly sampled low-resource languages comparing SNER against MANER (the plot for the remaining languages is in Appendix D). MANER offers up to 48% performance improvements compared to SNER, and there are only a few languages (12 out of 100) in which the SNER outperforms MANER.
We believe the reason that MANER outperforms MLM-NER is that MANER uses the [mask] token for NER prediction in both training and inference, whereas MLM-NER does not. Therefore, MANER learns to give more importance to the context in the case of out-of-distribution test labels using the [mask] token during inference. We will revisit and empirically support the above reasoning in Section 3.2. Additionally, in MLM-NER training, we mask out certain words with the [mask] token, which introduces noise and makes training and the NER task more difficult.

Analysis: Importance of the[mask]token
We now conduct two analyses to demonstrate the importance of using the This demonstrates that MANER is best suited for extreme low resource languages and rapid prototyping since it is easy and cost-effective to obtain very few human annotations to achieve large performance improvements (just 100 annotations are required).
0.649 0.715 (12%) 0.679 (6%) word that needs to be tagged (by distributional hypothesis in Harris (1954) which states the meaning of a word can be inferred from its context).
In the first analysis, we replace the [mask] token in MANER with a control token, namely, the random token [rand]. Note that the [rand] token is not learned during the XLM-RoBERTa model pre-training; thus, it will not encode any contextual information. As we see in Table 2, if we replace the [mask] token with [rand] , MANER achieves only a 6% improvement in F1 performance over the SNER baseline. This result illustrates the power of the context: even when the [rand] token does not contain contextual information during pre-training, MANER can still use the [rand] token to predict how much weight to assign the context and the word immediately adjacent to it depending on the test sample.
In the second analysis, we report in Table 3 the averaged F1 score of only those languages on which the XLM-RoBERTa model was pre-trained with at least 0.5GB of training data per language. The rationale behind this experiment design is that the [mask] token will encode the context semantics of a language only if the language was seen during the pre-training stage of XLM-RoBERTa model. As we see in Table 3, in this case, MANER provides a whooping 18% improvement in F1 score (as compared to the 12% gain in Table 1) if the language was seen in the pre-training stage. This experiment again highlights the importance of using [mask] token in MANER.

Analysis: Effect of training set size
We measure the effectiveness of MANER in situations where more training data is available. For this purpose, we select 4 languages from WikiANN dataset that have 1000 training data samples each. From Figure 3, we see that MANER boosts F1 performance over the SNER baseline until about 400 samples and then both methods perform similarly. This result demonstrates that MANER is best suited for extreme low resource languages and rapid prototyping because it is easy and cost-effective to obtain very few human annotations to achieve large performance improvements (e.g., just 100 annotations are required).

Conclusions
In this paper, we have proposed Mask Augmented Named Entity Recognition (MANER) for NER in extreme low-resource language settings. MANER exploits the information encoded in pre-trained masked language models (inside [mask] token specifically) and outperforms existing approaches for extreme low-resource languages with as few as only 100 training examples by up to 48% and by 12% on average on F1 score. Analyses and ablation studies show that using semantics encoded in [mask] token is integral to MANER. Future work will exploit MANER's effectiveness for highly resource-constraint and human-in-the-loop settings, such as rapid prototyping in an active learning setup and few-shot learning with human annotators.

222
Our proposed method MANER for improving NER is best suited for low-resource settings. As discussed in Section 3.3, we measured the effectiveness of MANER in situations where more training data is available and found that MANER boosts F1 performance over the SNER baseline until about 400 training examples, and then both methods perform similarly. The result demonstrated that MANER is best suited for extreme low-resource languages and rapid prototyping because it is easy and cost-effective to obtain very few human annotations to achieve significant performance improvements.
We base the experiments in this paper on a widely adopted model, XLM-RoBERTa, pretrained on multiple languages. It is possible that the empirical conclusions we draw from the observations do not generalize to other pre-trained models.

Ethics Statement
We believe providing NER for low-resource languages is critical to ensure the equitable, fair, and democratized utilization of NLP technologies that are required to achieve the goal of making such technologies universally available for all. Our work contributes to this direction by proposing MANER, which boosts performance for 100 languages with only 100 training samples each.
where M base is an NER model built on a transformer model T using classifier weight matrix M. This baseline method remains the de-facto method for training NER models for most languages (especially low-resource languages) to the best of our knowledge, though specialized models have been built for popular languages like English.
Inference: Similar to training during inference, the NER class of each word in the sentence is the most probable NER tag assigned to the classified word embedding.

A.2 MLM-NER
Our MANER methodology in Section 2 is one way to change the input phrase using the mask token. In this baseline, we introduce yet another way to repurpose the [mask] token for NER that is inspired by the masked language modeling (MLM) framework that is used for pre-training transformer models which we refer as MLM-NER. In MLM, a word is predicted using the words surrounding it in the sentence. Since the NER category of a word is also a semantic property of the word, we use the philosophy of MLM for NER fine-tuning.
In MLM pre-training, the dataset is prepared by masking random words in a sentence with a [mask] token with a fixed probability p mlm . Then, the masked words are predicted using the context information.
Analogous to MLM pre-training, for NER finetuning, we randomly replace words in sentence S with the [mask] token with the fixed probability p ner . However, instead of predicting the missing words, as with MLM, we predict the NER labels L for each word w in S irrespective of whether the word was replaced by a [mask] token or not. In the case the word was replaced with the [mask] token, the transformer outputs the [mask] token embedding for that word.

Thus the modified input to the transformer is
with p i a random number between 0 and 1 generated for w i . Then, we use the first baseline NER model design M base for training, but now it is is fine-tuned on S ′ and L (note we predict the label of the [mask] tokens as well). Inference with this model remains same as the first baseline model.

A.3 MANER Classifier Input Embedding
The NER prediction for each word in MANER is based on the embedding of the first token of the word. This is a common practice in NER with Transformer-based models where a word may be tokenized into multiple tokens.

A.4 Training procedures
For each language in our experiment, we train MANER and baselines for 30 epochs with a learning rate of 5e −6 and the loss optimized using Adam (Loshchilov and Hutter, 2019). Training takes 3 minutes on a single 11 GB GeForce GTX 1080 Ti GPU for a single language. We run MANER for following five random seeds for each language -12345, 23451, 34512, 45123, 51234. The standard deviation in performance for SNER averaged over 100 languages for 5 runs is 0.649 ± 0.005 and for MANER is 0.715 ± 0.007.

B WikiANN dataset details
The NER labels in WikiANN are in IOB2 (Inside-outside-beginning) format (Ramshaw and Marcus, 1995)  In addition, language names corresponding to abbreviations used in figure 2 can be found in the Appendix section of Conneau et al. (2020).

C Comments on catastrophic forgetting in MANER
The catastrophic forgetting (Kirkpatrick et al., 2017) phenomenon that masked language models undergo during any kind of fine-tuning is one of the arc  arz  as  ay  bar  bat-smg  bh  ceb  csb  cv  diq  eml  ext  fiu-vro  gd  gn  hak  ig  ilo  io  jbo  kn  ku  ky  li  lij  map-bms  mg  mhr  mi  my  oc  or  os  pa  pdc  rm  sah  sd  si  so  su  tg  vec  vo  wa  xmf  yi  yo  zh- MANER improves upon SNER for each of these languages, with F1 score improvement of up to 22% and 18% on average.
reasons we think MANER does not provide gains when more training data is available (of course more training data also implies less reliance on specialized techniques like ours). Catastrophic forgetting causes the loss of useful context semantics encoded in the [mask] token during the fine-tuning stage that MANER heavily relies on. Adding an additional masked language modeling loss to the NER loss during fine-tuning may help to circumvent catastrophic forgetting; we leave this investigation as a valuable venue for future work. Figure 4 shows the performance comparing MANER and SNER on the remaining 50 lowresource languages in the WikiANN dataset.

D Additional experiment results
The results here align with that in the main text: MANER provides performance improvement, sometimes significantly, over SNER. Figure 5 shows the performance comparing MANER and SNER on a subset of languages on which the backbone of both models, XLM-RoBERTa-large, has been pre-trained. The results corroborate with those in the main text: MANER improves upon SNER for each of these languages, with F1 score improvement of up to 22% and 18% on average.