Language Clustering for Multilingual Named Entity Recognition

Recent work in multilingual natural language processing has shown progress in various tasks such as natural language inference and joint multilingual translation. Despite success in learning across many languages, challenges arise where multilingual training regimes of-ten boost performance on some languages at the expense of others. For multilingual named entity recognition (NER) we propose a simple technique that groups similar languages together by using embeddings from a pre-trained masked language model, and automatically discovering language clusters in this embedding space. Speciﬁcally, we ﬁne-tune an XLM-Roberta model on a language identiﬁcation task, and use embeddings from this model for clustering. We conduct experiments on 15 diverse languages in the WikiAnn dataset and show our technique largely outperforms three baselines: (1) training a multilingual model jointly on all available languages, (2) training one monolingual model per language, and (3) grouping languages by linguistic family. We also conduct analyses showing meaning-ful multilingual transfer for low-resource languages (Swahili and Yoruba), despite being automatically grouped with other seemingly dis-parate languages.


Introduction
Large transformer language models (Vaswani et al., 2017;Devlin et al., 2019) have shown impressive progress on tasks across different languages, including joint multilingual learning. Many works have focused on cross-lingual transfer from highto low-resource languages in a zero-or few-shot setting (Hu et al., 2020). However recent work has also highlighted that small amounts of data may be available for some low-resource languages, and even very few examples for fine-tuning on a target language can be effective (Lauscher et al., 2020). Given these insights and the scarcity of studies that present a middle ground between monolingual and multilingual learning, we investigate methods for clustering languages to boost multilingual performance on named entity recognition (NER).
One transformer model that has shown particularly strong performance on multilingual tasks is XLM-Roberta (Conneau et al., 2020), a variant of the Roberta model  that adapts the multilingual training regime of XLM (Lample and Conneau, 2019) to a CommonCrawl corpus containing 100 languages. This model can be adapted to tasks in multiple languages, and we take this as the base model for NER fine-tuning. Additionally, inspired by work in multilingual neural machine translation (NMT) (Tan et al., 2019), we investigate a method for grouping similar languages using an automated clustering method. We provide a focused evaluation of this method on 15 languages from the WikiAnn corpus (Pan et al., 2017) following the train-test splits from Rahimi et al. (2019) and show that NER models trained on language clusters largely outperform (a) individual monolingual models trained for each language, (b) multilingual models trained on languages that are grouped by linguistic family, and (c) a single multilingual model trained on all available languages. Mueller et al. (2020) fine-tune multilingual NER models monolingually on individual target languages, showing this technique to be effective in boosting F1 scores in all considered languages in their study. In a similar vein, Lauscher et al. (2020) test the effectiveness of few-shot adaptation of multilingual models to new languages, finding that even including as few as 10 samples from the target language increases performance over zero-shot transfer.

Related Work
Similar to our work, Chung et al. (2020) explore grouping languages by similarity, but focus on optimally constructing multilingual sub-word vocabularies, and show that these inputs perform bet-ter on tasks such as XNLI and WikiAnn NER. In a more focused work, Arkhipov et al. (2019) investigate NER performance on four related Slavic languages, and demonstrate the advantages of pretraining multilingual BERT on the unsupervised language modeling task. Finally, while not focusing on NER, Tan et al. (2019) show performance gains in multilingual NMT using clustering based on language tag embeddings. We take most direct inspiration from this work, though our embedding technique differs.

Clustering Languages for Multilingual NER
While many of the works above provide insight into multilingual NER performance in both broad and narrow contexts, many focus on zero-or few-shot transfer, or linguistically similar language groups. Our work seeks to fill a gap by studying multilingual NER performance for several diverse languages where data is available (though not evenly distributed) to understand how to best group languages for multilingual NER training. Here we present our proposed automatic clustering approach to address this problem. To obtain input representations for a clustering algorithm, we use a pre-trained XLM-R model. 1 For each sentence in our corpus, we obtain a single vector for the sentence as the output from XLM-R. We then input these vectors to a clustering algorithm to obtain cluster-assigned labels for each sentence. To obtain the final cluster label for an entire language, we simply compute the majority vote of the clustering labels for all sentences within a language.
While the base XLM-R model provides a good starting point for downstream tasks, we found that when clustering in this model's embedding space most languages were assigned to the same cluster regardless of the number of desired clusters. 2 Thus, we fine-tune XLM-R on a language identification task where the model is trained to classify sentences into one of the 15 languages in the dataset. We then use the [CLS] token embedding that is fed to the classification layer during finetuning as the input for clustering. This language identification model is fine-tuned for 3 epochs on the WikiAnn training set with a batch size of 20, and achieves an overall accuracy of 90% across all languages. Figure 1 shows qualitative evidence of strong grouping of languages such as overlap between Chinese and Japanese that is reflected in assigned clusters in Section 4 below.
To automatically group languages, we follow Tan et al. (2019) in choosing bottom-up agglomerative clustering, which assigns each data point its own cluster and iteratively merges clusters such that the sum of squared distances between points within all clusters is minimized. Similar to k-means, agglomerative clustering uses a k hyperparameter for the number of clusters, and after experimentation with k ∈ {3, 4, 5, 6} and noting sub-optimal groupings for many values of k, we set this parameter to 4.

Experimental Setup
For training NER models with this method, we group all sentences from languages that are assigned the same cluster and train and evaluate on these languages from the WikiAnn dataset. We compare these models against monolingual models for each language, a single multilingual model trained on all languages, and another set of grouped models using linguistic family as the assigned group. Language groupings for the automated clustering method and the linguistically-informed method are shown in tables 1 and 2 respectively.
We note several observations from these groupings. First, several languages appear in their own individual clusters when grouped by linguistic family (ja, ko, zh) or our clustering method (ar). In these cases results for grouped models are identical to those for monolingual models. Second, we note differences between the automated and linguistic grouping methods, most notably the inclusion of Yoruba and Swahili in an otherwise Indo-European cluster. This may be the result of few examples for these two languages in this dataset 3 , however we show in Section 5 that this grouping is beneficial to these languages in our experiments despite being counter-intuitive from a linguistic perspective. Finally, we note the grouping of Chinese and Japanese under the automatic clustering method, consistent with qualitative evidence from overlap in semantic space of the fine-tuned language classifier discussed above.

Cluster Number Languages
Cluster 1 ar Cluster 2 da, de, en, es, fr, hi, it, sw, yo Cluster 3 he, ko, ru Cluster 4 ja, zh  We initialize all NER models from the pretrained XLM-R checkpoint available from the Huggingface Transformers library (Wolf et al., 2020) and train all models for 3 epochs, with a batch size of 20, and maximum input sequence length of 300 sub-tokens. We evaluate with span-based F1 score as in the CoNLL-2003 evaluation script (Sang and Meulder, 2003), and report this metric for the three classes available in the dataset -location, organization, and person. Table 3 presents an overview of results from our experiments. For each language grouping we train five models, each newly initialized from the XLM-R weights except for the token classification head, whose weights are randomly initialized. Table 3 reports mean scores over these five training runs with standard deviation in parentheses. We first note fairly strong performance across all methods and languages except Swahili and Yoruba in the monolingual and language family settings. This is unsurprising given that these languages have significantly less data in the WikiAnn dataset. For most classes and languages, best performance is observed when using the proposed language clustering technique. We note slightly better performance using multilingual training for some languages, however these differences are typically less than one F1 point when compared to the clustering based models. Most notably, for Arabic we see best performance across all classes under the fully multilingual grouping, suggesting a need for improvement in our clustering method which assigns Arabic to its own cluster. Overall, these results show evidence that grouping languages together for multilingual NER provides a strong alternative to training a monolingual model for each language or a single multilingual model for all languages.

Results
Additional information about these results is plotted in Figure 2 below. 4 Here we use box plots to show the distribution of the class-averaged F1  score for each language, with each box representing a different language grouping. This visualization highlights interesting differences in the spread of scores, including comparatively large spread for monolingual training of languages such as Italian, French and Spanish. Conversely, we see relatively little spread in scores for the clustered language grouping within each language. This may be evidence of increased training stability when grouping similar languages together, although further work is needed to better understand these trends.
We also note drastic performance improvement for Swahili and Yoruba when trained in a single multilingual model compared to monolingual training, consistent with previous findings for lowresource languages in multilingual settings (Rahimi et al., 2019;Hu et al., 2020;Mueller et al., 2020;Conneau et al., 2020). However, we observe best performance for these two languages when grouped using our proposed clustering method, which is somewhat surprising given the counterintuitive grouping with mostly European languages, though this grouping is also observed in previous work (Chung et al., 2020).
This raises a question as to whether this improvement is due to effective learning of shared multilingual representations or whether it is primarily due to availability of more data of any kind. To test this, we evaluate NER models in a zero-shot framework where we train a multilingual model on all languages in Cluster 2 with Swahili and Yoruba removed and evaluate this model on these two heldout languages. These results are presented in Table 4 below. While we see that this transfer beats performance from monolingual models for some classes in these languages, we see that F1 scores for all classes are well below both the cluster models and the single multilingual model. This suggests that some of the increased performance on these languages in the clustering setting is due to advantageous multilingual transfer.   We first note poor performance from the model trained solely on WikiAnn data, which is unsurprising given the domain mismatch and idiosyncrasies in each of the datasets. Performance improves substantially in all cases where CoNLL training data is used, with best performance noted in the "Cluster Combined" model, which slightly outperforms using all available training data from both datasets. This suggests that even in a new domain the multilingual representations of closely related languages may be helpful, and that utilizing related languages is more useful than simply combining all available multilingual training data as in the "All" setting. We finally note that, despite not being extensively tuned on this dataset, we achieve results within 3.5 F1 points of previously reported state of the art results on this test set (Yamada et al., 2020).

Conclusion
We have presented a simple data-driven clustering technique for improving performance on multilingual NER, and showed that this technique largely outperforms naive combination of all languages studied here within a single model, as well as outperforming monolingual models and models for languages grouped by linguistic family. We further tested whether improved performance for lowresource languages in the Niger-Congo family was solely the result of more available data and showed evidence of multilingual transfer via a focused zeroshot experiment. We believe this straightforward method can be easily applied to other multilingual settings as has been shown in previous work in NMT.