Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling

A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as ˷10% F_1 (NER) and 2% accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from training data augmentation possible with DA examples via self-training. This opens up opportunities for developing DA models exploiting only MSA resources. Our approach can also be extended to other languages and tasks.


Introduction
Neural language models (Xu and Rudnicky, 2000;Bengio et al., 2003) with vectorized word representations (Mikolov et al., 2013) are currently core to a very wide variety of NLP tasks. In specific, using representations from transformer-based (Vaswani et al., 2017) language models (Devlin et al., 2018;, pre-trained on large amounts of unlabeled data and then fine-tuned on labeled taskspecific data, has become a popular approach for improving downstream task performance. This pre-training then fine-tuning scheme has been successfully applied to several tasks, including question answering (Yang et al., 2019), social meaning detection (Abdul-Mageed et al., 2020d), text classification , named entity recognition (NER), and part-of-speech (POS) tagging (Tsai et al., 2019;. The same setup also works well for cross-lingual learning (Lample and Conneau, 2019;. Given that it is very expensive to glean labeled resources for all language varieties and dialects, a question arises: "How can we leverage resourcerich dialects to develop models nuanced to downstream tasks for resource-scarce ones?". In this work, we aim to answer this particular question by applying self-training to unlabeled target dialect data. We empirically show that self-training is indeed an effective strategy in zero-shot (where no gold dialectal data are included in training set, Section 4.2) and few-shot (where a given number of gold dialectal data points is included in training split, Section 4.4).
Our few-shot experiments reveal that selftraining is always a useful strategy that consistently improves over mere fine-tuning, even when all dialect-specific gold data are used for fine-tuning. In order to understand why this is the case (i.e., why combining self-training with fine-tuning yields better results than mere fine-tuning), we perform an extensive error analysis based on our NER data. We discover that self-training helps the model most (% = 59.7) with improving false positives. This includes DA tokens whose MSA orthographic counterparts (Shaalan, 2014) are either named entities or trigger words that frequently co-occur with named entities in MSA. Interestingly, such out-of-MSA tokens occur in highly dialectal contexts (e.g., interjections and idiomatic expressions employed in interpersonal social media communication) or ones where the social media context in which the language (DA) is employed affords more freedom of speech (Alshehri et al., 2020) and a platform for political satire. We present our error analysis in Section 5.
Context: Language use in social media tends to diverge from 'standard', offline norms (Danet and Herring, 2007;Herring et al., 2015). For example, users employ slang, emojis, abbreviations, letter repetitions, and other types of playful practices. This poses a challenge for processing social media data in general. However, there are other challenges specific to Arabic that motivate our work. More specifically, we choose Arabic to apply our approach since it affords a rich context of linguistic variation: In addition to the standard variety, MSA, Arabic also has several spoken dialects Bouamor et al., 2019;Abdul-Mageed et al., 2020b,c), which differ significantly from the written MSA (Zaidan and Callison-Burch, 2014) thus offering an excellent context for studying our problem. Arabic dialects differ among themselves and from MSA at various linguistic levels: lexical, phonological, morphological, and syntactic. This makes our case much more challenging than that of standard vs. social media English, for example. For a good zero-shot performance in our case, a model is required to accommodate not only lexical distance between MSA and DA, but also differences in word formation and syntax (related to POS tags, for example) and lexical ambiguity (as the meaning of the same token can vary cross-dialectically). This makes the zeroshot setting even harder, where the performance drops 20% F1 points (See section 4.2).
From a geopolitical perspective, Arabic also has a strategic significance. This is a function of Arabic being the native tongue of 400 million speakers in 22 countries, spanning across two continents (Africa and Asia) 2 . In addition, the three dialects of our choice, namely Egyptian (EGY), Gulf (GLF), and Levantine (LEV), are popular dialects that are widely used online. This makes our resulting models highly useful in practical situations at scale. Pragmatically, ability to develop NLP systems on dialectal tasks with no-to-small labeled dialect data immediately eases a serious bottleneck. Arabic dialects differ among themselves and from MSA at all linguistic levels, posing challenges to traditional NLP approaches. We also note that our method is 2 https://www.internetworldstats.com/stats19.htm language-independent, and we hypothesize it can be directly applied to other varieties of Arabic or in other linguistic contexts for other languages and varieties.
Tasks: We apply our methods on two sequence labeling tasks, where we have access to both MSA and DA gold data. In particular, as mentioned above, we perform experiments on POS tagging and NER. Each of these tasks has become an integral part of various other NLP applications, including question answering, aspect-based sentiment analysis, machine translation, and summarization, and hence our developed models should have wide practical use. Again, we note that our approach itself is task-independent. The same approach can thus be applied to other tasks involving DA. We leave testing our approach on other languages, varieties, and tasks for future research.
Contributions: Our work offers the following contributions: 1. We study the problem of MSA-to-DA transfer in the context of sequence labeling and show that when training on MSA data only, a wide performance gap exists between testing on MSA and DA. That is, models fine-tuned on MSA generalize poorly to DA in zero-shot settings.
2. We propose self-training to improve zero-and few-shot MSA-to-DA transfer. Our approach requires little-to-no labeled DA data. We evaluate extensively on 3 different dialects, and show that our method indeed narrows the performance gap between MSA and DA by a margin as wide as 10% F 1 points.
3. We develop state-of-the-art models for the two sequence labeling tasks (NER and POS).
We now introduce our method.

Method
While the majority of labeled Arabic datasets are in MSA, most daily communication in the Arab world is carried out in DA. In this work, we show that models trained on MSA for NER and POS tagging generalize poorly to dialect inputs when used in zero-shot-settings (i.e., no dialect data used during training). Across the two tasks, we test how self-training would fare as an approach to leverage unlabeled DA data to improve performance  Figure 1: MSA-to-DA Self-training transfer. on DA. As for self-training, it involves training a model using its own predictions on a set of unlabeled data identical from its original training split. Our proposed self-training procedure is given two sets of examples: a labeled set L and an unlabeled set U . To perform zero-shot MSA-to-DA transfer, MSA examples are used as the labeled set, while unlabeled DA examples are the unlabeled set. As shown in Figure 1, each iteration of the self-training algorithm consists mainly of three steps. First, a pre-trained language model is fine-tuned on the labeled MSA examples L. Second, for every unlabeled DA example u i , we use the model to tag each of its tokens to obtain a set of predictions and confidence scores for each token j ) are the label and confidence score (softmax probability) for the j-th token in u i . Third, we employ a selection mechanism to identify examples from U that are going to be added to L for the next iteration.
For a selection mechanism, we experiment with both a thresholding approach and a fixedsize (Dong and de Melo, 2019) approach. In the thresholding method, a threshold τ is applied on the minimum confidence per example. That is, we only add an example u i to L if min See Algorithm 1. The fixed-size approach involves, at each iteration, the selection of the top S examples with respect to the minimum confidence score min where S is a hyper-parameter. We experiment with both approaches and report results in Section 4. For our language model, we use XLM-RoBERTa , XML-R for short. Obtain prediction pu i on unlabeled DA example ui using model M ; remove ui from U and add it to L; 8 end 9 until stopping criterion satisfied XLM-R is a cross-lingual model, and we choose it since it is reported to perform better than the multilingual mBERT (Devlin et al., 2018). XLM-R also uses Common Crawl for training, which is more likely to have dialectal data than the Arabic Wikipedia (used in mBERT), making it more suited to our work. We now introduce our experiments.

Experiments
We begin our experiments with evaluating the standard fine-tuning performance of XLM-R models on both NER and POS tagging against strong baselines. We then use our best models from this first round to investigate the MSA-to-DA zeroshot transfer, showing a significant performance drop even when using pre-trained XLM-R. Consequently, we employ self-training for both NER and POS tagging in zero-and few-shot settings, showing substantial performance improvements in both cases. We now introduce our datasets.

Datasets
NER: For our work on NER, we use 4 datasets: ANERCorp (Benajiba et al., 2007) POS Tagging: There are a number of Arabic POS tagging datasets, mostly on MSA (Maamouri et al., 2004) but also on dialects such as EGY (Maamouri et al., 2014). To show that the proposed approach is able to work across multiple dialects, we ideally needed data from more than one dialect. Hence, we use the multi-dialectal dataset from (Darwish et al., 2018), comprising 350 tweets from each of the 4 varieties MSA, EGY, GLF and LEV. This dataset has 21 POS tags, some of which are suited to social media (since it is derived from Twitter). We show the POS tag set from (Darwish et al., 2018) in Table 10 in Appendix A. We now introduce our baselines.

Baselines
For the NER task, we use the following baselines: • NERA (Abdallah et al., 2012): A hybrid system of rule-based features and a decision tree classifier.
• WC-CNN (Khalifa and Shaalan, 2019): A character-and a word-level CNN with a CRF layer.
• mBERT (Devlin et al., 2018): A fine-tuned multilingual BERT-Base-Cased (110M parameters), pre-trained with a masked language modeling objective on the Wikipedia corpus of 104 languages (including Arabic). For finetuning, we find that (based on experiments on our development set) a learning rate of 6 × 10 −5 works best with a dropout of 0.1.
In addition, we compare to the published results in (Shaalan and Oudah, 2014), AraBERT (Antoun et al., 2020), and CAMel (Obeid et al., 2020) for the ANERCorp dataset. We also compare to the published results in (Khalifa and Shaalan, 2019) for the 4 datasets. For the POS tagging task, we compare to our own implementation of WC-BiLSTM (since there is no published research that uses this method on the task, as far as we know) and run mBERT on our data. We also compare to the CRF results published by (Darwish et al., 2018). In addition, for the Gulf dialect, we compare to the BiLSTM with compositional character representation and word representations (CC2W+W) published results in (Alharbi et al., 2018).

Experimental Setup
Our main models are XLM-RoBERTa base architecture XLM-R B (L = 12, H = 768, A = 12, 270M params) and XLM-RoBERTa large architecture XLM-R L (L = 24, H = 1024, A = 16, 550M params), where L is number of layers, H is the hidden size, A is the number of selfattention heads. For XLM-R experiments, we use Adam optimizer with 1e −5 learning rate, batch size of 16. We typically fine-tune for 20 epochs, keeping the best model on the development set for testing. We report results on the test split for each dataset, across the two tasks. For all BiLSTM experiments, we use the same hyper-parameters as (Khalifa and Shaalan, 2019).
For the standard fine-tuning experiments, we use the same train/development/test split as in (Khalifa and Shaalan, 2019) for NER, and the same split provided by (Darwish et al., 2018) for POS tagging. For all the self-training experiments, we use the dialect subset of the Arabic online news commentary (AOC) dataset (Zaidan and Callison-Burch, 2011), comprising the EGY, GLF, and LEV varieties limiting to equal sizes of 9K examples per dialect (total =27K) 3 . We use the split from  of AOC, removing the dialect labels and just using the comments themselves for our self-training. Each iteration involved fine-tuning the model for K = 5 epochs. As a stopping criterion, we use early stopping with patience of 10 epochs. Other hyper-parameters are set as listed before.

Fine-tuning XLM-R
Here, We show the resuts of standard fine-tuning of XLM-R for the two tasks in question. We start by showing the result of fine-tuning XLM-R on the named entity task, on each of the 4 Arabic NER (ANER) datasets listed in Section 3.1. Table 1 shows the test set macro F 1 score on each of the 4 ANER datasets. Clearly, the fine-tuned XLM-R models outperform other baselines on all datasets, except on the NW-2003 where WC-CNN (Khalifa and Shaalan, 2019) performs slightly better than XLM-R L .
For POS Tagging, Table 2 shows test set word accuracy of the XLM-R models compared to baselines. Again, XLM-R models (both base and large) outperform all other models. A question arises why XLM-R models outperform both mBERT and AraBERT. As noted before, for XLM-R vs. mBERT, XLM-R was trained on much larger data: CommonCrawl for XLM-R vs. Wikipedia for mBERT. Hence, the larger dataset of XLM-R is giving it an advantage over mBERT. For comparison with AraBERT, although the pre-training data for XLM-R and AraBERT may be comparable, even the smaller XLM-R model (XLM-R B ) has more than twice the number of parameters of the BERT BASE architecture on which AraBERT and mBERT are built (270M v. 110M). Hence, XLM-R model capacity gives it another advantage. We now report our experiments with zero-shot transfer from MSA to DA.

MSA-DA Zero-Shot Transfer
We start by the discussion of NER experiments. Since there is no publicly available purely dialectal NER dataset on which we can study MSA-to-DA transfer, we needed to find DA data to evaluate on. We observed that the dataset from (Darwish, 2013) contains both MSA and DA examples (tweets). Hence, we train a binary classifier 6 to distinguish DA data from MSA. We then extract examples that are labeled with probability p > 0.90 as either DA or MSA. We obtain 2,027 MSA examples (henceforth, Twitter-MSA) and 1,695 DA examples (henceforth, Twitter-DA), respectively. We split these into development and test sets with 30% and 70% ratios. As for POS Tagging, we already have the three previously used DA datasets, namely EGY, GLF and LEV. We use those for the zero-shot setting by omitting their training sets and using only the development and test sets.
We first study how well models trained for NER and POS tagging on MSA data only will generalize to DA inputs during test time. We evaluate this zero-shot performance on both the XLM-R B and XLM-R L models. For NER, we train on ANER-6 The model we use is XLM-RB fine-tuned on the AOC using  split. We achieve development and test accuracies of 90.3% and 89.4 %, respectively, outperforming the best results in .
Corp (which is pure MSA) and evaluate on both Twitter-MSA and Twitter-DA. While for POS tagging, we train on the MSA subset (Darwish et al., 2018) and evaluate on the corresponding test set for each dialect. As shown in Table 3, for NER, a significant generalization gap of around 20 % F 1 points exists between evaluation on MSA and DA using both models. While for POS tagging, the gap is as large as 18.13 % accuracy for the LEV dialect with XLM-R B . The smallest generalization gap is on the GLF variety, which is perhaps due to the high overlap between GLF and MSA (Alharbi et al., 2018). In the next section, we evaluate the ability of self-training to close this MSA-DA performance gap.

Zero-shot Self-Training
Here, for NER, similar to Section 4.2, we train on ANERCorp (pure MSA) and evaluate on Twitter-MSA and Twitter-DA. Table 4 shows self-training NER results employing the selection mechanisms listed in Section 2, and with different values for S and τ . The best improvement is achieved with the thresholding selection mechanism with a τ = 0.90, where we have an F 1 gain of 10.03 points. More generally, self-training improves zero-shot performance in all cases albeit with different F 1 gains. It is noteworthy, however, that the much highercapacity large model deteriorates on MSA if selftrained (dropping from 68.32% to 67.21%). This shows the ability of the large model to learn representations very specific to DA when self-trained. It is also interesting to see that the best self-trained base model achieved 50.10% F 1 , outperforming the large model before the latter is self-trained (47.35% in the zero-shot setting). As such, we conclude that a base self-trained model, with less computational capacity, can (and in our case does) improve over a large (not-self-trained) model that needs significant computation. The fact that, when self-trained, the large model improves 15.35% points over the base model in the zero-shot setting (55.42 vs. 40.07) is remarkable.
As for POS tagging, we similarly observe consistent improvements in zero-shot transfer with selftraining (Table 5). The best model achieves accuracy gains of 2.41% (EGY), 1.41% (GLF), and 1.74% (LEV). Again, this demonstrates the utility of self-training pre-trained language models on the POS tagging task even in absence of labeled dialectal POS data (zero-shot).

Few-Shot Self-Training
We also investigate whether self-training would be helpful in scenarios where we have access to some gold-labeled DA data (as is the case with POS tagging). Here, we evaluate the few-shot performance of self-training as increasing amounts of predicted DA data are added to the gold training set. This iteration of experiments focuses exclusively on POS tagging, using a fixed-size S = 100 of predicted cases for self-training and the XLM-R base model. Figure 2 shows how POS tagging test accuracy improves as the percentage of gold DA examples added to the MSA training data increases from 0% to 100% on the three dialects (EGY, GLF, and LEV). Comparing these results to those acquired via the standard fine-tuning settings without selftraining, we find that self-training does consistently improve over fine-tuning. This improvement margin is largest with only 20% of the gold examples.

Ablation Study
Here, we conduct an ablation study with the NER task as our playground in order to verify our hypothesis that the performance boost primarily comes from using unlabeled DA data for self-training. By using a MSA dataset with the same size as our unlabeled DA one 7 , we can compare the performance of the self-trained model in both settings: MSA and DA unlabeled data. We run 3 different self-training 7 We use a set of MSA tweets from the AOC dataset mentioned before. experiments using 3 different values for τ using each type of unlabeled data. Results are shown in table 6. While we find slight performance boost due to self-training even with MSA unlabeled data, the average F1 score with unlabeled DA is better by 2.67 points, showing that using unlabeled DA data for self-training has helped the model adapt to DA data during testing.

Error analysis
To understand why self-training the pre-trained language model, when combined with fine-tuning, improves over mere fine-tuning, we perform an error analysis. For the error analysis, we focus on the NER task where we observe a huge selftraining gain. We use the development set of Twitter-DA (See section 4.3) for the error analysis. We compare predictions of the standard finetuned XLM-R B model (FT) and the best performing self-training (τ = 0.9) model (ST) on the data, and provide the confusion matrices of both models with gold labels in Table 11 (in Appendix B). The error analysis leads to an interesting discovery: The greatest benefit from the ST model comes mostly from reducing false positives (see Table 7). In other words, self-training helps regularize the model predictions such that tokens misclassified by the original FT model as a named entities are now correctly tagged as unnamed entity "O".
To understand why the ST model improves false positive rate, we manually inspect the cases it cor-  Table 3: Zero-shot transfer results on DA For NER (macro F 1 ) and POS Tagging (accuracy). Models are trained on MSA only and evaluated on DA. Datasets used are: Twitter-MSA and Twitter-DA (Darwish, 2013) for NER, and Multi-dialectal (Darwish et al., 2018) for POS tagging.  Table 4: Zero-short self-training (ST) NER results. Models trained on ANERCorp (pure MSA) and evaluated on Twitter-MSA and Twitter-DA we extract from (Darwish et al., 2018). Self-training boosts the performance on DA data by 10% macro F1 points with XLM-R B and τ = 0.90. rectly identifies that were misclassified by the FT model. We show examples of these cases in Table 8. As the table shows, the ST model is able to identify dialectal tokens whose equivalent MSA forms can act as trigger words (usually followed by a PER named entity). We refer to this category as false trigger words. An example is the word "prophet" (row 1 in Table 8). A similar example that falls within this category is in row (2), where the model is confused by the token ( "who" in EGY, but "to" in MSA and hence the wrong prediction as LOC). A second category of errors is caused by non-standard social media language, such as use of letter repetition in interjections (e.g.,  in row (3) in Table 8). In these cases, the FT model also assigns the class PER, but the ST model correctly identifies the tag as "O". A third class of errors arises as a result of out-of-MSA vocabulary. For example, the words in rows (4-6) are all outof-MSA where the FT model, not knowing these, assigns the most frequent named entity label in train (PER). A fourth category of errors occurs as a result of a token that is usually part of a named entity in MSA, that otherwise functions as part of an idiomatic expression in DA. Row (7) in Table 8 illustrates this case. Table 12 in Appendix B provides more examples.
We also investigate errors shared by both the FT and ST models (errors which the ST model also could not fix). Some of these errors result from the fact that often times both MSA and DA use the same word for both person and location names.   Row (1) in Table 13 (in Appendix B) is an example where the word "Mubarak", name of the ex-Egypt President, is used as LOC. Other errors include out-of-MSA tokens mistaken as named entities. An example is in row (3) in Table 13, where ,("proof" or "basis" in EGY) is confused for ("emirate", which is a location). False trigger words, mentioned before, also play a role here. An example is in row (7) where is confused for PER due to the trigger word "Hey!" that is usually followed by a person name. Spelling mistakes cause the third source of errors, as in row (4). We also note that even with self-training, detecting ORG entities is more challenging than PER or LOC. The problem becomes harder when such organizations are not seen in training such as in rows (8) , (9) and (10) , all of which do not occur in the training set (ANERCorp).
False negatives. The "regularizing" effect caused by self-training we discussed thus far can sometimes produce false negatives as shown in Table 9. We see a number of named entities that were misclassified by the self-trained model as unnamed ones. As an example, we take the last name which was classified both correctly and incorrectly in different contexts by the self-trained model. Context of correct classifi-cation is " ", while it is " " for the incorrect classification. First, we note that is not a common name (zero occurrences in the MSA training set). Second, we observe that in the correct case, the word was preceded by the first name which was correctly classified as PER, making it easier for the model to assign PER to the word afterwards as a surname.
Pre-trained Language Models. Language models, based on Transformers (Vaswani et al., 2017), and pre-trained with the masked language modeling (MLM) objective have seen wide use in various NLP tasks. Examples include BERT (Devlin et al., 2018), RoBERTa , MASS (Song et al., 2019), and ELECTRA (Clark et al., 2020). While they have been applied to several tasks, including text classification, question answering, named entity recognition , and POS tagging (Tsai et al., 2019), a sufficiently large amount of labeled data is required for good performance. Concurrent with our work, Abdul-Mageed et al. (2020a) released MAR-BERT, a language model trained on a large amount of dialectal Arabic data. However, the extent to which dialect-specific models such as MARBERT can alleviate lack of labeled data remains untested.
Cross-lingual Learning. Cross-lingual learning is of particular importance due to the scarcity of labeled resources in many of the world's languages. The goal is to leverage existing labeled resources in high-resource languages (such as English) to optimize learning for low-resource ones. In our case, we leverage MSA resources for building DA models. With proximity to our work, Kim et al. (2017)    using English-resources only using two BiLSTM networks to learn common and language-specific features. Xie et al. (2018) made use of bilingual word embeddings with self-attention to learn crosslingual NER for low-resource languages. Multilingual extensions of LMs have emerged through joint pre-training on multiple languages. Examples include mBERT (Devlin et al., 2018), XLM (Lample and Conneau, 2019) and XLM-RoBERTa . Such multilingual models have become useful for few-shot and zero-shot cross-lingual settings, where there is little or no access to labeled data in the target language. For instance  evaluated a cross-lingual version of RoBERTa , namely XLM-R, on cross-lingual learning across different tasks such as question answering, text classification, and named entity recognition.
Self-Training. Self-Training is a semisupervised technique to improve learning using unlabeled data. Self-training has been successfully applied to NER (Kozareva et al., 2005), POS tagging (Wang et al., 2007), parsing (Sagae, 2010) and text classification (Van Asch and Daelemans, 2016). Self-training has also been applied in cross-lingual settings when gold labels are rare in the target language. Hajmohammadi et al. (2015) proposed a combination of active learning and self-training for cross-lingual sentiment classification. Pan et al. (2017) made use of self-training for named entity tagging and linking across 282 different languages. Lastly, Dong and de Melo (2019) employed selftraining to improve zero-shot cross-lingual classification with mBERT (Devlin et al., 2018).

Conclusion
Even though pre-trained language models have improved many NLP tasks, they still need labeled data for fine-tuning. We show how self-training can boost the performance of pre-trained language models in zero-and few-shot settings on various Arabic varieties. We apply our approach to two sequence labeling tasks (NER and POS), establishing new state-of-the-art results on both. Through in-depth error analysis and an ablation study, we uncover why our models work and where they can fail. Our method is languageand task-agnostic, and we believe it can be applied to other tasks and language settings. We intend to test this claim in future research. Our research also has bearings to ongoing work on language models and self-training, and interactions between these two areas can be the basis of future work. All our models and code are publicly available.   (Darwish et al., 2018).