ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.

Since LMs are costly to pre-train, it is important to keep in mind the end goals they will serve once developed. For example, (i) in addition to their utility on 'standard' data, it is useful to endow them with ability to excel on wider real world settings such as in social media. Some existing LMs do not meet this need since they were trained on datasets that do not sufficiently capture the nuances of social media language (e.g., frequent use of abbreviations, emoticons, and hashtags; playful character repetitions; neologisms and informal language). It is also desirable to build models able to (ii) serve diverse communities (e.g., speakers of dialects of a given language), rather than focusing only on mainstream varieties. In addition, once created, models should be (iii) usable in energy efficient scenarios. This means that, for example, medium-to-large models with competitive performance should be preferred to large-to-mega models.
A related issue is (iv) how LMs are evaluated. Progress in NLP hinges on our ability to carry out meaningful comparisons across tasks, on carefully designed benchmarks. Although several benchmarks have been introduced to evaluate LMs, the majority of these are either exclusively in English (e.g., DecaNLP (McCann et al., 2018), GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019)) or use machine translation in their training splits (e.g., XTREME (Hu et al., 2020)). Again, useful as these benchmarks are, this circumvents our ability to measure progress in real-world settings (e.g., training and evaluation on native vs. translated data) for both cross-lingual NLP and in monolingual, non-English environments.
Context. Our objective is to showcase a scenario where we build LMs that meet all four needs listed above. That is, we describe novel LMs that (i) excel across domains, including social media, (ii) can serve diverse communities, and (iii) perform well compared to larger (more energy hungry) mod-els (iv) on a novel, standardized benchmark. We choose Arabic as the context for our work since it is a widely spoken language (∼ 400M native speakers), with a large number of diverse dialects differing among themselves and from the standard variety, Modern Standard Arabic (MSA). Arabic is also covered by the popular mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), which provides us a setup for meaningful comparisons. That is, not only are we able to empirically measure monolingual vs. multilingual performance under robust conditions using our new benchmark, ARLUE, but we can also demonstrate how our base-sized models outperform (or at least are on par with) larger models (i.e., XLM-R Large , which is ∼ 3.4× larger than our models). In the context of our work, we also show how the currently best-performing model dedicated to Arabic, AraBERT (Antoun et al., 2020), suffers from a number of issues. These include (a) not making use of easily accessible data across domains and, more seriously, (b) limited ability to handle Arabic dialects and (c) narrow evaluation. We rectify all these limitations.
Our contributions. With our stated goals in mind, we introduce ARBERT and MAR-BERT, two Arabic-focused LMs exploiting largeto-massive diverse datasets. For evaluation, we also introduce a novel ARabic natural Language Understanding Evaluation benchmark (ARLUE). ARLUE is composed of 42 different datasets, making it by far the largest and most diverse Arabic NLP benchmark we know of. We arrange AR-LUE into six coherent cluster tasks and methodically evaluate on each independent dataset as well as each cluster task, ultimately reporting a single ARLUE score. Our models establish new stateof-the-art (SOTA) on the majority of tasks, across all cluster tasks. Our goal is for ARLUE to serve the critical need for measuring progress on Arabic, and facilitate evaluation of multilingual and Arabic LMs. To summarize, we offer the following contributions: 1. We develop ARBERT and MARBERT, two novel Arabic-specific Transformer LMs pre-trained on very large and diverse datasets to facilitate transfer learning on MSA as well as Arabic dialects.
2. We introduce ARLUE, a new benchmark developed by collecting and standardizing splits on 42 datasets across six different Arabic language understanding cluster tasks, thereby facilitating measurement of progress on Arabic and multilingual NLP.
3. We fine-tune our new powerful models on ARLUE and provide an extensive set of comparisons to available models. Our models achieve new SOTA on all task clusters in 37 out of 48 individual datasets and a SOTA AR-LUE score.
The rest of the paper is organized as follows: In Section 2, we provide an overview of Arabic LMs. Section 3 describes our Arabic pre-tained models. We evaluate our models on downstream tasks in Section 4, and present our benchmark AR-LUE and evaluation on it in Section 5. Section 6 is an overview of related work. We conclude in Section 7. We now introduce existing Arabic LMs.

Arabic LMs
The term Arabic refers to a collection of languages, language varieties, and dialects. The standard variety of Arabic is MSA, and there exists a large number of dialects that are usually defined at the level of the region or country (Abdul-Mageed et al., 2020a, 2021a. A number of Arabic LMs has been developed. The most notable among these is AraBERT (Antoun et al., 2020), which is trained with the same architecture as BERT (Devlin et al., 2019) and uses the BERT Base configuration. AraBERT is trained on 23GB of Arabic text, making ∼ 70M sentences and 3B words, from Arabic Wikipedia, the Open Source International dataset (OSIAN) (Zeroual et al., 2019) (3.5M news articles from 24 Arab countries), and 1.5B words Corpus from El-Khair (2016) (5M articles extracted from 10 news sources). Antoun et al. (2020) evaluate AraBERT on three Arabic downstream tasks. These are (1) sentiment analysis from six different datasets: HARD (Elnagar et al., 2018), ASTD (Nabil et al., 2015), ArsenTD-Lev (Baly et al., 2019), LABR (Aly and Atiya, 2013), and ArSaS (Elmadany et al., 2018). (2) NER, with the ANERcorp (Benajiba and Rosso, 2007), and (3) Arabic QA, on Arabic-SQuAD and ARCD (Mozannar et al., 2019) datasets. Another Arabic LM that was also introduced is Ara-bicBERT (Safaya et al., 2020), which is similarly based on BERT architecture. ArabicBERT was pretrained on two datasets only, Arabic Wikipedia and Arabic OSACAR (Suárez et al., 2019). Since both of these datasets are already included in AraBERT, and Arabic OSACAR 1 has significant duplicates, we compare to AraBERT only. GigaBERT (Lan et al., 2020), an Arabic and English LM designed with code-switching data in mind, was also introduced. 2 3 Our Models 3.1 ARBERT

Training Data
We train ARBERT on 61GB of MSA text (6.5B tokens) from the following sources: • Books (Hindawi). We collect and preprocess 1, 800 Arabic books from the public Arabic bookstore Hindawi. 3 • El-Khair. This is a 5M news articles dataset from 10 major news sources covering eight Arab countries from El-Khair (2016).
• Gigaword. We use Arabic Gigaword 5 th Edition from the Linguistic Data Consortium (LDC). 4 The dataset is a comprehensive archive of newswire text from multiple Arabic news sources.
• OSCAR. This is the MSA and Egyptian Arabic portion of the Open Super-large Crawled Almanach coRpus (Suárez et al., 2019), 5 a huge multilingual subset from Common Crawl 6 obtained using language identification and filtering.
• OSIAN. The Open Source International Arabic News Corpus (OSIAN) (Zeroual et al., 2019) consists of 3.5 million articles from 31 news sources in 24 Arab countries.
• Wikipedia Arabic. We download and use the December 2019 dump of Arabic Wikipedia. We use WikiExtractor 7 to extract articles and remove markup from the dump. We provide relevant size and token count statistics about the datasets in Table 1.

Training Procedure
Pre-processing. To prepare the raw data for pretraining, we perform light pre-processing. This helps retain a faithful representation of the naturally occurring text. We only remove diacritics and replace URLs, user mentions, and hashtags that may exist in any of the collections with the generic string tokens URL, USER, and HASHTAG, respectively. We do not perform any further preprocessing of the data before splitting the text off to wordPieces (Schuster and Nakajima, 2012). Multilingual models such as mBERT and XLM-R have 5K (out of 110K) and 14K (out of 250K) Arabic WordPieces, respectively, in their vocabularies. AraBERT employs a vocabulary of 60K (out of 64K We use the original implementation of BERT in the TensorFlow framework. 9 As mentioned, we use the same network architecture as BERT Base : 12 layers, 768 hidden units, 12 heads, for a total of ∼ 163M parameters. We use a batch size of 256 sequences and a maximum sequence length of 128 tokens (256 sequences × 128 tokens = 32, 768 tokens/batch) for 8M steps, which is approximately 42 epochs over the 6.5B tokens. For all our models, we use a learning rate of 1e−4.

7091
We pre-train the model on one Google Cloud TPU with eight cores (v2.8) from TensorFlow Research Cloud (TFRC). 10 Training took ∼ 16 days, for 42 epochs over all the tokens. Table 2 shows a comparison of ARBERT with mBERT, XLM-R, AraBERT, and MARBERT (see Section 3.2) in terms of data sources and size, vocabulary size, and model parameters.

MARBERT
As we pointed out in Sections 1 and 2, Arabic has a large number of diverse dialects. Most of these dialects are under-studied due to rarity of resources. Multilingual models such as mBERT and XLM-R are trained on mostly MSA data, which is also the case for AraBERT and ARBERT. As such, these models are not best suited for downstream tasks involving dialectal Arabic. To treat this issue, we use a large Twitter dataset to pre-train a new model, MARBERT, from scratch as we describe next.

Training data
To pre-train MARBERT, we randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets. We only include tweets with at least three Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not. That is, we do not remove non-Arabic so long as the tweet meets the three Arabic word criterion. The dataset makes up 128GB of text (15.6B tokens).

Training Procedure
Pre-processing.
We employ the same preprocessing as ARBERT. Pre-training. We use the same network architecture as BERT Base , but without the next sentence prediction (NSP) objective since tweets are short. 11 We use the same vocabulary size (100K wordPieces) as ARBERT, and MARBERT also has ∼ 160M parameters. We train MARBERT for 17M steps (∼ 36 epochs) with a batch size of 256 and a maximum sequence length of 128. Training took ∼ 40 days on one Google Cloud TPU (eight cores). We now present a comparison between our models and popular multilingual models as well as AraBERT.

Model Comparison
Our models compare to mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020) (base and large), and AraBERT (Antoun et al., 2020) in terms of training data size, vocabulary size, and overall model capacity as we summarize in Table 2. In terms of the actual Arabic variety involved, Devlin et al. (2019) train mBERT with Wikipedia Arabic data, which is MSA. XLM-R (Conneau et al., 2020) is trained on Common Crawl data, which likely involves a small amount of Arabic dialects. AraBERT is trained on MSA data only. ARBERT is trained on a large collection of MSA datasets. Unlike all other models, our MAR-BERT model is trained on Twitter data, which involves both MSA and diverse dialects. We now describe our fine-tuning setup.

Model Fine-Tuning
We evaluate our models by fine-tuning them on a wide range of tasks, which we thematically organize into six clusters: (1) sentiment analysis (SA), (2) social meaning (SM) (i.e., age and gender, dangerous and hateful speech, emotion, irony, and sarcasm), (3) topic classification (TC), (4) dialect identification (DI), (5) named entity recognition (NER), and (6) question answering (QA). For all classification tasks reported in this paper, we compare our models to four other models: mBERT, XLM-R Base , XLM-R Large , and AraBERT. We note that XLM-R Large is ∼ 3.4× larger than any of our own models (∼ 550M parameters vs. ∼ 160M). We offer two main types of evaluation: on (i) individual tasks, which allows us to compare to other works on each individual dataset (48 classification tasks on 42 datasets), and (ii) ARLUE clusters (six task clusters).
For all reported experiments, we follow the same light pre-processing we use for pre-training. For all individual tasks and ARLUE task clusters, we finetune on the respective training splits for 25 epochs, identifying the best epoch on development data, and reporting on both development and test data. 12 We typically use the exact data splits provided by original authors of each dataset. Whenever no clear  Datasets. We fine-tune the language models on all publicly available SA datasets we could find in addition to those we acquired directly from authors. In total, we have the follow- Baselines. We compare to the STOA listed in Table 3 and Table 4 captions. For all datasets with no baseline in Table 3, we consider AraBERT our baseline. Details about SA baselines are in Section A.2.
Results. To facilitate comparison to previous works with the appropriate evaluation metrics, we 13 NER and QA are expetions, where we use sequence lengths of 128 and 384, respectively; a batch sizes of 16 for both; and a learning rate of 2e−6 and 3e−5, respectively. 14 www.kaggle.com/mksaad/arabic-sentiment-twitter.    Table 3 and F 1 in Table 4. We typically bold the best result on each dataset. Our models achieve best results in 13 out of the 17 classification tasks reported in the two tables combined, while XLM-R (which is a much larger model) outperforms our models in the 4 remaining tasks. We also note that XLM-R acquires better results than AraBERT in the majority of tasks, a trend that continues for the rest of tasks. Results also clearly show that MARBERT is more powerful than than ARBERT. This is due to MARBERT's larger and more diverse pre-training data, especially that many of the SA datasets involve dialects and come from social media.

Social Meaning Tasks
We collectively refer to a host of tasks as social meaning. These are age and gender detection; dangerous, hateful, and offensive speech detection; emotion detection; irony detection; and sarcasm detection. We now describe datasets we use for each of these tasks. Datasets. For both age and gender, we use   Arap-Tweet (Zaghouani and Charfi, 2018). We use AraDan (Alshehri et al., 2020) for dangerous speech. For offensive language and hate speech, we use the dataset released in the shared task (subtasks A and B) of offensive speech by Mubarak et al. (2020). We also use AraNET Emo (Abdul-Mageed et al., 2020b), IDAT@FIRE2019 (Ghanem et al., 2019, and ArSarcasm (Farha and Magdy, 2020) for emotion, irony and sarcasm, respectively. More information about these datasets and their splits is in Appendix B.1. Baselines. Baselines for social meaning tasks are the SOTA listed in Table 5 caption. Details about each baseline is in Appendix B.2.
Results. As Table 5 shows, our models acquire best results on all eight tasks. Of these, MAR-BERT achieves best performance on seven tasks, while ARBERT is marginally better than MAR-BERT on one task (irony@FIRE2019). The sizeable gains MARBERT achieves reflects its superiority on social media tasks. On average, our models are 9.83 F 1 better than all previous SOTA.

Topic Classification
Classifying documents by topic is a classical task that still has practical utility. We use four TC datasets, as follows: Datasets. We fine-tune on Arabic News Text (ANT) (Chouigui et al., 2017) under three pretaining settings (title only, text only, and title+text.), Khaleej (Abbas et al., 2011), andOSAC (Saad andAshour, 2010). Details about these datasets and the classes therein are in Appendix C.1. Baselines. Since, to the best of our knowledge, there are no published results exploiting deep learning on TC, we consider AraBERT a strong baseline.
Results. As Table 6 shows, ARBERT acquires best results on both OSAC and Khaleej, and the title-only setting of ANT. AraBERT slightly outperforms our models on the text-only and title+text Dataset (classes) mBERT XLM-RB XLM-RL AraBERT ARBERT MARBERT   settings of ANT.

Dialect Identification
Arabic dialect identification can be performed at different levels of granularity, including binary (i.e., MSA-DA), regional (e.g., Gulf, Levantine), country level (e.g., Algeria, Morocco), and recently province level (e.g., the Egyptian province of Cairo, Results. As Table 7 shows, our models outperform all SOTA as well as the baseline AraBERT across all classification levels with sizeable margins. These results reflect the powerful and diverse dialectal representation of MARBERT, enabling it to serve wider communities. Although ARBERT is developed mainly for MSA, it also outperforms all other models.

Named Entity Recognition
We fine-tune the models on five NER datasets. Datasets. We use ACE03NW and ACE03BN (Mitchell et al., 2004), ACE04NW (Mitchell et al., 2004), ANERcorp (Benajiba and Rosso, 2007), and TW-NER (Darwish, 2013).  Results. As Table 8 shows, our models outperform SOTA on two out of the five NER datasets. We note that even though SOTA (Khalifa and Shaalan, 2019) employ a complex combination of CNNs and character-level LSTMs, which may explain their better results on two datasets, MARBERT still achieves highest performance on the social media dataset (TW-NER).

Question Answering
Datasets. We use ARCD (Mozannar et al., 2019) and the three human translated Arabic test sections of the XTREME benchmark (Hu et al., 2020): MLQA (Lewis et al., 2020), XQuAD (Artetxe et al., 2020), and TyDi QA (Artetxe et al., 2020). Details about these datasets are in Table F.1. Baselines. We compare to Antoun et al. (2020) and consider their system a baseline on ARCD. We follow the same splits they used where we fine-tune on Arabic SQuAD (Mozannar et al., 2019) and 50% of ARCD and test on the remaining 50% of ARCD (ARCD-test). For all other experiments, we fine-tune on the Arabic machine translated SQuAD (AR-XTREME) from the XTREME multilingual benchmark (Hu et al., 2020) and test on the human translated test sets listed above. Our baselines in these is Hu et al. (2020)'s mBERT Base model on gold (human) data.
Results. As is standard, we report QA results in terms of both Exact Match (EM) and F 1 . We find that results with ARBERT and MARBERT on QA are not competitive, a clear discrepancy from what we have observed thus far on other tasks. We hypothesize this is because the two models are pre-trained with a sequence length of only 128, which does not allow them to sufficiently capture both a question and its likely answer within the same sequence window during the pre-training. 16 To rectify this, we further pre-train the stronger model, MARBERT, on the same MSA data as AR-BERT in addition to AraNews dataset (Nagoudi et al., 2020) (8.6GB), but with a bigger sequence length of 512 tokens for 40 epochs. We call this further pre-trained model MARBERT-v2, noting it has 29B tokens. As Table 9 shows, MARBERT-v2 acquires best performance on all but one test set, where XLM-R Large marginally outperforms us (only in F 1 ).

ARLUE Categories
We concatenate the corresponding splits of the individual datasets to form ARLUE, which is a conglomerate of task clusters. That is, we concatenate all training data from each group of tasks into a single TRAIN, all development into a single DEV, and all test into a single TEST. One exception is the social meaning tasks whose data we keep independent (see ARLUE SM below). Table 10 shows a summary of the ARLUE datasets. 17 We now briefly describe how we merge individual datasets into ARLUE. ARLUE Senti . To construct ARLUE Senti , we collapse the labels very negative into negative, very positive into positive, and objective into neutral, and remove the mixed class. This gives us the 3 classes negative, positive, and neutral for ARLUE Senti . Details are in We construct three ARLUE Dia categories. Namely, we concatenate the AOC and AraSarcasm Dia MSA-DA classes to form ARLUE Dia-B (binary) and the region level classes from the same two datasets to acquire ARLUE Dia-R (4-classes, region). We then merge the country  Table 9: QA results. Results on this test set are with models using the same training data as Antoun et al. (2020), while rest of rows report models trained with AR-XTREME (Hu et al., 2020). † Antoun et al. (2020); ‡ Hu et al. (2020).

Evaluation on ARLUE
We present results on each task cluster independently using the relevant metric for both the development split (Table 11) and test split (Table 12). Inspired by McCann et al. (2018) and Wang et al. (2018) who score NLP systems based on their performance on multiple datasets, we introduce an ARLUE score. The ARLUE score is simply the macro-average of the different scores across all task clusters, weighting each task equally. Following Wang et al. (2018), for tasks with multiple metrics (e.g., accuracy and F 1 ), we use an unweighted average of the metrics as the score for the task when computing the overall macro-average. As Table 12 shows, our MARBERT-v2 model achieves the highest ARLUE score (77.40), followed by XLM-R L (76.55) and ARBERT (76.07). We also note that in spite of its superiority on social data, MARBERT ranks top 4. This is due to MAR-BERT suffering on the QA tasks (due to its short input sequence length), and to a lesser extent on NER and TC.  2020) were also introduced. More information about BERT-inspired LMs can be found in Rogers et al. (2020). Non-English LMs. Several models dedicated to individual languages other than English have been developed. These include AraBERT (Antoun et al., 2020) and ArabicBERT (Safaya et al., 2020) for Arabic, Bertje for Dutch (de Vries et al., 2019), CamemBERT (Martin et al., 2020) and FlauBERT (Le et al., 2020) for French, PhoBERT for Vietnamese (Nguyen and Tuan Nguyen, 2020), and the models presented by Virtanen et al. (2019) for Finnish, Dadas et al. (2020) for Polish, and Malmsten et al. (2020) for Swedish. Pyysalo et al. (2020) also create monolingual LMs for 42 languages exploiting Wikipedia data. Our models contributed to this growing work of dedicated LMs, and has the advantage of covering a wide range of dialects. Our MARBERT and MARBERT-v2 models are also trained with a massive scale social media dataset, endowing them with a remarkable ability for real-world downstream tasks.    ARLUESM results is the average score across the social meaning tasks described in Table 5. ‡ Metric for ARLUEQA is Exact Match (EM) / F1. provide a Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark for the evaluation of cross-lingual transfer learning covering nine tasks for 40 languages (12 language families). ARLUE complements these benchmarking efforts, and is focused on Arabic and its dialects. ARLUE is also diverse (involves 42 datasets) and challenging (our best ARLUE score is at 77.40).

Conclusion
We presented our efforts to develop two powerful Transformer-based language models for Arabic. Our models are trained on large-to-massive datasets that cover different domains and text genres, including social media. By pre-training MARBERT and MARBERT-v2 on dialectal Arabic, we aim at enabling downstream NLP technologies that serve wider and more diverse communities. Our best models perform better than (or on par with) XLM-R Large (∼ 3.4× larger than our models), and hence are more energy efficient at inference time. Our models are also significantly better than AraBERT, the currently best-performing Arabic pre-trained LM. We also introduced AraLU, a large and diverse benchmark for Arabic NLU composed of 42 datasets thematically organized into six main task clusters. ARLUE fills a critical gap in Arabic and multilingual NLP, and promises to help propel innovation and facilitate meaningful comparisons in the field. Our models are publicly available. We also plan to publicly release our ARLUE benchmark.
In the future, we plan to explore self-training our language models as a way to improve performance following Khalifa et al. (2021). We also plan to investigate developing more energy efficient models.

Ethical Considerations
Although our language models are pre-trained using datasets that were public at the time of collection, parts of these datasets might become private or get removed (e.g., tweets that are deleted by users). For this reason, we will not release or redistribute any of the pre-training datasets. Data coverage is another important consideration: Our datasets have wide coverage, and one of our contributions is offering models that can serve more diverse communities in better ways than existing models. However, our models may still carry biases that we have not tested for and hence we recommend they be used with caution. Finally, our models deliver better performance than larger-sized models and as such are more energy conserving. However, smaller models that can achieve simply 'good enough' results should also be desirable. This is part of our own future research, and the community at large is invited to develop novel methods that are more environment friendly.  • ArSarcasm Sent This sarcasm dataset is labeled with sentiment tags by Farha and Magdy (2020) who extract it from ASTD (Nabil et al., 2015) (10, 547 tweets) and SemEval-2017 Task 4 (Rosenthal et al., 2017) (8, 075 tweets).
• AWATIF. This is an MSA dataset from newswire, Wikipedia, and web fora introduced by Abdul-Mageed and Diab (2012).

A.2 SA Baselines
For SA, we compare to the following STOA: • Antoun et al. (2020). We compare to best results reported by the authors on five SA datasets: HARD, balanced ASTD (which we refer to as ASTD-B), ArSenTD-Lev, AJGT, and the unbalanced positive and negative classes for LABR. They split each dataset into 80/20 for Train/Test, respectively, and report in accuracy using the best epoch identified on test data. For a valid comparison, we follow their data splits and evaluation set up.
• Obeid et al. (2020). They fine-tune mBERT and AraBERT on the merged CAMel sent  datasets and report in F 1 PN , which is the macro F 1 score over the positive and negative classes only (while neglecting the neutral class).

A.3 SA Evaluation on DEV
• Dangerous Speech. We use the dangerous speech AraDang dataset from Alshehri et al. (2020), which is composed of tweets manually labeled with dangerous and safe tags.  • Offensive Language and Hate Speech. We use manually labeled data from the shared task of offensive speech (Mubarak et al., 2020). 21 The shared task is divided into two sub-tasks: sub-task A: detecting if a tweet is offensive or not-offensive, and sub-task B: detecting if a tweet is hate-speech or not-hate-speech.
More details about these datasets are in Table B.1.

B.2 SM Baselines
• Age and Gender.
We compare to AraNET Abdul-Mageed et al. (2020b) age and gender models, trained by fine-tuning mBERT. The authors report 51.42 and 65.30 F 1 on age and gender, respectively.
• Dangerous Speech. We compare to Alshehri et al. (2020), who report a best of 59.60 F 1 on test with an mBERT model fined-tuned on emotion data.
• Hate Speech. The best results on the offensive and hate speech shared task (Mubarak et al., 2020) are at 95 F 1 score and are reported by Husain (2020), who employ heavy feature engineering with SVMs. Since our focus is on methods exploiting language models, we compare to Djandji et al. (2020) who rank second in the shared task with a fine-tuned AraBERT (83.41 F 1 on test).
• Irony. We compare to Zhang and Abdul-Mageed (2019a) who fine-tune mBERT on the irony task, with an auxiliary author profiling task, and report 82.4 F 1 on test.
• Offensive Language. We compare to the best results on the offensive sub-task (Mubarak et al., 2020) reported by Hassan et al. (2020). They propose an ensemble of SVMs, CNN-BiLSTM, and mBERT with majority voting and acquire 90.51 F 1 .
• Sarcasm. We compare to Farha and Magdy (2020) who train a BiLSTM model using the AraSarcasm dataset, reporting 46.00 F 1 score.   We introduce each dataset briefly here and provide a description summary of all datasets in Table D.1.

B.3 SM Evaluation on DEV
• Arabic Online Commentary (AOC). This is a repository of 3M Arabic comments on online news (Zaidan and Callison-Burch, 2014). It is labeled with MSA and three regional dialects (Egyptian, Gulf, and Levantine).  Levantine (accuracy of 82.45). Their best results are based on BiLSTM.
• Zhang and Abdul-Mageed (2019b) developed the top ranked system in MADAR subtask 2, with 48.76 accuracy and 34.87 F 1 at tweet level.
• El Mekki et al. (2020) developed NADI subtask 2 (province level) winning system using a combination of word and character n-grams to fine-tune AraBERT (6.08 F 1 ).
• AraBERT. For ArSarcasm Dia , where no dialect id system was previously developed, we consider a fine-tuned AraBERT a baseline.

F Question Answering Datasets
• ARCD. Mozannar et al. (2019) use crowdsourcing to develop the Arabic Reading Comprehension Dataset. We use the same ARCD data splits used by Antoun et al. (2020).
• MLQA. This MultiLingual Question Answering benchmark is proposed by Lewis et al. (2020). It consists of over 5K extractive question-answer instances in SQuAD format in seven languages, including Arabic.
• TyDi QA. The TyDi QA dataset Artetxe et al. (2020) is manually curated and covers 11 languages (including Arabic). We focus on the "Gold" passage task only.