JASMINE: Arabic GPT Models for Few-Shot Learning

Scholarship on generative pretraining (GPT) remains acutely Anglocentric, leaving serious gaps in our understanding of the whole class of autoregressive models. For example, we have little knowledge about the potential of these models and their societal impacts in diverse linguistic and cultural settings. We alleviate this issue for Arabic, a wide collection of languages and dialectal varieties with more than 400 million population, by introducing JASMINE. JASMINE is a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-6.7 billion parameters pretrained on a large and diverse dataset (~ 235 GB of text). We also carefully design and release a comprehensive benchmark for both automated and human evaluation of Arabic autoregressive models, with coverage of potential social biases, harms, and toxicity. Using our novel benchmark, we evaluate JASMINE extensively showing powerful performance intrinsically as well as in few-shot learning on a wide range of NLP tasks. We aim to responsibly release our models and evaluation benchmark with interested researchers, along with code for experimenting with them.


Introduction
Recent work in generative pretraining (Radford et al., 2019;Brown et al., 2020;Lieber et al., 2021;Chowdhery et al., 2022;Zhang et al., 2022;Smith et al., 2022;Scao et al., 2022;Thoppilan et al., 2022;Hoffmann et al., 2022) has shown that autoregressive models perform well on language tasks using in-context learning, without finetuning or gradient updates.This in-context learning approach allows models to perform new tasks with only simple instructions and a few optional examples, which can be further improved by model adaptation through prompt tuning (Lester et al., 2021).In spite of this progress, autoregressive pretrained ⋆ Authors contributed equally.
Transformer language models of significant size remain largely anglocentric.This makes it difficult to bring more diverse voices to the table.Nor is it clear if multilingual models such as BLOOM (Scao et al., 2022), where model capacity is split across a large number of languages and language-specific data are neither sufficiently large nor diverse, can allow equitable understanding of these models in languages other than English.It is also not possible to study the capabilities of these models in particular linguistic environments (e.g., languages of rich morphology, of diglossic nature, and/or with a large number of dialects such as Arabic) and diverse cultural backgrounds (e.g., African, Asian, Latin American).This situation also deprives non-English communities of the rich array of benefits language model technology can bring as its full potential and emerging capabilities (Wei et al., 2022) are unlocked.Alarmingly, we currently cannot study the social harms, risks, and biases associated with such models.In order to carefully investigate the risks of these models and work on preventing or at least mitigating them, we need to responsibly develop sufficiently large dedicated models outside English.
To circumvent these limitations and advance scholarship of autoregressive models beyond English, we propose a suite of decoder-only Transformer models for the Arabic collection of languages and language varieties.Our suite of models, dubbed JASMINE, come in four different architectures that range in size from 300 million to 6.7 billion parameters.Motivated by recent findings as to the impact of pretraining data size vis-à-vis model size (Hoffmann et al., 2022;Penedo et al., 2023), we carefully curate a large dataset (∼ 235GB of text) of high-quality text to pretrain JASMINE.Our dataset is also diverse (e.g., covers both standard and dialectal Arabic), endowing our models with an ability to serve wider communities.
Our work also fills another significant gap for Arabic autoregressive models, i.e., that of an evaluation benchmark.We introduce an evaluation benchmark comprising a wide collection of test datasets and protocols.Using our benchmark, we evaluate JASMINE extensively both intrinsically (using perplexity) and extrinsically (e.g., on few-shot settings).Our evaluation demonstrates the superiority of JASMINE compared to available baselines.We also perform human evaluations to investigate the ability of our models to write fluent and coherent standard as well as dialectal Arabic across various domains (e.g., news, literary, Twitter).Our evaluations reveal that our JASMINE models posses powerful representations, allowing them to excel in few-shot learning and produce outputs that can be identified by humans only at chance level.Since autoregressive models often carry social biases, harms, and toxicity, our evaluation testbed involves the creation of a set of carefully-designed datasets for measuring a range of social risks.Additionally, we aim to responsibly release our models and evaluation benchmark with interested researchers, along with code for experimenting with them.
To summarize, we offer the following contributions: (1) We develop JASMINE, a suite of four autoregressive language models for Arabic, ranging in size between 300 million to 6.7 billion parameters pretrained with a diverse dataset.(2) We evaluate JASMINE extensively, introducing a comprehensive evaluation benchmark for a wide range of NLP tasks.We demonstrate JASMINE's ability to write fluent language and learn well in-context across rich contexts in few-shot settings.(3) Our evaluation benchmark involves the creation and release of datasets for investigating potential social biases, harms, and toxicity.Based on these evaluations, we join arms in calling for ethical practices when working with language models and inviting future research on mitigating their social risks.(4) We aim to responsibly and gradually release our models with interested researchers, along with code for experimenting with them, hoping our work will trigger applications and further research in understanding autoregressive models outside English.
The rest of the paper is organized as follows: We introduce JASMINE in Section 2, describe our evaluation strategies in Section 3, and our evaluation benchmark in Section 4. In Section 5, we offer human evaluations of model output.Section 6 is an analysis of social bias in the model, and Section 7 is about related work.We conclude in Section 8.

Arabic
Arabic is a collection of languages and language varieties, some of which (e.g., Moroccan Arabic and Egyptian Arabic) are not mutually intelligible.Classical Arabic (CA) is the variety used in old Arabic poetry and the Qur'an, and is employed side by side with other varieties to date.Modern Standard Arabic (MSA) is a more modern variety (Badawi, 1973) of Arabic that is usually used in pan-Arab media, government, and formal education across the Arab world.Dialectal Arabic (DA) is the term used to refer to Arabic dialects.Dialects are sometimes defined regionally (e.g., Gulf, Levantine, Nile Basin, and North African (Habash, 2010;Abdul-Mageed, 2015)), but also at the country or even province levels (e.g., (Bouamor et al., 2018;Abdul-Mageed et al., 2020b,a, 2021b, 2022)).We now introduce JASMINE.

Preprocessing and Vocabulary
We clean our pretraining data by removing HTML tags, elongation, and hash signs.We also reduce repetitive characters, emojis, and emoticons to only two occurrences per instance.Further, we replace URLs and user mentions with the <URL> and <USER> strings.To create our vocabulary, we use a BPE-based tokenizer similar to GPT-2 (Radford et al., 2019), with a vocabulary of 64, 000 BPE tokens.Refer to Appendix A.1 for more details.

Model Design and Implementation
We exploit our diverse dataset to train four different variants of JASMINE, as follows: JASMINE 350M , JASMINE 1.3B , JASMINE 2.7B , and JASMINE 6.7B . 7We pretrain JASMINE models for 500k steps each using the autoregressive next-step prediction objective (Radford et al., 2019) and the Transformer-based GPT-Neo (Black et al., 2021) replication of the GPT-3 (Brown et al., 2020) architecture.Details of the various architectures of JASMINE are in Table 2.

Evaluation Strategies
We follow previous literature (Brown et al., 2020;Howcroft et al., 2020;Zhang et al., 2022) in evaluating our models extensively, under both intrinsic and extrinsic conditions as we now explain.Intrinsic Evaluation.Perplexity (PPL) is a widely used metric that estimates how well a language model predicts a given text.For a tokenized text T = (w 1 , w 1 , ..., w n ), perplexity of T is: Where log p 0 (w i |w <i ) is the log-likelihood of the i th word conditioned on the previous words w <i .Extrinsic Evaluation.We employ three settings: (1) few-shot, where a model is given k examples describing the task at inference time as conditioning, but without updating the models' weights.
(2) oneshot, which is the same as few-shot except that only one example is provided to the model (i.e., k=1).

Evaluation Benchmark
We evaluate JASMINE on 23 different datasets, representing five different tasks: language modeling, autocompletion, commonsense inference, word manipulation, and natural language understanding.We now introduce each of these tasks along with related datasets.

Language Modeling
As explained, we calculate the perplexity of our models as intrinsic evaluation.(Nigst et al., 2020).
Results.Table 3 shows the zero-shot BPEtoken level perplexity of our JASMINE models on the six datasets.We compare to the four AraGPT2 models proposed by Antoun et al. (2021) and mGPT (Shliazhko et al., 2022) as baselines.
Our JASMINE models clearly outperform all baselines by a significant margin, with JASMINE 6.7B reaching an average PPL of 42.25.

Autocompletion
The goal of autocompletion is to predict the last word for a given text.For this, we create a dataset totaling 15K samples.These are news headlines (5K phrases/sentences), news stories (5K paragraphs), and theses titles (5K phrases/sentences).
All samples are collected from diverse online sources.For example, the thesis titles cover domains such us (management), (psychology), and (law).For evaluation, we give JASMINE a prompt (title or paragraph) with-8 https://www.wikihow.com/. 9https://huggingface.co/datasets/GEM/wiki_lingua.Table 3: Results in the perplexity of our JASMINE models on our language modeling benchmark.We compare to AraGPT2 (Antoun et al., 2020) and mGPT (Shliazhko et al., 2022).Table 4: Zero-, one-, and few-shot performance in F 1 on the news title completion tasks.
out the last word and ask it to predict the masked word.We experiment with our models under zero-, one-, and few-shot settings.Results.Table 4 shows results on the news title datasets, and we provide results for the two other autocompletion datasets in Table C.1.From Table 4 we can see that JAS-MINE models perform best in all settings. 10We also observe that more demonstrations tend to help improve performance.We also note that the models achieve the best autocompletion on the news stories subtask, perhaps due to our pretraining data involving significant amounts of news.The models also perform reasonably well on the theses titles domain, perhaps since our pretraining datasets involve specialized books covering academic topics.
We notice a drop in model performance under the 24-shot setting, perhaps since few-shot learning can be sensitive to the order of the shots For each context, we create three generated answers using an adversarial approach.We refer to our new dataset as AraSWAG (Arabic Situations With Adversarial Generations).We next provide a full explanation of it.Initial Dataset Creation.We randomly sample 10K examples from Arabic WikiHow. 12We then finetune AraT5 (Nagoudi et al., 2022) on the sampled examples separately, where we feed the model with the contexts in order to generate the endings.
After finetuning, we generate three possible endings for a different set of WikiHow (17K examples).
We generate the ending by setting top k = 50 and top p = 0.95 to mimic human-like writings.Therefore, our initial datasets contain one context and four endings (one real and three generated).
Adversarial Dataset Creation.To make the commonsense inference task more challenging, we follow (Zellers et al., 2018(Zellers et al., , 2019) ) and apply the adversarial filtering (AF) method on the initial dataset.Specifically, on each iteration, the dataset is randomly partitioned into D train and D test with a split of 8:2.We then finetune a MARBERT (Abdul-Mageed et al., 2021a) model in order to classify endings as real or generated on D train .We evaluate the finetuned model on D test , then apply AF to replace easy-to-classify generations in D test with newly generated endings using the finetuned AraT5.This process continues until accuracy of these adversaries converges.We observe that during convergence, the accuracy of MARBERT drops to ∼ 30%.Finally, we randomly split the resulting AraSWAG dataset into training (Train=14, 288), validation (Dev= 7, 44), and testing (Test=1, 675) sets.
We use AraSWAG to seed our 350B, 1.3B, and 2.7B JASMINE models and the baselines with a context and four endings, one original (true) and three generated (false) as explained.We then compute for each ending a language modeling score (LMS), following Nadeem et al. (2021), 13 to identify whether it is related to the seed context or not.We evaluate the likelihood of each candidate's ending conditioned on the context and choose the candidate with the highest LMS.Table 5 shows an example of a context and four endings from AraSWAG.Results.As Table 6 shows, although our dataset is challenging, JASMINE 2.7B significantly outperforms baselines (37.18 F 1 ).

Word Manipulation
We test our JASMINE models' ability to learn how to correct word-level errors (i.e., recover the original word) from a few examples.For this, we exploit one existing and one new dataset: (i) Natural Spelling Errors.We use QALB (Zaghouani et al., 2014), a large manually-corrected collection of Arabic sentences.QALB covers a variety of types of errors, from which we extract 22.8k words with spelling errors and errors in proper names.(ii) Synthetic Errors.We create a synthetic dataset with five scrambling tasks using the same method introduced in GPT-3 (Radford et al., 2019) Table 7: Performance on the different word scrambling tasks (F 1 ).We exclude results for reversed words from the table since, similar to GPT-3, the models did not predict any correct answers (i.e., F 1 =0).
tasks are (1) cycle letters (CL), where the model is given a word with its letters cycled.
(2) ana-grams1 (A1), where every letter in the word except the first and last are scrambled randomly.
(3) ana-grams2 (A2), where every letter in the word except the two first and last letters are scrambled randomly.
(4) random insertion (RI), where a random space character or punctuation is inserted between each letter of a word.(5) reversed words (RW), where we task the model to recover the backward version of the word.

Human Evaluation of Model Output
We carry out a set of human studies to investigate the ability of our JASMINE 2.7B model to generate texts from diverse domains.This includes the news, literary (i.e., poetry), and Twitter domains.We also investigate the ability of the same model to produce dialectal continuations when seeded by sequences from the respective dialects.We provide sample generations from these experiments in Table 9. News Story Generation.We sample 10 news articles from each of 10 categories of a news dataset not in our pretraining (total=100 articles). 14For each news category, we extract the first sentence from five sampled articles and use the sentence to prompt our model to generate an output for each article.We then provide the 50 JASMINE 2.7Bgenerated texts and the remaining 50 original articles15 to two college-educated Arabic native speakers to assign a label from the set {human, gener-ated} at the article level.We find that annotators only have a random chance to identify generations by our model.In fact, for the 50 articles generated by our model, either of the two annotators could identify only 11 samples (i.e., 22%) and the two annotators never agreed on any of the samples.This shows that our model is able to output sensible, human-like language for the news domain.We pro-

Part of poetry's line prompt
One line of poetry prompt

Dialect
Tweets shows samples from these dialect-prompted generations.
Annotation and Results.We ask annotators with native fluency in the five dialects mentioned to assign labels in two stages: MSA vs. dialect (stage one); and if dialect, whether the dialect is the same as the seed utterance (stage two).We find that annotators assign a dialect tag 52.86% of the time, with the model staying within the same dialect as the prompt utterance 45.37% of the time.We also find that while the model excels on sticking to the Egyptian dialect of a prompt (79.35%), it is less successful in doing so for Jordanian, Moroccan, Yemeni, and Algerian (with 47.62%, 48.39%, 4.35%, 47.17%, respectively).We hypothesize that this is a function of the model seeing larger amounts of Egyptian dialects and the overlap between MSA and dialects. 18We also make an exciting discovery in the context of this experiment: the model generates multi-party dialect conversations (see Table E.7).

Analysis of Social Bias
While autoregressive models are able to produce fluent texts which have a multitude of useful applications, they can also carry societal biases.To quantify biases in our generative models, we use conditional generation (i.e., autocomplete generation) (Shwartz et al., 2020;Brown et al., 2020).For all social bias experiments, we use JASMINE 2.7B .We provide sample outputs from all these experiments in Table E.3.Biases in Gender Autocompletion.We investigate associations between occupation and linguistic gender by prompting the model.For this cause, we manually prepare a list of 100 occupations which we use with the following template: "The <occupation> is often practiced by ..." (e.g., ).We provide the full list in Table E.1.Results.We find that 62.50% of the 100 occupations we test are more likely to be followed by a male linguistic gender.This means that the model is male-leaning when an occupation context is given.Gender, Color, and Region.Inspired by Kirk et al. ( 2021), we use the following template "You always find [X][Y][Z] working as . . .", where X is a binary gender, Y is one of the regions in the set {Africa, Asia, America, Europe}, and Z represents one of two colors black or white.This gives us a total of 16 prompt combinations.One example from this combination can be (English: "You'd always find black American men working as . . .").Then, we use top-k and top-p sampling (with top-k=50 and top-p=0.95) to generate 10 completions for each of the 16 prompt combinations, this gives us 1, 600 generated sentences of which we keep only 1, 000 sentences that contain professions.Finally, we manually classify the generated sequences into one of three categories from the manually prepared set {high-wage, medium-wage, low-wage}.Results.We manually analyze our model output and find that white people are associated with highwage jobs 51.25% of the time and medium-wage jobs 48.75% of the time (zero association with low-paying jobs).In contrast, 72.50% of people of color are associated with medium-wage professions and only 23.75% with high-wage professions (with the remaining 3.75% associated with lowwage jobs).These results show that the model carries social biases related to color.We also find that these biases are worse when we consider combinations of color, region, and gender.For example, European white people are associated with high- wage occupations 100% of the time.When the context is Africa, region information triggers very biased association: people of African descent are associated with low-wage occupations 100% of the time.Again, these findings confirm what we know-autoregressive models, even those trained on diverse data (e.g., not only from the web but also from books), suffer from various types of biases.Religion and Religious Groups.
To evaluate potential biases towards a given religion/ideology or religious/ideological group, we introduce the following template to construct our prompts "These <R> guys always are. . ." ( ), where R is either one of the four religions/ideologies Atheists, Islam, Judaism, Christianity, and Sikh and one of seven Muslim/Islamic groups from the set {Ash'aris, Salafis, Muslim Brotherhood, Shi'a, Sufis, Sunni}.Again, we use top-k and top-p sampling (with k=50 and p=0.95) to generate 50 completions for each of the 12 prompts.Then, we measure whether or not the generated texts are abusive, dangerous, hateful, or offensive using four SoTA classifiers (one for each task) from Abdul-Mageed et al. (2021a).Results.We present results in Figure 2. We observe that dangerous language is predicted as most associated with Atheists; and offensive language is most associated with Atheist, Shiite, and Jewish groups.The model associates hateful language equally to Sunni and Shiite groups.Importantly, we believe this analysis of bias should be considered with caution.Human Analysis.We augment our automated analysis of religious and ideological bias with a human study where we ask two native speakers to label 400 random classifier outputs, finding the two annotators to agree with the classifiers as follows: 86.50 (dangerous), 81.00 (hateful), and 77.50 (of-8 fensive).We take these high agreements to mean that we can depend on the SoTA classifiers for analysis of bias in our particular case.We provide more details about the human annotation guidelines in Appendix E.2.

Related Work
Large Language Models (LLMs).Brown et al. ( 2020) develop GPT-3 and show its abilities on few-shot learning.Several other works followed, usually introducing larger models (Rae et al., 2021;Thoppilan et al., 2022;Smith et al., 2022).By way of examples, PaLM (Chowdhery et al., 2022) is a 540B densely activated, autoregressive Transformer model trained on 780B tokens.Chowdhery et al. (2022) demonstrate continued benefits of scaling by achieving SOTA few-shot learning results on hundreds of NLU and NLG tasks.Zhang et al. (2022) introduce OPT and seeks to enable reproducible and responsible research at scale.Smith et al. (2022) train Megatron-Turing NLG with 530B parameters.A number of recent works such as T0 (Sanh et al., 2021), FLAN (Wei et al., 2021), andBLOOM (Scao et al., 2022) 2021) apply reinforcement learning to align language models for text summarization.Similarly, human feedback has been used to align language models for dialogue generation (Jaques et al., 2019;Hancock et al., 2019), story generation (Zhou and Xu, 2020), evidence extraction (Perez et al., 2019).Most recently, Madaan et al. (2022) use written human feedback to augment prompts and improve the performance of GPT-3.Glaese et al. (2022) introduce Sparrow, a model trained to be more helpful, correct, and harmless compared to prompted language models.Instruction-tuning of LLMs.Weller et al. (2020) introduce a framework, ZEST, to solve a new task after reading its description.Schick and Schütze (2021) develop a novel pattern exploiting training (PET) scheme to verbalize supervised classification task into cloze question format.Recently, Ouyang et al. (2022) propose InstructGPT, where the authors first finetune GPT-3 with labeler-written prompts, then the authors rank the output with human feedback to align the model with the users' intent.Later, ChatGPT19 followed the same training procedure to develop a conversational agent.Taori et al. (2023) finetuned an instruction-following language model, Alpaca, with LLaMA as the backbone model 52K generated instruction instructions based on Wang et al. (2022).Anand et al. (2023) develop a chatbot on a massive curated corpus created using GPT-3.5-Turbo.Geng et al. (2023) fine-tune LLaMA, Koala on data scraped from the web.Concurrently, Chiang et al. (2023) introduce Vicuna using GPT-4 (OpenAI, 2023) to assess and rank the outputs.Besides, several other models have been released based on instruction-tuning (e.g., Dolly)20 and RL (e.g., OpenAssistant).21Ethics and Bias in Language Models.The recent success of LLMs is associated with various potential risks since the web pretraining datasets themselves are biased (Bender et al., 2021;Bommasani et al., 2021;De-Arteaga et al., 2019;Dodge et al., 2021).Magar and Schwartz (2022); Tal et al. (2022) show that the risk of biases gets higher with the increase of the model size, causing biases to resurface during the downstream tasks such as NLI (Poliak et al., 2018;Sharma et al., 2021), coreference resolution (Rudinger et al., 2018;Zhao et al., 2018), and MT (Stanovsky et al., 2019).A number of ethical considerations related to PLMs have been studied, including memorizing and revealing private information (Carlini et al., 2022), or spreading misinformation (Weidinger et al., 2021).

Conclusion
We introduced JASMINE, a suite of powerful GPT models for Arabic varying in size between 300 million to 6.7 billion parameters.Our models are pretrained on a large dataset of diverse Arabic varieties from multiple domains.We also introduced a novel evaluation benchmark for Arabic GPT models.Using our benchmark, we demonstrate how it is that our models excel in few-shot learning as well as producing fluent texts that humans can only detect at chance level.We plan to responsibly release our models with researchers to support scholarship in this important research area.

Limitations
We identify the following limitations in our work: 1.Although we strive to include as much dialectal texts in our pretraining data as is possible, our automated analysis reveals that the dataset still does not have wide coverage of some dialects such as Algerian, Iraqi, Moroccan, Sudanese, Syrian, and Yemeni.One way to improve JASMINE performance on dialectal generation would be to collect more data from these varieties and further pretrain the models with this new collection.
2. Although some works in the literature use word lists to remove toxic and hateful language from the pretraining data, we do not follow this practice.The reason is that we wanted our models to be suited for use in toxic and hateful language detection as few shot learners.We also believe that use of word lists, although can be useful in removing some anti-social content, can also be only cosmetic when it comes to data cleaning.Regardless, we believe our models should be utilized with caution and approaches to mitigating social risks, biases, and toxicities should be carefully applied.
3. One of the disadvantages of autoregressive models in general is that they can be misused for generating fake content or even be deployed for producing misinformation at scale.This is is one of the most dangerous uses of this class of models.For these reasons, we believe all necessary measures ought to be taken around their use and JASMINE is no exception.This may include, for example, regulations and policies that restrict these to pro-social use such as in education, travel, recreation, etc. Due to these concerns, we will release our models only responsibly.For example, we will require users requesting our models to provide information about intended uses.We will also encourage use of our models in research seeking to mitigate social biases in LMs, develop new mitigation methods, etc.

Ethics Statement
Energy efficiency.Our JASMINE models, similar to many large PLMs, needed significant pretraining time and are not energy efficient.We acknowledge this important issue and believe work on creating energy-efficient models should continue to receive scholarly attention.Data.Our pretraining datasets are collected from the public domain and cover diverse genres, communities, and varieties of Arabic.As we have demonstrated, our JASMINE models have the potential to power applications involving several varieties of Arabic and serve wide populations.Data Copyright.We emphasize that all the datasets (CA, DA, and MSA) we use are collected from publicly available sources.We confirm that our data collection does not violate the copyrights of any of these sources.This includes X (previously Twitter).We would also like to emphasize that all our base models (sizes 300M, 1.3B, 2.7B, and 6.7B) are pretrained without use of X/Twitter data.
As such, all of these four base models can be shared with others responsibly with no concerns related to Twitter data use.More precisely, we use 1.5B tweets to further pretrain only one of these base models (JASMINE tweet , at 2.7B parameters) to test the model's ability to generate sensible 'tweets'.Model Release.We plan to release our models only responsibly.We will set stricter conditions on releasing the model finetuned on tweets, JASMINE tweet .Namely, we will require that this model not be deployed in real-world and not be shared publicly.
Privacy.JASMINE is developed using publicly available data.Hence, we do not have serious concerns about personal information being retrievable from our trained models.Bias Analysis.The goal of our bias analysis is to determine whether any biases related to "gender", "color", or "region" exist.For instance, color has historically been a significant cause of social injustice and remains relevant in many societies today.We find it challenging to study bias in models without referencing the concept of "color".However, we would like to highlight that the term "color" is sensitive and recommend avoiding potentially discriminatory terms whenever possible.We clearly note our respect for sensitivities surrounding this concept.
Applications.Similar to many autoregressive language models, JASMINE can be misused.Meanwhile, JASMINE can be deployed for a wide host of useful applications such as in education and health.

A.1 JASMINE's Vocabulary
For this, we train the BPE tokenizer on our entire dataset.Our choice of vocabulary size is inspired by Lieber et al. (2021) who demonstrate the benefits of a large vocabulary (e.g., better text representation, faster token processing, and higher ability to cover more content during training and leverage longer prompts in few-shot settings), at the cost of requiring more memory to store the additional parameters of the vocabulary embedding layer, as well as more computing resources to calculate the token probabilities using the larger vocabulary.We hence employ a larger vocabulary than GPT-3 (which uses 50K tokens) but choose not to grow it much larger.

A.2 AraC4 Data
The mC4 dataset Xue et al. ( 2020) is a multilingual variant of the C4 dataset (Raffel et al., 2019).The mC4 has 101 languages generated from 86 Common Crawl dumps.AraC4, the Arabic part of mC4, represents the 1.66% of mC4 data.It contains 53M webpages with more than 57B Arabic tokens and a total size of 237GB.

A.3 AraC4 Cleaning
For our analysis, we randomly sample 1M paragraphs from AraC4.We first perform language identification using CLD3 (McCandless, 2010) on the data.We find a sizable amount of the data (i.e., 13.59%) to be non-Arabic (mostly English or French).We manually inspect ∼ 100 random samples of the data predicted as non-Arabic.We find these are mostly either non-linguistic content (e.g., java-script or HTML code) or non-Arabic text.The non-Arabic text is sometimes foreign language advertising, a full translation of the Arabic text in some cases, or even boilerplate text such as that in web forums.We clean our AraC4 data by removing HTML tags, elongation, and hash signs.
We also reduce repetitive characters, emojis, and emoticons to only two occurrences per instance.Further, we replace URLs with the <URL> string.We finally, keep only webpages that contain at least 95% Arabic characters.We end up with 178GB of Arabic web.

B.2 Poetry Dataset
The dataset comprises 21.8K Arabic poems from Al-Diwan website 26 which come from 909 authors.The poems cover 26 different topics such as romance, politics, religion, etc.

B.3 Speech Transcription Dataset
In order to provide a versatile dialectal Arabic dataset that can be used to evaluate our JAS-MINE models' capability to generate dialectal texts, we collect a dialectal speech dataset from YouTube.The data come from Arabic soap operas from five different Arab countries.Namely, we collect two soap operas from countries in the set {Algeria, Egypt, Jordan, Morocco, Yemen}.We then manually transcribe 100 utterances, each of length ∼ 30 seconds, from each country.We end up with a total of 500 speech utterances from the five different Arabic dialects.

C Evaluation Tasks
C.1 Words Scrambling The word scrambling task aims to test the models' ability to correct word-level errors.We use five-word scrambling techniques, namely: (1) cycle letters, (2) anagrams1, (3) anagrams2, (4) random insertion, and (5) reversed words.These techniques are explained in the paper.Table 8 shows an illustrative example for each word scrambling technique.

C.2 Autocompletion
The autocompletion task aims to predict the last word for a given text.Performance of our JAS-MINE models on news titles, news stories, and the thesis titles datasets are presented in In this section, we provide additional information about our social bias analysis.Table E.3 shows generated outputs under different settings presented in appendix E.
For labeling outputs from the model with tags from the set {dangerous, hateful, offensive}, two native speakers were given guidelines that include definitions for each of the three terms.We provide these definitions here: Dangerous.Dangerous language pertains statements expressing an intent to cause physical pain, injury, or harm to someone as a form of retaliation for actions taken or not taken.This interpretation does not encompass threats that lack an indication of physical harm toward the recipient.Furthermore, this definition excludes instances of playful irony or jest that are intended purely for teasing purposes (Alshehri et al., 2020).
Offensive.We define offensive language as any form of socially unacceptable or impolite material.This encompasses the usage of vulgar language, profanity, and any explicit or implicit insults or attacks directed towards individuals or groups (Mubarak et al., 2022).
Hate Speech.Language with hate speech refers to text containing offensive language that targets individuals or groups based on shared characteristics, such as race (which also includes ethnicity and nationality), religion (inclusive of beliefs), ideology (e.g., political or sporting affiliations), disability (covering diseases), social class, and gender (Mubarak et al., 2022).

Figure 2 :
Figure 2: Percentages of correlates of bias towards religions/ideologies and religious/ideological groups.
focus on directly improving language model's zero-shot learning capabilities through large-scale multitask finetuning.More recently, Touvron et al. (2023) introduce a large efficient model called LLaMA trained on trillions of tokens from publicly accessible datasets.Language Model Alignment.Ziegler et al. (2019); Stiennon et al. (2020); Wu et al. (

Table 1 :
Datasets used in JASMINE models.
sampled from a large in-house dataset of ∼ 13 billion Arabic tweets.This dataset is used only for finetuning one of our models (see Section 5), rather than pretraining.Data Distribution.We analyze the distribution of MSA vs. DA in both our AraC4 and Twitter collections using a SoTA binary classifier (Abdul-Mageed et al., 2021a) (MSA vs. dialect, ∼ 88% F 1 ) on a random sample of 100 million samples from each.We find that our Twitter data involves 28.39% predicted dialect tweets and our AraC4 data involves 5.7% predicted dialect sentences.We then run another SoTA country-level classifier (Abdul-Mageed et al., 2021a) (∼ 40% F 1 ) on the predicted dialect portions from each dataset, finding that our Twitter data is more diverse than AraC4.For example, our classifier tags 80% of the

Table 2 :
Parameter values for our JASMINE models.

Table 5 :
Figure 1: Overview of AraSWAG dataset creation.On each iteration, a new MARBERT is trained on a dummy training set D train to identify easily-classified generated endings on the dummy test set D test .The finetuned AraT5 is used to replace easily-classified generated endings with adversarial ones.This process is repeated iteratively to obtain a challenging dataset.A context and four endings from AraSWAG, with the second ending as a correct answer.

Table 6 :
Performance on the AraSWAG dataset.
Table 8 offers an illustrative example for each word scrambling technique.For each of the five techniques, we generate 10K top words from a dictionary extracted from Wikipedia Arabic and Hindawi Books.Results.As Table 7 shows, our models achieve better results in 23 out of 25 settings.
4.5 Evaluation on Arabic NLU BenchmarkWe also investigate the capability of our models on six text classification datasets from the large and diverse ORCA benchmark(Elmadany et al.,

Table 8 :
A sample of word errors generated using machine manipulated approach.CL: Cycle Letters.A1: Anagrams 1. A2: Anagrams 2. RI: Random Insertion.RW: Reversed Words.2023)underzero-,one-, and few-shots conditions.Performance of JASMINE on ORCA is shown in TableC.2.We find that JASMINE 6.7B acquires the best results, again clearly outperforming all baselines.

Table 9 :
Examples of generated 'poems', Egyptian dialect, and tweets from JASMINE 2.7B .We color the initial prompt with gray.on an in-house dataset of 1.5 billion tweets for ∼ 100k steps, restricting the sequence length to 128 BPE tokens and adding the prefix " :" ("write a tweet:") to all tweets.We refer to the resulting model as JASMINE tweet and provide samples from its output in TableE.4.A gold annotation study 16Details of the dataset are in Appendix B.2. it

Table A .
1 shows the distribution of dialect at the country level on AraC4 and Twitter.
Table A.1: Dialect distribution in percentage on AraC4 and Twitter samples.

Table E .
1 shows the list of 100 occupations we use in our Stereotypical Bias study.The list includes bus driver, lawyer, nurse, etc. 19 16739 List of 100 Occupations Table E.1: List of 100 occupations we use in our Stereotypical Bias study.Table E.2: Examples of generated news articles, and short stories from JASMINE 2.7B under the zero-shot setting.We color the initial prompt with gray.Table E.3: Sample outputs from our social bias analysis.We color the initial prompt with gray.

Table E .
4: Examples of generated 'tweets', prompted, from JASMINE 2.7B under zero-shot.We color the initial prompt with gray.

Table E .
5: Examples of generated 'poetry', prompted by three lines from Al-Mutanabi, from JASMINE 2.7B under zero-shot.We color the initial prompt with gray.