ALDi: Quantifying the Arabic Level of Dialectness of Text

Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17% from news articles and 83% from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers' stylistic choices in different situations, a useful property for sociolinguistic analyses.


Introduction
Arabic is spoken by more than 420 million people all over the world (Bergman and Diab, 2022), and exists in a state of Diglossia, in which two variants of the language co-exist in Arabic-speaking communities (Ferguson, 1959).Modern Standard Arabic (MSA) is the standardized variant, which is taught in schools and used in formal communications and as a common language across all Arab countries.However, many local variants of Dialectal Arabic (DA) are used for daily communicationmainly in speech and speech-like text such as social media.They can diverge from MSA and each other in phonology, morphology, syntax, and semantics The dialectal sentences use their preferred SVO word order, contrasted by VOS order for MSA.The low dialectness example also shows a lexical dialectal feature for the word the man (MSA ): the Egyptian word ( ) differs from MSA in a single character, while the equivalent Levantine word ( ) has a different origin.Both dialects allow different variants for the verb: one variant ( ), used in both dialects, shares a root with the MSA variant, while the more dialectal variants ( in Egyptian and in Levantine) do not.(Habash, 2010)-sometimes even being mutually unintelligible (Abu Farha and Magdy, 2022)-and they do not have a standard orthography.
These differences between MSA and DA, and the fact that speakers commonly code-switch between the two, are a major challenge for Arabic NLP systems.As a result, many systems have been designed to perform Dialect Identification (DI), often on the sentence level (Zaidan and Callison-Burch, 2011;Elfardy and Diab, 2013;Salameh et al., 2018), but also on the token level as a way of detecting code-switching points (Solorio et al., 2014;Molina et al., 2016).Both formulations take a binary view of the problem (a sentence or token is either MSA or DA), and assume all the features of DA have the same impact on the perceived "dialectness" of a sentence.We argue, however, that the level of dialectness of a sentence is a spectrum, as illustrated in Table 1.Earlier initiatives recognized the presence of such a spectrum (Habash et al., 2008;Zaidan and Callison-Burch, 2011), however, the datasets that were developed are either skewed toward more standardized documents with limited code-switching or lack information about the distribution and the quality of the levels of dialectness labels.Consequently, the Level of Dialectness has not yet been adopted as a linguistic variable that is formally recognized in analyzing Arabic text, despite being potentially useful for NLP applications.
We argue that the level of dialectness is an important but overlooked aspect of Arabic text which is complementary to, and more nuanced than, dialect identification.To support this claim and promote further research in the area, we: 1. Define the Arabic Level of Dialectness (ALDi) as a continuous linguistic variable that quantifies the dialectness of a sentence (or sentence-like unit) and can enrich the analysis of Arabic text.2. Release AOC-ALDi1 , a dataset of 127,835 Arabic comments with their ALDi labels, which is derived from the Arabic Online Commentary dataset (Zaidan and Callison-Burch, 2011).We provide the first detailed analysis of the level of dialectness labels and form canonical splits for the AOC-ALDi dataset.3. Propose an effective method for estimating the ALDi of sentences, that can generalize to corpora of other genres and dialects2 .4. Demonstrate via case studies that ALDi estimation of transcribed political speeches can highlight interesting insights that existing DI systems fail to detect.
We hope that our work on the Level of Dialectness variable can motivate research in this direction applied to other languages such as Swiss German where a Standard variant co-exists with nonstandardized ones.

Background and Related Work
MSA and Dialectal Arabic Unlike English, where there is no single standard variant used in all English-dominant countries, Arabic speakers agree to a great extent on having a single standardized form of the language that they call Fus-ha .Arabs perceive both MSA and Classical Arabic (CA), the variant of Arabic that dates back to the 7th century, as Fus-ha (Parkinson, 1991).
While Arabs can understand and read this standard language, spontaneously speaking in the standard language is not a natural task for most of them.Variants of DA are generally used in everyday communications, especially in spontaneous situations, and are widely used on social media platforms.
DA variants can be grouped on the level of regions (5 major variants: Nile Basin, Gulf, Levant, Maghreb, and Gulf of Aden), countries (more than 20 variants), or cities (100+ variants) (Baimukan et al., 2022).In text, MSA differs from DA in terms of morphemes, syntax, and orthography.These differences form cues of dialectness in code-switched text.In the orthography, distinctive DA terms are written in ways that match the pronunciation.Regional differences in the pronunciation of MSA terms are typically lost in writing due to the standardized orthography, but in some cases, individuals use non-standard orthography that matches their regional pronunciations (e.g.: Man written as instead of the standardized form as in Table 1).
Much of this work has been done at the sentence or document level, but there has also been work on token-level DI for code-switching, for example on Egyptian Arabic-MSA tweets (Solorio et al., 2014;Molina et al., 2016) and on Algerian Arabic (Adouane and Dobnik, 2017).

Levels of Dialectness
Both sentence-level and token-level DI methods fail to distinguish between sentences having the same number of dialectal cues, yet different levels of dialectness.As per Table 1, each of the sentences and has two lexical cues of dialectness, yet the latter sentence is perceived as being more dialectal than the former.Only a very few works have considered this distinction.One is Zaidan and Callison-Burch (2011), who collected sentence-level dialectness annotations in the Arabic Online Commentary data set.Although the dataset has been released, there has been no published description or analysis of these annotations that we know of, and (perhaps for this reason) no follow-up work using them3 .Our work aims to remedy this.
An earlier project that annotated dialectness was Habash et al. (2008), who proposed a word-level annotation scheme consisting of four levels: (1) Pure MSA, (2) MSA with non-standard orthography, (3) MSA with dialect morphology, and (4) Dialectal lexeme.Annotators also labeled full sentences according to their level of dialectness.Although the inter-annotator agreement was relatively good (less so for the sentence level), only a small corpus was annotated (19k words).Moreover, the corpus has sentences that are mostly in MSA with limited code-switching.A later work piloted a simplified version of the scheme on another corpus of 30k words (Elfardy and Diab, 2012).Both corpora are not publicly released.
Level of Dialectness and Formality Formality is a concept that has been studied, yet it does not generally have an agreed-upon definition (Heylighen and Dewaele, 1999;Lahiri, 2016;Pavlick and Tetreault, 2016;Rao and Tetreault, 2018).Heylighen and Dewaele (1999) define formality as the avoidance of ambiguity by minimizing the contextdependence, and the fuzziness of the used expressions.Later operationalizations recognize factors such as slang words and grammatical inaccuracies have on the people's perception of formality (Mosquera and Moreda, 2012;Peterson et al., 2011) as cited in (Pavlick and Tetreault, 2016).
Arabic speakers tend to use MSA in formal situations, and their regional dialects in informal ones.However, an Arabic speaker can still use MSA and speak informally, or use their dialect and speak formally.The case studies described in §5 show how Arab presidents use sentences of different levels of dialectness in their political speeches.While these speeches would all be considered to be formal, using different levels of dialectness might be to sound authoritative (using MSA) or seek sympathy (us-  ing a regional dialect).Therefore, we believe the level of dialectness and formality are related yet not interchangeable.
3 The Arabic Level of Dialectness (ALDi) We define the Level of Dialectness of a sentence as the extent by which the sentence diverges from the standard language, which can be based on any of the cues described above.This definition is consistent with the crowd-sourced annotation of the Arabic Online Commentary (AOC) dataset (Zaidan and Callison-Burch, 2011), where annotators labeled user comments on Arabic newspaper articles by their dialect and their level of dialectness.However, the original and subsequent work only used the dialect labels, and the dialectness annotations have not previously been analyzed in detail.

Analyzing the AOC Dataset
The AOC dataset was created by scraping user comments on articles from three different newspapers, which are published in Egypt (Youm7 -), Jordan (AlGhad -), and Saudi Arabia (AlRiyadh -); thus expecting the majority of comments of each source to be in Egyptian (EGY), Levantine (LEV), and Gulf (GLF) dialects respectively.Each comment is labeled for its level of dialectness (MSA, little, mixed, mostly dialectal, not Arabic).For comments labeled as Non-MSA, the annotators also chose the dialect in which the text is written: EGY, LEV, GLF, Maghrebi (MAG), Iraqi (IRQ), General (GEN: used when the text is DA, but could belong to multiple dialects), Unfamiliar, and Other.
Each row of the released AOC dataset consists of 12 different sentences representing a Human Intelligence Task (HIT) on Amazon mTurk, with annotations provided by the same human judge.A HIT has 10 comments in addition to 2 control sentences sampled from the articles' bodies, which are expected to be mostly written in MSA.As part of each HIT, annotators provided some personal information such as their place of residence, whether   [0, 0.11 [ [0.11, 0.44[ [0.44, 0.77[ [0.77,  they are native Arabic speakers, and the Arabic dialect they understand the most.Table 2 shows the number of annotations collected for sentences from each source.
Table 3 shows the distribution of Level of Dialectness annotations in AOC.As expected, the control sentences are nearly all (94%) annotated as MSA.MSA is also the most common label for the scraped comments (57% of their annotations), followed by the mostly dialectal label (23%), little dialectal (11%), and mixed (6.5%).
Figure 1 shows the distribution of dialectness labels split out by dialect (sentences labeled as MSA are not shown).We see that the proportions of different levels of dialectness for the LEV, GLF, and EGY dialects are similar, even though the total number of annotations per source (Table 2) is more skewed.This is likely due to the fact (noted by Zaidan and Callison-Burch 2014) that AlGhad contains the highest proportion of MSA annotations, followed by AlRiyadh and then Youm7. Figure 1 also shows that the distribution of dialectness levels is similar for the LEV, GLF, and EGY dialects, whereas the GEN dialect label has a higher proportion of little dialectness.This makes sense, since for sentences with few cues of dialectness, the level of dialectness would be low, and it would be hard to assign these sentences to a specific dialect.

From AOC to AOC-ALDi
In order to transform the AOC level of dialectness annotations into numeric ALDi scores, we applied the following steps: Step #1 -HIT to annotation rows: We split each row (HIT) of the AOC dataset into 12 annotation rows, one for each sentence of the HIT, with the annotator's information shared across them.
Step #2 -Grouping identical comments: Comments on the same article can sometimes be identical.We decided to group identical comments on the same article together.Out of 129,873 grouped comments, only 1,377 comments have more than three annotations.We discarded 2,038 grouped comments for which at least 2 3 of the dialectness level annotations are either Missing or Not Arabic,4 leaving a total of 127,835 comments with at least three annotations each.The average length of these comments is 20 words.
We measured inter-annotator agreement on the level of dialectness annotations for the 124,257 comments which have 3 annotations that are not Not Arabic or Missing.The Fleiss' Kappa (κ) is 0.44 (Fleiss, 1971), while Krippendorff's Alpha (interval method) (α) is 0.63 (Krippendorff, 2004).Both metrics are corrected for chance agreement and disagreement respectively.κ considers the labels as categorical, while α penalizes disagreements according the differences between their val-

Comment English translation (ours) ALDi
Bravo to the wonderful Minister, who proved that he is responsible, feeling the importance of his responsibility for the first time in a long time in the history of Egyptian education.
We start with the right task of developing schools and providing observers over them Honestly, a serious minister ....I hope he stays like this all the time Table 5: Sample comments to the same article with their level of dialectness labels (3 annotations for each comment with their mean as the ALDi score).The labels are MSA (0), Little ( 1 3 ), Mixed ( 2 3 ), Most (1).DA segments are underlined.Loanwords are double-underlined.
ues.Although these agreement levels are considered only moderate, our experiments demonstrate that the corpus can nevertheless be useful.
Step #3 -Label aggregation: Multiple human annotations for the level of dialectness were aggregated into a single label.We transformed the ordinal labels (MSA, Little, Mixed, Mostly) into the numeric values (0, 1 3 , 2 3 , 1), then took the algebraic mean of these as the gold standard label, which has the range [0, 1]. 5 The distribution of the aggregated scores across four intervals is shown in Figure 2.
Step #4 -Splits creation: To build reliable splits of AOC, we made sure comments to the same document are in the same split.For each source, we group sentences belonging to the same article together, shuffle these groups, and then assign the first 80% of the comments to the training split, the following 10% to the development split, and the remaining 10% to the test split.This way, the dev and test sets evaluate whether a model generalizes to comments from articles not seen in training.The distribution of the sources across AOC-ALDi's splits is in Table 4. Qualitative Analysis: Table 5 shows three example sentences from the AOC-ALDi dataset with their corresponding annotations where all annotators labeled the dialect as either MSA, EGY, or GEN.The first sentence begins with an English 5 AOC-ALDi also includes the original separate labels.
loanword.The rest of the sentence has MSA terms that will not sound natural if pronounced according to the phonetic rules of a variant of DA.Unsurprisingly, two annotators considered the sentence to be in MSA, while the third might have perceived the presence of the loanword as a sign of dialectness, thus marking the sentence as little dialectal.The second example shows code-switching between MSA and Egyptian DA, but an Egyptian can still naturally pronounce the MSA portion abiding by the phonetic rules of Egyptian Arabic.This might be the reason why one of the annotators labeled the sentence as mostly dialectal (see Parkinson (1991), who observed the same relation between pronunciation and perceived levels of dialectness).For the third example, all the tokens except for the first one show dialectal features, which made it easy for the three annotators to classify it as mostly dialectal.

Models
The main model we use to predict ALDi is a BERTbased regression model.Using the training split of AOC-ALDi, we fine-tune a regression head on top of MarBERT, an Arabic BERT model (Abdul-Mageed et al., 2021a), and clip the output to the range [0, 1].To measure the consistency of the model's performance, we repeat the fine-tuning process three times using 30, 42, and 50 as the random seeds, and report averaged evaluation scores for the model (similarly for Baseline #3).We compare this model to three baselines, which use existing Arabic resources and are not trained on AOC-ALDi.Baseline #1 -Proportion of tokens not found in an MSA lexicon: The presence of dialectal lexical terms is one of the main signals that humans use to determine dialectal text.Sajjad et al. (2020) built an MSA lexicon from multiple MSA corpora.They then computed the percentage of tokens within a sentence not found in the MSA lexicon as a proxy for sentence-level dialectness.We replicate this method using the tokens occurring more than once in the Arabic version of the United Nations Proceedings corpus (Ziemski et al., 2016) as the source for the MSA lexicon.Baseline #2 -Sentence-Level DI: We use an offthe-shelf DI model implemented in (Obeid et al., 2020) based on (Salameh et al., 2018).The model is based on Naive Bayes, trained on the MADAR corpus (Bouamor et al., 2018), and uses character and word n-grams to classify a sentence into 6 variants of DA in addition to MSA.A sentence is assigned an ALDi score of 0 if it is classified as MSA and a score of 1 otherwise.Baseline #3 -Token-level DI: Molina et al. (2016) created a token-level DI dataset (MSA-EGY token DI), in which tokens of tweets were manually tagged as MSA, EGY, Named-Entity, ambiguous, mixed, or other.We use this dataset to fine-tune a layer on top of MarBERT to tag tokens of a sentence.The tag of the first subword for each token is adapted as the tag for the whole token as done in (Devlin et al., 2019).We use token-level tags to compute the Code-Mixing Index (CMI; Das and Gambäck 2014) as a proxy for ALDi: CM I = N EGY tokens N EGY tokens +N M SA tokens (set to 0 if none of the tokens are tagged as MSA or EGY).

Evaluation
Intrinsic AOC-ALDi evaluation Treating the aggregated human-assigned scores of AOC-ALDi's Disentangling Parallel MSA/DA Sentences For a model estimating ALDi, a minimal requirement is to assign a higher score to a DA sentence than that assigned to its corresponding MSA translation.We utilize two parallel corpora of different genres and dialects to test this requirement.First, we use a parallel corpus of 8219 verses (sentences) from the Bible, provided by Sajjad et al. (2020), which includes versions in MSA, Tunisian, and Moroccan Arabic.We also use DIAL2MSA, which is a dataset of dialectal Arabic tweets with parallel MSA translations (Mubarak, 2018).Five MSA translations were crowd-sourced for 12,000 tweets having distinctive lexical features of Egyptian and Maghrebi Arabic.Each translation was then validated by 3 judges.For our analysis, we discard samples having a non-perfect validation confidence score, and ones that still have a distinctive dialectal lexical term in their MSA translations.
The distribution of the ALDi scores in Figure 3 reveals that MSA Lexicon does not discriminate strongly between MSA and DA, while Token DI mostly assigns scores of 0 or 1 (acting like Sentence DI), despite the possibility to do otherwise.The Sentence ALDi model provides more nuanced scores while also showing strong discrimination between MSA and DA, even for DA variants that are barely present in AOC-ALDi (TUN, MOR, MGR; note that Token DI also has not seen these). 6It also yields slightly lower scores for the DA versions of the Bible than for the DA tweets, indicating that the informal genre of tweets may be an indicator of stronger dialectness levels.

Analysis -Minimal Contrastive Pairs
Inspired by Demszky et al. (2021)'s corpus of minimal contrastive pairs for 18 distinctive features of Indian English, we build contrastive pairs of MSA and Egyptian Arabic variants of a single sentence.We investigate 5 features of Egyptian Arabic that were previously recognized by Darwish et al. (2014).For each sentence, we generate versions with different gender markings (masculine and feminine) and word orders (SVO and VSO).
While MSA allows for both word orders, it favors VSO (El-Yasin, 1985), while Egyptian Arabic favors SVO (Gamal-Eldin, 1968 as cited in Holes, 2013;Zaidan and Callison-Burch, 2014).In Table 7, we display the ALDi scores assigned by the different models to the contrastive pairs.The MSA Lexicon model considers all dialectal features to have the same impact in assigning a non-zero ALDi score (i.e., 1 3 ≈ 0.33 or 1 2 ≈ 0.5) to the DA sentences.As implied by our previous experiment, the Token DI model acts as a sentencelevel DI model, tagging all the tokens as dialectal if only one token shows a distinctive dialectal feature.This behavior might be an artifact of the model's fine-tuning dataset, where annotators were asked to use the surrounding context to determine an ambiguous token's language (EGY or MSA).
Conversely, the Sentence ALDi model provides a more nuanced distinction between the different features.The negation form (F4, F5) used in Egyptian Arabic seems to cause the model to categorically consider the sentence as highly dialectal.Less salient features such as the (F1) present progressive proclitic increase the ALDi level of the sentence, but to a lesser extent than the negation feature.We also see that the model assigns higher ALDi scores to SVO sentences than VSO, suggesting that the model may have learned the common word order in Egyptian Arabic.Finally, feminine-marked sentences tend to get higher scores compared to their masculine-marked counterparts, which may be indicative of a gender bias in the training data and resulting model-if feminine marking is less common, it may also be seen as less standard language and interpreted as non-MSA.

Case Studies (ALDi in Practice)
The same speaker can adapt different styles according to various social and linguistic factors (Kiesling, 2011).The ALDi of speech is one example of an intraspeaker variation in Arabic.In this section, we provide two case studies analyzing the transcribed speeches of three different Arab presidents.We highlight how quantitatively estimating the ALDi can help in revealing different speaking styles.Mubarak, the former Tunisian and Egyptian presidents during the period of the Tunisian and Egyptian revolutions.Mubarak consistently used MSA for his speeches to showcase authority and power.Ben-Ali used MSA for his first two speeches.For his last speech, he explicitly said: " " -"I talk to you in the language of all the Tunisians", apparently using his choice of dialect as a way to identify himself with a particular group (cf.Shoemark et al. 2017;McNeil 2022).

Presidential Speeches in the Arab Spring
We quantitatively replicate the analysis by visualizing the ALDi scores of the transcribed speeches.We scraped the speeches from online websites 7 and used the HTML line breaks <br> to segment them into sentences.For each sentence, we predict the ALDi score with our model and also use the Sentence DI model to classify it as DA or MSA.
Figure 4a shows that our model correctly finds nearly 0 ALDi scores throughout Ben-Ali's speech on the 10 th of January, while the DI model makes a couple of errors (and similarly for Mubarak's speeches, shown in Figures 4c, 4d).Both models identify the shift to DA in the second speech (Figure 4b), with more sentences identified as DA by the DI model, and many with moderate ALDi scores.Given the nature of the speech, Ben-Ali still used formal terms while speaking in Tunisian Arabic which is likely the reason for the intermediate ALDi scores.

El-Sisi's Speeches
Next, we studied the ALDi scores for 659 speeches of the current Egyptian president El-Sisi, scraped 7 www.babnet.netand egypt-blew.blogspot.com from almanassa.com.The transcripts are not limited to the edited presidential speech, but also include greetings, introductory comments, interventions by the audience, and signs of disfluency or hesitation.The site's editors segmented each speech into coherent sentences, embedded in <p> HTML tags, that we adapt as units of analysis.
While most of these speeches are conducted in MSA, multiple cases of code-switching between MSA and Egyptian Arabic occur.For example, in Figures 4e and 4f, El-Sisi used MSA when reading the edited speech, and Egyptian Arabic with high ALDi scores when spontaneously addressing the audience before or after the edited speech.
Interestingly, Figure 4g shows three different ALDi levels as part of the same speech.El-Sisi used MSA for reading the edited speech directed to the press, discussing issues such as Egyptian-German diplomatic relations, climate change, and economic hardships.He then reacted spontaneously to two questions from the press.He attempted to answer the first question, related to gas prices, in MSA but the sentences show codeswitching between MSA and Egyptian Arabic, indicated by intermediate ALDi scores (though the DI system does not identify these).For the second question about human rights in Egypt, El-Sisi uses sentences that are more dialectal and less formal, inviting the journalist to visit Egypt in order to make a fair assessment of the situation.This is indicated by even higher ALDi scores.Samples from each segment are listed in Appendix D.
This speech is a clear example of how an Arabic speaker can adapt different levels of dialectness in their speech and indicates the ability of ALDi to reveal such differences.

Conclusion
We presented ALDi, a linguistic variable that quantifies the level of dialectness of an Arabic sentence.We release AOC-ALDi, a dataset of Arabic comments annotated with their ALDi scores.A BERTbased regression model fine-tuned on AOC-ALDi showed superior performance compared to existing baselines that are based on lexicons and DI models.Our analysis shows that the model generalizes to various Arabic dialects.In addition, the model provides a nuanced distinction of dialectal features, which token and sentence DI models can not perform.Lastly, we presented multiple case studies demonstrating the effectiveness of ALDi in revealing new insights in Arabic text.For future work, we aim to explore the possible applications of ALDi for text analysis, especially for sociolinguistics and computational social science studies.Moreover, we aim to apply the level of dialectness work to other languages that have the same phenomena of Arabic, such as Swiss-German.

Limitations
Our AOC-ALDi dataset is based on the AOC dataset that comes mainly from news comments, which might be of specific genre.Although our ex-periments show robustness across multiple genres of text, it will be interesting to prepare a dataset (even just for intrinsic testing) that comes from other sources, such as social media.Reannotating existing DI datasets with ALDi might be a first-todo option.
Also, the gold-standard ALDi scores in our AOC-ALDi dataset are based on normalizing the level of dialectness annotations of the AOC dataset, which might be sub-optimal.Labeling a dataset directly with continuous ALDi scores might provide more accurate labels (still might be more challenging for annotators).
While our experiments cover diverse dialects of Arabic, the generalizability of ALDi for more dialects of Arabic more dialects needs to be tested.
Finally, we found preliminary evidence of possible gender bias in our dataset/model.While we did not explore this issue in depth here, it will be important to consider its impact and possible mitigation strategies in future work.

A AOC Annotation Interface
Zaidan and Callison-Burch (2011) used Amazon Mechanical Turk to annotate Arabic comments they scraped from three different newspapers.They provided the annotators with minimal guidelines for determining the dialect and level of dialectness of the comments.A screenshot of their annotation interface is shown in Figure A1. 8hile the guidelines are minimal, we think that the Arabic and English translations of the labels might have impacted the annotator's understanding of the labeling process.For instance, the annotation interface has the Not Arabic label translated to ( ) in Arabic, which actually means ) I agree with you that racism exists in the United States; I also know it exists in Arab countries as well.Just remember that America elected a black president with 360 electoral college votes.In terms of numbers, that means a sweeping majority.Lets learn to be better than the Americans by developing our own democratic systems for a change.(Another language, or symbols).We believe that Another language or symbols is not equivalent to Not Arabic, which might make annotators interpret the guidelines differently.

B Discarded Samples from AOC-ALDi
As mentioned in §3.2, we discarded 2,038 comments that have the majority of their ALDi annotations either set to Not Arabic or are missing.Five different categories of such comments were identified as per Table A1.These categories include sentences that have only punctuation marks, are written in English or Arabizi (Romanized Arabic (Yaghan, 2008)), are just links to sites or emails, or have HTML encoded characters or formatting tags.

C Discrimination scores
For the experiments in §4.2, we computed D ′ , a measure of discrimination, for all models on each pair of parallel corpora.Results are shown in Table C2.On the DIAL2MSA corpora, which are likely more similar in style to AOC-ALDi, our model performs about as well as Token DI, the other BERT-based model (which, like ours, has not seen MGR in training), while also providing a wider range of scores (as shown in §4.2).Token DI does somewhat better than our model on the Bible corpora, but again by making nearly binary judgments for each sentence.

D Edited and Spontaneous Speech
As depicted in Figure 4g, El-Sisi's speech on the 22 nd of July 2022 can be split into three segments: the edited presidential speech, and two spontaneous responses from the president to questions from the succeeding press conference.We sampled a sentence from each segment as shown in Table D3 to demonstrate the three different levels of dialectness that categorize each segment.And we agreed the current circumstances endeavors all actors to bear their responsiblities by finding practical solutions and mechanisms to mitigate the impact of the crisis on the most affected countries.Q&A -Gas 0.41 . I spoke about this matter and that coordination and cooperation are required between all countries of the world regarding this topic during my talk or speech at the Jeddah conference, specifically on the issue of energy.
And we are not interested in it because you ask about it..It is very important that you know this..We are interested in it because we respect our people, and we love them, and these are not just words, we respect our people just as you respect your people.. and therefore we are not interested because you ask us about it... No..This is our moral, historical and humanitarian responsibility toward our people.This is one point.
Table D3: Three sentences of different estimated ALDi scores sampled from three segments of El-Sisi's speech on the 22 nd of July 2022 shown in Figure 4g.

Figure 1 :
Figure 1: The distribution of the annotations for the dialect and the level of dialectness in AOC.Note that each comment has three different annotations.251,476 MSA annotations are not shown in the Figure.The General dialect label is used when a sentence is natural in multiple variants of DA.The REST bar represents the (Maghrebi, Iraqi, Unfamiliar, and Other) labels.

Figure 3 :
Figure 3: The distribution of the ALDi scores assigned by the four models to sentences of the Bible and DIAL2MSA corpora.Each column (across the four plots) represents the same set of sentences as scored by the four different models, and the columns are grouped by corpus to compare the different dialectal versions of that corpus.For each plot, the orange line shows the median score, the box represents the interquartile range (IQR) [Q1, Q3] of the scores, the whiskers represent ±1.5 * ∆(IQR) beyond Q1 and Q3, and the dots represent outliers beyond this.Note1: ∆(IQR) = Q3 − Q1.Note2: The boxplots for the Token DI and Sentence ALDi models are not significantly different across the multiple fine-tuning runs of different random seeds.
Lahlali (2011) qualitatively analyzed the usage of MSA and DA (Tunisian Arabic and Egyptian Arabic) in the last three speeches of Ben-Ali and Do not say the truth Table7: The ALDi scores assigned to contrastive MSA and Egyptian Arabic sentences.Only the feminine-marked version of the sentence is shown, and tokens with dialectal features are underlined.A single score is reported if a model assigns the same score to the masculine and feminine versions of a sentence, otherwise the scores for masculine/feminine are shown.We tested VSO (favored in MSA) and SVO (favored in EGY) word orders.Note: Scores ∈ [0, 0.11] are encoded in green, while ones ∈ ]0.11, 1] have a shade of purple.*: The scores for these models are averaged across three fine-tuned models with different random seeds.

Figure 4 :
Figure 4: The ALDi scores assigned to sentences of transcribed political speeches.Subfigures a) and b) represent two speeches of the former Tunisian president Ben-Ali during the Tunisian Revolution.Subfigures c) and d) represent two speeches of the former Egyptian president Mubarak during the Egyptian Revolution.Subfigures e), f), and g) are speeches of the current Egyptian president El-Sisi.The MSA/DA labels were generated by the Sentence DI model.

Figure
Figure A1: A screenshot of the annotation interface of the AOC dataset (Zaidan and Callison-Burch, 2011).Reason for Discarding Sentence Source Level of Dialectness Symbols Cmnt (Y7) ¬ Arabic (x13), Missing (x2) ******** Cmnt (Ri) ¬ Arabic (x3) English gloves to protect the baby from infection !Cmnt (Ri) ¬ Arabic (x2), MSA(x1) I agree with you that racism exists in the United States; I also know it exists in Arab countries as well.Just remember that America elected a black president with 360 electoral college votes.In terms of numbers, that means a sweeping majority.Lets learn to be better than the Americans by developing our own democratic systems for a change...ccc

Table 1 :
Example sentence meaning the man cheered us written with different levels of dialectness in two Arabic dialects.Words with DA features are underlined.

Table 2 :
Statistics of the AOC dataset, showing the number of annotations of each type from each newspaper source.Each sentence has 3 independent annotations.

Table 3 :
The distribution of AOC's Level of Dialectness annotations.Each sentence has 3 independent annotations.Control are sentences extracted from the article body, most likely MSA, to check the quality of the annotations.

Table 4 :
The number of grouped comments in AOC-ALDi's splits.127,835 comments of 20 words on average, are distributed across all splits.

Table A1 :
Examples of the discarded AOC comments of majority labels set to Not Arabic or missing.