The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus

Recent years have seen a rise in interest for cross-lingual transfer between languages with similar typology, and between languages of various scripts. However, the interplay between language similarity and difference in script on cross-lingual transfer is a less studied problem. We explore this interplay on cross-lingual transfer for two supervised tasks, namely part-of-speech tagging and sentiment analysis. We introduce a newly annotated corpus of Algerian user-generated comments comprising parallel annotations of Algerian written in Latin, Arabic, and code-switched scripts, as well as annotations for sentiment and topic categories. We perform baseline experiments by fine-tuning multi-lingual language models. We further explore the effect of script vs. language similarity in cross-lingual transfer by fine-tuning multi-lingual models on languages which are a) typologically distinct, but use the same script, b) typologically similar, but use a distinct script, or c) are typologically similar and use the same script. We find there is a delicate relationship between script and typology for part-of-speech, while sentiment analysis is less sensitive.


Introduction
Cross-lingual transfer has shown promising results for several tasks, however the effect of and the interplay between typologically related languages and languages that do not share the same script has seen less focus. This is especially true for under-resourced vernacular languages and dialects. In this paper, we focus our work on the Algerian language, a non-standardized vernacular Arabic variety, characterized by the heavy use of both code-switching and borrowings. The existing codeswitching can be anything from local Algerian dialects (e.g. region based Algerian or Berber), French, English, Spanish, Modern Standard Arabic (MSA), or other Arabic dialects. The borrowings depend on the speakers' background, but is usually heavily French-based.
Algerian is a spoken language with no standardized writing, and with the rise of social media, it has become a language extensively used to communicate online. Algerian can be written in both Arabic and Latin scripts, and code-switching can therefore occur in a mixture of scripts, or within one same script. Arabic varieties written in Latin script are referred to as Arabizi, with north African languages referred to as North African Arabizi, NArabizi in short . For the remainder of the paper, we will refer to Algerian written in Latin script as NArabizi (NA) and Algerian written in Arabic script as Algerian Arabic (DZ).
The broad usage of Algerian results in large amounts of data, with no resources or tools to automatically process them. To address this issue and further investigate which of scripts and typological differences influence the results the most, we use a corpus of user comments that reflect the nature of the Algerian vernacular dialect: with heavy use of non-standardised spellings and code-switching.
Our main contributions are (i) a new layer of annotations (transliteration, sentiment analysis, topic classification) that build on the Algerian NArabizi treebank corpus , (ii) we investigate the interplay of script and typology on crosslingual transfer for the two tasks part-of-speech (POS) tagging and sentiment analysis (SA); (iii) we give a baseline model for topic categorization for Algerian. All of the data, annotations, and models are made freely available 1 .
To the best of our knowledge, the corpus we present in this work is the first dataset of parallel Algerian texts written in NArabizi and DZ, anno-tated on the morphological and syntactic levels, and for which the interplay between typology and script can be investigated. We also believe that it can help developing approaches to tackle the heavy code-switched nature of the language.
In what follows, in Section 2, we give a brief overview of related works. In Section 3, we describe our dataset and annotations, the annotation processes, and give detailed statistics of the data. We start with some benchmark experiments in Section 4, and present in Section 5 our experiments for POS tagging, SA, and topic classification. In Section 6, we summarize and discuss our results, and conclude in Section 7 with our main findings and future plans.

Related work
The vernacular Algerian language is underresourced, and few freely available corpora and tools exist. Despite work in recent years on this language (Adouane et al., 2020;Moudjari et al., 2020;Adouane et al., 2018;Adouane and Dobnik, 2017;Cotterell et al., 2014), there is only one corpus manually annotated for morphological and syntactical analysis .
As pointed out by , Algerian is a non-codified spoken Semitic language. It is a morphologically-rich language (Tsarfaty et al., 2010), although less so than MSA (Saadane and Habash, 2015). Similarly to other north African languages, it uses heavy code-switching and borrowings, which can either be lexicalized borrowings that receive Arabic-like morphology, or borrowings that remain invariant or take the morphology of the borrowings' original language (e.g., French). Furthermore, Algerian exhibits high variance at the morphological and phonological levels, as well as the lexicon and conventions . As shown in Table 1, the Arabic name of the country "Algeria" can be written in various ways in both NArabizi and DZ scripts.
As in other North African languages written in Latin script, phonemes that do not exist in the Latin alphabet are represented by digits that are visually similar. For example Table 2 shows how the digits 3 and 9 are used to represent the Arabic letters "ayin" and "qāf " respectively. The nature of the language makes it therefore an interesting avenue to explore the interplay between language similarity and differences in script on cross-lingual transfer.
The script of NArabizi differs from the more re-  sourceful MSA and French languages, which can be seen as its culturally closest languages. However,  show that transfer learning approaches can be used on NArabizi, both for POS-tagging and dependency parsing. They show that multilingual BERT (Xu et al., 2019) trained on Maltese, French, and English can successfully transfer to NArabizi, despite not being included in pretraining. This shows the potential for multilingual language models to transfer to unseen dialects across scripts. The effect of language similarity on NLP tasks is well known (Ponti et al., 2019), with several dedicated workshop series (Nicolai et al., 2020;Zampieri et al., 2018). More recently, attention has turned to larger scale analyses of morphological typology effects on language modeling (Gerz et al., 2018;Cotterell et al., 2018;Mielke et al., 2019). Cross-lingual transfer between languages with related typology is more successful than between languages that do not share similar scripts (Murikinati et al., 2020;Anastasopoulos and Neubig, 2019), especially for the study of morphological inflection. Finally, regarding difference in script, Murikinati et al. (2020) find that using high-quality transliteration as preprocessing can improve the accuracy of such models.
However, in contrast to these previous works, we are interested in the interplay between similar typology and difference in script on cross-lingual transfer for two supervised tasks, namely POS tag-NArabizi ycombati la misere li las9at fina welat kiste

Arabic transliteration
Code-switched transliteration kyste la misère English translation he fights the misery that sticks to us and which has become a cyst Table 3: Example of transliteration annotations into Arabic and code-switched scripts. The NArabizi is from . The translation to English is added for readers' comprehension.
ging and sentiment analysis. More precisely, we are interested in investigating if there are differences in performance based on the various Algerian scripts.

Data and Annotations
The underlying dataset we use is the NArabizi treebank presented in . This dataset comprises approximately 1,500 sentences: 1,300 NArabizi sentences extracted from an Algerian newspaper's web forum (Cotterell et al., 2014), and 200 sentences from lyrics of songs collected manually from the web. Each NArabizi sentence has five annotation layers: tokenization, morphology, identification of code-switching, syntax, and translation to French . The corpus is in conllu format, and is freely available 2 .
To investigate the interplay between script and typology for cross-lingual transfer on POS tagging and SA, we extend the annotations of  by adding two levels of annotations: Token level: for each token of the NArabizi sentences we: 1. transliterate each NArabizi token to Arabic script (i.e., DZ).
2. transliterate each NArabizi token to codeswitched scripts (Arabic or Latin) based on the origin of the token (and the code-switch annotation label of the treebank).
Sentence level: we annotate each sentence of the NArabizi corpus for: 1. sentiment: each sentence is annotated as POS (positive), NEG (negative), NEU (neutral), or MIX (a mix of two or more of the three previous classes).
All the annotations were carried out by native speakers of Algerian, Arabic, and French. Two annotators worked on the token-level annotations, and three annotators for the sentence-level annotations. Before starting the annotations, we did a common annotation round to agree on the guidelines, and discuss possible issues. During this, we identified a set of errors in the NArabizi treebank, we therefore started by preprocessing the data and correct some of the recurring errors. More details about our preprocessing of the dataset is given in Section 3.1, the transliteration annotations are described in Section 3.2, and sentiment and topic annotations are described in respectively Section 3.3 and Section 3.4.

Annotation Preprocessing
The NArabizi treebank dataset  contains duplicates both in document IDs and in sentences (strings), both across splits and within splits. Duplicate IDs refer to the same sentences, and therefore duplicate IDs imply duplicate sentences. However, duplicate sentences represent same strings with different IDs. There are far more sentence duplicates than ID duplicates.
All duplicates were removed. However, as the corpus is already quite small, we attempt to avoid removing duplicates from the dev and test splits. If there are duplicates between the train and the dev splits, then we keep the sentences in the dev and remove them from the train set. The same is done with the test split. For the inter-split duplicates, we identified 9 duplicated IDs and 46 (12 unique) duplicated sentences. Intra-split duplicates were only present in train split, with 9 duplicated IDs, and 28 (8 unique) duplicated sentences. We kept one occurrence of each as it seems that most of these duplicates come from the chorus of the song lyrics, and short common utterances as e.g., "viva Algeria".

Transliteration to Arabic and code-switched scripts
Two annotators expanded the annotations of the NArabizi treebank by  by adding for each token of each sentence a transliteration into Arabic script, and a code-switched version that includes both Latin and Arabic scripts. The Latin script is used for tokens that originate from Latin-scripted languages.
For example, Table 3 shows how the NArabizi sentence is transliterated into the corresponding DZ and code-switched scripts. The first word, " ", is actually a borrowing from French.
However, borrowings that are integrated into the Algerian language lexicon, and that are influenced by Arabic verbal inflections, were not written in Latin script in the code-switched annotations. The two annotators were given a subset of 300 sentences to transliterate, i.e., these were doubly annotated. Due to the lack of codification, we do not compute any inter-annotator agreement. The subset of the 300 sentences were mainly used to set the annotation guidelines, and were extensively discussed by the annotators.
We decided to normalize some of the Latin characters that do not have equivalent pronunciations in Arabic, these were transliterated into what the native annotators deemed to be the corresponding Arabic characters. In Table 4 we show the Latin letters and the Arabic form they were transliterated to. Even so, we decided to transliterate the last letter (phoneme gu) into a non-native Arabic letter. This letter is vastly used in various Algerian dialects, it represents the dialectal pronunciation of qāf, and is also used in names of places and persons.
We are aware of the various efforts to develop guidelines for conventional orthography of Algerian and other Arabic dialects (Saadane and Habash, 2015;Habash et al., 2018;Adouane et al., 2019), but we decided to keep the transliterations as identical as possible to the original NArabizi pronunciations and spellings, to reflect the distinctiveness of the language and its use in normal settings in social media. During the transliteration annotations, several issues were identified in the original NArabizi treebank by . However, since our annotators were not trained to alter the dependency treebank, only a small selection of the identified errors were corrected.
The first problem encountered is a lack of consistency in the tokenization. For example, the definite article " " ( "el") can be found both as a standalone token, or attached to a word. The same applies to the adposition "in/on" ("fi" -" ") where it can be found both as a stand-alone token, and attached to the next word. For example, it was kept with the token in "f'doute" ("in doubt"), while it was tokenized as "f +almarikhe" for the word "falmarikhe" ("on Mars"). All tokenization errors were not corrected, as this would lead to altering the dependency trees, and as previously mentioned, our annotators were not trained for this task.
Secondly, there were also errors in the translations from NArabizi to French. This is likely due to non-native Algerian speakers translating some parts of the NArabizi treebank. We only corrected the translations that did not alter the tree, i.e., the POS did not change. Some examples of these types of errors can be found in Table 10 in Appendix A.
Finally, we also found some errors in the marker for code-switching (label lang in the data). Some Algerian tokens were marked as French, and viceversa. This also happened with other languages present in the data (as Spanish, English, and MSA). One of the typical errors was the acronyms of football clubs which were all labeled as Algerian. These were corrected to French, since the acronyms come from their names in French. For example the football club "MCA" stands for "Mouloudia Club d'Alger", while the Arabic name is " " ("Nadi mouloudiat aljazair").

Sentiment annotations
The sentences were classified based on their polarities into four different classes: POS (positive),  NEG (negative), NEU (neutral), and MIX (mixed). The annotation guidelines were quite simple, and annotators were asked to use POS and NEG in clear positive and negative cases respectively. If a sentence does not express any kind of polarity, then NEU was assigned. When sentences express a combination of two or more of the POS, NEG, or NEU polarities, annotators were asked to assign the MIX label. The inter-annotator agreement using Cohen's kappa coefficient κ is 0.71 on the doubly annotated subset of 300 sentences. Table  5 shows the distribution of the four labels across the training, development, and test sets. The distribution is unbalanced, and the large amount of sentences categorized as MIX can be problematic as it can contain all other polarities. However, the difference between the POS and NEG classes is relatively small, which we believe should be suitable for binary sentiment classification tasks.

Topic annotations
After a first round of common analysis in collaboration with the annotators, we identified five topics. However, some sentences were difficult to classify and we therefore decided to include the category "NONE". The final dataset is annotated for the following six categories: (1) Politics: contains all sentences referring or discussing political events or issues; (2) Prayer: all sentences representing prayers; (3) Religion: sentences discussing religious issues or issues related to religion in general; (4) Societal: societal related discussions. Covers everything from schools and teaching, to terrorism and extremism; (5) Sport: mainly covering football events, but spans all types of sports and related events; (6) NONE: sentences that were impossible to categorize. This was mainly due to the lack of context, as some sentences were comments responding to either articles or other comments. The final κ score for the triply annotated 300 sentences was 0.70. Table 5 also shows the distribution of topics across the three splits. Most sentences were classified as "Societal" and "Sport". A large amount of sentences could not be categorised, and few sentences were related to "Religion" and "Prayer". Due to the size of the two latter, one could argue that they could be collapsed into a single topic, as done in our benchmarking experiments (see Section 4). However, we decided to keep them separate in the annotations, to facilitate further annotations in the future.

Benchmarking experiments
We perform benchmark experiments for SA and topic classification. Specifically, we use the setup from Barnes et al. (2017), who perform experiments with a logistic regression classifier with bagof-words features (BOW) and averaged embedding features (AVE), as well as a CNN and BiLSTM. We use their default value for hyperparameters (c=1, hidden dimension = 100, dropout = 0.3) and train for 20 epochs, finally testing the best model on the dev set. As the label distribution for both tasks is highly skewed, we use Macro F 1 to evaluate. Given the size of the categories "Prayer" and "Religion", we collapse them to a single topic, converting the topic classification task into a 5-class multi-class problem.  For the NArabizi and code-switched experiments, we create 100-dimensional fasttext embeddings (Bojanowski et al., 2017) on the unlabeled NArabizi data made available by . For the DZ experiments, we use available 300-dimensional MSA fasttext embeddings 3 trained on Wikipedia articles. Table 6 shows the results. On the sentiment task, the BiLSTM performs best on NArabizi, the CNN best on DZ, and BOW best on the codeswitched data. For topic classification the performance is similar, but BiLSTM is also best on DZ. The fact that BOW performs best on codeswitched data is largely due to the large amount of out-of-vocabulary words for all other methods, which require embeddings. These baseline experiments show that the dataset is challenging, and the variation means that no single model is always best. The code-switched setting is particularly challenging.

The interplay between language similarity and script
The transliteration and further sentiment and topic annotations allow us to explore what interplay there is between typology and script in cross-lingual transfer.  perform experiments on zero-shot cross-lingual transfer for POS tagging on NArabizi. They find that the best transfer language is Maltese, a Semitic language which is written in Latin script, rather than MSA, which performs poorly. This begs the question: is it mainly similar typology or a shared script that leads to this result? The transliterated dataset, along with the further sentiment annotations, allow us to investigate this question in more depth, as we are able to control for the script choice. We choose Persian and Urdu, languages written in Arabic script, but morphologically distinct from DZ (we refer to this group as Script ), Hebrew and Maltese, two Semitic languages written in other scripts ( Typology ) , and MSA, which is both morphologically similar and written in Arabic script ( Both ). These languages are both available in UD (Zeman et al., 2020) and also have available sentiment analysis datasets (Hebrew (Amram et al., 2018), Maltese (Dingli and Sant, 2016), MSA (Nabil et al., 2015;Abdulla et al., 2013), Urdu (Khan and Nizami, 2020), Persian (Hosseini et al., 2018)). As not all sentiment datasets have the same labels as the NArabizi dataset, we remove all neutral and mixed labels and create binary sentiment data for all languages. Table 7 gives an overview of the statistics of the POS and SA datasets, respectively. The NArabizi data is the smallest POS data (1,276 sentences), followed by Maltese (2,074), Urdu (5,130), Persian (5,997), Hebrew (6,216), and finally MSA (7,664). The average sentence lengths in tokens range between 16.1 for NArabizi and 42.3 in MSA. The sentiment datasets have a larger variance, ranging from 719 sentences for Maltese to 51,051 for MSA. The distribution of polarity is also skewed to a different degree in each dataset.

Modeling
We model universal POS (UPOS) tagging as a sequence labeling task and SA as a classification task using multilingual BERT (Xu et al., 2019). We finetune each model on the available training data in each language, using a shared set of hyperparameters which were selected from recommended values according to the characteristics of our data. We set the learning rate to 2e-5, max sequence length of 256, batch size of 8 or 16 4 , and perform early stopping once the validation score has not improved in the last epochs, saving the model that performs best on the dev set. We then test each model on its own dev and test data, the NArabizi test set, and finally the transliterated data. We use accuracy as our metric for POS and macro F 1 for sentiment, as the latter often contains unbalanced classes, and define a baseline as the result of predicting the majority class.

Results and Discussion
In order to quantify the zero-shot loss, we define a measure of average transfer loss between a group in Equation 1: where T L x→y is the transfer loss experienced by a model fine-tuned in language x when transferring to language y and S x→y is the score 5 achieved when testing a model fine-tuned in language x on language y. Thus, it is a measure of the performance lost in the transfer process. We also define its averaged variant: where T L A→B refers to the average transfer loss experienced by languages from any group A to languages from group B (group-to-group transfer loss) and N A is the number of languages included in the experiment that belong to group A (in our case, either languages that have similar typology, or have the same script).   transliterated data than the original NArabizi, although training on Urdu performs lower than the majority baseline. This suggests that even though mBERT was not pretrained on NArabizi or DZ, there is a preference for DZ. This is likely due to the fact that at least some of the words have been seen in pretraining, i.e., through MSA. An analysis of the tokenization shows that mBERT splits NArabizi words at a much higher rate than DZ (see Figure 1), breaking it into smaller pieces, which may account for some of the differences between the two. The fact that training on Maltese achieves the best score on both NArabizi (37.8) and DZ (38.4), however, suggests that there is still an effect of typology.

POS
The monolingual model trained and tested on DZ performs better (82.5 acc.) than the one trained and tested on NArabizi (76.3). When each of these models is tested on the other, they have significant transfer losses (32.7 for NArabizi → DZ, and 42.6 for DZ → NArabizi). Here too, transfer to DZ script seems easier.
On POS, the effect of language typology is stronger than script, with the best results achieved by training on Maltese and Hebrew. The average transfer loss from Persian and Urdu to DZ is 70.4 while for Hebrew and Maltese to NArabizi it is 59.6, showing less transfer loss from Typology . MSA has higher transfer loss on NArabizi (76.7) than DZ (66.2). The differences between average transfer loss on NArabizi and DZ are also slightly larger for Script (3.4) compared to Typology (3.1) or MSA (3.1).
All of this points to a complicated relationship between script and typology on POS. First of all, it is clear that mBERT prefers the Arabic script seen in pretraining. At the same time, typological similarity also plays a strong role in cross-lingual transfer in POS, although even in this case, the best scores are found on DZ. Table 9 shows the results for sentiment analysis. Training in-language again produces the best results (72.1 and 80.3 on NArabizi and DZ, respectively). The transfer loss from NArabizi to DZ is relatively low (9.0), while inversely it is immense (52.6).

Sentiment
Like on POS, most models perform better on DZ and the best zero-shot results do not come from training on Typology . In fact, quite the opposite, as these lead to the worst scores and have the highest average transfer loss (34.8/27.9). The best models are MSA for NArabizi (62.4) and Urdu for DZ (63.9), which curiously performs better than NArabizi → DZ. MSA has transfer losses of 12.8/25.1, while Script have the lowest average transfer loss (9.2/4.2). This suggests that cross-lingual transfer for a more semantic task, e.g., sentiment analysis, is less reliant on both typological and script similarities.

Analysis of results
As domain differences between datasets could also lead to transfer loss, we control for this variable by first translating all data to English (to use as a pivot language) and calculating domain difference using Proxy A-distance. Proxy A-distance (Glorot et al., Dev   2011) measures the generalization error of a linear SVM trained to discriminate between two domains. We translate 1,000 sentences from each dataset to English using GoogleTranslate and then compute the proxy A-distance 6 We show heat maps for the domain distances in Figure 2. For POS tagging, there are small but insignificant negative effects of proxy A-distance on results (a Pearson coefficient of -0.264, p > 0.05). On the sentiment task, there is no significant domain effect (0.264, p > 0.05). This suggests that most of the transfer loss is not due to domain mismatch.

Conclusion and Future work
In this paper we have described the process of annotating an available Algerian corpus with sentiment and topics, as well as the transliteration to Arabic and code-switched scripts, and finally some aspects of corpus cleanup. We performed benchmark experiments on the three script varieties and show that they are a challenging testbed for future experiments.
We used this new resource to explore a valuable research question in cross-lingual transfer: namely, what is the interplay between language similarity and script when choosing a source language? We found there is a delicate interplay between similar typology and script for transfer in part-of-speech tagging, where typology is more important, but having seen the script in pretraining also influences results. Sentiment analysis, on the other hand, is less M S A A l g e r i a n N a r a b i z i P e r s i a sensitive to typological differences, while still preferring the script seen in pretraining. This suggests that choice of transfer language is task-specific and that surprising differences can appear from one task to another.
In the future, we would like to address data related issues, and correct the tokenization and translation issues discussed in Section 3.1. Moreover, we plan to focus more concretely on the codeswitching aspect of our dataset. The challenges of code-switched data to NLP techniques are numerous, and we would like to focus on the syntactic analysis of our code-switched data, and to explore in more details language modeling approaches to processing it.