Building a Corpus for Palestinian Arabic: a Preliminary Study

This paper presents preliminary results in building an annotated corpus of the Palestinian Arabic dialect. The corpus consists of about 43K words, stemming from diverse resources. The paper discusses some linguistic facts about the Palestinian dialect, compared with the Modern Standard Arabic, especially in terms of morphological, orthographic, and lexical variations, and suggests some directions to resolve the challenges these differences pose to the annotation goal. Furthermore, we present two pilot studies that investigate whether existing tools for processing Modern Standard Arabic and Egyptian Arabic can be used to speed up the annotation process of our Palestinian Arabic corpus.


Introduction and Motivation
This paper presents preliminary results towards building a high-coverage well-annotated corpus of the Palestinian Arabic dialect (henceforth PAL), which is part of an ongoing project called Curras. Building such a PAL corpus is a first important step towards developing natural language processing (NLP) applications, for searching, retrieving, machine-translating, spellchecking PAL text, etc. The importance of processing and understanding such text is increasing due to the exponential growth of socially generated dialectal content at recent Social Media and Web 2.0 breakthroughs.
Most Arabic NLP tools and resources were developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Using such tools to understand and process Arabic dialects (DAs) is a challenging task because of the phonological and morphological differences between DAs and MSA. In addition, there is no standard orthography for DAs. Moreover, DAs have limited standardized written resources, since most of the written dialectal content is the result of ad hoc and unstructured social conversations or commentary, in comparison to MSA's vast body of literary works.
The rest of this paper is structured as follows: We present important linguistic background in Section 2, followed by a survey of related work in Section 3. We then present the process of collecting the Curras Corpus (Section 4) and the challenges of annotating it (Section 5).

Linguistic Background
In this section we summarize some important linguistic facts about PAL that influence the decisions we made in this project. For more information on PAL and Levantine Arabic in general, see (Rice and Sa'id, 1960;Cowell, 1964;Bateson, 1967;Brustad, 2000;Halloun, 2000;Holes, 2004;Elihai, 2004). For a discussion of differences between Levantine and Egyptian Arabic (EGY), see Omar (1976).

Arabic and its dialects
The Arabic language is a collection of variants among which a standard variety (MSA) has a special status, while the rest are considered colloquial dialects (Bateson, 1967, Holes, 2004Habash, 2010). MSA is the official written language of government, media and education in the Arab World, but it is not anyone's native language; the spoken dialects vary widely across the Arab World and are the true native varieties of Arabic, yet they have no standard orthography and are not taught in schools , Zribi et al., 2014.
PAL is the dialect spoken by Arabic speakers who live in or originate from the area of Historical Palestine. PAL is part of the South Levantine Arabic dialect subgroup (of which Jordanian Arabic is another dialect). PAL is historically the result of interaction between Syriac and Arabic and has been influenced by many other regional language such as Turkish, Persian, English and most recently Hebrew. The Palestinian refugee problem has led to additional mixing among different PAL sub-dialects as well as borrowing from other Arabic dialects. We discuss next some of the important distinguishing features of PAL in comparison to MSA as well as other Arabic dialects. We consider the following dimensions: phonology, morphology, and lexicon. Like other Arabic dialects, PAL has no standard orthography.

Phonology
PAL consists of several sub-dialects that generally vary in terms of phonology and lexicon preferences. Commonly identified subdialects include urban (which itself varies mostly phonologically among the major cities such as Jerusalem, Jaffa, Gaza, Nazareth, Nablus and Hebron), rural, and Bedouin. The Druze community has also some distinctive phonological features that set it apart. The variations are a miniature version of the variations in Levantine Arabic in general. Perhaps the most salient variation is the pronunciation of the /q/ phoneme (corresponding to MSA ‫ﻕق‬ q 1 ), which realizes as /'/ in most urban dialects, /k/ in rural dialects, and /g/ in Bedouin 1 Arabic orthographic transliterations are provided in the Habash-Soudi-Buckwalter (HSB) scheme (Habash et al., 2007), except where indicated. HSB extends Buckwalter's transliteration scheme (Buckwalter, 2004) to increase its readability while maintaining the 1-to-1 correspondence with Arabic orthography as represented in standard encodings of Arabic, i.e., Unicode, etc. The following are the only differences from Buckwalter's scheme (indicated in parentheses): . Orthographic transliterations are presented in italics. For phonological transcriptions, we follow the common practice of using '/.../' to represent phonological sequences and we use HSB choices with some extensions instead of the International Phonetic Alphabet (IPA) to minimize the number of representations used, as was done by Habash (2010). dialects. The Druze dialect retains the /q/ pronunciation. Another example is the /k/ phoneme (corresponding to MSA ‫ﻙك‬ k), which realizes as /tš/ in rural dialects. These difference cause the word for ‫ﻗﻠﺐ‬ qlb 'heart' to be pronounced as /qalb/, /'alb/, /kalb/ and /galb/ and to be ambiguous out of context with the word ‫ﻛﻠﺐ‬ klb 'dog' /kalb/ and /tšalb/. And similarly to EGY (but unlike Tunisian Arabic), the MSA phoneme /θ/ ‫ﺙث(‬ θ) becomes /s/ or /t/, and the MSA phoneme /ð/ ‫ﺫذ(‬ ð) becomes /z/ or /d/ in different lexical contexts, e.g., MSA ‫ﻛﺬﺏب‬ kðb /kaðib/ 'lying' is pronounced /kizib/ in PAL and /kidb/ in EGY.

Morphology
PAL, like MSA and its dialects and other Semitic languages, makes extensive use of templatic morphology in addition to a large set of affixations and clitics. There are however some important differences between MSA and PAL in terms of morphology. First, like many other dialects, PAL lost nominal case and verbal mood, which remain in MSA. Additionally, PAL in most of its sub-dialects collapses the feminine and masculine plurals and duals in verbs and most nouns. Some specific inflections are ambiguous in PAL but not MSA, e.g., ‫ﺣﺒﻴﯿﺖ‬ Hbyt /Habbēt/ 'I (or you [m.s.]) loved'.
Second, some specific morphemes are slightly or quite different in PAL from their MSA forms, e.g., the future marker is /sa/ in MSA but /Ha/ or /raH/ in PAL. Another prominent example is the feminine singular suffix morpheme (Ta Marbuta), which in MSA is pronounced as /at/ except at utterance final positions (where it is /a/). In some PAL urban sub dialects, it has multiple allomorphs that are phonologically and syntactically conditioned: /a/ (after non-front and emphatic consonants), /e/ (after front nonemphatic consonants), /it/ (nouns in construct state such as before possessive pronouns) and /ā/ (in deverbals before direct objects): e.g. ‫ﺑﻄﺔ‬ bTħ /baTT+a/ 'duck', ‫ﺣ‬ ‫ﺒﺔ‬ Hbħ /Habb+e/ 'pill', ‫ﺑﻄﺘﻨ‬ ‫ﺎ‬ bTnA /baTT+it+na/ 'our duck' and /mdars+ā +hum/ 'she taught them'.

Lexicon
The PAL lexicon is primarily Arabic with numerous borrowings from many different languages. MSA cognates generally appear with some minor phonological changes as discussed above; a few cases include more complex changes, e.g. /biddi/ 'I want' is from MSA /bi+widd+i/ 'in my desire' or /illi/ 'relative pronoun which/who/that' which corresponds to a set of MSA forms that inflect for gender and number ‫ﺍاﻟﺬﻱي(‬ Alðy, ‫ﺍاﻟﺘﻲ‬ Alty, etc.). Some common PAL words are portmanteaus of MSA words, e.g., /lēš / 'why?' corresponds to MSA /li+'ayy+i šay'/ 'for what thing?'. Examples of common words that are borrowed from other languages include the following:

Corpus Collection and Annotation
There have been many contributions aiming to develop annotated Arabic language corpora, with the main objective of facilitating Arabic NLP applications. Notable contributions targeting MSA include the work of Maamouri and Cieri, (2002), Maamouri et al. (2004), Smrž and Hajič (2006), and Habash and Roth (2009). These efforts developed annotation guidelines for written MSA content producing large-scale Arabic Treebanks.
Contributions that are specific to DA include the development of a pilot Levantine Arabic Treebank (LATB) of Jordanian Arabic, which contained morphological and syntactic annotations of about 26,000 words (Maamouri et al., 2006). To speed up the process of creating the LATB, Maamouri et al. (2006) adapted MSA Treebank guidelines to DA and experimented with extensions to the Buckwalter Arabic Morphological Analyzers (Buckwalter, 2004). The LATB was used in the Johns Hopkins workshop on Parsing Arabic Dialect (Rambow et al., 2005;Chiang et al., 2006), which supplemented the LATB effort with an experimental Levantine-MSA dictionary. The LATB effort differs from the work presented here in two respects. First, the LATB corpus consists of conversational telephone speech transcripts, which eliminated the orthographic variations issues that we face in this paper. Secondly, when the LATB was created, there were no robust tools for morphological analysis of any dialects; this is not the case any more. We plan to exploit existing tools for EGY to help the annotation effort.
Other DA contributions include the Egyptian Colloquial Arabic Lexicon (ECAL) (Kilany, et al., 2002), which was developed as part of the CALLHOME Egyptian Arabic (CHE) corpus (Gadalla, et al., 1997). In addition to YADAC (Al-Sabbagh and Girju, 2012), which was based on dialectal content identification and web harvesting of blogs, micro blogs, and forums of EGY content. Similarly, the COLABA project (Diab et al., 2010) developed annotated dialectal content resources for Egyptian, Iraqi, Levantine, and Moroccan dialects, from online weblogs.

Dialectal Orthography
Due to the lack of standardized orthography guidelines for DA, along with the phonological differences in comparison to MSA, and dialectal variations within the dialects themselves, there are many orthographic variations for written DA content. Writers in DA, regardless of the context, are often inconsistent with others and even with themselves when it comes to the written form of a dialect; writing with MSA driven orthography, or writing words phonologically sometimes. These orthography variations make it difficult for computational models to properly identify and reason about the words of a given dialect (Habash et al, 2012a), hence, a conventional form for the orthographic notations is important. Within this scope, we can view this problem for Levantine dialects as an extension of the work of Habash et al. (2012a) who proposed the socalled CODA (Conventional Orthography for Dialectal Arabic). CODA is designed for the purpose of developing conventional computational models of Arabic dialects in general. Habash et al. (2012a) provides a detailed description of CODA guidelines as applied to EGY. Eskander et al. (2013) identify five goals for CODA: (i) CODA is an internally consistent and coherent convention for writing DA; (ii) CODA is created for computational purposes; (iii) CODA uses the Arabic script; (iv) CODA is intended as a unified framework for writing all DAs; and (v) CODA aims to strike an optimal balance between maintaining a level of dialectal uniqueness and establishing conventions based on MSA-DA similarities. CODA guidelines will be extended to cover PAL in this paper, as discussed in Section 5.3.

Dialectal Morphological Annotation
Most of the work that explored morphology in Arabic focused on MSA (Al-Sughaiyer and Al-Kharashi, 2004;Buckwalter, 2004;Habash and Rambow, 2005;Graff et al., 2009;Habash, 2010). The contributions for DA morphology analysis, however, are relatively scarce and are usually based on either extending available MSA tools to tackle DA specificities, as in the work of (Abo Bakr et al., 2008;Salloum and Habash, 2011), or modeling DAs directly, without relying on existing MSA contributions . Due to the variations between MSA and DAs, available MSA tools and resources cannot be easily extended or transferred to work properly for DA (Maamouri, et al., 2006;Habash, et al., 2012b). Therefore, it is important to develop annotated and morpheme-segmented resources, along with morphological analysis tools, that are specific and tailored for DAs. One of the notable recent contributions for EGY morphological analysis was CALIMA (Habash et al., 2012b). The CALIMA analyzer for EGY and the commonly used SAMA analyzer for MSA (Graff et al., 2009) are central in the functioning of the EGY morphological tagger MADA-ARZ , and its successor MADAMIRA (Pasha et al., 2014), which supports both MSA and EGY.
The work we present in this paper builds on the shoulders of these previous efforts from the development of guidelines for orthography and morphology (in MSA and EGY) to the use of existing tools (specifically MADAMIRA MSA and EGY) to speed up the annotation process.

Corpus Collection
Written dialects in general tend to have scarce resources in terms of written literature; written materials usually involve informal conversations or traditional folk literature (stories, songs, etc.). It is therefore often difficult to find resources for written dialectal content. In addition, resources of dialectal content are prone to significant noise and inconsistency because they tend to lack standard orthographies and rely on ad hoc transcriptions and orthographic borrowing from the standard variety. In the case of Arabic, unlike MSA that dominates the formal and written content outlets, as in the press, scientific articles, books, and historical narration, DAs are more naturally used in traditional and informal contexts, such as conversations in TV series, movies, or on social media platforms, providing socially powered commentary on different domains and topics. And given the lack of standard orthography, there is common mixing of phonetic spelling and MSA-cognate-based spelling in addition to the so-called Arabizi spelling -writing DAs in Roman script, rather than Arabic script (Darwish, 2014 and. Such noise imposes many challenges regarding the collection of high-coverage high-accuracy DA corpora. It is therefore important to remark that although bigger is better when it comes to corpus size, we focus more in this first iteration of our PAL corpus on precision and variety rather than mere size. That is, we tried not only to manually select and review the content of the corpus, but also to assure that we covered a variety of topics and contexts, localities and sub-dialects, including the social class and gender of the speakers and writers. This is because such aspects help us discover new language phenomena in the dialect as will be discussed in the next section. Table 1 presents the resources that we manually collected to build the PAL Curras corpus. There are 133 social media threads (about 16k words) from blogs (e.g., ‫ﺍاﻟﻌﺎﻁطﻲ‬ ‫ﺍاﻟﺤﻤﻴﯿﺪ‬ ‫ﻋﺒﺪ‬ ‫ﻣﺪﻭوﻧﺔ‬ Abdelhameed Alaaty's blog), forums (e.g., ‫ﺷﺒﻜﺔ‬ ‫ﺍاﻟﻔﻠﺴﻄﻴﯿﻨﻲ‬ ‫ﺍاﻟﺤﻮﺍاﺭر‬ The Palestinian dialogue network), Twitter, and Facebook. The collection was done by reading many discussion threads and selecting the relevant ones to assure diversity and PAL representative content. Content that is heavily written in a mix of languages, or a mix of other dialects was excluded. In the same way, we also manually collected some PAL stories, and a list of PAL terms and their meanings, which reflect additional diversity of topics, contexts, and social classes. About half of our corpus comes from 41 episode scripts from the Palestinian TV show ‫ﻭوﺗﺮ‬ ‫ﻉع‬ ‫ﻭوﻁطﻦ‬ "Watan Aa Watar". Each episode discusses and provides satirical critiques regarding different topics of relevance to the Palestinian viewers about daily life issues. The show's importance stems from the fact that the actors use a variety of Palestinian local dialects, hence enriching the coverage of the corpus.

Corpus Annotation Challenges
This section presents our approach to annotating the Curras corpus. We start with a specification of our annotation goals, followed by a discussion of our general approach. We then discuss in more details two important challenges that need to be addressed for annotation of a new dialectal corpus: orthography and morphology.

Annotation Specification
The words are annotated in context. As such, the same word may receive different annotations in different contexts. We define the annotation of a word as a tuple <w, w B , c, c B , l, p B , g, i> described as follow. (Examples of such annotations are illustrated in Table 5.): • w: Raw (Unicode) The raw input word defined as a string of letters delimited by white space and punctuation. The word is represented in Arabic script (Unicode). • w B : Raw (Buckwalter) The same raw input word in the commonly used Buckwalter transliteration (Buckwalter, 2004). Buckwalter transliteration. The lemma is the citation form or dictionary entry that abstracts over all inflectional morphology (but not derivational morphology). The lemma is fully diacritized. We follow the definition of lemma used in BAMA (Buckwalter, 2004) and CALIMA-ARZ (Habash et al., 2012b). • p B: Buckwalter POS The Buckwalter full POS tag, which identifies all clitics and affixes and the stem and assigns each a subtag. This representation treats clitics as separate tokens and abstracts the orthographic rewrites they undergo when cliticized. See the handling of the l/PREP+Al/DET in word #6 in Table 5. This representation is used by the LDC in the Penn Arabic Treebank (PATB) (Maamouri et al., 2004) and tools such as MADAMIRA (Pasha et al., 2014). It is a high granularity representation that allows researchers to easily go to coarser granularity POS (Diab 2007;Habash, 2010;Alkuhlani et al., 2013). The Buckwalter POS tag can be fully diacritized or undiacritized. Given the added complexity of producing diacritized text manually by annotators, we opted at this stage to only use undiacritized forms.
• g: Gloss The English gloss, an informal semantic denotation of the lemma. In Tables  3-5, we only use one English word for space limitations. • i: Analysis A specification of the source of the annotation, e.g., ANNO is a human annotator, and MADA is the MADAMIRA system with some minor or no automatic post-processing. In Tables 3 and 4, which are produced automatically, the Analysis field is replaced with a status indicating how usable the automatic annotation is.

General Approach
To speed up the process of annotating our corpus, we made the following decisions. First, and quite obviously from the previous section, we made a conscious decision to follow on the footsteps of previous efforts for MSA and EGY annotation done at the Linguistic Data Consortium and Columbia's Arabic Modeling group in terms of guidelines for orthography conventionalization and morphological annotation. This allows us to exploit existing guidelines with only essential modification to accommodate PAL and produce annotations that are comparable to those done for MSA and EGY. This, we hope, will encourage research in dialectal adaptation techniques and will make our annotations more familiar and thus usable by the community.
Second, and closely related to the first point, we exploit existing tools to speed up the annotation process. In this paper, we specifically use the MADAMIRA tool (Pasha et al., 2014) for morphological analysis and disambiguation of MSA and EGY. Our choice of using this tool is motivated by the assumption that EGY/MSA and PAL share many orthographic and morphological features. This assumption was validated by pilot experiments, presented below, and which show most of the PAL annotations can be generated automatically. However, a manual step is then needed to verify every annotation, to correct errors and fill in gaps. The manual annotation has not been completed yet as of the writing of this paper submission.
Finally, we made one major simplification to the annotations to minimize the load on the human annotator: we do not produce diacritized morphological analyses in the Buckwalter POS tag. The reasons for this decision are the following: (i) full diacritization is a complex task that most Arabic speakers do not do and thus it requires a lot of training and precious attention to detail; (ii) MSA and EGY produce many morphemes and lexical items that are quite similar to PAL except in terms of the short vowels (compare the lemmas for word #5 in Tables 3, 4 and 5); (iii) PAL has many cases of multiple valid diacritizations as mentioned above. While we think a convention should be defined to explain the variation and model it, it is perhaps the topic of a future effort that is more focused on PAL phonology. We make an exception for the lemmas and diacritize them since lemmas are important in indicating the core meaning of the word. In case of different pronunciations of the lemma, we choose the shortest.

A Conventional Orthography for PAL
As explained in Section 2, PAL, like other Arabic dialects, does not have a standard orthography. Furthermore, there are numerous phonological, morphological and lexical differences between PAL and MSA that make the use of MSA spelling as is undesirable. PAL speakers who write in the dialect produce spontaneous inconsistent spellings that sometimes reflect the phonology of PAL, and other times the word's cognate relationship with MSA. For example, the word for 'heart' (MSA ‫ﻗﻠﺏب‬ qalb) has four spellings that correspond to four sub-dialectal pronunciations: ‫ﻗﻠﺏب‬ qlb /qalb/, ‫ﺃأﻟﺏب‬ Âlb /'alb/, ‫ﻛﻠﺏب‬ klb /kalb/, and ‫ﺟﻠﺏب‬ jlb /galb/. Similarly, the common shortening of some long vowels (from MSA to PAL) leads to different orthographies as in ‫ﻗﺎﻧﻭوﻥن‬ qAnwn 'law' (MSA /qānūn/), which can also be written with a shortened first vowel ‫ﻗﻧﻭوﻥن‬ qnwn /'anūn/ reflecting the PAL pronunciation. PAL also has some clitics that do not exist in MSA, which leads to different spellings, e.g. the PAL future particle ‫ﺡح‬ H /Ha/ can be written attached to or separate from the verb that follows it. Even when a morpheme exists in MSA and PAL, it may have additional forms or pronunciations. One example is the definite article morpheme ‫ﺍاﻝل‬ Al /il/ which has a non-MSA/non-EGY allomorph /li/ when attached to nominals with initial consonant clusters. As a result, a word like /li+blād/ 'the homeland/countries' can be spelled to reflect the morphology as ‫ﺍاﻟﺑﻼﺩد‬ AlblAd or the phonology ‫ﻟﺑﻼﺩد‬ lblAd, with the latter being ambiguous with 'for countries' (in PAL /la+blād/). Finally, there are words in PAL that have no cognate in MSA and as such have no clear obvious spelling to go with, e.g., the word /barDo/ 'additionally' is spontaneously written as ‫ﺑﺭرﺿﻭو‬ brDw, ‫ﺑﺭرﺿﻪﮫ‬ brDh and ‫ﺑﺭرﺿﺔ‬ brDħ.
This, of course, is not a unique PAL problem. Researchers working on NLP for EGY and Tunisian dialects developed CODA guidelines for them (Habash et al., 2012a;Zribi et al., 2014). These guidelines were by design intended to apply (or be easily extended) to all Arabic dialects, but were only demonstrated for two. Our challenge was to take these guidelines (specifically the EGY version) and extend them. There were three types of extensions. First, in terms of phonology-orthography, we added the letter ‫ﻙك‬ k to the list of root letters to be spelled in the MSA cognate to cover the PAL rural subdialects that pronounces it as /tš/. Second, in terms of morphology, we added the non-EGY demonstrative proclitic ‫ﻩه‬ h+ and the conjunction proclitic ‫ﺕت‬ t+ 'so as to' to the list of clitics, e.g., ‫ﺑﻬﮭﺎﻟﺑﻳﯾﺕت‬ bhAlbyt 'in this house' and ‫ﺗﻳﯾﺷﻭوﻑف‬ tyšwf 'so that he can see'. Finally, we extended the list of exceptional words to cover problematic PAL words. All of the basic CODA rules for EGY (and Tunisian) are kept the same.

Pilot Study (I):
We conducted a small pilot study in annotating the CODA for PAL words. We considered 1,000 words from 77 tweets in Curras. The CODA version of each word was created in context. 15.9% of all words had a different CODA form from the input raw word form. 42% of these changes involve consonants (two-fifths of the cases), vowels (one-fifth of the cases) and the hamzated/bare forms of the letter Alif ‫ﺍا‬ A. Examples of consonant change can be seen in Table 5 (words #4 and #10). An additional 29% word changes involve the spelling of specific morpheme. The most common change (over half of the time) was for the first person imperfect verbal prefix ‫ﺍا‬ A when following the progressive particle ‫ﺏب‬ b: ‫ﺑﻛﺗﺏب‬ bktb as opposed to ‫ﺑ‬ ‫ﺎﻛ‬ ‫ﺗﺏب‬ bAktb. About 18% of the changed words experience a split or a merge (with splits happening five time more than merges). An example of a CODA split is seen in Table 5 (word #9). Finally, only about 8% of the changed words were PAL specific terms; and less than 7% involved a typo or speech effect elongation. These results are quite encouraging as they suggest the differences between CODA and spontaneously written PAL are not extensive. Further analysis is still needed of course.
In Tables 3 and 4 (column CODA), we show the results of using the MADAMIRA-MSA and MADAMIRA-EGY systems on a set of ten words, while Table 5 shows the manually selected or corrected CODA. MADAMIRA generates a CODA version (contextually) by default. We expect the EGY version to be more successful than the MSA version in producing the CODA for PAL given the shared presence of many morphemes in EGY and PAL. However, when we ran the same set of words through MADAMIRA-EGY, we encountered many errors in words, morphemes and spelling choices in PAL that are different from EGY, e.g., the raw word ‫ﻣﻧﺣﺏب‬ mnHb 'we love' (CODA ‫ﺑﻧﺣﺏب‬ bnHb) is analyzed as the EGY ‫ﻧﺣﺏب‬ ‫ﻣﺎ‬ mA nHb 'we do not love'!

Morphological Annotation Process and Challenges
To study the value of using an existing morphological analyzer for MSA or EGY in creating PAL annotations, we conducted the following pilot study.

Pilot Study (II):
We ran the words from a randomly selected episode of the PAL TV show "Watan Aa Watar" (460 words) through both MADAMIRA-MSA and MADAMIRA-EGY. We analyzed the output from both systems to determine its usability for PAL annotations. We consider all analyses that are correct for PAL annotation or usable via simple post processing (such as removing CASE endings on MSA words) to be correct (as in word #2 in Tables 3-5). Words that receive incorrect analyses or no analyses require manual modifications.
The results of this experiment are summarized in Table 2. Table 3 and 4 illustrate sample results for ten words and Table 5 includes the manually created results. 2 The wrongly analyzed words are words that were assigned incorrect POS tag in context. For example, word #3 in Tables 3 and 4 is the result of mis-analyzing the proclitic l-as the preposition 'for/to' as opposed to the non-CODA spelling of the definite article in PAL. The analysis provided by MADAMIRA-EGY is correct for other contexts than the one illustrated here. Another example is word #8, which is a Levantine specific term hardly used in EGY and not used at all in MSA. MADAMIRA-MSA has a higher proportion of wrongly analyzed words than MADAMIRA-EGY.
Overall MADAMIRA-EGY produced analyses that were either correct and ready to use for PAL or requiring some minor modifications such as adjusting the vowels on the lemmas (e.g., word #5) in one of every five words.

Conclusion and Future Work
We presented our preliminary results towards building an annotated corpus of the Palestinian Arabic dialect. The challenges and linguistic variations of the Palestinian dialect, compared with Modern Standard Arabic, were discussed especially in terms of morphology, orthography, and lexicon. We also discussed and showed the potential, and limitations, of using existing resources, especially MADAMIRA-EGY, to semi-automate and speed up the annotation process.
The paper has also pointed out several issues that need to be considered and researched further, especially the development of Palestinianspecific morphological annotation and CODA guidelines, a Palestinian lexicon, and the extension of MADAMIRA to analyze Palestinian text. Our corpus will be further extended to include more text, and all lexical annotations (i.e., Lemmas) will be linked with existing Arabic ontology resources such as the Arabic WordNet (Black et al., 2006). The corpus will be publicly available for research purposes.