Interactive Annotation for Event Modality in Modern Standard and Egyptian Arabic Tweets

We present an interactive procedure to annotate a large-scale corpus of Modern Standard and Egyptian Arabic tweets for event modality that comprises obligation, permission, commitment, ability


Introduction
Event modality, according to Palmer (2001), describes events that are not actualized but are merely potential. It comprises obligation, permission, commitment, ability, and volition. Both obligation and permission emanate from an external authority such as the law; whereas commitments are the obligations placed by speakers on themselves as in promises. Ability is the (in)capacity to do something. Volition is broadly defined as intensions, desires, wishes, and preferences. Event modality is used for several NLP tasks, including sales and marketing analysis (Ramanand et al. 2010, Carlos andYalamanchi 2012), sentiment analysis (Chardon et al. 2013), the automatic detection of request emails (Lampert et al. 2010), and the classification of animacy and writers' emotions (Liao andLiao 2009, Bowman andChopra 2012).
To-date, there are no large-scale Arabic corpora annotated for event modality compared to English (Baker et al. 2010(Baker et al. , 2012Rubinstein et al. 2013), Japanese (Matsuyoshi et al. 2010), Portuguese (Hendrickx et al. 2012), and Chinese (Cui and Chi 2013). One obstacle for the creation of modality-annotated corpora is the lack of consensus definitions of modality and its attributes to be rendered into annotation tasks and guidelines. Furthermore, most modality annotation schemes use sophisticated theoretical guidelines that need annotators with linguistic background; hence, annotation typically takes place in in-lab settings at small scales.
In this paper, we present an interactive annotation procedure to annotate event modality and its attributes of sense, polarity, intensification, tense, holders, and scopes in Modern Standard and Egyptian Arabic tweets. The procedure depicts the following ideas: first, it defines each annotation task as a series of questions displayed 1 /hidden based on prior answers; second, it avoids lengthy theoretically-sophisticated definitions and uses the questions instead as simplified self-explanatory annotation prompts; and third, based on the elicited answers it automatically determines nested triggers and their attributes. The fact that our procedure does not require special linguistic background and consists of easy-to-administer questions makes it eligible for large-scale crowdsourcing annotation.
Our corpus comprises 9949 unique tweets, annotated for 12134 tokens that map to 315 unique types of event modality triggers and their attributes of sense, polarity, intensification, tense, holders, and scopes. The reason to work on the genre of tweets is that our corpus is part of a larger project to incorporate linguistic features, such as modality, with network-based features to automatically identify the key players of political discourse on Twitter for countries with fast-changing politics such as Egypt. The fact that our corpus is harvested from the Arabic Egyptian Twitter entails that the corpus is diglossic for Modern Standard Arabic (MSA), the This work is licensed under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http:// creativecommons.org/licenses/by/4.0/ formal Arabic variety, and Egyptian Arabic (EA), the native Arabic dialect of Egypt. We evaluate the annotation results with Krippendorff's alpha (Krippendorff 2011). Results show high inter-annotator reliability rates, indicating that our annotation scheme and procedure are effective. The contribution of this paper, therefore, is twofold: first, we create a novel annotated resource for Arabic NLP that is larger than existing corpora even for languages other than Arabic; and second, we present an efficient and easy-to-administer annotation procedure with interactive crowdsourcing potentials.
The rest of this paper is organized as follows: Section 2 outlines the annotation scheme, guidelines and the interactive procedure; Section 3 gives examples for the final output representations; Section 4 describes corpus harvesting and sampling; Section 5 provides the annotation results and disagreement analysis; and Section 6 compares and contrasts our work with related work.

Annotation Scheme: Tasks and Guidelines
Our annotation scheme comprises six tasks to label sense, polarity, intensification, tense, holders, and scopes for each event modality. Prior to the beginning of the interactive procedure, we highlight all event modalities in each tweet using a string-match algorithm and the lexicons from Al-Sabbagh et al. (2013, 2014a. The algorithm finds all potential event modality triggers (i.e. words/phrases that convey event modality) within each tweet in our corpus and marks them as annotation units. A total of 12134 candidate triggers are highlighted in 9949 tweets.

Task 1: Sense
Sense annotation is to decide for each candidate trigger in context whether it actually conveys event modality given the tweet's context. The same present participle ‫حابب‬ HAbb in example 1 is a volition trigger meaning I want/desire; whereas in example 2 it is a non-modal present participle meaning like/prefer/respect. We define sense annotation as a synonymy judgment task, following Al-Sabbagh et al. (2013, 2014b. Each event modality sense is represented by an exemplar set manually selected so that: (1) each exemplar is an unambiguous event modality trigger; (2) exemplars are in both MSA and EA; (3) exemplars comprise both simple words and multiword expressions; (4) exemplars are both affirmative and negative; and (5) exemplars are of different intensities. Presented with a pre-highlighted candidate trigger in context and the exemplar sets, annotations are to decide whether the candidate trigger is synonymous with any of the exemplar sets. If not, the trigger is then assumed as non-modal.
If an annotator decides that a given candidate trigger is a non-modal, no further questions about polarity, intensification, tense, holders, or scopes are displayed. In order to guarantee that annotators do not select the non-modal option as an easy escape, they are not allowed to move forward without giving at least one synonym of their own to the candidate trigger.

Task 2: Polarity
Task 2 uses as input the candidates labeled as valid event modality triggers in Task 1 and label each as either affirmative (AFF) or negative (NEG). To decide, annotators are instructed to consider the absence/presence of: • Negation particles such as ‫مش‬ m$ (not), lA (not), and ‫غير‬ gyr (not), among others.
• Negation affixes, especially in EA, like the circumfix m...$ in ‫مقدرش‬ mqdr$ (I cannot). 1 Throughout the examples, modality triggers are marked in boldface, and scopes are in-between brackets.
• Negative auxiliaries where negation is placed on the past tense auxiliary as in ‫عايز‬ ‫مكنتش‬ mknt$ EAyz (I did not want).
• Embedding under negated epistemic modality triggers as in ‫يجب‬ ‫أنه‬ ‫أعتقد‬ lA >Etqd >nh yjb (I do not think it is necessary) which entails that the speaker is not actually setting an obligation.
Annotators are instructed that using multiple negation markers results in an affirmative sense. Thus, ‫يعجز‬ ‫لم‬ lm yEjz (he was not unable to) means that he was actually able to. Annotators are required to give the reason for negation if they decide that a given trigger is negative.

Task 3: Intensification
Event modality triggers have different lexical intensities (i.e. intensities encoded in the lexical meaning of the word/phrase regardless of the context). In obligation triggers, for instance, even without a context, Arabic speakers know that ‫ضروري‬ Drwry (necessary) expresses a higher necessity than ‫المفروض‬ AlmfrwD (should). When used in context, the trigger's lexical intensity can be maintained as is, or amplified/mitigated by such linguistic means as: • Modification: adverbs like ‫تماما‬ tmAmA (absolutely) amplify lexical intensity; whereas mitigation is invoked by such adverbs as ‫غالبا‬ gAlbA (most probably).
• Coordination of two or more triggers typically results in intensity amplification as in ‫وضروري‬ ‫زم‬ lAzm wDrwry (must and necessary).
• Embedding under epistemic modality triggers can affect the lexical intensities of event modality triggers. In ‫أن‬ ‫الضروري‬ ‫من‬ ‫أعتقد‬ >Etqd mn AlDrwry >n (I think it is necessary to) the strong obligation associated with ‫الضروري‬ AlDrwry (necessary) is mitigated by the moderate-intensity epistemic ‫أعتقد‬ >Etqd (I think), being embedded under it.
The annotators' task for intensification annotation is to decide for each candidate labeled as a valid event modality trigger in Task 1 whether its lexical intensity is amplified (AMP), mitigated (MTG) or maintained (AS IS). During interactive annotation, annotators are asked to provide the reason for their selection; that is, whether the lexical intensity is affected by modification, coordination, negation, embedding or any other reason whether listed above or not.

Task 4: Tense
In this version of our event modality corpus, we work on the present and past tenses only. Thus, Task 4 is to decide for each valid event modality trigger from Task 1 whether it is present (PRS) or past (PST). Annotators are required to give their reasons for selecting either PRS or PST.

Task 5: Holders
Holder annotation identifies the source of the obligation, permission, commitment, ability, or volition. In example 3, the source that sets the obligation that Egyptians have to learn the meaning of democracy is the Twitter user. The holder is not always the Twitter user, however. In example 4, the Twitter user quotes Kamal Alganzoury -a former Egyptian Prime Minster -stating that he does not want to continue as the Prime Minister. Therefore, the holder of the negated volition trigger ‫رغبة‬ ‫لدي‬ ‫ليس‬ lys ldy rgbp (not have a will) is Alganzoury not the Twitter user. This is an example of the nested holder notion first proposed by Wiebe et al. (2005) and Saurí and Pustejovsky (2009 Another example of nested holders is example 5. We know that the regime is incapable of maintaining security and protecting the people only because the Twitter user says so. Put differently, the best way to understand this tweet is that according to what the Twitter user holds as a true proposition, the regime is unable to maintain security and protect the people.

5.
‫غير‬ We can have two or more nested holders. In example 4, the two holders are Alganzoury who expresses his unwillingness to continue as a Prime Minster and the Twitter user who is quoting Alganzoury. In example 5, the two holders are the regime that is incapable of marinating security and protecting its people and the Twitter user who holds this proposition as true. In example 6, we have three nested holders: the Iranians who are unwilling to confront the outside world, Obama who holds that as a true proposition about Iranians, and the Twitter user who is quoting Obama stating his proposition. During the interactive procedure, annotators are first asked whether the holder is the same as the Twitter user. If not, more questions are displayed to determine (1) who the real holder is; (2) whether the tweet is a(n) (in)direct quote; or it conveys the Twitter user's assumptions.
When the holder is not the Twitter user, annotators are asked to mark the boundaries of the linguistic unit that corresponds to the holder in the tweet's text. Annotators are instructed to use the maximal length principle from Szarvas et al. (2008) so that they mark the largest possible meaningful linguistic unit. Thus, in example 4 the holder is ‫الجنزوري‬ ‫كمال‬ ‫الدكتور‬ Aldktwr kmAl Aljnzwry (Dr. Kamal Alganzoury) not only Kamal Alganzoury.

Task 6: Scopes
Scopes are the events modified by the trigger, syntactically realized as clauses, verb phrases, deverbal nouns or to-infinitives, according to Al-Sabbagh et al. (2013). We use the same maximal length principle from Task 5 so that the marked scope segment corresponds to the largest meaningful linguistic unit that describes the event. Typically, scope segments are delimited by: (1) punctuation markers and (2) subordinate conjunctions.
Annotators are instructed that: (1) a single trigger may have one or more scopes; (2) two or more triggers -especially conjoined by coordinating particles -can share the same scope; and (3) scopes are not necessarily adjacent to their triggers. Examples 7, 8 and 9 illustrate each of these guidelines, repecetively.

Final Output Representation
All elicited answers during annotation are organized into the representations illustrated in the following examples. The representation of example 10 reads as: the Twitter USER strongly did not want Shafiq to win the presidential elections. The trigger ‫اتمنيت‬ Atmnyt (wished) is tagged as synonymous with the volition exemplar set; therefore, it denotes a DESIRE. It is then labeled as a past tense (PST), negative (NEG) trigger. Furthermore, its lexical intensity is labeled as amplified (AMP) because of the categorical negation ‫ما‬ ‫عمري‬ Emry mA (never ever). Originally, ‫اتمنيت‬ Atmnyt (wished) is of moderate lexical intensity, being less intense than ‫اشتھيت‬ A$thyt (longed for) but more intense than ‫أردت‬ >rdt (wanted). Given the categorical negation, the lexical intensity of ‫اتمنيت‬ Atmnyt (wished) goes up the scale from moderate to strong (STRG). Example 11 reads as: the Twitter USER reports Hegazy stating that he has the ability to become the Muslim's caliphate. The trigger ‫أصلح‬ >SlH (can) is labeled as synonymous with the ability exemplar set. It is also labeled as a present (PRS), affirmative (AFF) trigger whose lexical intensity is maintained (AS IS) in the context. Therefore, its lexical intensity is maintained to its original level which is moderate (MOD).

11.
‫حجازي‬ Example 12 shows a Twitter user who holds as true that the only thing Egypt needed was a wise politician to avoid the bloodshed. The trigger ‫تحتاج‬ tHtAj (needs) is labeled as an obligation trigger synonymous with ‫تتطلب‬ ttTlb (requires). It is also labeled as past tense (PST) given the preceding past tense auxiliary ‫تكن‬ tkn (was). The assigned strong (STRG) lexical intensity label is attributed to the fact that the original moderate intensity of ‫تحتاج‬ tHtAj (needs) is amplified by the categorical negation structure ‫لم‬ ... ‫إ‬ lm ... <lA (nothing but). Example 14 shows how two conjoined triggers (i.e. ‫زم‬ lAzm (must) and ‫ضروري‬ Drwry (necessary)) that share the same holder and scope are merged into one representation, and the conjunction leads to amplifying the intensity of the obligation set by them both.

Corpus Harvesting
Tweets are harvested from the Arabic Egyptian Twitter provided that (1) each tweet has at least one trendy political English or Arabic hashtag; and (2) each tweet has at least one candidate event modality trigger from the Arabic modality lexicons (Al-Sabbagh et al. 2013, 2014a. We harvest tweets from a variety of users such as newspapers, TV stations, political and humanitarian campaigns, politicians, celebrities, and ordinary people. Thus, our corpus comprises both MSA, the formal Arabic variety, and EA, the native Arabic dialect of Egypt. The harvested corpus comprises 9949 unique tweets, with 12134 tokens of event modality triggers that map to 315 unique types.

Evaluation Methodology and Metrics
Our annotation tasks are of two types: (1) Tasks 1-4 are label-based where there is a pre-defined set of labels from which annotators choose; and (2) Tasks 5-6 are segmentation-based where the output of the annotation is a text segment. For the segmentation-based tasks, we use an all-ornothing method to measure inter-annotator reliability: for segments to be considered as agreement, they must share both the beginning and end boundaries. We use Krippendorff's alpha α (Krippendorff 2011) as our inter-annotator reliability measure, following the most recent work on modality annotation for other languages including English (Rubinstein et al. 2013) and Chinese (Cui and Chi 2013). For more details on Krippendorff's alpha and a, we refer the reader to Artstein and Poesio (2008).

Results
We use the surveygizmo survey services 2 to implement our interactive annotation procedure given that their survey structure is one that uses conditional branching and skip logic. We distribute the survey on Twitter and we have three annotators participating. According to the short qualifying quiz given at the beginning of the survey, all three participants are native Egyptian Arabic (EA) speakers who have at least two-year experience with Twitter. They are also university graduates who, therefore, master MSA. None of the participants has a linguistics background. Table 1 shows alpha rates for each annotation task.

Discussion and Disagreement Analysis
Among the factors that lead to high inter-annotator reliability are that: (1) the vast majority of negation is explicitly marked by negation particles that are easy to detect by human annotators; (2) the vast majority of triggers are used without any amplification or mitigation markers; and (3) punctuation markers are surprisingly informative for marking scope boundaries and direct quotations; and hence, holders.
Typically, event modality triggers are adjunct constituents that add an extra-layer of meaning and can be removed without disturbing the syntactic structure. Yet, in example 15, ‫واجب‬ wAjb (a must) and ‫أوجب‬ >wjb (a more important must) have main grammatical functions as the predicates of the phrases they modify. Most of the exemplars from Section 2.1 are adjuncts; and, thus, none can substitute ‫واجب‬ wAjb (a must) or ‫أوجب‬ >wjb (a more important must) in such a context.

15.
] [Being cautious about manipulating the revolution] is a must but [getting united for one project] is a more important must.
Highly-polysemous triggers invoke disagreement because in many cases even the context is ambiguous. In example 16, ‫أقسم‬ >qsm (I swear) has two eligible interpretations: an epistemic trigger interpretation I assure (you) that and a commitment trigger interpretation I promise (you) that. Even the context is not enough to disambiguate the two interpretations and annotators go by the most common sense for the trigger according to their own opinions. Non-human or −RATIONAL holders invoke disagreement, especially for obligation versus volition triggers. The most common sense of such triggers as ‫عايزة‬ EAyzp (want) is volition. Yet, when the holder is −RATIONAL like ‫نتخابات‬ ‫ا‬ AlAntxAbAt (the elections) in example 17, annotators disagree as to whether ‫عايزة‬ EAyzp means want (i.e. a volition trigger) or need (i.e. an obligation trigger).

17.
‫نتخابات‬ Intensity-related disagreement is attributed mostly to progressive verb aspect. Some annotators consider progressive verb aspect as indicated by the EA prefix b as a marker for lexical intensity amplification. Thus they tag the volition trigger ‫بتمنى‬ btmnY (I wish) in example 18 as amplified, especially it is modified by ‫يوم‬ ‫كل‬ kl ywm (everyday).

18.
‫يوم‬ Polarity-related disagreement is mainly caused by (1) negated holders and (2) contextual negation. In ‫يقدر‬ ‫حد‬ ‫مفيش‬ mfy$ Hd yqdr (no one can), annotators disagree as to whether ‫يقدر‬ yqdr (can) should be labeled as affirmative or negative. By contextual negation we mean examples like ‫أن‬ ‫نتمنى‬ ‫أن‬ ‫الصعب‬ ‫من‬ mn AlSEb >n ntmnY >n (it is hard to wish to), which entails negation due to the adjective ‫الصعب‬ AlSEb (hard).
Holder-related disagreement is attributed mainly to generic nouns and impersonal pronouns like ‫الشعب‬ Al$Eb (the people) and ‫الواحد‬ AlwAHd (one), respectively. They are interpreted by some annotators as referring implicitly to the Twitter USER. Therefore, the annotators select the USER as the only holder with zero nesting. Other annotators interpret them as referring to people in general not necessarily the Twitter USER and thus they consider these as instances of nested holders.
Tense yields almost perfect inter-annotator reliability rates. Annotation disagreement does not show any particular pattern. Therefore, we attribute minor disagreement to random errors, resulting from fatigue.

Majority Statistics
Based on majority annotations, Table 2 gives the statistics for our corpus in terms of sense, polarity, intensification, and tense. As for holder annotations, approximately 60.5% of the triggers have zero-nested holders (i.e. the tweet's writer is the same as the holder).  Table 2: Token statistics for each annotation task per event modality sense where MD is modal, NMD is non-modal, AFF is affirmative, NEG is negative, AMP is amplified, MTG is mitigated, ASIS is as is, PRS is present, and PST is past

Related Work
Event modality is the focus of many annotation projects. Matsuyoshi et al. (2010) annotate a corpus of English and Japanese blog posts for a number of modality senses including volition, wishes, and permission. They annotate sense, tense, polarity, holders as well as other attributes that we have not covered in our scheme such as grammatical mood. They report macro kappa inter-annotator agreement rates of 0.69, 0.70, 0.66 and 0.72 for holders, tense, sense, and polarity, respectively. Baker et al. (2010Baker et al. ( , 2012 simultaneously annotate modality and modality-based negation for Urdu-English machine translation systems. Among the modality senses they work on are requirement, permission, success, intention, ability, and desires. They report macro kappa interannotator agreement rates of 0.82 for sense annotation and 0.76 for scopes. They, however, do not annotate holders and do not consider nested modalities. Hendrickx et al. (2012) annotate eleven modality senses in Portuguese, including necessity, capacity, permission, obligation, and volition, among others. They report a macro kappa interannotator rate of 0.85 for sense annotation. Rubinstein et al. (2013) propose a linguistically-motivated annotation scheme for modalities in the MPQA English corpus. They annotate sense, polarity, holders, and scopes, among other annotation units. They work on obligation, ability, and volition among other modality senses. They attain macro alpha inter-annotator reliability rates of 0.89 and 0.65 for sense and scope, respectively. Cui and Chi (2013) apply the same scheme of Rubinstein et al. (2013) to the Chinese Penn Treebank and get alpha inter-annotator reliability rates of 0.81 and 0.39 for sense and scope annotation, respectively.
Finally, Al-Sabbagh et al. (2013) annotate event modality in MSA and EA tweets. We attain kappa inter-annotator agreement rates of 0.90 and 0.93 for sense and scope annotation, respectively, for only 772 tokens of event modality triggers.
Our annotation results, therefore, are comparable to the results in the literature. Furthermore, our annotation scheme and its tasks are orthogonal to most of the aforementioned schemes. However, the key differences between our work and related work are: • We use a standardized taxonomy of event modality -Palmer's (2001) -that has been proved valid for a variety of languages, including Arabic, according to Mitchell andAl-Hassan (1994), Brustad (2000), and Moshref (2012).
• We annotate nested holders unlike some of the aforementioned studies (e.g. Baker et al. 2010Baker et al. , 2012 and use a wider range of negation and intensification markers.
• We use crowdsourcing with simplified guidelines implemented interactively to annotate a larger-scale corpus of 12134 tokens for event modality and its attributes.

Conclusion and Outlook
We presented a large-scale corpus annotated for event modality in MSA and EA tweets. We use a simplified annotation procedure that defines each annotation task as a series of questions, implemented interactively. Our scheme covers a wide range of the most common annotation units mentioned in the literature, including modality sense, polarity, intensification, tense, holders, and scopes. We deal with nested holders -which are crucial in a highly interactive genre such as tweets where users frequently quote others and make assumptions about them. We also automatically merge triggers with shared holders and scopes based on elicited annotators' answers. The annotation procedure yields reliable results and creates a novel resources for Arabic NLP. The current version of our corpus does not, however, cover a number of issues including: the future tense, grammatical moods other than the declarative, and modality entailment. By modality entailment, we mean, for example, when a tweet's user criticizes the obligation of another quoted person, this entails that the user does not consider such an event as required. For a future version of the corpus, we plan to cover such points. Furthermore, we will use the corpus to train and test a machine learning system for the automatic processing of Arabic event modality.