Automatically building a Tunisian Lexicon for Deverbal Nouns

The sociolinguistic situation in Arabic countries is characterized by diglossia (Ferguson, 1959) : whereas one variant Modern Standard Arabic (MSA) is highly codiﬁed and mainly used for written communication, other variants coexist in regular everyday’s situations (dialects). Similarly, while a number of resources and tools exist for MSA (lexica, annotated corpora, taggers, parsers . . . ), very few are available for the development of dialectal Natural Language Processing tools. Taking advantage of the closeness of MSA and its dialects, one way to solve the problem of the lack of resources for dialects consists in exploiting available MSA resources and NLP tools in order to adapt them to process dialects. This paper adopts this general framework: we propose a method to build a lexicon of deverbal nouns for Tunisian (TUN) using MSA tools and resources as starting material.


Introduction
The Arabic language presents both a standard written form and a number of spoken variants (dialects).While dialects differ from one country to another, sometimes even within the same country, the written variety (Modern Standard Arabic, MSA), is the same for all the Arabic countries.Similarly, MSA is highly codified, and used mainly for written communication and formal spoken situations (news, political debates).Spoken varieties are used in informal daily discussions and in informal written communication on the web (social networks, blogs and forums).Such unstandardized varieties differ from MSA with respect to phonology, morphology, syntax and the lexicon.Linguistic resources (lexica, corpora) and natural language processing (NLP) tools for such dialects (parsers) are very rare.
Different approaches are discussed in the litterature to cope with Arabic dialects processing.A general solution is to build specific resources and tools.For example, (Maamouri et al., 2004) created a Levantine annotated corpus (oral transcriptions) for speech recognition research.(Habash et al., 2005;Habash and Rambow, 2006) proposed a system including a morphological analyzer and a generator for Arabic dialects (MAGEAD) used for MSA and Levantine Arabic.(Habash et al., 2012) also built a morphological analyzer for Egyptian Arabic that extends an existing resource, the Egyptian Colloquial Arabic Lexicon.Other approaches take advantage of the special relation (closeness) that exists betweeen MSA and dialects in order to adapt MSA resources and tools to dialects.To name a few, (Chiang et al., 2006) used MSA treebanks to parse Levantine Arabic.(Sawaf, 2010) presented a translation system for handling dialectal Arabic, using an algorithm to normalize spontaneous and dialectal Arabic into MSA.(Salloum and Habash, 2013) developped a translation system pivoting through MSA from some Arabic dialects (Levantine, Egyptian, Iraqi, and Gulf Arabic) to English.(Hamdi et al., 2013) proposed a translation system between Tunisian (TUN) and MSA verbs using an analyser and a generator for both variants.
Yet if the first kind of approach is more linguistically accurate because it takes into account specificities of each dialect, building resources from scratch is costly and extremely time consuming.In this paper we will thus adopt the second approach: we will present a method to automatically build a lexicon for Tunisian deverbal nouns by exploiting available MSA resources as well as an existing MSA-TUN lexicon for verbs (Boujelbane et al., 2013).We will use a root lexicon to generate possible deverbal nouns which will be later filtered through a large MSA lexicon.
This work is part of a larger project that aims at 'translating' TUN to an approximative form of MSA in order to use MSA NLP tools on the output of this translation process.The final lexicon for TUN deverbal nouns will be integrated into a morphological and syntactic parser for TUN.
The paper is organized as follows: in section 2 we describe and compare some morphological aspects of MSA and TUN, focusing on derivation.We then discuss in section 3 our approach to build a TUN lexicon for deverbal nouns from an existing MSA-TUN resource for verbs.Section 4 presents an evaluation of the results obtained and section 5 proposes some solutions to increase the coverage of the lexicon.

Arabic Morphology
Arabic words are built following two kinds of morphological operations: templatic and affixational.Functionally, both operations are used inflectionally or derivationally (Habash, 2010).In templatic morphology, a root and a pattern combine to form a word stem.A root is a sequence of three, four or five letters that defines an abstract notion while a pattern is a vocalized template which marks where the root radicals are inserted.To give an example, by combining the root f t H 1 with the verbal patterns 1a2a3 and ta1a22a3, two verbs are generated : (1) fataH 'to open' and (2) tafattaH 'to bloom'.Derivation consists in replacing each digit of the pattern by the corresponding letter in the root.
Arabic verbs have ten basic triliteral patterns, which are conventionally noted with the Latin numbers I, . . ., X. and two basic quadriliteral patterns (XI, XII) (Habash, 2010).A verb is the combination of a root and a pattern.
Many deverbal nouns can be derived from verbs.Nine kind of deverbal nouns (1, 2, 3 ... 9) are defined in Arabic (Al-Ghulayaini, 2010), each of them corresponds to a semantic relationship between the verb and the deverbal noun (see table 1 2).These deverbal nouns represent the active and the passive participles of these verbs.They are derived from the same root as the verb, using deverbal patterns which depend on the verbal pattern.Table 2 shows TUN and MSA patterns of the active and the passive participles for the first three verbal patterns.
Table 2 is just a sample of a larger table of deverbal nouns (henceforth called TUN-MSA deverbal table) that defines for every verbal pattern all deverbals which are derived from it in MSA and TUN.
1 Arabic orthographic transliteration is presented in the Habash-Soudi-Buckwalter HSB scheme (Habash et al., 2007): (in alphabetical order) A b t θ j H x d ð r z s š S D T Ď ς γ f q k l m n h w y and the additional letters: ' , Â , Ǎ , Ā , ŵ , ŷ , h , ý .
Verbal pattern Deverbal noun MSA patterns

Overview of the Method
Our method consists in generating TUN and MSA pairs of deverbal nouns simultaneously: in a first step, we use the TUN-MSA deverbal table and an existing MSA-TUN dictionary of verbs in order to generate candidate pairs of deverbal nouns (N OU N M SA , N OU N T U N ).These candidates are then filtered on the MSA side using an available MSA resource.

Generating pairs of deverbal nouns
As shown in the TUN-MSA deverbal table (Table 2), every verbal pattern in MSA produces several patterns of deverbal nouns (i.e., pattern IX2 yields for example the infinitive form Ai12i3A3).The same applies to TUN (i.e., pattern IX yields the infinitive form 12uw3iyy).A total of 54 MSA and 52 TUN nominal patterns were defined.To generate deverbal lexicon we have used an existing TUN-MSA lexicon (Boujelbane et al., 2013) of 1500 verbs composed of pairs of the form (P M SA , P T U N ) where P M SA and P T U N are themselves pairs made of a root and a verbal pattern.The TUN side contains 920 distinct pairs and the MSA side 1,478 distinct pairs.This difference shows that MSA is lexically richer than TUN.For every pair (a pattern and a root) we combined the root with all the nominal patterns corresponding to the verbal pattern on both sides (MSA and TUN) as shown in figure 1.At this point, about twenty morphological and orthographic rules manually predefined are applied on the generated form in order to produce a lemma.For instance, the second root radical /y/ and /w/ changes to /ŷ/ for MSA active participle, while the second root radical /w/ changes to /y/ in the TUN side.Another • The verbal lexicon can associate for one input verb many target verbs, for example the TUN verb mšý matches with two different MSA verbs mšý 'to walk' and ðhb 'to go'.The ambiguity is more important in the TUN → MSA sense.On average, a TUN pair corresponds to 1.78 MSA pairs, 1.11 in the opposite direction.The maximum ambiguity is equal to four in the MSA → TUN direction and sixteen in the opposite direction.
• the TUN-MSA deverbal table may define several patterns for a deverbal noun as shown in table 2.
The evaluation4 of the deverbal lexicon on the test set is displayed in Table 3.The table shows that, without filtering the lexicon coverage is equal to 67.23%.Ambiguity (in the TUN→MSA direction) is equal to 12.58, which means that, on average, for a TUN deverbal, 12.58 MSA deverbals are produced.After filtering using AFP corpus, coverage drops to 60.04% and ambiguity to 6.99.Filtering with the SAMA lexicon yields a coverage of 62.66% and an ambiguity of 7.24.Finally, filtering using AFP ∪ SAMA, the coverage reaches 65.67% and the whith an ambiguity of 7.35.As in the verbal lexicon, switching from TUN to MSA is more ambigous than the inverse direction.Ambiguity rates attests that MSA is lexically richer than TUN.The filtering step helps to significantly decrease ambiguity, but it also decreases coverage!The best result is the union of AFP∪SAMA, which enables us to obtain the best trade-off.We have carried out an error analysis on the automatically generated lexical entries.There are three major causes that can explain a missing target deverbal: 1. Absence of the corresponding verb in the verbal lexicon: nouns deriving from a verb that is absent from the verb lexicon are not produced in the deverbal lexicon.
2. Missing entries in the TUN-MSA deverbal table 3. Missing morphological and orthographic rules.
In order to estimate the part of missing deverbals that is due to lack of coverage of the verbal lexicon, we have added verbs that derive missing deverbals of the development corpus.92 verbal entries have been added.Table 5 shows results of coverage and ambiguity on the development set.This result, although artificial allows to compute an upper bound that can be attained with a more complete verbal lexicon.
As one can see in As shown in table 6, enriching the verbal lexicon improves significantly the coverage of the deverbal lexicon on the test set.In fact, it rises from 67% to 73% before filtering and from 65% to 71% after filtering using AFP∪SAMA, whereas ambiguity remains stable.

Root lexicon and pattern correspondance table
The previous section shows that a large portion of errors came from the lack of coverage of the verbal lexicon.By adding 92 verbal entries, the coverage jumps by about 6%.Among these 92 entries, there were 28 inexistent roots but for the 64 remaining, the root was already present in the verbal lexicon, we have just added new patterns to the roots (as the pair did not exist).
Sebsequently, we have divided the verbal lexicon into two independant resources : a root lexicon and a verbal pattern correspondance table.
The root lexicon is made of pairs of the form (r M SA , r T U N ), where r M SA is an MSA root and r T U N is a TUN root.The root lexicon contains 1,357 entries.The MSA side contains 1,068 distinct roots and the TUN side 665 ones.523 entries are composed of the same root on both sides.As in the verbal lexicon, the ambiguity is higher in the TUN → MSA direction.On average, a TUN root is paired with 2.07 MSA roots.In the opposite direction, 1.27 roots.
The verbal pattern correspondance table indicates, for a pattern in MSA or TUN, the most frequent corresponding pattern on the other side.
In this approach, the target pattern is selected by a lookup in the verbal pattern correspondance table but the target roots are selected by a root lexicon lookup.For each source root, we have combined it with all the nominal patterns corresponding to each verbal pattern.The target deverbal is made of the target root given by the lexicon root and the target nominal pattern depends on the target verbal pattern indicated in the verbal pattern correspondance table as shown in figure 2.
Results of this experiment on the test corpus show that using this method increase greatly the coverage.Although it also raises the number of generated entries and subsequently ambiguity.In this paper, we have presented a bilingual lexicon of deverbal nouns between MSA and TUN.Our method aims to extend an existing TUN verbal lexicon using a table of deverbal patterns to automatically generate pairs of TUN and MSA deverbal nouns.Several MSA resources were used to filter wrong pairs generated.The lexicon was evaluated using two metrics: coverage and ambiguity.The coverage given by our lexicon is about 71%.Ambiguity is slightly high in TUN→MSA direction.It reaches 8.15.A contextual disambiguation process is therefore necessary for such a process to be of practical use.
In future work, we plan to include this lexicon into a system of translation from TUN to an approximative form of MSA which will be parsed using an MSA parser.

Figure
Figure 2: Generating TUN-MSA pairs of deverbal nouns using roots

Table 2 :
TUN-MSA Deverbal TableThis table has been created by a Tunisian native speaker.Unlike MSA, which defines a unique pattern for each participle with all verbal patterns, table2shows that TUN has often more than one pattern for participles.However, for some other cases, such as the infinitive forms and nouns of instruments, MSA defines several nominal patterns.The choice of the nominal pattern depends on the verbal pattern.The Arabic nominal derivation system is not systematic and depends on the meaning of the verbs.In fact, for semantic reasons, most Arabic verbs cannot derive all deverbal nouns.The verb fataH 'open', for example, cannot produce the noun of place and time.However, fataH derives the active and the passive participles fAtiH 'opener' and maftuwH 'opened', the noun of instrument miftAH 'key' and an exaggerate form fattAH 'conqueror'...

Table 3 :
Results on test set

Table 4 :
Table 4 summarizes the coverage and the ambiguity rate of the deverbal lexicon in the development and the test sets respectively : Results in the development set

Table 5 :
Table 5, coverage jumps from 66.12% to 87.33% before filtering and from 64.59% to 84.16% after filtering using AFP ∪ SAMA.The ambiguity rate increases slightly.Results in the development set after enriching the verbal lexicon Table6gives the results obtained on the test set after enriching the verbal lexicon using the development set.

Table 6 :
Results in the test set after enriching the verbal lexicon