Part-of-speech Tagset and Corpus Development for Igbo, an African Language

This project aims to develop linguistic resources to support computational NLP research on the Igbo language. The starting point for this project is the development of a new part-of-speech tagging scheme based on the EAGLES tagset guidelines, adapted to incorporate additional language internal features. The tags are currently being used in a part-of-speech annotation task for the development of POS tagged Igbo corpus. The proposed tagset has 59 tags.


Introduction
Supervised machine learning methods in NLP require an adequate amount of training data. The first crucial step for a part-of-speech (POS) tagging system for a language is a well designed, consistent, and complete tagset (Bamba Dione et al., 2010) which must be preceded by a detailed study and analysis of the language. Our tagset was developed from scratch through the study of linguistics and electronic texts in Igbo, using the EAGLES recommendations.
This initial manual annotation is important. Firstly, information dealing with challenging phenomena in a language is expressed in the tagging guideline; secondly, computational POS taggers require annotated text as training data. Even in unsupervised methods, some annotated texts are still required as a benchmark in evaluation. With this in mind, our tagset design follows three main goals: to determine the tagset size, since a smaller granularity provides higher accuracy and less ambiguity (de Pauwy et al., 2012); to use a sizeable scheme to capture the grammatical distinctions at a word level suited for further grammatical analysis, such as parsing; and to deliver good accuracy for automatic tagging, using the manually tagged data. We discuss the development of the tagset and corpus for Igbo. This work is, to the best of our knowledge, the first published work attempting to develop statistical NLP resources for Igbo.
2 Some Grammatical Features of the Igbo Language

Language family and speakers
The Igbo language has been classified as a Benue-Congo language of the Kwa sub-group of the Niger-Congo family 1 and is one of the three major languages in Nigeria, spoken in the eastern part of Nigeria, with about 36 million speakers 2 . Nigeria is a multilingual country having around 510 living languages 1 , but English serves as the official language.

Phonology
Standard Igbo has eight vowels and thirty consonants. The 8 vowels are divided into two harmony groups that are distinguished on the basis of the Advanced Tongue Root (ATR) phenomenon. They are -ATR: i .  (Uchechukwu, 2008). Many Igbo words select their vowels from the same harmony group. Also, Igbo is a tonal language. There are three distinct tones recognized in the language viz; High, Low, and Downstep.  (Emenanjo, 1978;Ikekeonwu, 1999) and are placed above the tone bearing units (TBU) of the language.
There are two tone marking systems, either: all high tones are left unmarked and all low tones and downsteps are marked (Green and Igwe, 1963;Emenanjo, 1978), or only contrastive tones are marked (Welmers and Welmers, 1968;Nwachukwu, 1995). We used the first system to illustrate the importance of tonal feature in the language's lexical or grammatical structure. For example, at the lexical level the word akwa without a tone mark can be given the equivalent of 'bed/bridge', 'cry', 'cloth', or 'egg'. But these equivalents can be properly distinguished when tone marked, as follows: akwa "cry", akwà "cloth", àkwà "bed or brigde", àkwa "egg". At the grammatical level, an interrogative sentence can be distinguished from a declarative sentence through a change in tone of the person pronouns from a high tone (e.g. O . nà-àbi . a "He is coming") to a low tone (e.g. Ò . nà-àbi . a "Is he coming?"). Also, there are syllabic nasal consonants, which are tone bearing units in the language. The nasal consonants always occur before a consonant. For example:ǹdo 'Sorry' or explicitly tone marked asǹdó.

Writing System
The Igbo orthography is based on the Standard Igbo by the O . nwu . Committee (O . nwu . Committee, 1961). There are 28 consonants: b gb ch d f g gh gw h j k kw kp l m n nw nyṅ p r s sh t v w y z, and 8 vowels (see phonology section). Nine of the consonants are digraphs: ch, gb, gh, gw, kp, kw, nw, ny, sh.
Igbo is an agglutinative language in which its lexical categories undergo affixation, especially the verbs, to form a lexical unit. For example, the word form erichari . ri . is a verbal structure with four morphemes: verbal vowel prefix e-, verb root -ri-, extensional suffix -cha-, and a second extensional suffix -ri . ri . . Its occurrence in the sentence "Obi must eat up that food" is Obi ga-erichari . ri . nri ahu . , that is, Obi aux-eat.completely.must food DET. Igbo word order is Subject-Verb-Object (SVO), with a complement to the right of the head.

Grammatical Classes
Generally, Emenanjo (1978) identified the following broad word classes for Igbo: verbal, nominal, nominal modifier, conjunction, preposition, suffixes, and enclitics. The verbal is made up of verbs, auxiliaries and participles, while the nominal is made up of nouns, numerals, pronouns and interrogatives. Nouns are further classified into five lexical classes, viz; proper, common, qualificative, adverbial and ideophones. However, we identified extra five in the tagset design phase (see the appendix). Nominal modifiers occur in a noun phrase. Its four classes are adjectives, demonstratives, quantifiers and pronominal modifiers. Conjunctions link words or sentences together, while prepositions are found preceding nominals and verbals and cannot be found in isolation. Suffixes and enclitics are the only bound elements in the language. Suffixes are primarily affixed to verbals only, while enclitics are used with both verbals and other word classes. Suffixes are found in verb phrase slots and enclitics can be found in both verb phrase and noun phrase slots. The language does not have a grammatical gender system.

Language Resources
The development of NLP resources for any language is based on the linguistics resources available for the language. This includes appropriate fonts and text processing software as well as the available electronic texts for the work. The font and software problems of the language have been addressed through the Unicode development (Uchechukwu, 2005;Uchechukwu, 2006). The next is the availability of Igbo texts.
Any effort towards the Igbo corpus development is a non-trivial task. There are basic issues connected with the nature of the language. The first major surprise is that Igbo texts 'by native speakers' written 'for native speakers' vary in forms due to dialectal difference and are usually not tone-marked. Indeed, the tone marking used in the sections above are usually found in academic articles. It would be strange to find an Igbo text (literary work) that is fully tone marked and no effort has been made to undertake a tone marking of existing Igbo texts. Such an effort looks impossible as more Igbo texts are written and published. Such is the situation that confronts any effort to develop an Igbo corpus. Hence, developing NLP resources for the language has to start with the available resources; otherwise, such an endeavour would have to first take a backward step of tone marking all the texts to be added to its corpus and normalizing the dialectal differences. This is a no mean task.
It is for this reason that we chose the New World Translation (NWT) Bible version for Igbo corpus with its English parallel text 3 . The NWT Bible does not adopt a particular tone marking system, neither is there a consistent use of tone marks for all the sentences in the Bible. Instead, there is narrow use of tone marks in specific and restricted circumstances throughout the book. An example is when there is a need to disambiguate a particular word. For instance, ihe without tone mark could mean 'thing' or 'light'. These two are always tone marked in the Bible to avoid confusion; hence ìhè 'light' and íhé 'thing'. The same applies to many other lexical items. Another instance is the placement of a low tone on the person pronouns to indicate the onset of an interrogative sentence, which otherwise would be read as a declarative sentence. This particular example has already been cited as one of the uses of tone mark in the language. Apart from such instances, the sentences in the Bible are not tone marked. As such, one cannot rely on such restricted use of tone marks for any major conclusions on the grammar of the language. With regard to corpus work in general, the Bible has been described as consistent in its orthography, most easily accessible, carefully translated (most translators believe it is the word of God), and well structured (books, chapters, verses), etc. Kanungo and Resnik, 1999;Chew et al., 2006). The NWT Bible is generally written in standard Igbo.
In place of sentence splitting, we use verses since all 66 books of the Bible is written in verse level. Our major aim is to use this Igbo corpus to implement our new tagset, which will capture all the inflected and non-inflected tokens in the corpus. For lack of space, issues with tokenization with respect to morphemes, manual annotation implemetations and platform used will not be discussed in this paper.

Tagset Design
We adopt the (Leech, 1997) definition of a POS tagset as a set of word categories to be applied to the tokens of a text. We designed our tagset following the standard EAGLES guidelines, diverging where necessary (e.g. EAGLES, which favours European languages, specifies articles at the obligatory level, but this category does not apply for Igbo). A crucial question in tagset design is the extent of fine-grained distinctions to encode within the tagset. A too coarsely grained tagset may fail to capture distinctions that would be valuable for subsequent analysis, e.g. syntactic parsing; too fine-grained may make automatic (and manual) POS tagging difficult, resulting in errors that lead to different problems for later processing.
In what follows, we introduce a sizeable tagset granularity with the intention of providing a basis for practical POS tagging.  The tagset is intended to strike an appropriate balance for practical purposes regarding granularity, capturing what we believe will be the key lexico-grammatical distinctions of value for subsequent processing, such as parsing. Further subcategorization of the grammatical classes, as described in section 2.4, results in 59 tags which apply to whole tokens (produced by the tokenisation stage described above). An important challenge comes from the complex morphological behaviour of Igbo. Thus, a verb such as bi . a, which we assign the tag VSI (a verb in its simple or base form), can combine with extensional suffixes, such as ghi . and kwa, to produce variants such as bi . aghi . , bi . akwa and bi . aghi . kwa, which exhibit similar grammatical behaviour to the base form. As such, we might have assigned these variants the VSI tag also, but have instead chosen to assign VSI_XS, which serves to indicate both the core grammatical behaviour and the presence of extensional suffixes. In abi . akwa, we find the same base form bi . a, plus a verbal vowel prefix a, resulting in the verb being a participle, which we assign the tag VPP_XS. For the benefit of cross-lingual training and other NLP tasks, a smaller tagset that captures only the grammatical distinctions between major classes is required. The present 59 tags can easily be simplified to a coarsegrained tagset of 15 tags, which will principally preserve just the core distinctions between word classes, such as nouns, verb, adjective, etc.
Athough Emenanjo (1978) classified ideophones as a form of noun, we have assigned them a separate tag IDEO, as these items can be found performing many grammatical functions. For instance, the ideophone ko . i . , "to say that someone walks ko . i . ko . i . " has no nominal meaning, rather its function here is adverbial. A full enumeration of this scheme is given in the appendix.

The developement of an POS tagged Igbo Corpus
Here we analyse the manual POS tagging process that is ongoing based on the tagset scheme. The Bible books were allocated randomly to six groups, producing six corpora portions of approximately 45,000 tokens each. Our plan was for each human annotator to tag at least 1000 tokens per day, resulting in complete POS tagging in 45 days. The overall corpus size allocated is 264,795 tokens of the new testament Bible. There are six human annotators, who are students of the Department of Linguistics at Nnamdi Azikiwe University, Awka, supervised by a senior lecturer in the same department; giving an effective total of seven human annotators. Additionally, a common portion of the corpus (38,093 tokens) was given to all the annotators, as a basis for calculating inter-annotator agreement.

Conclusions
We have outlined our current progress in the development of a POS tagging scheme for Igbo from scratch. Our project aims to build linguistic computational resources to support research in natural language processing (NLP) for Igbo. It is important to note that these tags are applicable on unmarked, not fully marked, and fully tone marked Igbo texts, since the fully tone marked tokens play the same grammatical roles as in the none tone marked texts, written by native speakers for fellow native speakers.
Our method of tagset design could be used for other African or under-resourced languages. African languages are morphologically rich, and of around 2000 languages in the continent, only a small number have featured in NLP research.

DEM
Demonstrative. This is made up of only two deictics and always used after their nominals. E.g. a, ahu . .

DIGR
Digraph. All combined graphemes that represent a character in Igbo, which occur in the text. gb, gw, kp, nw, ...
Any type of Enclitics ENC Collective. cha, si . nu . , ko . -means all, totality forming a whole or aggregate. Negative Interrogative. di . , ri . , du . -indicates scorn or disrespect and are mainly used in Rhetorical Interrogatives.