Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia

In this paper we propose a new methodology to exploit Wikipedia features and structure to automatically develop an Arabic NE annotated corpus. Each Wikipedia link is transformed into an NE type of the target article in order to produce the NE annotation. Other Wikipedia features - namely redi-rects, anchor texts, and inter-language links - are used to tag additional NEs, which appear without links in Wikipedia texts. Furthermore, we have developed a ﬁltering algorithm to eliminate ambiguity when tagging candidate NEs. Herein we also introduce a mechanism based on the high coverage of Wikipedia in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. The corpus created with our new method ( WDC ) has been used to train an NE tagger which has been tested on different domains. Judging by the results, an NE tagger trained on WDC can compete with those trained on manually annotated corpora.


Introduction
Supervised learning techniques are well known for their effectiveness to develop Named Entity Recognition (NER) taggers (Bikel et al., 1997;Sekine and others, 1998;McCallum and Li, 2003;Benajiba et al., 2008). The main disadvantage of supervised learning is that it requires a large annotated corpus. Although a substantial amount of annotated data is available for some languages, for other languages, including Arabic, more work is needed to enrich their linguistic resources. In fact, changing the domain or just expanding the set of classes always requires domain-specific experts and new annotated data, both of which cost time and effort. Therefore, current research focuses on approaches that require minimal human intervention to facilitate the process of moving the NE classifiers to new domains and to expand NE classes.
Semi-supervised and unsupervised learning approaches, along with the automatic creation of tagged corpora, are alternatives that avoid manually annotated data (Richman and Schone, 2008;Althobaiti et al., 2013). The high coverage and rich informational structure of online encyclopedias can be exploited for the automatic creation of datasets. For example, many researchers have investigated the use of Wikipedia's structure to classify Wikipedia articles and to transform links into NE annotations according to the link target type (Nothman et al., 2008;Ringland et al., 2009).
In this paper we present our approach to automatically derive a large NE annotated corpora from Arabic Wikipedia. The key to our method lies in the exploitation of Wikipedia's concepts, specifically anchor texts 1 and redirects, to handle the rich morphology in Arabic, and thereby eliminate the need to perform any deep morphological analysis. In addition, a capitalisation probability measure has been introduced and incorporated into the approach in order to replace the capitalisation feature that does not exist in the Arabic script. This capitalisation measure has been utilised in order to filter ambiguous Arabic NE phrases during annotation process.
The remainder of this paper is structured as follows: Section 2 illustrates structural information about Wikipedia. Section 3 includes background information on NER, including recent work. Section 4 summarises the proposed methodology. Sections 5, 6, and 7 describe the proposed algorithm in detail. The experimental setup and the evaluation results are reported and discussed in Section 8. Finally, the conclusion features comments regarding our future work.

The Structure of Wikipedia
Wikipedia is a free online encyclopedia project written collaboratively by thousands of volunteers, using MediaWiki 2 . Each article in Wikipedia is uniquely identified by its title. The title is usually the most common name for the entity explained in the article.

Content Pages
Content pages (aka Wikipedia articles) contain the majority of Wikipedia's informative content. Each content page describes a single topic and has a unique title. In addition to the text describing the topic of the article, content pages may contain tables, images, links and templates.

Redirect Pages
A redirect page is used if there are two or more alternative names that can refer to one entity in Wikipedia. Thus, each alternative name is changed into a title whose article contains a redirect link to the actual article for that entity. For example, 'UK' is an alternative name for the 'United Kingdom', and consequently, the article with the title 'UK' is just a pointer to the article with the title 'United Kingdom'.

List of Pages
Wikipedia offers several ways to group articles. One method is to group articles by lists. The items on these lists include links to articles in a particular subject area, and may include additional information about the listed items. For example, 'list of scientists' contains links to articles of scientists and also links to more specific lists of scientists.

Infobox
An infobox is a fixed-format table added to the top right-hand or left-hand corner of articles to provide a summary of some unifying parameters shared by the articles. For instance, every scientist has a name, date of birth, birthplace, nationality, and field of study.

Links
A link is a method used by Wikipedia to link pages within wiki environments. Links are enclosed in doubled square brackets. A vertical bar, the 'pipe' symbol, is used to create a link while labelling it with a different name on the current page. Look at the following two examples, 1 -[[a]] is labelled 'a' on the current page and links to taget page 'a'.
2 - [[a|b]] is labelled 'b' on the current page, but links to target page 'a'. In the second example, the anchor text (aka link label) is 'a', while 'b', a link target, refers to the title of the target article. In the first example, the anchor text shown on the page and the title of the target article are the same.

Related Work
Current NE research seeks out adequate alternatives to traditional techniques such that they require minimal human intervention and solve deficiencies of traditional methods.
Specific deficiencies include the limited number of NE classes resulting from the high cost of setting up corpora, and the difficulty of adapting the system to new domains.
One of these trends is distant learning, which depends on the recruitment of external knowledge to increase the performance of the classifier, or to automatically create new resources used in the learning stage. Kazama and Torisawa (2007) exploited Wikipedia-based features to improve their NE machine learning recogniser's F-score by three percent. Their method retrieved the corresponding Wikipedia entry for each candidate word sequence in the CoNLL 2003 dataset and extracted a category label from the first sentence of the entry.
The automatic creation of training data has also been investigated using external knowledge. An et al. (2003) extracted sentences containing listed entities from the web, and produced a 1.8 million Korean word dataset. Their corpus performed as well as manually annotated training data. Nothman et al. (2008) exploited Wikipedia to create a massive corpus of named entity annotated text. They transformed Wikipedia's links into named entity annotations by classifying the target articles into standard entity types 3 . Compared to MUC, CoNLL, and BBN corpora, their Wikipedia-derived corpora tend to perform better than other cross-corpus train/test pairs. Nothman et al. (2013) automatically created massive, multilingual training annotations for named entity recognition by exploiting the text and internal structure of Wikipedia. They first categorised each Wikipedia article into named entity types, training and evaluating on 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Italian, Polish, Spanish, Dutch, Portuguese, and Russian. Their cross-lingual approach achieved up to 95% accuracy. They transformed Wikipedia's links into named entity annotations by classifying the target articles into standard entity types. This technique produced reasonable annotations, but was not immediately able to compete with existing gold-standard data. They better aligned their automatic annotations to the gold standard corpus by deducing additional links and heuristically tweaking the Wikipedia corpora. Following this approach, millions of words in nine languages were annotated. Wikipedia-trained models were evaluated against CONLL shared task data and other gold-standard corpora. Their method outperformed Richman and Schone (2008) and Mika et al. (2008), and achieved scores 10% higher than models trained on newswire when tested on manually annotated Wikipedia text.
Alotaibi and Lee (2013) automatically developed two NE-annotated sets from Arabic Wikipedia. The corpora were built using the mechanism that transforms links into NE annotations, by classifying the target articles into named entity types. They used POS-tagging, morphological analysis, and linked NE phrases to detect other mentions of NEs that appear without links in text. By contrast, our method does not require POS-tagging or morphological analysis and just identifies unlinked NEs by matching phrases from an automatically constructed and filtered alternative names with identical terms in the articles texts, see Section 6. The first dataset created by Alotaibi and Lee (2013) is called WikiFANE(whole) and contains all sentences retrieved from the articles. The second set, which is called WikiFANE(selective), is constructed by selecting only the sentences that have at least one named entity phrase.

Summary of the Approach
All of our experiments were conducted on the 26 March 2013 Arabic version of the Wikipedia dump 4 . A parser was created to handle the mediawiki markup and to extract structural information from the Wikipedia dump such as a list of redirect pages along with their target articles, a list of pairs containing link labels and their target articles in the form 'anchor text, target article', and essential information for each article (e.g., title, body text, categories, and templates).
Many of Wikipedia's concepts such as links, anchor texts, redirects, and inter-language links have been exploited to transform Wikipedia into a NE annotated corpus. More details can be found in the next sections. Generally, the following steps are necessary to develop the dataset: 1. Classify Wikipedia articles into a specific set of NE types.
2. Identify matching text in the title and the first sentence of each article and label the matching phrases according to the article type.
3. Label linked phrases in the text according to the NE type of the target article.
4. Compile a list of alternative titles for articles and filter out ambiguous ones.
5. Identify matching phrases in the list and the Wikipedia text.
6. Filter sentences to prevent noisy sentences being included in the corpus.
We explain each step in turn in the following sections.

Classifying Wikipedia Articles into NE Categories
Categorising Wikipedia articles is the initial step in producing NE training data. Therefore, all Wikipedia articles need to be classified into a specific set of named entity types.

The Dataset and Annotation
In order to develop a Wikipedia document classifier, we used a set of 4,000 manually classified Wikipedia articles that are available free online 5 . The set was manually classified using the ACE (2008) taxonomy and a new class (Product). Therefore, there were eight coarse-grained categories in total: Facility, Geo-Political, Location, Organisation, Person, Vehicle, Weapon, and Product. As our work adheres to the CoNLL definition, we mapped these classified Wikipedia articles into CoNLL NE types -namely person, location, organisation, miscellaneous, or other -based on the CoNLL 2003 annotation guidelines (Chinchor et al., 1999).

The Classification of Wikipedia Articles
Many researchers have already addressed the task of classifying Wikipedia articles into named entity types (Dakka and Cucerzan, 2008;Tardif et al., 2009). Alotaibi and Lee (2012) is the only study that has experimented with classifying the Arabic version of Wikipedia into NE classes. They have explored the use of Naive Bayes, Multinomial Naive Bayes, and SVM for classifying Wikipedia articles, and achieved a F-score ranging from 78% and 90% using different language-dependent and independent features. We conducted three experiments that used a simple bag-of-words features extracted from different portions of the Wikipedia document and metadata. We summarise the portions of the document included in each experiment below: Exp1: Experiment 1 involved tokens from the article title and the entire article body.
Exp2: Rich metadata in Wikipedia proved effective for the classification of articles (Tardif et al., 2009;Alotaibi and Lee, 2012). Therefore, in Experiment 2 we included tokens from categories, templates -specifically 'Infobox' -as well as tokens from the article title and first sentence of the document.
Exp3: Experiment 3 involved the same set of tokens as experiment 2 except that categories and infobox features were marked with suffixes to differentiate them from tokens extracted from the article body text. This step of distinguishing tokens based on their location in the document improved the accuracy of document's classification (Tardif et al., 2009;Alotaibi and Lee, 2012). 5 www.cs.bham.ac.uk/∼fsa081/ In order to optimise features, we implemented a filtered version of the bag-of-words article representation (e.g., removing punctuation marks and symbols) to classify the Arabic Wikipedia documents instead of using a raw dataset (Alotaibi and Lee, 2012). In addition, the same study shows the high impact of applying tokenisation 6 as opposed to the neutral effect of using stemming. We used the filtered features proposed in the study of Alotaibi and Lee (2012), which included removing punctuation marks, symbols, filtering stop words, and normalising digits. We extended the features, however, by utilising the tokenisation scheme that involves separating conjunctions, prepositions, and pronouns from each word.
The feature set has been represented using Term Frequency-Inverse Document Frequency (T F − IDF ). This representation method is a numerical statistic that reflects how important a token is to a document.

The Results of Classifying the Wikipedia Articles
As for the learning process, our Wikipedia documents classifier was trained using Liblinear 7 . 80% of the 4,000 hand-classified Wikipedia articles were dedicated to the training stage, while 20% were specified to test the classifier. Table  1 is a comparison of the precision, recall, and F-measure of the classifiers that resulted from the three experiments. The Exp3 classifier performed better than the other classifiers. Therefore, it was selected to classify all of the Wikipedia articles. At the end of this stage, we obtained a list of pairs containing each Wikipedia article and its NE Type. We stored this list in a database in preparation for the next stage: developing the NE-tagged training corpus. 6 The Annotation Process

Utilising the Titles of Articles and Link Targets
Identifying corresponding words in the article title and the entire body of text and then tagging the matching phrases with the NE-type can be a risky process, especially for terms with more than one meaning. For example, the title of the article describing the city ( , 'Cannes') 8 can also, in Arabic, refer to the past verb ( , 'was'). The portion of the Wikipedia article unlikely to produce errors during the matching process is the first sentence, which usually contains the definition of the term the Wikipedia article is written about (Zesch et al., 2007). When identifying matching terms in the article title and the first sentence, we found that article titles often contain abbreviations, while the first sentence spells out entire words. This pattern makes it difficult to identify matching terms in the title and first sentence, and frequently appears in biographical Wikipedia articles. For example, one article is entitled ( , 'Abu Bakr Al-Razi'), but the first sentence states the full name of the person: ( , 'Abu Bakr Mohammad Bin Yahia Bin Zakaria Al-Razi'). Therefore, we decided to address the problem with partial matching. In this case, the system should first identify all corresponding words in the title and the first sentence. Second, the system should annotate them and all words that fall between, provided that: • the sequence of the words in the article title and the text are the same in order to avoid errors in tagging. For example, if the title of the article is ( , 'The River Thames'), but the first sentence reads ( , 'The Thames is a river flowing through southern England....'), then the text will not be properly tagged.
• the number of tokens located between matched tokens is less than or equal to five 9 . Figure 1 shows one example of partial matching.
8 Throughout the entire paper, Arabic words are represented as follows: ( Arabic word,'English translation'). 9 An informal experiment showed that the longest proper Arabic names are 5 to 7 tokens in length.

Dictionaries of Alternative Names
Depending only on NE anchor texts in order to derive and annotate data from Wikipedia results in a low-quality dataset, as Wikipedia contains a fair amount of NEs mentioned without links. This can be attributed to the fact that each term on Wikipedia is more likely to be linked only on its first appearance in the article (Nothman et al., 2008). These unlinked NE phrases can be found simply by identifying the matching terms in the list of linked NE phrases 10 and the text. The process is not as straightforward as it seems, however, because identifying corresponding terms may prove ineffective, especially in the case of morphologically rich language in which unlinked NE phrases are sometimes found agglutinated to prefixes and conjunctions. In order to detect unlinked and inflected forms of NEs in Wikipedia text, we extended the list of articles titles that were used in the previous step to find and match the possible NEs in the text by including NE anchor texts. Adding NE anchor texts to the list assists in finding possible morphologically inflected NEs in the text while eliminating the need for any morpho- 10 The list of anchor texts that refer to NE articles logical analysis. Table 2 shows examples from the dictionary of NE anchor texts. ). Therefore, we compiled a list of the titles of redirected pages that send the reader to articles describing NEs. We refer to these titles in this paper as NE redirects. We consider to the lists of NE redirects and anchor texts a list of alternative names, since they can be used as alternative names for article titles.
The list of alternative names is used to find unlinked NEs in the text by matching phrases from the list with identical terms in the articles texts. This list is essential for managing spelling and morphological variations of unlinked NEs, as well as misspelling. Consequently, the process increases the coverage of NE tags augmented within the plain texts of Wikipedia articles.

Filtering the Dictionaries of Alternative Names
One-word alternative names: Identifying matching phrases in the list of alternative names and the text inevitably results in a lower quality corpus due to noisy names. The noisy alternative names usually occur with meaningful named entities. For example, the article on the person ( , 'Abu Abdullah Alamyn') has an alternative name consisting only of his last name ( , 'Alameen'), which means 'custodian'. Therefore, annotating every occurrence of 'Alamyn' as PER would lead to incorrect tagging and ambiguity. The same applies to the city with the name ( , 'Aljadydah'), which literally means 'new'. Thus, the list of alternative names should be filtered to omit one-word NE phrases that usually have a meaning and are ambiguous when taken out of context.
In order to solve this problem, we introduced a capitalisation probability measure for Arabic words, which are never capitalised. This involved finding the English gloss for each one-word alternative name and then computing its probability of being capitalised using the English Wikipedia. To find the English gloss for Arabic words, we exploited Wikipedia Arabic-to-English crosslingual links that provided us with a reasonable number of Arabic and corresponding English terms. If the English gloss for the Arabic word could not be found using inter-language links, we resorted to an online translator. Before translating the Arabic word, a light stemmer was used to remove prefixes and conjunctions in order to get the translation of the word itself without its associated affixes. Otherwise, the Arabic word ( ) would be translated as (in the country). The capitalisation probability was computed as follows: where: EN is the English gloss of the alternative name; f (EN ) isCapitalised is the number of times the English gloss EN is capitalised in English Wikipedia; and f (EN ) notCapitalised is the number of times the English gloss EN is not capitalised in English Wikipedia.
This way, we managed to build a list of Arabic words and their probabilities of being capitalised. It is evident that the meaningful one-word NEs usually achieve a low probability. By specifying a capitalisation threshold constraint, we prevented such words from being included in the list of alternative names. After a set of experiments, we decided to use the capitalisation threshold equal to 0.75.
Multi-word alternative names: Multi-word alternative names (e.g., /'MusTafae Mahmud'), /'Ahmad Adel') rarely cause errors in the automatic annotation process. Wikipedians, however, at times append personal and job titles to the person's name contained in the anchor text, which refers to the article about that person. Examples of such anchor texts are ( , 'Ruler of Dubai Muhammad bin Rashid') and ( , 'President of the Council of Ministers Muhammad bin Rashid'). As a result, the system will mistakenly annotate words like Dubai, Council, Ministers as PER. Our solution to this problem is to omit the multi-word alternative name, if any of its words belong to the list of apposition words, which usually appear adjacent to NEs such as ( , 'President'), ( , 'Minister'), and ( , 'Ruler'). The filtering algorithm managed to exclude 22.95% of the alternative names from the original list. Algorithm 1 shows pseudo code of the filtering algorithm. The dictionaries derived from Wikipedia by exploiting Wikipedia's structure and adopting the filtering algorithm is shown in Table 3.

Post-processing
The goal of Post-processing was to address some issues that arose during the annotation process as a result of different domains, genres, and conventions of entity types. For example, nationalities and other adjectival forms of nations, religions, and ethnic groups are considered MISC in the CoNLL NER task in the English corpus, while the Spanish corpus consider them NOT named entities (Nothman et al., 2013). As far as we know, almost all Arabic NER datasets that followed the CoNLL style and guidelines in the annotation process consider nationalities NOT named entities. On Wikipedia all nationalities are linked to articles about the corresponding countries, which makes the annotation tool tag them as LOC. We decided to consider them NOT named entities in accordance with the CoNLL-style Arabic datasets. Therefore, in order to resolve this issue, we compiled a list of nationalities, and other adjectival forms of religion and ethnic groups, so that any anchor text matching an entry in the list was retagged as a NOT named entity. The list of nationalities and apposition words used in section 6.2.1 were compiled by exploiting the 'List of' articles in Wikipedia such as list of people by nationality, list of ethnic groups, list of adjectival forms of place names, and list of titles. Some English versions of these 'List of' pages have been translated into Arabic, either because they are more comprehensive than the Arabic version, or because there is no corresponding page in Arabic.

Building the Corpus
After the annotation process, the last step was to incorporate sentences into the corpus. This resulted in obtaining an annotated dataset with around ten million tokens. However, in order to obtain a corpus with a large number of tags without affecting its quality, we created a dataset called Wikipedia-derived corpus (WDC), which included only sentences with at least three annotated named entity tokens. The WDC dataset contains 165,119 sentences consisting of around 6 million tokens. The annotation style of the WDC dataset followed the CoNLL format, where each token and its tag are placed together in the same file in the form < token > \s < tag >. The NE boundary is specified using the BIO representation scheme, where B-indicates the beginning of the NE, Irefers to the continuation (Inside) of the NE, and O indicates that the word is not a NE. The WDC dataset is available online to the community of researchers 11

Experimental Evaluation
To evaluate the quality of the methodology, we used WDC as training data to build an NER model. Then we tested the resulting classifier on datasets from different domains.

Datasets
For the evaluation purposes, we used three datasets: ANERcorp, NEWS, and TWEETS.
ANERcorp is a news-wire domain dataset built and tagged especially for the NER task by Benajiba et al. (2007). It contains around 150k tokens and is available for free. We tested our methodology on the ANERcorp test corpus because it is widely used in the literature for comparing with existing systems. The NEWS dataset is also a news-wire domain dataset collected by Darwish (2013) from the RSS feed of the Arabic version of news.google.com from October 2012. The RSS consists of the headline and the first 50 to 100 words in the news articles. This set contains approximately 15k tokens. The third test set was extracted randomly from Twitter and contains a set of 1,423 tweets authored in November 2011. It has approximately 26k tokens (Darwish, 2013).

Our Supervised Classifier
All experiments to train and build a probabilistic classifier were conducted using Conditional Random Fields (CRF) 12 . Regarding the features used in all our experiments, we selected the most successful features from Arabic NER work (Benajiba et al., 2008;Abdul-Hamid and Darwish, 2010;Darwish, 2013). These features include: • The words immediately before and after the current word in their raw and stemmed forms.
• The appearance of the word in the gazetteer.
• The stemmed form of the word.
The gazetteer used contains around 5,000 entries and was developed by Benajiba et al. (2008). A light stemmer was used to determine the stem form of the word by using simple rules to remove conjunctions, prepositions, and definite articles (Larkey et al., 2002

Results
We compared a system trained on WDC with the systems trained by Alotaibi and Lee (2013) on two datasets, WikiFANE(whole) and Wiki-FANE(selective), which are also automatically collected from Arabic Wikipedia. The evaluation process was conducted by testing them on the AN-ERcorp set. The results shown in Table 5 prove that the methodology we proposed in this paper produces a dataset that outperforms the two other datasets in terms of recall and F-measure.   Table 6 shows that WDC classifier outperforms the F-score of the news-based classifier by around 48%.The obvious difference in the performance of the two classifiers can be attributed to the difference in annotation convention for different domains. For example, many key words in Arabic Wikipedia, which appear in the text along with NEs (e.g., /university, / city, /company), are usually considered part of NE names. So, the phrase 'Shizuoka Prefecture' that is mentioned in some Arabic Wikipedia articles is considered an entity and linked to an article that talks about Shizuoka, making the system annotate all words in the phrase as NEs as follows: ( B-LOC I-LOC/ Shizuoka B-LOC Prefecture I-LOC). On the other hand, in ANERcorp corpus, only the the word after the keyword ( , 'Prefecture') is considered NE. In addition, although sport facilities (e.g., stadiums) are categorized in Wikipedia as location, some of them are not even considered entities in ANERcorp test corpus.
Secondly, the ANERcorp-Model and the WDC-Model were tested on the ANERcorp test data. The point of this comparison is to show how well the WDC dataset works on a news-wire domain, which is more specific than Wikipedia's open domain. The table shows that the ANERcorp-model outperforms the F-score of the WDC-Model by around 13 points. However, in addition to the fact that training and test datasets for the ANERcorp-Model are drawn from the same domain, 69% of NEs in the test data were seen in the training set (Darwish, 2013).
Thirdly, the ANERcorp-Model and the WDC-Model were tested on NEWS corpus, which is also a news-wire based dataset. The results from Table 6 reveal the quality of the WDC dataset on the NEWS corpus. The WDC-Model achieves relatively similar results to the ANERcorp-Model, although the latter has the advantage of being trained on a manually annotated corpus extracted from the similar domain of the NEWS test set.
Finally, testing the ANERcorp-Model and the WDC-Model on data extracted from a social networks like Twitter proves that models trained on open-domain datasets like Wikipedia perform better on social network text than classifiers trained on domain-specific datasets, as shown in Table 6.
In order to show the effect of combining our corpus (WDC) with a manually annotated dataset from a different domain, we merged WDC with the ANERcorp dataset. Table 7 shows the results of a system trained on the combined corpus when testing it on three test sets. The system trained on the combined corpus achieves results that fall between the results of the systems trained on each corpus separately when testing them on the ANERcorp test set and NEWS test set. On the other hand, the results of the system trained on the combined corpus when tested on the third test set (TWEETS) show no significant improvement.

Conclusion and Future Work
We have presented a methodology that requires minimal time and human intervention to generate an NE-annotated corpus from Wikipedia. The evaluation results showed the high quality of the developed corpus WDC, which contains around 6 million tokens representing different genres, as Wikipedia is considered an open domain. Furthermore, WDC outperforms other NE corpora generated automatically from Arabic Wikipedia by 8 to 12 points in terms of F-measure. Our methodology can easily be adapted to extend to new classes. Therefore, in future we intend to experiment with finer-grained NE hierarchies. In addition, we plan to carry out some domain adaptation experiments to handle the difference in annotation convention for different domains.