Arabic Fine-Grained Entity Recognition

Traditional NER systems are typically trained to recognize coarse-grained categories of entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level sub-types. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with sub-types. In particular, four main entity types in Wojood (geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC) are extended with 31 sub-types of entities. To do this, we first revised Wojood’s annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC’s ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~ 44K) in Wojood are manually annotated with the LDC’s ACE subtypes. This extended version of Wojood is called WojoodFine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen’s Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodFine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with sub-types and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open source and available at https://sina.birzeit.edu/wojood/.

Traditional NER systems are typically trained to recognize coarse and high-level categories of enti-ties, such as person (PERS), location (LOC), geopolitical entity (GPE), or organization (ORG).However, less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes (Zhu et al., 2020;Desmet and Hoste, 2013).For example, locations (LOC) like Asia and Red Sea could be further classified into Continent and Water-Body, respectively.Similarly, organizations like Amazon, Cairo University, and Sphinx Cure can be classified into commercial, educational, and health entities, respectively.Belgium, Beirut, and Brooklyn can be classified into Country, Town, and Neighborhood instead of classifying them all as GPE.The importance of classifying named entities into subtypes is increasing in many application areas, especially in question answering, relation extraction, and ontology learning (Lee et al., 2006).
As will be discussed in the following sub-section, the number of NER datasets that support subtypes is limited, particularly for the Arabic language.The only available Arabic NER corpus with subtypes is the LDC's ACE2005 (Walker et al., 2005).However, this corpus is expensive.In addition, ACE2005 was collected two decades ago and hence may not be representative of the current state of Arabic language use.This is especially the case since language models are known to be sensitive to temporal and domain shifts (see section 5).
To avoid starting from scratch, we chose to extend upon a previously published and open-source Arabic NER corpus known as 'Wojood' (Jarrar et al., 2022).Wojood consists of 550K tokens manually annotated with 21 entity types.In particular, we manually classify four main entity types in Wojood (GPE, LOC, ORG, and FAC) with 31 new finegrained subtypes.This extension is not straightforward as we have to change (5, 614 changes) the original annotation of these four types of entities to align with LDC guidelines before extending them with subtypes.The total number of tokens that are annotated with the 31 subtypes is 47.6K.
Our extended version of Wojood is hereafter called Wojood F ine .We measure inter-annotator agreement (IAA) using both Cohen's Kappa and F 1 , resulting in 0.9861 and 0.9889, respectively.
To compute the baselines for Wojood F ine , we fine-tune three pre-trained Arabic BERT encoders across three settings: (i) flat, (ii) nested without subtypes, and (iii) nested with subtypes, using multitask learning.Our models achieve 0.920, 0.866, and 0.885 in F 1 , respectively.
The remaining of the paper is organized as follows: Section 2 overviews related work, and Section 3 presents the Wojood F ine corpus, the annotation process, and the inter-annotator-agreement measures.In Section 4, we present the experiments and the fine-tuned NER models.In Section 5 we present error analysis and out-of-domain performance and we conclude in Section 6.

Related Work
Most of the NER research is focused on coarsegrained named entities and typically targets a limited number of categories.For example, Chinchor and Robinson (1997) proposed three classes: person, location and organization.The Miscellaneous class was added to CoNLL-2003 (Sang andDe Meulder, 2003).Additional four classes (geopolitical entities, weapons, vehicles, and facilities) were also introduced in the ACE project (Walker et al., 2005).The OntoNotes corpus is more expressive as it covers 18 types of entities (Weischedel et al., 2013).
Coarse-grained NER is a good starting point for named entity recognition, but it is not sufficient for tasks that require a more detailed understanding of named entities (Ling and Weld, 2012;Hamdi et al., 2021).
Substantial research has been undertaken to identify historical entities.For instance, the HIPE shared task (Ehrmann et al., 2020a) focused on extracting named entities from historical newspapers written in French, German, and English.One of its subtasks was the recognition and classification of mentions according to finer-grained entity types.The corpus used in the shared task consists of tokens annotated with five main entity types and 12 subtypes, following the IMPRESSO guidelines (Ehrmann et al., 2020b).A similar corpus, called NewsEye, was collected from historical newspapers in four languages: French, German, Finnish, and Swedish (Hamdi et al., 2021).The corpus is annotated with four main types: PER, LOC, ORG, and PROD.The LOC entities were further classified into five subtypes, and the ORG entities into two subtypes.Desmet and Hoste (2013) proposed a one million fine-grained NER corpus for Dutch, which was annotated using six main entity types and 27 subtypes (10 subtypes for PERS, three for ORG, nine for LOC, three for PROD, and two for events).Zhu et al. (2020) noted that NER models cannot effectively process fine-grained labels with more than 100 types.Thus, instead of having many finegrained entities at the top level, they propose a tagging strategy in which they use 15 main entity types and 131 subtypes.Additionally, Ling and Weld (2012) proposed a fine-grained set of 112 tags and formulated the tagging problem as multiclass multi-label classification.
A recent shared task was organized by Fetahu et al. (2023) at SemEval-2023 Task 2, called Multi-CoNER 2 (Fine-grained Multilingual Named Entity Recognition).A multilingual corpus (MUL-TICONER V2) was extracted from localized versions of Wikipedia covering 12 languages -Arabic is not included.The corpus was annotated with a NER taxonomy consisting of 6 coarse-grained types and 33 fine-grained subtypes (seven subtypes for Person, seven for Group, five for PROD, five for Creative Work, and five for Medical).Most participating systems outperformed the baselines by about 35% F 1 .
There are a few Arabic NER corpora (Darwish et al., 2021), but all of them are coarse-grained.The ANERCorp corpus covers four entity types (Benajiba et al., 2007), CANERCorpus covers 14 religion-specific types (Salah and Zakaria, 2018), and Ontonotes covers 18 entities (Weischedel et al., 2013).The multilingual ACE2005 corpus (Walker et al., 2005), which includes Arabic, covers five coarse-grained entities and 35 fine-grained subtypes (3 subtypes for PERS, 11 for GPE, seven for LOC, nine for ORG, and five for FAC).Nevertheless, the ACE2005 corpus is costly and covers only one domain (media articles) that was collected 20 years ago.The most recent Arabic NER corpus is Wojood (Jarrar et al., 2022), which covers 21 nested entity types covering multiple domains.However, Wojood is a coarse-grained corpus and does not support entity subtypes.
To build on previous research on Arabic NER, we chose to extend the Wojood corpus with finergrained subtypes.To ensure that our Wojood exten-sion is compatible with other corpora, we chose to follow the ACE annotation guidelines.
3 Wojood F ine Corpus Wojood F ine expands the annotation of the Wojood corpus (Jarrar et al., 2022), by adding fine-grain annotations for named-entity subtypes.Wojood is a NER corpus with 550K tokens annotated manually using 21 entity types.About 80% of Wojood was collected from MSA articles, while the 12% was collected from social media in Palestinian and Lebanese dialects (Curras and Baladi corpora (Haff et al., 2022;Jarrar et al., 2017Jarrar et al., , 2014))).One novelty of Wojood is its nested named entities, but some entity types can be ambiguous, which will affect downstream tasks such as information retrieval.For instance, the entity type "Organization" may refer to the government, educational institution, or a hospital to name a few.That is why Wojood F ine adds subtypes to four entity types: Geopolitical Entity (GPE), Organization (ORG), Location (LOC), and Facility (FAC).Table 3.3 shows the overall counts of the main four entity types in Wojood and Wojood F ine .Note that creating Wojood F ine was not a straightforward process as it required revision of the Wojood annotation guidelines, which we discuss later in this section.As discussed in (Jarrar et al., 2022), Wojood is available as a RESTful web service, the data and the source-code are also made publicly available (Jarrar and Amayreh, 2019; Ghanem et al., 2023;Jarrar et al., 2019;Alhafi et al., 2019;Helou et al., 2016).

Wojood F ine Annotation Guideline
We followed ACE annotation guidelines to annotate the subtypes in Wojood F ine .However, since Wojood F ine is based on Wojood, we found a discrepancy between Wojood and ACE guidelines.
To address this issue in Wojood F ine , we reviewed the annotations related to GPE, ORG, LOC and FAC to ensure compatibility with ACE guidelines.In this section, we highlight a number of the challenging annotation decisions we made in Wojood F ine .
Country's governing body: in Wojood, country mentions were annotated as GPE and if the intended meaning of the country is a governing body then it is annotated as ORG.However, in Wojood F ine , all ORG mentions that refer to the country's governing body are annotated as GPE with the subtype GPE_ORG.3.
In addition to the changes mentioned in this section, ACE guidelines considered any unit that is smaller-size than a village, like neighborhoods or camps, as LOC, while it is considered as GPE in Wojood guidelines.Continents are labaled as LOC in Wojood, while it is GPE in ACE.Both of these cases where corrected in Wojood F ine .

Annotation Process
The annotation process was done by one annotator, managed by NER expert, and was conducted over two phases: Phase I: manually revise all annotations of GPE, ORG, LOC, and FAC in Wojood according to ACE guidelines, as discussed in section 3.2.Table 3.3 shows the counts of each of the four entity types in Wojood and Wojood F ine .
Phase II: manually annotate the GPE, ORG, LOC, and FAC with subtypes.The annotator meticulously read each token in every sentence and classified the tokens into their respective subtypes.All critical and problematic tokens are reviewed by the NER expert.
Phase III: The NER expert reviewed all annotations marked in Phase I and Phase II in order to validate the entities that have been annotated.
Table 2 presents the counts of each entity subtype in the corpus, which shows 47,621 annotated entities in total.

Inter-Annotator Agreement
It has been shown that inter-annotator consistency significantly affects the quality of training data and, consequently, a NER system's ability to learn (Zhang, 2013).To measure the subtypes annotation quality and consistency, we recruited a second annotator to re-annotate 25,490 tokens (5.0% of the corpus) that were previously annotated by the first annotator.The sentences were selected randomly from the corpus while diversifying the sources and domains they were selected from.We then assessed the data quality and annotation consistency using the inter-annotator agreement (IAA), measured using Cohen's Kappa (κ) and F 1 .The overall IAA was measured at κ = 0.9861 and F 1 = 0.9889.3 for the IAA for each subtype.

Refer to Table
One can clearly observe that κ is high and that is for multiple reasons.First, we revised the annotations of the main four entity types (GPE, ORG, LOC and FAC) to better match ACE guideline.Second, once we verified the top level entity types, we started annotating the subtypes.Since the types and subtypes are hierarchically organized, that constraint the number of possible subtypes per token, leading to high IAA.Third, the NER expert gave a continuous feedback to the annotator and challenging entity mentions were discussed with the greater team.
As mentioned above, we calculated the IAA using both, Cohen's Kappa and F 1 , for the subtypes of GPE, ORG, LOC and FAC tags.In what follows we explain Cohen's Kappa and F 1 .Note that F 1 is not normally used for IAA, but it is an additional validation of the annotation quality.

Cohen's Kappa
To calculate Kappa for a given tag, we count the number of agreements and disagreements between annotators for a given subtype (such as GPE_COUNTRY).At the token level, agreements are counted as pairwise matches; thus, disagreements happen when a token is annotated by one annotator (e.g., as GPE_COUNTRY) and (e.g., as GPE_STATE-OR-PROVINCE) by another annotator.As such, Kappa is calculated by equation 1 (Eugenio and Glass, 2004).
where P o represents the observed agreement between annotators and P e represents the expected agreement, which is given by equation 2.
where n T i is the number of tokens labeled with tag T by the ith annotator and N is the total number of annotated tokens.

F-Measure
For a given tag T , the F 1 is calculated according to equation 3. We only counted the tokens that at least one of the annotators had labeled with the T .
We then conducted a pair-wise comparison.T P represents the true positives which is the number of agreements between annotators (i.e.number of tokens labeled GPE_TOWN by both annotators).If the first annotator disagrees with the second, it is counted as false negatives (F N ), and if the second disagrees with the first, it is counted as false positives (F P ), with a total of disagreement being F N + F P .4 Fine-Grained NER Modeling

Approach
For modeling, we have three tasks all performed on Wojood F ine : (1) Flat NER, where for each token, we predict a single label from a set of 21 labels, (2) Nested NER, where we predict multiple labels picked from the 21 tags (i.e., multilabel classification) for each token and (3) Nested with Subtypes NER, this is also a multi-label task, where we ask the model to predict the main entity types and subtypes for each token from 52 total labels.We frame this as multi-task approach : BERT refers to one of three pre-trained models we are using.For flat task, each softmax produce one class for each token, for other tasks each softmax is a set of softmax that produce multiple labels for each token.
since we are learning both the nested labels and their subtypes jointly.In the multi-task case, each entity/subtype has its own classification layer, in the case of nested NER and nested with subtypes NER, the model consists of 21 and 52 classification layers, respectively.Since we use the IOB2 (Sang and Veenstra, 1999) tagging scheme, each linear layer is a multi-class classifier that outputs the probability distribution through softmax activation function for three classes, C ∈ {I, O, B} (Jarrar et al., 2022).The model is trained with cross entropy loss objective computed for each linear layer separately, which are summed to compute the final cross entropy loss.All models are flat in the sense that we do not use any hierarchical architectures.However, future work can consider employing a hierarchical architecture where nested tokens are learnt first then their subtypes within the model.For all tasks, we fine-tune three encoder-based models for Arabic language understanding.Namely, we use ARBERTv2 and MARBERTv2 (Elmadany et al., 2023), which are both improved versions of ARBERT and MAR-BERT (Abdul-Mageed et al., 2021), respectively, that are trained on bigger datasets.The third model is ARABERTv2, which is an improved version of ARABERT (Antoun et al., 2021).It is also trained on a bigger dataset, with improved preprocessing. Figure 4 offers a simple visualization of our models' architecture.

Training Configuration
We split our dataset into three distinct parts for training (Train) 70%, validation (Dev) 10%, and blind testing (Test) 20%.We fine-tune all three models for 50 epochs each with an early stop- ping patience of 5 as identified on Dev.We use the AdamW optimizer (Loshchilov and Hutter, 2019), an exponential learning rate scheduler and a dropout of 0.1.The maximum sequence length is 512, the batch size, B = 8, and the learning rate, η = 1e −5 .For each model, we report an average of three runs (each time with a different seed).
We report in F 1 along with the standard deviation from the three runs, on both Dev and Test, for each model.All models are implemented using PyTorch, Huggingface Transformers, and a custom version of the Wojood open-source code1 .

Results
We show the results of our three fine-tuned models across each of the three tasks in Table 4.We briefly highlight these results in the following: Flat NER.The three fine-tuned models achieve comparable results on the Flat NER task, with AR-BERTv2 scoring slightly better on both the Dev and Test sets.ARBERTv2 achieves an F 1 of 92% on the Test set, while ARBERTv2 and ARABERTv2 achieves 91.3% and 90.3%, respectively.Nested NER.ARABERTv2 slightly outperforms other pre-trained models with a small margin, on Dev and Test.On Test, it scores 86.6%.

Analysis
For all tasks, all models almost always converge in the first 10 epochs.For all models, there is a positive correlation between performance and the number of training samples.For example, for classes represented well in the training set (e.g., COUNTRY, TOWN and GOV), models perform at 0.90 F 1 or above.
The inverse is also true, with poor performance on classes such as SPORT, BOUNDARY and CELES-TIAL.There are also some nuances.For example, we can see that the best model is struggling with the COM subtype class even though the model has scored good results with classes with fewer samples such as CLUSTER.The main reason for this is that types such as CLUSTER are a closed set of classes (e.g., "European Union", "African Union") where the model can easily memorize them, while the COM refers to an infinite group of commercial entities, that can not be limited.Figure 5 is a plot of the number of samples in training data (X-axis) vs. performance (Y-axis) that clearly shows the general pattern of good performance positively correlating with the number of training samples.

Out-of-Domain Performance
To assess the generalization capability of our models, we conducted an evaluation on three unseen domains and different time periods.Three corpora were collected, each covering a distinct domain: finance, science, and politics.These corpora were compiled from Aljazeera news articles published in 2023.Manual annotation of the three corpora was performed in accordance with the same annotation guidelines established for Wojood F ine .We apply the three versions of each of our three models trained on Wojood F ine original training data (described in Section 4.2) on the new domains, for each of the three NER tasks.We present results for this out-of-domain set of experiments in Table 5.We observe that performance drastically drops on all three new domains, for all models on all tasks.This is not surprising, as challenges related to domain generalization are well-known in  the literature.Our results here, however, allow us to quantify the extent to which model performance degrades on each of these three new domains.In particular, models do much better on the politics domain than they perform on finance or science.This is the case since our training data are collected from online articles involving news and much less content from financial or scientific sources.Figure 6 shows some examples for new mentions from those domains that have not been seen in Wojood F ine .

Error Analysis
In order to understand the errors made by the model, we conduct a human error analysis on the errors generated by ARABERTv2 (i.e, best model on this task) on the first 2K tokens of the Dev set of Nested NER with Subtypes task.We find that the model's errors can be categorized into six major error classes: (1) wrong tag, where the model predicts a different tag, (2) no prediction, where the model does not produce any tag (i.e.predict O), (3) missing subtype, the model succeeds in predicting parent tag but fails to predict the subtype,  (4) missing parent tag: the model succeeds in predicting subtype tag but fails to predict the parent tag, (5) MSA vs. DIA confusion, the model makes a wrong prediction due to confusion between MSA and Dialect, and (6) ordinal vs. cardinal, in this class, the model assigns cardinal to an ordinal class.
Figure 7 shows the distribution of different errors present in the Dev set, with the wrong tag being the major source of errors followed by no prediction error.A further breakdown of the wrong tag error class shows that 14.3% are due to usage of dialectal words, a similar proportion are due to nested entities.Table 6 shows an example of each error class.

Conclusion and Future Work
We presented Wojood F ine , an extension to the Wojood NER corpus with subtypes for the GPE, LOC, ORG, and FAC.Wojood F ine corpus is the first finegrain corpus for MSA and dialectal Arabic with nested and subtyped NER.The GPE, ORG, FAC and LOC tags form more than 44K tokens of the corpus, which was manually annotated using subtypes entities.Our inter-annotator agreement IAA evaluation of Wojood F ine annotations achieved high levels of agreement among the annotators.The achieved evaluations are 0.9861 Kappa and 0.9889 F 1 .We also fine-tune three pre-trained models AR-BERTv2, MARBERTv2 and ARABERTv2 and tested their performance on different settings of Wojood F ine .We find that ARABERTv2 achieved the best performance on Nested and Nested with Subtypes tasks.In the future, we plan to test pretrained models on nested subtypes with hierarchical architecture.We also plan to link named entities with concepts in the Arabic Ontology (Jarrar, 2021(Jarrar, , 2011) ) to enable a richer semantic understanding of text.Additionally, we will extend the Wojood F ine corpus to include more dialects, especially the Syrian Nabra dialects (Nayouf et al., 2023) as well as the four dialects in the Lisan (Jarrar et al., 2023b) corpus.

Acknowledgment
We would like to thank Sana Ghanem for her invaluable assistance in reviewing and improving the annotations, as well as for her support in the IAA calculations.The authors would also like to thank Tymaa Hammouda for her technical support and expertise in the data engineering of the corpus.

Limitations
A number of considerations related to limitations and ethics are relevant to our work, as follows: • Intended Use.Our models perform named entity recognition at a fine-grained level and can be used for a wide range of information extraction tasks.As we have shown, however, even though the models are trained with data acquired from several domains, their performance drops on data with distribution different than our training data such as the finance or science domains.We suggest this be taken into account in any application of the models.
• Annotation Guidelines and Process.Some of the entities are difficult to tag.Even though annotators have done their best and we report high inter-annotator reliability, the application of our guidelines may need to be adapted before application to new domains.

REGION-GENERAL
Taggable locations that do not cross national borders.

REGION-INTERNATIONAL
Taggable locations that cross national borders.

Figure 3 :
Figure 3: (a) The direction (/ north east Gaza city) is not annotated in Wojood, while in (b) it is annotated as LOC with Region-General as subtype in Wojood F ine .
Figure4: BERT refers to one of three pre-trained models we are using.For flat task, each softmax produce one class for each token, for other tasks each softmax is a set of softmax that produce multiple labels for each token.

Figure 5 :
Figure 5: Number of samples vs. F 1 in each subtype class on Subtype classification task.

Table 5 :
Results of fine-tuned models on the three new domains, Finance, Science, and Politics.M1: MAR-BERTv2, M2: ARBERTv2 and M3: ARABERTv2.The results are represented as F1 averaged over 3 runs.

Figure 6 :
Figure 6: Some mentions from the three new domains that have not previously appeared in Wojood F ine .(a) ( ) in Politics domain, (b) ( ) in Finance domain, (c) ( OpenAI) in Science domain.

Figure 7 :
Figure 7: Distribution of error classes in nested with subtypes task on our Dev set.

Table 1 :
Frequency of the four entity types in Wojood and Wojood F ine .All GPE, ORG, LOC and FAC tagged tokens in Wojood F ine corpus were annotated with the appropriate subtype based on the context, adding an additional 31 entity subtypes to Wojood F ine .Table2lists the frequency of each subtype in Wojood F ine .Tables7 and 8in Appendix A present a brief explanation and examples of each subtype.
Throughout our annotation process, The LDC's ACE 2008 annotation guidelines for Arabic Entities V7.4.2 served as the basis for defining our annotation guidelines.Nevertheless, we added new tags (NEIGHBORHOOD, CAMP, SPORT, and ORG_FAC) to cover additional cases.

Table 2 :
Counts of each subtype entity in the corpus.
Figure1illustrates two examples to illustrate the difference between Wojood and Wojood F ine guidelines.According to Wojood, /Nigeria is tagged once as GPE and once as ORG, while in Wojood F ine both are GPE in the first level and in the second level one is tagged as Country and the other as GPE_ORG.Facility vs. organization: Wojood annotates buildings as FAC but if the intended meaning, in the context is an organization, then it is annotated as ORG.In Wojood F ine , all mentions that refer to the facility's organization or social entity are annotated as ORG with the subtype ORG_FAC. Figure 2 illustrates an example of this case.Instead of annotating ( /Al-Shifa Hospital) once as FAC and once as ORG, Wojood F ine tags it as ORG in the first level, and ORG_FAC in the second level.
ORG▸ORG_FACORGFigure2: Two examples illustrating the difference between Wojood (in blue) and Wojood F ine (in red) guideline for annotating FAC vs. ORG. 3)

Table 3 :
Overall Kappa and F1-score for each sub-type.

Table 6 :
Examples of error categories made by our best model (ARABERTv2) on our Dev set.We provide the translation to English of each sample.

Table 7 :
Parent type and description of each sub-type in Wojood F ine that is focused primarily upon providing ideas, products, or services for profit.EDUAn educational organization that is focused primarily upon the furthering or promulgation of learning/education.ENT Entertainment organizations whose primary activity is entertainment.NONGOV Non-governmental organizations that are not a part of a government or commercial organization and whose main role is advocacy, charity or politics (in a broad sense).MEDMedia organizations whose primary interest is the distribution of news or publications.RELReligious organizations that are primarily devoted to issues of religious worship.SCI Medical-Science organizations whose primary activity is the application of medical care or the pursuit of scientific research.

Table 8 :
Parent type and description of each sub-type in Wojood F ine

Table 9 :
Overall IAA for each sub-type, reported using Kappa and F 1 .