Developing a Named Entity Recognition Dataset for Tagalog

We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity types: Person, Organization, and Location. The inter-annotator agreement, as measured by Cohen's $\kappa$, is 0.81. We also conducted extensive empirical evaluation of state-of-the-art methods across supervised and transfer learning settings. Finally, we released the data and processing code publicly to inspire future work on Tagalog NLP.


Introduction
Tagalog (tl) is one of the major languages in the Philippines with over 28 million speakers in the country (Lewis, 2009).It constitutes the bulk of Filipino, the country's official language, by sharing its lexical items and grammatical structure.Despite this fact, there are little to no resources for Tagalog (Cruz and Cheng, 2022), hampering the development of reliable language technologies.
In this paper, we present TLUNIFIED-NER,1 a Tagalog dataset for Named Entity Recognition (NER).The texts were obtained from TLUnified (Cruz and Cheng, 2022), a pretraining corpora containing news reports and other types of text.We focused on NER because of its foundational role in several NLP tasks (Tjong Kim Sang and De Meulder, 2003;Lample et al., 2016), especially in problems that require the extraction of structured information.TLUNIFIED-NER consists of ∼7.8k documents across three entity types (Person, Organization, Location), modeled closely to the CoNLL Shared Tasks (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003).Three native speakers conducted the annotation process, resulting to an inter-annotator agreement (IAA) score of 0.81.
We hope that TLUNIFIED-NER will allow researchers to build better NER classifiers for Tagalog, and thereby inspire future research on Tagalog NLP through the following contributions: 1. We curated and annotated texts from a large pretraining corpora to represent the modern usage of Tagalog in the news domain.
2. We provided performance baselines across a variety of supervised and transfer learning settings.

Related Work
Tagalog language Tagalog is an agglutinative language within the Austronesian family (Kroeger, 1992).It uses the Latin script for its writing system with 28 letters in its alphabet.Twenty-six letters are the same as in English, with the addition of Ñ/ñ and Ng/ng.Tagalog typically follows the VSO word order, but VOS and SVO are also accepted (Schachter and Otanes, 1973).Although Filipino is the country's official language, it has little to no linguistic differences with Tagalog.
Tagalog NER datasets Unfortunately, resources for Tagalog NER are meager.One major resource is WikiANN (Pan et al., 2017) language packs for Tagalog (Strassel and Tracey, 2016), but they're not publicly-accessible.
TLUNIFIED-NER aims to fill this resource gap by providing a publicly-assessible gold standard resource for Tagalog NER.

Dataset Collection
The texts were obtained from Cruz and Cheng (2022)'s TLUnified pretraining corpora.It combines news reports (Cruz et al., 2020), a preprocessed version of CommonCrawl (Suarez et al., 2019), and several other datasets.We manually filtered this dataset to contain news reports so as to resemble the CoNLL Shared Tasks (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003).
The texts are diverse.It contains articles from different news sites online that ran a published print media or news channel in Metro Manila from 2009 to early 2020.The topics range from politics, weather, and popular science among others.

Annotation Setup
We used Prodigy as our annotation tool. 2 We set up a web server on the Google Cloud Platform and routed the examples through Prodigy's builtin task router.Figure 1 shows the labeling interface as seen by the annotator.Finally, we used the ner.manual recipe to highlight spans during the annotation process.We used three entity labels for TLUNIFIED-NER as shown in Table 1.Unlike CoNLL, we decided to exclude the Miscellaneous (MISC) tag to reduce confusion.

Annotation Process
The annotation process was done iteratively with three annotators (including the author) who are native Tagalog speakers.Given 2 https://prodigy.aia set annotation budget, we paid the annotators above the country's minimum daily wage.Each annotation round spans for two to three weeks, for a total of six rounds (18 weeks).The annotators labeled the same batch of examples to ensure high overlap.
After each round, the annotators hold a retrospective meeting and discussed examples they found confusing, inconsistent with the annotation guidelines, and noteworthy.This process continued until we reached ∼10k examples or if we exhausted our annotation budget.In addition, we also tracked the training curve to determine the quality of the collected annotations.If the F1-score improved within the last 25% of the training data, then it is a good sign that obtaining more labels will result to better accuracy.The Automatic Content Extraction (ACE 2004/05) annotation document (Doddington et al., 2004) heavily inspired our initial draft.We co-developed the guidelines after each annotation round to improve clarity and reduce disagreements.These guidelines are accessible on GitHub: https://github.com/ljvmiranda921/calamanCy/tree/master/datasets/tl_ calamancy_gold_corpus/guidelines

Corpus Statistics and Evaluation
Table 2 shows the final dataset statistics for TLUNIFIED-NER.We also included span-(SD) and boundary-distinctiveness (BD) metrics (Papay et al., 2020).They measure the KL-divergence of the unigram word distributions between the span (or its boundaries) and the rest of the corpora.These metrics can be used to gauge the difficulty of the span labeling task, (e.g., more distinct spans means it's "easier" to detect them in the text).

Inter-annotator Agreement (IAA)
Similar to Brandsen et al. (2020), we measured two types of Cohen's κ.The first metric calculates κ for tokens where at least one annotator has made an annotation.The second metric computes for all tokens while ignoring the 'O' label.In addition, we had a third measure: the F1-score using one set of annotations as reference (Deleger et al., 2012).We did these computations for each annotator-pair and averaged the results as shown in Table 3.Finally, Figure 2 shows the growth of IAA for each annotation round.Because of our annotation process, we were able to label the same batch of documents and track the agreement every round.

Benchmark results
We trained several NER models using spaCy's transition-based parser (Honnibal et al., 2020).The state transitions are based on the BILUO sequence encoding scheme and the actions are decided by a convolutional neural network with a maxout (Goodfellow et al., 2013) activation function.
While keeping the NER classifier constant, we experimented with various word embeddings that led to the following configurations: • Baseline: we trained the transition-based parser "from scratch" without additional information from static or context-sensitive vectors.
• Static vectors: we used Tagalog fastText vectors (Bojanowski et al., 2017)     weights of the model.The pretraining objective asks the model to predict some number of leading and trailing UTF-8 bytes for the words-a variant of the cloze task.
• Transformer-based vectors (monolingual): we used RoBERTa Tagalog (Cruz and Cheng, 2022), the only pretrained language model for Tagalog, and finetuned it with our annotations.
This experimental setup allows us to see the expected performance when training Tagalog NER classifiers using standard techniques.

Error analysis
From our benchmark results, we noticed that most models are having trouble predicting the Location or Organization tags. Figure 3 shows the confusion matrix of the Baseline model on the development set in the IOB format.Most of the mistakes came from incorrectly tagging a token with the outside 'O' label.However, we also noticed instances where the model confuses between the lexical and semantic tag of an entity.For example, in the span, ". . .panukala ng Ombudsman. . ." (". . .proposed by the Ombudsman. . ."), the token Ombudsman might be a Person or Organization depending on the context.We hypothesize that including context-sensitive training, which the baseline model lacks, can help mitigate this issue.
To test this hypothesis, we experimented on two training configurations.First, we trained a POS tagger together with our transition-based NER with shared weights.This process may help provide extra information to the transition-based parser so it can disambiguate between entities.Second, we finetuned context-sensitive vectors from RoBERTa Tagalog (Cruz and Cheng, 2022) for NER.Table 5 shows the relative error reduction between LOC and ORG entities.Given these results, we encourage researchers to utilize context-sensitive vectors such as RoBERTa Tagalog (or other BERT variants) when training models from this corpora.

Comparison to WikiANN
The WikiANN dataset (Pan et al., 2017) is another resource for Tagalog NER.However, we found many annotation errors in the dataset, from misclassifications to fragmented sentences.We investigated how TLUNIFIED-NER fares against WikiANN's silver-standard annotations.
We finetuned several models similar to Section 5.2 on the Tagalog portion of WikiANN's training set and tested it on TLUNIFIED-NER's test set (and vice-versa).In order to properly evaluate on WikiANN, we reannotated the test dataset using the same annotation guidelines described in Section 4.
Our results in Table 6 suggest that models built from the TLUNIFIED-NER corpus are more performant than with WikiANN.Additionally, the gap between WikiANN's silver-standard annotations and our corrections is large, as shown in Table 7.We then posit that the gold-standard nature of TLUNIFIED-NER led to better performance than WikiANN, which predominantly consists of text fragments and low-quality annotations.
Our results suggest that supervised learning reliably outperforms zero-shot prompting for TLUNIFIED-NER given our prompt (see Appendix A.1).However, we acknowledge that these results are not a definitive comparison between two methods as prompt engineering is unstable with high variance (Webson and Pavlick, 2022;Zhao et al., 2021).In the future, we plan to explore different prompting techniques such as PromptNER (Ashok and Lipton, 2023) and chain-of-thought (Wei et al., 2023) to uncover the language models' full capabilities.

Conclusion
In this paper, we introduced TLUNIFIED-NER, a Named Entity Recognition dataset for Tagalog.Unlike other Tagalog NER datasets, TLUNIFIED-NER is publicly-accessible and gold standard.Our iterative annotation process, together with our inter-annotator agreement, shows that the corpus is of high quality.In addition, our benchmarking results suggest that the task is learnable even with a simple baseline method.We hope that TLUNIFIED-NER fills the resource gap present in Tagalog NLP today.In the future, we plan to create a more finegrained (and perhaps, overlapping) NER tag set similar to the ACE project and expand on other major Philippine languages.Finally, the dataset is available online (https://huggingface.co/ datasets/ljvmiranda921/tlunified-ner) and we encourage researchers to improve upon our benchmark results.

Limitations
The TLUNIFIED-NER corpora is comprised mostly by news reports.Although the texts demonstrate the standard usage of Tagalog, its domain is limited.In addition, we only trained a transitionbased parser model for our NER classifier.In the future, we plan to extend these benchmarks and include CRFs or other tools such as Stanford Stanza.

Figure 1 :
Figure 1: Prodigy's annotation interface for a given text.(Translation: MANILA -The owner of the illegal billboards that fell on EDSA this Monday, injuring five people and damaging property, should be caught and imprisoned according to Senator Miriam Defensor Santiago.)

Figure 2 :
Figure 2: Growth of IAA for each annotation round.

Table 4 :
Benchmark results on TLUNIFIED-NER across different word embeddings using spaCy's transitionbased parser(Honnibal et al., 2020).Reported results are F1-scores on the test set across three trials.B-PER I-PER B-ORG I-ORG B-LOC I

Figure 3 :
Figure 3: Development set confusion matrix of the Baseline model predictions in the IOB format.

Table 1 :
Entity types used for annotating TLUNIFIED-NER (derived from the TLUnified pretraining corpus of Cruz and Cheng, 2022).

Table 2 :
(Papay et al., 2020)otation guidelines in an iterative fashion.Dataset statistics for TLUNIFIED-NER.It shows the number of examples, number of tokens, and spanlevel statistics.SD stands for span distinctiveness whereas BD is boundary distinctiveness(Papay et al., 2020).
Table 4 reports the F1-score on the test set across three trials.

Table 5 :
Relative error reduction (with respect to the Baseline) for classifying ORG and LOC entities.Reported results are F1-scores on the development set.

Table 6 :
Cross-dataset comparison between WikiANN (Pan et al., 2017)and TLUNIFIED-NER.We trained a model from WikiANN then applied it to TLUNIFIED-NER(and vice-versa).Reported results are F1-scores on the test set across three trials.