calamanCy: A Tagalog Natural Language Processing Toolkit

We introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks. calamanCy addresses the development gap by providing a consistent API for building NLP applications and offering general-purpose multitask models with out-of-the-box support for dependency parsing, parts-of-speech (POS) tagging, and named entity recognition (NER). calamanCy aims to accelerate the progress of Tagalog NLP by consolidating disjointed resources in a unified framework.The calamanCy toolkit is available on GitHub: https://github.com/ljvmiranda921/calamanCy.


Introduction
Tagalog is a low-resource language from the Austronesian family, with over 28 million speakers in the Philippines (Lewis, 2009).Despite its speaker population, few resources exist for the language (Cruz and Cheng, 2022).For example, Universal Dependencies (UD) treebanks for Tagalog are tiny (≪ 20k words) (Samson, 2018;Aquino and de Leon, 2020), while domain-specific corpora are sparse (Cabasag et al., 2019;Livelo and Cheng, 2018).In addition, Tagalog language models (LMs) (Cruz and Cheng, 2022;Jiang et al., 2021) are few, while most multilingual LMs (Conneau et al., 2020;Devlin et al., 2019) underrepresent the language (Lauscher et al., 2020).Thus, consolidating these disjointed resources in a coherent framework is still an open problem.The lack of such framework hampers model development, experimental workflows, and the overall advancement of Tagalog NLP.
To address this problem, we introduce cala-manCy, 1 an open-source toolkit for Tagalog NLP.It is built on top of spaCy (Honnibal et al., 2020) 1 "calamanCy" derives its name from kalamansi, a citrus fruit native to the Philippines.and offers end-to-end pipelines for NLP tasks such as dependency parsing, parts-of-speech (POS) tagging, and named entity recognition (NER).cala-manCy also provides general-purpose pipelines in three different sizes to fit any performance or accuracy requirements.This work has two main contributions: (1) an open-source toolkit with outof-the box support for common NLP tasks, and (2) comprehensive evaluations on several Tagalog benchmarks.

Related Work
Open-source toolkits for NLP There has been a growing body of work in the development of NLP toolkits in recent years.For example, DaCy (Enevoldsen et al., 2021) and HuSpaCy (Orosz et al., 2022) serve the language-specific needs of Danish and Hungarian respectively.In addition, scispaCy (Neumann et al., 2019) and medspaCy (Eyre et al., 2021) were built to focus on scientific text.These tools employ spaCy (Honnibal et al., 2020), an industrial-strength open-source software for natural language processing.Using spaCy as a foundation is optimal, given its popularity and integration with other frameworks such as Hugging-Face transformers (Wolf et al., 2020).However, no tool has existed for Tagalog until now.We aim to fill this development gap and serve the needs of the Tagalog language community through calamanCy.
Evaluations on Tagalog NLP Tasks Structured evaluations for core NLP tasks, such as dependency parsing, POS tagging, and NER, are meager.However, we have access to a reasonable amount of data to conduct comprehensive benchmarks.For example, TLUnified (Cruz and Cheng, 2022) is a pretraining corpus that combines news reports (Cruz et al., 2020), a preprocessed version of Com-monCrawl (Suarez et al., 2019), and several other datasets.However, it was evaluated on domainspecific corpora that may not easily transfer to more general tasks.In addition, Tagalog has two Universal Dependencies (UD) treebanks, Tagalog Reference Grammar (TRG) (Samson, 2018) and Ugnayan (Aquino and de Leon, 2020), both with POS tags and relational structures for parsing grammar.This paper will fill the evaluation gap by providing structured benchmarks on these core tasks.

Implementation
The best way to use calamanCy is through its trained pipelines.After installing the library, users can access the models in a few lines of code: import calamancy as cl nlp = cl.load("tl_calamancy_md-0.1.0")doc = nlp("Ako si Juan de la Cruz.") Here, the variable nlp is a spaCy processing pipeline 2 that contains trained components for POS tagging, dependency parsing, and NER.Applying this pipeline to a text will produce a Doc object with various linguistic features.calamanCy offers three pipelines of varying capacity: two static word vector-based models (md, lg), and one transformerbased model (trf).We will discuss how we developed these pipelines in the following section.

Pipeline development
Data annotation for NER There is no goldstandard corpus for NER, so we built one.To construct the NER corpus, we curated a portion of TLUnified (Cruz and Cheng, 2022) to contain Tagalog news articles.Including the author, we recruited two more annotators with at least a bachelor's degree and whose native language is Tagalog.The three annotators labeled for four months, given three entity types as seen in Table 1.We chose the 2 https://spacy.io/usage/processing-pipelines

Dataset
Examples entity types to resemble ConLL (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003), a standard NER benchmark.We excluded the MISC label to reduce uncertainty and confusion when labeling.Then, we measured inter-annotator agreement (IAA) by taking the pairwise Cohen's κ on all tokens and then averaged them for all three pairs.This process resulted in a Cohen's κ score of 0.81.
To avoid confusion with the original TLUnified pretraining corpora, we will refer to this annotated NER dataset as TLUnified-NER.The final dataset statistics can be found in Table 2.For the dependency parser and POS tagger, we merged the TRG (Samson, 2018) and Ugnayan (Aquino and de Leon, 2020) treebanks to leverage their small yet relevant examples.

Model training
We considered three design dimensions when training the calamanCy pipelines: (1) the presence of pretraining, (2) the word representation, and its (3) size or dimension.Model pretraining involves learning vectors from raw text to inform model initialization.Here, the pretraining objective asks the model to predict some number of leading and trailing UTF-8 bytes for the words-a variant of the cloze task (Devlin et al., 2019).A model's word representation may involve training static word embeddings using floret,3 an efficient version of fastText (Bojanowski et al., 2017)   using context-sensitive vectors from a transformer (Vaswani et al., 2017).Finally, a model's dimension is our way to tune the tradeoff between performance and accuracy.
The general process involves pretraining a filtered version of TLUnified, constructing static word embeddings if necessary, and training the downstream components.We used TLUnified-NER to train the NER component, and then trained the dependency parser and POS tagger using the combined treebanks.Ultimately, we devised three language pipelines as seen in Table 3.

Evaluation
Architectures We used spaCy's built-in architectures for each component in the calamanCy pipeline.The token-to-vector layer uses the multihash embedding trick (Miranda et al., 2022) to reduce the representation size.For the parser and named entity recognizer, we used a transition-based parser that maps text representations into a series of state transitions.As for the text categorizer, we utilized an ensemble of a bag-of-words model and a feed-forward network.
Experimental set-up We assessed the cala-manCy pipelines on various Tagalog benchmarks as detailed in Table 4.We also tested on text categorization, an unseen task, for robustness.For NER evaluation, we used a held-out test split from TLUnified-NER.We measured their performance across five trials and then reported the average and standard deviation.For treebank-related benchmarks (POS tagging and dependency parsing), we followed UD's data split guidelines (Nivre et al., 2022) and performed 10-fold cross-validation to compensate for the size of the corpora (≪ 20k tokens).
We also tested a cross-lingual transfer learning approach, i.e., finetuning a model from a source language closely related to Tagalog.According to Table 5: Benchmark evaluation scores for monolingual, cross-lingual, and multilingual pipelines across a variety of tasks and datasets.We evaluated the text categorization and NER tasks across five trials, and then conducted 10-fold cross-validation for dependency parsing.F1-scores are reported on the text categorization and NER tasks.
Aquino and de Leon ( 2020), the closest languages to Tagalog are Indonesian (id), Ukrainian (uk), Vietnamese (vi), Romanian (ro), and Catalan (ca).They obtained these results via a distance metric (Agić, 2017) based on the World Atlas for Language Structures (Haspelmath et al., 2005).However, only uk, ro, and ca have equivalent spaCy pipelines, so we only compared against those three.Finally, we also compared against multilingual language models by finetuning on XLM RoBERTa (Conneau et al., 2020) and an uncased version of multilingual BERT (Devlin et al., 2019).These LMs contain Tagalog in their training pool and are common alternatives for building Tagalog NLP applications.

Discussion
Table 5 shows the F1-scores for the text categorization and NER tasks, the unlabeled (UAS) and labeled attachment scores (LAS) for the dependency parsing task, and the tag accuracy for POS tagging.
The calamanCy pipelines are competitive across all core NLP tasks while maintaining a smaller compute footprint.As shown in the text categorization and NER results, users with low compute budgets can attain similar performance to multilingual LMs by using medium-or large-sized cala-manCy models.The transformer-based calamanCy pipeline is the best option for users who prioritize accuracy.However, we were surprised that most alternative approaches perform better in dependency parsing.We attribute this performance to the added strength of multilingual and crosslingual information, which we don't have when training solely on a smaller treebank.We plan to improve dependency parsing performance by building a larger treebank within the Universal Dependencies framework.For practical applications, we recommend users to start with a medium-or largesized calamanCy model before trying out GPUintensive pipelines.Only then can they switch to a transformer-based pipeline to get accuracy gains.

Conclusion
In this paper, we introduced calamanCy, a natural language processing toolkit for Tagalog.Our work has two main contributions: (1) an opensource toolkit containing general-purpose multitask pipelines with out-of-the-box support for common NLP tasks, and (2) comprehensive benchmarks that compare against alternative approaches, such as cross-lingual or multilingual finetuning.We hope that calamanCy is a step forward to improving the state of Tagalog NLP.As a low-resource language, consolidating resources into a unified framework is crucial to advance research and improve collaboration.In the future, we plan to create a more fine-grained NER benchmark corpus and extend calamanCy to natural language understanding (NLU) tasks.Finally, the project is hosted on GitHub (https://github.com/ljvmiranda921/calamanCy) and we are happy to receive community feedback and contributions.

Limitations
The TLUnified-NER corpus utilized for training the NER component of calamanCy comprises of new articles from early 2000s to the present.In addition the Universal Dependencies (UD) corpora for the POS tagger and dependency parser components are relatively modest in size, containing fewer than 10k tokens.Hence, the performance for these tasks during test-time could potentially be constrained by these factors.
Finally, reproducing the transformer pipelines may require a T4 or V100 GPU.The biggest bottleneck for reproduction is pretraining on the whole TLUnified corpus.In a 64vCPU machine with 256GB of RAM, the pretraining process can take three full days for 20 epochs.We obtained these values by computing for the pairwise comparisons between all annotator-pairs and averaging the results.
Table 6 shows the IAA measurements while Figure 1 shows their growth after each annotation round.

Figure 1 :
Figure1: Inter-annotator agreement measurement after each annotation round.Each mark represents the end of a round.For each round, the annotators discuss disagreements, update the annotation guidelines, and evaluate the current set of annotations.

Table 1 :
Entity types used for annotating TLUnified-NER (derived from the TLUnified pretraining corpus of Cruz and Cheng, 2022).

Table 3 :
Language pipelines available in calamanCy (v0.1.0).The pretraining method for the word-vector models is a variant of the cloze task.All pipelines have a tagger, parser, morphologizer, and ner spaCy component.