Cross-lingual Transfer Learning for Javanese Dependency Parsing

While structure learning achieves remarkable performance in high-resource languages, the situation differs for under-represented languages due to the scarcity of annotated data. This study focuses on assessing the efficacy of transfer learning in enhancing dependency parsing for Javanese, a language spoken by 80 million individuals but characterized by limited representation in natural language processing. We utilized the Universal Dependencies dataset consisting of dependency treebanks from more than 100 languages, including Javanese. We propose two learning strategies to train the model: transfer learning (TL) and hierarchical transfer learning (HTL). While TL only uses a source language to pre-train the model, the HTL method uses a source language and an intermediate language in the learning process. The results show that our best model uses the HTL method, which improves performance with an increase of 10% for both UAS and LAS evaluations compared to the baseline model.


Introduction
Despite over 80 million native speakers of Javanese (Simons et al., 2023), this language is underrepresented in NLP due to a scarcity of annotated resources.Limited works in Javanese have focused on stemmer (Soyusiawaty et al., 2020), POS tagger (Askhabi et al., 2020), sentiment analysis (Tho et al., 2021), and machine translation (Lesatari et al., 2021).However, few have explored language structure prediction, such as dependency parsing.Dependency parsing is a process that makes a structural representation of a sentence (Kübler et al., 2009) that produces a structure in the form of a dependency tree represented in a graph consisting of several connected links between words in a sentence.
Recent work, Alfina et al. (2023) created a public gold standard dataset for Javanese with 1000 sentences, published as part of the Universal Depen-dencies dataset (Zeman et al., 2023).This dataset covers annotation for tokenization, POS tagging, morphological features tagging, and dependency parsing tasks.The most recent parser performance (Alfina et al., 2023) using this dataset is not satisfactory, with only 77.08% on Unlabeled Attachment Score (UAS) and 71.21% on Labeled Attachment Score (LAS).The lack of training data is a typical low-resource problem considered one of the biggest NLP research problems (Ruder, 2023).
Transfer learning (TL) involves leveraging a model's knowledge from a high-resource source domain to improve performance on various NLP tasks, particularly in low-resource domains (Weiss et al., 2016), by transferring learned information to target tasks.Inspired by Maulana et al. (2022) that utilizes cross-lingual transfer learning to develop an Indonesian dependency parser, we want to try to replicate its outcome in Javanese with a limited available dataset.Moreover, we also implement hierarchical transfer learning (HTL) with two stages of transfer learning that offer increased flexibility over TL by enabling knowledge transfer between languages with a significant gap (Luo et al., 2019), as demonstrated in diverse applications, including Javanese text-to-speech (Azizah et al., 2020) and biomedical named entity recognition models (Chai et al., 2022).
We build the dependency parser model for Javanese by adopting model (Ahmad et al., 2019) that uses a self-attention encoder and a graph-based decoder.We utilize the Universal Dependency dataset v1.12 (Zeman et al., 2023) that provides dependency treebanks for more than 100 languages, including Javanese.Both TL and HTL use a selection of source languages determined by LangRank (Lin et al., 2020).Specifically, HTL employs Indonesian as an intermediary language, developing from our referenced research (Maulana et al., 2022).The empirical results show that transfer learning improves accuracy with a margin of 10% compared to the baseline.We also report the word embedding comparison that fastText performs better than the Javanese BERT, Javanese RoBERTa, and multilingual BERT.In summary, the main contributions of this paper are as follows: 1. Provide the first study of Javanese dependency parsing using TL and HTL strategy.We report that the HTL method can significantly improve performance compared to the training from scratch method.
2. Report the investigation of which source language and word embedding performs best for TL and HTL strategy.
2 Related Works

Dependency Parser
The dependency parser model can be developed using two methods, the transition-based and graphbased methods (Das and Sarkar, 2020).The transition-based method works by processing the word order one by one in a given sentence (Martin., 2020).Meanwhile, the graph-based method gives a score to each edge of the word relation (Martin., 2020), then looks for the best tree formed from the edges with the best scores.
Apart from these two methods, there is an approach in which the parser is built using an encoderdecoder architecture.It was first developed using a BiLSTM encoder and a deep biaffine decoder (Dozat and Manning, 2017).Encoder variations began to develop using Transformers or self-attention encoders (Vaswani et al., 2017), then subsequent studies modified it using relative positional embedding (Shaw et al., 2018).The first Javanese dependency parser (Alfina et al., 2023) uses UD-Pipe (Straka, 2018), which also utilizes the biaffine attention mentioned before.
In the context of transfer learning, it was found that the best combination is a self-attention encoder and a graph-based decoder (Ahmad et al., 2019), which will be used in this research.This combination has been better than other encoder-decoder combinations in cross-lingual transfer learning.

Transfer Learning
Transfer learning involves leveraging a pre-trained model's knowledge to enhance the performance of other models (Sarkar and Bali, 2022), addressing resource limitations in low-resource domains.Besides that, hierarchical transfer learning offers a transfer learning method in which a new layer is added before the model is transferred to the low-resource language (Luo et al., 2019).Recent work has shown that transferring multiple times could minimize the dissimilarity between the highresource and the low-resource domain languages (Azizah et al., 2020).
Transfer learning strategy offers direct capability, which means a model is trained on a source task and then applied without any labeled data from the target task.Specifically on the parsing task, previous research already done by Kurniawan et al. (2021) and Ahmad et al. (2019) for developing an unsupervised parsing model in several languages using only English as its source language.That approach can be improved by adding fine-tuning with the available small dataset from low-resource language.Recent work (Maulana et al., 2022) shows the fine-tuning approach is better than the zeroshot one for building a parsing model in another low-resource language, Indonesian.

Method
This section concerns the model's architecture with the addition of the transfer learning method, the dataset and word embedding used to train the model, and the evaluation method of how the model is evaluated.

Model Architecture
This work uses an encoder-decoder architecture of Ahmad et al. (2019).No parameter modifications were made to maintain the success of the previous work.Because training and fine-tuning the model involves resources from several different languages, only language-independent labels are used where the subtype of the label is not involved.

Encoder
We convert the words and POS tags from the sentence into their embedding form.The self-attention encoder (Vaswani et al., 2017) in this study received an embedding matrix, which concatenates the word and POS embedding matrices.The encoder produces two matrices, M and N .M matrix represents the probability of a word in column j having the head of a word in row i.In comparison, the N matrix represents the probability of a word in column j having a label in row i.

Decoder
The decoder receives the two matrices and processes them in two following processes.First, M is processed with the maximum spanning tree algorithm in the following way: Let G = (V, E) be a graph constructed using directed weighted graph M .In this case, a vertex is a word representation, and an edge represents the dependency score of the two words.Let w : E → R be a function that assigns a weight to each edge in E.Then, the maximum spanning tree problem seeks to find a spanning tree T = (V, E T ) of G such that: subject to the constraint that T is a tree.Then, a list of head H is generated from all the destination nodes in E T .It can be denoted as: Meanwhile, N is processed to generate L, containing the list of labels with the highest probability for each word.Finally, the H and L arrays are used to build the final resulting tree from this model.

Word Embedding
This research used two types of word embedding approaches: the static type in the form of fastText and the contextual type in the form of BERT.The two types were selected to compare which type was most suitable for the Javanese parser model.
We chose fastText because of the similarity with that used in the previous research (Maulana et al., 2022).We also used BERT with two scenarios: using a different word embedding for each language (BERT and RoBERTa) and only one word embedding for all languages (multilingual BERT).The BERT and RoBERTa scenario uses all the languages involved except Croatian due to the unavailable resources.

Training Method
We perform two training methods: transfer learning and hierarchical transfer learning.Each method generates several models based on the number of source languages used.All models are fine-tuned with the Javanese treebank.
Standard transfer learning only uses one transfer stage from high-resource to low-resource language,

Choosing Source Languages
Some languages are selected as source languages using the help of LangRank (Lin et al., 2020) and references from previous studies.This tool considers combining two main feature groups in each language pair: corpus statistics and typological information.

The Javanese dataset
For the Javanese dataset, we use the only Javanese treebank available in the UD dataset v2.12, the UD_Javanese-CSUI (Alfina et al., 2023).Table 1 shows the statistics of this dataset.The set available for UD_Javanese-CSUI is only a test set because the data size is still relatively small.We do our split process by following the distribution rule of the data into train, dev, and test sets by 80%, 10%, and 10% percentages.(Bosco et al., 2022) 14167 298343 UD_Korean-GSD (Chun et al., 2019) 6339 80322

The source language dataset
Langrank recommends the top 3 languages in the following order: Indonesian, Croatian, and Korean.We also use English, one of the important languages in NLP research.These four languages are used in the standard transfer learning scenario.
For the hierarchical transfer learning scenario using Indonesian as the intermediary language, we choose English, French, and Italy as the source languages suggested by Maulana et al. (2022).In total, we use six languages as the source languages.
For each source language, we only use one treebank.If a language has more than one treebank in the UD dataset v2.12, we choose the treebank with the biggest size, as shown in Table 2.

Scenarios
As explained in Section 3.2, we conducted three main scenarios: 1. Training from scratch (FS) or baseline scenario, in which the models are trained only using the target language, Javanese.
2. Standard transfer learning (TL).We construct four distinct models utilizing treebanks from each source language.Then, each model is fine-tuned using the Javanese treebank.
3. Hierarchical transfer learning (HTL).First, we train three different models using treebank from each source language.After that, the models were fine-tuned with the Indonesian treebank before being fine-tuned again with the Javanese treebank.

Environment
Implementation is done in Python environments.
The training process is supported by the NVIDIA-DGX server with GPU NVIDIA A100 10GB, RAM of 64GB, and storage of 1 TB.

Evaluation
All models are evaluated using the unlabeled attachment score (UAS) and labeled attachment score (LAS) metrics, which are the most frequently used for evaluating the dependency parsing model (Nivre and Fang, 2017).The margin of error (MOE) with a 95% confidence level is also used to estimate the range of values within which the true population value is likely to fall.

Result and Analysis
The evaluation results for all scenarios are shown in Table 3. Scores in bold are marked as the best model in a particular word embedding type metric.

Models Comparison: From Scratch (FS)
Model, Transfer Learning (TL) Model, and Hierarchical Transfer Learning (HTL) Model Table 3 shows that the transfer learning model performs better than the baseline model in all word embeddings.The performance increase is quite significant, up to 13% on UAS and 14% on LAS.This verifies previous studies which explain the advantages of using transfer learning (Sarkar and Bali, 2022).The lack of resources in Javanese also indicates that transfer learning is suitable for use. Figure 3 also shows that the hierarchical transfer learning method consistently outperforms the transfer learning method even though it is not too significant.Specifically, the comparison focused on the TL-ID and HTL models, as all models from the HTL scenario use the TL-ID model as its second base for the transferring method.The difference  between these two scenarios shows that adding suitable high-resource language for the initial source model can give a better performance.

Source Languages Comparison
Table 3 shows that two of the top three recommendations from LangRank have good results.The conclusion is that LangRank can help predict the source language in the Javanese dependency parser.However, it does not rule out the possibility that other languages also have good results.For TL, it cannot be concluded which source language achieves the best performance since different word embedding used by the model gives different results.For HTL using Indonesian as the intermediate language, Italy performs best, followed by English as the source language.

Word Embeddings Comparison
Figure 4 shows that the model with a higher UAS score was obtained from word embedding fastText, followed by multilingual BERT, Javanese BERT, and Javanese RoBERTa.For LAS evaluation, the sequence is fastText, multilingual BERT, Javanese RoBERTa, and Javanese BERT.Although fastText is slightly superior, the differences are insignificant when considering the models' margin of error.

Error Analysis
Table 4 displays more detail about the performance difference.The ten labels taken are obtained from pairs with the highest errors in the from-scratch model.Some pairs significantly reduce error, but there are also pairs with no significant changes and even more errors in scenarios with transfer learning.
One noteworthy insight is the significantly increasing error of words with "obj" label that predicted with "obl".It seems contradictory that model accuracy is increasing simultaneously with the addition of transfer learning.It turns out that there are a few differences in the word labeling of both labels between the source and the target language, so the model could not predict the word label correctly.

Conclusions and Future Work
This section explains the conclusion and improvements that can be developed from this work.

Conclusions
This work investigates whether cross-lingual transfer learning works for dependency parsing tasks of a low-resource language, Javanese.The result shows that the cross-lingual transfer learning model is significantly better than the baseline model.Models with transfer learning can improve performance on UAS and LAS metrics by up to 10%.
The best model was obtained from the hierarchical transfer learning method using Italian and English as the source and Indonesian as the intermediary languages.Meanwhile, the standard transfer learning method achieved the best accuracy using Indonesian as the source language.However, the differences between standard transfer learning and hierarchical learning are insignificant, considering the margin of error from each scenario.

Future Work
We focused more on the model's learning scheme than the model's development with the highest score.We use architecture from Dozat and Manning (2017) rather than the one built by Mrini et al. (2020), the state-of-the-art dependency parsing task.So, better architecture can be used to produce a model with a higher evaluation score in the future.
Our future works also include further error analysis, especially related to the languages involved that LangRank chose.It could investigate languages with different demography and characteristics (Croatian and Korean) compared to Javanese.

Limitations
The following are the limitations of this research: 1.There is no hyper-parameter tuning treatment in the model creation process.
2. Cross-validation is not performed in the data distribution process.
3. Only one language is used as an intermediary language in hierarchical transfer learning.

Figure 1 :
Figure 1: Illustration of standard transfer learning method

Figure 3 :Figure 4 :
Figure 3: Comparison of the best model evaluation for each scenario

Table 1 :
The statistics of the Javanese treebank

Table 3 :
Evaluation results of all scenarios

Table 4 :
Top 10 errors of the from-scratch model and its comparison with the transfer-learning model