KinyaBERT: a Morphology-aware Kinyarwanda Language Model

Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality.Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity recognition task and by 4.3% in average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.


Introduction
Recent advances in natural language processing (NLP) through deep learning have been largely enabled by vector representations (or embeddings) learned through language model pre-training (Bengio et al., 2003;Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017;Peters et al., 2018;Devlin et al., 2019). Language models such as BERT (Devlin et al., 2019) are pre-trained on large text corpora and then fine-tuned on downstream tasks, resulting in better performance on many NLP tasks. Despite attempts to make multilingual BERT models (Conneau et al., 2020), research has shown that models pre-trained on high quality monolingual corpora outperform multilingual models pre-trained on large Internet data (Scheible et al., 2020;Virtanen et al., 2019). This has motivated many researchers to pretrain BERT models on individual languages rather than adopting the "language-agnostic" multilingual models. This work is partly motivated by the same findings, but also proposes an adaptation of the BERT architecture to address representational challenges that are specific to morphologically rich languages such as Kinyarwanda.
In order to handle rare words and reduce the vocabulary size, BERT-like models use statistical sub-word tokenization algorithms such as byte pair encoding (BPE) (Sennrich et al., 2016). While these techniques have been widely used in language modeling and machine translation, they are not optimal for morphologically rich languages (Klein and Tsarfaty, 2020). In fact, sub-word tokenization methods that are solely based on surface forms, including BPE and character-based models, cannot capture all morphological details. This is due to morphological alternations (Muhirwe, 2007) and non-concatenative morphology (McCarthy, 1981) that are often exhibited by morphologically rich languages. For example, as shown in Table 1, a BPE model trained on 390 million tokens of Kinyarwanda text cannot extract the true sub-word lexical units (i.e. morphemes) for the given words. This work addresses the above problem by proposing a language model architecture that explicitly represents most of the input words with morphological parses produced by a morphological analyzer. In this architecture BPE is only used to handle words which cannot be directly decomposed by the morphological analyzer such as misspellings, proper names and foreign language words. Given the output of a morphological analyzer, a second challenge is in how to incorporate the produced morphemes into the model. One naive approach is to feed the produced morphemes to a standard transformer encoder as a single monolithic sequence. This approach is used by Mohseni and Tebbifakhr (2019). One problem with this method is that mixing sub-word information and sentencelevel tokens in a single sequence does not encourage the model to learn the actual morphological compositionality and express word-relative syntactic regularities. We address these issues by proposing a simple yet effective two-tier transformer encoder architecture. The first tier encodes morphological information, which is then transferred to the second tier to encode sentence level information. We call this new model architecture KinyaBERT because it uses BERT's masked language model objective for pre-training and is evaluated on the morphologically rich Kinyarwanda language.
This work also represents progress in low resource NLP. Advances in human language technology are most often evaluated on the main languages spoken by major economic powers such as English, Chinese and European languages. This has exacerbated the language technology divide between the highly resourced languages and the underrepresented languages. It also hinders progress in NLP research because new techniques are mostly evaluated on the mainstream languages and some NLP advances become less informed of the diversity of the linguistic phenomena (Bender, 2019). Specifically, this work provides the following research contributions: • A simple yet effective two-tier BERT architecture for representing morphologically rich languages.
• New evaluation datasets for Kinyarwanda language including a machine-translated subset of the GLUE benchmark (Wang et al., 2019) and a news categorization dataset.
• Experimental results which set a benchmark for future studies on Kinyarwanda language understanding, and on using machinetranslated versions of the GLUE benchmark.
• Code and datasets are made publicly available for reproducibility 1 .

Morphology-aware Language Model
Our modeling objective is to be able to express morphological compositionality in a Transformerbased (Vaswani et al., 2017) language model. For morphologically rich languages such as Kinyarwanda, a set of morphemes (typically a stem and a set of functional affixes) combine to produce a word with a given surface form. This requires an alternative to the ubiquitous BPE tokenization, through which exact sub-word lexical units (i.e. morphemes) are used. For this purpose, we use a morphological analyzer which takes a sentence as input and, for every word, produces a stem, zero or more affixes and assigns a part of speech (POS) tag to each word. This section describes how this morphological information is obtained and then integrated in a two-tier transformer architecture ( Figure 1) to learn morphology-aware input representations.

Morphological Analysis and Part-of-Speech Tagging
Kinyarwanda, the national language of Rwanda, is one of the major Bantu languages (Nurse and Philippson, 2006) spoken in central and eastern Africa. Kinyarwanda has 16 noun classes. Modifiers (demonstratives, possessives, adjectives, numerals) carry a class marking morpheme that agrees with the main noun class. The verbal morphology (Nzeyimana, 2020) also includes subject and object markers that agree with the class of the subject or object. This agreement therefore enables users of the language to approximately disambiguate referred entities based on their classes. We leverage this syntactic agreement property in designing our unsupervised POS tagger. . The morphological analyzer produces morphemes for each word and assigns a POS tag to it. The two-tier transformer model then generates contextualized embeddings (blue vectors at the top). The red colored embeddings correspond to the POS tags, yellow is for the stem embeddings, green is for the variable length affixes while the purple embeddings correspond to the affix set.
Our morphological analyzer for Kinyarwanda was built following finite-state two-level morphology principles (Koskenniemi, 1983;Beesley andKarttunen, 2000, 2003). For every inflectable word type, we maintain a morphotactics model using a directed acyclic graph (DAG) that represents the regular sequencing of morphemes. We effectively model all inflectable word types in Kinyarwanda which include verbals, nouns, adjectives, possessive and demonstrative pronouns, numerals and quantifiers. The morphological analyzer also includes many hand-crafted rules for handling morphographemics and other linguistic regularities of the Kinyarwanda language. The morphological analyzer was independently developed and calibrated by native speakers as a closed source solution before the current work on language modeling. Similar to Nzeyimana (2020), we use a classifier trained on a stemming dataset to disambiguate between competing outputs of the morphological analyzer. Furthermore, we improve the disambiguation quality by leveraging a POS tagger at the phrase level so that the syntactic context can be taken into consideration.
We devise an unsupervised part of speech tagging algorithm which we explain here. Let x = (x 1 , x 2 , x 3 , ...x n ) be a sequence of tokens (e.g. words) to be tagged with a corresponding sequence of tags y = (y 1 , y 2 , y 3 , ...y n ). A sample of actual POS tags used for Kinyarwanda is given in Table 12 the Appendix. Using Bayes' rule, the optimal tag sequence y * is given by the following equation: y * = arg max A standard hidden Markov model (HMM) can decompose the result of Equation 1 using first order Markov assumption and independence assumptions into P (x|y) = n t=1 P (x t |y t ) and P (y) = n t=1 P (y t |y t−1 ). The tag sequence y * can then be efficiently decoded using the Viterbi algorithm (Forney, 1973). A better decoding strategy is presented below.
Inspired by Tsuruoka and Tsujii (2005), we devise a greedy heuristic for decoding y * using the same first order Markov assumptions but with bidirectional decoding.
First, we estimate the local emission probabilities P (x t |y t ) using a factored model given in the following equation: In Equation 2,P m (x t |y t ) corresponds to the probability/score returned by a morphological disambiguation classifier, representing the uncertainty of the morphology of x t .P p (x t |y t ) corresponds to a local precedence weight between competing POS tags. These precedence weights are man-ually crafted through qualitative evaluation (See Table 12 in Appendix for examples).P a (x t |y t ) quantifies the local neighborhood syntactic agreement between Bantu class markers. When there are two or more agreeing class markers in neighboring words, the tagger should be more confident of the agreeing parts of speech. A basic agreement score can be the number of agreeing class markers within a window of seven words around a given candidate x t . We manually designed a more elaborate set of agreement rules and their weights for different contexts. Therefore, the actual agreement scorẽ P a (x t |y t ) is a weighted sum of the matched agreement rules. Each of the unnormalized measuresP in Equation 2 is mapped to the [0, 1] range using a sigmoid function σ(z|z A , z B ) given in Equation 3, where z is the score of the measure and [z A , z B ] is its estimated active range.
After estimating the local emission model, we greedily decode y * t = arg max y tP (y t |x) in decreasing order ofP (x t |y t ) using a first order bidirectional inference ofP (y t |x) as given in the following equation: t−1 and y * t+1 have been decoded; P (x t |y t )P (y t |y * t−1 )P (y * t−1 |x) if only y * t−1 has been decoded; P (x t |y t )P (y t |y * t+1 )P (y * t+1 |x) if only y * t+1 has been decoded; P (x t |y t ) otherwise (4) The first order transition measuresP (y t |y t−1 ), P (y t |y t+1 ) andP (y t |y t−1 , y t+1 ) are estimated using count tables computed over the entire corpus by aggregating local emission marginals P (y t ) = xtP (x t , y t ) obtained through morphological analysis and disambiguation.

Morphology Encoding
The overall architecture of our model is depicted in Figure 1. This is a two-tier transformer encoder architecture made of a token-level morphology encoder that feeds into a sentence/document-level encoder. The morphology encoder is made of a small transformer encoder that is applied to each analyzed token separately in order to extract its morphological features. The extracted morphological features are then concatenated with the token's stem embedding to form the input vector fed to the sentence/document encoder. The sentence/document encoder is made of a standard transformer encoder as used in other BERT models. The sentence/document encoder uses untied position encoding with relative bias as proposed in Ke et al. (2020).
The input to the morphology encoder is a set of embedding vectors, three vectors relating to the part of speech, one for the stem and one for each affix when available. The transformer encoder operation is applied to these embedding vectors without any positional information. This is because positional information at the morphology level is inherent since no morpheme repeats and each morpheme always occupies a known (i.e. fixed) slot in the morphotactics model. The extracted morphological features are four encoder output vectors corresponding to the three POS embeddings and one stem embedding. Vectors corresponding to the affixes are left out since they are of variable length and the role of the affixes in this case is to be attended to by the stem and the POS tag so that morphological information can be captured. The four morphological output feature vectors are further concatenated with another stem embedding at the sentence level to form the input vector for the main sentence/document encoder.
The choice of this transformer-based architecture for morphology encoding is motivated by two factors. First, Zaheer et al. (2020) has demonstrated the importance of having "global tokens" such as [CLS] token in BERT models. These are tokens that attend to all other tokens in the modeled sequence. These "global tokens" effectively encapsulate some "meaning" of the encoded sequence. Second, the POS tag and stem represent the high level information content of a word. Therefore, having the POS tag and stem embeddings be transformed into morphological features is a viable option. The POS tag and stem embeddings thus serve as the "global tokens" at the morphology encoder level since they attend to all other morphemes that can be associated with them.
In order to capture subtle morphological information, we make one of the three POS embeddings span an affix set vocabulary that is a subset of the all affixes power set. We form an affix set vocabu-lary V a that is made of the N most frequent affix combinations in the corpus. In fact, the morphological model of the language enforces constraints on which affixes can go together for any given part of speech, resulting in an affix set vocabulary that is much smaller than the power set of all affixes. Even with limiting the affix set vocabulary V a to a fixed size, we can still map any affix combination to V a by dropping zero or very few affixes from the combination. Note that the affix set embedding still has to attend to all morphemes at the morphology encoder level, making it adapt to the whole morphological context. The affix set embedding is depicted by the purple units in Figure 1 and a sample of V a is given in Table 13 in the Appendix.

Pre-training Objective
Similar to other BERT models, we use a masked language model objective. Specifically, 15% of all tokens in the training set are considered for prediction, of which 80% are replaced with [MASK] tokens, 10% are replaced with random tokens and 10% are left unchanged. When prediction tokens are replaced with [MASK] or random tokens, the corresponding affixes are randomly omitted 70% of the time or left in place for 30% of the time, while the units corresponding to POS tags and affix sets are also masked. The pre-training objective is then to predict stems and the associated affixes for all tokens considered for prediction using a two-layer feed-forward module on top of the encoder output.
For the affix prediction task, we face a multilabel classification problem where for each prediction token, we predict a variable number of affixes. In our experiments, we tried two methods. For one, we use the Kullback-Leibler (KL) divergence 2 loss function to solve a regression task of predicting the N -length affix distribution vector. For this case, we use a target affix probability vector a t ∈ R N in which each target affix index is assigned 1 m probability and 0 probability for non-target affixes. Here m is the number of affixes in the word to be predicted and N is the total number of all affixes. We call this method "Affix Distribution Regression" (ADR) and model variant KinyaBERT ADR . Alternatively, we use cross entropy loss and just predict the affix set associated with the prediction word; we call this method "Affix Set Classification" (ASC) and the model variant KinyaBERT ASC . 2 https://en.wikipedia.org/wiki/ Kullback%E2%80%93Leibler_divergence

Experiments
In order to evaluate the proposed architecture, we pre-train KinyaBERT (101M parameters for KinyaBERT ADR and 105M for KinyaBERT ASC ) on a 2.4 GB of Kinyarwanda text along with 3 baseline BERT models. The first baseline is a BERT model pre-trained on the same Kinyarwanda corpus and with the same position encoding (Ke et al., 2020), same batch size and pre-training steps, but using the standard BPE tokenization. We call this first baseline model BERT BP E (120M parameters). The second baseline is a similar BERT model pretrained on the same Kinyarwanda corpus but tokenized by a morphological analyzer. For this model, the input is just a sequence of morphemes, in a similar fashion to Mohseni and Tebbifakhr (2019). We call this second baseline model BERT M ORP HO (127M parameters). For BERT M ORP HO , we found that predicting 30% of the tokens achieves better results than using 15% because of the many affixes generated. The third baseline is XLM-R (Conneau et al., 2020) (270M parameters) which is pretrained on 2.5 TB of multilingual text. We evaluate the above models by comparing their performance on downstream NLP tasks.

Pre-training details
KinyaBERT model was implemented using Pytorch version 1.9. The morphological analyzer and POS tagger were implemented in a shared library using POSIX C. Morphological parsing of the corpus was performed as a pre-processing step, taking 20 hours to segment the 390M-token corpus on an 12-core desktop machine.

Evaluation tasks
Machine translated GLUE benchmark -The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) has been widely used to evaluate pre-trained language models. In order to assess KinyaBERT performance on such high level language tasks, we used Google Translate API to translate a subset of the GLUE benchmark (MRPC, QNLI, RTE, SST-2, STS-B and WNLI tasks) into Kinyarwanda. CoLA task was left because it is English-specific. MNLI and QQP tasks were also not translated because they were too expensive to translate with Google's commercial API. While machine translation adds more noise to the data, evaluating on this dataset is still relevant because all models compared have to cope with the same noise. To understand this translation noise, we also run user evaluation experiments, whereby four volunteers proficient in both English and Kinyarwanda evaluated a random sample of 6000 translated GLUE examples, and assigned a score to each example on a scale from 1 to 4 (See Table 11 in Appendix). These scores help us characterize the noise in the data and contextualize our results with regards to other GLUE evaluations. Results on these GLUE tasks are shown in Table 3. Named entity recognition (NER) -We use the Kinyarwanda subset of the MasakhaNER dataset (Adelani et al., 2021) for NER task. This is a high quality NER dataset annotated by native speakers for major African languages including Kinyarwanda. The task requires predicting four entity types: Persons (PER), Locations (LOC), Or-3 https://github.com/pytorch/xla/ ganizations (ORG), and date and time (DATE). Results on this NER task are presented in Table 4.
News Categorization Task (NEWS) -For a document classification experiment, we collected a set of categorized news articles from seven major news websites that regularly publish in Kinyarwanda. The articles were already categorized, so no more manual labeling was needed. This dataset is similar to Niyongabo et al. (2020), but in our case, we limited the number collected articles per category to 3000 in order to have a more balanced label distribution (See Table 10 in the Appendix). The final dataset contains a total of 25.7K articles spanning 12 categories and has been split into training, validation and test sets in the ratios of 70%, 5% and 25% respectively. Results on this NEWS task are presented in Table 5.
For each evaluation task, we use a two-layer feedforward network on top of the sentence encoder as it is typically done in other BERT models. The finetuning hyper-parameters are presented in Table 14 in the Appendix.

Main results
The main results are presented in Table 3, Table 4, and Table 5. Each result is the average of 10 independent fine-tuning runs. Each average result is shown with the standard deviation of the 10 runs. Except for XLM-R, all other models are pre-trained on the same corpus (See Table 2) for 32K steps using the same hyper-parameters.
On the GLUE task, KinyaBERT ASC achieves 4.3% better average score than the strongest baseline. KinyaBERT ASC also leads to more robust results on multiple tasks. It is also shown that having just a morphological analyzer is not enough: BERT M ORP HO still under-performs even though it uses morphological tokenization. Multilingual XLM-R achieves least performance in most cases, possibly because it was not pre-trained on Kinyarwanda text and uses inadequate tokenization.
On the NER task, KinyaBERT ADR achieves best performance, about 3.2% better average F1 score than the strongest baseline. One of the architectural differences between KinyaBERT ADR and KinyaBERT ASC is that KinyaBERT ADR uses three POS tag embeddings while KinyaBERT ASC uses two. Assuming that POS tagging facilitates named entity recognition, this empirical result suggests that increasing the amount of POS tag information    in the model, possibly through diversification (i.e. multiple POS tag embedding vectors per word), can lead to better NER performance. The NEWS categorization task resulted in differing performances between validation and test sets. This may be a result that solving such task does not require high level language modeling but rather depends on spotting few keywords. Previous research on a similar task (Niyongabo et al., 2020) has shown that simple classifiers based on TF-IDF features suffice to achieve best performance.
The morphological analyzer and POS tagger inherently have some level of noise because they do not always perform with perfect accuracy. While we did not have a simple way of assessing the impact of this noise in this work, we can logically expect that the lower the noise the better the results could be. Improving the morphological analyzer and POS tagger and quantitatively evaluating its accuracy is part of future work. Even though our POS tagger uses heuristic methods and was evaluated mainly through qualitative exploration, we can still see its positive impact on the pre-trained language model. We did not use previous work on Kinyarwanda POS tagging because it is largely different from this work in terms of scale, tag dictionary and dataset size and availability.
We plot the learning curves during fine-tuning process of KinyaBERT and the baselines. The results in Figure 2 indicate that KinyaBERT finetuning has better convergence across all tasks. Additional results also show that positional attention (Ke et al., 2020) learned by KinyaBERT has more uniform and smoother relative bias while BERT BP E and BERT M ORP HO have more noisy KinyaBERT ASC achieves the best convergence in most cases, indicating better effectiveness of its model architecture and pre-training objective. relative positional bias (See Figure 3 in Appendix). This is possibly an indication that KinyaBERT allows learning better word-relative syntactic regularities. However, this aspect needs to be investigated more systematically in future research.
While the main sentence/document encoder of KinyaBERT is equivalent to a standard BERT "BASE" configuration on top of a small morphology encoder, overall, the model actually decreases the number of parameters by more than 12% through embedding layer savings. This is because using morphological representation reduces the vocabulary size. Using smaller embedding vectors at the morphology encoder level also significantly reduces the overall number of parameters. Table 8 in Appendix shows the vocabulary sizes and parameter count of KinyaBERT in comparison to the baselines. While the sizing of the embeddings was done essentially to match BERT "BASE" configuration, future studies can shed more light on how different model sizes affect performance.

Ablation study
We conducted an ablation study to clarify some of the design choices made for KinyaBERT architecture. We make variations along two axes: (i) morphology input and (ii) pre-training task, which gave us four variants that we pre-trained for 32K steps and evaluated on the same downstream tasks.
• AFS→STEM+ASC: Morphological features are captured by two POS tag and one affix set vectors. We predict both the stem and affix set. This corresponds to KinyaBERT ASC presented in the main results.
• POS→STEM+ADR: Morphological features are carried by three POS tag vectors and we predict the stem and affix probability vector. This corresponds to KinyaBERT ADR .
• AVG→STEM+ADR: Morphological features are captured by two POS tag vectors and the pointwise average of affix hidden vectors from the morphology encoder. We predict the stem and affix probability vector.
• STEM→STEM: We omit the morphology encoder and train a model with only the stem parts without affixes and only predict the stem.
Ablation results presented in Table 6 indicate that using affix sets for both morphology encoding and prediction gives better results for many GLUE tasks. The under-performance of "STEM→STEM" on high resource tasks (QNLI and SST-2) is an indication that morphological information from affixes is important. However, the utility of this information depends on the task as we see mixed results on other tasks.
Due to a large design space for a morphologyaware language model, there are still a number of other design choices that can be explored in future studies. One may vary the amount of POS tag embeddings used, vary the size affix set vocabulary or the dimension of the morphology encoder embeddings. One may also investigate the potential of other architectures for the morphology encoder, such as convolutional networks. Our early attempt of using recurrent neural networks (RNNs) for the morphology encoder was abandoned because it was too slow to train.  Table 6: Ablation results: each result is an average of 10 independent fine-tuning runs. Metrics, dataset sizes and noise statistics are the same as for the main results in Table 3, Table 4 and Table 5.

Related Work
BERT-variant pre-trained language models (PLMs) were initially pre-trained on monolingual highresource languages. Multilingual PLMs that include both high-resource and low-resource languages have also been introduced (Devlin et al., Conneau et al., 2020;Xue et al., 2021;Chung et al., 2020). However, it has been found that these multilingual models are biased towards high-resource languages and use fewer low quality and uncleaned low-resource data (Kreutzer et al., 2022). The included low-resource languages are also very limited because they are mainly sourced from Wikipedia articles, where languages with few articles like Kinyarwanda are often left behind (Joshi et al., 2020;Nekoto et al., 2020). Joshi et al. (2020) classify the state of NLP for Kinyarwanda as "Scraping-By", meaning it has been mostly excluded from previous NLP research, and require the creation of dedicated resources and models. Kinyarwanda has been studied mostly in descriptive linguistics (Kimenyi, 1976(Kimenyi, , 1978a(Kimenyi, ,b, 1988Jerro, 2016). Few recent NLP works on Kinyarwanda include Morphological Analysis (Muhirwe, 2009;Nzeyimana, 2020), Text Classification (Niyongabo et al., 2020), Named Entity Recognition Adelani et al., 2021;Sälevä and Lignos, 2021), POS tagging (Garrette and Baldridge, 2013; Duong et al., 2014;Fang and Cohn, 2016;Cardenas et al., 2019), and Parsing (Sun et al., 2014;Mielens et al., 2015). There is no prior study on pre-trained language modeling for Kinyarwanda.
There are very few works on monolingual PLMs for African languages. To the best of our knowledge there is currently only AfriBERT (Ralethe, 2020) that has been pre-trained on Afrikaans, a language spoken in South Africa. In this paper, we aim to increase the inclusion of African languages in NLP community by introducing a PLM for Kinyarwanda. Differently to the previous works (see Table 15 in Appendix) which solely pretrained unmodified BERT models, we propose an improved BERT architecture for morphologically rich languages.
Recently, there has been a research push to improve sub-word tokenization by adopting characterbased models (Ma et al., 2020;Clark et al., 2022). While these methods are promising for the "language-agnostic" case, they are still solely based on the surface form of words, and thus have the same limitations as BPE when processing morphologically rich languages. We leave it to future research to empirically explore how these characterbased methods compare to morphology-aware models.

Conclusion
This work demonstrates the effectiveness of explicitly incorporating morphological information in language model pre-training. The proposed twotier Transformer architecture allows the model to represent morphological compositionality. Experiments conducted on Kinyarwanda, a low resource morphologically rich language, reveal significant performance improvement on several downstream NLP tasks when using the proposed architecture. These findings should motivate more research into morphology-aware language models.    Score Translation quality 1 Invalid or meaningless translation 2 Invalid but not totally wrong 3 Almost valid, but not totally correct 4 Valid and correct translation