How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques to an extremely low-resource language – Sumerian cuneiform – one of the world’s oldest written languages attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We introduce InterpretLR, an interpretability toolkit for low-resource NLP and use it alongside human evaluations to gauge the trained models. Notably, all our techniques and most components of our pipeline can be generalised to any low-resource language. We publicly release all our implementations including a novel data set with domain-specific pre-processing to promote further research in this domain.


Introduction
Sumerian is one of the oldest written languages, attested in the cuneiform texts from around 2900 BC and possibly the language of even older protocuneiform texts from the second half of the 4th millennium BC (Englund, 2009). Specialists in Assyriology have recently worked to digitize Sumerian scripts, annotate, and translate a part of them to modern-day languages like English and German.
In this work, we attempt to create the first information extraction and translation pipeline for Data sets and training subroutines are available at linktr.ee/rachitbansal † Work was done prior to joining Amazon at Goethe University Frankfurt Sumerian. Specifically, we focus on machine translation from Sumerian to English, and sequence labeling tasks of Named Entity Recognition (NER) and Part of Speech (POS) Tagging. Figure 1 shows a sample of our raw data where the Sumerian text has been derived from the tabletinscribed cuneiform script along with its humaninterpreted English translations. Creating an annotated corpus for such a language is a tedious task. We obtain our data from openly available sources and corpora, painstakingly annotated and translated by human experts. Yet, for languages like Sumerian, which are not fully-understood by humans themselves, transferring knowledge and patterns to learning algorithms from this limited data becomes extremely difficult. The consequent challenge posed for NER and POS tagging is evident. Lack of annotated data and fuzzy characterlevel text makes it hard for a model to generalise, irrespective of its size.
In case of machine translation, the labeled data is composed of incomplete and short phrase-like sentences, especially on the target side. This makes the context largely ambiguous. Moreover, we find that for a majority of medieval and ancient languages the target-side translated text is highly incoherent with modern-day English language text, making it impossible to use the latter in semi-supervised and unsupervised settings.
Throughout this study, we elaborate on such challenges faced when working with low-resource languages, and talk about what makes some of these languages like Sumerian 'extremely' lowresource. Through extensive experimentation, evaluation, and analysis we further introduce specific algorithms and modifications to work around them.
In all, our contribution is three-fold: 1. Building and analyzing a variety of algorithms on the unexplored human-annotated Sumerian dataset for sequence labeling tasks of POS Tagging and NER. ( §3) 2. Introducing the problem of Target-side Incoherence for low-resource settings and its effect on semi-supervised and unsupervised machine translation ( §4.2). Further investigating specific modifications and methodologies to cope-up with these constraints. ( §4) 3. Introducing InterpretLR, a generalisable toolkit to interpret low-resource NLP. We apply it to further study, compare, and evaluate all of our proposed techniques for machine translation and sequence labeling. ( §7) Throughout this work, we have conducted human studies and evaluation for our models, in addition to automated metrics. For gauging our models with InterpretLR, we have made use of human annotations.

Data
Sumerian is an ancient language from Iraq that was written using the cuneiform script. While Basque and Turkish display some similarities (splitergativity, agglutinativity), it is a language isolate (Englund, 2009). We have found artifacts dating to around 2900 BC with Sumerian texts inscribed until the first century AD. Most of the Sumerian texts found to this day are administrative in nature as, during the third dynasty of the Ur III Period, the state administration swell to an unprecedented level of activity which was not seen again later in the history of Mesopotamian culture. All through this study, our evaluation sets are composed of Ur III Admin text only and it acts as our in-domain data.
Part of the datasets we used were assembled from the Cuneiform Digital Library Initiative (CDLI) 1 , Machine Translation and Automated Analysis of Cuneiform languages (MTAAC) project (Pagé-Perron et al., 2017) 2 and The Electronic Text Corpus of Sumerian Literature (ETCSL) dataset 3 . CDLI and MTAAC datasets contain the Ur III Administrative (Admin) texts 4 which are preserved by the CDLI 5 . The MTAAC and ETCSL corpora were both manually annotated for morphology by cuneiform linguistics. We divided the data between training and testing sets, and then to reduce the data sparsity, we performed text augmentation using a set of labeled named entities for these sets separately. This increased our combined number of phrases from 25,000 to 48,000, representing our final dataset for sequence labeling. Figures 2 and 3 provide the distribution of word tokens in our final preannotated dataset. The corpus consists of phrases with lengths ranging from 1 to 19 words. These phrases are small since they are translated line by line from the scripts. Around 2,500 phrases were used for testing, while the 45,500 were employed for training purposes. For machine translation, the final dataset summarizes as (i) 10,520 parallel phrases from the Ur III administrative corpus; (ii) 88,460 parallel phrases, all genres combined; and (iii) all monolingual Sumerian data (1.43 million phrases). In all cases, phrases are short, generally ranging from 1 to 5word tokens.

Related Work
Past work aimed at machine translation of Sumerian-English (Pagé-Perron et al., 2017;Punia et al., 2020a) have used the minimal bitext upon a variety of general statistical and neural supervised techniques. However, they do not handle the textlevel peculiarities any differently than one would do for a high-resource language, thus, often failing to capture context, resulting in poor and inconsistent translations. Techniques, learning algorithms, and architectures that optimally use the vast monolingual data and parallel sentences while keeping in mind the several linguistic limitations are motivated in such a scenario. Thus, we experiment on semi-supervised and unsupervised techniques across the three categories of data augmentation (Sennrich et al., 2016;He et al., 2016), knowledge transfer (Zoph et al., 2016), and pre-training (Conneau and Lample, 2019; Song et al., 2019).
In the past, Pagé-Perron et al. (2017) applied statistical models for morphological analysis and information extraction for Sumerian. Although, due to the unavailability of annotated data, these models could not generalise well.  and  used an unsupervised approach for NER with the help of domain experts and used contextual and spelling rules to build the model. They also post-processed their outputs automatically, which enhanced their results. In this work, we thoroughly investigate a wide range of algorithms for these sequence labeling tasks and consequently take a first step towards effective information extraction for Sumerian. Here, "NE" stands for named entities, "O" stands for unstructured words. Other tags are in accordance with ORACC.

Part of Speech Tagging and Named Entity Recognition
In this section, we talk about the various algorithms that we investigated to carry out the sequence labeling tasks of POS tagging and NER for Sumerian. The subsequent experimental results are described and discussed in Section 6. Conditional Random Fields CRF (Lafferty et al., 2001) is a discriminative probabilistic classifier, which optimises the weights or parameters in order to maximize the conditional probability distribution P (y | x). They take set of input features (language or domain specific) into account, using the learned weights associated with these features and previous labels to predict the current label. Since CRFs use feature sets (rules) which are language-specific, it makes the model more robust specially for very low-resource languages. In our case we developed domain specific rules with the help of previous studies  and language experts. A set of these rules are mentioned in the Appendix.
Bi-directional LSTM We also experiment across Recurrent Neural Networks (RNNs) to deal with the sequential text input. We employ Bi-LSTM (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) in particular. As in Huang et al. (2015), an additional CRF layer is used for efficient usage of sentence level tag information and past input features by LSTM cells.

FLAIR Akbik et al. (2018) introduced a Contextual String Embedding for Sequence
Labeling, FLAIR, which has shown great promise for NER across various languages (Akbik et al., 2019b). We make use of the two distinct properties of its embeddings: (i) training without any explicit notion of words and fundamentally modeling the words as a sequence of characters, and (ii) deriving and using the context from surrounding tokens. We train the bi-directional character language model using the Sumerian monolingual phrases and retrieve the contextual embedding for each word which we then pass into the vanilla Bi-LSTM CRF model.

RoBERTa
We also investigate the transformerbased language model, RoBERTa . The encoder is first pre-trained on our Sumerian monolingual data, and then fine-tuned on our downstream sequence labeling tasks using the labeled data.

Machine Translation
In this section, we present our experiments for machine translation, primarily focusing on specific data and algorithmic modeling techniques which may be generalised for any extremely lowresource language that may or may not suffer from Target-side Incoherence, a phenomenon which we also introduce herein. All results are summarised in Table 1.

Supervised NMT
In order to create a benchmark for the semisupervised and unsupervised approaches, we perform supervised machine translation using the limited bitext available (∼10,000 phrases). We perform experiments on a variety of data configurations which are given by: 1. UrIIISeg: Follows the format as present in the original texts provided by Assyriologists and used in the past attempts for Sumerian-English machine translation (Pagé-Perron et al., 2017;Punia et al., 2020b). It contains only in-domain Ur III Admin text with line-by-line translated segments, each of 1-5 words. Amounts to total 10528 segments. 2. UrIIIComp: Also contains the in-domain data only, but multiple segments are concatenated together to form complete sentences.
The 'completeness' of a sentence is ensured through punctuation marks. Since multiple segments are combined, it amounts to only 4792 sentences.

AllSeg: Contains all of out-of-domain
Sumerian text segments in addition to indomain Ur III Admin text alone. The additional text varies across a wide range of genres such as literary, lexical, ritual, and legal, resulting into a corpus size of 88466 segments. 4. AllComp: Combines the additional features of 2. and 3., thus comprising of a total of 32694 complete text sentences from all outof-domain as well as in-domain genres. We make use of the vanilla transformer encoder and decoder architecture (Vaswani et al., 2017) for all our supervised machine translation experiments over these three different bitext configurations. Noting the supervised MT results from Table 1, the AllComp text configuration is used for all further experiments. The computational configurations are mentioned in Section 5.

Semi-Supervised and Unsupervised NMT
We observed that one of the primary reasons for the lack of success of semi-supervised and unsupervised algorithms for low-resource settings, specially for ancient languages, is the lack of coherence between monolingual texts for the target-side language in the modern-day corpora and the targetside text in the available bitext. We refer to this as the Target-side Incoherence (TSIC) problem for such languages. Specifically, as can be seen from Figure 1, the transliterated English text in our parallel corpora is vastly different from general modern-day English texts. In Sumerian, this is because the text has been human-translated to English on the level of words and small segments due to insufficient knowledge of the language. This results into a contextually distorted English language text, as compared what we see in general corpora. This leads to multiple pitfalls. Most significantly, the colossal monolingual data available for a data-rich target-side language (i.e., English in this case) can no longer be used. This Target-side Incoherence holds true for most ancient language texts like Sumerian, which makes them 'extremely' low-resource.
In this section, we elaborate on the problems caused due to TSIC and further present findings on adapting various semi-supervised and unsupervised NMT techniques to deal with them.
Forward Translation Back-translation (BT) (Sennrich et al., 2016) has been widely used and analysed for NMT across a large set of language pairs. BT uses a reverse model, Sumerian ← English trained on the existing parallel corpora, when the task is to translate from Sumerian → English, and applies it on the target-side monolingual corpus. The synthetic samples thus generated are added to the source-side corpus and a new reverse model is trained on the augmented dataset. It has been shown to outperform its forward counterpart, Forward Translation (FT) (Zhang and Zong, 2016;Burlot and Yvon, 2018), which instead uses a forward (Sumerian → English) model to augment the target-side of the bitext. However, due to TSIC, the target-side monolingual data falls into a completely different distribution than what a Sumerian ← English model is trained on. Using back-translation in such a scenario results into a poor source-side augmentation, doing more harm than good. Keeping this in mind, we rely on forward-translation (FT), thus using the Sumerian monolingual text.
We divide the Sumerian monolingual data into 8 shards, each containing ∼100,000 monolingual AllComp sentences each. The FT process takes place for each shard and the Transformer model is trained after each shard is forward-translated.
Large scale studies (Edunov et al., 2018; have shown the heavy dependency of BT and FT on aspects like sampling methods and the amount of parallel data. The performance with non-MAP (where, MAP stands for maximum a posteriori) estimation methods like nuclear sampling (Holtzman et al., 2018) and beam search with noise improves almost-linearly with the amount of bitext, and thus, for low-resource settings (∼80,000 sentence pairs), MAP methods have been shown to give better results. This was also observed in our experiments and the reported results are obtained using beam search ( §5).

Cross-Lingual Language Model Pre-training
We further make use of XLM (Conneau and Lample, 2019) to carry out a wide range of experiments for both unsupervised and semi-supervised fine-tuning techniques. Considering the lack of original target-side monolingual text due to TSIC, the following target data configurations were used for pre-training the XLM: In the pre-training phase, we perform various experiments over different combinations of MLM and TLM objectives. The XLM is, then, fine-tuned on a denoising auto-encoding objective for unsupervised while cross-reference machine translation objective over the parallel data for semi-supervised training. BT steps are also performed in both cases.
Data Augmentation In order to further reduce the effect of TSIC on the model performance and to allow the model to attend to a larger and more diverse volume of target text during pre-training, we make use of the following data augmentation techniques: 1. BERT: Replacing words by the spatially closest words measured using cosine similarity over BERT (Devlin et al., 2019)

Experimental Setup
All our experiments have been implemented in Py-Torch, except for the Bi-LSTM and CRF which were done in Tensorflow. In addition to this, we used FairSeq , FLAIR (Akbik et al., 2019a), HuggingFace Transformers (Wolf et al., 2019), and Open-NMT (Klein et al., 2017)    GeForce RTX 2070 GPU, while the pre-training and fine-tuning of FLAIR, RoBERTa, and XLM on various data configurations were performed on 2 16 GB Nvidia V100 GPUs. We used development sets to tune the hyper-parameters for all our models, especially those for POS and NER. For RoBERTa and vanilla transformer, N = 6 encoder layers with h = 16 attention heads were used, while N = 4 and h = 12 was used for XLM. A beam-size of 5 was used for our FT experiments. Adam (Kingma and Ba, 2015) optimiser with a learning rate of 0.001, β 1 = 0.90, β 2 = 0.98 and a decay factor of 0.5 was used. Additional regularisation was done via Dropout and Attention Dropout (wherever applicable) layers with p drop = 0.1. We used a batch size of 32 or 64 and an early stopping criteria based on the validation loss.

Results and Analysis
Sequence Labeling Tables 2 and 3 represent the metric scores of our different models for POS and NER tasks, respectively. CRF with domain-specific rules gives the best F1-score for the POS tagging task, even better than the complex RoBERTa and FLAIR language models which are the current state-of-the-art techniques for most languages. The prevalence of distorted words and short phrases in the corpora makes context learning difficult, although the domain-specific rules help learn shortterm dependencies by learning feature weights.
RoBERTa performs well for both of the tasks, while being the best among others for NER (95.37 F1 score). To make the most out of the limited vocabulary and noisy text, we used Byte-Level BPE (Radford et al., 2019) to train the language model and further fine-tuned it on our POS and NER dataset with a batch size of 128. We also tried FLAIR language model across various word embeddings (character, Word2vec, FastText, GloVe) along with an additional CRF layer for both of the tasks. Although a high precision is observed using this approach, the F1 scores is seen to be significantly low due to low recall. In addition to the F1 metric we also conducted human evaluation by language expert for the best performing models, out of randomly selected 76 (496 words) phrases, only 8 and 6 words were misclassified by NER and POS models, giving an error of 1.20 and 1.61%, respectively. Table 1 summarises our results for all supervised, semi-supervised, and unsupervised techniques. Forward translation on vanilla transformer outperforms all other techniques by at least 2 BLEU. The variation of its performance with more monolingual source text is shown. The superior performance of AllComp over the other configurations in vanilla transformer signifies the value of both context and out-of-domain data together. Even though the XLM-based models show lower performance, it could be attributed to the lesser number of encoder layers and attention heads used for them. What is interesting to note, though, is the variation of its performance across various training strategies. We experiment across MLM and TLM (+ MLM) initialization for XLM, where the latter comfortably outperforms the former. We do not test with random initialization and CLM, following up from the conclusions made for NMT in Conneau and Lample (2019). Pre-training the XLM on augmented target-side text works surprisingly well. We note that using pre-training on BERT and WordNet augmentations results in better Unsupervised performance while introducing CharSwap improves the semi-supervised models. The human evaluation presented in the table was made by three Assyriologists, who rated 100 output examples for each model, on a scale of 3. A pairwise inter-annotator agreement of 0.673 (Cohen's Kappa) was observed. 6

Interpretability Analysis
Oftentimes in case of Deep Learning Architectures, metric scores like Accuracy, F1 and BLEU are unable to portray the true behavior of the models. For languages like Sumerian, the human-understanding itself is scarce. Visualizing the representations and correlations made by the model could provide insights into which elements of the context can give additional information to support semantic analysis of the terms. Thus, we herein introduce a generalisable interpretability toolkit, InterpretLR, to interpret algorithms for Low-Resource NLP and 6 Elaborate evaluation criteria mentioned in the Appendix. further apply it for the aforementioned tasks and models.
InterpretLR is primarily aimed at fabricating attribution saliency maps, i.e., tracing back the model output so as to assign an importance score to each input token, based on its 'influence' on that output. We do this using two kinds of interpretability techniques-gradient-based (Sundararajan et al., 2017;Simonyan et al., 2014;Shrikumar et al., 2017), and perturbation-based (Zeiler and Fergus, 2014;Castro et al., 2009).
Due to the inherently discrete nature of natural language text, the starting point for all our approaches is the embedding of the input sentence across the model to interpret. Most of our analysis is done for the encoder of the network architecture, thus analyzing the effect of different pre-training and fine-tuning techniques on how the model eventually represents the language attributes. We use the word 'Attribution' as a better-defined substitute for the 'Influence' measure of an input span of text on the output. A part of our visual analysis is shown and elaborated here, while a complete analysis with all our models and layer-wise heat-maps is presented in the Appendix.
In Table 4a, we apply InterpretLR on 3 different configurations of XLM for a randomly chosen sentence from NMT's evaluation set. A human expert was asked to annotate the source sentence in accordance with the expected reference for each output token in the actual English translation, as shown in the first column. The highlighted visualizations for each of the 3 models were obtained using Integrated Gradients (Sundararajan et al., 2017) across the three input embeddings-token, position, and language. A lot of interesting observations could be made from these attributions. Firstly, the named entity in the sentence ur-{d}asznan (UrAnan) has been wrongly translated by all the three models. Although this behavior is expected (learning the context of a named entity is extremely difficult without excessive supervision around the same, which is largely absent our training text) the models even largely fail to attend to the right words in the input. Secondly, words like rations, weavers and seal which appear frequently in the parallel Ur III Admin corpora and have a contextual meaning attached to them, are translated perfectly by the models, this property is observed among these models  in general. Even the unsupervised models that do not have access to the one-to-one mapping of the translation during training manage to infer these words from the appropriate context. It can be assumed that they learn the right representations of such tokens. But at the same time, there are instances like sze-ba (barley), which the two unsupervised models rightly refer to but do not give the right translations, which thus is a direct result of the absence of supervision. Lastly, English words like under, of and from do not have any direct translations in Sumerian and are mostly inferred from the context, even by the human annotators. At such places, again, supervision might play a critical role as in the 4 th row of Table 4a. There are also instances like the 6 th row where the supervised model fails to attend to the right words, and the correct output word could very well be out of memorisation .  Tables 4b and 4c represent visualizations for two randomly selected phrases for our sequence labeling tasks, indicating the attributions for each sub-word for tagging the corresponding target word with their predicted labels. It can be observed from Table 4b that word gin (unit) and sub-word ku, are contributing to the attribution score positively, depicting positive model attribution to tag ku3-babbar 7 The left-out tokens were rightly predicted by all the three models, with almost the same attributions. as a Noun (N), whereas in Table 4c the sub-words ur, hul and ki are contributing ur-bi2-lum{ki} to be tagged as the label GN (Geographical Name). As observed from the corresponding human annotation, ur and ki are the most associated for Geographical names and GNs are mostly followed by a verb part, which is hul (destroy) in this case. It can thus be inferred that RoBERTa identifies this correspondence well and makes the decision accordingly.

Conclusion
In this work, we introduced the first information extraction and translation pipeline for Sumerian cuneiform. We first undertook the tasks of POS Tagging and NER, where we observed that deeper is not necessarily better. A simple CRF model with well-defined rules outperformed the large language model RoBERTa for POS Tagging. Further, for machine translation we overcame unprecedented challenges pertaining to lack of in-domain text, sparse sentence formation, and incoherence. We found that using out-of-domain text along with specific data-augmentation can have huge impacts in a low-resource setting. All components of this work are generalisable to other low-resource languages, including InterpretLR, and we open way to future research in this direction. Forward Translation with Vanilla Transformer gave the best results for Sumerian-English Neural Machine Translation. Figure 5 shows the variation of the BLEU score with the amount of source monolingual data used. Here, the X-Axis represents the number of shards used, with each shard consisting of 80K sentences. It can be observed that the translation accuracy is not linear with the amount of text used. Figure 6 shows the variation of several performance metrics during the Unsupervised fine-tuning of various XLM configurations. The comparison is made between XLM pre-training without any data augmentation (MLM TLM), with one augmentation (Aug) and with all three augmentations (Aug 12x). It can be seen from Figure 6a that an XLM pre-trained on the Aug 12x configuration converges the fastest among the others, in terms of the main Denoising Auto-encoding Loss. It can also be observed that the curve corresponding to this configuration is much smoother than the others, which shows a positive regularizing effect of a better weight initialisation (through appropriate pre-training). A similar pattern is observed for the validation accuracy across the epochs as shown in Figure 6c, although, the trend of Back Translation loss remains mostly inseparable for the three configurations. Table 5 depicts the net percentage error found by an human expert on the POS and NER results for the entire evaluation set across the best performing model. Table 6 and 7 represents the detailed results of POS and NER models. It can be observed from the tables, that although CRF and RoBERTa models gave the best results, FLAIR language model along with character embeddings also gave high precision for both of the tasks.

B Extended Interpretations
Here we present the interpretability analysis across a larger set of models and visualisations. We use and compare the different algorithms across layerlevel, gradient-based, and perturbation-based techniques to obtain the attributions. Figure 7 visualises the Multi-head Self Attention (MHSA) using Layer Conductance Dhamdhere, Sundararajan, and Yan 2018) across the 4 encoder layers we employ in XLMs 8 . The first two output tokens barley and female are known to be one-on-one mapping between the input words of sze-ba and geme2 respectively. While the third output token barley is not a direct translation and   is needed to be inferred from context. Figure 9a represents the attribution heat-map when gradient-normalisation saliency (Simonyan, Vedaldi, and Zisserman 2013) is used. Being one of the most conventional techniques for finding attribution, it is more prone to inconsistent interpretations. Whereas, the attribution heat-map in Figure  9b represents the Integrated Gradients (IG) (Sundararajan, Taly, and Yan512017) approach. Being a path-based technique, which measures the gradient attribution relation using a straight-line path from a baseline (usually all-zeros), to the given input, it is much more robust and stable.
Even though the gradient-based methods are much faster than perturbation-based methods, we observe that the heavy dependency of IG on hyperparameters like the number of input steps to be considered when going from a baseline to the actual input, n steps, to be a major setback. The final attribution is generally found out after integrating (or summing) over the attributions of these sub-steps. We found that the attributions do not change when going beyond n steps = 250, thus, we experiment by varying it between 10 to 250. We observe that there is no ideal value of n steps, IG's faithfulness to the model varies largely over this range. For some inputs, the best value is n steps = 50 while for others n steps = 250 is the most ideal. We judge this by considering how much the attribution is given to sos and eos tokens for each output token. Thus, based on both plausibility and faithfulness. We use n steps = 50 for obtaining the heat-maps in Figure 9b. Figure 10 represents the visualization for our sequence labeling tasks. It indicates two major things, 1) the effect of words, sub-words (depends on tokenization) on tagging the target word and 2) the effect of 6 transformer encoder layers. We created the hook on embeddings of RoBERTa with layer IG and obtained the visualizations for how each sub-word is contributing to tag the target word. Similarly, to obtain the heat-map we created the hook on RoBERTa embeddings and used the Layer Conductance.
From Figure 10a it can be observed that ku and du contribute the most to the attribution scores for tagging ku3-babbar and ba-du3 as a Noun (N) and Verb (V), respectively. From the heat-maps it is also noted that ku shows the effect on all 6 layers whereas in second example effects are majorly due to the initial transformer layers. Similarly in the Figure 10b ur and lugal are the most effective subwords to tag ur-bi2-lumki and lugal-tesz2-mu as GN (Geographical Name) and PN (Personal Name) respectively. It is also interesting to note that both