Towards Multilingual Interlinear Morphological Glossing

,


Introduction
Interlinear Morphological Gloss (IMG) (Lehmann, 2004;Bickel et al., 2008) is an annotation layer aimed to explicit the meaning and function of each morpheme in some documentation ('object') language L1, using a (meta)-language L2.In computational language documentation scenarios, L1 is typically a low-resource language under study, and L2 is a well-resourced language such as English.Figure 1 displays an example IMG: the source sentence t in L1 is overtly segmented into a sequence of morphemes (x), each of which is in one-to-one correspondence the corresponding gloss sequence y.Each unit in the gloss tier is either a grammatical description (OBL for the oblique marker in Figure 1) or a semantic tag (son in Figure 1), expressed by a lexeme in L2.An idiomatic free translation z in L2 is usually also provided.y and z help linguists unfamiliar with L1 to understand the morphological analysis in x. t Nesis ł Q ono uži zown x nesi-s ł Q ono uži zow-n y he.OBL-GEN1 three son be.NPRS-PST.UNW z He had three sons.
Figure 1: A sample entry in Tsez: L1 sentence (t), and its morpheme-segmented version (x), its gloss (y), and a L2 translation (z).Grammatical glosses are in small capital, lexical glosses in straight orthography.
In this paper, we study the task of automatically computing the gloss tier, assuming that the morphological analysis x and the free L2 translation z are available.As each morpheme has exactly one associated gloss, 1 an obvious formalisation of the task that we mostly adopt views glossing as a sequence labelling task performed at the morpheme level.Yet, while grammatical glosses effectively constitute a finite set of labels, the diversity of lexical glosses is unbounded, meaning that our tagging model must accommodate an open vocabulary of labels.This issue proves to be the main challenge of this task, especially in small training data regimes.
To handle such cases, we assume that lexical glosses can be directly inferred from the translation tier, an assumption we share with (McMillan-Major, 2020;Zhao et al., 2020).In our model, we thus consider that the set of possible morpheme labels in any given sentence is the union of (i) all grammatical glosses, (ii) lemmas occurring in the target translation, and (iii) frequently-associated labels from the training data.This makes our model a hybrid between sequence tagging (because of (i) and (iii)) and unsupervised sequence alignment (because of (ii)), as illustrated in Figure 2. Our implementation relies on a variant of Conditional Random Fields (CRFs) (Lafferty et al., 2001;Sut-ton and McCallum, 2007), which handles latent variables and offers the ability to locally restrict the set of possible labels.The choice of a CRF-based approach is motivated by its notable data-efficiency, while methods based on neural networks have difficulties handling very low resource settings-this is again confirmed by results in §5. 1.In this work, we generalise previous attempts to tackle this task with sequence tagging systems based on CRFs such as (Moeller and Hulden, 2018;McMillan-Major, 2020;Barriga Martínez et al., 2021) and makes the following contributions: (a) we introduce ( §2) a principled and effective endto-end solution to the open vocabulary problem; (b) we design, implement and evaluate several variants of this solution ( §3), which obtain results that match that of the best-performing systems in the 2023 Shared Task on automatic glossing ( §5.1); (c) in experiments with several low-resource languages ( §4), we evaluate the benefits of an additional multilingual pre-training step, leveraging features that are useful cross-linguistically ( §5.5).Owing to the transparency of CRF features, we also provide an analysis of the most useful features ( §5.6) and discuss prospects for improving these techniques.  2 A hybrid tagging / alignment model

The tagging component
The core of our approach is a CRF model, the main properties of which are defined below.Assuming for now that the set of possible glosses is a closed set Y, our approach defines the conditional probability of a sequence y of T labels in Y given a sequence x of T morphemes as: where {G k , k = 1 . .Using CRFs for sequence labelling tasks has long been the best option in the pre-neural era, owing to (i) fast and data-efficient training procedures, even for medium-size label sets (e.g.hundreds of labels (Schmidt et al., 2013)) and higher-order label dependencies (Vieira et al., 2016), (ii) the ability to handle extremely large sets of interdependent features (Lavergne et al., 2010).They can also be used in combination with dense features computed by deep neural networks (Lample et al., 2016).

Augmenting labels with translations
One of the challenges of automatic glossing is the need to introduce new lexical glosses in the course of the annotation process.This requires extending the basic CRF approach and incorporating a growing repertoire of lexical labels.We make the assumption [H] that these new labels can be extracted from the L2 translation (z in Figure 1).
Informally, this means that the grammatical label set Y G now needs to be augmented with L2 words in z or equivalently, with indices in [1 . . . |z|].This raises two related questions: (a) how to exactly specify the set of labels Y in inference and training.(b) depending on answers to question (a), how to learn the model parameters?
In our model, we additionally consider an extra source of possible lexical labels, Y L (x), which contains likely glosses for morphemes in x.There are several options to design Y L (x): for instance, to include all the lexical glosses seen in training or to restrict to one or several glosses for each word x t .In our experiments ( §5), we select for each morpheme in x the most frequently associated gloss in the training corpus.Y thus decomposes into a global part Y G and a sentence-dependent part Y L (x) ∪ [1 . . . |z|].Performing inference with this model yields values y t that either directly correspond to the desired gloss or correspond to an integer, in which case the (lexical) gloss at position t is z yt .Formally, the gloss labels are thus obtained as ỹt = ϕ(y t ), with ϕ() the deterministic decoding function defined as Training this hybrid model is more difficult than for regular CRFs, for lack of directly observing y t .We observe instead ỹt = ϕ(y t ): while the correspondence is non-ambiguous for grammatical glosses, there is an ambiguity when ỹt is present in both Y L (x) and z, or when it occurs multiple times in z.We thus introduce a new, partially observed, variable o t which indicates the origin of gloss ỹt : o t = 0 when ỹt ∈ Y G ∪ Y L (x) and o t > 0 when ỹt comes from L2 word z ot .The full model is: By making the origin of the lexical label(s) explicit, we distinguish in §3.3 between feature functions associated with word occurrences in z and those for word types in Y L (x) (Täckström et al., 2013).Learning θ with (partially observed) variables is possible in CRFs and yields a non-convex optimisation problem (see e.g.(Blunsom et al., 2008)).In this case, the gradient of the objective is a difference of two expectations (Dyer et al., 2011, eq. ( 1)) and can be computed with forward-backward recursions.We, however, pursued another approach, which relies on an automatic word alignment a between lexical glosses and translation ( §3.1) to provide proxy information for o.Assuming a t = 0 for grammatical glosses and unaligned lexical glosses and a t > 0 otherwise, we can readily derive the supervision information o t needed in training, according to the heuristics detailed in Table 1, which depend on the values of y t and a t .
Three heuristic supervision schemes are in Table 1, which vary on how ambiguous label sources are handled: (S1) only considers dictionary entries, which makes the processing of unknown words impossible; (S2) only considers translations, possibly disregarding correct supervision from the dictio-2 I.e., the correct label is always part of the search space.
Table 1: Three supervisions for the hybrid CRF model.(*) means that the correct label does not occur in the dictionary nor in the translation.To preserve reference reachability, 2 we augment Y L (x) with the correct label.
nary; (S3) assumes that the label originates from the translation only if an exact match is found.
3 Implementation choices

Aligning lexical glosses with target words
To align the lexical glosses with the L2 translation, we use SimAlign (Jalili Sabet et al., 2020), an unsupervised, multilingual word aligner which primarily computes source / target alignment links based on the similarity of the corresponding embeddings in some multilingual space.Note that our task is much simpler than word alignment in bitexts, as the lexical gloss and the translation are often in the same language (e.g.English or Spanish), meaning that similarities can be computed in a monolingual embedding space.We extract alignments from the similarity matrix with Match heuristic, as it gave the best results in preliminary experiments.Match views alignment as a maximal matching problem in the weighted bipartite graph containing all possible links between lexical glosses and L2 words.This ensures that all lexical morphemes are aligned with exactly one L2 word.3 one home two khan place become be in one home there is no place for two kings Figure 3 displays an alignment computed with the Match method.Most alignments are trivial and associate identical units (e.g.one/'one') or morphologically related words (e.g.son/'sons').Non-trivial (correct) links comprise (khan/'kings'), which is the best option as 'khan' does not occur in the translation.A less positive case is the (erroneous) alignment of be with 'for', which only exists because of the constraint of aligning every lexical gloss.Nevertheless, frequent lemmas such as 'be' will occur in multiple sentences, and their correct labels are often observed in other training sentences.We analyse these alignments in §5.4.

Implementing the hybrid CRF model
Following e.g.(Dyer et al., 2011;Lavergne et al., 2011), our implementation of the CRF model4 heavily relies on weighted finite-state models and operations (Allauzen et al., 2007), which we use to represent the spaces of all possible and reference labellings on a per sentence basis, and to efficiently compute the expectations involved in the gradient, as well as to search for the optimal labellings and compute alignment and label posteriors.
Training is performed by optimising the penalised conditional log-likelihood with a variant of gradient descent (Rprop, (Riedmiller and Braun, 1993)), with ℓ 1 regularisation to perform feature selection, associated with parameter value 0.5 that was set during preliminary experiments and kept fixed for producing all the results below.
Given the size of our training data ( §4.1) and typical sentence lengths, training and decoding are very fast, even with hundreds of labels and millions of features.A full experiment for Lezgi takes about 20 minutes on a desktop machine; processing the larger Tsez dataset takes about 10 hours.

Observation and label features
Our implementation can handle a very large set of sparse feature functions g k (), testing arbitrary properties of the input L1 sequence in conjunction with either isolated labels (unigram features) or pairs of labels (bigram features).Regarding L1, we distinguish between orthographic features, which test various properties of the morpheme string (its content, prefix, suffix, CV structure and length), and positional features, which give information about the position of the morpheme in a word; all these features can also test the properties of the surrounding morphemes (within the same word or in its neighbours).Note that a number of these features abstract away the orthographic content, a property we exploit in our multilingual model.We consider five (out of seven) languages from the SIGMORPHON 2023 Shared Task on Interlinear Glossing (Ginn et al., 2023): Tsez (ddo), Gitksan (git), Lezgi (lez), Natugu (ntu; surprise language), and Uspanteko (usp; target translation in Spanish). 7able 2 gives general statistics about the associated datasets; a brief presentation of these languages is in Appendix B and in (Ginn et al., 2023).

SimAlign settings
Since the glosses and the translation are in the same language, we use the embeddings from the English BERT (bert-base-uncased) (Devlin et al., 2019) when the L2 language is English and mBERT ('bert-base-multilingual-uncased') for Spanish (for Uspanteko).We stress here that our model is compatible with several target languages, SimAlign being an off-the-shelf multilingual (neural) aligner.
Our preliminary experiments showed that the embeddings from the 0-th layer yielded the best alignments, especially compared to the 8-th layer, which seems to work best in most alignment tasks.A plausible explanation is that contextualised embeddings are unnecessary here because lexical glosses do not constitute a standard English sentence (for instance, they do not contain stop words, and their word order reflects the L1 word order).

Evaluation metrics
We use the official evaluation metrics from the Shared Task: morpheme accuracy, word accuracy, BLEU, and precision, recall, and F1-score computed separately for grammatical (Gram) and lexical (Lex) glosses.We report the results of our best system with all metrics in Appendix.

Baselines
Below, we consider three baseline approaches which handle glossing as a sequence-labelling task: • maj: a dictionary-based approach, which assigns the majority label (grammatical and lexical) seen in the training dataset to a source morpheme and fails for out-of-vocabulary morphemes; • CRF+maj: a hybrid model relying on a CRF to predict grammatical glosses and a unified lexical label (as in (Moeller and Hulden, 2018;Barriga Martínez et al., 2021)).Known lexical morphemes are then assigned a lexical label according to the maj model; • BASE_ST: is the Transformer-based baseline developed by the SIGMORPHON Shared Task organisers and detailed in Ginn (2023).9 5 Experiments

Results
Table 3 reports the scores of the baselines from §4.6, as well as the best results in the Shared Task on Automatic Glossing (BEST_ST) 10 and the results of variants of our system on the official testsets.We only report below the word-and morpheme-level (overall) accuracy, which are the two official metrics of the Shared Task.A first observation is that system (S3), which effectively combines the information available in a dictionary and obtained via alignment in an integrated fashion (see §2.2) greatly outperforms (S1) (only dictionary) and (S2) (only alignment), obtaining the best performance among our three variants.(S3) is also consistently better than all the baselines, with larger gaps when few training sentences are available (e.g.Gitksan or Lezgi).In comparison, the BERT-based baseline suffers a large performance drop in very low-data settings, as also reported in (Ginn, 2023).Our CRF model also achieves competitive results compared to the best system submitted to the Shared Task, especially for the word-level scores.These scores confirm that decent to good accuracy numbers can be obtained based on some hundreds of training sentences.Note, however, that annotated datasets of that size are not so easily found: in the IMTVault (Nordhoff and Krämer, 2022), only 16 languages have more than 700 sentences, which is about the size of the Lezgi and the Natugu corpora.

Handling unknown morphemes
Leveraging the translation in glossing opens the way to better handle morphemes that were unseen in training.Table 4 displays some statistics about unknown morphemes in test datasets.For most languages, they are quite rare, representing solely around or below 10% of all lexical glosses in the test set, with Gitksan a notable outlier (Ginn et al., 2023).Among those unseen morphemes, a significant proportion (from a third in Tsez up to 70% in Gitksan) of the reference lexical gloss is not even present in the translation 13 (cf.'not in L2' in Table 4).Taking this into account nuances the seemingly low accuracy for unknown lexical morphemes: in Uspanteko, for instance, the system reaches an accuracy of 29.3 when the best achievable score is about 35.To have a more optimistic view of our prediction, we 'approximate' lexical glosses with lemmas from the translation, using automatic alignments (e.g., king instead of khan in Figure 3).By evaluating the unknown morphemes with their reachable labels14 (cf.'align.accuracy' line), we get higher scores, such as 40. 4

Data efficiency
An important property of our approach seems to be its data efficiency.To better document this property, we report in Table 5 the morpheme-level accuracy obtained with increasingly large training data of size (50, 200, 700, 1,000, 2,000, full) in Tsez.With 200 examples already, our model does much better than the simple baseline maj and delivers usable outputs.(S3) also has a faster improvement rate than the baseline for small training datasets (e.g.almost +7 points between 200 and 700 sentences), while maj increases by only +4 points.The return of increasing the dataset size above 1,000 is, in comparison, much smaller.While the exact numbers are likely to vary depending on the language and the linguistic variety of the material collected on the field, they suggest that ML techniques could be used from the onset of the annotation process.

Analysis of automatic alignments
Our approach relies on automatic alignment computed with SimAlign to supervise the learning process.When using the Match method, (almost) all lexical glosses are aligned with a word in the translation (cf.footnote 3).We cannot evaluate the overall alignment quality, as reference alignments are unobserved.However, we measure in Table 6 the proportion of exact matches between the reference gloss and the lemma of the aligned word.Overall, around half of our alignments are trivial and hence sure, which is more or less in line with the proportions found by Georgi (2016), albeit lower due to the marked linguistic difference between L1 and L2 in our case.These seemingly low scores have to be nuanced by two facts.First, they are a lower bound to evaluate alignment quality since synonyms (such as khan/king) are counted as wrong alignments.In some cases, inflected forms are also used as a gloss (e.g., dijo/decir in Uspanteko).Second, the alignment-induced glosses are not used as is in our experiments: they supplement the output label with their PoS tag and their position in the L2 sentence.This means that even non-exact matches can yield useful features.
Removing L2 stop words We carried out a complementary experiment where we filtered stop words in the L2 translation in Gitksan to remove a potential source of error in the alignments.The number of unaligned lexical glosses ( §3.1) thus increases, which generally means a reduced noise in the alignment for the Match method.Yet, using these better alignments and reduced label sets in training and inference yields mixed results: +1 point in word accuracy, −2 points in morpheme accuracy.

Multilingual pre-training
Cross-lingual transfer techniques via multilingual pre-training (Conneau et al., 2020) are the cornerstone of multilingual NLP and have been used in multiple contexts and tasks, including morphological analysis for low-resource languages (Anastasopoulos and Neubig, 2019).In this section, we apply the same idea to evaluate how well such techniques can help in our context.We train the model to predict the nature of the gloss (grammatical or lexical) with multilingual features (see Appendix A): for a given morpheme, its position in the word, its length in characters, its CV skeleton, 15  and the number of morphemes in the word.Using IMTVault ( §4.3) for this task, the model reaches around 80 of accuracy.We use these pre-trained weights to initialise the multilingual features in our (S3) system.To help feature selection, we notably reduce the value of ℓ 1 to 0.4.Pre-training results (+ IMT) are in Table 7.We note that pre-training has a negligible effect, except in some metrics, such as in Lezgi for the word level.We observed that the most important weights in the pre-trained model correspond 15 We identify consonants and vowels based on orthography.
to delexicalised pattern features that are relevant in IMTVault but not present at all in our datasets.
Cross-lingual study Besides, since in our studied languages, Tsez and Lezgi belong to the same language family, we reiterate this methodology but with the Tsez dataset as a pre-training source to predict Lezgi glosses.This kinship is also explored through successful cross-lingual transfer in (Zhao et al., 2020).We obtain 83.3 and 87.1 for word and morpheme accuracy, which is very close to the performance with (and without) IMTVault despite containing fewer sentences.
Very low resource scenarios A final experiment focuses on a very low training data scenario.Save the Gitksan corpus, the test languages already represent hundreds of annotated sentences.This contrasts with actual data conditions: in IMTVault, for instance, only 16 languages have equivalent or more sentences than Lezgi (the second-lowest language in terms of training sentences in our study; cf. §5.1).We thus focus here on (simulated) very lowresource data settings by considering only 50 sentences16 selected from the training data.Table 8 reports the results obtained with this configuration.We observe that using the multilingual pre-training helps for all languages to a more noticeable extent than before.These three experiments confirm the potential of multilingual transfer for this task, which can help improve performance in very low-resource scenarios.Contrarily, when hundreds of sentences are available, pretraining delexicalised features proves ineffective, sometimes even detrimental.

Feature analysis
One positive point in using models based on CRFs relies on access to features and their correspond-ing learnt weights.We report in Appendix E ten features with the largest weight in Natugu.
Among the top 1% of active features in Natugu in terms of weight, we find two features testing the gloss type b with the morpheme length: (LEX, 6+) and (LEX, 5).These indicate that longer morphemes are likely to have a lexical label.Such an analysis can also be relevant to weights learnt through multilingual pre-training.For instance, (LEX, 5) is among the top 10% features on the IMT-Vault dataset, suggesting a cross-lingual tendency of longer morphemes being lexical.

Related work
Language documentation With the everpressing need to collect and annotate linguistic resources for endangered languages, the field of computational language documentation is quickly developing, as acknowledged by the ComputEL workshop.17Regarding annotation tools, recent research has focused on all the steps of language documentation, from speech segmentation and transcription to automatic word and morpheme splitting to automatic interlinear glossing.(Xia and Lewis, 2007;Georgi et al., 2012Georgi et al., , 2013) use, as we do, interlinear glosses to align morphs and translations, then to project syntactic parses from L2 back to L1, a technique pioneered by Hwa et al. (2005), or to extract grammatical patterns (Bender et al., 2014).Through these studies, a large multilingual database of IMGs was collected from linguistic papers, curated, and enriched with additional annotation layers (Lewis and Xia, 2010;Xia et al., 2014) for more than 1,400 languages.(Georgi, 2016) notably discusses alignment strategies in ODIN and trains a multilingual alignment model between the gloss and translation layers -in our work, we extend multilingual training to the full annotation process.Another massive multilingual source of glosses is IMTVault, described in (Nordhoff and Krämer, 2022), studied in §5.5.
Automatic glossing Automatic glossing was first studied in (Palmer et al., 2009;Baldridge and Palmer, 2009), where active learning was used to incrementally update an underlying tagging system focusing mainly on grammatical morphemes (lexical items are tagged with their PoS).(Samardžić et al., 2015) added to this an extra layer aimed to annotate the missing lexical tags, yielding a system that resembles our CRF+maj baseline.
Both (Moeller and Hulden, 2018) and (McMillan-Major, 2020) rely on CRFs, the latter study being closer to our approach as it tries to combine post-hoc the output of two CRF models operating respectively on the L1 and L2 tier, where our system introduces an integrated end-to-end architecture.Zhao et al. (2020) develop an architecture inspired by multi-source neural translation models, where one source is the L1 sequence, and the other the L2 translation.They experiment with Arapaho, Lezgi, and Tsez, while also applying some sort of cross-lingual transfer learning.The recent SIGMORPHON exercise (Ginn et al., 2023) is a first attempt to standardise benchmarks and task settings and shows that morpheme-level accuracy in the high 80s can be obtained for most languages considered.
Latent variable CRFs models Extended CRF models were proposed and used in many studies, including latent variables to represent, e.g.hidden segmentation as in (Peng et al., 2004) or hidden syntactic labels (Petrov and Klein, 2007).Closer to our work, (Blunsom et al., 2008;Lavergne et al., 2011) use latent structures to train discriminative statistical machine translation systems.Other relevant work on unsupervised discriminative alignment are in (Berg-Kirkpatrick et al., 2010;Dyer et al., 2011), while Niehues and Vogel (2008) use a supervised symmetric bi-dimensional word alignment model.

Conclusion
This paper presented a hybrid CRF model for the automatic interlinear glossing task, which was specifically designed and tailored to work well in low data conditions and to effectively address issues due to out-of-vocabulary morphemes.We presented our main approach, which relies on analysing the translation tier, and discussed our main implementation choices.In experiments with five low-resource languages, we obtained accuracy scores that match or even outperform those of very strong baselines, confirming that accuracy values in the 80s or above could be obtained with a few hundred training examples.Using a large multilingual gloss database, we finally studied the possibility of performing cross-lingual transfer for this task.
There are various ways to continue this work and improve these results, such as removing the noise introduced via erroneous alignments linkseither by marginalising over the 'origin' variable, by filtering unlikely alignments based on link poste-rior values, or by also trying to generate alignment links for function words (Georgi, 2016;McMillan-Major, 2020).Our initial experiments along these lines suggest that this may not be the most promising direction.We may also introduce powerful neural representations for L1 languages; while these were usually available for a restricted number of languages, recent works have shown that even lowresource languages could benefit from these techniques (Wang et al., 2022;Adebara et al., 2023).

Limitations
The main limitation comes from the small set of languages (and corresponding language family) studied in this work.In general, texts annotated with Interlinear Morphological Gloss are scarcely available due to the time and expertise needed to annotate sentences with glosses.However, corpora such as IMTVault (Nordhoff and Krämer, 2022) or ODIN (Lewis and Xia, 2010) or languages such as Arapaho (39,501 training in the Shared Task) pave the way for further experiments.
Moreover, another shortcoming of our work stems from the fact that we do not use neural models in our work, while, for instance, the best submission to the Shared Task relies on such architectures.In this sense, we have yet to compare the scalability of our performance to larger data.Still, one of our main focuses was to tackle glossing in very low-resource situations for early steps of language documentation and to study how to handle previously unseen morphemes at the inference step.handle glosses that literally appear in the L2 sentence (namely, proper nouns) and a distortion feature which tests the difference in relative positions between the source morpheme and the (possible) lexical label, whenever a t > 0.

B Brief language presentation
We describe below the studied languages.
• Tsez (ddo) is a Nakh-Daghestanian language spoken in the Republic of Dagestan in Russia.
• Gitksan (git) is a Tsimshian language spoken on the western coast of Canada (British Columbia).
• Lezgi (lez) is a Nakh-Daghestanian language spoken in the Republic of Dagestan in Russia and in Azerbaijan.
• Natugu (ntu) is an Austronesian language spoken in the Solomon Islands.
• Uspanteko (usp) is a Mayan language spoken in Guatemala.
According to Ethnologue (Eberhard et al., 2023), the two Nakh-Daghestanian languages have between 10K to 1M speakers, while the other three have fewer than 10K users.

D Full results
Table 11 displays the results of the S3 system with all metrics presented in §4.5.Two scores of accuracy are computed for both morpheme and word levels: an overall (Ovr) value and a sentenceaveraged value (Avg).

E Example of learnt features
Table 12 displays 10 features with the largest weight in the S3 system in Natugu.Here, we ignore trivial features for numbers or punctuation signs that also have high weight values.The ID of the feature refers to Table 13.

Figure 2 :
Figure 2: Tagging morphs for the L1 sentence in Fig. 1.Y G represents the set of all grammatical glosses in the training data, z the words occurring in the translation, Y L (x) the set of lexical labels from the training dictionary, and y the reference lexical labels seen in training.During training, automatic alignments between x and z are used.Dashed lines symbolise the ambiguous origin of the label, possibly in both z and Y L (x).

Figure 3 :
Figure 3: Example of SimAlign alignment between lexical glosses and an English translation (Tsez sentence).

Figure 4 :
Figure 4: Example of input, outputs, and associated features to Lost for the Tsez reference sentence of Figure 1.
. K} are the feature functions with associated weights θ = [θ 1 ...θK ] T ∈ R K , and Z θ (x) is the partition function summing over all label sequences.For tractability, in linearchain CRFs, the feature function G k only test local properties, meaning that each G k decomposes asG k (x, y) = t g k (y t , y t−1 , t, x) with g k () a local feature.Training is performed by maximising the regularised conditional log-likelihood on a set of fully labelled instances, where the regulariser is proportional to the ℓ 1 (|θ|) or ℓ 2 (||θ|| 2 ) norm of the parameter vector.Exact decoding of the most likely label sequence is achieved with Dynamic Programming (DP); furthermore, an adaptation of the forward-backward algorithm computes the posterior distribution of any y t conditioned on x.

Table 2 :
Number of sentences for each language

Table 3 :
Accuracy (overall)at the word (top) and morpheme (bottom) levels for baseline systems and three variants of the hybrid CRF model on the test dataset.Best scores in each language & metrics are in bold.12

Table 4 :
in Natugu.Statistics about unknown lexical morphemes in testsets.We report the morpheme-level accuracy.

Table 5 :
Morpheme-level accuracy with increasing training data size in Tsez for two systems.

Table 6 :
Proportion of exact matches between the reference gloss and the lemma of the aligned word with SimAlign for the training datasets.

Table 7 :
Accuracy (overall)at the word (top) and morpheme (bottom) levels for the model without and with multilingual pre-training on the test dataset.

Table 9 :
Example of output features extracted from each label set, using the example of Figure3.The reference (y t , o t ) is the training supervision.
Table10presents the number of active features (in thousands) selected among all features (in mil-lions) for S3.We note here that thanks to the l 1regularisation term, most feature weights are set to 0 and less than 1% of the features are actually retained.

Table 10 :
Number of selected features among all computed features for the S3 system in each language.

Table 11 :
Full results with our S3 system on the IGT Shared Task test dataset.

Table 12 :
10 features with the largest weight in Natugu.