The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation

This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL. We evaluate translation tasks between English and each of Nepali, Sinhala, and Kazakh, and predict that using morphologically-based segmentation methods would lead to better performance in this setting. However, comparing to BPE, we find that no consistent and reliable differences emerge between the segmentation methods. While morphologically-based methods outperform BPE in a few cases, what performs best tends to vary across tasks, and the performance of segmentation methods is often statistically indistinguishable.


Introduction
Despite the advances of neural machine translation (NMT), building effective translation systems for lower-resourced and morphologically rich languages remains a challenging process.The lack of large training data sets tends to lead to problems of vocabulary sparsity, a problem exacerbated by the combinatorial explosion of permissible surface forms commonly encountered when working with morphologically rich languages.
Current NMT systems typically operate at the level of subwords.Most commonly, these systems achieve vocabulary reduction by decomposing tokens into character sequences constructed by maximizing an information-theoretic compression criterion.The most widely used subword segmentation method is byte pair encoding, originally invented in the data compression literature by Gage (1994), and introduced to the MT community by Sennrich et al. (2016).Another approach to open vocabulary NMT has been to compose characters or character n-grams to form word representations (Ataman and Federico, 2018a;Ling et al., 2015).
As BPE has become mainstream, the question of whether segmenting words in a linguisticallyinformed fashion provides a benefit remains open.Intuitively, the translation task may be easier when using subwords that contain maximal linguistic signal, as opposed to heuristically derived units based on data compression.The greatest benefit may come in low-resource settings, where the training data is small and biases toward morphological structure may lead to more reusable units.
We seek to address this question by exploring the usefulness of linguistically-motivated subword segmentation methods in NMT, as measured against a BPE baseline.Specifically, we investigate the effectiveness of morphology-based segmentation algorithms of Ataman et al. (2017) and Lignos (2010) as alternatives to BPE at the word or sentence level and find that they do not lead to reliable improvements under our experimental conditions.We perform our evaluation using both BLEU (Papineni et al., 2002) and CHRF3 (Popović, 2015).In our low-resource NMT setting, all these methods provide comparable results.
The contribution of this work is that it provides insights into the performance of these segmentation methods using a thorough experimental paradigm in a highly replicable environment.We evaluate without the many possible confounds related to back-translation and other processes used in stateof-the-art NMT systems, focusing on the performance of a straightforward Transformer-based system.To analyze the performance differences between the various segmentation strategies, we utilize a Bayesian linear model as well as nonparametric hypothesis tests.

Related work
Attempts to create unsupervised, morphologicallyaware segmentations have often been derived from the Morfessor family of morphological segmentation tools (Virpioja et al., 2013).In addition to extensions of Morfessor, such as Cognate Morfessor (Grönroos et al., 2018), Ataman et al. (2017) and Ataman and Federico (2018b) introduced the LMVR model, derived from Morfessor FlatCat (Grönroos et al., 2014), and applied it to NMT tasks on Arabic, Czech, German, Italian, Turkish and English, noting that LMVR outperforms a BPE baseline in CHRF3 and BLEU.Contrary to their results, however, Toral et al. (2019) find that using LMVR yielded mixed results: on a Kazakh-English translation task the authors observed marginal BLEU improvements over BPE, whereas for English-Kazakh, the authors reported LMVR to perform marginally worse than BPE in terms of CHRF3.
There have also been efforts to combine BPE with linguistically motivated approaches.For instance, Huck et al. (2017) propose to combine BPE with various linguistic heuristics such as prefix, suffix, and compound splitting.The authors work with English-German and German-English tasks, and observe performance improvements of approximately 0.5 BLEU compared to a BPE-only baseline.As another example, Weller-Di Marco and Fraser (2020) combine BPE with a full morphological analysis on the source and target sides of an English-German translation task, and report performance improvements exceeding 1 BLEU point over a BPE-only baseline.
Finally, even though Sennrich et al. (2016) originally only used the NMT training set to train their segmentation model, others have recently found benefit in adding monolingual data to the process.In particular, Scherrer et al. ( 2020) used both Sen-tencePiece and Morfessor as segmentation models on an Upper Sorbian-German translation task and found a monotonic increase in BLEU when the segmentation model was trained with additional data, while at the same time keeping the NMT training data constant.

Experiments
To investigate the effect of subword segmentation algorithms on NMT performance, we train translation models using the Transformer architecture of Vaswani et al. (2017).We base our work on two recent datasets: FLoRes (Guzmán et al., 2019), and select languages from the WMT 2019 Shared Task on News Translation (Barrault et al., 2019).Corpus statistics for all corpora can be found in Table 1.
The FLoRes dataset consists of two language pairs, English-Nepali and English-Sinhala.To add another lower-resourced language, we use the Kazakh-English translation data from WMT19.In terms of morphological typology, both Nepali and Sinhala are agglutinative languages (Prasain, 2011;Priyanga et al., 2017), as is Kazakh (Kessikbayeva and Cicekli, 2014).
We conduct two sets of experiments on Kazakh to investigate how the amount of training data influences our results: first, we train only on Segmentation Sentence

Original
The nation slowly started being centralized and during SentencePiece the n ation sl ow ly start ed being cent ral ized and d ur ing Subword-NMT the n@@ ation s@@ low@@ ly star@@ ted being cen@@ tr@@ ali@@ z@@ ed and d@@ ur@@ ing LMVR the nation s +low +ly st +ar +ted be +ing c +ent +ral +ized and d +ur +ing MORSEL the nation s@@ low +ly start +ed being cen@@ tr@@ ali@@ z +ed and du@@ r +ing the WikiTitles and News Commentary corpora (train120k), followed by another set of experiments (train220k) where we include the web crawl corpus prepared by Bagdat Myrzakhmetov of Nazarbayev University.We also conducted experiments with Gujarati data from WMT19, but BLEU scores were too low to allow for meaningful analysis.For our models, we generally follow the architecture and hyperparameter choices of the FLoRes Transformer baseline, except for setting clip norm to 0.1 and enabling FP16 training.
Despite the widespread use of auxiliary techniques such as back-translation we deliberately refrain from employing such techniques in this work.This is done to best isolate the effect of varying the subword segmentation algorithm, and to avoid the complexity of disentangling it from the effect of other factors.It should be noted, however, that such techniques were highly prevalent among of systems submitted to the KK↔EN WMT19 News Translation Shared Task: 64% used back-translation, 61% used ensembling, and 57% employed extensive corpus filtering (Barrault et al., 2019).

Subword segmentation algorithms
Below we describe our hyperparameter settings for the various subword segmentation algorithms.Sinhala and Nepali are tokenized using the Indic NLP tokenizer (Kunchukuttan, 2020), whereas for English and Kazakh we use the Moses tokenizer (Koehn et al., 2007).Example segmentations from actual data can be seen in Table 2.
The segmentation methods we evaluate learn their subword vocabularies from frequency distributions of tokenized text.The exception to this is SentencePiece, whose subword units are learned from sentences, including whitespace.In the case of English and Kazakh, these sentences are untokenized whereas for Nepali and Sinhala, preprocessing with the Indic NLP tokenizer is applied following the approach of Guzmán et al. (2019).

Subword-NMT and SentencePiece
As our baseline subword segmentation algorithm, we use the BPE implementation from Subword-NMT1 .Throughout our experiments we use a joint vocabulary of the source and target and set the number of requested symbols to 5,000.For Senten-cePiece, we use the default BPE implementation2 with a joint vocabulary size of 5,000 words.These choices are motivated by the general observation by Sennrich and Zhang (2019) that lowering BPE size improves translation quality in ultra-low resource conditions, and the specific value of 5,000 was previously used by Guzmán et al. (2019).The same small vocabulary size has been used elsewhere in the low-resource NMT literature, for instance by Roest et al. (2020) while training NMT systems for Inuktitut.We also conducted a hyperparameter sweep for 2,500, 5,000, 7,500 and 10,000 merge operations, but noticed no improvement over the choice of 5,000 motivated by prior work.

LMVR
For LMVR (Ataman et al., 2017), we utilize slightly modified versions of the sample scripts from the author's Github repository3 .Our main modification is tuning the corpusweight hyperparameter in the Morfessor Baseline (Virpioja et al., 2013) model used to seed the LMVR model.Tuning is performed by maximizing the F1 score for segmenting the English side of the training data, using the English word lists from the Morpho Challenge 2010 shared task (Kurimo et al., 2010) as gold standard segmentations.After tuning the Morfessor Baseline model, we train a separate LMVR model for each language in a language pair using a vocabulary size parameter of 2,500 per language.

MORSEL
MORSEL (Lignos, 2010) provides linguisticallymotivated unsupervised morphological analysis that has been shown to work effectively on small datasets (Chan and Lignos, 2010).While it provides derivations of morphologically complex forms via a combination of stems and affix rules, we modified it to provide a segmentation and then postprocessed its output to apply BPE to the stems to yield a limited-size vocabulary.
For example, on the English side of the NE-EN training data, MORSEL analyzes the word algebraic as resulting from the stem algebra being combined with the suffix rule +ic.A BPE model is trained on all of the stems in MORSEL's analysis, and when that is applied to the stem, it is segmented as al@@ ge@@ br@@ a.The stem and suffix are combined using a special plus character to denote suffixation, so the final segmentation is al@@ ge@@ br@@ a +ic.Tuning is performed as with LMVR, using the English word lists from the Morpho Challenge 2010 shared task (Kurimo et al., 2010) as a reference.We adjust the number of BPE units learned from the stems to keep the total per-language vocabulary below 2,500.

Results and analysis
Our experimental results can be seen in Table 3.All BLEU scores were computed using sacrebleu, and all CHRF3 scores using nltk.Each row consists of the mean and standard deviation computed across 5 random seeds for each configuration.We also plot the raw results in Figure 1.Table 4 gives counts for the number of times each segmentation approach was the top-performing one or statistically indistinguishable from it.Table 7 in the appendix gives p-values for all comparisons performed.
Overall, based on Tables 3 and 4, no segmentation method seems to emerge as the clear winner across translation tasks, although BPE applied at the token (Subword-NMT) or sentence (Sentence-Piece) level performs well consistently.Subword-NMT or SentencePiece perform best in 12 out of 16 cases (counting BLEU and CHRF3 for each translation task), while morphology-based methods rank best in 4 out of 16 cases.In particular, we note that morphology-based methods seem to achieve or tie the best BLEU performance for translation tasks involving SI, and best CHRF3 performance for KK-EN with smaller training data (train120k) as well as EN-SI.However, when using LMVR, we fail to find the significant gains in BLEU compared to BPE reported by Ataman et al. (2017).
Comparing our results to Guzmán et al. (2019), we note that the scores are similar, although not directly comparable as we report lowercased BLEU Table 4: Number of times each segmentation method was or tied with the best-performing method under each metric, counted across all tasks.
scores. 4 They report EN-NE/NE-EN baseline BLEU scores of 4.3 and 7.6 using a single random seed, which are in line with our results in Table 3.
For EN-SI/SI-EN, the authors report 1.2 and 7.2 BLEU, which likewise matches our findings.Even though our scores are low overall, they are as low as is to be expected using this approach, size of data, and languages.In order to compare our results to WMT19 participant systems, it is only meaningful to compare our system to baseline systems due to the widespread use of auxiliary training techniques, such as back-translation.For instance, Casas et al. (2019) report baseline NMT scores of 2.32 on KK-EN and 1.42 on EN-KK, which are in line with our MORSEL and SentencePiece results on KK-EN, and Subword-NMT results on EN-KK in the train120k condition.

Modeling BLEU and CHRF3
Based on Figure 1 and Tables 3 and 4, the BLEU and CHRF3 scores vary with both the translation task and segmentation method.Intuitively, the scores seem to cluster around a certain range for each translation task, and are perturbed slightly depending on the choice of segmentation method.To better disentangle the influence of these factors, we fit a Bayesian linear model to the experimental data, treating the final BLEU/CHRF3 score as a sum of a "translation task effect" η, a "segmentation method effect" τ , and a translation task-specific noise term . 5The η and terms are estimated for each of the eight translation tasks (e.g.SI-EN and EN-SI are estimated separately), and τ is estimated for each of the four segmentation methods using results from all translation tasks.To explicitly compare SentencePiece, LMVR and MORSEL to the Subword-NMT baseline, we also model the pairwise differences between each method's τ -term and that of Subword-NMT.The posterior inferences for these quantities can be seen in Table 5 and are plotted in the appendix.For BLEU, the differences for LMVR are several standard deviations below 0, suggesting that it performs worse than the Subword-NMT baseline when accounting for all translation tasks.Similarly, MORSEL is almost 2 standard deviations away from 0, though its posterior interval does cover 0. In both cases, the effect size is small, with a mean of -0.12 and -0.26 points of BLEU for MORSEL and LMVR, respectively.The reliability of this difference also disappears for LMVR under the CHRF3 model, where no segmentation method's posterior mean is several standard deviations away from 0.
We hypothesize that this greater discrimination among methods when using BLEU may originate from the differences between how BLEU and CHRF3 operate.Since CHRF3 is a character-level metric, it is less prone than BLEU to penalizing a given translation due to subword outputs that are almost correct.For instance, consider output of do@@ gs → dogs with dog as the reference; while CHRF3 awards credit for this as a partial match, BLEU treats it as entirely incorrect.This further underscores our observation that segmentation methods perform inconsistently across experimental conditions.

Conclusion and future work
Contrary to our hypothesis about the usefulness of morphology-aware segmentation, we see no consistent advantage, and possibly a small disadvantage, to using LMVR or MORSEL in this resource-constrained setting.By and large, our experiments and modeling show that no segmentation approach consistently achieves the best BLEU/CHRF3 across all translation tasks.BPE remains a good default segmentation strategy, but it is possible that LMVR, MORSEL, or similar systems may show larger performance advantages for languages with specific morphological structures.
Consequently, we believe further work is needed to better understand when morphology-aware methods are most effective and to develop methods that provide a consistent advantage over BPE.One such avenue of future work would be to broaden our analysis to more languages and include languages that are higher-resourced but morphologically rich and as well as ones that are lower-resourced but morphologically poor.Ortega et al. (2021), which we encountered during preparation of the final version of this paper, began to address these questions by comparing Morfessor with BPE and their own BPE variant on Finnish, Quechua and Spanish.
An alternative approach which we intend to pur-sue in future work is experimenting with supervised morphological segmenters or analyzers that can be efficiently developed even in lower-resourced settings.Incorporating such "gold standard" segmentations may make it clearer whether the unsupervised morphological segmenters are capturing linguistically-relevant structure.Finally, there is the question of whether BPE can approximate a general representation for a language instead of converging on a corpus-specific set of subwords.To test this, one can add monolingual data and train the BPE segmentation on that larger data set.Ideally the new, "enriched" segmentations would depend less on the specific vocabulary of the training corpus.As noted above, Scherrer et al. (2020) observed this approach to be helpful in terms of BLEU.However, it remains unknown why the subwords derived from a larger corpus perform better, and whether better identification of morphological structure could be responsible.
We hope that this work and these ideas will catalyze further research, and that efficient methods for translating to and from lower-resourced languages can be developed as a result.

Figure 1 :
Figure 1: CHRF3 vs. BLEU with different translation tasks indicated by color and segmentation by marker shape.

Table 1 :
Number of sentences in raw corpora.The 120k and 220k training conditions for KK correspond to training KK↔EN models with/without an additional crawled corpus.The test sets for KK→EN and EN→KK are different from each other and mirror the released WMT19 data.

Table 2 :
Examples of segmentation strategies and tokenization.