When is Char Better Than Subword: A Systematic Study of Segmentation Algorithms for Neural Machine Translation

Subword segmentation algorithms have been a de facto choice when building neural machine translation systems. However, most of them need to learn a segmentation model based on some heuristics, which may produce sub-optimal segmentation. This can be problematic in some scenarios when the target language has rich morphological changes or there is not enough data for learning compact composition rules. Translating at fully character level has the potential to alleviate the issue, but empirical performances of character-based models has not been fully explored. In this paper, we present an in-depth comparison between character-based and subword-based NMT systems under three settings: translating to typologically diverse languages, training with low resource, and adapting to unseen domains. Experiment results show strong competitiveness of character-based models. Further analyses show that compared to subword-based models, character-based models are better at handling morphological phenomena, generating rare and unknown words, and more suitable for transferring to unseen domains.


Introduction
Neural machine translation (NMT) has achieved great success in recent years. Modern NMT systems typically operate on subword level, using segmentation algorithms such as byte pair encoding (BPE) (Sennrich et al., 2016) or Morfessor (Creutz and Lagus, 2002). Compared to word-level models, subword segmentation helps overcome the out-ofvocabulary (OOV) problem and make better use of morphological information in the surface form.
Despite their empirical effectiveness, subword algorithms may produce improper segmentation due to their data-dependent nature. NMT models * Equal contribution † Corresponding author are typically robust to such errors when trained on large corpora or the target language is regular in morphological changes, like French or German. However, the problem will arise when such conditions are not met, i.e. there is not enough data for learning compact composition rules or the target language is morphologically rich and complex. An alternative segmentation choice is to use fully character-level (CHAR) models (Lee et al., 2017;Cherry et al., 2018;Gupta et al., 2019;Gao et al., 2020;Banar et al., 2020), which has the potential to alleviate above issues. CHAR does not need to learn any segmentation rules and keeps all available information in the surface form, avoiding the risk of information loss due to improper segmentation. What is more, the main pain point of CHAR that it takes too long to train is less obvious in above settings since there is not as much data as in the rich resource setting. However, there has not been a comprehensive study in these settings.
In this paper, we conduct a systematic comparison between CHAR and other subword algorithms, e.g. BPE and Morfessor. Experiments show strong competitiveness of CHAR under three settings: translating to typologically diverse languages (Section 2), training with low resource (Section 3), and adapting to distant domains (Section 4). Further analyses show that compared to subword algorithms, the benefits of CHAR mainly come from better capture of the morphological phenomena, better generation of rare and unknown words, and better translation of domain-specific words.

Translation Across Typologically Diverse Languages
Human languages are known to exhibit diverse morphological phenomena, which could serve as a principle to classify languages into different morphological categories, such as fusional, agglutinative, introflexive and isolating. While previous works only focus on performances of characterlevel models when translating to fusional and agglutinative languages (Gupta et al., 2019;Libovický and Fraser, 2020), we conduct a comprehensive study covering all four morphological categories.

Experiment Setup
Dataset We consider the translation from English to eight target languages representing four morphological categories, i.e. French (Fr) and Romanian (Ro) for fusional, Finnish (Fi) and Turkish (Tr) for agglutinative, Hebrew (He) and Arabic (Ar) for introflexive, and Vietnamese (Vi) and Malaysian (Ml) for isolating. We use OPUS-100 corpus 1 (Tiedemann, 2012), which consists of 1M parallel sentences for each language pair.

Model and Hyperparameters
We use the Transformer architecture (Vaswani et al., 2017) throughout all experiments. To ensure results' reliability , we run an exhaustive search of hyperparameters including batch size and learning rate. Detailed hyperparameters can be found in Appendix A.

Results
The results are listed in Table 1. We can see that CHAR outperforms other algorithms in 7 out of 8 languages in terms of BLEU (Papineni et al., 2002) and chrF3 (Popović, 2015), showing strong competitiveness of CHAR's ability across languages. The only exception is the En-Fr language pair, which are known to be quite similar and is beneficial for BPE to learn a joint segmentation model.
It is intuitive that BPE and Morfessor cannot outperform CHAR on introflexive languages (Hi, Ar). Introflexive languages follows non-concatenative morphology (McCarthy, 1981)  Surprisingly, even for highly agglutinative languages such as Finnish and Turkish, which has very regular morphological changes by adding affixes or suffixes, CHAR still achieves better performance.

Analysis on MorphEval
To understand where the advantages of CHAR model come from, we take Finnish as an example and evaluate the morphological competence of different models using MorphEval test suites (Burlot et al., 2018). MorphEval generates pairs of source sentences that differ by one kind of morphological phenomena, and assesses a MT system's ability by computing the percentage of its generated target sentences that convey as the source sentences. Higher accuracy means the model is more sensitive to the current morphological phenomenon.
As shown in Table 2, CHAR performs the best in 10 out of 14 tests. Among these 10 tests, in comparative adjectives, possessive determiner, local postposition case, preposition case, plural nouns, CHAR surpasses other algorithms notably by at least 5% accuracy. This indicates CHAR's strong ability to capture the fine-grained morphological phenomena, which is crucial for MT models when translating into morphologically rich languages.
Interestingly, three of four morphological phenomena on which CHAR falls behind are so-called stability features (Burlot et al., 2018), which are expressed differently in the source language but should be expressed identically in the target language 2 . The disadvantage of CHAR in this kind of phenomena shows CHAR-based model may be less robust to lexical changes to source-side changes, and the reason needs to be further researched.

Translation with Low Resource
Subword algorithms help alleviate the OOV problem. However, most of them are based on heuristics and may produce wrong segmentation. While this problem is not so evident when there is enough data to learn robust composition rules, in low-resource setting it could be a different story and their effectiveness should be examined. While for CHAR, pure character sequences can directly provide all the information to the model for learning the composition rules. Therefore a prudent choice of segmentation should be studied in this setting.

Experiment Setup
We perform evaluation on WMT14 En-De 3 and WMT17 En-Fi 4 dataset. Datasets of size 50k, 100k, 200k, 500k and 1000k are subsampled from the original training dataset and serve as training data Figure 2: Recall rates of unknown and rare words generated by systems based on different tokenizers models. Words appearing no more than 5 times in the training set are considered as rare words.
of different resource conditions. For validation and test, we use the original development and test split.
Previous works (Sennrich and Zhang, 2019;Nguyen and Chiang, 2017) show that in low resource settings the evaluation results can be sensitive to model size (e.g. hidden dimension, layer number) and the number of BPE merges k, so we run an additional search of hidden dimension, layer number and k, and report the best results in this section. See Appendix A for details.

Results
We evaluate models with BLEU and chrF3. The results are showed in Figure 1. In general, the performances of CHAR and BPE are on par, and are better than Word and Morfessor. In different data conditions, the results varies. medium-resource When there are plenty resources, e.g. 500k and 1000k, the performance of CHAR and BPE are comparable but different for different language pairs. For En-Fi, CHAR is better than BPE. It is because morphological changes in Finnish are quite complex. More fine-grained segmentation like CHAR is needed to learn corresponding rules. Conversely, German's morphological changes are so regular that BPE can learn most of merging rules, making them performing better.
low-resource When the corpus size is 50k to 200k, CHAR performs the best among four segmentation methods. BPE and Morfessor usually regard frequently occurring words as single tokens, many of which contain rich morphological information. This, together with the improper segmentation problem, prevents NMT models from learning correct composition rules, damaging the model's generalization ability on rare and unknown words. In low resource setting this problem would be more se-  vere, since there are much more rare and unknown words but not enough data for learning compact composition rules. Compared with subwords, character-based models learn combinations directly from character sequences. Not limited to fixed char sequence patterns in subwords, more words with different morphological changes can be generated through CHAR. Therefore, CHAR can learn more correct composition rules than subword-based model, leading to better translation of rare and unknown words.

Analysis on Rare and Unknown Words
To further support the above analysis, we evaluate the translation quality of rare and unknown words by calculating their recall rates. The results are showed in Figure 2. We can see that CHAR has achieved the highest recall rates of rare and unknown words. Although, as the resource increases, the gap between CHAR and BPE is shrinking gradually, the results can still prove that CHAR can capture more morpheme information, performing better at generating rare and unknown words.

Translation Across Distant Domains
Domain robustness (Müller et al., 2020), which refers to models' generalization ability on unseen domains, is important for NMT applications. However, subword algorithms need to learn segmentation rules from a given corpus, which may be domain-specific. When applied to a new domain, they may improperly segment target-domain specific words, hurting the domain robustness. In contrast, CHAR does not suffer from the issue. In this section, we investigate how different segmentation algorithms affect NMT models' domain robustness.

Experiment Setup
We use the same corpora as (Koehn and Knowles, 2017), which is a De-En dataset covering subsets of four domains: Law, Medical, IT and Koran.
Following Koehn and Knowles (2017), each time we train a source domain model on one of four subsets and report results on test sets of the other three domain. We experiment in two settings: No Adapt and Finetune. The first one involves no target domain data, while the latter uses randomly sampled 100k sentence pairs from target domain data to finetune the source domain model.

Results
We report the average out-of-domain (OOD) BLEU scores of NMT systems based on different segmentation algorithms in Figure 3a and Figure 3b. As can be seen from the figure, CHAR surpasses other algorithms in almost all settings, except when finetuning from Medical to others. This illustrates the suitability of CHAR for domain robustness, especially when there is no enough data for adaptation.

Analysis on Different Types of Words
To understand the advantages of CHAR, we take the setting of finetuning from IT to Medical as an example and analyze performances on different types of words. Specifically, we divide words in the test set into three types: (1) Domain-specific words occur only in the target domain training data; (2) Common words occur in both the source and target domain training data; (3) OOV words do not occur in both training data.
The result can be seen in Figure 3c. CHAR achieves better performance on OOV words, which is consistent with findings in Section 3. While performances of CHAR and subword-based algorithms are on par on common words, CHAR outperforms the others by a large margin on domainspecific words. This suggests that the advantage of CHAR mainly comes from the correct translation of domain-specific and OOV words, which may be segmented improperly by subword algorithms.   (Provilkov et al., 2020)

Comparison with Advanced Segmentation Algorithms
Although we focus on deterministic segmentation algorithms in this paper, there are more advanced ones such as BPE-dropout (Provilkov et al., 2020) and subword regularization (Kudo, 2018), which produce multiple segmentation candidates when training and show improved performance. Therefore, we also conduct experiments comparing CHAR with BPE-dropout in terms of domain adaptation performance. We take the setting of adapting from Law to other domains and report results in Table 3. As can be seen, although BPEdropout surpasses BPE by a large margin, CHAR still achieves the best performance, which again shows the superiority of CHAR.

Related Work
Character-level neural machine translation has received growing attention in recent years. Lee et al. (2017) first propose a fully character-level NMT model based on recurrent encoder-decoder architecture and convolutional layers, which shows a promising results. Gao et al. (2020) propose to incorporate convolution layers in the more advanced Transformer architecture and show their model can learn more robust character-level alignments. However, translating at character level may incur significant computational overhead. Therefore, later works on character-level NMT (Cherry et al., 2018;Banar et al., 2020) mainly focus on reducing computation cost of them. Cherry et al. (2018) show that by employing source sequence compression techniques, the quality and efficiency of character-based models can be properly balanced. Banar et al. (2020) share the same idea as Cherry et al. (2018) but build their models using Transformer architecture. Our work differs from theirs in that we aim to analyze the performance of existing models instead of exploring novel architectures.
There are also several researches on comparison between CHAR and other subword algorithms (Durrani et al., 2019;Gupta et al., 2019). Durrani et al. (2019) compare character-based models and subword-based models in terms of representation quality, and find that representation learned by the former are more suitable for modeling morphology, and more robust to noisy input. Gupta et al. (2019) investigate the performance of different segmentation algorithms when using Transformer architecture, and find that character-based models can achieve better performance when translating noisy text or text from a different domain. Our finds are consistent with them, yet we conduct a more large-scale and in-depth analysis by covering language pairs from more language families and explaining where the advantage of character-based models comes from.

Conclusion
We conduct a comprehensive study and show advantages of CHAR over subword algorithms in three settings: translating to typologically diverse languages, translating with low resource, and adapting to distant domains. Note that although we have tried our best to take as much language pairs as possible into consideration, there are certainly a lot of languages remaining uncovered in this paper. However, we believe our experimental results can serve as an evidence of character-based NMT models' strong competitiveness. We hope more attention will be drawn to them, including exploring their more benefits and reducing the possibly higher computation cost in practice.