On Compositional Generalization of Neural Machine Translation

Modern neural machine translation (NMT) models have achieved competitive performance in standard benchmarks such as WMT. However, there still exist significant issues such as robustness, domain generalization, etc. In this paper, we study NMT models from the perspective of compositional generalization by building a benchmark dataset, CoGnition, consisting of 216k clean and consistent sentence pairs. We quantitatively analyze effects of various factors using compound translation error rate, then demonstrate that the NMT model fails badly on compositional generalization, although it performs remarkably well under traditional metrics.


Introduction
Neural machine translation (NMT) has shown competitive performance on benchmark datasets such as IWSLT and WMT (Vaswani et al., 2017;Edunov et al., 2018a;Liu et al., 2020a), and even achieves parity with professional human translation under certain evaluation settings (Hassan et al., 2018). However, the performance can be relatively low in out-of-domain and low-resource conditions. In addition, NMT systems show poor robustness and vulnerability to input perturbations (Belinkov and Bisk, 2018a;Cheng et al., 2019). One example is shown in Table 1, where simple substitution of a word yields translation with completely different semantics. Many of these issues origin from the fact that NMT models are trained end-to-end over large parallel data, where new test sentences can be sparse.
Disregarding out-of-vocabulary words, a main cause of sparsity is semantic composition: given a limited vocabulary, the number of possible compositions grows exponentially with respect to the composite length. The ability to understand and Input Translation Taylor breaks his promise "·à˙ (Taylor keeps his promise) James breaks his promise y∆Ø›Õ˙ (James breaks his promise) produce a potentially infinite number of novel combinations of known components, namely compositional generalization (Chomsky; Montague; Lake and Keysers et al., 2020), has been demonstrated deficient in many machine learning (ML) methods (Johnson et al., 2017a;Bastings et al., 2018;Loula et al., 2018;Russin et al., 2019a).
In this paper, we study compositional generalization in the context of machine translation. For example, if "red cars" and "blue balls" are seen in training, a competent algorithm is expected to translate "red balls" correctly, even if the phrase has not been seen in training data. Intuitively, the challenge increases as the composite length grows. Recently, several studies have taken steps towards this specific problem. They either use a few dedicated samples (i.e., 8 test sentences) for evaluation (Lake and Li et al., 2019b;Chen et al., 2020), or make simple modifications in sampled source sentences such as removing or adding adverbs, and concatenating two sentences (Raunak et al., 2019;Fadaee and Monz, 2020a). Such experimental data is limited in size, scope and specificity, and the forms of composition are coarse-grained and non-systematic. As a result, no qualitative conclusions have been drawn on the prevalence and characteristics of this problem in modern NMT.
We make a first large-scale general domain investigation, constructing the CoGnition dataset (Compositional Generalization Machine Translation Dataset), a clean and consistent paral-  lel dataset in English-Chinese, along with a synthetic test set to quantify and analyze the compositional generalization of NMT models. In particular, we define frequent syntactic constituents as compounds, and basic semantic components in constituents as atoms. In addition to the standard training, validation and test sets, the CoGnition dataset contains a compositional generalization test set, which contains novel compounds in each sentence, so that both the generalization error rate can be evaluated, and its influence on BLEU (Papineni et al., 2002) can be quantified. Our compositional generalization test set consists of 2,160 novel compounds, with up to 5 atoms and 7 words. In this way, generalization ability can be evaluated based on compound translation error rate. Empirical results show that the dominant Transformer (Vaswani et al., 2017) NMT model faces challenges in translating novel compounds, despite its competitive performance under traditional evaluation metrics such as BLEU. In addition, we observe that various factors exert salient effects on model's ability of compositional generalization, such as compound frequency, compound length, atom co-occurrence, linguistic patterns, and context complexity. The CoGnition dataset along with the automatic evaluation tool are realesed on https://github.com/yafuly/CoGnition.

Related Work
Analysis of NMT. Our work is related to research analyzing NMT from various perspectives. There has been much linguistic analysis of NMT representations (Shi et al., 2016;Belinkov et al., 2017;Bisazza andTump, 2018), interpretability (Ding et al., 2017;He et al., 2019;Voita et al., 2019a), and attention weights (Voita et al., 2019b;Michel et al., 2019). Robustness is also an important research direction. Work has shown that NMT models are prone to be negatively affected by both synthetic and natural noise (Belinkov and Bisk, 2018b;Cheng et al., 2018;Ebrahimi et al., 2018). For better exploration of robust NMT, Michel and Neubig (2018) propose an MTNT dataset containing several types of noise.  provide in-depth analyses of inference miscalibration of NMT resulting from the discrepancy between training and inference. Our work is in line but we discuss robustness from the perspective of compositional generalization.
In this respect, Lake and Baroni (2018) propose a simple experiment to analyze compositionality in MT, followed by Chen et al. (2020) and Li et al. (2019b). Specifically, they introduce a novel word "dax", and their training data contains a single pattern of sentence pairs (e.g. "I am daxy", "je suis daxiste") while the test set contains different patterns. However, their work is limited in that there are only 8 sentences in the test set. Raunak et al. (2019) observe a performance drop on a dataset of concatenated source sentences. Fadaee and Monz (2020b) modify source sentences by removing adverbs, substituting numbers, inserting words that tend to keep syntax correct (e.g. "very"), and changing the gender, and find unexpected changes in the translation. In contrast to these studies, we quantitatively measure compositionality of NMT under compound translation error rate.
Translation involves various challenges such as low-frequency words, polysemy and compositional complexity. In this work, we focus on how the NMT model generalizes to complex compositions in a controllable setting and minimize the effects of the other factors.
Compositional Generalization. Neural networks have been shown sample-inefficient, requiring large-scale training data, which suggests that they may lack compositionality (Lake and . Lake and Baroni (2018) introduce the SCAN dataset to help study compositional generalization of neural networks, which has received increasing interests (Russin et al., 2019b;Dessì and Baroni, 2019;Li et al., 2019c;Lake, 2019;Andreas, 2020;Gordon et al., 2020). Various benchmarks have been proposed including in the area of visual reasoning (Johnson et al., 2017b;Hudson and Manning, 2019), mathematics (Saxton et al., 2019), and semantic parsing (CFQ) (Keysers et al., 2020). However, no benchmark has been dedicated to machine translation in practice. We fill this gap by introducing a dataset with 216,000 instances and an average sentence length of 9.7 tokens.

Problem Definition
Following Keysers et al. (2020), compositional generalization is defined as the capacity to systematically generalize to novel combinations of components which are learned sufficiently during training. Key elements to measure compositional generalization include atoms and compounds. Specifically, atoms are primitive elements in the train set whereas compounds are obtained by composing these atoms. The research question is whether neural models perform well on unseen compounds. Take Table 2 for example, in the SCAN dataset, the atoms are simple commands such as "jump" and the composite command "jump twice" is a compound. In the CFQ, the compounds are questions such as "Who directed Elysium", and the atoms correspond to the primitive elements in the questions such as the predicate "directed", the question patterns "Who [predicate] [entity]" and the entities "Elysium".
In theory, compounds in MT can be defined as phrases, sentences or even document. In practice, however, we want to control the number of atoms in a novel compound for quantitative evaluation. In addition, it can be highly difficult to construct a large-scale dataset where novel compounds are sentences of practical sizes (the number of synthesized sentences increases exponentially with their length) while ensuring their grammatical correctness. Therefore, we constrain compounds to syntactic constituents, and define atoms as basic semantic components in constituents according to syntactic and semantic rules for forming constituents (Partee, 1995). As a result, we randomly assign multiple sentential contexts for investigating each novel compound. Table 2 shows a contrast between our dataset and existing datasets for compositional generalization in semantics.
Mistakes caused by weakness in computational generalization can be easily found in state-ofthe-art NMT models. In particular, we train a Transformer-based model (Vaswani et al., 2017) on WMT17 En-Zh Dataset 1 . One sentence in the standard test set, "but the problem is , with the arrival of durant , thompson 's appearance rate will surely decline , which is bound to affect his play", is translated into "FÓò/, è@\py Ñ0e, dnÓÑ hØö⇢↵M, Ÿ ö⇢ qÕ0÷Ñh" (English: but the problem is , with the arrival of durant , thompson 's will surely look worse , which is bound to affect his play). The novel compound "appearance rate" is composed of two atoms (i.e., "appearance" and "rate"), both with a high frequency of more than 27,000 times in the training set. However, the sentence semantics is completely distorted due to the failure of semantic composition, which is possibly influenced by the context word "play". More importantly, as the overall translation highly overlaps with the reference, the model achieves a high score in similarity-based metrics such as BLEU, demonstrating that fatal translation errors can be overlooked under traditional evaluation metrics. Figure 1 gives an overview of our data construction process. We first source monolingual data (Section 4.1), and then build parallel data based by translation (Section 4.2). Then we synthesize a test set of novel compounds (Section 4.3), and offer an automatic evaluation method (Section 4.4).

Monolingual Data Source
Our goal is to focus on compositional generalization and minimize the influence of additional factors such as polysemy (Berard et al., 2019), misalignment (Munteanu and Marcu, 2005), and stylistic problems (Hovy et al., 2020). The dataset should ideally have following characteristics. First, the vocabulary size should be small and contain only words of high-frequency in order to avoid problems caused by rare words. In other words, variety of composition should come from combining different frequent words instead of word diversity, as suggested in (Keysers et al., 2020). Metaphorical words, which can increase the translation difficulty, should be excluded. Second, source sentences should not be too long or have complex syntactic structures. As a result, a sentence can be

Synthesizing Source Sentences
The red dog is running. The red dog was sick. The red dog had fun with a toy.  Widely-adopted corpora such as parallel data released on WMT and IWSLT 2 have large vocabularies and also contain noisy sentences and rich morphology (Li et al., 2019a), which do not fully meet our goal. We choose Story Cloze Test and ROCStories Corpora (Mostafazadeh et al., 2016(Mostafazadeh et al., , 2017 as our data source. The dataset is created for commonsense story understanding and generation, and consists of 101903 5-sentence stories. These stories are rather simple in items of vocabulary and syntax, but still contain rich phrases. In addition, the topic is constrained to daily life.

Automatic Evaluation
Since the vocabulary size of 42, 458 is large, we select the top 2, 000 frequent words as our vocabulary and extract sentences where the words are exclusively from the restricted vocab. Moreover, sentences that are longer than 20 words are removed. In this way, we finally obtain 216, 246 2 https://wit3.fbk.eu/ sentences for parallel data construction. More detailed statistics including comparison to WMT and IWSLT data are shown in Appendix B.

Parallel Data Construction
We take an MT post-editing method to construct parallel data, first using a public translation engine to obtain model-generated translations, and then requesting expert translators to post-edit them. The following aspects are highlighted: • Ensure the fluency of translations.
• Ensure word-level matching between translated sentences and source sentences. Typically, every word should be correctly translated, without omission for legibility.
Finally, we obtain a parallel dataset of 216, 246 sentences in CoGnition, and randomly split it into three subsets: 196, 246 sentence pairs for training, 10, 000 sentence pairs for validation, and 10, 000 sentence pairs as the random test set. In addition to the above split, we additionally make a compositional generalization test set, which is described in the next section.

Compositional Generalization Test Set
We manually construct a special test set dedicated for evaluation of compositional generalization, by synthesizing new source sentences based on novel compounds and known contexts.
Designing Compound Patterns We use Berkeley Parser to obtain constituent trees (Kitaev and Klein, 2018). In CoGnition, noun phrases (NP), verb phrases (VP) and positional phrases (PP) are three most frequent constituents, accounting for 85.1% of all constituents, and thus we construct compounds based on them. According to syntactic and semantic rules (Partee, 1995), we choose basic semantic components as our atoms including determiners (DET), nouns (N), verbs (V), prepositions (P), adjectives (ADJ), and postpositive modifiers (MOD). Specifically, postpositive modifiers include prepositional phrases and relative clauses, and can contain multiple words. We consider them as a single atom due to their semantic inseparability. In this way, we generate 4 compound patterns for NP, VP, and PP, respectively, which are listed in Table 3 with corresponding examples.
Making Novel Compounds We use Stanza (Qi et al., 2020) to obtain POS tagging for each word in training sentences. We construct novel compounds by first selecting atom candidates with relatively consistent translation in the training set. The frequency of candidate atoms covers a wide range from 34 to 73518. We list full set of atom candidates in Table 4. For constructing compounds, we enumerate all possible combinations of atoms according to the patterns in Table 3, and then remove those that are ungrammatical or likely to cause ethic issues, obtaining 2,160 compounds finally. We do not deliberately make all compounds unseen, yet only 0.93% of them appear in the training data.
Synthesizing Source Sentences We embed the compounds in specific context to form complete source sentences. Concretely, we first apply Berkeley Parser on the training sentences to obtain sentence templates, where certain constituents are replaced by placeholders according to their constituent types, e.g., "NP-placeholder spent a lot of time to set up a wedding .". Then we select 5 sentence templates for each constructed compound accordingly, so that every compound can be evaluated under 5 different contexts. To distinguish from VP and PP, we put NP compounds only in sentences with the placeholder outside VP and PP.
Making Reference To maintain statistical consistency, target translations of synthetic sentences are also obtained using the same MT post-edit approach. In addition to the annotation principles listed in 4.2, we set several additional rules: • Filter sentences with ethical issues and replace them with other synthetic ones.
• Ensure the accuracy of compound translation.
Finally, we obtain a compositional generalization test set (CG test set) of 10, 800 parallel sentences. The final dataset statistics is shown in table 5.

Automatic Evaluation
We mainly adopt human evaluation for the experiments of this paper (Section 5) for ensuring reliability of findings. Despite its accuracy, human evaluation can be expensive. To facilitate fast evaluation in future research, we introduce an automatic evaluation approach to quantify a model's generalization ability on our CG test set.
In particular, we manually construct a dictionary for all the atoms based on the training set (See Appendix C). The prerequisite of correctly translating one compound is that all of the atom translations should be contained. Besides, in most cases the translation of nouns should be placed after that of other atoms. Based on this, we design a heuristic algorithm to determine whether compounds are translated correctly. With the human annotation as ground truth, our automatic evaluation tool achieves a precision of 94.80% and a recall of 87.05%, demonstrating it can serve as an approximate alternative to human evaluation.

Experiments
We conduct experiments on CoGnition dataset and perform human evaluation on the model results.

Settings
We tokenize the English side using Moses tokenizer and do not apply byte pair encoding (BPE) (Sennrich et al., 2016) due to the small vocabulary (i.e., 2000). The Chinese sentences are segmented by   jieba segmenter 3 . We employ BPE with 3,000 merge operations, generating a vocabulary of 5,500 subwords. We focus on Transformer (Vaswani et al., 2017) because of its state-of-the-art performance on machine translation (Edunov et al., 2018b;Takase and Kiyono, 2021;Raffel et al., 2020;Zhu et al., 2020;Liu et al., 2020b) and better performance on existing compositional generalization dataset (Daniel et al., 2019). We implement our model using BASE configuration provided by Fairseq (Ott et al., 2019). The model consists of a 6-layer encoder and a 6-layer decoder with the hidden size 512. We tie input and output embeddings on the target side. The model parameters are optimized by Adam (Kingma and Ba, 2015), with 1 = 0.1, 2 = 0.98 and ✏ = 10 9 . The model is trained for 100,000 steps and we choose the best checkpoint on validation set for evaluation.
We report character-level BLEU scores using SacreBLEU (Post, 2018) to measure the overall translation performance. In addition, we request expert translators to annotate the correctness of compound translation. Translators are asked to only focus on examining whether the compound itself is translated correctly or not, disregarding errors in context. Specifically, a compound is correct only if its translation contains semantic meaning of all atoms and is fluent in human language. Since each of the 2,160 compounds is provided with 5 contexts, we can compute the translation error-rate for each compound. 3 https://github.com/fxsjy/jieba Table 6 shows the results. Besides the CG test set, we also list results on three of its subsets, which only contain NP, VP or PP compounds respectively. The model achieves a 69.58 BLEU score on the random test set, which partly indicates distributional consistency and quality of the dataset. In comparison, the performance on the CG test set drops dramatically by more than 20 BLEU points. Given that the only difference between synthetic sentences and training sentences is the unseen compounds (i.e., contexts are seen in training), the decrease of 20 BLEU points indicates that unseen compounds pose a significant challenge, which is however easy to be overlooked in traditional evaluation metrics. For example, the model mis-translates "alas , he became sick from eating all of the peanut butter on the ball" into " ÷ ‡:⇤âÜ⇤:⌦@ Ñ ± q ≈Ü" (English: alas , he became sick from eating all of the peanut butter on the field). With a minor mistake on the compound "on the ball", the model achieves a sentence-level BLEU of 61.4, despite that the full sentence meaning is largely affected. In other words, the BLEU score of 69.58 can be misleading since novel compounds can be rare in the random test set. Such mistakes in generalizing new compounds can severely hinder overall performance of translation engines in practice, as shown earlier in Table 1. Also, we calculate BLEU for the original training sentences that provide contexts for the CG test set (row 3). The model achieves 99.74 BLEU, further demonstrating that the performance degradation is mainly caused by the unseen compounds.

Main Results
Instance-wise, 27.31% compounds are translated incorrectly. However, when aggregating all 5 contexts, 61.62% compounds suffer at least one incorrect translation. This suggests that a well-trained NMT model is not robust in translating compounds, though all atoms within them are highly frequent in   the training set. We also observe that the error rate of PP compounds, 37.72%, is much higher than the other two, 21.94% and 22.25%, which we will discuss in detail in the following section.

Analysis
We conduct experiments to explore in what situations the model is error-prone by considering compound frequency, compound length, compound structure, atom frequency, atom co-occurrence, and the complexity of external context.

Compound Frequency
Intuitively, compounds with higher frequencies in the training set are easier to infer. We classify compounds according to their frequency levels, including many-shots (frequency higher than 10), fewshots (frequency from 1 to 10) and zero-shot, and show the error rate for each bucket in Figure 2. The model translates all the many-shots compounds correctly. For few-shot compounds, translation error rate increases to 5.00%, but is still much lower than zero-shot compounds with an error rate of 27.53%.
The result suggests the model is good at memorizing correspondence between sentence segments. However, the model deteriorates severely when test samples are unseen in the training set, which further confirms model's weakness in compositional generalization (Lake and Baroni, 2018).   Figure 4: Effect of atom frequency on compound translation error rate.

Compound Length
As shown in Figure 3, the error rate grows with the increase of compound length (i.e., the number of atoms in a compound). Only 4.50% of the shortest compounds are translated incorrectly, each of which consists of a determiner and a noun. The error rate increases to 13.72% when the compound length grows to 3 atoms (e.g., "the smart lawyer").
The longest compounds contain a determiner, a noun, an adjective, a modifier and a preposition or verb in each of them, e.g., "taking every special chair he liked". The error rate increases to 36.63%, demonstrating that it is more difficult to generalize in longer compounds, which contain richer semantic information. We conjecture that if the range of compound is further expanded, the error rate will be much higher.

Atom Frequency
We empirically divide compounds into multiple groups according to the minimum frequency of their atoms, where each group consists of similar numbers of compounds. The intuition is that the atom with low frequency might be difficult to translate and therefore hinders the whole compound translation. We fix the compound length to 3 in order to reduce effects of compound length. As shown in Figure 4, the error rate has no strong correlation with the atom frequency. This can be because all atoms in our corpus are simple and relatively frequent and thus it is easy for the NMT model to memorize the semantics of most atoms. Therefore, simply increasing atom frequency does not enhance model's generalization ability of novel compounds. We observe similar patterns for compounds of other lengths (Appendix A).

Atom Co-occurrence
Although the NMT model may never see a compound, there can exist many local segments where atoms co-occur. For example, in the unseen compound "the smart lawyer", "smart" and "lawyer" may occur within some training sentences. Intuitively, the compounds of which atoms co-occur more frequently may be translated better. We calculate pointwise mutual information (PMI) and compare error rates of compounds with positive or negative mean PMI scores (MPMI): where a i is the i-th atom in the compound C, N is the compound length, M is the number of possible combinations of two atoms, and PMI score is computed as: where the probabilities p(a i ) and p(a i , a j ) are obtained by dividing the number of n-grams in which one word or both words occur by the total number of n-grams 4 . We divide compounds into 4 groups by their length and compare error rates within each group. As shown in Figure 5, across all groups, the error rates with positive mean PMI scores are lower than those with negative ones, verifying our hypotheses. 4 We use 5-gram here Error Rate Pattern # Figure 6: Compound translation error rates of different patterns. Figure 6 shows the error rates of all compound patterns in Table 3. The MOD atom exerts salient influence on translation error rate. The error rate of compounds with MOD is 19.78% higher than those without on average. In contrast, adding ADJ into compounds only increases error rate by 2.66%. The major difficulty caused by MOD is word reordering. One can translate "the small dog" monotonically without adjusting word order. However, compounds like "the dog he liked" require the model to recognize "he liked" as MOD and put its translation before that of "the dog" in Chinese. We find many cases where the model translates such compounds without reordering or breaking the connection between nouns and modifiers. Across these groups, we can see that the error rate of NP (Pattern 1.*) is generally lower than that of VP (Pattern 2.*) and PP (Pattern 3.*). Such phenomenon is more obvious for the patterns without MOD. The reason is that compounds in Pattern 1.* are generally shorter and contain less semantic and syntactic information. However, the error rates of Pattern 2.3 and 2.4 are lower than other patterns with MOD (i.e., Pattern 1.3, 1.4, 3.3 and 3.4), indicating the model performs better in "V+DET(+ADJ)+NN+MOD". This can be because under certain situations the MOD can be useful for correctly translating verbs, which are more commonly seen in the training set, e.g., "found the chair on the floor".

Linguistic Factors
We also observe that compounds of PP (Pattern 3.*) are more difficult to translate compared with VP (Pattern 2.*), although both types of compounds share the same compound length. In the training set, verbs typically have consistent translations, whereas the meanings of prepositions vary with contexts. Therefore prepositional compounds are more difficult to translate as more context infor- mation is required to ground their meanings.

Effect of External Context
Due to the nature of NMT, the semantic representation of each compound is context-aware. Intuitively, translation of compounds is also influenced by external context, which is sentential in our case but can also be document-level in practice. We investigate effects of context lengths and sentence comprehension difficulty. In particular, the context length is calculated by subtracting the sentence length by the number of words in the compound. Comprehension difficulty of the training sentences which provide contexts, is quantified by the dependency distance (Liu, 2008): where N is the number of words in the sentence and D i is the dependency distance of the i-th syntactic link of the sentence.
The results are shown in Figure 7. The translation error rate increases stably with the context length as well as the dependency distance. These observations demonstrate that the generalization for novel compounds correlates strongly with context complexity. Sentences with higher dependency distances are harder for model to comprehend during training. Given that our test sentences are restricted to 20 words, compositional generalization can be more challenging in practice where average sentence lengths can be much longer.

Conclusion
We proposed a dedicated parallel dataset for measuring compositional generalization of NMT and quantitatively analyzed a Transformer-based NMT model manually. Results show that the model exhibits poor performance on novel compound translation, which demonstrates that the NMT model suffers from fragile compositionality, and it can be easily overlooked under transitional metrics. To the best of our knowledge, we are the first one to propose a practical benchmark for compositionality of NMT, which can be a testbed for models tailored for this specific problem.

Ethics Consideration
As mentioned, we collected our data from Story Cloze Test and ROCStories Corpora that all are public to academic use, and they contain no sensitive information (Mostafazadeh et al., 2016(Mostafazadeh et al., , 2017. The legal advisor of our institute confirms that the sources of our data are freely accessible online without copyright constraint to academic use. Our data construction involves manual annotation. Annotators were asked to post-edit machine translation and filter out samples that may cause ethic issues, which do not involve any personal sensitive information. We hired 4 annotators who have degrees in English Linguistics or Applied Linguistics. Before formal annotation, annotators were asked to annotate 100 samples randomly extracted from the dataset, and based on average annotation time we set a fair salary (i.e., 32 dollars per hour) for them. During their training annotation process, they were paid as well.

A Atom Frequency
For compounds of other lengths, we also compute their error rates with respect to minimum atom frequency. As shown in Figure 8, 9 and 10, the error rate does not correlates with atom frequency across all compound lengths.  34  49  59  61  67  71  73  79  84  89  91  106  107  135  174  183  186  188  189  190  216  236  238  265  275  304  307  338  379  404  405  513  523 Error Rate Atom Frequency  Table 7 and Table 8 lists statistics of several monolingual data sources, compared with the data source (ROC-Filter) used in constructing the CoGnition dataset. We can see that our dataset has both shorter sentences and vocabulary made up of more frequent words.

C Lexicon
Part of the lexicon for automatic evaluation is shown in Table 9.