Rule-based Morphological Inflection Improves Neural Terminology Translation

Current approaches to incorporating terminology constraints in machine translation (MT) typically assume that the constraint terms are provided in their correct morphological forms. This limits their application to real-world scenarios where constraint terms are provided as lemmas. In this paper, we introduce a modular framework for incorporating lemma constraints in neural MT (NMT) in which linguistic knowledge and diverse types of NMT models can be flexibly applied. It is based on a novel cross-lingual inflection module that inflects the target lemma constraints based on the source context. We explore linguistically motivated rule-based and data-driven neural-based inflection modules and design English-German health and English-Lithuanian news test suites to evaluate them in domain adaptation and low-resource MT settings. Results show that our rule-based inflection module helps NMT models incorporate lemma constraints more accurately than a neural module and outperforms the existing end-to-end approach with lower training costs.


Introduction
Incorporating terminology constraints in machine translation (MT) has proven useful to adapt translation lexical choice to new domains (Hokamp and Liu, 2017) and to improve its consistency in a document (Ture et al., 2012). In neural MT (NMT), most prior work focuses on incorporating terms in the output exactly as given, using soft (Song et al., 2019;Dinu et al., 2019;Xu and Carpuat, 2021) or hard constraints (Hokamp and Liu, 2017;Post and Vilar, 2018). These approaches are problematic when translating into morphologically rich languages where terminology should be adequately inflected in the output, while it is more natural and flexible to provide constraints as lemmas as in a dictionary.
To the best of our knowledge, only one paper has directly addressed this problem for neural MT: (Bergmanis and Pinnis, 2021) design an NMT model trained to copy-and-inflect the terminology constraints using target lemma annotations (TLA) -TLA are synthetic training samples where the source sentence is tagged with automatically generated lemma constraints. While this approach improves translation quality, the end-to-end training set-up prevents fast adaptation to lemmas and inflected forms that are rare or unseen at training time. Its impact is also limited to a specific neural architecture, and it is unclear whether its benefits port to more generic sequence-to-sequence models.
In this paper, we introduce a modular framework for inflecting terminology constraints in NMT. It relies on a cross-lingual inflection module that predicts the inflected form of each lemma constraint based on the source context only. The inflected lemmas can then be incorporated into NMT using any of the aforementioned constrained NMT techniques. Compared with TLA, this framework is more flexible, as it can be applied to diverse types of NMT architectures and inflection modules, and facilitates fast adaptation to new terminologies without retraining the base NMT model from scratch. This flexibility is enabled by the cross-lingual nature of the inflection module, which predicts the inflected form of each target lemma based on the source context only. This differs from traditional inflection models that predict the inflected forms based on pre-specified morphological tags or monolingual target context.
Based on this framework, this paper makes the following contributions: • We construct and release test suites to evaluate models' ability to inflect terminology constraints for domain adaptation (English-German Health) and low-resource MT (English-Lithuanian News). • We show that integrating linguistic knowledge through a simple rule-based inflection module improves over its neural counterpart in intrinsic and end-to-end MT evaluations. • Our framework improves autoregressive and non-autoregressive translation, and outperforms the existing TLA approach for inflecting terminology translation. We open-source the code to facilitate replication and extensions.

Background
Autoregressive NMT with Constraints Terminology constraints can be incorporated in autoregressive NMT models via 1) constrained decoding where constraint terms are incorporated in the beam search algorithm (Hokamp and Liu, 2017;Post and Vilar, 2018), or 2) constrained training where NMT models are trained to incorporate constraints using synthetic parallel data augmented with constraint terms on the source side (Song et al., 2019;Dinu et al., 2019). These approaches all assume that the constraints are provided in the correct inflected forms and can be directly copied to the target sentence. Bergmanis and Pinnis (2021) extended the constrained training approach of Dinu et al. (2019) to incorporate lemma-form constraints in an end-to-end way -the inflected form of the lemma constraints are predicted jointly during translation. This approach requires a dedicated NMT model architecture to integrate constraints as additional inputs to the encoder, and learns inflection solely from the parallel data. By contrast, our approach can be applied to multiple NMT architectures and uses linguistically motivated rule that generalize better to rare and unseen terms.
Non-Autoregressive NMT with Constraints Instead of generating the output sequence incrementally from left to right, non-autoregressive NMT generates tokens in parallel (Gu et al., 2018;van den Oord et al., 2018;Ma et al., 2019) or by iteratively editing an initial sequence (Lee et al., 2018;Ghazvininejad et al., 2019). Architectures differ with the nature of edit operations: the Levenshtein Transformer (Gu et al., 2019) relies on insertion and deletion, while EDITOR (Xu and Carpuat, 2021) uses insertion and reposition (where each input token can be repositioned or deleted). Edit-based nonautoregressive generation provides a natural way to incorporate constraints in NMT -the constraints can be put into the initial sequence and edited to produce the final translation (Susanto et al., 2020;Xu and Carpuat, 2021;Wan et al., 2020). Our approach can augment this family of techniques by inflecting constraints before they are used for further editing.
Morphological Inflection Morphological inflection is the process of alternating the morphological form of a lexeme that adds morpho-syntactic information of the word in a sentence (e.g. tense, case, number). Traditionally, morphological inflection as computational task is framed as predicting the inflected form of a word given its lemma and a set of morphological tags (e.g. N;ACC;PL represents a plural noun used in accusative case) (Cotterell et al., 2017). The task was traditionally tackled using hand-engineered finite state transducer that relies on linguistic knowledge (Koskenniemi, 1984;Kaplan and Kay, 1994), while recent work has shown impressive results by modeling it using neural sequence-to-sequence models (Faruqui et al., 2016). More recently, a context-based inflection task has been proposed where the inflected form of a lemma is predicted given the rest of the sentence as context (Cotterell et al., 2018). The stateof-the-art models for the task are neural models trained on supervised data (Cotterell et al., 2018;Kementchedjhieva et al., 2018). The inflection module in our framework differs from those for the context-based inflection task in that it requires cross-lingual context-based inflection -it predicts the inflected form of a target lemma based only on the source language context.

Morphologically-Aware Translation
In phrasebased MT, modeling morphological compounds on the source (Koehn and Knight, 2003) and target sides (Cap et al., 2014) improves translation quality. In NMT, morphologically-aware segmentation is also useful when translating from or into morphologically complex languages (Huck et al., 2017;Ataman and Federico, 2018;Banerjee and Bhattacharyya, 2018). Tamchyna et al. (2017) propose to overcome data sparsity caused by inflection by training NMT models to predict the lemma form and morphological tag of each target word. Different from prior work, we incorporate grammatical and morphological knowledge in an inflection module for terminology constraints in NMT.

Inflecting Target Lemmas Given the Source Context
We introduce a modular framework for inflecting terminology constraints for NMT, where we first build an inflection module that predicts the inflected form of each target lemma term based on the source sentence and then incorporate the inflected constraints in NMT using any of the aforementioned techniques. By framing the problem this way, we assume that the inflected forms can be inferred based only on the source context and integrated in a fluent translation by NMT models. In cases where there are multiple possible inflected forms corresponding to different ways of translating the source, the inflection module can predict one of the possible forms, and the NMT model can generate a translation conditioned on the predicted forms of the constraints. Compared with Bergmanis and Pinnis (2021), our framework is more flexible -it can be combined with any NMT model that enables translation with constraints and can leverage diverse types of morphological inflection modules in which linguistic knowledge can be easily incorporated. Formally, given a source sequence x and k target lemma wordsz = (z 1 ,z 2 , ...,z k ) that need to be inflected, the inflection module Θ predicts the inflected form of each target lemma z = (z 1 , z 2 , ..., z k ) independently:

Rule-Based Inflection Module
One can predict the inflected form of a target word given its lemma and the source context in two steps: first predict the morphological tag of the target word based on the source context, and then predict the inflected form based on the lemma and morphological tag. The second step can be modeled using traditional inflection models (Cotterell et al., 2017), while the first step can be performed using rule-based inference based on linguistic knowledge. McCarthy et al. (2020) present a universal morphological (UniMorph) paradigm with universal morphological tags for hundreds of world languages. In UniMorph, the morphological tag of a verb includes information about the tense (past, present, or future), mood (indicative, conditional, imperative, or subjunctive), the number (singular or plural) and person (first, second, or third person) of the subject. The tag of a noun or adjective includes information about gender (masculine or feminine), number, and grammatical case. Some of these can be inferred from the target lemma (e.g. the gender of a noun) or the source term (e.g. the number of a noun), while some others need to be inferred based on the grammatical function of the source term in the sentence (e.g. grammatical case) or the sentencelevel semantics (e.g. mood). Many of the inference rules are shared across a wide range of languages, except for the tense and mood of verbs, as well as the gender and some grammatical cases of nouns and adjectives. In our rule-based inflection module, we extract the morphological features, part-of-speech tags, and dependency parsing tree of the source sentence using pre-trained Stanza models 2 and infer the aforementioned classes based on grammar rules and validation examples. The tense and mood of a verb are inferred from the morphological form of the corresponding source term, 3 while the number and person of its subject are inferred based on the morphological form of its subject. For nouns and adjectives, the number can be inferred from the morphological form of the source term or modified noun, while the gender can be determined based on the target lemma.
To infer the grammatical case of a noun or adjective, one needs to infer about the grammatical role of the source term in the sentence. For example, in Lithuanian, there are seven main cases, including nominative, genitive, dative, accusative, instrumental, locative, and vocative cases. Figure 1 shows examples of how the case of a Lithuanian noun can be inferred from the dependency parsing tree of the source sentence. Some of the cases can be easily distinguished from the others, while some are more difficult to infer. In this example, the nominative case is comparatively easy to infer -the noun should be in the nominative case when the corresponding source term is the root or subject of the sentence. However, to distinguish between dative, accusative, instrumental, and locative cases, one needs to infer based on the grammatical and semantic role of the source term. In our rule-based module, we only take into account the most common scenarios.  Figure 1: Examples showing how the grammatical case of a target lemma is inferred from the dependency parsing tree of the source sentence. In each example, the reference usage of the target constraint is underlined, and its corresponding source term is boldfaced and highlighted in the yellow, outlined box in each dependency tree. Figure (a) shows an example where the constraint term "smuikas" is used in nominative case in the reference, since its the root in the dependency tree. In Figure (b), the same constraint term is used in accusative case in the reference, since it is the object of the root verb "bought". However, not all objects should be used in accusative case. As shown in Figure (c), "smuikas" is used in instrumental case, since it serves the instrument with which the subject performs the action.
Finally, given a lemma and its morphological tag, one can look up its inflected form in a morphological dictionary. We use DEMorphy (Altinok, 2018) for German and Wiktionary 5 for Lithuanian. Since most Lithuanian nouns follow a set of declension rules, 6 we inflect Lithuanian nouns based on the rules for lemmas unseen in the dictionary.

Neural Inflection Module
As prior work shows that BERT-style architectures (Devlin et al., 2019) can encode morphological information in their hidden representations and disambiguate morphologically ambiguous forms via contextualized encoding (Edmiston, 2020), we build the neural-based inflection module as a substitution model and base it on the encoder-decoder Transformer architecture, which embeds the source sentence through the encoder and the target lemmas through the decoder. Next, the decoder predicts the inflected form of each target word in parallel. The inflection module resembles the architecture of the conditional masked language model (CMLM) (Ghazvininejad et al., 2019) but differs in decoder input and output: CMLM takes the target sentence with some tokens masked out as input and is trained to predict only the masked tokens conditioned on unmasked ones, while our inflection module takes target tokens in their lemma forms as input and predicts their inflected forms.
CMLM only allows for one-to-one substitution of subwords. However, in the case of inflection, the number of subwords that constitute a lemma The expert who played the carillon in July called it something else: "A cultural treasure" and "an irreplaceable historical instrument".
carillon karilionas Liepos mėnesį karilionu grojęs ekspertas pavadino jį kitaip: "kultūros lobiu" ir "nepakeičiamu istoriniu instrumentu".  Health Test Suite We construct the health test suite to test the models' ability to integrate terminology translations for fast domain adaptation. The test set contains English health information text annotated with domain-specific terminology translations and the human-translated sentences in German. We extract English→German test examples from the Himl Test Set, 10 which consists of English health information texts manually translated into German. We extract keyphrases from each source sentence using Yet Another Keyword Extractor (YAKE) (Campos et al., 2020) 11 and filter out phrases with high or medium frequency in the training corpora since they are mostly common and domain-generic phrases. 12 We extract 10 http://www.himl.eu/test-sets 11 YAKE extracts n-grams as keyphrases based on word casing, frequency, position, and their sentence context. 12 We filter out keyphrases with frequency > 100 in the WMT news training data. terminology translations from WikiTitles 13 and an online English-German dictionary, 14 and annotate the keyphrases whose dictionary translations match the reference translation. As shown in Table 1, each source sentence in the test set is annotated with health-related terminology translations in the lemma forms, some of which can be directly copied to the final translation while some need to be inflected based on the context.

News Test Suite
The news test suite simulates the scenario where a user looks up keyphrases of a document in a bilingual dictionary and pick the top translation for each keyphrase as a constraint to help low-resource MT. We choose English→Lithuanian as an example of low-resource translation. The test suite is constructed from English→Lithuanian test examples from WMT 2019 news test sets. We first extract keyphrases from each source document using YAKE. Then, we find the top translation of each keyphrase (for many terms there's only one translation available) in an online dictionary. 15 We filter out the keyphrases whose translations do not match the reference. Table 1 shows two examples from the same document in the test suite. All occurrences of a keyphrase in one document are annotated with its target translation to encourage consistent translation of keyphrases within a document. 16 Table 2 shows the number of sentences and constraints in each test suite.

Experimental Settings
Training Data For English→German (En-De), we use the training corpora from WMT14 (Bojar et al., 2014) and newstest2013 for validation. For English→Lithuanian, we use the training data from WMT19 (Barrault et al., 2019) and newsdev2019 as the validation set. For preprocessing, we apply normalization, tokenization, true-casing, and BPE (Sennrich et al., 2016). 17 Baselines We compare our model with the following baselines: • Auto-Regressive (AR) baseline without integrating terminology constraints. • AR with Constrained Decoding (CD) to incorporate hard constraints (Post, 2018). • AR with Target Lemma Annotation (TLA) that integrates lemma constraints as an additional input stream on the source side (Bergmanis and Pinnis, 2021). • Non-AutoRegressive (NAR) baseline based on the EDITOR model (Xu and Carpuat, 2021). • NAR with constraints (NAR+C) that integrates constraints as the initial sequence in EDITOR without explicit inflection.
MT Models All models are based on the base Transformer (Vaswani et al., 2017). 18 All models are trained with the Adam optimizer (Kingma and Ba, 2015) with initial learning rate of 0.0005 and effective batch sizes of 32k tokens for AR models and 64k tokens for NAR models for maximum 300k steps. 19 We select the best checkpoint based on validation perplexity. NAR models are trained via sequence-level knowledge distillation (Kim and Rush, 2016). For decoding, we use beam search with a beam size of 4 for AR and AR with TLA, while for AR with CD we use a beam size of 20 16 Interestingly, in Lithuanian, the masculine foreign names are usually translated by appending a suffix to the name to reflect their inflection forms. In this example, the foreign name "Johnson" is translated into "Johnsonas" in the nominative form in the dictionary, while in the reference it becomes "Johnsono" in the genitive form. 17 See preprocessing details and data statistics in Appendix. 18 See more details in Appendix. 19 As shown in prior work (Zhou et al., 2020), the batch sizes for training NAR models are typically larger than the AR model. as suggested in prior work (Post and Vilar, 2018). To enhance constraint usage in NAR models, we adopt the techniques by Susanto et al. (2020): we prohibit deletions on constraint tokens or insertions within the constraint segments.
Neural Inflection Model Its synthetic training data is derived from the MT parallel data. We first lemmatise and part-of-speech tag the target sentences using Stanza. We then randomly select adjectives, verbs, nouns, and proper nouns from each target sentence and train the inflection module to predict their inflected forms based on their lemma forms and the source sentence. Following Bergmanis and Pinnis (2021), we draw the proportion of words selected in each target sentence randomly from the uniform distribution between (0, 0.4].For training, we initialize its encoder parameters using the NAR baseline encoder and train it using Adam optimizer with a batch size of 32k tokens for maximum 200k steps. Evaluation We evaluate translation quality using sacreBLEU (Post, 2018). To evaluate how well the translation preferences are incorporated in the translation outputs, we measure lemma usage rate by first lemmatising the translation output and then computing the percentage of lemma terms that appear in the lemmatised output. To evaluate whether the terms are inflected correctly, we measure term usage accuracy by matching each lemma constraint with its inflected form in the reference and computing the percentage of reference inflected terms that appear in the translation output.

Results and Discussion
Intrinsic Inflection Accuracy To evaluate the quality of the inflection modules, we first compare the inflection accuracy of neural-based and rulebased inflection modules against the term usage accuracy of the TLA model. The rule-based inflection module achieves higher inflection accuracy than the neural-based module on both test suites: the neuralbased module obtains 81.2% accuracy on En-De health set and 15.4% accuracy on En-Lt news set, while the rule-based module achieves 87.6% accuracy on En-De and 77.4% accuracy on En-Lt. The rule-based module achieves close accuracy to TLA on En-De (89.2% term usage accuracy) and higher accuracy on En-Lt (67.9% term usage accuracy).
To investigate why the neural-based inflection underperforms the rule-based one, we examine how  Table 3: BLEU, lemma, and term usage rates on the En-De health and En-Lt news test suites. For lemma and term usage, we report scores on all constraints (All), constraints that require no inflection (No Inf ), and constraints that require inflection (Inf ). We boldface the highest scores and their ties based on the paired bootstrap significance test (Clark et al., 2011) with p < 0.05.
the training and validation perplexity changes over the number of training epochs (see Appendix). On both languages, the validation perplexity stops decreasing after a few training epochs (10 epochs for En-De and 20 epochs for En-Lt) while the training perplexity decreases very slowly. The final training perplexity remains at around 5.1 on En-De and 5.7 on En-Lt, which is high considering the number of possible inflection forms given a German or Lithuanian lemma. This indicates that the neural-based module does not learn generalizable inflection rules from the data effectively. Table 3 shows the impact of rule-based and neural-based inflection modules on top of a range of AR and NAR baselines. NAR baselines without constraints achieves competitive BLEU to the AR baseline on En-Lt and slightly lower BLEU on En-De, as in Xu and Carpuat (2021). Given lemma constraints, AR with CD without inflection obtains lower term usage accuracy and lower BLEU than AR with TLA, as in Bergmanis and Pinnis (2021). Similar to AR with CD, NAR+C without inflection obtains lower term usage and close or lower BLEU than AR with TLA. Adding rule-based inflection helps all models leverage lemma constraints more accurately. On En-De, it significantly improves term usage accuracy of AR with CD by +4.7% and NAR+C models by +5.1%. 20 On En-Lt, it significantly improves both the lemma usage rate and term usage accuracy of AR with CD (+3.2% on lemma usage and +10.7% on term usage) and NAR+C (+3.5% on lemma usage and +11.0% on term usage). Remarkably, it also improves the term accuracy of En-Lt AR with TLA, which is already trained to inflect the target lemma constraints. When evaluating only on constraints that require inflection, the rule-based modules improves by 4.4-8.3% on TLA, 38.6-46.5% on CD, and 38.7-48.1% on NAR+C. As expected based on inflection accuracy results, rule-based modules outperform neural-based ones across the board. These improvements in term usage preserve or slightly improve BLEU. 21 , as can be expected since the constraints only constitute a small portion of the tokens in the translation outputs. Overall, these results indicate that our proposed framework is model-agnostic and supports our hypothesis that the lemma constraints can be effectively inflected based on the source context alone.

End-to-End MT Evaluation
We now compare our framework against TLA. Rule-based inflection combined with NAR+C achieves close lemma and term usage rates (∆ ≤ 2%) to TLA on En-De, +11.8% higher lemma usage, and +7.8% higher term usage accuracy on En-Lt (the improvements are significant). On En-Lt, the largest improvements are on constraints that require inflection: +20.5% on lemma usage and +10.6% on term usage. Incorporating the constraints preserves translation quality, with no significant difference in BLEU. Overall, these results show the benefits of integrating linguistic knowledge via rule-based inflection over purely data-driven approaches. Our approach is also more adaptive, as NAR+C with rule-based inflection does not require re-training the whole NMT model to incorporate new lemma terms. Instead, new terms can be incorporated by updating the morphological dictionary used in the inflection module.
Cost Trade-offs Implementing the rule-based inflection module for the first target language (Lithuanian) took around 6 hours (including the time for learning the grammar knowledge from Wikipedia) by a computer scientist without prior knowledge of the target language nor formal linguistics training.The second language (German) implementation took only 3 hours, since some rules are shared across languages. By contrast, the neural-based module was implemented in about 3 hours but took around 38 hours to train a single model for one language pair on 2 GeForce GTX 1080 Ti GPUs. While these numbers do not provide a controlled comparison, they highlight that the rule-based module is relatively simple to build, as it can be done for both languages in 7-15% of the time required to train the neural model. 21 The improvements on BLEU is statistically significant for NAR+C on En-De, but not for other models. Term Frequency We analyze where rule-based inflection helps the most by computing the term usage accuracy on terms in different frequency bucket. As shown in Figure 2, the trends are different on En-De and En-Lt. On En-De, CD + rule slightly improves TLA on terms with frequency between [5, 100) instead of the rare terms. One reason is that the German morphological dictionary that we use to determine the gender of a word and its inflection forms only covers around 70% of the constraint terms in the health test suite. In addition, NAR+C + rule underperforms CD + rule on some constraint terms with frequency between [30, 100). This might be a side effect of knowledge distillation, which yields frequent errors for words that are rare in the training data (Ding et al., 2021). In En-Lt test set, 68% of the constraint terms are used in the inflection forms that are unseen in the training data. As shown in the figure, both CD + rule and NAR+C + rule bring substantial improvements over TLA on terms that are unseen in the training data. This is because most Lithuanian nouns and adjectives are inflected based on a fixed set of rules, thus even when the target lemma is unseen in the training data or morphological dictionary, it can still be inflected correctly. As a result, the rulebased inflection module can effectively incorporate linguistic knowledge in translation models and thus generalizes better to rare and unseen terms.

Qualitative Analysis
We examine a few randomly selected translation examples from TLA, NAR+C, and their counterparts with rule-based inflection. As shown in Table 4, TLA tends to copy constraint terms that are infrequent in the training data, and adding the rule-based inflection module helps TLA inflect the term correctly instead. In NAR+C models, the inflection module also improves the translation of the context around constraint terms, while the vanilla NAR+C model is prone to compounding errors caused by the uninflected constraints.

Conclusion
We introduced a modular framework for leveraging terminology constraints provided in lemma forms in neural machine translation. The framework is based on a novel cross-lingual inflection module that inflects the target lemma constraints given source context and an NMT model that integrates the inflected constraints in the output. We showed that our framework can be flexibly applied to different types of inflection modules, including rule-based and neural-based ones, and different NMT models, including autoregressive and non-autoregressive ones, with minimal training costs. Results on the English-German health and English-Lithuanian test suites showed that the linguistically motivated rule-based inflection module helps NMT models incorporate terminology constraints more accurately than both neural-based inflection and the existing end-to-end approach to incorporating lemma constraints. This work opens future avenues for further improving the inflection module by combining linguistic knowledge with data-driven approaches. Future work is needed to explore the strengths and weaknesses of this framework for languages with a broader range of morphological properties.

A Data Preprocessing
For preprocessing, we apply normalization, tokenization, true-casing, and BPE (Sennrich et al., 2016) with 37, 000 and 24, 500 merging operations for En-De and En-Lt. Table 5 shows the provenance and statistics of the preprocessed data.

B Model and Training Details
All models are based on the base Transformer (Vaswani et al., 2017) with d model = 512, d hidden = 2048, n heads = 8, n layers = 6, and p dropout = 0.3. We tie the source and target embeddings with the output layer weights (Press and Wolf, 2017;Nguyen and Chiang, 2018). We add dropout to embeddings (0.1) and label smoothing (0.1). All models are trained with the Adam optimizer (Kingma and Ba, 2015) with initial learning rate of 0.0005 and effective batch sizes of 32k tokens for AR models and 64k tokens for NAR models for maximum 300, 000 steps. 22 We select the best checkpoint based on validation perplexity. Following Xu and Carpuat (2021), we train NAR models using sequence-level knowledge distillation: we replace the reference sentences in the training data with translation outputs from the AR models. To train the neural-based inflection module, we initialize its encoder parameters using the NAR baseline encoder and train it using Adam optimizer with a batch size of 32k tokens for maximum 200, 000 steps. Models are trained on 2 GeForce GTX 1080 Ti GPUs. Table 6 shows the number of parameters in each model.

C Evaluation Metric
We evaluate translation quality using sacre-BLEU (Post, 2018). 23 22 As shown in prior work, the batch sizes for training non-autoregressive models are typically larger than the AR model (Zhou et al., 2020