Input Augmentation Improves Constrained Beam Search for Neural Machine Translation: NTT at WAT 2021

This paper describes our systems that were submitted to the restricted translation task at WAT 2021. In this task, the systems are required to output translated sentences that contain all given word constraints. Our system combined input augmentation and constrained beam search algorithms. Through experiments, we found that this combination significantly improves translation accuracy and can save inference time while containing all the constraints in the output. For both En->Ja and Ja->En, our systems obtained the best evaluation performances in automatic and human evaluation.


Introduction
This year, we participated in the restricted translation task at WAT 2021 (Nakazawa et al., 2021), in which we were asked to control a model so that the translation output would contain specified terms. Although the recent neural machine translation (NMT) model achieves excellent performance, controlling its output is still a challenging task. Figure 1 shows an overview of the task. Each sentence includes the target words (constraints) that must be contained in the output. We believe this task reflects a critical function, especially in practical applications. For example, users may want to control the translation of technical terms or proper nouns.
Several works have tried to control the NMT outputs, and these works can be divided into two categories: hard and soft methods. The hard lexically constrained method guarantees that all the target words are in the output. Current works achieve this by modifying the beam search algorithm to find the hypothesis that contains all of the target words (Hokamp and Liu, 2017; Post and * Equal contribution.
geometric-optical theory, standing wave, ray coincidence

Constraints:
A geometrical optics theory of stationary waves based on ray matching is developed.

MT Output:
A geometric-optical theory of standing wave based on ray coincidence is developed.

Constrained MT Output:
Figure 1: Overview of the restricted translation task Vilar, 2018). The hard method guarantees all constraints are satisfied, but its translation performance is sometimes lower than the conventional NMT. This is because it requires all given target words to be contained in the decoding step, which may disrupt the model inference.
The soft lexically constrained method, on the other hand, does not guarantee that all target words are contained in the output. These methods usually modify or augment the input of the NMT model and try to output the given target words without changing the decoding algorithm (Song et al., 2019;Chen et al., 2020). Its decoding speed is usually faster than the hard method, but some of the constraints may not be satisfied.
Our submission aims to contain all of the specified target words with high translation accuracy. To achieve this goal, we applied both input augmentation and constrained beam search algorithms. To the best of our knowledge, this is the first work that combines these two methods. Through experiments, we found that this combination achieves quite high translation performance while containing all target words in the output and saving inference time. We submitted the systems to the English-to-Japanese (En→Ja) and Japanese-to-English (Ja→En) tasks, and we were ranked first in both language pairs in terms of BLEU scores and human evaluations.

Task Definition
Suppose we have a source sentence X = (x 1 , x 2 , . . . , x S ) with S tokens and a target sentence Y = (y 1 , y 2 , . . . , y T ) with T tokens. In a conventional machine translation approach, the problem of translation from X to Y can be solved by finding the best target sentence that maximizes the conditional probability (1) In the restricted translation task, lists of target words are provided to represent word restrictions, and systems are required to output translations that contain all of the target words in each list. Here, the problem of translation with word constraints can be defined as where C = (C 1 , C 2 , . . . , C N ) is the provided word constraints with N phrases, and the constraints are given in random order.
The performance of systems in this task is evaluated through two metrics: • Translation accuracy: BLEU (Papineni et al., 2002) is used for evaluation in this task.
• Consistency score: The percentage of sentences that correctly contain the given constraints over the entire test set.
For the final ranking, the combined score of the above metrics is calculated as follows: 1. If the translation does not contain all of the constraints based on exact matching, replace the translation with an empty string.
2. Calculate BLEU scores with modified translations.

Provided Data
In this task, we were asked to translate an English/Japanese scientific paper. As the indomain training data, organizers provided AS-PEC (Nakazawa et al., 2016), which contains three million parallel sentences. Since this corpus is Save checkpoint for every 100 steps and take an average of last 8 checkpoints Table 1: List of hyperparameters ordered by the sentence-alignment quality, the sentences at the end might be noisy. Following a previous work (Morishita et al., 2017), we used only the first two million sentences as parallel sentences.
We treated the final one million sentences as monolingual data and created a synthetic corpus (Sennrich et al., 2016). Based on a previous analysis (Morishita et al., 2019), we forward-translated it for the Japanese-English task and back-translated it for the English-Japanese task.

Other Resources
We also trained the model with additional resources. As an additional parallel corpus, we used JParaCrawl (Morishita et al., 2020), which contains 10 million sentence pairs.
We also used CommonCrawl provided by the WMT 2020 news shared task (Barrault et al., 2020) as additional monolingual data. For Common-Crawl data, we chose the ten million English and Japanese sentences that are similar to the scientific domain based on the language model trained with ASPEC (Moore and Lewis, 2010). Then we further filtered out the following noisy sentences: (1) non-English/Japanese sentences with CLD2 1 , (2) excessively long sentences (more than 250 subwords), (3) sentences that contain out-of-vocabulary characters. After cleaning, we kept 7.9 million English and 9.2 million Japanese sentences. We then backtranslated these sentences with the NMT model trained with ASPEC to make a synthetic corpus.  Table 2: Comparison of translation accuracy and consistency score for each setting on Ja→En.

Base Model and Hyperparameters
As a baseline system, we employed the Transformer model with the big setting (Vaswani et al., 2017). Table 1 shows the detailed settings and hyperparameters. As an NMT implementation, we used fairseq (Ott et al., 2019), and modified it in the following experiments.

Lexically Constrained Decoding
We used the lexically constrained decoding (LCD) technique (Hokamp and Liu, 2017;Post and Vilar, 2018) to incorporate constraints at decoding time. In this task, the translations that do not satisfy the constraints lead to a substantial decrease in the final score. This technique is a hard lexically constrained method that uses grid beam search algorithm, and it guarantees that all word constraints appear in the target sentence.
To evaluate the effectiveness of this technique, we compared the baseline model (BASE) and the baseline with LCD (BASE+LCD). Here, we used two metrics for the consistency score: term% is the percentage of constraints that are correctly generated in the translations, and sent% is the percentage of sentences that contain all given constraints. Table 2 shows that the BASE+LCD significantly improves both term% and sent% on Ja→En. The reason why the two consistency scores of BASE+LCD are not 100% is due to the normalization on the tokenization, and this can be addressed by postprocessing ( §4.7).
However, BASE+LCD decreased the translation accuracy of the model. In preliminary experiments with the baseline models, we also found that the beam size needs to be larger than 60 to successfully generate all the constraints in this task. This is because the translations contain much repetition and the model never finishes generation before reaching the maximum output length. 1 https://github.com/CLD2Owners/cld2

LExical-Constraint-Aware NMT
To ease the problem in LCD, we used the Lexical-Constraint-Aware NMT (LeCA) model (Chen et al., 2020), whose input is augmented by concatenating constraints and the source sentence together. This method can inform the model of what constraints are given before decoding time, and thus the model can properly decide where to output a constraint. LeCA is a one of the soft lexically constrained methods, which do not guarantee all constraints are in the output. However, in combination with LCD, we can guarantee the model always satisfies the constraints while keeping or improving the translation performance.
The input is constructed by concatenating the source sentence X and each phrase C i in the constraints C with a separator symbol sep , as follows: where eos is the symbol indicating the end of the sentence.
To construct the input at training time, Chen et al. (2020) proposed a method that dynamically samples constraints from a reference sentence. They first sampled the number of constrained words k, and then they randomly sampled k target words (not subwords) as constraints from the reference. Here, we sampled the number of constrained words k from 0 to 14 following the distribution that is p = 0.4 for 0 and p = 0.6/14(= 0.04) for the other ones. The high probability for no constraint is to maintain the translation performance for unconstrained settings.
To handle such a source sequence, this method modifies the input representation of the encoder to distinguish the source sentence and each constraint. This representation is composed of three types of learned embeddings: token embeddings, positional embeddings, and segment embeddings, as shown in Fig. 2. The position of each constraint starts from the maximum length of the source sentences to avoid overlapping with the sentence. We assigned different values for the source sentence and each constraint and fed it to the model with the segment embeddings. This method also introduces a pointer network architecture (Vinyals et al., 2015;See et al., 2017) that helps to generate constraints by copying from the source sequence. Finally, we updated the models with 10,000 steps for Ja→En and 12,000 steps for En→Ja and set the beam size to 30 for

LCD.
We evaluated the effectiveness of LeCA and LeCA with LCD (LeCA+LCD). Table 2 shows that LeCA achieved high translation accuracy and consistency scores. The input of both LeCA and BASE+LCD are the same, but the translation accuracy of LeCA is significantly better than that of BASE+LCD. Moreover, LeCA+LCD with a small beam size improves the translation accuracy and satisfies all of the constraints. This implies that inputting both a source sentence and constraints as source sequence is very effective for improving the performance in this task.

Pre-process
Since constraints that are sampled from the reference are given as not a subword but a word, we need to separate the sentence into words. To do this, we first tokenized both the input and output sentences. For English, we simply applied the tokenizer scripts available in the Moses toolkit (Koehn et al., 2007). We used the Moses truecaser when the target language is English. For Japanese, we use the MeCab tokenizer (Kudo, 2006) with the mecab-ipadic-NEologd (Sato, 2015) dictionary. This dictionary contains many neologisms and thus it helps in handling named entities or technical terms, which are included in ASPEC but cannot be tokenized correctly using the default system dictionary. We compared the LeCA per-formance of mecab-ipadic-NEologd with the default system dictionary on an En→Ja task. Table 3 shows that mecab-ipadic-NEologd significantly improved translation accuracy and consistency scores. We confirmed that using mecab-ipadic-NEologd is the best option for LeCA on this task.
Then, we trained subword encoding models using the sentencepiece implementation (Kudo and Richardson, 2018). According to an earlier work (Morishita et al., 2019), a smaller vocabulary size (e.g., 4,000) is empirically superior to the commonly used ones (e.g., 32,000). On the other hand, larger vocabulary size is preferred for an LCD to keep the number of constraint tokens small. This is because a large number of tokens requires a large beam size of the LCD and increases the inference time. Finally, we found in a preliminary experiment that a vocabulary size of 32,000 achieved the best results, so we used a joint subword vocabulary with 32,000 tokens. For training data, we applied the Moses clean-corpus-n scripts to remove sentence pairs that are either too long or too different int their lengths 2 .

Fine-Tuning and Data Selection
The synthetic corpora (e.g., ASPEC last 1M and CommonCrawl) contain noisy sentence pairs, and the domain of JParaCrawl is different from that of ASPEC, a scientific paper domain. We used these corpora to make the translations more fluent. The model was initially pre-trained with these corpora and the first 2M sentence pairs of ASPEC for 12,000 updates. We then fine-tuned the pre-trained model using only the first 2M sentence pairs of ASPEC for 2,000 steps. For the pre-training, we oversampled ASPEC three-times to keep roughly the same number of sentences as the synthetic cor-

pora.
We searched for an effective setting to use the training data. Table 4 shows the results. The model using only ASPEC 2M for En→Ja and the model using ASPEC 2M and forward-translated ASPEC last 1M for Ja→En achieved the highest translation accuracies. For both En→Ja and Ja→En, the models trained on ASPEC 2M after pre-training achieved comparable results to the best ones. Since these models are trained on large amounts of parallel sentence pairs, they might be expected to produce more natural output than the best ones and thus be preferred by humans. Therefore, we decided to submit these four models for human evaluation.

Ensemble
We applied a model ensemble technique to improve the translation accuracy. First, eight models were trained with different random seeds. We then computed the average scores of these models and generated hypotheses based on these scores using beam search decoding. Table 5 shows the effectiveness of ensembling models. Ensembling the eight models shows a significant improvement over the single model.

Post-processing
For the submission, we need to match the tokenization to the reference constraints. To achieve this, we fixed the terms that are not matched to the constraints due to tokenization issues. Specifically, for each unmatched constraint, we removed spaces in both the output and the constraint, and then replaced the constraint in the output with the reference-spaced constraint. In some cases, we found that constraints may contain out-ofvocabulary (OOV) characters, resulting in translation failure 4 . The model outputs the special OOV tokens for these sentence, and thus we replaced them with correct characters in the reference constraint. Table 6 shows the automatic evaluated performance of our systems on the test set. These scores were measured in the evaluation server 5 . The best systems improved the BLEU score by +11.93 pts for En→Ja and +15.04 pts for Ja→En against the BASE. Our systems achieved the best BLEU score for both En→Ja and Ja→En subtasks. Table 7 shows the official results of our systems 6 . For both En→Ja and Ja→En, our systems achieved the best scores in the final ranking. Our submissions did not drop the scores from the BLEU, while the other participants dropped it. This means that our team only succeeded in implementing systems whose translation output could contain all the specified terms. Our systems also achieved the best performance in terms of human evaluations for both En→Ja and Ja→En. Notably, our scores are better than the reference ones even for Ja→En. This implies that constrained translation can yield humanparity performance when the system can receive appropriate terms in the target language. Figure 3 shows the example translation of the baseline and LeCA with lexically constrained decoding. Underlines in Figure 3 show the terms that match the constraints. Obviously, the baseline model generated the same term repeatedly and failed to translate while all of the constraints were satisfied. The baseline model appears to struggle with generating 4 We found that two percent of the lines in the test set include OOV characters.    (Cettolo et al., 2017;Federmann, 2018) and source-based contrastive assessment (CA) (Sakaguchi and Van Durme, 2018;Federmann, 2018).

Analysis
the constraint "superconductivity single phase autotransformer." One likely reason for this is that the baseline model generated a phrase that was quite similar to the constraint in the early phase (marked with a wavy line in Figure 3), and thus the model considered the constraint as translated.
In contrast, LeCA+LCD successfully translated the sentence with the constraints. We believe this is because the LeCA model correctly gives higher scores to the constraint phrases compared to the baselines, helping to generate a sentence with constraints. Figure 4 shows the BLEU scores of En→Ja translation decoding with various beam sizes. As mentioned in §4.2, the beam size of BASE+LCD needs to be larger than 60 to successfully generate all of the constraints. In contrast, LeCA+LCD can generate all of the constraints and improve the translation accuracy even when their beam size is quite small. This result indicates that the output of LeCA is helpful for LCD to score the candidates and that LeCA can save inference time. Hokamp and Liu (2017) proposed Grid Beam Search (GBS), an extended beam search algorithm that forces the NMT model to output pre-specified lexical constraints of words or phrases. At each decoding step, a beam is allocated to each number of constraints, and the top-k candidates that contain n constraints are selected for the n th beam. Translations that satisfy the constraints appear in the beam corresponding to the number of constraints. The beam size changes depending on the number of constraints for each sentence, which makes batch decoding difficult. Post and Vilar (2018) proposed Dynamic Beam Allocation (DBA), which dynamically allocates the beam with a fixed size and improves decoding more efficiently. However, the distribution of the number of constraint tokens in the experiments of these papers was much smaller than that of this task, and we found these methods did not perform well on this task. Song et al. (2020) and Chen et al. (2021) proposed lexically constrained decoding given explicit alignment guidance between the constraints and the source text. Alignments were induced from an additional alignment head or attention weights (Garg et al., 2019), but these methods assumed that gold alignments are given as constraints. To apply these methods to this task, we would have to use an automatic alignment method (e.g., GIZA++, Fast-Align) to obtain the alignments, and the translation accuracy might suffer due to alignment error. Susanto et al. (2020) proposed nonautoregressive NMT for lexically constrained Source 分路巻線のみに補助巻線を持つ超電導単相単巻変圧器を試作した。

Reference
Superconductivity single phase auto-transformer with auxiliary winding only at the shunt winding was produced experimentally.
Constraints shunt winding, auxiliary winding, superconductivity single phase auto-transformer

LeCA+LCD
A superconductivity single phase auto-transformer with auxiliary winding only in the shunt winding was produced experimentally.  (Post, 2018) translation. They used the Levenshtein Transformer (Gu et al., 2019), which inserts and deletes tokens at each time step, starting from the given constraints as the initial state. They assumed that the order of the given constraints is the same as the order in the reference, but the given constraints in this task appear in random order. Furthermore, they have not achieved comparable translation accuracy to the auto-regressive approaches.
Some works augment the input sequence with constraints. Song et al. (2019) augmented the source sentence by replacing or appending constraints with its corresponding source phrase through leveraging an SMT phrase table. Chen et al. (2020) proposed a simple yet effective augmentation method that appends constraints after the source sentence. Although the decoding speed is fast, Song et al. (2019) relied on the quality of the SMT phrase table. Furthermore, neither of the works could guarantee that the translation would contains all constraints.

Conclusion
This paper described the systems that were submitted to the WAT 2021 restricted translation task. We submitted systems for both En→Ja and Ja→En, and both of our systems won the best translation accuracy as assessed by BLEU, the consistency score, and human evaluations. We also confirmed that the data augmentation method makes lexically constrained decoding more effective and, furthermore, that combining data augmentation and constrained decoding significantly improves translation accuracy.