NHK’s Lexically-Constrained Neural Machine Translation at WAT 2021

This paper describes the system of our team (NHK) for the WAT 2021 Japanese-English restricted machine translation task. In this task, the aim is to improve quality while maintaining consistent terminology for scientific paper translation. This task has a unique feature, where some words in a target sentence are given in addition to a source sentence. In this paper, we use a lexically-constrained neural machine translation (NMT), which concatenates the source sentence and constrained words with a special token to input them into the encoder of NMT. The key to the successful lexically-constrained NMT is the way to extract constraints from a target sentence of training data. We propose two extraction methods: proper-noun constraint and mistranslated-word constraint. These two methods consider the importance of words and fallibility of NMT, respectively. The evaluation results demonstrate the effectiveness of our lexical-constraint method.


Introduction
Our team (NHK) participated in the restricted machine translation task 1 using the Japanese-English dataset of the Asian scientific paper excerpt corpus (ASPEC-JE) (Nakazawa et al., 2016) at WAT 2021 (Nakazawa et al., 2021). In this task, the aim is to improve translation quality while preserving consistent terminology for translating scientific papers that include technical terms and proper nouns. In this task, a list of target words is given for each source sentence to appear in a target sentence. Figure 1 shows the overview of this task. There are two evaluation criteria: the 1 https://sites.google.com/view/restricted-translation-task/ Output sentence is required to contain all the target words in each target-vocabulary list. This is a feedback circuit shifting resonance frequency by change of input signal phase, which can detect change of magnetic features of an object present at a center of two coils on high sensitivity and resolution. Figure 1: Overview of the restricted translation task (Japanese→English). translation accuracy via bilingual evaluation understudy (Papineni et al., 2002) (BLEU score) and the consistency score of the ratio of sentences satisfying an exact match of given constraints (consistency score). The final ranking is determined by the combined score of both: calculating BLEU with only the exact match sentences 2 .

Reference
In related work (Chen et al., 2020a;Song et al., 2019;Post and Vilar, 2018;Hokamp and Liu, 2017), since it does not require higher computational complexity than the other methods using the grid beam search (GBS) decoding algorithm (Hokamp and Liu, 2017;Post and Vilar, 2018), we use the lexicalconstraint method of Chen et al. (2020a). This method concatenates a source sentence and constrained words with a special token to input them into an encoder of the neural machine translation (NMT). In addition to the merit of reducing the computational cost compared with GBS decoding, this method has two other merits: no need to modify the architecture of the NMT system or prepare any word alignment data. In this method for this task, one of the main problems is how to extract constraints from training data since only constrained word lists for dev, devtest, and test sets are provided to participants.
In this paper, we propose extracting constraints from target sentences on the basis of propernoun and mistranslated-word constraints considering the importance of words and fallibility of NMT. The former constraint is a list of proper nouns extracted with named-entity recognition. The latter constraint is a list of words mistranslated or under-translated with vanilla NMT compared with a target sentence. We conducted experiments to evaluate the NMT using the proposed method and found that the proposed method outperformed a baseline lexical-constraint method.

Official Dataset
The main dataset of the restricted translation task is the Japanese-English paper abstract corpus (ASPEC-JE) and the target vocabulary list as constraints. In addition to the main dataset, participants can use any other resources by mentioning their details. The ASPEC-JE dataset consists of training, dev, devtest, and test data. The training data contains 3.0 million bilingual pairs provided with similarity scores automatically calculated by DP matching (Utiyama and Isahara, 2007). The target vocabulary list for restricted translation is attached to the dev, devtest, and test data dedicated for this task. Participants are not told the detailed way to select constraints. Table 1 shows statistics of each data.

Official Evaluation
In this task, four distinct metrics are calculated: BLEU, RIBES (Isozaki et al., 2010), AMFM (Banchs et al., 2015), and consistency scores. The BLEU, RIBES, and AMFM scores are calculated in accordance with the WAT convention. The consistency score is the ratio of the number of sentences satisfying the exact match of given constrained words over the whole test corpus. The final score is calculated using both BLEU  and consistency scores by WAT 2021 organizers as below: 1. Check whether the translation satisfies the given constraints or not.
2. If the translation does not satisfy the constraint, replace the translation with an empty string.

NMT with Lexical Constraint
Borrowing Chen et al. (2020a)'s idea, we implemented a lexically-constrained NMT with encoder and decoder modules. We concatenated a source sentence and constrained words with a special token to input into the encoder, as illustrated in Figure 2. The key to the successful lexicallyconstrained NMT is the way to extract constraints from a target sentence. Though the constraints are given for the dev, devtest, and test data, they are not given for the training data. In this paper, we focus on the way to extract a constraint from the target sentence in training data for the training phase. The simplest method of extracting a lexical constraint is randomly sampling words from the target sentence, as Chen et al. (2020a) did. Beyond the random sampling method, we propose two other directions with a focus on proper nouns and mistranslated words to extract the constrained words automatically from the target sentence.
• Proper-Noun Constraint. Though participants were not told the detailed way to se- Training phase: Extracted constraints from training data by the proposed methods. Translation phase: Given constraints by organizers. Figure 2: Overview of NMT using lexical-constraint method. x = (x 1 , x 2 , ..., x K ), c = (c 1 , c 2 , ..., c N ), and t = (y 1 , y 2 , ..., y J ) show source-, constraint-, and predicted-sequences, respectively. K and J are the lengths of source and target sentences. N is a number of constrained words. "|" is a special token for delimiter. During the training phase, constraints are extracted from training data by the proposed methods. During the translation phase, constraints are given by WAT 2021 organizers. lect constraints, we found that the vocabulary list in dev data includes many technical terms and proper nouns. Supposing that the important words such as technical terms and proper nouns tend to be selected as constraints, we extract proper nouns on the basis of the named-entity recognition.
The proper-noun constraint is not enough to be sufficient to cover all constraints in this task. Given constrained words including the proper-noun constraints accounted for 21% of the Japanese dev data. To increase the number of appropriate constrained words, we extract mistranslated or dropped words by NMT as constraints. First, we trained an NMT model on parallel training data, and translated the source sentences in training data with this model. We then picked out the words that do not appear in the translated sentence but appear in the target sentence. Both proper-noun and mistranslatedword constraints could cover 38% of constraints for the dev data. The remaining 62% constrained words could be translated correctly without adding them as constraints.
• Both the Proper-Noun and Mistranslated-Word Constraints. Both constraints are made by concatenating the proper-noun and mistranslated-word constraints and removing duplicates.

Data
In this paper, we used only the first 2.0 million bilingual pairs 3 in the official dataset, i.e., 3 The remaining 1.0 million bilingual pairs were often noisy as described in Neubig (2014). We found the perfor-ASPEC-JE, with high similarity scores for training the models. We did not use any other resources.

System Setup
We used the KyTea (Neubig et al., 2011) to tokenize Japanese sentences and the Moses toolkit 4 to clean and tokenize English sentences. We then used a vocabulary of 48K tokens on the basis of joint byte-pair encoding (BPE) (Sennrich et al., 2016) for the source and target.
We used the encoder and decoder of the transformer model (Vaswani et al., 2017), which is a stateof-the-art NMT model. The encoder converts a source sentence into a sequence of continuous representations, and the decoder generates a target sentence. We implemented this system with the Sockeye 2 toolkit (Hieber et al., 2020). All models were trained within at most three days on four Nvidia V100 Tesla GPUs with 16-GB memory in parallel. In training the model, we applied stochastic gradient descent with Adam (Kingma and Ba, 2015) as the optimizer, using a learning rate of 0.0002, multiplied by 0.7 after every 8 checkpoints. We set the batch size to 5000 tokens and the maximum sentence length to 150 BPE tokens. We applied early stopping with a patience of 32. Dropout was set to 0.1 for encoder, decoder, attention layer, and feed-forward layer after testing with 0.1, 0.3, and 0.5 using development data. For the other hyperparameters of the models, we used the default Sockeye 2 parameters 5 .
Translation was carried out through a beam search with a beam size of 30, and we used an ensemble of 5 models with different seeds.
We used three types of constraints for the promance degraded when using all data in this work. 4 https://github.com/moses-smt/mosesdecoder 5 Sockeye 2 uses a transformer model with 6 encoder and decoder layers, 8 parallel attention heads, model dimensionality of 512, and a feed-forward layer size of 2048 as default.  posed method: the proper-noun constraint, the mistranslated-word constraint, and both, called "Proper-noun," "Mistranslated-word," and "Prop. & Mistrans.," respectively. For extracting the proper nouns from the target sentence, we used GiNZA 4.0 6 for Japanese and en core web sm model of spaCy 2.3 7 for English. We used at most five words from candidates sorted on the basis of term-frequency inverse document frequency (TF-IDF) scores (Chen et al., 2020b) in each constraint.
To evaluate translation quality separately from the official evaluation, we calculated caseinsensitive BLEU (Papineni et al., 2002) scores by using multi-bleu.perl 8 and a consistency rate of words, which is the ratio of the number of words appearing in the output of given constrained words.

Baselines
We trained two types of baselines using the transformer model.
1. Baseline: The model trained on the parallel data (2.0 million bilingual pairs) without any constraint.
2. Random-word: The model trained on the parallel data with constraints of five words 6 https://megagonlabs.github.io/ginza/ 7 https://spacy.io/usage/v2-3 8 https://github.com/moses-smt/mosesdecoder/blob/ master/scripts/generic/multi-bleu-detok.perl randomly extracted from the target sentence. We extracted different constraints randomly for each epoch. Table 2 shows the experimental results for Japanese↔English tasks. Compared with the Baseline method, our proposed methods improved both consistency rates of words and BLEU scores for Japanese↔English tasks.

Experimental Results
Though models using the Random-word method improved the consistency rate compared with Baseline, there is no or little improvement in BLEU scores. For the Japanese→English task, though the consistency rates of the Random-word and Proper-noun methods are almost same, the BLEU scores of the Proper-noun performed better than the Random-word method. The average number of constrained words of the Random-word method is higher than the Proper-noun method. This result indicates that translation quality highly depends on the way to extract constraints rather than the number of constraints.
From comparing among the versions of our proposed method using three types of constraints, the model using the Prop. & Mistrans. method performed the best for both the Japanese↔English tasks.
From comparing the use of the propernoun and mistranslated-word constraints, the "Mistranslated-word" method performed better for Japanese→English, whereas the "Proper-noun" method performed better for  English→Japanese. In addition, there is no significant difference in the consistency rate of the mistranslated-word constraint between English→Japanese and Japanese→English. The proper-noun constraint for English→Japanese appears likely to be more similar to constraints of the test data than that for Japanese→English. For the average number of constrained words, though the Random-word method has the most constrained words, it did not perform the best for either the consistency rate or BLEU score. The results indicate that the quality of the model using constraints relies on whether constraints are suitable for the task or not.
As a whole, we found that the using both the proper-noun and mistranslated-word constraints is effective for the restricted machine translation task. Table 3 lists the official results. For "Prop. & Mistrans. + rule" method, we input the unsatisfied constrained word, which does not appear in the output with the following procedure:

Official Results
1. extracts unsatisfied words, which do not appear in the output, from the constrained words.
2. calculates Levenshtein distance between each unsatisfied word and each word in the output.
3. swaps the word of the output with the closest distance for the unsatisfied word.
The outputs of the "Prop. & Mistrans. + rule" method satisfy all given constraints. The official results indicate the effectiveness of using the proposed constraints in terms of the human evaluation since the rankings of "BLEU," "HUMAN DA," "HUMAN CA," and "Final score" are the same as among participants of this task at WAT 2021.

Conclusion
We described our proposed method using lexical constraints for a Japanese↔English restricted machine translation task with the Asian scientific paper excerpt corpus (ASPEC). We proposed a method to extract appropriate constraints of the lexically-constrained neural machine translation (NMT) for this task. Our proposed method using the proper-noun and mistranslated-word constraints improved translation performance compared with random-word constraint.
For future work, we plan to apply the proposed constraints into NMT with a grid beam search decoding algorithm (Hokamp and Liu, 2017;Post and Vilar, 2018) to compare the performance. Kehai Chen, Rui Wang, Masao Utiyama, and Eiichiro Sumita. 2020b