Machine Translation with Pre-specified Target-side Words Using a Semi-autoregressive Model

We introduce our TMU Japanese-to-English system, which employs a semi-autoregressive model, to tackle the WAT 2021 restricted translation task. In this task, we translate an input sentence with the constraint that some words, called restricted target vocabularies (RTVs), must be contained in the output sentence. To satisfy this constraint, we use a semi-autoregressive model, namely, RecoverSAT, due to its ability (known as “forced translation”) to insert specified words into the output sentence. When using “forced translation,” the order of inserting RTVs is a critical problem. In this work, we aligned the source sentence and the corresponding RTVs using GIZA++. In our system, we obtain word alignment between a source sentence and the corresponding RTVs and then sort the RTVs in the order of their corresponding words or phrases in the source sentence. Using the model with sorted order RTVs, we succeeded in inserting all the RTVs into output sentences in more than 96% of the test sentences. Moreover, we confirmed that sorting RTVs improved the BLEU score compared with random order RTVs.


Introduction
In this study, we tackle a machine translation task called "restricted translation." This task requires the output sentence to contain all the pre-specified restricted target vocabularies (RTVs) 1 . In other words, we are given a source sentence and a set of RTVs, and we are supposed to generate an output sentence that contains all the RTVs in the set 2 .
Since the emergence of neural machine translation (NMT) models (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017), several studies have been conducted to explore NMT systems capable of decoding translations under terminological constraints (Hasler et al., 2018;Dinu et al., 2019;Chen et al., 2020;Song et al., 2020). However, these previous studies were conducted under the condition that a bilingual dictionary is given. Moreover, these challenges are limited to autoregressive NMT systems, and scant research has been conducted on non-autoregressive or semiautoregressive NMT systems, which have received more attention recently.
To accomplish restricted translation, where only target terminologies are given, we used a semiautoregressive model called RecoverSAT (Ran et al., 2020), which generates a sentence as a sequence of segments. In this model, the segments are generated simultaneously, and each segment is predicted token-by-token. Ran et al. (2020) also attempted to force the model to generate a certain token at the beginning of a segment and showed that the model could generate valid sentences under the constraint. Then, we considered whether this model could be applied to generate sentences containing RTVs.
When tackling this task using this model, the insertion order of RTVs is a critical issue. To address this issue, we used GIZA++ (Och and Ney, 2003) to obtain word alignments and then identify the source position corresponding to the RTVs. Subsequently, we inserted them in the order in which their corresponding source tokens appear. We confirmed that sorting RTVs with GIZA++ improved the BLEU (Papineni et al., 2002) score. Finally, by using this model, we achieved all the RTVs outputs in more than 96% of the test sentences.
2 System Overview 2.1 Corpus Refinement Morishita et al. (2019) reported that the synthetic data generated by back-translation (Sennrich et al., 2016) degraded the performance in the Japaneseto-English translation setting. The reason for this phenomenon was that the ASPEC (Nakazawa et al., 2016) training sentences are ordered by sentence alignment scores, and so the sentences with lower scores are considered relatively noisy data. Therefore, Morishita et al. (2019) attempted to generate synthetic data using forward-translation instead of standard back-translation and confirmed that forward-translation improved the performance of the Japanese-to-English translation setting.
Following Morishita et al. (2019), we used forward-translation to refine the latter half of the ASPEC training data. In the same manner as their method, we first trained a Japanese-to-English translation model on the first 1.5M sentences of the ASPEC training data. Subsequently, we used the trained model to translate the latter 1.5M Japanese sentences of the ASPEC training data and obtained refined English sentences. Finally, we combined the first 1.5M training data and the refined 1.5M training data and trained a Japanese-to-English translation model.

RecoverSAT
RecoverSAT (Ran et al., 2020) is a semiautoregressive model that performs generation autoregressively in local and non-autoregressively in global. At each decoding step, the model generates a token in each segment, with paying attention to not only all the previous tokens in the segment but also those in all the other segments. The model continues decoding in each segment until either a special token, EOS or DEL, is generated, or the length of the generated token reaches the maximum token number. The final translation is a concatenation of all the segments except those that end with DEL.
RecoverSAT is also known for its capability to generate a translation under a word constraint (Ran et al., 2020), which is called the "forced translation" approach. In this approach, the model generates the constraint word (or phrases) at the beginning of an arbitrary segment. Once the constraint word (or phrase) has been generated, the model predicts the remainder of the segment in a semi-autoregressive manner.
In contrast to the original "forced translation," which only takes one constrained word (or phrase), we are required to place multiple RTVs in a transla-tion. To compensate for this gap, we place the i-th RTV at the P i -th segment as follows 3 : where N S is the number of segments and N V is the number of RTVs. When the RTVs have more phrases than segments during inference, we cut off phrases in the RTVs from the tail to fit the placeholder.

Sorting RTVs Using Source Alignment
RecoverSAT outputs RTVs in the order where they are inserted, so the order of inserting RTVs is important for accurate translation. We determined the order of the RTVs under the assumption that it correlated with the order of the aligned words in the input sentence. We used GIZA++ to align each RTV with a word in the input sentence and sorted the RTVs in the order of their corresponding input words. When the RTV was a phrase, we first obtained a source word that was most aligned with each word in the RTV and then selected the source word with the highest alignment score as the aligned word for the entire RTV. If there was a tie, the first aligned word in the input sentence was selected as the corresponding word.

Dataset
We used the ASPEC (Nakazawa et al., 2016) dataset for Japanese-to-English translation. This dataset contains 3M sentences as training data, 1,790 sentences as validation data, and 1,812 sentences as test data. As explained in Section 2.1, we refined the latter half of the training data using forward-translation.
We used SentencePiece (Kudo and Richardson, 2018) to tokenize the training data for both the source and target sentences, where the vocabulary size was set to 4K. Note that we used Sentence-Piece models obtained from the first 1.5M training data through all the experiments. When determining the insertion order of RTVs using GIZA++, we used MeCab 4 with IPADIC to tokenize Japanese sentences before computing the alignment.

Evaluation
We evaluated system outputs using the following two distinct metrics.
BLEU score. The BLEU score is a metric evaluated by the n-gram matching rate with the reference. We calculated it using multi-bleu.perl in the Moses toolkit (Koehn et al., 2007).
Consistency score. The consistency score is the ratio of translations that satisfy the exact match of all the given constraints over the entire test corpus. The exact match is determined as follows. We simply lowercased hypotheses and constraints and then judged character-level sequence matching (including whitespaces) for each constraint.
For the final score, we calculated the BLEU score using only the translations that exactly matched their RTVs. In other words, first, we calculated the exact match, and then, we replaced the translations that did not satisfy the constraint with an empty string. Subsequently, we calculated the BLEU score with the modified translations.

Model
Transformer. We used "Transformer (base)" (Vaswani et al., 2017) for forward-translation and a baseline model. The hyperparameter settings were the same as described in Vaswani et al. (2017).
In the baseline model, we inserted the RTVs at the tail of the output sentence without sorting.
RecoverSAT. We use the encoder of the Transformer to initialize the encoder of RecoverSAT, and share the parameters of the embedding layers and the pre-softmax linear layer in the same way as Ran et al. (2020). We adopted the same model and hyperparameters that were used in the previous study (Ran et al., 2020) 5 , where d model = 512, d hidden = 512, n layer = 6, and n head = 8. However, we did not share the source and target vocabularies.
Moreover, we changed the number of segments from the original paper (i.e., 10) because some examples had more than 10 (up to 14) RTVs in the test data. We also expanded the length of a segment to be able to insert all the tokens of the RTV if the RTV has more tokens than allowed by default. We examined four RecoverSAT models with different numbers of segments: 10 is the default value in BLEU RIBES AMFM RecoverSAT 25.29 0.653597 0.612290 Table 1: Results of the official score using RecoverSAT with 14 segments and forced translation with sorted order. Figure 1: Results of our experiments using Recover-SAT. The solid line represents the BLEU score, and the dotted line represents the consistency score. The dot marker represents RecoverSAT without RTVs. The triangle marker represents forced translation without sorting RTVs. The square marker represents forced translation with sorted order. The cross marker represents forced translation with oracle order. Ran et al. (2020) and 14 is the maximum number of RTVs among the development data. The models with 21 and 29 segments have more free segments than the previous models, which are supposed to be lubricating segments to improve the overall output. Table 1 presents the official BLEU, RIBES (Isozaki et al., 2010), and AMFM (Banchs et al., 2015) scores, calculated in the evaluation server, for the model in which the number of segments is 14. As shown in Table 1, the BLEU, RIBES, and AMFM scores were 25.29, 0.653597, and 0.612290 points, respectively. Table 2 presents the scores obtained in our evaluation. Moreover, Figure 1 shows the BLEU score and consistency scores for different numbers of segments {10, 14, 21, 29}.

Our Evaluation
BLEU score. Figure 1 shows that the translation accuracy decreases as the number of segments in-  Table 2: Results of the experiments in our evaluation. The number of segments of RecoverSAT is 10. The consistency score is the ratio of sentences satisfying the exact match of the given constraints. The final score is the constraint-aware BLEU score. "random order": we insert RTVs without sorting. "sorted order": we insert RTVs in the order of the corresponding source words. "oracle order": we insert RTVs in the same order as that in the reference.
creases, similar to the previous study (Ran et al., 2020). This may be because the model predicts the target tokens more independently as the number of segments increases. As the number of segments increases, the length of each segment becomes shorter, and the model becomes closer to the non-autoregressive model. Table 2 shows that sorting the RTVs using GIZA++ improves the BLEU score. However, there is still a significant gap in the scores compared with those obtained using the oracle order. This is because the word order between Japanese and English is different.
Consistency score. Figure 1 shows that Recover-SAT with forced translation reliably outputs RTVs in almost all the cases. When the number of segments was 10, we could not insert all the RTVs in some test sentences with more than 10 RTVs 6 . On the other hand, when the number of segments was 14 or more, it was expected that all the RTVs could be inserted into all the test sentences. However, some output sentences did not contain all the RTVs, even if the number of segments was 14 or more. This result indicates that the model generates a special token, DEL, to delete segments beginning with the RTVs.
The final BLEU score of the model with 10 segments, which gives up to generate some RTVs on occasion, was the highest. This is because it is rare to have more than 10 RTVs for a single sentence 7 . Additionally, we confirmed that the insertion of RTVs was effective in improving not only the con-6 As mentioned in Section 3.3, the maximum number of RTVs in the test set was 14.
7 Only 14 out of 1,812 (0.8%) sentences were given more than 10 RTVs in the test data. sistency score but also the BLEU score.

Related Work
Previously, some NMT with terminology constraints have been studied (Hasler et al., 2018;Alkhouli et al., 2018;Dinu et al., 2019;Chen et al., 2020;Song et al., 2020). For example, Song et al. (2020) proposed a dedicated head in a multi-head Transformer architecture to learn explicit word alignment and use it to guide the constrained decoding process. When the source-aligned word matches a dictionary, the model outputs the corresponding target word. However, these models are not available for the "restricted translation" task because we can only access the target-side vocabularies.
In this study, we used the semi-autoregressive model RecoverSAT (Ran et al., 2020). Originally, this model was not intended to output forcibly more than one constrained word. A non-autoregressive model can decode target tokens simultaneously, resulting in faster decoding. However, its output sentence suffers from the multi-modality problem causing token repetitions or missing by not using the dependency between the output words (Gu et al., 2018;Ran et al., 2020). Thus, Ran et al. (2020) proposed RecoverSAT to alleviate this problem. Their model could maintain the accuracy of the autoregressive model while achieving a faster processing speed. They also mentioned that, as the number of segments increases, the closer the model becomes to a non-autoregressive model. In other words, when the number of segments increases, the decoding process is faster, but the accuracy is lower. Moreover, they attempted to force the model to generate a pre-specified token at the beginning of a segment and showed that the model could avoid repetitive output and translate properly.

Conclusions
We introduced a semi-autoregressive approach to tackle the restricted translation task. In our experiments, we showed that RecoverSAT could output almost all the RTVs. Additionally, we used source sentence alignment to determine the insertion position and observed that it improved the BLEU score. Moreover, the importance of the order of the RTVs was confirmed by the fact that the score was considerably improved by inserting RTVs in the order in which they appear in the reference translations. However, there is still room for improvement in determining the insertion order. In future work, investigating how to determine the best order to insert RTVs will be necessary.