NAIST English-to-Japanese Simultaneous Translation System for IWSLT 2021 Simultaneous Text-to-text Task

This paper describes NAIST’s system for the English-to-Japanese Simultaneous Text-to-text Translation Task in IWSLT 2021 Evaluation Campaign. Our primary submission is based on wait-k neural machine translation with sequence-level knowledge distillation to encourage literal translation.


Introduction
Automatic simultaneous translation is an attractive research field that aims to translate an input before observing its end for real-time translation similar to human simultaneous interpretation. Starting from early attempts using rule-based machine translation (Matsubara and Inagaki, 1997;Ryu et al., 2006) and statistical methods using statistical machine translation (Bangalore et al., 2012;Fujita et al., 2013), recent studies successfully applied neural machine translation (NMT) into this task (Gu et al., 2017;Ma et al., 2019;Arivazhagan et al., 2019). The simultaneous translation shared task in the IWSLT evaluation campaign started on 2020 with English-to-German (Ansari et al., 2020) speechto-text and text-to-text tasks, and a new language pair of English-to-Japanese has been included on 2021 only in text-to-text task. English-to-Japanese is much more challenging than English-to-German due to the large language difference in addition to data scarsity.
We developed an automatic text-to-text simultaneous translation system for this shared task. We applied some extensions to a standard wait-k NMT in the training time: sequence-level knowledge distillation and target-side chunk shuffling. However, these techniques showed mixed results in different latency regimes on the IWSLT21 development set, so we configured the system differently for each latency regime. This paper describes the details of the system and the results on the development sets.
We also describe our another attempt to include incremental constitutent label prediction that was not included in the primary system.

Simultaneous Neural Machine
Translation with wait-k Let X = x 1 , x 2 , . . . , x |X| be an input sequence in a source language and Y = y 1 , y 2 , . . . , y |Y | be an output sequence in a target language. Here, the input can be speech or text, but we assume the input is text because this paper discusses the text-to-text task. The task of simultaneous translation is to translate X to Y incrementally. In other words, each output prediction of Y is made upon partial input observations of X. Suppose an output prefix subsequence Y j 1 = y 1 , y 2 , ..., y j has already been predicted from prefix observations of the input X i 1 = x 1 , x 2 , ..., x i . When we predict the next output subsequence Y j j+1 = y j+1 , ..., y j after further partial observations X i i+1 = x i+1 , ..., x i , the prediction is made based on the following formula: whereŶ is a possible prediction of the subsequence. In a usual consecutive machine translation, we can use the whole input sequence X anytime in the prediction of Y . The limitation of available input information is a key challenge of simultaneous translation.
Wait-k delays the decoding process in k input tokens (Ma et al., 2019). The wait-k model translates a token sequence of the source language X into that of the target language Y as follows.
The decoder has to predict an output token based on the attention over an observed portion of the input tokens. k is a hyperparameter for the fixed delay in this model; setting k larger causes longer delays, while smaller k would result in worse output predictions due to the poor context information.

Knowledge Distillation
Knowledge Distillation (KD) (Hinton et al., 2015) is a method that uses the distilled knowledge learned by a stronger teacher model in the learning of a weaker student model. When teacher distribution is q(y|x; θ T ), we minimize the cross-entropy with the teacher's probability distribution instead of reference data, as follows: where θ T parameterizes the teacher distribution. Sequence-level Knowledge Distillation (SKD), which gives the student model the output of the teacher model as knowledge, propagates a wide range of knowledge to the student model and trains it to mimic its knowledge (Kim and Rush, 2016).
and the loss objectives as follows: where p(Y |X) is the sequence-level distribution, and Y ∈ T is the space of possible target sentences. SKD can be implemented simply by training the student model using (X, Y ), where Y is derived from the teacher model outputs for the source language portion of the training corpus.
We use SKD for reduction of colloquial expressions in the spoken language corpus. Such colloquial expressions are highly dependent on languages and difficult to translate by NMT, which usually generates literal translations. Here, we firstly train a teacher, Transformer-based offline NMT model using the training corpus and use it to obtain pseudo-reference translations in the target language. Then, we train a student, Transformerbased simultaneous NMT model using the pseudoparallel corpus with the original source language sentences and the corresponding translation results by the teacher model. The pseudo-references should consist of more literal and NMT-friendly translations, therefore the training of the student model becomes easier than that using the original parallel corpus. Since we have to train simultaneous translation using less context information than an offline translation model, the SKD would be helpful. This is motivated by the recent success of non-autoregressive NMT using SKD (Gu et al., 2018).

Target-side chunk shuffling
Chunk shuffling is a kind of data augmentation that reorders Japanese chunks (called bunsetsu). Our motivation for this one is to encourage monotonic IMT utilizing a characteristic of Japanese as an agglutinative language, in which the order of bunsetsu chunks is not so strict. When we have a target language sequence T = t 1 , . . . , t |T | in the training set, we apply greedy left-to-right chunking to it; T is divided as a chunk sequenceT = C 1 , . . . , C Q , in which each chunk consists of k (i.e., delay hyperparameter in wait-k) tokens C q = t q 1 , . . . , t q k . Note that the last chunk C Q may be shorter than k according to the length of T . Then, we choose to shuffle or keep the chunks inT with a probability p r , defined as a hyperparameter. We tried only the random shuffling with the fixed chunk size of k in this time; More linguistically-motivated chunk reordering would be worth trying as future work.
5 Primary system 5.1 Implementation Our system implementation was based on the official baseline 1 using fairseq (Ott et al., 2019) and SimulEval .

Setup
Data All of the models were based on Transformer, trained using 17.9 million English-Japanese parallel sentences from WMT20 news task and fine-tuned using 223 thousand parallel sentences from IWSLT 2017. During fine-tuning, we examined the effectiveness of knowledge distillation and chunk shuffling with several hyperparameter settings and reported the results by the models that resulted in the higher BLEU on IWSLT 2021 development set. The text was preprocessed by Byte Pair Encoding (BPE) (Sennrich et al., 2016)  for subword segmentation. The vocabulary was shared over English and Japanese, and its size was 16,000.
Model The hyperparameters of the model almost followed the Transformer Base settings (Vaswani et al., 2017). The encoder and decoder were composed of 6 layers. We set the word embedding dimensions, hidden state dimensions, feed-forward dimensions to 512, 512, and 2,048, respectively. We performed the sub-layer's dropout with a probability of 0.1. The number of attention heads was eight for both the encoder and decoder. The model was optimized using Adam with an initial learning rate of 0.0007, β 1 = 0.9, and β 2 = 0.98, following Vaswani et al. (2017).
Evaluation To evaluate the performance, we calculated BLEU and Average lagging (AL) (Ma et al., 2019) with SimulEval on IWSLT 2021 development set. Table 1 shows the excerpt of system results for the full-sentence topline (offline), wait-k baselines (wait-k), and our extensions: SKD (+ SKD) and chunk shuffling (+ CShuf). We tried some different latency hyperparameter values k = {10, 12, 14, . . . , 32} for comparison. Figure 1 plots our BLEU-AL results for wait-k and wait-k+SKD. It shows that the use of SKD gave some improvements in low-latency settings with k = {10, 12, 14}, however, the results with larger k were mixed. These results support our assumption on the difficulty of the translation into colloquial expressions discussed in Section 3.   We also tried chunk shuffling with different hyperparameter values 2 p r = {0, 0.01, 0.02, 0.03}. Table 2 shows the result using the target-side chunk shuffling. Here, the chunk shuffling results are only shown for wait-10. The use of larger latency hyperparameter k did not show remarkable differences from the baseline. Chunk shuffling with p r = 0.02 resulted in the best BLEU and outperformed the baseline, but the other values 0.01, 0.03 did not work. These differences should be due to the output length shown in len hyp column in Table 2; the output length became much shorter than the baseline using the chunk shuffling with p r = 0.02. In contrast, p r = 0.01 and p r = 0.03 increased the output length. Table 3 shows translation examples by the baseline and chunk-shuffling (p r = 0.02). Here, the baseline translation results do not have endof-sentence expressions like で す (desu), ま す (masu), ですよね (desuyone). These differences were not straightforward with the chunk shuffling, but a certain value of p r = 0.02 worked in our experiment.

Results on the development set
The results above suggest that the target-side 2 Higher values of pr resulted in much worse results and are not included in this paper.
En-input I see other companies that say, "I'll win the next innovation cycle, whatever it takes." En-input She's a musical instrument maker, and she does a lot of wood carving for a living.
En-input Humans are very good at considering what might go wrong if we try something new, say, ask for a raise.  chunk shuffling may work as a perturbation, and we need further investigation. Table 4 shows BLEU and AL results on the test set. The system with the medium latency regime (wait-20 + SKD) worked relatively well; it achived a comparable BLEU result with wait-30. However, the results were worse than those of the other teams by around two points in BLEU in all the latency regimes.

Another attempt: Incremental Next Constituent Label Prediction
We tried another technique described below in the shared task, but it was not included in our primary submission because it did not outperform the baseline. Here, we also describe this for further investigation in future. For simultaneous machine translation, deciding how long to wait for input before translation is important. Predicting what kind of phrase comes next is a part of useful information in determining the timing. In this study, we tried incremental Next Constituent Label Prediction (NCLP).
In SMT-based simultaneous translation, Oda et al. (2015) proposed a method to predict unseen syntactic constituents to determine when to start train dev eval 2,762,408 27,903 21,941 translation for partially-observed input, using a multi-label classifier based on linear SVMs (Fan et al., 2008). Motivated by this study, we used a neural network-based classifier using BERT (Devlin et al., 2019) for NCLP. The problem of NCLP is defined as the label prediction of a syntactic constituent coming next to a given word subsequence in the pre-order tree traversal. In this work, we used 1-lookahead prediction, so the problem was relaxed into the prediction of a label of a syntactic constituent given its preceding words and the first word composing it. A predicted constituent label was inserted at the corresponding position in the input word sequence, immediately after its preceding word. That doubled the length of input sequences. For subword-based NMT, we applied BPE only onto words in the input sequences and put dummy labels after subwords other than end-of-word ones, to order the input in an alternating way.
We used Huggingface transformers (Wolf et al., 2020) for our implementation of NCLP with bert-base-uncased. We used Penn Treebank 3 (Marcus et al., 1993) for the NCLP training and development sets, and NAIST-NTT TED Talk Treebank (Neubig et al., 2014) for the NCLP evaluation set. Table 5 shows the number of training, development, and evaluation instances extracted from the datasets. Note that we can extract many instances from a single parse tree. Table 6 shows the results of the 5 most frequent labels in the NCLP training data. NP and VP are   important clues of the sentence structure, and their F1 scores were over 90% on the NCLP evaluation data. However, the results by wait-k using NCLP results as its input did not outperform the baseline wait-k, as shown in Figure 2. We can observe NCLP-based wait-k gave smaller ALs with the same latency hyperparameter k. One possible problem of current NCLP-based wait-k is that the length of an input length is doubled by the additional constitutent labels. Since we ran wait-k-based simultaneous NMT for such an augmented input sequence, the decoder using NCLP-augmented input has roughly half of the information compared to the decoder using original input if we use the same k. This forces the decoder to perform very aggressive anticipation with limited information from an input prefix. Table 7 shows the translation input and output examples of baseline and NCLP. Input sentences include constituents labels. The first example shows that NCLP could translate "publication" before a verb "work" following the Japanese sentence order. Second example shows NCLP output is constructed naturally in terms of grammar, while the baseline has repetitive and unnatural phrases. We observed NCLP sentences are tend to be shorter and more natural than baseline like these examples. However, many sentences are not informative and missing details compared to the baseline. We'll investigate a more effective way to use NCLP in our future work.

Conclusion
In this paper, we described our English-to-Japanese text-to-text simultaneous translation system. We extended the baseline wait-k with the knowledge distillation to encourage literal translation and targetside chunk shuffling to relax the output order in Japanese. They achieved some improvements on IWSLT 2021 development set in certain latency regimes.