Improving Simultaneous Translation by Incorporating Pseudo-References with Fewer Reorderings

Simultaneous translation is vastly different from full-sentence translation, in the sense that it starts translation before the source sentence ends, with only a few words delay. However, due to the lack of large-scale, high-quality simultaneous translation datasets, most such systems are still trained on conventional full-sentence bitexts. This is far from ideal for the simultaneous scenario due to the abundance of unnecessary long-distance reorderings in those bitexts. We propose a novel method that rewrites the target side of existing full-sentence corpora into simultaneous-style translation. Experiments on Zh\rightarrowEn and Ja\rightarrowEn simultaneous translation show substantial improvements (up to +2.7 BLEU) with the addition of these generated pseudo-references.


Introduction
Simultaneous translation, which starts translation before the source sentence ends, is substantially more challenging than full-sentence translation due to partial observation of the (incrementally revealed) source sentence. Recently, it has witnessed great progress thanks to fixed-latency policies (such as wait-k)  and adaptive policies (Gu et al., 2017;Arivazhagan et al., 2019).
However, all state-of-the-art simultaneous translation models are trained on conventional parallel text which involve many unnecessary long-distance reorderings (Birch et al., 2009;Braune et al., 2012); see Fig. 1 for an example. The simultaneous translation models trained using these parallel sentences will learn to either make bold hallucinations (for fixed-latency policies) or introduce long delays (for adaptive ones). Alternatively, one may want to use transcribed corpora from professional simultaneous interpretation (Matsubara et al., 2002;Bendazzoli et al., 2005;Neubig et al., 2018). These data are more monotonic in word-order, but they are all very ⇤ ⇤ Equal contribution. † Currently at Columbia University.

Pseudo-Refs
(wait-1) china 's west has many big mtns (...wait-2...) the chinese west has many big mtns (...wait-3...) western china has many big mtns (...wait-4...) there are many big ... Figure 1: Example of unnecessary reorderings in the bitext which can force the model to anticipate aggressively, along with the ideal pseudo-references with different wait-k policies. Larger k improves fluency but sacrifices latency (pseudo-refs with k 4 are identical to the original reference). (mtns: mountains) small in size due to the high cost of data collection (e.g., the NAIST one (Neubig et al., 2018) has only 387k target words). More importantly, simultaneous interpreters tend to summarize and inevitably make many mistakes (Shimizu et al., 2014;Zheng et al., 2020) due to the high cognitive load and intense time pressure during interpretation (Camayd-Freixas, 2011).
How can we combine the merits of both types of data, and obtain a large-scale, more monotonic parallel corpora for simultaneous translation? We propose a simple and effective technique to generate pseudo-references with fewer reorderings; see the "Pseudo-Refs" in Fig. 1. While previous work (He et al., 2015) addresses this problem via languagespecific hand-written rules, our technique can be easily adopted to any language pairs without using extra data or expert linguistic knowledge. Training with these generated pseudo references can reduce anticipations during training and result in fewer hallucinations in decoding and lower latency. We make the following contributions: • We propose a method to generate pseudoreferences which are non-anticipatory and semantic preserving. • We propose two metrics to quantify the antic- ipation rate in the pseudo-references and the hallucination rate in the hypotheses.
• Our pseudo-references lead to substantial improvements (up to +2.7 BLEU) on Zh!En and Ja!En simultaneous translation.

Preliminaries
We briefly review full-sentence neural translation and the wait-k policy in simultaneous translation.
Full-Sentence NMT uses a Seq2seq framework (Fig. 2) where the encoder processes the source sentence x = (x 1 , x 2 , ..., x m ) into a sequence of hidden states. A decoder sequentially generates a target sentence y = (y 1 , y 2 , ..., y n ) conditioned on those hidden states and previous predictions: The model is trained as follows: Simultaneous Translation translates concurrently with the (growing) source sentence, so  propose the wait-k policy (Fig. 2) following a simple, fixed schedule that commits one target word on receiving each new source word, after an initial wait of k source words. Formally, the prediction of y for a trained wait-k model is where the wait-k model is trained as follows This way, the model learns to implicitly anticipate at testing time, though not always correct (e.g., in Fig. 2, after seeing x 1 x 2 ="-˝Ñ" (China 's), output y 1 ="there") . The decoder generates the target sentenceŷ with k words behind source sentence x:

Pseudo-Reference Generation
Since the wait-k models are trained on conventional full-sentence bitexts, their performance is hurt by unnecessary long-distance reorderings between the source and target sentences. For example, the training sentence pair in Fig. 2, a wait-2 model learns to output y 1 ="there" after observing x 1 x 2 ="-˝Ñ" (china 's) which seems to induce a good anticipation ("-˝Ñ..." $ "There ..."), but it could be a wrong hallucination in many other contexts (e.g., "-˝Ñ WS à $" $ "Chinese streets are crowded", not "There ..."). Even for adaptive policies (Gu et al., 2017;Arivazhagan et al., 2019;Zheng et al., 2019a), the model only learns a higher latency policy (wait till x 4 =" ") by training on the example in Fig. 2. As a result, training-time wait-k models tend to do wild hallucinations .
To solve this problem, we propose to generate pseudo-references which are non-anticipatory under a specific simultaneous translation policy by the method introduced in Section 3.1. Meanwhile, we also propose to use BLEU score to filter the generated pseudo-references to guarantee that they are semantic preserving in Section 3.2.

Generating Pseudo-References with
Test-time Wait-k To generate non-anticipatory pseudo-references under a wait-k policy, we propose to use the fullsentence NMT model ✓ full (Eq. 1) which is not trained to anticipate, but decode with a wait-k policy. This combination is called test-time wait-k , which is unlikely to hallucinate since the full source content is always available during training. Although here the full-sentence model ✓ full only has access to the partially available source words x <t+k , it can still enforce fluency becauseŷ t relies on the decoded target-side prefix y <t (Eq. 2). Formally, the generation of pseudoreferences is: Figure 3: Sentence-level BLEU distributions of Pseudo-Refs using wait-k policies for Zh!En and Ja!En, respectively. The parts to the right of the vertical lines indicate the top 40% references in terms of BLEU in each distribution. Fig. 1 shows the pseudo-references with different wait-k policies (k = 1..4). Note that k = 1 or 2 results in non-idiomatic translations, and larger k leads to more fluent pseudo-references, which converge to the original reference with k 4. The reason is that in each wait-k policy, each target word y t only rely on observed source words (x <t+k ).

Ja En
To further improve the quality of the pseudoreferences generated by test-time wait-k, we propose to select better pseudo-references by using beam search. Beam search usually improves translation quality but its application to simultaneous translation is non-trivial, where output words are committed on the fly (Zheng et al., 2019b). However, for pseudo-reference generation, unlike simultaneous translation decoding, we can simply adopt conventional off-line beam search algorithm since the source sentence is completely known. A larger beam size will generally give better results, but make anticipations more likely to be retained if they are correct and reasonable. To trade-off the expectations of quality and monotonicity, we choose beam size b = 5 in this work.

Translation Quality of Pseudo-References
We can use sentence-level BLEU score to filter out low quality pseudo-references. Fig. 3 shows the sentence level BLEU distributions of the pseudoreferences generated with different wait-k policies. As k increases, the translation qualities are better since more source prefixes can be observed during decoding. The obvious peak at the BLEU=100 on Zh!En denotes those pseudoreferences which are identical to the original ones. Those original references are probably already non- hallucinatory or correspond to very short source sentences (e.g. shorter than k). The figure shows that even for wait-1 policy, around 40% pseudoreferences can achieve BLEU score above 60.

Anticipation Rate of (Pseudo-)References
During the training of a simultaneous translation model, an anticipation happens when a target word is generated before the corresponding source word is encoded. To identify the anticipations, we need the word alignment between the parallel sentences. A word alignment a between a source sentence x and a target sentence y is a set of source-target word index pairs (s, t) where the s th source word x s aligns with the t th target word y t . In the example in Fig. 4, the word alignment is: a = {(1, 8), (3, 7), (4, 1), (4, 2), (5, 3), (6, 4), (7, 5)}.
Based on the word alignment a, we propose a new metric called "k-anticipation" to detect the anticipations under wait-k policy. Formally, a target word y t is k-anticipated (A k (t, a) = 1) if it aligns to at least one source word x s where s t + k: We further define the k-anticipation rate (AR k ) of an (x, y, a) triple under wait-k policy to be:

Hallucination Rate of Hypotheses
The goal of reducing the anticipation rate during the training of a simultaneous translation model is to avoid hallucination at testing time. Similar to the anticipation metric introduced in the previous section, we define another metric to quantify the number of hallucinations in decoding. A target wordŷ t is a hallucination if it can not be aligned to any source word. Formally, based on word alignment a, whether target wordŷ t is a hallucination is

Experiments
Dataset and Model We conduct the experiments on two language pairs Zh!En and Ja!En. We use NIST corpus (2M pairs) for Zh!En as training set, and NIST 2006 and NIST 2008 as dev and test set, which contains 616 and 691 sentences with 4 English references respectively. We also collected a set of references annotated by human interpreters with sight-interpreting 1 for the test set. For Ja!En translation, we use ASPEC corpus (3M pairs). Following Morishita et al. (2019), we only use the first 1.5M parallel sentences and discard the rest noisy data. We use the dev and test datasets in ASPEC with 1,790 and 1,812 pairs. We preprocess the data with Mecab (Kudo et al., 2004) as the word segmentation tool and Unidic (Yasuharu et al., 2007) as its dictionary. Consecutive Japanese tokens which only contain Hiragana characters are combined to reduce the redundancy.
The full-sentence model is trained on the original training set. We use fast align (Dyer et al., 2013) as the word aligner (Model 2 for anticipation and Model 1 for hallucination) and train it on the training set. All the datasets are tokenized with BPE (Sennrich et al., 2016). We implement wait-k policies on base Transformer (Vaswani et al., 2017) following  for all experiments.

Results
We compare the performance of wait-k models trained on three different settings: (i) original training references only; (ii) original training references with all Pseudo-Refs; (iii) original training references with top 40% Pseudo-Refs in sentence-level BLEU. Table 1   hallucination rate. The filtered 40% Pseudo-Refs achieve the best results except k = 9. Fig. 7 shows that the generated Pseudo-Refs can significant reduce the k-anticipation rate compared with the original training references, especially for smaller k. As shown in Table 2, if taking the human sightinterpreting result as a single reference, the improvement is more salient than evaluated on the standard 4 references (+7.5% vs. +6.5%), which confirms that our method tend to translate in a "syntactic linearity" fashion like human sight and simultaneous interpreters (Ma, 2019). Fig. 5 shows an example of how the wait-k model is improved by generated Pseudo-Refs. In this example, the original training references actively delay the translation of adverbial clause (time). It makes the model learn to anticipate the subject before its appearance. It is common in the original set. Fig. 6 shows two other examples of generated pseudo references on Ja!En and Zh!En, respectively. The generated pseudoreferences are obviously more ideal than the original references. We also show several examples of solving other avoidable anticipations in Figs. A1-A4 in the Appendix. Table 3 shows the results of Ja!En translation task. Japanese-to-English simultaneous translation is a more difficult task due to long distance reorderings (SOV-to-SVO); many Japanese sentences are difficult to be translated into English monotonically. Besides that, the test set has only one single reference and does not cover many possible expressions. Results show that filtered Pseudo-Refs still improve the translation quality (Tab. 3), and reduces anticipation (Fig. 7) and hallucination (Tab. 3).

Related Work
In the pre-neural statistical MT era, there exist several efforts using source-side reordering as a preprocessing step for full-sentence translation (Collins et al., 2005;Galley and Manning, 2008;Xu et al., 2009). Unlike this work, they rewrite the source sentences. But in the simultaneous translated scenario, the source input is incrementally revealed and unpredictable. Zheng et al. (2018) propose to improve full sentence translation by generating pseudo-references from multiple gold references, while our work does not require the existence of multiple gold references and is designed for simultaneous translation. This work is closely related to the work of He et al. (2015), which addresses the same problem but only in the special case of Ja!En translation, and uses handwritten language-specific syntactic transformations rules to rewrite the original reference into a more monotonic one. By comparison, our work is much more general in the following aspects: (a) it is not restricted to any language pairs; (b) it does not require language-specific grammar rules or syntactic processing tools; and (c) it can generate pseudo-references with a specific policy according to the requirement of latency.

Conclusions
We have proposed a simple but effective method to generate more monotonic pseudo references for simultaneous translation. These pseudo references cause fewer anticipations and can substantially improve simultaneous translation quality.  Figure A1: The training reference uses passive voice while the source sentence uses active voice. This kind of problem often appears in sentences with "there be" (e.g. Fig. A2) wait-3 Pseudo-Ref the economic and trade cooperation between the two countries has great potential . Figure A2: A similar example in which the pseudo-reference avoids the anticipation brought by the "there be" phrase in the gold reference.