Learning Coupled Policies for Simultaneous Machine Translation using Imitation Learning

We present a novel approach to efficiently learn a simultaneous translation model with coupled programmer-interpreter policies. First, we present an algorithmic oracle to produce oracle READ/WRITE actions for training bilingual sentence-pairs using the notion of word alignments. This oracle actions are designed to capture enough information from the partial input before writing the output. Next, we perform a coupled scheduled sampling to effectively mitigate the exposure bias when learning both policies jointly with imitation learning. Experiments on six language-pairs show our method outperforms strong baselines in terms of translation quality quality while keeping the delay low.


Introduction
Simultaneous machine translation (SIMT) is a setting where the translator needs to incrementally generate the translation while the source utterance is being received. This is a challenging translation scenario as the SIMT model needs to trade off delaying translation output and the quality of the generated translation.
Recent research on SIMT relies on a strategy to decide when to read a word from the input or write a word to the output (Satija and Pineau, 2016;Gu et al., 2017). This is based on a sequential decision making formulation of SIMT, where the decision making about the next READ/WRITE action is made by an agent, interacting with the neural machine translation (NMT) environment. Current approaches are sub-optimal as they either fix the agent's policy to focus learning the NMT model Dalvi et al., 2018) or learn adaptive agent policies while the NMT model is fixed (Gu et al., 2017;Alinejad et al., 2018). We argue that the interpreter should also learn to generate correct translation from incomplete input information. This is challenging as we need to optimize both programmer's and interpreter's policies to balance the tradeoff between quality and delay in the reward.
Previous research has considered the use of imitation learning (IL) to train the agent's policy (Zheng et al., 2019a,b), which is generally superior to reinforcement Learning (RL) in terms of the stability and sample complexity. However, the bottleneck of IL in SIMT is the unavailability of the oracle sequence of actions. Designing algorithmic oracles to compute sequence of READ/WRITE actions with low translation latency and high translation quality is under-explored.
We present an IL approach to efficiently learn effective coupled programmer-interpreter policies in SIMT, based on the following contributions. First, we present a simple, fast, and effective algorithmic oracle to produce oracle actions from the training bilingual sentence-pairs based on statistical word alignments (Brown et al., 1993). Next, we design a framework that uses scheduled sampling on both programmer and interpreter. This is different from the typical IL scenarios, where there is only one policy to learn. As the two policies collaborate, their learning needs to be robust not only to their own incorrect predictions, but also to incorrect predictions of the other policy to mitigate this coupled exposure bias.
Experiments on six language pairs (translating to English from Arabic, Czech, German, Romanian, Hungarian, and Bulgarian) show the policies trained using our approach compares favorably with strong policies from the previous work. We attribute the effectiveness of the learned coupled policies to (i) the scheduled sampling, which handles the coupled exposure bias, resulting in up to 5-8 BLEU score improvements, and (ii) the quality of oracle actions generated by our algorithmic oracle, which balances translation quality and delay.
Algorithm 1 Generation in NPI-SIMT 1: i, j ← 0 2: while a stopping condition is not met do 3: if a t+1 = READ then 8: 14: end if 16: end while 2 NPI Approach to SIMT We describe generation in our neural programmerinterpreter (NPI) approach to simultaneous machine translation (SIMT) in Algorithm 1. At each time step t, the programmer needs to decide whether to READ the next source word or to WRITE the next target word in the translation. The interpreter then immediately executes the action generated by the programmer. Both programmer and interpreter are modeled using Markov Decision Process (MDP) where prediction at particular timestep depends on the history of previous predictions. The indices i t and j t are the number of READ and WRITE actions in the program up to time step t.
The Programmer needs to sequentially decide about the next action, given the previous actions a <t and the prefix of the source utterance read so far x ≤it as well as the prefix of the target translation generated so far y ≤jt . That is, our programmer is modeled as P prog (a|a <t , x ≤it , y ≤jt ).
The Interpreter needs to execute the action generated by the programmer. At time step t, if the generated action a t is READ, we reveal the next input token. Otherwise, if WRITE, then we generate the next target word according to P intp (y|a ≤t , x ≤it , y ≤jt ). 1 The Probabilistic Model. The probability of simultaneously generating the translation y and the 1 The counter i and j are also incremented according to the respective actions at time t. sequence of actions a for a source utterance x is, Training the Model. In SIMT, we are interested in not only producing a high quality translation, but also reducing the delay between the times of receiving the source words and generating their translations. Training of the model based on this hybrid training objective can be done by reinforcement learning (RL) or imitation learning (IL). The RL approach has been attempted by (Satija and Pineau, 2016;Gu et al., 2017;Alinejad et al., 2018) for training the programmer; however, it is unstable due to sparsity of the reward function and these works also assumed a fixed interpreter. We thus take the IL approach for a sample efficient, effective, and stable learning of policies in NPI-SIMT.

Deep Coupled Imitation Learning
Our goal is to learn a pair of policies for the programmer and interpreter using IL. §3.1 describes the method of learning of both policies where their learning inter-dependency needs to be taken into account. §3.2 describes our novel oracle program actions for each sentence pair in the training set, i.e., the program a which has been responsible for generating the translation y for a source utterance x with as low delay as possible. Our overall training algorithm is depicted in Algorithm 2. X = [x 1 , ...,x |x| ] is the encoding of the input sequence andŶ = [ŷ 1 , ...,ŷ |y| ] is a list of interpreter hidden states for all predictions. During training, these values are calculated before calculating the loss of the programmer.

Learning Robust Coupled Policies
Assuming we have the oracle actions, we can learn the policies for both the programmer and interpreter using behavioural cloning in IL (Torabi et al., 2019). That is, the model parameters are learned by maximising the likelihood of the oracle actions for both the programmer and interpreter, θ * prog , θ * intp := arg max θprog,θ intp (x,y,a) |x|+|y| t=1 log P prog (a|a <t , x ≤it , y ≤jt ; θ prog ) + t:at=WRITE log P intp (y|x ≤it , y ≤jt ; θ intp ).
This is akin to have the expectation, in the original training objective of NPI, under a point-mass distribution over the oracle actions. IL with behavioural cloning does not lead to robust policies for unseen examples in the test time due to exposure bias (Bengio et al., 2015). That is, the agent is only exposed to situations resulting from the correct actions in the training time, leading to its inability to mitigate from propagation of errors faced due to incorrect actions in the test time. Scheduled sampling (Bengio et al., 2015;Ross et al., 2011) addresses this issue by exposing the agent to incorrect decisions in training time through perturbation of the oracle decisions, which we extend to learning policy pairs. Crucially, the programmer-interpreter policies need to be robust to incorrect decisions encountered not only in their own trajectories, but also to one anothers' trajectories.
Learning the Programmer. To train our programmer on a training example (x, y, a) with scheduled sampling, we first create the perturbation (a , y ) of the ground truth program and interpreter decisions. The perturbed program a and translation y are only used as the input to the recurrent architectures of the programmer and interpreter's decoder. They are created by replacing some of the ground truth element by randomly selecting an action from the predictive distribution of each model. We then maximise the following training objective, θ * prog := arg max θprog (x,y ,a,a ,a ) |x|+|y | t=1 log P prog (a|a <t , x ≤it , y ≤jt ; θ prog ). (1) Based on the generative process described in Algorithm 1, the programmer conditions the generation of actions in each time step on the current states of the NMT's encoder and decoder. Hence, while training the programmer, the valid READ/WRITE actions need to be communicated to the interpreter and be executed in order to provide NMT's encoder/decoder states to the programmer to condition upon. Crucially, the communicated program needs to be valid.
Valid Program A ground truth program a is a valid sequence of READ/WRITE actions if |READ ∈ a| = |x| and |WRITE ∈ a| = |y|. This valid program ensures the NPI model to safely consume a pair of parallel sentence. We generate a valid perturbation a by only permuting the READ/WRITE actions of the program a ( Figure 1).
R W R W W R R W 1 2 3 4 5 6 7 8 Oracle Index 1 2 3 4 5 6 7 8 Bernoulli 1 5 2 4 6 3 7 8 Permute Algorithm 2 Training NPI-SIMT Require: D: Sentence pairs with oracle actions, β 1 , β 2 , β 3 : scheduled sampling probabilities for y , a , a . 1: while a stopping condition is not met do 2: We further extend the definition of a valid program with respect to the domain knowledge of translation so that: (i) no WRITE at the beginning, and (ii) no READ at the end of the program.
Learning the Interpreter. The interpreter needs to be robust to the incorrect actions in the previously generated words in the translations as well as the READ/WRITE actions generated by the programmer. This is done by communicating a to the intepreter during training. Thus, the training objective for the interpreter is,

Oracle Program Actions
Our proposed oracle should measure the appropriate amount of inputs needed for translating a particular target word y jt . This is done by determining the key word or phrases δ j which contain important information of y jt , and therefore guiding the programmer to read until δ j before writing y jt . Algorithm 3 outlines our oracle generation Algorithm 3 Oracle Generation Require: a: Symmetrized alignment of x and y in forms of a i,j , which means that x i is aligned to y j . Index starts with 0. 1: δ READ := −1 2: for j ∈ range(0, |y|) do 3: δ j can be heuristically determined by using word alignment (Brown et al., 1993;Koehn et al., 2003), which captures strong relationship between tokens. In the case of many to one alignment of source to target, we choose the furthest source word. In the case of no alignment, nothing is done as it means the target word can be induced merely from the decoder without needing to read additional inputs. This oracle can generally generate a valid program. Caution is needed to ensure that no WRITE at the beginning and no READ at the end of the generated oracle. This can be done by aligning the first words of the parallel sentence. Similarly we also need to align the last words of the parallel sentence. 2

Experiments
Our experiments aim to measure the effectiveness of our proposed method versus a strong wait-k baseline, over a range of languages of varying difficulty, syntactic complexity, and lexical complexity.

Settings
Datasets. Our main experiment will be performed in higher quality corpus which is designed for spoken dialogue and carefully edited dataset. Additionally, we perform a single large scale experiment using crawled corpus such as WMT to show that our method also scales to a large dataset.
We evaluate our proposed method on 6 language pairs, in all cases translating into English, with the source languages chosen to cover a wide range of language families and syntax. We use German (DE), Czech (CS) and Arabic (AR) from the IWSLT 2016 translation dataset (Cettolo et al., 2012). We use the provided training and development sets as-is, and concatenate all provided test sets to create our test set. We also evaluate Hungarian (HR), Bulgarian (BG), and Romanian (RO) from the SETIMES corpus (Tyers and Alperen, 2010). As this corpus is not partitioned, we use the majority of the data for training, holding out 2000 random sentence pairs for development and another 2000 sentence pairs for testing. Together these languages are representative of Germanic, West Slavic, Arabic, Uralic, East Slavic, and Italic language families, respectively.
We use sentencepiece (Kudo and Richardson, 2018) to build and tokenize our training data with 16k vocabulary size. Then we generate our oracle program actions based on the segmented tokens. We use fast_align (Dyer et al., 2013) to generate symmetrized alignments between tokens. Unless otherwise specified, we use the default settings of the mentioned toolkit.
Evaluation. We evaluate the SIMT systems based on its translation quality and delay. Translation quality can be measured by case sensitive BLEU (Papineni et al., 2002). 3 We adopt three delay measurements by previous studies. First, average proportion (AP) (Gu et al., 2017) is a fraction of read source words per emited target words. Second, average lagging (AL)  is an average number of lagged source words until all inputs are read. Finally the differentiable-AL (DAL) (Arivazhagan et al., 2019) is a refinement of AL which also accumulates the cost of writing output tokens after inputs are fully read.
Baseline. We compare against the wait-k baseline  where the programmer's policy begins with k numbers of READ, and is followed by switching WRITE and READ, until the source sentence is exhausted or end of sentence (EOS) symbol is written. If the source sentence is exhausted, the programmer will only emit WRITE actions. This baseline was shown to be superior compared to the reinforcement learning approach (Zheng et al., 2019a), and k can be tuned for the desired delay. Arivazhagan et al. (2019)'s approach is superior than the wait-k baseline. However there 3 Calculated using sacrebleu (Post, 2018 Table 1: Full results on IWSLT and SETIMES datasets. Boldface indicates better translation quality versus wait-k are about the same delay (relevant systems indicated using underline within same column). Oracle-at-test is the system where the correct program is given during testing, serving as an upper bound on translation quality at a given delay.
is currently no open source code available and their end-to-end approach is not using an oracle policy. As our goal is not to beat the state-of-the-art, we leave this comparison as a future work.
NPI-SIMT. Both the programmer and interpreter are modelled using a unidirectional recurrent neural network (RNN) with a long short term memory cell (LSTM). In particular, we follow the architecture of Luong et al. (2015) with the multilayer perceptron attention of Bahdanau et al. (2015). Both the programmer and interpreter employ 20% dropout to the network output and 10% dropout to the embedding vector, and use a single layered LSTM with 512 hidden units. For the large scale experiment, we are using the transformer architecture (described in §4.5).
Training. We use Adam optimizer (Kingma and Ba, 2015) to train this framework. We track the learning rate of programmer and interpreter separately. We start with 0.001 learning rate, and start halving it whenever perplexity increase on development set. We use a fixed perturbation probability of 5%, 15%, and 15% for y , a , and a respectively. Early stopping is executed at the fourth learning rate decay.
Testing. We use a beam search algorithm with a beam size of 5 and length normalization algorithm that divides hypothesis score by its length during search (Murray and Chiang, 2018).

Empirical Results
Scheduled Sampling. Our first experiment tests the effect of scheduled sampling (SS) in learning coupled policies in our NPI-SIMT method. For this purpose, we train four versions of our models where apply SS to both the programmer and the interpreter, only programmer, only interpreter, or neither. Table 1 shows the results, comparing against policies trained using the baseline wait-k method where k ∈ {1, 2, . . . , 7, ∞}. The NPI-SIMT system that is trained using our proposed oracle (NPI SIMT) is able to learn from the low delay oracle as their natural delays (DAL, AL, AP) are generally as low as the delay of the oracle during training. However, it is clearly difficult to perfectly predict the oracle during test-time, and these programmer prediction errors resulted in mistakes in interpreter decisions.
Next, we consider the effect of perturbation and scheduled sampling on the proposed method. Table 1 shows that applying valid perturbation (a ) is more important than doing normal scheduled sampling (a ). This perturbation is directly correlated with the training of the interpreter, as such the noisy program make the interpreter resilient to the exposure bias. Applying both scheduled sampling further increased our proposed method accuracy. The schedule sampling on programmer (+a , a ) during training ameliorate this; as it increased up to 10 points of BLEU score in case of Romanian and Hungarian, 8 points in case of Bulgarian and, 5 points for German, Czech and Arabic. Additionally applying scheduled sampling on the interpreter (+y ) further improves translation accuracy while slightly decreasing the delay, both effects being consistent across all language pairs. Oracle Policy vs Wait-k Policy. Figure 2 compares the policies trained by our algorithmic oracle vs those trained using the wait-k policy starting from k = 1. In each of these six plots, the policy trained using the oracle actions corresponds to the leftmost triangle point on the figure. Observe that the policy trained using the oracle actions compares favorably with those trained using the wait-k method in terms of translation quality (higher is better) and translation delay (lower is better). Next we investigate the effect of increasing the delay of the oracle policy in a controlled manner onto the translation quality of the trained systems. As such, we increase the delay of the oracle policy by moving the last READ action in the oracle program to the beginning of the program, and thus increasing the delay of the oracle artificially. For additional delay, we repeat this process. We expect that the delayed oracle programs lead to trained policies with better translation quality at increased delay. The triangles in Figure 2 correspond to poli-cies trained using the versions of the oracle program, where we added delays {0−5}. Observe that policies trained with the delayed versions of the oracle program consistently outperform the wait-k policies, across all languages.
The quality of the oracle is shown as the green triangle in Figure 2. This system is provided the oracle program at test time, unlike the other systems that allow errors to propagate from the interpreter into subsequent decisions of the programmer. Note both the low delay of the oracle, and also the fact that the BLEU score outperforms offline translation (wait-∞). This seemingly surprising finding can be explained by the oracle providing key information to the interpreter in the form of word and phrase segmentation of the inputs.
_Aber _wenn _wir _die _Zusammensetzung _des _Erd boden s _nicht _ändern , _werden _wir _das _nie _tun . </s> _But _if _we _don ' t _change _the _composition _of _the _soil , _we _will _never _do _this . Table 2: The comparison of our wait-k, our coupled-SS (Co-SS) and the oracle trajectory. A column shows a sequence of consecutive READ and WRITE. Here our proposed method is able to imitate the oracle well, by patiently waiting for sufficient input to produce a good translation. Red texts indicate place of translation error.
Finetuning Wait-k. Next we consider warmstarting training using the wait-k model, and finetuning with the oracle program and SS training method. This method of training is cheaper when a wait-k system is available, as training converges in few iterations. In this setting, we allow retraining of the interpreter, as fixing the interpreter yielded poor BLEU scores. The × points in Figure 2 show the results of this experiment, where each point was warm-started with a different wait-k system. As it can be seen, all of these runs achieve similar results, but with inferior delay and quality to our proposed method (leftmost ). One explanation to this result is that the interpreter already converged to the wait-k policy and retraining it results in an inferior model compared to training it jointly from scratch. 4

Qualitative Analysis
Gu et al. (2017) address the difficulty of translating sentences in subject-object-verb order when translating from German to English. We show a typical example in Table 2 where the wait-k systems are forced to make difficult decisions with insufficient evidence, in this case of predicting negation of a verb which appears latter in the input. We compare systems with AL ≈ 2 and DAL ≈ 4 which is achieved by the wait-2, our coupled-SS system, and the oracle system. Consider first the wait-2 system. From the state shown at bottom right corner of the smaller red box, the system next generates a poor choice of verb ("look"), which was done without access to the verb in the German input. Instead the rightmost context word was "Zusammensetzung" ("composition"), which gives little information about the verb. One way around this problem is to use a specialized classifier which predicts the final verb (Grissom II et al., 2014).
However, this is often onerous or impossible. In this example, the model must also predict the negation "nicht" which appears immediately before the final verb. In a real interpretation scenario, the only way to ensure we output a correct translation is to wait for the matrix verb and negation token.
In this example the oracle trajectory breaks down the input sentence into coherent chunks, and this leads to excellent translation of each segment, and with low delay. This is because the word alignment oracle includes crossing alignment inside a phrase, thus producing sequence of READ and WRITE that do not break phrase translations. We posit that this oracle provides the minimal context needed for SIMT to translate on the fly, with sufficient context to generate each output token.
Here our proposed system closely imitates the oracle trajectory. Our proposed method is more conservative in waiting for the input, waiting until the final verb to make precise prediction. This can be explained by the uncertainty over breaking the phrases by the programmer, and thus is incurs additional delay for the sake of better translation quality. Such behaviour that is observed in the output of human interpreters, who will often wait when they are unsure what the main speaker is talking about.

Oracle Behaviour towards Alignment
Section 4.3 has partly shown our oracle behavior in translating from a final-verb language into English. Here we discuss the oracle's action when translating into a final verb language (English-Dutch). In English, the past participle is usually found right after the auxiliary verb. In this case, our oracle actions are conservative when waiting for inputs on the target side.
The example is shown in Figure 3 in which "have worked" does not produce a crossing alignment. First, the generated oracle will READ "have" and WRITE "heb". Then it will examine the next word, "jaren", and determine whether it needs to I have worked in that company for years Ik heb jaren in dat bedrijf gewerkt Figure 3: Translation example from English to Dutch. In this case "have worked" produce a non crossing alignment with "heb ... gewerkt" in Dutch.
READ more up until the word it is aligned. When it decided to WRITE "gewerkt"; the word "worked" would have been scanned in the past; so it should WRITE without an additional READ.
Next, it is also inevitable that our produced alignments are noisy and do not align all words correctly. It is currently not clear how much this will hurt the performance of the systems in terms of quality and delay due to alignment errors. In the worst case, our oracle will misguide the interpreter to guess target words without appropriate context (lowering quality, lowering delay) or wait for too many words (increasing quality, increasing delay). Both scenarios resulting from this noisy alignment are not catastrophic as it still depends on the interpreter's ability to guess translation outputs without appropriate input context.

Transformer and Large Scale Experiments
To show that the proposed method also extends to the transformer and larger parallel data, we conduct two experiments. The first experiment is changing the LSTM architecture with the transformer architecture similar to . The second experiment uses 4.5 millions DE to EN parallel sentences from WMT 2015. The interpreter is a standard 6 layers encoderdecoder NMT transformer similar to Vaswani et al. (2017). The programmer consists of single 6 layers encoder transformer with a binary classifier. For both networks, we allow attention to attend only to the previous timesteps. Then we use the program to mask out unseen inputs at each timestep in the interpreter. Unless otherwise specified, we use the default settings of training transformer as in Vaswani et al. (2017). We employ 30% dropout to all transformers and 10% to the embeddings, 16k vocabulary size, an average batch size of 4k, 8k steps learning rate warmup, 50 tokens maximum per sentence during training, for a total of 200k steps. We use a single pass of parallel scheduled sampling (Duckworth et al., 2019) for the trans-  former to generate a and y and set the y , a , a perturbation rate to be 10%, 15%, and 25%. Training is completed within 20 hours on a single V100 GPU. 5 Table 3 presents the Transformer results on IWSLT and WMT. First we see that our transformer results are competitive or better than the LSTM on IWSLT dataset (compare with Table 1). These results are similar to the Ma et al. (2019) when comparing LSTM and transformer based architectures in the SIMT settings. Second, we see that our coupled scheduled sampling approach is also able to increase BLEU by up to 1.5 points compared to vanilla NPI-SIMT approach while also keeping the delay low (3.27 AL). The higher AL of the SIMT model in WMT compared to IWSLT is likely due to the higher AL of the oracle (2.75 vs. 1.79), which we attribute to the nature of the dataset. Arguably, this crawled corpus is less suitable for SIMT in general, because it contains considerably longer parallel sentences; moreover, the text is less reflective of a real simultaneous interpretation setting as it was built by only matching offline and post-edited texts. We are able to see similar improvements using coupled scheduled sampling over vanilla NPI-SIMT approach, showing the scalability of our approach. Satija and Pineau (2016); Gu et al. (2017) and Alinejad et al. (2018) formulate simultane-ous NMT as sequential decision making problem where an agent interacts with the environment (i.e. the underlying NMT model) through READ/WRITE actions. They pre-train the NMT system, while the agent's policy is trained using deep-RL. Arivazhagan et al. (2020) highlights the poor performance of finetuned offline translation model when translating prefixes of input, which is the case of SIMT. Their approach uses retranslation strategy where every READ is performed, a new translation is generated from scratch, allowing revising translation on the fly and mitigating error propagation on the decoder that was attributed to the insufficient evidence when generating past output words. Their approach uses a stability metric which takes number of suffixes revisions made to produce latest translation. This approach involves wait-k inference, which limits number of words that can be emitted by the interpreter during one writing and thus limiting number of suffix revisions at the next writing. This wait-k inference is a heuristic that can be replaced by learning from the oracle.  Zheng et al. (2019b) produces oracle READ/WRITE actions using a pretrained NMT model, which is then used to train an adaptive agent based on supervised learning, i.e. behavioural cloning in imitation learning. Compared to our oracle which is produced merely from word alignment, their method requires a full decoding of training corpus, which is computationally expensive. These works are different from ours in that: (i) they do not use word alignment to produce the oracle actions, and (ii) they do not use of scheduled sampling.

Conclusion
This paper proposes a simple and effective way to train a simultaneous translation system to produce low delay translations. Our central contribution is to determine a sufficient, if not minimum, amount of inputs to translate each target token. This is achieved using word-alignment to create an oracle, which is then used as part of a training algorithm based on imitation learning to learn coupled policies, for a "programmer" which decides when to wait for more input producing translation tokens, and an "interpreter" which generates the translation. We show the importance of scheduled sampling during learning, which is crucial to combat exposure bias. Overall we show improvements in BLEU score over naively trained systems with modest translation delays.
Future work is needed to better understand the effect of various alignment models and symmetrization methods on the generated oracle. Beyond this, other opportunities include applying the model to the real speech input and applying more sophisticated imitation learning techniques that involves the generated trajectories of both "interpreter" and "programmer".