Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation

This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency; and (b) visual information to support this process by providing additional (visual) contextual information which may be available before the textual input is produced. For that, we propose a multimodal approach to simultaneous machine translation using reinforcement learning, with strategies to integrate visual and textual information in both the agent and the environment. We provide an exploration on how different types of visual information and integration strategies affect the quality and latency of simultaneous translation models, and demonstrate that visual cues lead to higher quality while keeping the latency low.


Introduction
Research into automating real-time interpretation has explored deterministic and adaptive approaches to build policies that address the issue of translation delay (Ryu et al., 2006;Cho and Esipova, 2016;. In another recent development, the availability of multimodal data (such as visual information) has driven the community towards multimodal approaches for machine translation (MMT) Barrault et al., 2018). Although deterministic policies have been recently explored for simultaneous MMT Imankulova et al., 2020), there are no studies regarding how multimodal information can be exploited to build flexible and adaptive policies for simultaneous machine translation (SiMT).
Applications of reinforcement learning (RL) for unimodal SiMT have highlighted the challenges for the agent to maintain good translation quality while learning an optimal translation path (i.e. a sequence of READ/WRITE decisions at every time step) (Grissom II et al., 2016;Alinejad et al., 2018).
Incomplete source information will have detrimental effect especially in the cases where significant restructuring is needed while translating from one language to another.
In addition, the lack of information generally leads to high variance during the training in the RL setup. We posit that multimodality in adaptive SiMT could help the agent by providing extra signals, which would in turn improve training stability and thus the quality of the estimator and translation decoder.
In this paper, we present the first exploration on multimodal RL approaches for the task of SiMT.
As visual signals, we explore both image classification features as well as visual concepts, which provide global image information and explicit object representations, respectively. For RL, we employ the Policy Gradient method with a pre-trained neural machine translation model acting as the environment.
As the SiMT model is optimised for both translation quality and latency, we apply a combined reward function that consists of a decomposed smoothed BLEU score and a latency score. To integrate visual and textual information, we propose different strategies that operate both on the agent (as prior information or at each step) and the environment side.
In experiments on standard datasets for MMT, our models achieve the highest BLEU scores on most settings without significant loss on average latency, as compared to strong SiMT baselines. A qualitative analysis shows that the agent benefits from the multimodal information by grounding language signals on the images.
Our main contributions are as follows: (1) we propose the first multimodal approach to simultaneous machine translation based on adaptive policies with RL, introducing different strategies to integrate visual and textual information (Sections 3 and 4); (2) we show how different types of visual information and integration strategies affect the quality and latency of the models (Section 5); (3) we demonstrate that providing visual cues to both agent and environment is beneficial: models achieve high quality while keeping the latency low (Section 5).

Related Work
In this section, we first present background and related work on SiMT, and then discuss recent work in MMT and multimodal RL.

Simultaneous Machine Translation
In the context of neural machine translation (NMT), Cho and Esipova (2016) introduce a greedy decoding framework where simple heuristic waiting criteria are used to decide whether the model should read more source words or instead write a target word.  utilise a pre-trained NMT model in conjunction with an RL agent whose goal is to learn a READ/WRITE policy by maximising quality and minimising latency. Alinejad et al. (2018) further extend the latter approach by adding a PREDICT action with an aim to capture the anticipation of the next source word.  propose an end-to-end, fixed-latency framework called 'wait-k' which allows prefix-to-prefix training using a deterministic policy: the agent starts by reading a specified number of source tokens (k), followed by alternating WRITE and READ actions. Other approaches to SiMT include re-translation of previous outputs depending on new outputs (Arivazhagan et al., 2020;Niehues et al., 2018) or learning adaptive policies guided by a heuristic or alignment-based approaches Arthur et al., 2020). A general theme in these approaches is their reliance on consecutive NMT models pre-trained on full-sentences. However, Dalvi et al. (2018) discuss potential mismatches between the training and decoding regimens of these approaches and propose to perform fine-tuning of the models using chunked data or prefix pairs.

Multimodal Machine Translation
MMT aims at improving the quality of automatic translation using additional sources of informa-tion (Sulubacak et al., 2020). Different methods for fusing textual and visual information have been proposed. These include initialising the textual encoder or decoder with the visual information (Elliott and Kádár, 2017;Caglayan et al., 2017), combining the visual information through spatial feature maps using soft attention (Caglayan et al., 2016;Libovický and Helcl, 2017;Huang et al., 2016;, and projecting a summary of the visual representations to a common context space via a trained projection matrix Caglayan et al., 2017;Elliott and Kádár, 2017;Grönroos et al., 2018). Further, recent work has also focused on exploring Multimodal Pivots (Hitschler et al., 2016) and latent variable models (Calixto et al., 2019) in the context of multimodal machine translation. In this paper, we explore all these strategies, and also the use of visual concepts, similar to the approach by Ive et al. (2019).

Multimodal Reinforcement Learning
Previous work has explored RL with language inputs (Andreas et al., 2017;Bahdanau et al., 2018;Goyal et al., 2019) by making use of language to improve the policy or reward function: for example, the task of navigating in the world grid environment using language instructions (Andreas et al., 2016).
Alternatively, RL with language output can be shaped as sequential decision making for language generation, while conditioning on other modalities. This includes image captioning (Ren et al., 2017), video captioning (Wang et al., 2018), question answering (Das et al., 2018), and text-based games (Côté et al., 2018). Our study sits somewhere in between these different types of work. We have both the source language and respective images as input and the target language as output. Our agent is focused only on learning the READ and WRITE actions while the translation model is fixed for simplicity.
The central aim of the agent is learning to capture the relevant structures and relations of the modalities that can lead to a better SiMT system.

Methods
We first present the architectures for consecutive and baseline fixed policy simultaneous MT (Section 3.1). Then we introduce our RL approaches, both the baseline, the proposed multimodal extension (Section 3.2), and the visual features used by all multimodal approaches (Section 3.3).

Baselines
Unimodal MT. We implement a standard encoder-decoder baseline with attention  which incorporates a two-layer encoder and a two-layer decoder with GRU  units. Given a source sequence of embeddings X={x 1 , . . . , x S } and a target sequence of embeddings Y ={y 1 , . . . , y T }, the encoder first computes the sequence of hidden states H={h 1 , . . . , h S } unidirectionally.
The attention layer receives H as key-values whereas the hidden states of the first decoder GRU provide the queries. The context vector c T t produced by the attention layer is given as input to the second GRU. Finally, the output token (y t ) probabilities are obtained by applying a softmax layer on top of the concatenation of the previous word embedding, context vector and the second GRU's hidden state.
For consecutive NMT, all source tokens are observed before the decoder begins the process of generation.
Multimodal MT. We extend unimodal MT with multimodal attention (Calixto et al., 2016;Caglayan et al., 2016) in the decoder, in order to incorporate visual information into the baseline NMT. Let us denote the visual counterpart of textual hidden states H by V . Multimodal attention simply applies another attention layer on top of V , which yields a visual context vector c V t at each decoding timestep t. The final multimodal context vector that would be given as input to the second GRU is simply the sum of both context vectors.
Unimodal wait-k NMT. We explore deterministic wait-k  approach as a unimodal baseline 1 for simultaneous NMT. The wait-k model starts by reading k source tokens and writes the first target token. The model then reads and writes one token at a time to complete the translation process. This implies that the attention layer will now attend to a partial textual representation corresponding to k-words. We use the decoding-only variant which does not require re-training an NMT model i.e. it re-uses the already trained consecutive NMT baselines.

Policy Learning Framework
RL baseline. We closely follow  and cast SiMT as a task of producing a sequence of READ or WRITE actions. We then devise an RL model that connects the MT system and these actions. The model is based on a reward function that takes into account both quality and latency. Following standard RL, the framework is composed of an environment and an agent. The agent takes the decision of either reading one more input token or writing a token into the output -hence two actions are possible: READ and WRITE. The environment is a pre-trained NMT system which is frozen during RL training.
The agent is a GRU that parameterises a stochastic policy which decides on the action a t by receiving as input the observation o t . 2 In our setup, o t is defined as [c T t ; y t ; a t−1 ], i.e. the concatenation of vectors coming from the environment, as well as the previously produced action sequence. At each time step, the agent receives a reward r t = r Q t +r D t where r Q t is the quality reward (the difference of smoothed BLEU scores for partial hypotheses produced from one step to another) and r D t is the latency reward formulated as: where C t denotes the consecutive wait (CW) metric which is added to avoid long consecutive waits . CW measures how many source tokens are consecutively read between committing two translations. D t refers to average proportion (AVP) (Cho and Esipova, 2016), which defines the average proportion of wait tokens when translating the words. D * and C * are hyper-parameters that determine the expected/target values. The optimal quality-latency trade-off is achieved by balancing the two reward terms. In our reward implementation we again closely follow .

Multimodal extension.
Here we focus on integrating the visual information with the agent (see Figure 1). The basic premise is that the addition of multimodal information, especially in the context of MMT, can result in the agent learning better and more flexible policies. We explore several ways to integrate visual information into this framework: • Multimodal initialisation (RL-init) -the agent network is initialised with the image vector V as d 0 . We expect this vector to give the agent some context w.r.t. the source sentence so it can potentially read fewer words before producing outputs.
• Multimodal attention (RL-att, Figure 1) applies another attention layer on top of V , which yields a visual context vector c V t at each agent time step t. This visual context vector is a dot product attention c V t = Attention(V, query ← y t ) that computes the similarity between V and the embedding of the target word produced by the decoder at the time step t. In this setting, we expect the agent to pay attention to the information in V that will help in defining whether y t is good enough to be written to the output (potentially with closer relationship to some part of the image information) or we need to read more source words to produce a better y t . We concatenate • As a control, we also study multimodal environment (RL-env, Figure 1) where we use the MMT baseline as environment. Here, we expect the initial translation quality of SiMT RL models be closer to the quality of the respective consecutive multimodal baseline as the image information is expected to compensate for partial source information. When combined with RL-init and RL-att settings, we expect the agent to exploit different kinds of image information than the environment.
Learning. To learn the multimodal agent, we introduce an additional neural network with the same structure as that of the agent GRU network to provide for control variates (baselines) that improve the Monte-Carlo policy gradient (REINFORCE (Williams, 1992)). Note that here we depart from the previous work where  use a simple multilayer perceptron as the baseline. Therefore, with the reward r t at each time step, we obtain the estimation of the gradients by subtracting the baselines b(o t ): To further reduce the variance of the gradient estimator, we also introduce a temperature τ for controlling the interpolation between discrete action samples and continuous categorical densities, which yields to a Gumbel-Softmax reparameterisation (Jang et al., 2017) that smooths the learning.
To be more precise, we use the Gumbel-Softmax distribution instead of argmax while sampling. So the probability of the WRITE action is given to the agent network instead of the index of the action.

Visual Features
In order to represent the visual information, we explore two settings that differ in the organisation of the spatial structure. Regardless of the setting, the image features are linearly projected into the hidden space of the decoder to yield the tensor V .
Image classification features (OC) are global image information represented by convolutional feature maps, which are believed to capture spatial cues. These features are extracted from the final convolution layer of a ResNet-50 convolutional neural network (CNN) (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009) for object classification. The size of the final feature tensor being 8x8x2048, the visual attention is applied on a grid of 64 equally-sized regions.
Visual Concepts (VC) are explicit object representations where local regions are detected as objects and subsequently encoded with 100dimensional word representations. For a given image, the detector provides 36 object and 36 attribute region proposals which are abstract concepts associated with the image. We represent each of the detected region with its corresponding GloVe (Pennington et al., 2014) word vectors. An image is thus represented by a feature tensor of size 72x100 and the visual attention is now applied on these visual concepts, rather than the uniform grid of the first approach above. We hypothesise that this type of information can result in better referential grounding by using conceptually meaningful units rather than global features. The detector used here is a Faster R-CNN/ResNet-101 object detector (with 1600 object labels) (Anderson et al., 2018) 3 pre-trained on the Visual Genome dataset (Krishna et al., 2017). 4 Experimental Setup

Dataset
We perform experiments on the Multi30k dataset  4 which extends the Flickr30k image captioning dataset (Young et al., 2014) with caption translations in German and French . Multi30k is a standard MMT dataset that contains parallel sentences in two languages that describe the images. The training set for each language direction comprises 29,000 image-source-target triplets whereas the development and the test sets have around 1,000 samples. We use the corresponding test sets from 2016, 2017 and 2018 for evaluation.
Pre-processing. We use Moses scripts (Koehn et al., 2007) to lowercase, normalise and tokenise the sentences. We then create word vocabularies on the training subset of the dataset. We did not use subword segmentation to avoid its potential side effects on fixed policy SiMT and to be able to better analyse the grounding capability of the models. The resulting English, French and German vocabularies contain 9.8K, 11K and 18K tokens, respectively.

Evaluation
We use BLEU (Papineni et al., 2002) for quality, and perform significance testing via bootstrap resampling using the Multeval tool (Clark et al., 2011). For latency, we measure Average proportion (AVP) (Cho and Esipova, 2016). AVP is the average number of source tokens required to commit a translation. This metric is sensitive to the difference in lengths between source and target.
Hence, as our main latency metric we measure Average Lagging (AVL)  which estimates the number of tokens the "writer" is lagging behind the "reader", as a function of the number of input tokens read.

Training
Hyperparameters. We set the embeddings dimensionality and GRU hidden states to 200 and 320, respectively. We use the ADAM (Kingma and Ba, 2014) optimiser with the learning rate 0.0004 and the batch size of 64. We use pysimt  with Py-Torch (Paszke et al., 2019) v1.4 for our experiments. 5 We early stop w.r.t. the validation BLEU with the patience of 10 epochs. On a single NVIDIA RTX2080-Ti GPU, the training takes around 35 minutes for the unimodal model and around 1 hour for the multimodal model. The number of learnable parameters is between 6.9M and 9.3M depending on the language pair and the type of multimodality. For the RL systems, we follow . 6 The agent is implemented by a 320dimensional GRU followed by a softmax layer and the baseline network is similar to the agent except with a scalar output layer. 7 We use ADAM as the optimiser and set the learning rate and mini-batch size to 0.0004 and 6, respectively. For each sentence pair in a batch, 5 trajectories are sampled. Following best practises in RL, the baseline network is trained to reduce the MSE loss between the predictions and the rewards using a second op-timiser.
For inference, greedy sampling is used to pick action sequences. We set the hyperparameters C * =2, D * =0.3, α=0.025 and β= − 1. To encourage exploration, the negative entropy policy term is weighed empirically with 0.001. Following , we choose the model that maximises the quality-to-latency ratio (BLEU/AVP) on the validation set with a patience of 5 epochs. 8 On a single NVIDIA RTX2080-Ti GPU, the training takes around 2 hours. The number of learnable parameters is around 6M.
Model configurations. We experiment with seven different configurations (below). We consider visual concepts (VC) as the main source of multimodal information. Visual concepts are more abstract forms of multimodal information. Unlike spatial image representation or region of interestbased object representations, where the representation for the same concept can vary significantly across images, visual concepts remain constant. For example, the visual concept "dog" is the same regardless of the breed, colour, size, position, etc. of the concept in different images. Image classification (OC) features are used as a contrastive setting.
• Unimodal RL baseline (RL-base): This baseline follows  where the environment is a text-only NMT model.
• Multimodal agent with VC initialisation (RL-init VC): We initialise the agent GRU using a projection of the flattened 72x100 matrix of visual concepts.
• Multimodal agent with attention over VC (RL-att VC): The agent attends over the set of visual concepts at each step.
• Multimodal agent with attention over OC (RL-att OC): The agent attends over the set of image classification-based spatial feature maps at each step.
• Visually initialised multimodal agent with attention over VC (RL-init-att VC): Similar to RL-att VC but the agent is also initialised with VC.
• Multimodal environment with unimodal RL agent (RL-env VC): The environment is an 8 We also attempted to choose the model that maximises BLEU or BLEU/AVL but those stopping criteria resulted in instability of convergence.
MMT model, however the agent is a standard RL agent akin to the baseline.
• Multimodal agent with multimodal environment (RL-env-init-att VC): This merges all the variants in that both the multimodal environment and the multimodal agent attend to visual concepts, the latter is also initialised with visual information.

Results
In this section, we first provide the results from our experiments (Section 5.1) and then analyse the behaviour of the (multimodal) agents (Section 5.2).

Quantitative Results
SiMT vs. Consecutive. We present the main results in Table 1. The top block for each language pair shows the textual Consecutive model and its multimodal counterpart (Consecutive+VC). These are our upperbounds since they have access to the entire source before translating. As expected, they have better BLEU but much larger AVL.
RL SiMT vs. Deterministic policy. The second block in Table 1 shows the deterministic policy Wait-2 and Wait-3 approaches. RL-base performs on par with the Wait-2 (English-French) and Wait-3 (English-German). We however emphasise the flexibility of the stochastic policies with RL models. These are particularly beneficial in the multimodal scenario and allow for exploitation of the image information more efficiently especially towards reducing the average lag. We further expand on this later in Section 5.2.
Unimodal RL vs. Multimodal RL. The third block in Table 1 compares all multimodal RL variants against the text-only SiMT RL (RL-base). In general, the multimodal RL models produce translations that are significantly better than RL-base.
Across Multimodal RL Setups. With regard to different configurations, we observe (1) an increase in quality for the RL-att models when compared to RL-base which is consistent in both types of visual inputs OC and VC, and (2) a decrease in the lag for the RL-init models at a small decrease in quality (for VC RL-init in comparison to RL-base). This observation suggests that the RL model with the agent explicitly attending over image information leads to an increase in quality, as the multimodal agent model is more selective towards the word choice. The RL-init configuration with prior image context on the other hand reduces the lag and seems to use WRITE actions more often than READ actions. It is interesting that OC and VC features result in similar quality translations, however we see that on average the average lag is lower with VC. We hypothesise that this could be due to the fact that the representations remain constant across images (see Section 4.3).
The RL-init-att configuration represents a middle ground and we see similar quality improvement to RL-att across setups (a gain of 2 BLEU points on average) but with a slightly lower latency. We however observe that RL-env-init-att has a slightly inferior performance with a a pronounced latency when compared to the RL-env model. We investigate this aspect in the next sections.
Investigating Average Lag. To further study the impact of our configurations on the sentence level lag, in Figure 2 we present the binned-histograms of sentence lags over the English→German test 2016 set. Generally, the models which are initialised with image information seem to have more mass towards the smaller delay bins. In terms of RL-init and RL-env-init-att setups, we also observe the presence of two modes around the lag value 3 as well as around two negative values (around -0.25 and -1.25 respectively). These negative lag values are due the difference in length between source and target sentences which is typical for the English→German. This also shows that the agent initialised with the image information tends to prefer WRITE actions with fewer READ actions. Further, on manual inspection of some samples, we observed that in the cases with negative lag the model begins with a WRITE action straight after reading the first token (See Table 2). As the agent is a GRU model, this behavior resembles that of an image captioning model. We also observe similar trends for English→French with RL-init models predominantly having more mass towards smaller delay bins (see Figure 3).

Agent Attention over Visual Inputs
In Figure 4 we visualize the agent's attention at each time step. On average, the agent actions correlate with the objects it attends to when producing the translation.
We now examine the general pattern of agent attention over the visual concepts across the four configurations using attention norm: a) RL-att-VC; b) RL-att-OC; c) RL-init-att; and d) SRC: the red car is ahead of the two cars in the background . REF: das rote auto fährt vor den beiden autos im hintergrund . 'the red car goes before the both cars in the background' RL-init: die person ist im begriff , die rote mannschaft auf dem roten auto versammelt . 'the person is in concept, that red manhood on the red car gathered' Actions: 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1   RL-env-init-att. The attention norm is simply the average 2 norm between two consecutive attention time-steps. This can help in measuring the average visual attention per time step for a given sentence. We then compare the attention norm distributions over all the sentences in the English→German test 2016 set for the four different agent attention configurations. We present the result in Figure 5. Overall, RL-init and RL-att models are significantly more peaky than the RL-env-init-att. This suggests that RL-env-init-att model is generally spread across the 72 visual concepts more uniformly than the other two models. This perhaps is one of the causes for the slightly inferior performance of the model. We hypothesise that further regularisation of the attention distribution can ameliorate this behavior and leave it as future work.

Conclusion
In this paper we presented the first thorough exposition of multimodal reinforcement learning strategies for simultaneous machine translation. We demonstrate the efficacy of visual information and show that it leads to adaptive policies which substantially improve over the deterministic and unimodal RL baselines. Our empirical results indicate that both agent-side and environment-side visual information can be exploited to achieve higher quality translations with lower latency.
Throughout the experimental journey, we observed that the optimisation of simultaneous machine translation for dynamic policies is non-trivial, due to the two competing objectives: translation quality versus latency. For unimodal simultaneous machine translation, RL approaches tend to achieve translation quality on par with the quality of the deterministic policies within the same average lag. We believe that the fundamental issue is related  to the high variance of the estimator for sequence prediction, which increases sample complexity and impedes effective learning. On the other hand, the approaches with deterministic policies are simple and effective, as they are positively biased for language pairs that are close to each other. But the latter suffer from poor generalisation.
In the multimodal simultaneous machine translation setting, however, the variance of the estimator from RL models can be substantially reduced with to the presence of additional (visual) information.