NICT’s Neural Machine Translation Systems for the WAT21 Restricted Translation Task

This paper describes our system (Team ID: nictrb) for participating in the WAT’21 restricted machine translation task. In our submitted system, we designed a new training approach for restricted machine translation. By sampling from the translation target, we can solve the problem that ordinary training data does not have a restricted vocabulary. With the further help of constrained decoding in the inference phase, we achieved better results than the baseline, confirming the effectiveness of our solution. In addition, we also tried the vanilla and sparse Transformer as the backbone network of the model, as well as model ensembling, which further improved the final translation performance.


Introduction
The performance of machine translation has been greatly improved since it entered the era of Neural Machine Translation (NMT) (Bahdanau et al., 2015;Sutskever et al., 2014;Wu et al., 2016). Different from traditional statistical machine translation (SMT) (Koehn et al., 2003), NMT models are trained end-to-end with contextualized representations to alleviate the locality problem and dense representations to mitigate the sparsity issue. The incorporation of novel structures such as CNN (Gehring et al., 2017) and Transformer (Vaswani et al., 2017) into NMT has brought the performance one step closer to practical translation.
Though NMT can more effectively exploit large parallel corpora, the performance is still insufficient to meet the requirements in some special translation scenarios. The end-to-end NMT models remove many approaches in the SMT paradigm for manually guiding the translation process. One attractiveness of the SMT method is that it provides explicit control over translation output, which is effective in a variety of translation settings, including interactive machine translation (Peris et al., 2017) and domain adaptation (Chu and Wang, 2018), which is also crucial for the practical application of NMT.
Since there is still a need for manual interventions for the new NMT paradigm, much effort is spent in studying how to incorporate this explicit control into the end-to-end neural translation (Arthur et al., 2016). Among these efforts, Constrained Decoding (CD) has gained a lot of attention in this research field, which is a modification to commonly adopted beam search in ordinary NMT models. Hokamp and Liu (2017) proposed grid beam search, which expands beam search to include pre-specified lexical constraints. Anderson et al. (2017) used constrained beam search to force the inclusion of restricted words in the output, and employed fixed pre-trained word embeddings to facilitate vocabulary expansion to unseen words in training.
While these works accomplish the goal of explicit translation control, the time complexity of their decoding algorithm and resultant decoding speed falls short of the expectations. The complexity of grid beam search and constrained beam search is linear and exponential to the number of constraints, respectively. These algorithms are thus too inefficient to be practical for large-scale use. To alleviate the shortcomings in constrained decoding, Post and Vilar (2018) proposed a new constrained decoding algorithm with a claimed complexity of O(1) in the number of constraints -dynamic beam allocation which allocates the slots in a fixed-size beam. However, their approach still processes sentence constraints sequentially rather than batch processing, limiting the GPU's parallel processing capabilities. Based on Post and Vilar (2018), a vectorized dynamic beam allocation approach was proposed in Hu et al. (2019), which which vector-izes the dynamic beam allocation for batching and thus leading to improvement in throughput with parallelization. Based on Post and Vilar (2018), Hu et al. (2019) proposed a vectorized dynamic beam allocation approach, which vectorizes the dynamic beam allocation for batching, resulting in increased throughput with parallelization.
Constrained decoding is a very general method for incorporating additional translation knowledge into the output without modifying the model parameters or training data. However, the model's prediction distribution can be skewed during the decoding process with hard constraints, resulting in poor translation results. When the model is exposed to the restricted translation paradigm during training, the gap between training and inference can be reduced, potentially improving performance. Therefore, in this paper, we propose a training method of Sampled Constraints as Concentration (SCC). In this method, training data is the same as the ordinary NMT; only minor modifications on the loss calculation are required to adapt the model to restricted translation.
In our submission to WAT'21 (Nakazawa et al., 2021) restricted translation task, we chose Transformer (Vaswani et al., 2017) as our baseline because of its high performance and scalability. Although there are some variants, our previous experiments have shown there are not too many approaches that can be both concise and effective. At the same time, though multi-head self-attention in Transformer can model extremely long dependencies, deep layer attention tends to overconcentrate on a single token, resulting in inadequate use of local information and difficulty representing long sequences. To address this disadvantage, we employ the PRIME Transformer  with a multi-scale sparse attention mechanism as a second baseline. The models in the two architectures are ensembled to improve the overall results. Our final system uses a combination of the SCC training method and the constrained decoding of Hu et al. (2019), which makes our system leverages soft constrained (inside the model) and benefit from hard restrictions (external decoding).

Our System
In this section, we describe the methods used in our system in detail. Our system is made up of four components: the Transformer model, the Sparse Transformer model, the SCC training ap-proach, and the constrained decoding algorithm. In translation, given the source input sequence X = {w 1 , w 2 , ..., w m }, its target translation is Y = {y 1 , y 2 , ..., y n }, the parameter of the NMT model is θ, then the probability form of the translation process can be written as: where y <i denotes the tokens generated before time step i.

Transformer Model
Transformer model (Li et al., 2021) is a encoderdecoder architecture entirely built on multi-head self-attention which is responsible for learning representations of global context. With an input representation H, a multi-head self-attention (MHA) layer first projects H into three representations, key K, query Q, and value V . Then, it uses a self-attention mechanism to get the output representation: and W V are projection parameters. The self-attention operation σ is the dot-production between key, query, and value pairs: where d k = d model /K is the dimension of each head. The encoder of the Transformer model consists of a stack of multiple layers with MHA structure (Self-MHA enc ) where the residual mechanism and layer normalization are used to connect two adjacent layers. Similar to the encoder, each decoder layer decoder is composed of two MHA structures: Self-MHA dec and Cross-MHA, since it not only encodes the input sequence but also incorporates the source representation. Then the processing flow of the model can be written as: H dec = Self-MHA dec (IncMask([BOS, y1, · · · , yn−1])), P (Y |X) = Softmax(Linear(Cross-MHA(H dec , Henc)))), where IncMask(·) represents the incremental masking strategy.

Sparse Transformer Model
According to the evaluation in recent research (Tang et al., 2018), it has shown that the vanilla Transformer has surprising shortcomings in long sequence encoding even the Transformer is designed to modeling long dependencies. Vanilla Transformer works well for short sequence translation, but performance drops as the source sentence length increases because only a small number of tokens are represented by self-attention, resulting in difficulty for translation. Replacing the dense self-attention mechanism with a sparse attention mechanism will alleviate the difficulties in long sentence translation; we chose the PRIME Transformer  as our another base model. Compared to vanilla Transformer, PRIME Transformer generates the output representation of layer i in a fusion way: where H i−1 is the output of layer i − 1. Conv(·) refers to dynamic convolution with multiple kernel sizes, which is employed to capture local context: in which DepthConv(·) is the depth convolution structure proposed in Wu et al. (2019). And Pointwise(·) refers to a position-wise feed-forward network: where W 1 , b 1 , W 2 , and b 2 are learnable parameters.

Sampled Constraints as Concentration Training
The predicted probability in ordinary NMT is y i ∼ P (y i |X, θ). Because of the inclusion of the constrained word sequence C in restricted translation, the probability distribution becomes y i ∼ P (y i |X, C, θ). To adapt the restricted translation for the NMT model rather than just influencing the search process, we expose the constrained word sequence C as additional context like source input.
Since the parallel training data only contains the source and target language sequences, we obtain the constrained word sequence for training via random dynamic sampling from the reference target translation. This not only alleviates the burden of constrained word annotation but also has the potential to minimize overfitting.
Specifically, in the model, we use the Self-MHA dec to encode the input constrained sequence to obtain its representation: It is worth noting that we remove the positional encoding of constrained sequence since the order of restricted word sequence is usually inconsistent with the target translation; additionally, we also remove the incremental mask because the whole sequence is exposed to the decoder as an additional context at the same time. The probabilistic form of restricted translation accordingly changes to: P (Y |X) = Softmax(Linear(Cross-MHA(H dec , Henc)+ Cross-MHA(H dec , Hcst)))).
Because sampled constrained words are exposed to the decoder, to enforce the inclusion of these words in the translation, we place additional penalties on the loss of these sampled positions to achieve the goal of restrict translation with soft constraints on the model: (1 + γ1(yi ∈ C)) logP (yi|X; C; y<i; θ) , where 1(·) is the indicator function and γ is the penalty factor.

Lexically Constrained Decoding
Beam search (Koehn, 2010) is a common approximate search algorithm for sequence generation task. Lexically constrained decoding is a modification to the beam search algorithm, which is proposed to enforce hard constraints that force a given constrained sequence to appear in the generated sequence. Specifically, beam search maintains a beam B t on time step t, which contains only the b most likely partial sequences, where b is known as the beam size. The beam B t is updated by retaining the b most likely sequences in the candidate set E t generated by considering all possible next word predictions:  whereŶ t−1 is the generated sequence in time step t − 1 and V is the target vocabulary.
In lexically constrained decoding, a finite-state machine (FSM) is used to impose the constraints. For each state s ∈ S in the FSM, a corresponding search beam B s is maintained similar to the beam search: where δ : S × V → S is the FSM state-transition function that maps states and predicted words to states.

System Details
Our implementation of the Transformer models and lexically constrained decoding algorithm are based on the Fairseq toolkit 1 . We follow the settings and pre-processing methods in our previous models and systems He et al., 2019;Li et al., 2020b,d,c;. We use Transformer-big as our basic model, which has 6 layers in both the encoder and decoder, respectively. For each layer, it consists of a multi-head attention sublayer with 16 heads and a feed-forward sublayer with an inner dimension 4096. The word embedding dimensions and the hidden state dimensions are set to 1024 for both the encoder and decoder.
In the training phase, the dropout rate is set to 0.1. Our model training consists of two phases. In the first NMT pre-training phase, the ParaCrawl-v5.1 (Esplà et al., 2019)   ASPEC training data in the second domain finetune phase. Table 2 shows the data statistics for each dataset. In both phases, cross-entropy with label smoothing of 0.1 and D2GPo (Li et al., 2020a) are employed as the training loss criterions. We use Adam (Kingma and Ba, 2015) as our optimizer, with parameters settings β 1 = 0.9, β 2 = 0.98 and = 10 −8 . The initial learning rate is set to 10 −4 for NMT pre-training and 10 −5 for domain finetuning. The models are trained on 8 GPUs for about 500,000 steps. In our systems, we follow standard practice and learn a subword (Sennrich et al., 2016) encoding with 40K joint merge operations. Table 1 shows the official results evaluated on ASPEC En→Ja test set. Comparing the results of the vanilla Transformer-big model and Transformer-big+SCC+CD, restricted translation under +SCC+CD has brought a very large performance improvement, which illustrates the performance advantage of restricted translation. Similar to ordinary NMT, sparse Transformer achieves better results than Transformer-big in restricted translation, which demonstrates that Sparse Transformer is a general model structure. A further increase in performance is achieved after ensembling on these two models. This benefits from the models of the distinct architectures of the two models. In general, the improvement brought about by the same architecture is less. We show the results of ASPEC En→Ja test set in Table 3. By comparison, the conclusion is essentially consistent with Table 2.

Conclusion
In this paper, we present our NMT systems for WAT21 restricted translation shared tasks in English ↔ English. By integrating the following techniques: Sparse Transformer, Sampled Constraints as Concentration, and Lexically Constrained Decoding, our final system achieves substantial improvement over baseline systems which show the effectiveness of our approaches.