Transformer-Based Direct Hidden Markov Model for Machine Translation

The neural hidden Markov model has been proposed as an alternative to attention mechanism in machine translation with recurrent neural networks. However, since the introduction of the transformer models, its performance has been surpassed. This work proposes to introduce the concept of the hidden Markov model to the transformer architecture, which outperforms the transformer baseline. Interestingly, we find that the zero-order model already provides promising performance, giving it an edge compared to a model with first-order dependency, which performs similarly but is significantly slower in training and decoding.


Introduction
Recently, significant improvements have been made to neural machine translations (NMT). Regardless of whether a recurrent neural network with long short-term memory (Hochreiter and Schmidhuber, 1997) (LSTM-RNN) (Bahdanau et al., 2015) or a convolutional neural network (CNN) (Gehring et al., 2017) or a self-attentive transformer network (Vaswani et al., 2017) is used, the attention mechanism is always one of the key components that all state-of-the-art NMT systems contain.
Several attempts have been made to explore alternative architectures that do not use an attention mechanism (Wang et al., 2017(Wang et al., , 2018Bahar et al., 2018;Press and Smith, 2018). However, either the performance of those systems is significantly worse than that of the LSTM-RNN-based approaches, or the time and memory complexity is much higher. Since the transformer architecture has upgraded the state-of-the-art to an even higher standard, fewer studies are being carried out in this direction.
Despite the promising translation performance of the transformer architecture, recent studies have found that the quality of the word alignments produced by the multi-head cross-attention weights is quite poor, and various techniques are proposed to address this problem (Alkhouli et al., 2018;Garg et al., 2019;Zenkel et al., 2020). While these works focus on extracting promising alignment information from the transformer architecture, we aim to improve the translation performance of the baseline model by introducing alignment components while keeping the system monolithic. To this end, the possibilities are studied to apply the transformer architecture to the direct hidden Markov model (HMM), which is not as straightforward as in the case of LSTM-RNN due to the cross-attention through all decoder layers. Experimental results show that the zero-order direct HMM already outperforms the baseline transformer model in terms of TER scores (Snover et al., 2006), while the first-order dependency with higher computational complexity offers no further improvements.

Related Work
The attention component is introduced by Bahdanau et al. (2015) in NMT to simulate the alignment between the source and target sentence, which leads to significant improvements compared to the pure sequence-to-sequence model (Sutskever et al., 2014). Wang et al. (2018) present a LSTM-RNNbased HMM that does not employ an attention mechanism. This work aims to build a similar model with the transformer architecture. While they perform comparable to the LSTM-RNN-based attention baseline with a much slower model, our model outperforms the transformer baseline in terms of TER scores.
The derivation of neural models for translation on the basis of the HMM framework is also studied in Yu et al. (2017) and Alkhouli et al. (2018). In Yu et al. (2017), alignment-based neural models are used to model alignment and translation from the target to the source side (inverse direction), and a language model is included in addition. And Alkhouli et al. (2018) rely on alignments generated by statistical systems that serve as supervision for the training of the neural systems. By contrast, the model proposed in this work does not require any additional language model or alignment information and thus keeps the entire system monolithic.
Several works have been carried out to change attention models to capture more complex dependencies. Cohn et al. (2016) introduce structural biases from word-based alignment concepts such as fertility and Markov conditioning. Arthur et al. (2016) incorporate lexical probabilities to influence attention. These changes are based on the LSTM-RNN-based attention model. Garg et al. (2019) and Zenkel et al. (2020) try to generate translation and high-quality alignment jointly using an end-to-end neural training pipeline. By contrast, our work focuses more on improving the translation quality using the alignment information generated by the self-contained model.

Direct HMM
The goal of machine translation is to find the target language sentence e I 1 = e 1 , e 2 , · · · , e I that is the translation of a particular source language sentence f J 1 = f 1 , f 2 , · · · , f J with the maximum likelihood (arg max I,e I 1 Pr(e I 1 |f J 1 ) ). In the direct HMM, an alignment from target to source (i → j = b i ) is introduced into the translation probability: The term "direct" refers to the modeling of p(e|f ) instead of p(f |e) as in the conventional HMM (Vogel et al., 1996). In Wang et al. (2018), two LSTM-RNN based neural networks are used to model the lexicon and the alignment probability separately. In this work they are modeled with a single transformer-based network.

Direct HMM in Transformer
This section describes in detail how we modify the transformer model so that both the alignment and the lexicon probability can be generated. While the lexicon model in the direct HMM has a zero-order dependency on the current alignment position b i : Pr(e i |b i 0 , e i−1 0 , f J 1 ) := p(e i |b i , e i−1 0 , f J 1 ) (4) we implement zero-and first-order dependencies for the alignment model.

Zero-order Architecture
In the zero-order architecture, the alignment model is defined as follows: To obtain the alignment probability we change the order of the weighted sum and the activation function at each decoder layer in the transformer: l: index of the decoder layer ∈ {1, 2. · · · , L} c i : context vector, input to the next layer h j : source hidden state (key and value) s i : target hidden state (query) W n : weight matrices α(j|i): softmax(A[s i , h j ]) cross-attention weights The arrow indicates that the weighted sum with the cross-attention is moved outside of the ReLU activation function. Before the ReLU function is employed, the target hidden state s i−1 is projected and added to the projected source hidden state h j in order to include information from the target side to the context vector, which can also be considered as a substitution for the residual layer in the standard transformer architecture. As the outputs of the last decoder layer (and the entire network) we have a lexicon probability: and an alignment probability: The output probability for the current word is: And the sentence probability is then: Due to the redefinition of the context vector, layer normalization, residual connection and linear projection are also modified accordingly. Detailed changes to the architecture are shown in Figure 1. Note that all modifications are made to decoder layers while encoder layers remain unchanged.

First-order Architecture
In the first-order architecture, the alignment model is defined as follows: The lexicon probability remains the same as in the zero-order model (Equation 4). To consider the dependency on the previous source position (j = b i−1 ), we change the cross-attention weights: where [h j ] denotes the concatenation of the source hidden states at positions j and j .
Changing the architecture from the zero-order model to the first-order model is straightforward, but the main challenge is in the training process. Due to the first-order dependency, the complexity of the brute-force search (forward path) becomes exponential (confirm Equation 3). To address this problem, we apply a dynamic programming algorithm to find the probability of the entire sentence: where Q denotes the recursive function. For given sentence pairs (F r , E r ), the training criterion is then the maximization of the log-likelihood function arg max θ r log p(E r |F r , θ).
In previous work on the neural HMM, the forward-backward algorithm is implemented to calculate the posterior probability as the golden truth to guide the training of the lexicon and the alignment models (referred to as "manual differentiation"). But actually it is not necessary. As long as the forward path is implemented according to a recursive function of dynamic programming, as shown in Equation 13, the frameworks can handle the backward path automatically (referred to as "automatic differentiation"). Intuitively, the recursive equation is nothing more than a sum of products that should be easy to work with the au-tomatic differentiation toolkit. Theoretically, the mathematical proof for this is presented in Eisner (2016). And practically, our experimental results of the automatic differentiation and the manual differentiation are the same as long as label smoothing (Szegedy et al., 2016) is not applied.
Without an explicitly implemented forwardbackward algorithm, applying label smoothing is not straightforward as it should be applied to the words while the automatic differentiation is performed after the forward path has been done for the entire sentence. To solve this problem, we apply label smoothing to the lexicon probability p(e i |j, e i−1 0 , f J 1 ) at each step of the forward path. Although in this case the type of label smoothing is different for the automatic and manual differentiation, experimental results are quite similar (< 0.1% differences). The automatic differentiation has an advantage in terms of memory and time complexity and is therefore used for all subsequent experiments.

Translation Performance
In order to test the performance of the direct HMM, we carry out experiments on the WMT 2019 1 German→English (de-en), WMT 2019 Chinese→English (zh-en) and WMT 2018 2 English→Turkish (en-tr) tasks. These three tasks represent different amounts of training data, from hundreds of thousands to tens of millions. Detailed data statistics are shown in Appendix A.
The proposed approaches are completely implemented in fairseq (Ott et al., 2019). The standard transformer base model (Vaswani et al., 2017) implemented in the fairseq framework is used as our baseline and we follow the standard setup for hyperparameters. Translation performance is measured by case-insensitive BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) scores with SACRE-BLEU toolkit (Post, 2018). The results are shown in Table 1.
The results show that the direct HMMs achieve comparable performance to the transformer baselines in terms of BLEU scores and outperform the baseline systems in terms of TER scores. The TER metric is known to favor shorter hypotheses, but from the length ratio results we can conclude that the improvements are not due to it. In addition, it can be seen that the first-order dependency could not provide further improvements over the zeroorder model. To find the possible reasons for this, we try to extract alignment heat maps with regard to the dependencies between the current position j and the predecessor position j . As shown in Figure 2, the target position j with the maximum probability is often the same for different predecessor positions j , which indicates that the training of the model tends to "forget" the explicit first-order dependency. We checked a lot of heat maps and this happens quite often, in fact, for short sentences it almost always happens. This essentially explains why the first-order model fails to make improvements. To benefit from the first-order dependency, constraints or other techniques might be used during training.
Here the results of the RNN-based direct HMM are not included as one of the baselines, as the performance of the RNN-based approaches is significantly surpassed by the transformer-based ap-proaches. We believe this work will outperform the system proposed in (Wang et al., 2018), but that is mainly due to the transformer architecture rather than refinements we made.
Compared to the baseline transformer model, the direct HMM only has about 2% more free parameters. While the first-order model has a clear disadvantage in terms of training and decoding speed compared to the baseline system due to the inevitable loop over the target position i, the decoding speed of the zero-order model is only slightly slower than that of the transformer baseline. Details of time usage are given in Appendix B.

Alignment Quality
In addition to improvements in the TER scores, we believe that the direct HMM also provides better alignment quality than the standard cross-attention. To verify this assumption, we compute the alignment error rate (AER) (Och and Ney, 2000) on the RWTH German-English Golden Alignments corpus (Vilar et al., 2006), which provides 505 manually word-aligned sentence pairs extracted from the Europarl corpus. We take the argmax of the alignment probability output of our model as an estimated alignment. In addition, as with the conventional HMM, the argmax of the posterior probability can also be used as an estimated alignment, which explicitly includes the lexicon information and should lead to a better quality. As baselines, we take the argmax of the average of the attention heads in the fifth and sixth decoder layers, since Garg et al. (2019) claim that the crossattention weights in the fifth layer produce more accurate alignment information than the last layer. All models are trained in both directions to get bidirectional alignments. These bidirectional alignments are then merged using the grow diagonal heuristic (Koehn et al., 2005  From the results shown in Table 2, we can observe that the alignment generated by the direct HMM has a significantly better quality than that extracted directly from the transformer attention weights. The posterior probability that contains the lexicon information indeed provides better alignments, which can be seen as a further advantage of the direct HMM, since it cannot be calculated in the standard transformer architecture without an explicit alignment probability. In terms of AER performance, our model stands behind GIZA++ (Och and Ney, 2003) as well as the approaches proposed in Garg et al. (2019) and Zenkel et al. (2020). Note, however, that our zero-order model does not include the future target word information in estimating alignments, and we do not use additional loss for alignment training, since the original goal of this work is to improve translation quality by applying HMM factorization.
In addition to the AER results, Appendix C shows heat maps extracted for the alignment probability from direct HMM compared to those extracted for cross-attention weights from the standard transformer model.

Conclusion
This work exhibits the use of the transformer architecture in a direct HMM for machine translation, which significantly improves TER scores. In addition, we show that the proposed system tends to "refuse" to learn first-order dependency during training. The zero-order model achieves a good compromise between performance and decoding speed, which is much faster than previous work on the direct HMM. In order to benefit from the predecessor alignment information, further techniques should be carried out. Another future work would be to combine the attention mechanism with the alignment information to further improve performance. For the German→English task, joint byte pair encoding (BPE) (Sennrich et al., 2016) with 32k merge operations is used.

References
The newstest2015 dataset is used as the validation set and newstest2019 as the test set.
The Chinese data are segmented using the pkuseg toolkit 3 (Luo et al., 2019). The vocabulary size and number of running words are calculated after segmentation. Separate BPE with 32k merge operations is used for Chinese and English data. The newsdev2017 dataset is used as the validation set and newstest2019 as the test set.
For the English→Turkish task, separate BPE with 8k merge operations is used.
The newstest2017 dataset is used as the validation set and newstest2018 as the test set.

B Training and Decoding Speed
Training and decoding are performed on one NVIDIA GeForce RTX 1080 Ti with 11 GB of GPU memory. Table 3 shows the training and decoding speed on the WMT 2019 German→English dataset. Compared to the baseline system, the disadvantages of the zero-order HMM on training speed are mainly due to the limited GPU memory. Since the largest tensor of the proposed model has a dimension of batch size × length of the source sentence × length of the target sentence × vocabulary size (in the standard transformer the dimension of "length of the source sentence" is not required), the batch size must be reduced to fit in the GPU memory. Although gradient accumulation can be used to guarantee performance, the reduced batch size still linearly slows the training speed. The influence on the decoding speed is rather small. By introducing the first-order dependency, however, a for loop over every target position is inevitable, so that the training and decoding speeds are greatly slowed down. This is also reported by the previous work.

32
C Heat Maps of Attention Weights and Alignments Figure 3 demonstrates the heat maps of some sentence pairs that are randomly selected from the German→English training data after the training has almost converged. Note that here the x and y axes indicate the source and target positions (j and i), which differs from Figure 2, where they indicate the current and previous source positions (j and j ). We can observe that the alignment paths are much more focused than the attention weights. Since our main goal is to propose an alternative technique to improve translation performance rather than alignment quality, alignment error rates are not calculated in this work. Figure 3: Heat maps of attention weights and alignments. The source sentence goes from left to right and the target sentence goes from top to bottom. The first column shows the attention weight heat maps (average of the multi-head cross-attention) for the 4th decoder layer. The second column shows the attention weight heat maps (average of the multi-head cross-attention) for the 6th (last) decoder layer. The third column shows the alignment heat maps taken from the proposed direct HMM.