A Bidirectional Transformer Based Alignment Model for Unsupervised Word Alignment

Word alignment and machine translation are two closely related tasks. Neural translation models, such as RNN-based and Transformer models, employ a target-to-source attention mechanism which can provide rough word alignments, but with a rather low accuracy. High-quality word alignment can help neural machine translation in many different ways, such as missing word detection, annotation transfer and lexicon injection. Existing methods for learning word alignment include statistical word aligners (e.g. GIZA++) and recently neural word alignment models. This paper presents a bidirectional Transformer based alignment (BTBA) model for unsupervised learning of the word alignment task. Our BTBA model predicts the current target word by attending the source context and both left-side and right-side target context to produce accurate target-to-source attention (alignment). We further fine-tune the target-to-source attention in the BTBA model to obtain better alignments using a full context based optimization method and self-supervised training. We test our method on three word alignment tasks and show that our method outperforms both previous neural word alignment approaches and the popular statistical word aligner GIZA++.


Introduction
Neural machine translation (NMT) (Bahdanau et al., 2014;Vaswani et al., 2017) achieves stateof-the-art results for various translation tasks (Barrault et al., 2019(Barrault et al., , 2020. Neural translation models, such as RNN-based (Bahdanau et al., 2014) and Transformer (Vaswani et al., 2017) models, generally have an encoder-decoder structure with a target-to-source attention mechanism. The targetto-source attention in NMT can provide rough word alignments but with a rather low accuracy (Koehn and Knowles, 2017). High-quality word alignment can be used to help NMT in many different ways, such as detecting source words that are missing in the translation (Lei et al., 2019), integrating an external lexicon into NMT to improve translation for domain-specific terminology or low-frequency words (Chatterjee et al., 2017;Chen et al., 2020), transferring word-level annotations (e.g. underline and hyperlink) from source to target for document/webpage translation (Müller, 2017).
A number of approaches have been proposed to learn the word alignment task, including both statistical models (Brown et al., 1993) and recently neural models (Zenkel et al., 2019;Garg et al., 2019;Zenkel et al., 2020;Chen et al., 2020;Stengel-Eskin et al., 2019;Nagata et al., 2020). The popular word alignment tool GIZA++ (Och and Ney, 2003) is based on statistical IBM models (Brown et al., 1993) which learn the word alignment task through unsupervised learning and do not require gold alignments from humans as training data. As deep neural networks have been successfully applied to many natural language processing (NLP) tasks, neural word alignment approaches have developed rapidly and outperformed statistical word aligners (Zenkel et al., 2020;Garg et al., 2019). Neural word alignment approaches include both supervised and unsupervised approaches: supervised approaches (Stengel-Eskin et al., 2019;Nagata et al., 2020) use gold alignments from human annotators as training data and train neural models to learn word alignment through supervised learning; unsupervised approaches do not use gold human alignments for model training and mainly focus on improving the target-to-source attention in NMT models to produce better word alignment, such as performing attention optimization during inference (Zenkel et al., 2019), encouraging contiguous alignment connections (Zenkel et al., 2020) or using alignments from GIZA++ to supervise/guide the attention in NMT models (Garg et al., 2019).
We propose a bidirectional Transformer based alignment (BTBA) model for unsupervised learning of the word alignment task. Our BTBA model predicts the current target word by paying attention to the source context and both left-side and right-side target context to produce accurate targetto-source attention (alignment). Compared to the original Transformer translation model (Vaswani et al., 2017) which computes target-to-source attention based on only the left-side target context due to left-to-right autoregressive decoding, our BTBA model can exploit both left-side and rightside target context to compute more accurate targetto-source attention (alignment). We further finetune the BTBA model to produce better alignments using a full context based optimization method and self-supervised training. We test our method on three word alignment tasks and show that our method outperforms previous neural word alignment approaches and also beats the popular statistical word aligner GIZA++.

Word Alignment Task
The goal of the word alignment task (Och and Ney, 2003) is to find word-level alignments for parallel source and target sentences. Given a source sentence s I−1 0 = s 0 , ..., s i , ..., s I−1 and its parallel target sentence t J−1 0 = t 0 , ..., t j , ..., t J−1 , the word alignment G is defined as a set of links that link the corresponding source and target words as shown in Equation 1.
The word alignment G allows one-to-one, one-tomany, many-to-one, many-to-many alignments and also unaligned words (Och and Ney, 2003). Due to the lack of labelled training data (gold alignments annotated by humans) for the word alignment task, most word alignment methods learn the word alignment task through unsupervised learning (Brown et al., 1993;Zenkel et al., 2020;Chen et al., 2020).

Neural Machine Translation
Neural translation models (Bahdanau et al., 2014;Vaswani et al., 2017) generally have an encoderdecoder structure with a target-to-source attention mechanism: the encoder encodes the source sentence; the decoder generates the target sentence by attending the source context and performing left-to-right autoregressive decoding. The target-tosource attention learned in NMT models can provide rough word alignments between source and target words. Among various translation models, the Transformer translation model (Vaswani et al., 2017) achieves state-of-the-art results on various translation tasks and is based solely on attention: source-to-source attention in the encoder; target-totarget and target-to-source attention in the decoder. The attention networks used in the Transformer model are called multi-head attention which performs attention using multiple heads as shown in Equation 2.
where Q, K and V are query, keys, values for the attention function; W o , W Q n , W K n and W V n are model parameters; d k is the dimension of the keys. Based on parallelizable attention networks, the Transformer can be trained much faster than RNN-based translation models (Bahdanau et al., 2014).

Statistical Alignment Models
Word alignment is a key component in traditional statistical machine translation (SMT), such as phrase-based SMT (Koehn et al., 2003) which extracts phrase-based translation rules based on word alignments. The popular statistical word alignment tool GIZA++ (Och and Ney, 2003) implements the statistical IBM models (Brown et al., 1993). The statistical IBM models are mainly based on lexical translation probabilities. Words that co-occur frequently in parallel sentences generally have higher lexical translation probabilities and are more likely to be aligned. The statistical IBM models are trained using parallel sentence pairs with no wordlevel alignment annotations and therefore learn the word alignment task through unsupervised learning. Based on a reparameterization of IBM Model 2, Dyer et al. (2013) presented another popular statistical word alignment tool fast align which can be trained faster than GIZA++, but GIZA++ generally produces better word alignments than fast align.

Neural Alignment Models
With neural networks being successfully applied to many NLP tasks, neural word alignment approaches have received much attention. The first neural word alignment models are based on feedforward neural networks (Yang et al., 2013) and recurrent neural networks (Tamura et al., 2014) which can be trained in an unsupervised manner by noise-contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010) or in a supervised manner by using alignments from human annotators or existing word aligners as labelled training data. As NMT (Bahdanau et al., 2014;Vaswani et al., 2017) achieves great success, the target-to-source attention in NMT models can be used to infer rough word alignments, but with a rather low accuracy. A number of recent works focus on improving the target-to-source attention in NMT to produce better word alignments (Garg et al., 2019;Zenkel et al., 2019;Chen et al., 2020;Zenkel et al., 2020). Garg et al. (2019) trained the Transformer translation model to jointly learn translation and word alignment through multi-task learning using word alignments from existing word aligners such as GIZA++ as labelled training data. Chen et al. (2020) proposed a method to infer more accurate word alignments from the Transformer translation model by choosing the appropriate decoding step and layer for word alignment inference. Zenkel et al. (2019) proposed an alignment layer for the Transformer translation model and they only used the output of the alignment layer for target word prediction which forces the alignment layer to produce better alignment (attention). Zenkel et al. (2019) also proposed an attention optimization method which directly optimizes the attention for the test set to produce better alignment. Zenkel et al. (2020) proposed to improve the attention in NMT by using a contiguity loss to encourage contiguous alignment connections and performing direct attention optimization to maximize the translation probability for both the source-to-target and target-to-source translation models. Compared to these methods that infer word alignments based on NMT target-tosource attention which is computed by considering only the left-side target context, our BTBA model can exploit both left-side and right-side target context to compute better target-to-source attention (alignment).
There are also a number of supervised neural approaches that require gold alignments from hu-mans for learning the word alignment task (Stengel-Eskin et al., 2019;Nagata et al., 2020). Because gold alignments from humans are scarce, Stengel-Eskin et al. (2019); Nagata et al. (2020)'s models only have a small size of task-specific training data and exploit representations from pre-trained NMT and BERT models. Compared to these supervised methods, our method does not require gold human alignments for model training.

Our Approach
We present a bidirectional Transformer based alignment (BTBA) model for unsupervised learning of the word alignment task. Motivated by BERT which learns a masked language model (Devlin et al., 2019), we randomly mask 10% of the words in the target sentence and then train our BTBA model to predict the masked target words by paying attention to the source context and both leftside and right-side target context. Therefore, our BTBA model can exploit both left-side and rightside target context to compute more accurate targetto-source attention (alignment) compared to the original Transformer translation model (Vaswani et al., 2017) which computes the target-to-source attention based on only the left-side target context due to left-to-right autoregressive decoding. We further fine-tune the target-to-source attention in the BTBA model to produce better alignments using a full context based optimization method and self-supervised training.

Bidirectional Transformer Based
Alignment (BTBA) Figure 1 shows the architecture of the proposed BTBA model. The encoder is used to encode the source sentence 1 and has the same structure as the original Transformer encoder (Vaswani et al., 2017). The input of the decoder is the masked target sentence and 10% of the words in the target sentence are randomly masked 2 . As shown in Figure 1, the target sentence contains a masked word <x>. The decoder contains 6 layers. Each of the first 5 layers of the decoder has 3 sub-layers:  a multi-head self-attention sub-layer, a target-tosource multi-head attention sub-layer and a feed forward sub-layer, like a standard Transformer decoder layer except that the self-attention sub-layer in the standard Transformer decoder can only attend left-side target context while the self-attention sub-layer in our BTBA decoder can attend all target words and make use of both left-side and right-side target context to compute better target-to-source attention (alignment). The last layer of the BTBA decoder contains a self-attention sub-layer and a target-to-source attention sub-layer like the first 5 layers of the BTBA decoder but without the feedforward sub-layer. We use the output of the last target-to-source attention sub-layer for predicting the masked target words and we use the attention of the last target-to-source attention sub-layer for inferring word alignments between source and target words. Our design that only uses the last targetto-source attention sub-layer output for predicting the masked target words is motivated by the alignment layer of Zenkel et al. (2019) in order to force Original the cake is very delicious <x> cake is very delicious the <x> is very delicious Masked the cake <x> very delicious the cake is <x> delicious the cake is very <x> the last target-to-source attention sub-layer to pay attention to the most important source words for predicting the target word and therefore produce better word alignments.
In Figure 1, A ijn is the attention value of the jth target word paying to the ith source word using the nth head in the last target-to-source multi-head attention sub-layer. V 0 , V 1 , V 2 , V 3 , V 4 are the outputs of the decoder for the 5 target words and V 1 is used to predict the masked target word "cake". Because V 1 is used to predict "cake", the attention value A 21n should be learned to be high in order to make V 1 contain the most useful source information ("kuchen"). Therefore, A ijn can be used to infer word alignment for the target word "cake" effectively. However, A ijn cannot provide good word alignments for unmasked target words such as "delicious" in Figure 1 because V 4 is not used to predict any target word and A 54n is not necessarily learned to be high.
Because A ijn can only be used to infer accurate word alignment for masked target words but we want to get alignments for all target words in the test set, we mask a target sentence t J−1 0 in the test set J times and each time we mask one target word as shown in Table 1. Each masked target sentence is fed into the BTBA model together with the source sentence and then we collect the attention A ijn for the masked target words. Suppose the j th target word is masked, then we compute the source position that it should be aligned to as,

Full Context Based Optimization
In Equation 3, the attention A ij n for the j target word is computed by considering both left-side and right-side target context, but information about the current target word is not used since the j target word is masked. For example in Figure 1, the BTBA model does not know that the second target word is "cake" because it is masked, therefore the BTBA model computes the attention (alignment) for "cake" only using the left-side and right-side context of "cake" without knowing that the word that needs to be aligned is "cake". We propose a novel full context based optimization method to use full target context, including the current target word information, to improve the target-to-source attention in the BTBA model to produce better alignments. That is for the last 50 training steps of the BTBA model, we do not mask the target sentence any more and we only optimize parameters W Q n and W K n in the last target-to-source multi-head attention sub-layer. As shown in Equation 2, W Q n and W K n are parameters that are used to compute the attention values in multi-head attention. Optimizing W Q n and W K n based on full target context can help the BTBA model to produce better attention (alignment) while at the same time freezing other parameters can make the BTBA model keep the knowledge learned from masked target word prediction. After full target context based optimization, we do not need to mask target sentences in the test set as shown in Table 1 any more. We can directly feed the original source and target test sentences into the BTBA model and compute attention (alignment) for all target words in the sentence. The full context based optimization method can be seen as a fine-tuning of the original BTBA model, i.e. we fine-tune the two parameters W Q n and W K n in the last target-to-source attention layer based on full target context to compute more accurate word alignments.

Self-Supervised Training
The BTBA model learns word alignment through unsupervised learning and does not require labelled data for the word alignment task. We train two unsupervised BTBA models, one for the forward direction (source-to-target) and one for the backward direction (target-to-source), and then symmetrize the alignments using heuristics such as grow-diagonal-final-and (Och and Ney, 2003) as the symmetrized alignments have better quality than the alignments from a single forward or backward model. After unsupervised learning, we use the symmetrized word alignments G a inferred from our unsupervised BTBA models as labelled data to further fine-tune each BTBA model for the word alignment task through supervised training using the alignment loss in Equation 4 following Garg et al. (2019)'s work. 3 During supervised training, the BTBA model is trained to learn the alignment task instead of masked target word prediction, therefore the target sentence does not need to be masked.
Note that we apply byte pair encoding (BPE) (Sennrich et al., 2016) for both source and target sentences before we feed them into the BTBA model. Therefore the alignments inferred from the BTBA model is on BPE-level. We convert 4 BPE-level alignments to word-level alignments before we perform alignment symmetrization. After alignment symmetrization, we want to use the symmetrized alignments to further fine-tune each BTBA model through supervised learning and therefore we convert 5 the word-level alignments back to BPE-level for supervised training of the BTBA models.

Settings
In order to compare with previous work, we used the same datesets 6 as Zenkel et al. (2020)'s work and conducted word alignment experiments for three language pairs: German ↔ English (DeEn), English ↔ French (EnFr) and Romanian ↔ English (RoEn). Each language pair contains a test set and a training set: the test set contains parallel sentences with gold word alignments annotated by humans; the training set contains only parallel sentences with no word alignments. Table 2 gives numbers of sentence pairs contained in the training and test sets. Parallel sentences from both the training set and the test set can be used to train 3 We optimize all model parameters during supervised finetuning. 4 To convert BPE-level alignments to word-level alignments, we add an alignment between a source word and a target word if any parts of these two words are aligned. Alignments between the source <bos> token and any target word are deleted; alignments between the last source word "." (full stop) and a target word which is not the last target word are also deleted. 5 To convert word-level alignments to BPE-level alignments, we add an alignment between a source BPE token and a target BPE token if the source word and the target word that contain these two BPE tokens are aligned; we add an alignment between the source <bos> token and a target BPE token if the target word that contains this target BPE token is not aligned with any source words. 6 https://github.com/lilt/alignment-scripts  For each language pair, we trained two BTBA models, one for the forward direction and one for the backward direction, and then symmetrized the alignments. We tested different heuristics for alignment symmetrization, including the standard Moses heuristics, grow-diagonal, grow-diagonalfinal, grow-diagonal-final-and. We also tested another heuristic grow-diagonal-and which is slightly different from grow-diagonal: the grow-diagonaland heuristic only adds a new alignment (i, j) when both s i and t j are unaligned while growdiagonal adds a new alignment (i, j) when any of the two words (s i and t j ) are unaligned. We find that the Moses heuristic grow-diagonal-final-and generally achieved the best results for symmetrizing the BTBA alignments, but grow-diagonal-and worked particularly good for the EnFr task.
Finally, we used the symmetrized alignments inferred from our unsupervised BTBA models as labelled data to further fine-tune each BTBA model to learn the alignment task through supervised training. We fine-tuned each BTBA model for 50 training steps using the alignment loss in Equation 4.
In addition, we also tested to use alignments from GIZA++ instead of alignments inferred from our 7 The training time (time of one training epoch × number of training epochs) of one BTBA model for different tasks (DeEn, EnFr and RoEn) is roughly the same, 30 hours using 4 GPUs.

Method
DeEn EnFr RoEn Zenkel et al. (2019) 21.2% 10.0% 27.6% Garg et al. (2019) 16.0% 4.6% 23.1% Zenkel et al. (2020) 16.3% 5.0% 23. 4% Chen et al. (2020) 15  unsupervised BTBA models as labelled data for supervised fine-tuning of the BTBA models. Table 3 gives alignment error rate (AER) (Och and Ney, 2000) results of our BTBA model and comparison with previous work. Table 3 also gives results of BTBA-left and BTBA-right: BTBA-left means that the BTBA decoder only attends leftside target context; BTBA-right means that the BTBA decoder only attends right-side target context. As shown in Table 3, the BTBA model, which uses both left-side and right-side target context, significantly outperformed BTBA-left and BTBAright. Results also show that the performance of our BTBA model can be further improved by full context based optimization (FCBO) and supervised training including both self-supervised training and GIZA++ supervised training. For DeEn and RoEn tasks, the self-supervised BTBA (S-BTBA) model achieved the best results, outperforming previous neural and statistical methods. For the EnFr task, as the statistical aligner GIZA++ performed well and achieved better results than our unsupervised BTBA model, the GIZA++ supervised BTBA (G-BTBA) model achieved better results than the S-BTBA model and also outperformed the original GIZA++ and previous neural models. Tables 4, 5 and 6 give results of using different heuristics for symmetrizing alignments produced by BTBA, GIZA++ and G-BTBA, respectively. For our unsupervised and self-supervised BTBA models, grow-diagonal-final-and achieved the best results on DeEn and RoEn tasks while grow-diagonal-and achieved the best results on the EnFr task. For GIZA++ and G-BTBA, the best heuristics for different language pairs are quite different, though grow-diagonal-final-and generally    obtained good (best or close to best) results on DeEn and RoEn tasks while grow-diagonal-and generally obtained good (close to best) results on the EnFr task.

Results
FCBO with/without Parameter Freezing As we explained in Section 4.2, during full context based optimization (FCBO), we only optimize W Q n and W K n in the last target-to-source attention sub-layer and freeze all other parameters so the BTBA model can keep the knowledge learned from masked target word prediction. We also tested to optimize all parameters of the BTBA model without parameter freezing during FCBO. Figure 2 shows how the AER results on the DeEn test set changed during FCBO with and without parameter freezing. Without freezing any parameters  during FCBO, the AER result (the red curve) first increased a little, then decreased sharply, and soon increased again. In contrast, when we freeze most of the parameters, the AER result (the blue curve) decreased stably and eventually got better results (16.3%) than no parameter freezing (16.7%). Note that the results in Figure 2 are computed based on full target context, i.e., the target sentence is not masked. As we explained in Section 4.  word alignments inferred from our unsupervised BTBA models as labelled data to further fine-tune each unidirectional BTBA model for the alignment task through supervised training. We also tested to use unidirectional BTBA alignments instead of symmetrized BTBA alignments as labelled data for supervised training. Figure 4 (the blue curve) shows how the performance of the forward BTBA model of the DeEn task changes during supervised training when using unidirectional alignments inferred from itself (the forward BTBA model) as labelled training data, which demonstrates that the forward BTBA model can be significantly improved through supervised training even when the training data is inferred from itself and not improved by alignment symmetrization. Figure 4 also shows that using symmetrized alignments for supervised training (the red curve) did achieve better results than using unidirectional alignments for supervised training.   Table 4 and Table 6) even though GIZA++ produced worse alignments (24.2% in Table 3) than the forward BTBA model.

Alignment Error Analysis
We analyze the alignment errors produced by our system and find that most of the alignment errors are caused by function words. As shown in the alignment example in Figure 3, source and target corresponding content words (e.g. "definiert" and "defines") are all correctly aligned by our model, but function words such as "the", "im" and "wird" are not correctly aligned. To give a more detailed analysis, we compute AER results of our model for 3 different types of alignments: FF (alignments between two function words), CC (alignments between two content words) and FC (alignments between a function word and a content word). 8 Table 7 shows that our models achieved significantly better results for CC alignments than for FF and FC alignments. Function words are more difficult to align than content words most likely because content words in a parallel sentence pair usually have very clear corresponding relations (such as "defines" clearly corresponds to "definiert" in Figure 3), but function words (such as "the", "es" and "im") are used more flexibly and frequently do not have clear corresponding words in parallel sentences, which increases the alignment difficulty significantly.

Dictionary-Guided NMT via Word Alignment
For downstream tasks, word alignment can be used to improve dictionary-guided NMT (Song et al., 2020;Chen et al., 2020). Specifically, at each decoding step in NMT, Chen et al. (2020) used a SHIFT-AET method to compute word alignment for the newly generated target word and then revised the newly generated target word by encouraging the pre-specified translation from the dictionary. The SHIFT-AET alignment method adds a separate alignment module to the original Transformer translation model (Vaswani et al., 2017) and trains the separate alignment module using alignments induced from the attention weights of the original Transformer. To test the effectiveness of our alignment method for improving dictionary-guided NMT, we used the alignments inferred from our BTBA models as labelled data for supervising the SHIFT-AET alignment module and performed dictionary-guided translation for the German↔English language pair following Chen et al. (2020)'s work. Table 8 gives the translation results of dictionary-guided NMT and shows that our alignment method led to higher translation quality compared to the original SHIFT-AET method.

Conclusion
This paper presents a novel BTBA model for unsupervised learning of the word alignment task. Our BTBA model predicts the current target word by paying attention to the source context and both left-side and right-side target context to produce accurate target-to-source attention (alignment). We further fine-tune the target-to-source attention in the BTBA model to obtain better alignments using a full context based optimization method and selfsupervised training. We test our method on three word alignment tasks and show that our method outperforms both previous neural alignment methods and the popular statistical word aligner GIZA++.