Mask-Align: Self-Supervised Neural Word Alignment

Word alignment, which aims to align translationally equivalent words between source and target sentences, plays an important role in many natural language processing tasks. Current unsupervised neural alignment methods focus on inducing alignments from neural machine translation models, which does not leverage the full context in the target sequence. In this paper, we propose Mask-Align, a self-supervised word alignment model that takes advantage of the full context on the target side. Our model masks out each target token and predicts it conditioned on both source and the remaining target tokens. This two-step process is based on the assumption that the source token contributing most to recovering the masked target token should be aligned. We also introduce an attention variant called leaky attention, which alleviates the problem of unexpected high cross-attention weights on special tokens such as periods. Experiments on four language pairs show that our model outperforms previous unsupervised neural aligners and obtains new state-of-the-art results.


Introduction
Word alignment is a task of finding the corresponding words in a sentence pair (Brown et al., 1993) and used to be a key component of statistical machine translation (SMT; Koehn et al. (2003)). Although word alignment is no longer explicitly modeled in neural machine translation (NMT; Bahdanau et al. (2015)), it is often leveraged to interpret and analyze NMT models (Ding et al., 2017;Tu et al., 2016). Word alignment is also used in many other scenarios, such as imposing lexical constraints on the decoding process (Arthur et al., 2016;Hasler et al., 2018), improving automatic post-editing (Pal et al., 2017) and providing guidance for translators in computer-aided translation (Dagan et al., 1993).
Recently, unsupervised neural alignment methods have been studied and outperformed GIZA++ (Och and Ney, 2003) on many alignment datasets (Garg et al., 2019;Zenkel et al., 2020;Chen et al., 2020). However, these methods are trained with a translation objective, which computes the probability of each target token conditioned on source tokens and previous target tokens. This will bring noisy alignments when the prediction is ambiguous (Figure 1(a)). To alleviate this problem, previous studies modify Transformer (Vaswani et al., 2017) by adding alignment modules to re-predict the target token (Zenkel et al., 2019(Zenkel et al., , 2020, or computing an additional alignment loss on the full target sequence (Garg et al., 2019). Moreover, Chen et al. (2020) propose an extraction method that induces alignment when the to-be-aligned target token is the decoder input.
Although these methods have demonstrated their effectiveness, they have two drawbacks. First, they retain the translation objective which is not tailored for word alignment. Consider the example in Figure 1(a). When predicting target token "Tokyo", the translation model may wrongly generate "1968" as it only considers the previous context, which will result in an incorrect alignment link ("1968", "Tokyo"). A better modeling is needed for obtaining more accurate alignments. Second, they need an additional guided alignment loss  to outperform GIZA++, which requires inducing alignments for entire training corpus.
In this paper, we propose a self-supervised model specifically designed for the word alignment task, namely Mask-Align. Our model masks each target token and recovers it with the source and the rest of the target tokens. For example, as shown in Figure 1(b), the target token "Tokyo" is masked and re-predicted. During this process, our model can identify that only the source token "Tokio" has not been translated yet, so the to-be-predicted target token "Tokyo" is aligned to "Tokio". Comparing with the translation model, this masked modeling method is highly related to word alignment, and based on that our model generates more accurate predictions and alignments.
To summarize, the main contributions of our work are listed as follows: • We propose a novel model for the word alignment task that parallelly masks and recovers each target token. Comparing with NMT-based alignment models, our model can leverage more context and find more accurate alignment for the to-be-predicted token.
• We introduce a variant of attention called leaky attention that is more suitable for alignment extraction. Our leaky attention can reduce the unexpected high attention weights on special tokens.
• By considering agreement between two directional models, we consistently outperform the state-ofthe-art on four language pairs without any guided alignment loss.

NMT-based Alignment Models
An NMT model can be utilized to measure the correspondence between source token x j and target token y i , and provides an alignment score matrix S ∈ R I×J , where each element S ij represents the relavance between y i and x j . Then we can extract the alignment matrix A accordingly: where A ij = 1 indicates y i is aligned to x j .
There are two types of methods to obtain S. The first kind of methods evaluate the importance of x j to the prediction of y i through feature importance measures such as prediction difference (Li et al., 2019), gradient-based saliency Ding et al., 2019) or norm-based measurement (Kobayashi et al., 2020). While they provide proper ways to extract alignments from an NMT model without any parameter update or architectural modification, their results did not surpass statistical methods such as GIZA++.
The second type of methods refer to the cross-attention weights W ij 1 between the encoder and decoder. There are two different kind of methods for extracting alignment from attention weights. For the first kind of methods, they consider W ij as the relevance between source token x j and output target token (i.e., S ij = W ij ) and extract alignment from top layers of decoder near the network output (Garg et al., 2019;Zenkel et al., 2019). For the second kind, W ij is taken as the alignment probability between source and input target token, i.e., S ij = W i+1,j (Chen et al., 2020;Kobayashi et al., 2020).

Disentangled Context Masked Language Model
A conditional masked language model (CMLM) predicts a set of target tokens y mask given a source text x and part of the target text y obs (Ghazvininejad et al., 2019). The original CMLM randomly selects y mask among the target tokens and only predicts a subset of target tokens in a forward pass. Kasai et al. (2020) extended it with an alternative objective called Disentangled Context (DisCo) objective, which predicts every target token y i given an arbitrary subset of other tokens (denoted as y i obs ). As directly computing P (y i |x, y i obs ) with a vanilla Transformer requires sequential time, they modify the Transformer decoder to predict all target tokens in parallel. For target token y i , they separate the query input q i from key input k i and value input v i , and only update q i across decoder layers: where q l i represents the query input in the l-th layer, w i and p i denotes the word and position embedding for y i respectively. The attention output in the l-th layer h l i is computed given the contexts corresponding to observed tokens y i obs : where h 0 i = Project(p i ). The DisCo Transformer is efficient in modeling target tokens conditioned on both past and future context, and has succeeded in non-autoregreesive machine translation.

Method
We introduce Mask-Align, a self-supervised neural alignment model (see Figure 2). Different from NMTbased alignment models, our model parallelly masks each target token and recovers it under the context of source and the remaining target tokens. In this process, the most helpful source tokens are identified as the alignment of the target token to predict.

Modeling
We propose to model the target token y i conditioned on the rest of the target tokens y\y i and the source sentence x, and the probability of the target sentence y given x as p θ (y|x): We argue that this kind of modeling is more effective for word alignment extraction. For one thing, our method has higher prediction accuracy because future context is considered, which has proved to be helpful for alignment extraction (Zenkel et al., 2020;Chen et al., 2020). For another, we hypothesize that in this way, our model can better identify the aligned source tokens since only their information is missing in the target.
Directly computing p θ (y i |y\y i , x) with a vanilla Transformer requires I seperate forward passes, which is unacceptable. Inspired by Kasai et al. (2020), we modify the Transformer decoder to perform the forward passes concurrently. This is done by separating the query inputs from the key and value inputs in the decoder self-attention layers. In each layer, we update the query input in each position by attending to keys and values in other positions. To avoid the model simply copying the representations from the inputs, we set the query inputs in the first decoder layer to be the position embeddings and keep the key and value inputs unchanged from the sum of position and token embeddings. We name this variant of attention the Static-KV Attention.
Another modification is that we remove the cross-attention in all but the last decoder layer. This makes the information flow from source to target happens only in the last layer. Our experiments demonstrate this modification reduces model parameters and improves alignment results.
Formally, given the word embedding w i and position embedding p i for target token y i , we compute the output of the l-th self-attention layer in the decoder for y i as h l i : where h 0 i = p i is the input for the first decoder layer. The output of the last self-attention layer h L i is used to compute cross-attention with encoder outputs. We use the cross-attention weights W to induce alignments.

Leaky Attention
We found that extracting alignments from vanilla cross-attention suffers from the unexpected high attention weights on some specific source tokens such as periods, [EOS] or other high frequency tokens. Hereinafter, we will refer to these tokens as the attractor tokens. As a result, if we compute the alignments according to the cross-attention weights, many target tokens are wrongly aligned to the attractor tokens.
This phenomenon has been studied in some of the previous work (Clark et al., 2019;Kobayashi et al., 2020). Kobayashi et al. (2020) also shows that the norms of the transformed vectors of the attractor tokens are usually small, thus their influence on the attention output is actually limited. We believe this is because vanilla attention does not consider untranslatable tokens, which is often aligned to a special NULL token in statistical alignment models (Brown et al., 1993). As a result, attractor tokens are implicitly treated as the NULL token.
We propose to explicitly model the NULL token with a modified attention, namely Leaky Attention. The leaky attention provides an extra position in addition to the attention memory for the target tokens to attend to. To be specific, we parameterize the key and value vectors as k NULL and v NULL for the leaky position in the cross-attention, and concatenate them with the transformed vectors of the encoder outputs. Then the attention output z i is computed as follows: where W Q is the query projection matrix, K enc and V enc are projected key and value for encoder outputs. We use a normal distribution with a mean of 0 and a standard deviation of 1/ √ d model to initialize k NULL and v NULL . When extracting alignments, we only consider the attention matrix without the leaky position.
Note that leaky attention is different from adding a special token in the source sequence, which will only act like another attractor token and share the high weights with the existing one instead of removing it (Vig and Belinkov, 2019). Our parameterized method is more flexible than Leaky-Softmax (Sabour et al., 2017) which adds an extra dimension with the value of zero to the routing logits.
With leaky attention, our model can capture more accurate alignment scores between source and target. In Section 3.3, we will show that this kind of attention is also helpful for agreement training.

Agreement
To better utilize the attention weights from the models in two directions, we apply an agreement loss in the training process to improve the symmetry of our model, which has proved effective in statistical alignment models (Liang et al., 2006). Given a parallel sentence pair x, y , we can obtain the attention weights from two different directions, denoted as W x→y and W y→x . As alignment is bijective, W x→y is supposed to be equal to the transpose of W y→x . We encourage this kind of symmetry through an agreement loss: where MSE represents the mean squared error.
For vanilla attention, this loss is hardly small because of the normalization constraint. Suppose we have x = "Falsch", y = "Not true" and gold alignment A = [1, 1], the optimal attention weights should be W x→y = [1.0, 1.0] and W y→x = [0.5, 0.5] because of the column normalization, resulting in minimal L a = 0.25. The leaky attention is able to achieve lower agreement loss as the column sum is not strictly equal to one. We assume this kind of flexibility is helpful for agreement training.
However, since we relax the constraints in the cross-attention, our model may converge to a degenerate case of zero agreement loss where attention weights are all zero except for the leaky position. We avoid this by introducing an entropy loss on the attention weights: whereW ij x→y is the renormalized attention weights, λ is a hyperparamter. Similayly, we have L e,y→x for the inverse direction.
We jointly train two directional models with the standard NLL loss L x→y and L y→x , the agreement loss L a and entropy losses . The overall loss L is: where α and β are hyperparameters.
When extracting alignments, we first compute the alignment score S ij for y i and x j with attention weights W ij x→y and W ji y→x from two directional models: We then extract alignments with S ij that exceed a threshold θ.

Settings
We implement our model based on the Transforemr architecture (Vaswani et al., 2017). The encoder consists of 6 standard Transformer encoder layers, and the decoder is composed of 6 layers. All decoder layers contain static-kv self-attention while only the last layer computes leaky-attention. We use the embedding size of 512, the hidden size of 1024, and 4 attention heads, and share the input and output embeddings for the decoder.
We train the models with a batch size of 36K tokens, and perform early stopping based on the prediction accuracy of the validation data. All models are trained in two directions without any alignment supervision. We tuned the hyperparameters via grid search on the Chinese-English validation set as it contains gold word alignments. In all of our experiments, we set λ = 0.05 (Eq. 16), α = 5, β = 1 (Eq. 17) and θ = 0.2. We evaluate the alignment quality with Alignment Error Rate (AER, Och and Ney (2000)).

Baselines
As described in Section 2.1, neural alignment models induce alignments either from the attention weights or through feature importance measures. We compare our method with the attention-based methods for (1) we also extract alignments from the attention weights and (2) these methods achieve best alignment results. We thus introduce the following neural baselines besides two statistical baselines fast-align and GIZA++: • Attention (all): the method that induces alignments from attention weights of the best (usually the penultimate) decoder layer in a vanilla Tranformer.
• Attention (last): same as Attention (all) except that only the last layer performs cross-attention.
• AddSGD (Zenkel et al., 2019): the method that adds an extra alignment layer to repredict the to-bealigned target token.
• Mtl-Fullc (Garg et al., 2019): the method that supervises a single attention head conditioned on full target context with symmetrized Attention (all) alignments in a multi-task manner.
• BAO (Zenkel et al., 2020): an improved version of AddSGD that first extracts alignments with bidirectional attention optimization and then uses them to retrain the alignment layer with guided alignment loss.    Figure 3. We mark the minimum norm for each variant of attention with boldface.
• Shift-Attn (Chen et al., 2020): the method that induces alignments when the to-be-aligned tatget token is the decoder input instead the output.
For convenience, we will use Masked to represent the method that only uses the masked modeling described in Section 3.1, and Mask-Align to denote the one that also uses leaky attention and agreement training. Table 1 shows the comparisons of Mask-Align and all baselines on four datasets. Our approach significantly outperforms all baselines in all datasets. Specifically, it improves over GIZA++ by 1.7-8.0 AER points across different language pairs without using any guided alignment loss, making it a good substitute to this commonly used statistical alignment tool. Compared to Attention (all), we achieve a gain of 13.4-17.5 AER points with fewer parameters (as we removed some cross-attention sublayers) and no additional modules, showing the effectiveness of our method. When comparing with the state-of-the-art neural baselines, Mask-Align consistently outperforms BAO, the best method that extracts alignments for output target tokens, by 2.2-4.4 AER points, demonstrating our modeling method is more suitable for word alignment tasks than translation.  (Zenkel et al., 2020) 16.3 5.0 23.4 Shift-AET (Chen et al., 2020) 15.4 4.7 21.2

Comparison with Guided Training Results
Mask-Align 14.5 4.4 19.5 Table 3: Comparison of Mask-Align with other methods using guided alignment loss.
additional alignment module with supervision from symmetrized Shift-Att alignments. Table 3 shows the performance of Mask-Align and baselines using guided training. As we see, Mask-Align performs better than all baselines. Note that our method is simpler and faster than these methods with guided training. To compute the guided alignment loss, they have to induce alignments for the entire training set first, which is computational expensive. In contrast, our method is much more efficient because we only need one training pass. We also tried using guided training in on our approach and obeserved no further improvements.
We attribute this to the use of agreement loss and leave it for future researches. Figure 3 shows the attention weights from vanilla and leaky attention, and Table 2 presents the norms of the transformed value vectors of each source token for two types of attention. For vanilla attention, we can see large weights on the high frequency token "der" and the small norm of its transformed value vector. As a result, the target token "in" will be wrongly aligned to "der". While for leaky attention, we observe the similar phenomenon on the leaky position "[NULL]", and "in" will not be aligned to any source tokens since the weights on all source tokens are small. This confirms our hypothesis that the attractor tokens appear because of the untranslatable tokens. For target token "in", there is no corresponding translation in the source. However, vanilla attention cannot set all attention weights for "in" to be small given the normalization constraint. Instead, the attractor tokens are put on high weights because they impact little on the attention output due to the small norms. On the contrary, our leaky attention performs better on untranslatable tokens as it explicitly models the NULL token, making it more suitable for word alignment tasks.  Table 4: Ablation study on German-English dataset. The second column lists the number of decoder layers that perform cross-attention. Table 4 shows the ablation results on German-English dataset. All results are symmetrized results. We first compare our masked modeling (Masked) with vanilla Transformer in two different settings in which all or only the last decoder layer will contain a cross-attention sublayer. The results show that limiting the interaction between encoder and decoder into a single layer will improve the quality of alignments, and our masked modeling is better than the vanilla translation modeling by 2.0-3.0 AER points. The leaky attention will bring an additional gain of 7.9 AER points, and based on that agreement training will further improve 3.1 AER points and achieve the best result, which shows the effectiveness of these two techniques.

Ablation Studies
Attention (

Analysis
Prediction & Alignment We analyze the relevance between the correctness of word-level prediction and alignment. Specifically, we consider the following four cases: correct prediction and alignment (cPcA), correct prediction and wrong alignment (cPwA), same to wPcA and wPwA. We regard a word as correctly predicted if any of its subwords are correct. For words with more than one possible alignment, we consider it as correctly aligned if one possible alignment is matched. The results are shown in Figure 4. Our masked method has higher prediction accuracy, and significantly reduces the alignment error caused by wrong predictions.  Removing End Punctuation To further investigate the performance of leaky attention, we test an extraction method that excludes some of the influence of attractor tokens. To be specific, we remove the attention weights on the end punctuation of a source sentence. In our preliminary experiments, when the source sentence contains an end punctuation, it will be treated as the attractor token in most cases. Therefore removing it will alleviate the impact of attractor tokens to a certain extent. Table 5 shows the compared results. When we use leaky attention, the removal of end punctuation brings no improvement on alignment quality. However, when we test the model without leaky attention, removing end punctuation obtains gains by 7.7 AER points. This suggests that leaky attention can effectively avoid the problem of attractor tokens. Figure 5 shows the attention weights from four different models for the example in Figure 1. As we have discussed in Section 1, in this example, the NMT-based methods may be confused when predicting the target token "1968". From the attention weights, we can see that (b) and (c) indeed put high weights wrongly on "tokio" in the source sentence. Another observation is that in the attention map of (c), "[EOS]" acts the same as the period, which proves our claim in Section 3.2 that simply adding a special token in the source sequence as the NULL token doesn't work. For our methods (d) and (e), we can see that the attention weights are highly consistent with gold alignment for this example. Our methods provide sparse and accurate (e) Mask-Align Figure 5: Attention weights from different models for the example in Figure 1. Gold alignment is shown in (a). For target token "1968", the NMT-based methods (b) and (c) assign high weights to the wrongly aligned source token "tokio", while our masked methods only focus on the correct source token "1968".

Case Study
attention weights. By comparing between (d) and (e), we notice that (e) eliminated some small noise in (d).
We attribute this to the bidirectional agreement training.

Related Work
Neural Alignment Model Some neural alignment models use gold-standard alignment data. Stengel-Eskin et al. (2019) introduce a discriminative model using the dot-product distance measure between source and target representations to predict the alignment labels. Nagata et al. (2020) first transform the task of word alignment into a question answering task and then use a multilingual BERT to solve it. This line of research suffers from the lack of human-annotated alignment data. Therefore, many studies focus on alignment extraction without gold data (Tamura et al., 2014;Legrand et al., 2016). Alkhouli et al. (2016) present neural translation and alignment models trained by using silver-standard alignments obtained from GIZA++. Peter et al. (2017) propose a target foresight approach and use silver-standard alignments to perform guided alignment training . These methods are not satisfactory in terms of alignment results.
Recently, there are plenty of studies that induce alignments from an NMT model. Garg et al. (2019) apply the guided alignment loss on a single attention head with silver-standard alignments from GIZA++. Zenkel et al. (2019Zenkel et al. ( , 2020 introduce an additional alignment module on top of the NMT model and also use guided training. Chen et al. (2020) come up with an extraction method that induce alignments when the to-be-aligned target token is the decoder input. However, all previous methods adopt a translation objective in the training process. Also, they outperform GIZA++ only with guided training, which requires inducing alignment for entire training set. Our method is fully self-supervised with a masked modeling objective and exceed all these unsupervised methods.
Masked Language Model Pre-trained masked language models (MLMs, Devlin et al. (2019)) have been successfully applied to many NLP tasks such as natural language understanding (Wang et al., 2018) and text generation (Lewis et al., 2020). Its idea has also been adopted in many advanced NLP models. Ghazvininejad et al. (2019) introduce a conditional masked language model (CMLM) to perform parallel decoding for non-autoregressive machine translation. The CMLM can leverage both previous and future context on the target side for sequence-to-sequence tasks with the masking mechanism. Kasai et al. (2020) extend it with a disentangled context Transformer that predicts every target token instead of a subset conditioned on arbitrary context. Our masked modeling method is inspired by CMLMs, as such a masking and predicting process is highly related to word alignment. To the best of our knowledge, this is the first work that incorporates a CMLM objective into alignment models.

Conclusion
In this paper, we propose a self-supervised neural alignment model Mask-Align. Different from the NMTbased methods, our model adopts a novel masked modeling objective that is more suitable for word alignment tasks. Moreover, Mask-Align can alleviate the problem of high attention weights on special tokens by introducing leaky attention. Experiments show that Mask-Align achieves new state-of-the-art results without guided alignment loss. We leave it for future work to extend our model in a semi-supervised setting.