Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

Word alignment which aims to extract lexicon translation equivalents between source and target sentences, serves as a fundamental tool for natural language processing. Recent studies in this area have yielded substantial improvements by generating alignments from contextualized embeddings of the pre-trained multilingual language models. However, we find that the existing approaches capture few interactions between the input sentence pairs, which degrades the word alignment quality severely, especially for the ambiguous words in the monolingual context. To remedy this problem, we propose Cross-Align to model deep interactions between the input sentence pairs, in which the source and target sentences are encoded separately with the shared self-attention modules in the shallow layers, while cross-lingual interactions are explicitly constructed by the cross-attention modules in the upper layers. Besides, to train our model effectively, we propose a two-stage training framework, where the model is trained with a simple Translation Language Modeling (TLM) objective in the first stage and then finetuned with a self-supervised alignment objective in the second stage. Experiments show that the proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.


Introduction
Word alignment which aims to extract the lexicon translation equivalents between the input sourcetarget sentence pairs (Brown et al., 1993;Zenkel et al., 2019;Garg et al., 2019;Jalili Sabet et al., 2020), has been widely used in machine translation (Och and Ney, 2000;Arthur et al., 2016;Yang et al., 2020Yang et al., , 2021)), transfer text annotations (Fang * Work done when Siyu were interning at Pattern Recognition Center, WeChat AI, Tencent Inc, China. † Yufeng Chen is the corresponding author. 1 The code is publicly available at: https://github.com/lisasiyu/Cross-Align There is a misalignment between "以" and "to" and "for".Red boxes denote the gold alignments.and Cohn, 2016;Huck et al., 2019), typological analysis (Lewis and Xia, 2008), generating adversarial examples (Lai et al., 2022), etc. Statistical word aligners based on the IBM translation models (Brown et al., 1993), such as GIZA++ (Och and Ney, 2003) and FastAlign (Dyer et al., 2013), have remained popular over the past thirty years for their good performance.Recently, with the advancement of deep neural models, neural aligners have developed rapidly and surpassed the statistical aligners on many language pairs.Typically, these neural approaches can be divided into two branches: Neural Machine Translation (NMT) based aligners and Language Model (LM) based aligners.
NMT based aligners (Garg et al., 2019;Zenkel et al., 2020;Chen et al., 2020Chen et al., , 2021;;Zhang and van Genabith, 2021) take alignments as a by-product of NMT systems by using attention weights to extract alignments.As NMT models are unidirectional, two NMT models (source-to-target and target-to-source) are required to obtain the final alignments, which makes the NMT based aligners less efficient.As opposed to the NMT based aligners, the LM based aligners generate alignments from the contextualized embeddings of the directionless multilingual language models.They extract contextualized embeddings from LMs and induce alignments based on the matrix of embed-ding similarities (Jalili Sabet et al., 2020;Dou and Neubig, 2021).While achieving some improvements in the alignment quality and efficiency, we find that the existing LM based aligners capture few interactions between the input source-target sentence pairs.Specifically, SimAlign (Jalili Sabet et al., 2020) encodes the source and target sentences separately without attending to the context in the other language.Dou and Neubig (2021) further propose Awesome-Align, which considers the cross-lingual context by taking the concatenation of the sentence pairs as inputs during training, but still encodes them separately during inference.
However, the lack of interaction between the input source-target sentence pairs degrades the alignment quality severely, especially for the ambiguous words in the monolingual context.Figure 1 presents an example of our reproduced results from Awesome-Align.The ambiguous Chinese word "以" has two different meanings: 1) a preposition ("to", "as", "for" in English), 2) the abbreviation of the word "以色列" ("Israel" in English).In this example, the word "以" is misaligned to "to" and "for" as the model does not fully consider the word "Israel' in the target sentence.Intuitively, the crosslingual context is very helpful for alleviating the meaning confusion in the task of word alignment.
Based on the above observation, we propose Cross-Align, which fully considers the crosslingual context by modeling deep interactions between the input sentence pairs.Specifically, Cross-Align encodes the monolingual information for source and target sentences independently with the shared self-attention modules in the shallow layers, and then explicitly models deep cross-lingual interactions with the cross-attention modules in the upper layers.Besides, to train Cross-Align effectively, we propose a two-stage training framework, where the model is trained with the simple TLM objective (Conneau and Lample, 2019) to learn the cross-lingual representations in the first stage, and then finetuned with a self-supervised alignment objective to bridge the gap between training and inference in the second stage.We conduct extensive experiments on five different language pairs and the results show that our approach achieves the SOTA performance on four out of five language pairs.2Compared to the existing approaches which apply many complex training objectives, our approach is simple yet effective.
Our main contributions are summarized as follows:  and van Genabith (2021) propose self-supervised models that take advantage of the full context on the target side, and achieve the SOTA results.Although NMT based aligners achieve promising results, there are still some disadvantages: 1) The inherent discrepancy between translation task and word alignment is not eliminated, so the reliability of the attention mechanism is still under suspicion (Li et al., 2019); 2) Since NMT models are unidirectional, it requires NMT models in both directions to obtain final alignment, which is lack of efficiency.

LM based Aligner
Recent pre-trained multilingual language models like mBERT (Devlin et al., 2019) and XLM-R (Conneau and Lample, 2019) achieve promising results on many cross-lingual transfer tasks (Liang et al., 2020;Hu et al., 2020;Wang et al., 2022a,b).Jalili Sabet et al. (2020) prove that multilingual LMs are also helpful in word alignment task and propose SimAlign to extract alignments

Self-Attention
Feed Forward Cross-Attention Feed Forward

Self-Attention
Feed Forward   (2021) prove that self-attention module tends to focus on their own context, while ignores the paired context, leading to few attention patterns across languages in the self-attention module.2) During inference, they still encode the language pairs individually, which causes the cross-lingual context unavailable when generating alignments.3Therefore, Awesome-Align models few interactions between cross-lingual pairs.Based on the above observation, we propose Cross-Align, which aims to model deep interactions of cross-lingual pairs to solve these problems.

Method
In this section, we first introduce the model architecture and then illustrate how we extract alignments from Cross-Align.Finally, we describe the proposed two-stage training framework in detail.

Model Architecture
As shown in Figure 2(c), Cross-Align is composed of a stack of m self-attention modules and n cross-attention modules (Vaswani et al., 2017).Given a sentence x = {x 1 , x 2 , . . ., x i } in the source language and its corresponding parallel sentence y = {y 1 , y 2 , . . ., y j } in the target language, Cross-Align first encodes them separately with the shared self-attention modules to extract the monolingual representations, and then generate the crosslingual representations by fusing the source and target monolingual representations with the crossattention modules.We elaborate the self-attention module and cross-attention module as follows.
Self-Attention Module.Each self-attention module contains a self-attention sub-layer and a fully connected feed-forward network (FFN).The attention function maps a query (Q) and a set of key-value (K-V) pairs to an output.As for selfattention, all queries, keys and values are from the same language.Formally, the output of a self-attention module in the l-th layer (1 ≤ l ≤ m) is calculated as: where W Q s , W K s , W V s are parameter matrices of the self-attention module, H l−1 is output from previous layer, LN(•) refers to the Layer-Normalization operation.With the above stacked m self-attention modules, we get the monolingual representations M x and M y when H 0 is set to the embeddings of x and y, respectively.
Cross-Attention Module.Although the selfattention modules can effectively encode monolingual information, the interactive information between x and y is not explored.Recently, crossattention modules have been successfully used to learn cross-modal interactions in multi-modal tasks (Wei et al., 2020;Li et al., 2021), which motivates us to leverage cross-attention modules for exploring cross-lingual interactions in word alignment.
Specifically, each cross-attention module contains a cross-attention sub-layer and an FFN network.Different from self-attention, the queries of cross-attention come from one language, while keys and values come from the other language.Formally, the output of a cross-attention module in the l-th layer (m < l ≤ m + n) is computed as: where are parameter matrices of the cross-attention module, H l−1 x is output from the previous layer corresponding to x and H l−1 y is output from the previous layer corresponding to y.With the above stacked n cross-attention modules, we get the cross-lingual representation C x and C y by setting H m x and H m y to M x and M y , respectively.

Alignments Extraction
The proposed Cross-Align aims to extract alignments from the input sentence pair x and y, and we illustrate the extraction method as follows.
Firstly, we extract the hidden states s = {s 1 , s 2 , . . ., s i } and t = {t 1 , t 2 , . . ., t j } for x and y respectively.Secondly, we get a similarity matrix S I×J by computing the dot products between s and t and apply the sof tmax normalization to convert S I×J into the source-to-target probability matrices P f I×J and target-to-source probability matrices P b I×J .After that, we obtain the final alignment matrix G I×J by taking the intersection of the two matrices following Dou and Neubig (2021): where τ is a threshold.G ij = 1 means x i and y j are aligned.Note that the current alignments generated from Cross-Align are on BPE-level.We follow previous work to convert BPE-level alignments to word-level alignments (Dou and Neubig, 2021;Zhang and van Genabith, 2021) by adding an alignment between a source word and a target word if any parts of these two words are aligned.

Two-stage Training Framework
This Following the previous works (Devlin et al., 2019;Conneau and Lample, 2019), we choose 15% of the token positions randomly for both x and y.For each chosen token, we replace it with the [MASK] token 80% of the time, a random token 10% of the time, and remain token 10% of the time.The model is trained to predict the original masked words based on the bilingual context.Thus, the training objective can be formulated as: where x and ŷ are the masked sentences for x and y respectively, θ s denotes all the parameters of the m self-attention modules, and θ c represents the parameters of n cross-attention modules.
Stage2: Self-Supervised Alignment.In the first training stage, the model is trained with TLM by feeding the masked sentence pairs as input.However, the model is required to extract the alignments from the original sentence pairs during inference.Therefore, there is a gap between the training and inference which may limit the alignment quality.To solve this problem, we propose a self-supervised alignment (SSA) objective in the second stage.SSA takes the alignments generated by the model trained in the first stage as golden labels and trains the model with the alignment task directly in this stage.
As previous studies (Jalili Sabet et al., 2020;Dou and Neubig, 2021) have shown that the middle layer of LM always has better alignment performance than the last layer, we take the c-th layer of Cross-Align as the alignment layer to train the alignment objective, where c (1 ≤ c ≤ m + n) is a hyper-parameter chosen from the experimental results. 4From the alignment layer of the firststage model, we extract the 0-1 alignment labels G I×J according to extraction method described in Section 3.2. 5P f I×J and P b I×J denotes the sourceto-target and target-to-source probability matrices extracted from the alignment layer of the current model, respectively.Following Garg et al. (2019), we optimize the alignment objective by minimizing 4 The analysis about the alignment layer is conducted in Section 5.2. 5 Now the alignment labels are on word level, while SSA objective is on BPE level, so we convert labels back to BPElevel as follows: a source BPE token is aligned to a target BPE token if their corresponding source word and target word are aligned.Besides, a target BPE tokens will be aligned with [CLS] token, if its corresponding target word is not aligned with any source word.the cross-entropy loss: To alleviate the catastrophe of forgetting knowledge learned by the TLM, we only finetune the alignment layer and freeze other layers of the model.With the SSA objective, Cross-Align directly learns the word alignment task in the alignment layer instead of the masked language modeling, making the training consistent with the inference process.During inference, we extract hidden states from the alignment layer and get the final alignments.

Experimental Settings
In this section, we first describe the details of datasets and implementation, then present the baselines, and finally introduce evaluation measures.

Datasets
We conduct our experiments on five publicly available datasets, including German-English (De-En), English-French (En-Fr), Romanian-English (Ro-En), Chinese-English (Zh-En), and Japanese-English (Ja-En).The training sets only contain the parallel sentences without word alignment labels, the development and test sets contain parallel sentences with gold word alignment labels annotated by experts.Table 1 gives the detailed data statistics.
Considering that De-En, En-Fr, and Ro-En do not have development sets, we use Zh-En development sets to tune the hyper-parameters for them.

Implementation Details
Our implementation is based on the code base released by Awesome-Align. 6We randomly initialize the parameters of cross-attention modules and leverage the pre-trained mBERT-base (Devlin et  Table 2: AER on the test sets with different alignment methods.The lower AER, the better performance.We highlight the best results for each language pair in bold.To make a fair comparison, we only report the results for all baselines under bilingual settings and without guidance from external word alignment tools.'Awesome-Align (concatenation)' means the source and target sentences are concatenated as inputs during inference.' † ' denotes the re-implement results based on their released code for those results not reported in the original paper.
2019) to initialize the rest parameters of our Cross-Align.The AdamW (Loshchilov and Hutter, 2019) is used as the optimizer, and the learning rate is set to 5e-4 and 1e-5 for the two stages of training, respectively.The batch size per GPU is set to 12 and gradient accumulation steps is set to 4. All models are trained on 8 NVIDIA Tesla V100 (32GB) GPUs.We train 2 epochs for each language pair in the first stage and then finetune 1 epoch in the second stage.The number of self-attention layers m and cross-attention layers n are set to 10 and 2, respectively.The alignment layer c is set to 11.In the first stage, the threshold of extraction τ is set to 0.001.In the second stage, τ is set to 0.15.

Baselines
To test the effectiveness of Cross-Align, we take the current three types of aligners as baselines.
Statistic based methods: • FastAlign (Dyer et al., 2013) and GIZA++ (Och and Ney, 2003) are two popular statistical aligners that are implementations of IBM model.

NMT based methods:
• Zenkel et al. ( 2019) and Zenkel et al. (2020) propose to add an extra attention layer on top of NMT model which produces translations and alignment simultaneously.
• Garg et al. (2019) propose a multi-task framework to align and translate with transformer models jointly.
• SHIFT-AET: Chen et al. ( 2020) induce alignments when the to-be-aligned target token is the decoder input instead of the output.
• Zhang and van Genabith (2021) predict the target word based on the source and both left-side and right-side target context to produce attention.
• MASK-ALIGN: Chen et al. ( 2021) proposed a self-supervised word alignment model that takes advantage of the full context on the target side.

LM based methods:
• SimAlign: Jalili Sabet et al. ( 2020) extract alignment from multilingual pre-trained language models without using parallel training data.
• Awesome-Align: Dou and Neubig (2021) further finetune multilingual pre-trained language models on parallel corpora to get better alignments

Evaluation Measures
Alignment Error Rate (AER) is the standard evaluation measure for word alignment (Och and Ney, 2003).The quality of an alignment A is computed by: where S (sure) are unambiguous gold alignments and P (possible) are ambiguous gold alignments.
5 Results and Analysis

Main Results
Table 2 compares the performance of Cross-Align against statistical aligners, NMT based aligners, and LM based aligners.We can see that Cross-Align significantly outperforms the statistical method GIZA++ by 2.1~12.6AER points across different language pairs.Compared to other LM based aligners, Cross-Align also achieves substantial improvement on all datasets.For example, on the Ja-En dataset, Cross-Align achieves 3.0 AER points improvement compared to Awesome-Align, demonstrating that modeling cross-lingual interactions based on the bilingual context is crucial for improving alignment quality.Compared to the strong NMT baselines with more parameters, we find Cross-Align still achieves the best results on all language pairs except Ro-En.We suppose the reason is that the parameters of cross-attention modules are initialized randomly, and the data size of Ro-En is too small to sufficiently train these parameters, resulting in unsatisfactory results compared to NMT based methods.We tried to use the self-attention parameters of mBERT to initialize it, but the results are not as good as random initialization.We will investigate the word alignment on low-resource language pairs in future work.

Analysis
Ablation Study.To understand the importance of the two-stage training objective, we conduct an ablation study by training multiple versions of the alignment models with some training stages removed.For all models, we extract the alignments on the alignment layer.The experimental results are shown in Table 3. From Table 3, we can find that the naive Cross-Align without training on the parallel corpus achieves very bad performance (see line "None" in Table 3).This is mainly because that the cross-attention modules are initialized randomly.TLM objective plays a critical role in training Cross-Align since it greatly improves the quality of alignment across all language pairs (see line"+TLM" in Table 3).In the second stage, the SSA objective further improves the performance by 0.9~2.4AER points (see line "++SSA").This shows that bridging the gap between the training and inference is helpful to the final alignment performance.
Number of Cross-Attention Layers.Since the self-attention and cross-attention modules play different roles in the final alignments, we are curious about how the number of cross-attention layers affects the final alignment performance.We investigate this problem by studying the alignment performance with different n, where n ranges from 0 to 12 with an interval of 2. Meanwhile, we keep m + n = 12 to ensure that Cross-Align has the same number of layers as mBERT.Figure 3 shows the AER results on the dev sets with different n.
For a more comprehensive analysis, we also show the results on the test sets for language pairs without dev sets.As shown in Figure 3, Cross-Align degenerates into the separate encoding framework when n = 0, achieving bad alignment performance.This shows that modeling the cross-lingual interactions is very helpful for enhancing the alignment performance.Additionally, when n = 12, the performance drops sharply, which shows that the monolingual representations built by the selfattention modules are necessary for the following cross-attention modules to generate reliable crosslingual representations.Almost all of the language pairs achieve the best performance when n is set around 2 and there is a trade-off between the selfattention and cross-attention module layers.
Alignment Layer.After the training of TLM, we need to decide the alignment layer c used to generate alignments.Figure 5 shows the AER results with c varying from 0 to 12.We observe that Cross-Align obtains the best performance when c is set around 11.This observation is consistent with previous studies (Jalili Sabet et al., 2020;Conneau et al., 2020).For Cross-align, the context representations in the lower self-attention layers are too language-specific to achieve high-quality alignment performance.In the upper cross-attention layers, the contextual representations are too specialized in the masked language modeling.The contextualized representations in the middle of crossattention layers contain rich cross-lingual knowledge that help generate high-quality alignments.

Case Study
In Figure 4, we present two examples from different alignment methods on Zh-En test set.In the first example, Cross-Align correctly aligns the ambiguous Chinese word "以" to "Israel" and "促使" to "work for" based on the bilingual context, but Awesome-Align does not.In the second example, there are two "chinese" in the target sentence with different meanings.Due to lack of cross-lingual context, Awesome-Align could not distinguish the difference and wrongly aligns "中国" to both "chinese", but Cross-Align gives correct alignments for them.It demonstrates that learning interaction knowledge between the source-target sentence pairs is beneficial to word alignment.

Conclusion
This paper presents a novel LM based aligner named Cross-Align, which models deep interactions between the input sentence pairs.Cross-Align first encodes the source and target sentences sep-arately with the shared self-attention modules in the shallow layers, then explicitly constructs crosslingual interactions with the cross-attention modules in the upper layers.Additionally, we propose a simple yet effective two-stage training framework, where the model is first trained with a simple TLM objective and then finetuned with a self-supervised alignment objective.Experimental results show that Cross-Align achieves new SOTA results on four out of language pairs.In future work, we plan to improve the alignment quality on more lowresource language pairs.

Limitations
Although the proposed Cross-Align has achieved promising results, we find it still has two main limitations.Firstly, Cross-Align has limited performance in low-resource language pairs like Ro-En and Ja-En, as shown in Table 2.We hypothesize the reason is that the cross-attention modules of Cross-Align are randomly initialized, so it needs a large number of data to train.We tried to use the self-attention parameters of mBERT to initialize it, but the results are not as good as random initialization.Secondly, we find current LM based aligners including Cross-Align have bad performance for phrase alignments.As shown in the second example in Figure 4, "academy of science" is a phrase that should be aligned to the Chinese word "科 学院", but Cross-Align only aligns part of it.It is because Cross-Align generates subword-level alignments without considering the word-level and phrase-level information.In future work, we will investigate these two limitations and further improve the quality of alignments.

Figure 1 :
Figure 1: An example from Dou and Neubig (2021).There is a misalignment between "以" and "to" and "for".Red boxes denote the gold alignments.
sub-section describes the proposed two-stage training framework.In the first stage, the model is trained with TLM to learn the cross-lingual representations.After the first training stage, the model is then finetuned with a self-supervised alignment objective to bridge the gap between the training and inference.Stage1: Translation Language Modeling.TLM is a simple training objective first proposed by Conneau and Lample (2019) for learning cross-lingual representations of LMs.Since Cross-Align aims to learn interactions between the input sentence pairs, TLM is a suitable objective for effectively training Cross-Align.Different from Conneau and Lample (2019) which train TLM objective based on the self-attention modules, Cross-Align applies the cross-attention modules to enforce the model to infer the masked tokens based on the cross-lingual representations C x and C y , encouraging deep interactions between the input sentence pair.

Figure 3 :
Figure 3: Word alignment performance with different number of cross-attention layers n.

Figure 4 :
Figure 4: Two examples from Zh-En alignment test set.(a) Gold alignments.(b) Results of Awesome-Align.(c) Results of Cross-Align.

Figure 5 :
Figure 5: Word alignment performance across different alignment layers of Cross-Align in the first stage.

Table 1 :
The number of sentences in each dataset. al.,

Table 3 :
Objective De-En En-Fr Ro-En Zh-En Ja-En Ablation studies on the two-stage training objective.'None' means the naive Cross-Align without further training on parallel corpus.'+TLM' means training Cross-Align on TLM objective.'++SSA' denotes further finetuned on SSA objective after TLM.