Third-Party Aligner for Neural Word Alignments

Word alignment is to find translationally equivalent words between source and target sentences. Previous work has demonstrated that self-training can achieve competitive word alignment results. In this paper, we propose to use word alignments generated by a third-party word aligner to supervise the neural word alignment training. Specifically, source word and target word of each word pair aligned by the third-party aligner are trained to be close neighbors to each other in the contextualized embedding space when fine-tuning a pre-trained cross-lingual language model. Experiments on the benchmarks of various language pairs show that our approach can surprisingly do self-correction over the third-party supervision by finding more accurate word alignments and deleting wrong word alignments, leading to better performance than various third-party word aligners, including the currently best one. When we integrate all supervisions from various third-party aligners, we achieve state-of-the-art word alignment performances, with averagely more than two points lower alignment error rates than the best third-party aligner. We released our code at https://github.com/sdongchuanqi/Third-Party-Supervised-Aligner.

Word alignment is usually inferred by GIZA++ (Och and Ney, 2003) or FastAlign (Dyer et al., 2013), which are based on the statistical IBM word alignment models (Brown et al., 1993).Recently, neural methods are applied for inferring the word alignment.They use NMT-based framework to induce alignments through using attention weights or feature importance measures, and surpass the statistical word aligners such as GIZA++ on a variety of language pairs (Li et al., 2019;Garg et al., 2019;Zenkel et al., 2019Zenkel et al., , 2020;;Chen et al., 2020;Song et al., 2020a,b;Chen et al., 2021).
Inspired by the success of the large-scale crosslingual language model (CLM) pre-training (De-vlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020), the pre-trained contextualized word embeddings are also explored for the word alignment task by either extracting alignments based on the pre-trained contextualized embeddings (Sabet et al., 2020) or fine-tuning the pretrained CLMs by self-training to get new contextualized embeddings appropriate for extracting word alignments (Dou and Neubig, 2021).Based on careful design of self-training objectives, the finetuning approach achieves competitive word alignment results (Dou and Neubig, 2021).
In this paper, we use simple supervision instead of the self-training to fine-tune the pre-trained CLMs.The simple supervision is derived from a third-party word aligner.Given a parallel corpus, the third-party word aligner predicts the word alignments over the corpus, which are used as the supervision signal for the fine-tuning.In particular, for each aligned word pair of a parallel sentence pair, the contextualized embeddings of the source and target words are trained to have high cosine similarity to each other in the embedding space.
As illustrated by Figure 1, the cosine similarities between the source and target words of the correct word alignments are not quite high before the finetuning.The third-party word aligner can provide some correct word alignments (e.g."that" "must" "be" associated alignments) along with wrong ones (e.g."primary" "objective" associated alignments) as the supervision.Although the supervision is not perfect, it is still helpful for driving the contextualized embeddings of the source and target words of a correct word alignment closer in the embedding space after the fine-tuning.Surprisingly, with imperfect third-party supervision in fine-tuning, the heat map of the cosine similarities exhibits clearer split between the correct and wrong word alignments than not fine-tuning.Wrong alignments of the third-party aligner are rectified after fine-tuning (e.g."primary" "objective" associated alignments), and the incorrect alignment before fine-tuning (e.g."be"associated alignment) is also rectified after finetuning.
We perform experiments on word alignment benchmarks of five different language pairs.The results show that the proposed third-party supervising approach outperforms all third-party word aligners.When we integrate all supervisions from various third-party word aligners, we achieve state-of-theart performances across all benchmarks, with an average word error rate two points lower than that of the best third-party word aligner.

Approach
Formally, the word alignment task can be defined as finding a set of word pairs in the sentence pair ⟨s, t⟩, where s denotes the source sentence "s 1 , ..., s n ", and t denotes the corresponding target sentence "t 1 , ..., t m " parallel to s.The set of the word pairs is: In each word pair ⟨s i , t j ⟩, s i and t j are translationally equivalent to each other within the context of the sentence pair.
In the following, we will describe how we obtain the word alignments by fine-tuning the pre-trained CLMs.Different to previous work that fine-tunes by self-training (Dou and Neubig, 2021), we supervise the fine-tuning process with third-party word alignments.

Third-Party Supervision
The large-scale CLM pre-training has gained impressive performances across various NLP tasks (Libovickỳ et al., 2019;Hu et al., 2020).As the outcome of the pre-trained CLMs, the contextualized word embeddings can represent words in semantic context across different languages.By further fine-tuning the CLMs, the contextualized embeddings of the source and target words of a word alignment in the embedding space can become closer, which makes it easier for identifying word alignments according to the simple geometry of the embedding space for each pair of parallel sentences.
We propose to fine-tune the pre-trained CLMs with supervision from a third-party word aligner.Figure 2 shows the overall fine-tuning framework.For a source sentence "s 1 , s 2 , s 3 , s 4 " and its corresponding target sentence "t 1 , t 2 , t 3 ", we stack CLM over them to obtain the contextualized word embeddings hs ="hs 1 , hs 2 , hs 3 , hs 4 " and ht ="ht 1 , ht 2 , ht 3 " for the source and target sides, respectively.Since CLM models sentences of different languages in the same contextualized embedding space, it is easy to constitute a similarity matrix by directly computing the cosine similarities between hs and ht.The similarity matrix is: In the matrix, word pairs with higher similarities are deemed as word alignments.Let A ′ denotes the word alignments generated by the third-party word aligner.CLM is fine-tuned with the supervision of A ′ so that M is consistent with A ′ .Although the third-party supervision A ′ is not perfect, we observe that the fine-tuning can proceed with selfcorrection of imperfect A ′ in the experiments.
The supervision is bidirectional: n j=1 e M i,j where P s2t denotes the probability of source-totarget alignment between s i and t j , which is computed by softmax over the ith row of M .Correspondingly, P t2s denotes the probability of targetto-source alignment between t j and s i , which is computed by softmax over the jth column of M1 .m and n denote the lengths of the source and target sentences, respectively.We aim to maximize L, which sums the bidirectional probabilities subject to A ′ supervision.
Through the above training objective, CLM is fine-tuned to generate the contextualized embeddings suitable for building the similarity matrices to extract word alignments.

Word Alignment Prediction
Given a new pair of parallel sentences in the test set, we can predict its word alignments based on the CLM fine-tuned on the parallel training corpus.In particular, for the sentence pair, the source-totarget probability matrix M s2t which consists of probabilities of P s2t , and the target-to-source probability matrix M t2s which consists of probabilities of P t2s , are computed using the fine-tuned CLM at first, then the set of word alignments A can be deduced according to the intersection of the two matrices: where c is a threshold.Only the word pairs whose source-to-target alignment probability and targetto-source alignment probability are both greater than c are deemed as the predicted word alignments.

Integrating various Third-Party Supervisions
Different third-party word aligners exhibit different behaviors in the word alignment results.We integrate the word alignments produced by various aligners into one set of supervisions for the finetuning process to test if they can be combined to improve the performance further.At first, we group all third-party aligners' output alignments into one union.Then we utilize the union in two category of methods: filtering and weighting.
The filtering method abandons word alignments in the union which have low consistency between various aligners, and only keep the alignments that majority of the aligners consent to.The remaining word alignments are used to supervise the fine-tuning process.Since different aligners get different performances, we assign credit to each aligner by using its performance on the development set (i.e., negative alignment error rate of the development set), then we normalize the credits of all aligners by softmax.Consequently, each word alignment ⟨s i , t j ⟩ in the union has a total credit: where A ′ k denotes the set of word alignments of the kth third-party word aligner, K denotes the number of the third-party word aligners, and Credit k is the credit of the kth aligner after softmax.Credit total represents the degree of agreement between various aligners.Only word alignments whose Credit total are greater than a threshold f are kept for the subsequent fine-tuning.
Different to the filtering method, the weighting method considers all word alignments in the union, though it put weights over them in the fine-tuning.Credit total used in the filtering method is also adopted in the weighting method: where w i,j is the weight of the word pair ⟨s i , t j ⟩.When Credit total exceeds the threshold f , the weight tends to 1, otherwise it tends to 0. λ is the hyper-parameter that controls the effect of the supervision integration.w i,j is inserted into the fine-tuning objective L in euqation (1) by simply replacing P s2t (i, j) with w i,j P s2t (i, j), and replacing P t2s (i, j) with w i,j P t2s (i, j).

Handling Subwords
Subwords (Sennrich et al., 2016;Wu et al., 2016) are widely used in pre-training CLMs.The finetuning process is conducted on the contextualized embeddings of the subwords.So we run all thirdparty word aligners at the subword level to get subword alignments, which are used for supervising the fine-tuning.During testing, we get the subword alignments for the test set sentence pairs at first, then convert the subword alignments to the word alignments by following previous work (Sabet et al., 2020;Zenkel et al., 2020), which consider two words to be aligned if any of their subwords are aligned.

Experiments
We test the proposed third-party supervised finetuning approach on word alignment tasks of five language pairs: Chinese-English (Zh-En), German-English (De-En), English-French (En-Fr), Romanian-English (Ro-En) and Japanese-English (Ja-En).

Datasets
We use the benchmark datasets of the five language pairs.They are utilized in two ways.For all thirdparty aligners, whole training corpus for each language pair is used by each third-party aligner.For our approach, only a fraction of the whole training corpus for each language pair is used in the fine-tuning phase.
Regarding the datasets for all third-party aligners, the configuration is the same to previous works.The Zh-En training-set is from the LDC corpus which consists of 1.2M sentence pairs, and the test and development sets are obtained from the Ts-inghuaAligner website2 (Liu et al., 2005).For the De-En, En-Fr, Ro-En datasets, we follow the experimental setup of previous work (Zenkel et al., 2019(Zenkel et al., , 2020) ) and use their pre-processing scripts (Zenkel et al., 2019) 3 to get the training and test sets.The Ja-En dataset is obtained from the Kyoto Free Translation Task (KFTT) word alignment data (Neubig et al., 2011), in which the sentences with less than 1 or more than 40 words are removed.The Japanese sentences are tokenized by KyTea tokenizer (Neubig et al., 2011).
Regarding the datasets for our fine-tuning approach, we only use the first 80,000 sentence pairs of the whole training corpus for each language pair.Basically, the third-party supervision for these sentence pairs are extracted from the word alignments of the whole training corpus induced by the thirdparty aligner.We also test training the third-party aligner just on the 80,000 sentence pairs to provide the third-party supervision, the results are presented in section 3.8.Besides, we also vary the data size for the fine-tuning as shown in the experimental section 3.5.
Table 1 presents the statistics of these datasets.Since De-En, En-Fr, and En-Ro have no manually aligned development sets, we take the last 1,000 sentences of the training data as the development sets (Ding et al., 2019)

Settings
Pre-trained Cross-lingual Language Models.
For fine-tuning, we investigate two types of pretrained CLMs, namely mBERT and XLM (Conneau and Lample, 2019).mBERT is pre-trained over Wikipedia texts of 104 languages with the same settings to Dou and Neubig (2021).For XLM, we have tried its two released models: 1) XLM-15(MLM+TLM) which is pre-trained with MLM and TLM objectives and supports 15 languages.
2) XLM-100(MLM), which is trained with MLM and supports 100 languages.Specifically, for Zh-En, De-En, and En-Fr, which are among the 15 languages, we use XLM-15(MLM+TLM) same to Dou and Neubig (2021).For Ro-En and Ja-En which are not covered by XLM-15(MLM+TLM), we choose XLM-100 (MLM) instead with a modification that XLM-100 (MLM) is further trained on the parallel training corpora of Ro-en and Ja-En with the TLM objectives to be consistent with XLM-15(MLM+TLM).In the following, unless with clear specification, XLM stands for XLM-15 or XLM-100 in appropriate circumstances.
The contextualized word embeddings are extracted from the hidden states of the ith layer of the pre-trained CLMs, where i is an empiricallychosen hyper-parameter based on the development set performances.For XLM-15, we use its 5th layer to extract the contextual embeddings (Hewitt and Manning, 2019;Tenney et al., 2019), while for XLM-100, we use its 9th layer.For mBERT, we use its 8th layer.We directly use the subwords in the pre-trained CLMs, i.e., BPE subwords in XLM and word piece subwords in mBERT .
Training Setup and Hyper-parameters.We fine-tune XLM and mBERT models for 10 epochs over the parallel fine-tuning corpus for each language pair, with a batch size of 8. We use AdamW (Loshchilov and Hutter, 2017) with learning rate of 1e-5.The dropout rate is set to 0.1.The training process typically takes 2 to 3 hours.The hyper-parameters are tuned based on the develop-ment set performances.Regarding the threshold c in the word alignment prediction, it is set to 1e-6 for Ro-En and 0.1 for the others.Regarding the hyper-parameters in integrating the various thirdparty supervisions, f is set to 0.45 and λ is set to 0.5 for all language pairs.

Third-Party Word Aligners
We explore various third-party word aligners ranging from statistical approaches to neural approaches to supervise the fine-tuning process.The aligners include: • FastAlign (Dyer et al., 2013) 4 : a popular statistical word aligner which is an effective reparameterization of IBM model 2.
• GIZA++(Och and Ney, 2003)5 : another popular statistical word aligner implementing the IBM models.We use traditional settings of 5 iterations each for model 1, HMM model, model 3 and model 4.
• Eflomal(Östling and Tiedemann, 2016)6 : an efficient statistical word aligner using a Bayesian model with Markov Chain Monte Carlo inference.
• SimAlign(Sabet et al., 2020)7 : a word aligner that directly uses static and contextualized embeddings of BERT to extract word alignments.
We use its Argmax model with default settings.
• AwesomeAlign(Dou and Neubig, 2021)8 : a neural word aligner that fine-tunes CLMs by self-training to produce contextualized embeddings suitable for word alignment.
• MaskAlign(Chen et al., 2021)9 : a neural word aligner based on self-supervision which parallel masks each target token and predicts it conditioned on both sides remaining tokens to better model the alignment.
For some language pairs that are not reported in the papers of the above third-party aligners, we run their released tools on the benchmark datasets to get the corresponding results.Specifically, for Zh-En, we run FastAlign, Eflomal, and SimAlign.For Ja-En, we run FastAlign, GIZA++, Eflomal, SimAlign, and MaskAlign.Because the evaluation in AwesomeAlign for Zh-En ignores manually labeled possible alignments, which is inconsistent to other works, we run AwesomeAlign for Zh-En to re-evaluate with considering the manually labeled possible alignments.

Main Results
The alignment error rate (AER) (Och and Ney, 2003) is used to evaluate the performances.Main results are summarized in Table 2. Compared to all third-party word aligners, which are also set as the baselines, our proposed approach achieves the state-of-the-art performances across the five language pairs, with an average AER of more than two points lower than the best third-party word aligner.
Table 2 presents the results of fine-tuning XLM.The results of fine-tuning mBERT is reported in Table 3.Both fine-tuning approaches perform better than the third-party word aligners.Since finetuning CLMs is conducted in the subword level, we need to adapt the third-party aligners for subwords.Given the parallel corpus of each language pair, we directly use the dictionary of the CLM to get the subwords of the corpus, then run each third-party aligner on such corpus which is subword segmented.Such adapted results are reported in both tables with the subscript "adapted" to each third-party aligner 10 .For neural aligners such as MaskAlign which already uses subwords, the adaptation is still needed since the subwords of the pretrained CLM are different.
Regarding the plain contextualized embeddings in XLM and mBERT, they can be directly aligned between source and target languages by mining the closest neighbors in the universal embedding space, as shown in the "w/o Fine-tuning" rows in Table 2 and 3 (Dou and Neubig, 2021).When we further fine-tune these embeddings supervised by the subword alignments produced by each adapted individual third-party aligner, we obtain significant improvement over each individual third-party aligner.When compare fine-tuning to without finetuning ("w/o Fine-tuning" rows), we found that 10 We have tried other complicated adaptation approaches such as decomposing word alignments into subword alignments, adding pooling layers that deal with word level alignments, but they are not as effective as the above simple adaptation approach.fine-tuning generally performs better than without fine-tuning, except for fine-tuning with the supervision of FastAlign adapted .Since FastAlign adapted performs remarkably worse than without finetuning, it is hard for FastAlign adapted to provide effective supervision for the fine-tuning.Since AwsomeAlign adapted already fine-tunes the CLMs by self-training, continuing to fine-tune CLMs with the supervision of AwsomeAlign adapted does not gain improvements.At last, when we integrate all supervisions from various third-party aligners, we achieve state-of-the-art AER.Details of integrating all supervisions are presented in section 3.7.

The Effect of The Fine-tuning Corpus Size
Figure 3 presents the performance variance when the size of parallel corpus for the fine-tuning varies.As the fine-tuning corpus becomes larger, AER becomes lower across all five language pairs.The full corpus is identical to that used in training the thirdparty aligners.The curve for En-Fr is presented in the appendix due to space limit.Usually 80k sentence pairs can provide good supervisions for the fine-tuning, with limited margin to the performance of using the full corpus.Note that the performance of using 2k sentence pairs for fine-tuning is less than two points worse than that of using the full corpus, even just 0.4 points worse in En-Fr.

Self-Correction Effect
Although the supervision from the third-party aligner is not perfect, we observe a self-correction effect that as the fine-tuning proceeds, more accurate word alignments other than the third-party alignments are identified as they become closer Zh-En
in the embedding space, and some wrong word alignments of the third-party aligner get departed farther in the space, which we deem that they do not influence the fine-tuning process.
Figure 4 presents the self-correction effect.In this subsection, we include the test set into the fine-tuning set for the new fine-tuning to check the predicted alignments against gold alignments.MaskAlign and XLM are used in this study.At first, we extract MaskAlign results of the test set as part of the supervision for the fine-tuning.As the fine-tuning steps forward, on the test set, we compute the precision of newly predicted alignments not included in the third-party alignments, denoted as "New", and the rate of the deleted alignments (certain third-party alignments not included in the predicted alignments) which are truly wrong alignments amongst all deleted alignments, denoted as "Del".Besides, we compute the precision of remaining alignments in the third-party alignments, denoted as "Remain".Figure 4 shows that "New" and "Del" increase as the fine-tuning proceeds, supporting the AER decrease in the experiment."Remain" almost keeps horizontal, indicating the stability of the fine-tuning process.The effect of En-Fr is shown in the appendix.Table 4: AER of different integration methods.

Results of Integrating Various
Third-Party Supervisions Table 4 presents the comparison between the performances of different integration methods.Finetuning mBERT is applied in this study for its computation efficiency.First, we intersect the word alignments from all third-party aligners as supervisions.Since aligners perform differently, AER is impacted by the worst aligner which results in small number of word alignments in the intersection.In contrast, when we get the union of all third-party alignments, its performance is much better, but it still contain noises hampering AER results.When we use filtering and weighting methods to deal with the noises, the integration gets the best performances, and surpasses all third-party aligners.
Ablation studies are shown in Table 5. Removing one aligner from the integration causes different performance variances.It shows that removing MaskAlign impact the integration performance most, since it is best aligner in most language pairs.

Training Third-Party Aligners on The
Same Parallel Corpus for The Fine-tuning Although the fine-tuning approach only needs a small fraction of the whole parallel corpus for each language pair, e.g.80k sentence pairs for the fine- tuning, its supervision is extracted from the alignments of the third-party aligner which is trained on the whole parallel corpus.In this subsection, we check if only using the small corpus, which is used in the fine-tuning, for training the third-party aligner can seriously impact the word alignment performance.Table 6 shows the result.Training MaskAlign on small corpus seriously drags down AER performances when compared to training on full corpus, with averagely over 7 points worse than "MaskAlign adapted " in Table 2 and 3. Surprisingly, fine-tuning with such worse supervision can still achieve remarkably better performances, even surpassing or performing comparable to the strongest baseline system MaskAlign in Table 2.The reason for this phenomenon is that MaskAlign adapted generates fewer but more accurate alignments, which is effective enough for the supervision.We also use 40k sentence pairs for this study.Please refer to Appendix C for the study.

Conclusion
We propose an approach of using a third-party aligner for neural word alignments.Different to previous work based on careful design of selftraining objectives, we simply use the word alignments generated by the third-party aligners to supervise the training.Although the third-party word alignments are imperfect as the supervision, we observe that the training process can do selfcorrection over the third-party word alignments by detecting more accurate word alignments and deleting wrong word alignments based on the geometry similarity in the contextualized embedding space, leading to better performances than the third-party aligners.The integration of various third-party supervisions improves the performance further, achieving state-of-the-art word alignment performance on benchmarks of multiple language pairs.

Limitations
The proposed third-party supervised fine-tuning approach is not applicable to using the best word alignments, which are generated by the integrated supervision in this paper, as the new supervision signal to continue the fine-tuning.Such continual fine-tuning does not obtain significant improvement, which indicates the ineffectiveness of continual fine-tuning with the supervision of self predicted alignments.
A The Corpus Size Effect and The Self-Correction Effect on En-Fr The corpus size effect is presented in Figure 5.It shows the trend same to Figure 3, though the trend is not so significant for En-Fr.The self-correction effect is presented in Figure 6.The effect is the same to those in the other four language pairs.

B Precision and Recall of Predicted Word Alignments
Besides AER, we also evaluate the word alignment predictions by computing precision and recall using the gold alignments in the test sets.MaskAlign is used in this study due to its best performance among the third-party aligners.Its word alignments are used to supervise the fine-tuning of mBERT.
The precision and recall are reported in Table 7.
It shows that precision is always significantly improved after the fine-tuning, while recall improvement is not so significant.On En-Fr and Ro-En, recall is slightly worse after the fine-tuning.

C 40k Sentence Pairs for Both Training
The Third-Party Aligner and Fine-tuning party aligner and fine-tuning.Table 8 shows the result.AER of MaskAlign adapted deteriorates sharply compared to training it on 80k sentence pairs shown in Table 6, but fine-tuning with such worse alignments as the supervision still gets better AER than without fine-tuning.We investigate the precision and recall of MaskAlign adapted listed in Table 9, and find that it always obtains high precision, and these fewer but accurate alignments are useful supervision information for the fine-tuning.

Figure 2 :
Figure 2: The framework of the fine-tuning with the third-party supervision.

Figure 3 :
Figure 3: The effect of the different sizes of parallel corpora for the fine-tuning.

Figure 5 :
Figure 5: The effect of the different sizes of the parallel corpus for En-Fr fine-tuning.

Figure 6 :
Figure 6: The self-correction effect in En-Fr fine-tuning process.

Table 1 :
, in which the aligner is selftuned on the alignments predicted by itself in the last iteration.Other development sets and all test sets are manually aligned.All training sets do not contain manually labeled word alignments.Number of sentence pairs in the benchmark datasets.

Table 2 :
AER results of the baseline systems and the systems of fine-tuning XLM with the third-party supervisions.The lower AER, the better.AVG denotes the average AER over the five language pairs.

Table 6 :
AER of fine-tuning XLM and mBERT with the third-party supervision, which is generated by MaskAlign adapted trained on the small parallel corpus same to that used in the fine-tuning.

Table 7 :
Precision and Recall of MaskAlign predictions and the results of fine-tuning mBERT supervised by MaskAlign.

Table 8 :
AER of fine-tuning mBERT with the third-party supervision, which is generated by MaskAlign adapted trained on 40k sentence pairs.We use smaller parallel corpus, which consists of 40k sentence pairs for both training the third-

Table 9 :
Precision and Recall of MaskAlign adapted predictions with different sizes of parallel training corpus .