Multilingual Sentence Transformer as A Multilingual Word Aligner

Multilingual pretrained language models (mPLMs) have shown their effectiveness in multilingual word alignment induction. However, these methods usually start from mBERT or XLM-R. In this paper, we investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner. This idea is non-trivial as LaBSE is trained to learn language-agnostic sentence-level embeddings, while the alignment extraction task requires the more fine-grained word-level embeddings to be language-agnostic. We demonstrate that the vanilla LaBSE outperforms other mPLMs currently used in the alignment task, and then propose to finetune LaBSE on parallel corpus for further improvement. Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties. In addition, our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.


Introduction
Word alignment aims to find the correspondence between words in parallel texts (Brown et al., 1993).It is useful in a variety of natural language processing (NLP) applications such as noisy parallel corpus filtering (Kurfalı and Östling, 2019), bilingual lexicon induction (Shi et al., 2021), code-switching corpus building (Lee et al., 2019;Lin et al., 2020) and incorporating lexical constraints into neural machine translation (NMT) models (Hasler et al., 2018;Chen et al., 2021b).
Recently, neural word alignment approaches have developed rapidly and outperformed statistical word aligners like GIZA++ (Och and Ney, 2003) and fast-align (Dyer et al., 2013).Some works (Garg et al., 2019;Li et al., 2019;Zenkel et al., Figure 1: Cosine similarities between subword representations in a parallel sentence pair from 8th layer of mBERT (left) and 6th layer of LaBSE (right).Red boxes denote the gold alignments.2019, 2020;Chen et al., 2020b;Zhang and van Genabith, 2021;Chen et al., 2021a) induce alignments from NMT model or its variants.However, these bilingual models only support the language pair involved in the training process.They also treat the source and target side differently, thus two models are required for bidirectional alignment extraction.Another line of works (Jalili Sabet et al., 2020;Dou and Neubig, 2021) build multilingual word aligners with contextualized embeddings from the multilingual pretrained language model (Wu and Dredze, 2019;Conneau et al., 2020, mPLM).Thanks to the language-agnostic representations learned with multilingual masked language modeling task, these methods are capable of inducing word alignments even for language pairs without any parallel corpus.
Different from previous methods, in this paper we present AccAlign, a more accurate multilingual word aligner with the multilingual sentence Transformer LaBSE (Feng et al., 2022, see Figure 1).The LaBSE is trained on large scale parallel corpus of various language pairs to learn language-agnostic sentence embeddings with contrastive learning.However, it is unclear whether LaBSE has learned language-agnostic word-level embeddings, which is the key for the success of word alignment extraction.Specifically, we first direct induce word alignments from LaBSE and demonstrate that LaBSE outperforms other mPLMs currently used in the alignment task.This indicates that LaBSE has implicitly learned languageagnostic word-level embeddings at some intermediate layer.Then we propose a simple and effective finetuning method to further improve performance.Empirical results on seven language pairs show that our best aligner outperforms previous SOTA models of all varieties.In addition, our aligner supports different language pairs in a single model, and even achieves new SOTA on zero-shot language pairs that does not appear in finetuning process.1 2 AccAlign 2.1 Background: LaBSE LaBSE (Feng et al., 2022) is the state-of-the-art model for the cross-lingual sentence retrieval task.Given an input sentence, the model can retrieve the most similar sentence from candidates in a different language.LaBSE is first pretrained on a combination of masked language modeling (Devlin et al., 2019) and translation language modeling (Conneau and Lample, 2019) tasks.After that, it is effectively finetuned with contrastive loss on 6B parallel sentences across 109 languages.We leave the training detail of LaBSE in the appendix.However, as LaBSE does not include any word-level training loss when finetuning with contrastive loss, it is unclear whether the model has learned high-quality language-agnostic word-level embeddings, which is the key for a multilingual word aligner.

Alignment Induction from LaBSE
To investigate whether LaBSE is a strong multilingual word aligner, we first induce word alignments from vanilla LaBSE without any modification or finetuning.This is done by utilizing the contextual embeddings from LaBSE.Specifically, consider a bilingual sentence pair x = ⟨x 1 , x 2 , ..., x n ⟩ and y = ⟨y 1 , x 2 , ..., y m ⟩, we denote the contextual embeddings from LaBSE as h x = ⟨h x 1 , ..., h xn ⟩ and h y = ⟨h y 1 , ..., h ym ⟩, respectively.Following previous work (Dou and Neubig, 2021;Jalili Sabet et al., 2020), we get the similarity matrix from the contextual embeddings: The similarity matrix is normalized for each row to get S xy .S xy is treated as the probability matrix as its i-th row represents the probabilities of aligning x i to all tokens in y.The reverse probability matrix S yx is computed similarly by normalizing each column of S. Taking intersection of the two probability matrices yields the final alignment matrix: where c is a threshold and A ij = 1 indicates that x i and y j are aligned.The above method induces alignments on the subword level, which are converted into word-level alignments by aligning two words if any of their subwords are aligned following (Zenkel et al., 2020;Jalili Sabet et al., 2020).

Finetuning LaBSE for Better Alignments
Inspired by (Dou and Neubig, 2021), we propose a finetuning method to further improve performance given parallel corpus with alignment labels.
Adapter-based Finetuning Adapter-based finetuning (Houlsby et al., 2019;Bapna and Firat, 2019;He et al., 2021) is not only parameter-efficient, but also benefits model performance, especially for low-resource and cross-lingual tasks (He et al., 2021).Figure 2 illustrates our overall framework, where the adapters are adopted from (Houlsby et al., 2019).For each layer of LaBSE, we introduce an adapter for each sublayer, which maps the input vector of dimension d to dimension m where m < d, and then re-maps it back to dimension d.Let h and h ′ denote the input and output vector, Model Setting de-en sv-en fr-en ro-en ja-en zh-en fa-en avg Bilingual Statistical Methods fast-align (Dyer et al., 2013)  Table 1: AER comparison between AccAlign and the baselines on test set of 7 language pairs.self-sup and sup mean finetuning the model with parallel corpus of self-supervised and human-annotated alignment labels, respectively.
All multilingual methods are tested on zero-shot language pairs.respectively.The output vector h ′ is calculated as: Note that a skip-connection is employed to approximate an identity function if parameters of the projection matrices are near zero.During finetuning, only parameters of the adapters are updated.
Training Objective Let Â denote the alignment labels for the given sentence pair x and y.We define the learning objective as: where S xy and S yx are the alignment probability matrices, n and m are the length of sentence x and y, respectively.Intuitively, this objective encourages the gold aligned words to have closer contextualized representations.In addition, as both S xy and S T yx are encouraged to be close to Â, it implicitly encourages the two alignment probability matrices to be symmetrical to each other as well.
Our framework can be easily extended to cases where alignment labels are unavailable, by replacing Â with pseudo labels A (Equation 2) and training in a self-supervised manner.

Setup
As we aim at building an accurate multilingual word aligner, we evaluate AccAlign on a diverse alignment test set of seven language pairs: de/sv/ro/fr/ja/zh/fa-en.For finetuning LaBSE, we use nl/cs/hi/tr/es/pt-en as the training set and cs-en as the validation set.To reduce the alignment annotation efforts and the finetuning cost, our training set only contains 3, 362 annotated sentence pairs.To simulate the most difficult use cases where the test language pair may not included in training, we set the test language pairs different from training and validation.Namely, LaBSE is tested in a zeroshot manner.We denote this dataset as ALIGN6.
We induce alignments from 6-th layer of LaBSE, which is selected on the validation set.We use Alignment Error Rate (AER) as the evaluation metric.Our model is not directly comparable to the bilingual baselines, as they build model for each test language pair using large scale parallel corpus of that language pair.In contrast, our method is more efficient as it supports all language pairs in a single model and our finetuning only requires 3, 362 sentence pairs.Appendix B show more dataset, model, baselines and other setup details.

Main Results
Table 1 shows the comparison of our methods against baselines.AccAlign-supft achieves new SOTA on word alignment induction, outperforming all baselines in 6 out of 7 language pairs.AccAlign is also simpler than AwesomeAlign, which is the best existing multilingual word aligner, as Awe-someAlign finetunes with a combination of five objectives, while AccAlign only has one objective.The vanilla LaBSE is a strong multilingual word aligner (see AccAlign-noft).It performs better than SimAlign-noft and AwesomeAlign-noft, and comparable with AwesomeAlign-supft, indicating that LaBSE has learned high-quality language-agnostic word embeddings.Our finetuning method is effective as well, improving AccAlign-noft by 1.6 and 2.7 AER with self-supervised and supervised alignment labels, respectively.Our model improves multilingual baselines even more significantly on non-English language pairs.See Table 2 of appendix for detailed results.

Analysis
Performance on non-English Language Pair We conduct experiments to evaluate AccAlign against multilingual baselines on non-English test language pairs.The fi-el (Finnish-Greek) and fi-he (Finnish-Hebrew) test set contains 791 and 2,230 annotated sentence pairs, respectively.Both test sets are from ImaniGooghari et al. ( 2021) 2 .The results are shown in Table 2.As can be seen, Ac-cAlign in all three settings significantly improves all multilingual baselines.The improvements is much larger compared with zero-shot English language pairs, demonstrating the effectiveness of Ac-cAlign on non-English language pairs.We also observe that finetuning better improves AccAlign than AwesomeAlign.This verifies the strong crosslingual transfer ability of LaBSE , even between English-centric and non-English language pairs.
Adapter-based vs. Full Finetuning We compare full and adapter-based fine-tuning in Table 3.Compared with full finetuning, adapter-based finetuning updates much less parameters and obtains better performance under both supervised and selfsupervised settings, demonstrating its efficiency and effectiveness for zero-shot word alignments.Bilingual Finetuning To better understand our method, we compare with AwesomeAlign under bilingual finetuning setup where the model is finetuned and tested in the same single language pair.We follow the setup in (Dou and Neubig, 2021)   expect the features of aligned words to be similar while that of two different words to be different.The results on de-en test set are presented in Figure 3.For vanilla LaBSE (green curves), we find that features from 6-th layer, namely the best layer to induce alignment, successfully trades off these two properties as it obtains the biggest s bi − s mono among all layers.In addition, adapter-based finetuning improves performance mainly by making features more word-identifiable, as it significantly decreases s mono while almost maintaining s bi .

Conclusion
In this paper, we introduce AccAlign, a novel multilingual word aligner based on multilingual sentence Transformer LaBSE.The best proposed approach finetunes LaBSE on a few thousands of annotated parallel sentences and achieves state-of-the-art performance even for zero-shot language pairs.Ac-cAlign is believed to be a valuable alignment tool that can be used out-of-the-box for other NLP tasks.

Limitations
AccAlign has shown to extract high quality word alignments when the input texts are two well-paired bilingual sentences.However, the condition is not always met.In lexically constrained decoding of NMT (Hasler et al., 2018;Song et al., 2020;Chen et al., 2021b), the aligner takes a full sourcelanguage sentence and a partial target-language translation as the input at each step to determine the right position to incorporate constraints.In creating translated training corpus in zero-resource language for sequence tagging or parsing (Ni et al., 2017;Jain et al., 2019;Fei et al., 2020), the aligner extracts alignments from the labelled sentence and its translation to conduct label projection.Both cases deviate from our current settings as the input sentence may contain translation error or even be incomplete.We leave exploring the robustness of AccAlign as the future work.At the same time, our proposed method only supports languages included in LaBSE.This hinders applying AccAlign to more low-resource languages.Future explorations are needed to rapidly adapt AccAlign to new languages (Neubig and Hu, 2018;Garcia et al., 2021).
fa-en Tavakoli and Faili (2014) http://eceold.ut.ac.ir/en/node/940 400 adds an extra alignment layer to repredict the to-bealigned target token and further improves performance with Bidirectional Attention Optimization.SHIFT-AET (Chen et al., 2020b).This model trains a separate alignment module in a selfsupervised manner, and induce alignments when the to-be-aligned target token is the decoder input.MASK-ALIGN (Chen et al., 2021a).This model is a self-supervised word aligner which makes use of the full context on the target side.BTBA-FCBO-SST (Zhang and van Genabith, 2021).This model has similar idea with Chen et al. (2021a), but with different model architecture and training objectives.
SimAlign (Jalili Sabet et al., 2020).This model is a multilingual word aligner which induces alignment with contextual word embeddings from mBERT and XLM-R.
AwesomeAlign (Dou and Neubig, 2021).This model improves over SimAlign by designing new alignment induction method and proposing to further finetune the mPLM on parallel corpus.
Among them, SimAlign and AwesomeAlign are multilingual aligners which support multiple language pairs in a single model, while others are bilingual word aligners which require training from scratch with bilingual corpus for each test language pair.We re-implement SimAlign and Awe-someAlign, while quote the results from (Dou and Neubig, 2021) for the three statistical baselines and the corresponding paper for other baselines.

B.5 Sentence Transformer
We compare LaBSE with four other multilingual sentence Transformer in HuggingFace.The detailed information of these models are: distiluse-base-multilingual-cased-v2. 4his model is a multilingual knowledge distilled version of m-USE (Yang et al., 2020), which has 135M parameters and supports more than 50+ languages.paraphrase-xlm-r-multilingual-v1. 5 This model is a multilingual version of paraphrase-distilrobertabase-v1 (Reimers and Gurevych, 2019)

Figure 2 :
Figure 2: The framework of adapter-based finetuning.The blue blocks are kept frozen, while the red adapter blocks are updated during finetuning.

Table 2 :
AER comparison between AccAlign and multilingual baselines on non-English zero-shot language pairs.The best AER for each column is bold and underlined.

Table 4 :
AccAlign under different settings by computing cosine similarity for aligned word pairs and word pairs randomly sampled from the same sentence, denoted as s bi and s mono (see appendix for more experiment details).Intuitively, bigger s bi and smaller s mono are preferred as we AER results with bilingual finetuning.

Table 6 :
Training, validation and test dataset of ALIGN6.Note that this is a zero-shot setting as the test language pairs do not appear in training and validation.

Table 7 :
AER comparison of full finetuning and adapter-based finetuning.The best AER for each column is bold and underlined.

Table 8 :
AER results with bilingual finetuning.The results where the model is trained and tested on the same language pair are bold and underlined.layer de-en sv-en fr-en ro-en ja-en zh-en fas-en

Table 9 :
AER comparison of LaBSE and other multilingual pretrained model.All are without finetuning.We determine the best layer of alignment induction for each model using the validation set.The best AER for each column is bold and underlined.

Table 10 :
AER comparison of vanilla LaBSE across layers.Layer 0 is the embedding layer.The best AER for each column is bold and underlined.