Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment

Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent algorithms for both subproblems significantly improves induced lexicon quality and further gains are possible by learning to filter the resulting lexical entries, with both unsupervised and semi-supervised schemes. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 F1 points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context. Further analysis of our output and the standard reference lexicons suggests they are of comparable quality, and new benchmarks may be needed to measure further progress on this task.


Introduction
Bilingual lexicons map words in one language to their translations in another, and can be automatically induced by learning linear projections to align monolingual word embedding spaces (Artetxe et al., 2016;Smith et al., 2017;Lample et al., 2018, inter alia). Although very successful in practice, the linear nature of these methods encodes unrealistic simplifying assumptions (e.g. all translations of a word have similar embeddings). In this paper, we show it is possible to produce much higher quality lexicons without these restrictions by introducing new methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment.
We show that simply pipelining recent algorithms for unsupervised bitext mining (Tran et al., 2020) and unsupervised word alignment (Sabet et al., 2020) significantly improves bilingual lexicon induction (BLI) quality, and that further gains are possible by learning to filter the resulting lexical entries. Improving on a recent method for doing BLI via unsupervised machine translation (Artetxe et al., 2019), we show that unsupervised mining produces better bitext for lexicon induction than translation, especially for less frequent words.
These core contributions are established by systematic experiments in the class of bitext construction and alignment methods (Figure 1). Our full induction algorithm filters the lexicon found via the initial unsupervised pipeline. The filtering can be either fully unsupervised or weakly-supervised: for the former, we filter using simple heuristics and global statistics; for the latter, we train a multi-layer perceptron (MLP) to predict the probability of a word pair being in the lexicon, where the features are global statistics of word alignments.
In addition to BLI, our method can also be directly adapted to improve word alignment and reach competitive or better alignment accuracy than the state of the art on all investigated language pairs. We find that improved alignment in sentence representations (Tran et al., 2020) leads to better contextual word alignments using local similarity (Sabet et al., 2020).
Our final BLI approach outperforms the previous state of the art on the BUCC 2020 shared task (Rapp et al., 2020) by 14 F 1 points averaged over 12 language pairs. Manual analysis shows that most of our false positives are due to the incompleteness of the reference and that our lexicon is comparable to the reference lexicon and the output of a supervised system. Because both of our key building blocks make use of the pretrainined contextual representations from mBART (Liu et al., cooccurrence(good, guten) = 2 one-to-one align(good, guten) = 2 many-to-one align(good, guten) = 0 cosine_similarity(good, guten) = 0.8 inner_product(good, guten) = 1.8 count(good) = 2 count(guten) = 2

Lexicon Induction
Multi-Layer Perceptron good, guten = 0.95 Figure 1: Overview of the proposed retrieval-based supervised BLI framework. Best viewed in color. 2020) and CRISS (Tran et al., 2020), we can also interpret these results as clear evidence that lexicon induction benefits from contextualized reasoning at the token level, in strong contrast to nearly all existing methods that learn linear projections on word types.

Related Work
Bilingual lexicon induction (BLI). The task of BLI aims to induce a bilingual lexicon (i.e., word translation) from comparable monolingual corpora (e.g., Wikipedia in different languages). Following Mikolov et al. (2013), most methods train a linear projection to align two monolingual embedding spaces. For supervised BLI, a seed lexicon is used to learn the projection matrix (Artetxe et al., 2016;Smith et al., 2017;Joulin et al., 2018). For unsupervised BLI, the projection matrix is typically found by an iterative procedure such as adversarial learning (Lample et al., 2018;Zhang et al., 2017), or iterative refinement initialized by a statistical heuristics (Hoshen and Wolf, 2018;Artetxe et al., 2018). Artetxe et al. (2019) show strong gains over previous works by word aligning bitext generated with unsupervised machine translation. We show that retrieval-based bitext mining and contextual word alignment achieves even better performance.
Word alignment. Word alignment is a fundamental problem in statistical machine translation, of which the goal is to align words that are translations of each in within parallel sentences (Brown et al., 1993). Most methods assume parallel sentences for training data (Och and Ney, 2003;Dyer et al., 2013;Peter et al., 2017, inter alia). In contrast, Sabet et al. (2020) propose SimAlign, which does not train on parallel sentences but instead aligns words that have the most similar pre-trained multilingual representations (Devlin et al., 2019;Conneau et al., 2019). SimAlign achieves competitive or superior performance than conventional alignment methods despite not using parallel sentences, and provides one of the baseline components for our work. We also present a simple yet effective method to improve performance over SimAlign (Section 5).

Baseline Components
We build on unsupervised methods for word alignment and bitext construction, as reviewed below.

Unsupervised Word Alignment
SimAlign (Sabet et al., 2020) is an unsupervised word aligner based on the similarity of contextualized token embeddings. Given a pair of parallel sentences, SimAlign computes embeddings using pretrained multilingual language models such as mBERT and XLM-R, and forms a matrix whose entries are the cosine similarities between every source token vector and every target token vector.
Based on the similarity matrix, the argmax algorithm aligns the positions that are the simultaneous column-wise and row-wise maxima. To increase recall, Sabet et al. (2020) also propose itermax, which applies argmax iteratively while excluding previously aligned positions.
Generation Artetxe et al. (2019) train an unsupervised machine translation model with monolingual corpora, generate bitext with the obtained model, and further use the generated bitext to induce bilingual lexicons. We replace their statistical unsupervised translation model with CRISS, a recent high quality unsupervised machine translation model which is expected to produce much higher quality bitext (i.e., translations). For each sentence in the two monolingual corpora, we generate a translation to the other language using beam search or nucleus sampling (Holtzman et al., 2020).
Retrieval Tran et al. (2020) show that the CRISS encoder module provides as a high-quality sentence encoder for cross-lingual retrieval: they take the average across the contextualized embeddings of tokens as sentence representation, perform nearest neighbor search with FAISS (Johnson et al., 2019), 3 and mine bitext using the margin-based max-score method (Artetxe and Schwenk, 2019a). 4 The score between sentence representations s and t is defined by cos(s,t ) 2k where NN k (·) denotes the set of k nearest neighbors of a vector in the corresponding space. In this work, we keep the top 20% of the sentence pairs with scores larger than 1 as the constructed bitext.
bilingual word pairs as the induced lexicon. The framework consists of two parts: (i) an unsupervised bitext construction module which generates or retrieves bitext from separate monolingual corpora without explicit supervision (Section 3.2), and (ii) a lexicon induction module which induces bilingual lexicon from the constructed bitext based on the statistics of cross-lingual word alignment. For the lexicon induction module, we compare two approaches: fully unsupervised induction (Section 4.1) which does not use any extra supervision, and weakly supervised induction (Section 4.2) that uses a seed lexicon as input.

Fully Unsupervised Induction
We align the constructed bitext with CRISS-based SimAlign, and propose to use smoothed matched ratio for a pair of bilingual word type s, t ρ(s, t) = mat(s, t) coc(s, t) + λ as the metric to induce lexicon, where mat(s, t) and coc(s, t) denote the one-to-one matching count (e.g., guten-good; Figure 1) and co-occurrence count of s, t appearing in a sentence pair respectively, and λ is a non-negative smoothing term. 5 During inference, we predict the target word t with the highest ρ(s, t) for each source word s. Like most previous work (Artetxe et al., 2016;Smith et al., 2017;Lample et al., 2018, inter alia), this method translates each source word to exactly one target word.

Weakly Supervised Induction
We also propose a weakly supervised method, which assumes access to a seed lexicon. This lexicon is used to train a classifier to further filter the potential lexical entries.
For a pair of word type s, t , our classifier uses the following global features: • Count of alignment: we consider both one-toone alignment (Section 4.1) and many-to-one alignment (e.g., danke-you and danke-thank; Figure 1) of s and t separately as two features, since the task of lexicon induction is arguably biased toward one-to-one alignment.
• Count of co-occurrence used in Section 4.1.
• The count of s in the source language and t in the target language. 6 • Non-contextualized word similarity: we feed the word type itself into CRISS, use the average pooling of the output subword embeddings, and consider both cosine similarity and dot-product similarity as features.
For a counting feature c, we take log (c + θ c ), where θ consists of learnable parameters. There are 7 features in total, which is denoted by x s,t ∈ R 7 .
We compute the probability of a pair of words s, t being in the induced lexicon P Θ (s, t) 7 by a ReLU activated multi-layer perceptron (MLP): where σ(·) denotes the sigmoid function, and Θ = {W 1 , b 1 , w 2 , b 2 } denotes the learnable parameters of the model.
Recall that we are able to access a seed lexicon, which consists of pairs of word translations. In the training stage, we seek to maximize the log likelihood: where D + and D − denotes the positive training set (i.e., the seed lexicon) and the negative training set respectively. We construct the negative training set by extracting all bilingual word pairs that cooccurred but are not in the seed word pairs. We tune two hyperparameters δ and n to maximize the F 1 score on the seed lexicon and use them for inference, where δ denotes the prediction threshold and n denotes the maximum number of translations for each source word, following Laville et al. (2020) who estimate these hyperparameters based on heuristics. The inference algorithm is summarized in Algorithm 1.

Extension to Word Alignment
The idea of using an MLP to induce lexicon with weak supervision (Section 4.2) can be directly extended to word alignment.
SimAlign sometimes mistakenly align rare words to punctuation, and such features can help exclude such pairs. 7 Not to be confused with joint probability.
Algorithm 1: Inference algorithm for weakly-supervised lexicon induction. Input: Thresholds δ, n, Model parameters Θ, source words S Output: Induced lexicon L L ← ∅ for s ∈ S do ( s, t 1 , . . . , s, t k ) ← bilingual word pairs sorted by the descending order of denote the constructed bitext in Section 3.2, where N denotes the number of sentence pairs, and S i and T i denote a pair of sentences in the source and target language respectively. In a pair of bitext S, T , S = s 1 , . . . , s s and T = t 1 , . . . , t s denote sentences consist of word tokens s i or t i .
For a pair of bitext, SimAlign with a specified inference algorithm produces word alignment A = { a i , b i } i , denoting that the word tokens s a i and t b i are aligned. Sabet et al. (2020) has proposed different algorithms to induce alignment from the same similarity matrix, and the best method varies across language pairs. In this work, we consider the relatively conservative (i.e., having higher precision) argmax and the higher recall itermax algorithm (Sabet et al., 2020), and denote the alignments by A argmax and A itermax respectively.
We substitute the non-contextualized word similarity feature (Section 4.2) with contextualized word similarity where the corresponding word embedding is computed by averaging the final-layer contextualized subword embeddings of CRISS. The cosine similarities and dot-products of these embeddings are included as features.
Instead of the binary classification in Section 4.2, we do ternary classification for word alignments. For a pair of word tokens s i , t j , the gold label y s i ,t j is defined as Intuitively, the labels 0 and 2 represents confident alignment or non-alignment by both methods, while the label 1 models the potential alignment.
The MLP takes the features x s i ,t j ∈ R 7 of the word token pair, and compute the probability of each label y byĥ On the training stage, we maximize the log-likelihood of groundtruth labels: On the inference stage, we keep all word token pairs s i , t j that have as the prediction.

Experimental Setup and Baselines
Throughout our experiments, we use a two-layer perceptron with the hidden size of 8 for both lexicon induction and word alignment. We optimize all of our models using Adam (Kingma and Ba, 2015) with the initial learning rate 5 × 10 −4 . For our bitext construction methods, we retrieve the best matching sentence or translate the sentences in the source language Wikipedia; for baseline models, we use their default settings. For evaluation, we use the BUCC 2020 BLI shared task dataset (Rapp et al., 2020) and metric (F 1 ). Like most recent work, this evaluation is based on MUSE (Lample et al., 2018). 8 We primarily report the BUCC evaluation because it considers recall in addition to precision. However, because most recent work only evaluates on precision, we include those evaluations in Appendix D.
We compare the following baselines:

Main Results
We evaluate bidirectional translations from beam search (GEN; Section 3.2), bidirectional translations from nucleus sampling (GEN-N; Holtzman et al., 2020), 13 and retrieval (RTV; Section 3.2). In addition, it is natural to concatenate the global statistical features (Section 4.2) from both GEN and RTV and we refer to this approach by GEN-RTV. Our main results are presented in Table 1. All of our models (GEN, GEN-N, RTV, GEN-RTV) outperform the previous state of the art (BUCC) by a significant margin on all language pairs. Surprisingly, RTV and GEN-RTV even outperform WikiMatrix by average F 1 score, indicating that we do not need bitext supervision to obtain high-quality lexicons.

Automatic Analysis
Bitext quality. Since RTV achieves surprisingly high performance, we are interested in how much the quality of bitext affects the lexicon induction performance. We divide all retrieved bitexts with score (Eq. 1) larger than 1 equally into five sections with respect to the score, and compare the lexicon 9 https://github.com/artetxem/VecMap 10 https://github.com/facebookresearch/ fastText 11 https://github.com/facebookresearch/ fastText/blob/master/docs/crawl-vectors. md; that is, our VECMAP baselines have the same data availability with our main results. 12 https://github.com/facebookresearch/ LASER/tree/master/tasks/WikiMatrix 13 We sample from the smallest word set whose cumulative probability mass exceeds 0.5 for next words.  Table 2: F 1 scores (×100) on the test set of the BUCC 2020 shared task (Rapp et al., 2020). We use the weakly supervised algorithm (Section 4.2). The best number in each row is bolded. RTV-1 is the same as RTV in Table 1. induction performance (Table 2). In the table, RTV-1 refers to the bitext of the highest quality and RTV-5 refers to the ones of the lowest quality, in terms of the margin score (Eq 1). 14 We also add a random pseudo bitext baseline (Random), where all the bitext are randomly sampled from each language pair, as well as using all retrieved sentence pairs that have scores larger than 1 (RTV-ALL). In general, the lexicon induction performance of RTV correlates well with the quality of bitext. Even using the bitext of the lowest quality (RTV-5), it is still able to induce reasonably good bilingual lexicon, outperforming the best numbers reported by BUCC 2020 participants (Table 1)  co-occurrences of correct word pairs even when they appear in unrelated sentences.
Word alignment quality. We compare the lexicon induction performance using the same set of constructed bitext (RTV) and different word aligners (Table 3). According to Sabet et al. (2020), SimAlign outperforms fast align in terms of word alignment. We observe that such a trend translates to resulting lexicon induction performance well: a significantly better word aligner can usually lead to a better induced lexicon.
Bitext quantity. We investigate how the BLI performance changes when the quantity of bitext changes (Figure 2). We use CRISS with nucleus sampling (GEN-N) to create different amount of bitext of the same quality. We find that with only 1% of the bitext (160K sentence pairs on average) used by GEN-N, our weakly-supervised framework outperforms the previous state of the art (BUCC;   Figure 2: F 1 scores (×100) on the BUCC 2020 test set, produced by our weakly-supervised framework using different amount of bitext generated by CRISS with nucleus sampling. 100% is the same as GEN-N in Table 1. For less than 100%, we uniformly sample the corresponding amount of bitext; for greater, we generate multiple translations for each source sentence. Table 1). The model reaches its best performance using 20% of the bitext (3.2M sentence pairs on average) and then drops slightly with even more bitext. This is likely because more bitext introduces more candidates word pairs.
Dependence on word frequency of GEN vs. RTV. We observe that retrieval-based bitext construction (RTV) works significantly better than generationbased ones (GEN and GEN-N), in terms of lexicon induction performance (Table 1). To further investigate the source of such difference, we compare the performance of the RTV and GEN as a function of source word frequency or target word frequency, where the word frequency are computed from the lower-cased Wikipedia corpus. In Figure Figure 3: Average F 1 scores (×100) with our weaklysupervised framework across the 12 language pairs (Table 1) on the filtered BUCC 2020 test set. Results on entries with (a) the k% most frequent source words, and (b) the k% most frequent target words.
11 of 12 language pairs except de-fr. In 6 of 12 language pairs, GEN does better than RTV for high frequency source words. As more lower frequency words are included, GEN eventually does worse than RTV. This helps explain why the combined model GEN-RTV is even better since GEN can have an edge in high frequency words over RTV. The trend that F 1 (RTV) − F 1 (GEN) increases as more lower frequency words are included seems true for all language pairs (Appendix A). On average and for the majority of language pairs, both methods do better on low-frequency source words than high-frequency ones (Figure 3a), which is consistent with the findings by BUCC 2020 participants (Rapp et al., 2020).
VECMAP. While BLI through bitext construction and word alignment clearly achieves superior performance than that through vector rotation (Table 1), we further show that the gap is larger on low-frequency words (Figure 3).

Ground-truth Analysis
Following the advice of Kementchedjhieva et al. (2019) that some care is needed due to the incompleteness and biases of the evaluation, we perform manual analysis of selected results. For Chinese-English translations, we uniformly sample 20 wrong lexicon entries according to the evaluation for both GEN-RTV and weakly-supervised VECMAP. Our judgments of these samples are shown in Table 4. For GEN-RTV, 18/20 of these sampled errors are actually acceptable translations, whereas for VECMAP, only 11/20 are acceptable. This indicates that the improvement in quality may be partly limited by the incompleteness of the reference lexicon and the ground truth performance of our method might be even better. The same analysis for English-Chinese is in Appendix B.   Furthermore, we randomly sample 200 source words from the MUSE zh-en test set, and compare the quality between MUSE translation and those predicted by GEN-RTV. This comparison is MUSE-favored since only MUSE source words are included. Concretely, we take the union of word pairs, construct the new ground-truth by manual judgments (i.e., removing unacceptable pairs), and evaluate the F 1 score against the constructed ground-truth (Table 5). The overall gap of 3 F 1 means that a higher quality benchmark is necessary to resolve further improvements over GEN-RTV. The word pairs and judgments are included in the supplementary material (Section F).

Model
de-en en-fr en-hi ro-en  following Sabet et al. (2020). We investigate four language pairs: German-English (de-en), English-French (en-fr), English-Hindi (en-hi) and Romanian-English (ro-en). We find that the CRISS-based SimAlign already achieves competitive performance with the state-of-the-art method (Garg et al., 2019) which requires real bitext for training. By ensembling the argmax and itermax CRISS-based SimAlign results (Section 5), we set the new state of the art of word alignment without using any bitext supervision. However, by substituting the CRISS-based SimAlign in the BLI pipeline with our aligner, we obtain an average F 1 score of 73.0 for GEN-RTV, which does not improve over the result of 73.3 achieved by CRISS-based SimAlign (Table 1), indicating that further effort is required to take the advantage of the improved word aligner.

Discussion
We present a direct and effective framework for BLI with unsupervised bitext mining and word alignment, which sets a new state of the art on the task. From the perspective of pretrained multilingual models (Conneau et al., 2019;Liu et al., 2020;Tran et al., 2020, inter alia), our work shows that they have successfully captured information about word translation that can be extracted using similarity based alignment and refinement. Although BLI is only about word types, it strongly benefits from contextualized reasoning at the token level.

Appendices A Language-Specific Analysis
While Figure 3 shows the average trend of F 1 scores with respect to the portion of source words or target words kept, we present such plots for each language pair in Figure 4 and 5. The trend of each separate method is inconsistent, which is consistent to the findings by BUCC 2020 participants (Rapp et al., 2020). However, the conclusion that RTV gains more from low-frequency words still holds for most language pairs. B Acceptability Judgments for en → zh  We present error analysis for the induced lexicon for English to Chinese translations (Table 7) using the same method as Table 4. In this direction, many of the unacceptable cases are copying English words as their Chinese translations, which is also observed by Rapp et al. (2020). This is due to an idiosyncrasy of the evaluation data where many English words are considered acceptable Chinese translations of the same words.

C Examples for Bitext in Different Sections
We show examples of mined bitext with different quality (Table 8), where the mined bitexts are di-vided into 5 sections with respect to the similaritybased margin score (Eq 1). The Chinese sentences are automatically converted to traditional Chinese alphabets using chinese converter, 16 to keep consistent with the MUSE dataset.
Based on our knowledge about these languages, we see that the RTV-1 mostly consists of correct translations. While the other sections of bitext are of less quality, sentences within a pair are highly related or can be even partially aligned; therefore our bitext mining and alignment framework can still extract high-quality lexicon from such imperfect bitext.
D Results: P@1 on the MUSE Dataset Precision@1 (P@1) is a widely applied metric to evaluate bilingual lexicon induction (Smith et al., 2017;Lample et al., 2018;Artetxe et al., 2019, inter alia), therefore we compare our models with existing approaches in terms of P@1 as well (Table 9). Our fully unsupervised method with retrieval-based bitext outperforms the previous state of the art (Artetxe et al., 2019) by 4.1 average P@1, and achieve competitive or superior performance on all investigated language pairs.

E Error analysis
To understand the remaining errors, we randomly sampled 400 word pairs from the induced lexicon and compare them to ground truth as and Google Translate via =googletranslate(A1, "zh", "en"). All error cases are included in Table 10. In overall precision, our induced lexicon is comparable to the output of Google translate API where there are 17 errors for GEN-RTV 14 errors for Google and 4 common errors. Table 9: P@1 of our lexicon inducer and previous methods on the standard MUSE test set (Lample et al., 2018), where the best number in each column is bolded. The first section consists of vector rotation-based methods, while Artetxe et al. (2019) conduct unsupervised machine translation and word alignment to induce bilingual lexicons. All methods are tested in the fully unsupervised setting. †: numbers copied from Artetxe et al. (2019