WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but we systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 16720 different language pairs, out of which only 34M are aligned with English. This corpus is freely available. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.


Introduction
Most of the current approaches in Natural Language Processing are data-driven. The size of the resources used for training is often the primary concern, but the quality and a large variety of topics may be equally important. Monolingual texts are usually available in huge amounts for many topics and languages. However, multilingual resources, i.e. sentences which are mutual translations, are more limited, in particular when the two languages do not involve English. An important source of parallel texts is from international organizations like the European Parliament (Koehn, 2005) or the United Nations (Ziemski et al., 2016). Several projects rely on volunteers to provide translations for public texts, e.g. news commentary (Tiedemann, 2012), OpensubTitles (Lison and Tiedemann, 2016) or the TED corpus (Qi et al., 2018) Wikipedia is probably the largest free multilingual resource on the Internet. The content of Wikipedia is very diverse and covers many topics. Articles exist in more than 300 languages. Some content on Wikipedia was human translated from an existing article into another language, not necessarily from or into English. Eventually, the translated articles have been later independently edited and are not parallel anymore. Wikipedia strongly discourages the use of unedited machine translation, 2 but the existence of such articles cannot be totally excluded. Many articles have been written independently, but may nevertheless contain sentences which are mutual translations. This makes Wikipedia a very appropriate resource to mine for parallel texts for a large number of language pairs. To the best of our knowledge, this is the first work to process the entire Wikipedia and systematically mine for parallel sentences in all language pairs. In this work, we build on a recent approach to mine parallel texts based on a distance measure in a joint multilingual sentence embedding space (Schwenk, 2018;Artetxe and Schwenk, 2018a), and a freely available encoder for 93 languages. We approach the computational challenge to mine in almost six hundred million sentences by using fast indexing and similarity search algorithms.
The paper is organized as follows. In the next section, we first discuss related work. We then summarize the underlying mining approach. Section 4 describes in detail how we applied this approach to extract parallel sentences from Wikipedia in 1620 language pairs. In section 5, we assess the quality of the extracted bitexts by training NMT systems for a subset of language pairs and evaluate them on the TED corpus (Qi et al., 2018) for 45 languages. The paper concludes with a discussion of future research directions.

Related work
There is a large body of research on mining parallel sentences in monolingual texts collections, usually named "comparable coprora". Initial approaches to bitext mining have relied on heavily engineered systems often based on metadata information, e.g. (Resnik, 1999;Resnik and Smith, 2003). More recent methods explore the textual content of the comparable documents. For instance, it was proposed to rely on cross-lingual document retrieval, e.g. (Utiyama and Isahara, 2003;Munteanu and Marcu, 2005) or machine translation, e.g. (Abdul-Rauf and Schwenk, 2009;Bouamor and Sajjad, 2018), typically to obtain an initial alignment that is then further filtered. In the shared task for bilingual document alignment (Buck and Koehn, 2016), many participants used techniques based on n-gram or neural language models, neural translation models and bag-of-words lexical translation probabilities for scoring candidate document pairs. The STACC method uses seed lexical translations induced from IBM alignments, which are combined with set expansion operations to score translation candidates through the Jaccard similarity coefficient (Etchegoyhen and Azpeitia, 2016;Azpeitia et al., 2017Azpeitia et al., , 2018. Using multilingual noisy webcrawls such as ParaCrawl 3 for filtering good quality sentence pairs has been explored in the shared tasks for high resource (Koehn et al., 2018) and low resource  languages.
In this work, we rely on massively multilingual sentence embeddings and margin-based mining in the joint embedding space, as described in (Schwenk, 2018;Artetxe and Schwenk, 2018a,b). This approach has also proven to perform best in a low resource scenario . Closest to this approach is the research described in España-Bonet et al. (2017); ; Guo et al. (2018); Yang et al. (2019). However, in all these works, only bilingual sentence representations have been trained. Such an approach does not scale to many languages, in particular when considering all possible language pairs in Wikipedia. Finally, related ideas have been also proposed in Bouamor and Sajjad (2018) or Grégoire and Langlais (2017). However, in those works, mining is not solely based on multilingual sentence embeddings, but they are part of a larger system. To the best of our knowledge, this work is the first one that applies the same mining approach 3 http://www.paracrawl.eu/ to all combinations of many different languages, written in more than twenty different scripts. In follow up work, the same underlying mining approach was applied to a huge collection of Common Crawl texts . Hierarchical mining in Common Crawl texts was performed by El-Kishky et al. (2020). 4 Wikipedia is arguably the largest comparable corpus. One of the first attempts to exploit this resource was performed by Adafre and de Rijke (2006). An MT system was used to translate Dutch sentences into English and to compare them with the English texts, yielding several hundreds of Dutch/English bitexts. Later, a similar technique was applied to Persian/English (Mohammadi and GhasemAghaee, 2010). Structural information in Wikipedia such as the topic categories of documents was used in the alignment of multilingual corpora (Otero and López, 2010). In another work, the mining approach of Munteanu and Marcu (2005) was applied to extract large corpora from Wikipedia in sixteen languages (Smith et al., 2010). Otero et al. (2011) measured the comparability of Wikipedia corpora by the translation equivalents on three languages Portuguese, Spanish, and English. Patry and Langlais (2011) came up with a set of features such as Wikipedia entities to recognize parallel documents, and their approach was limited to a bilingual setting. Tufis et al. (2013) proposed an approach to mine bitexts from Wikipedia textual content, but they only considered high-resource languages, namely German, Spanish and Romanian paired with English. Tsai and Roth (2016) grounded multilingual mentions to English Wikipedia by training cross-lingual embeddings on twelve languages. Gottschalk and Demidova (2017) searched for parallel text passages in Wikipedia by comparing their named entities and time expressions. Finally, Aghaebrahimian (2018) propose an approach based on bilingual BiL-STM sentence encoders to mine German, French and Persian parallel texts with English. Parallel data consisting of aligned Wikipedia titles have been extracted for twenty-three languages. 5 We are not aware of other attempts to systematically mine for parallel sentences in the textual content of Wikipedia for a large number of languages.

Distance-based mining approach
The underlying idea of the mining approach used in this work is to first learn a multilingual sentence embedding. The distance in that space can be used as an indicator of whether two sentences are mutual translations or not. Using a simple absolute threshold on the cosine distance was shown to achieve competitive results (Schwenk, 2018). However, it has been observed that an absolute threshold on the cosine distance is globally not consistent, e.g. (Guo et al., 2018). This is particularly true when mining bitexts for many different language pairs.

Margin criterion
The alignment quality can be substantially improved by using a margin criterion (Artetxe and Schwenk, 2018a). The margin between two candidate sentences x and y is defined as the ratio between the cosine distance between the two sentence embeddings, and the average cosine similarity of its nearest neighbors in both directions: (1) where NN k (x) denotes the k unique nearest neighbors of x in the other language, and analogously for NN k (y). We used k = 4 in all experiments.
We follow the "max" strategy of Artetxe and Schwenk (2018a): the margin is first calculated in both directions for all sentences in language L 1 and L 2 . We then create the union of these forward and backward candidates. Candidates are sorted and pairs with source or target sentences that were already used are omitted. We then apply a threshold on the margin score to decide whether two sentences are mutual translations or not.
The complexity of a distance-based mining approach is O(N × M ), where N and M are the number of sentences in each monolingual corpus. This makes a brute-force approach with exhaustive distance calculations intractable for large corpora. The languages with the largest Wikipedia are English and German with 134M and 51M sentences, respectively. This would require 6.8 × 10 15 distance calculations. 6 We show in Section 3.3 how to tackle this computational challenge.

Multilingual sentence embeddings
Distance-based bitext mining requires a joint sentence embedding for all the considered languages. One may be tempted to train a bi-lingual embedding for each language pair, e.g. (España-Bonet et al., 2017;Guo et al., 2018;Yang et al., 2019), but this is difficult to scale to thousands of language pairs present in Wikipedia. Instead, we chose to use one single massively multilingual sentence embedding for all languages, namely the one proposed by the open-source LASER toolkit (Artetxe and Schwenk, 2018b). Training one joint multilingual embedding on many languages at once also has the advantage that low-resource languages can benefit from the similarity to other languages in the same language family. For example, we were able to mine parallel data for several Romance (minority) languages like Aragonese, Lombard, Mirandese or Sicilian although data in those languages was not used to train the multilingual LASER embeddings. The reader is referred to Artetxe and Schwenk (2018b) for a detailed description how LASER was trained.

Fast similarity search
In this work, we use the open-source FAISS library 7 which implements highly efficient algorithms to perform similarity search on billions of vectors (Johnson et al., 2017). Our sentence representations being 1024-dimensional, all English sentences require 134·10 6 ×1024×4 = 536 GB of memory. Therefore, dimensionality reduction and data compression are needed for efficient search. We chose a rather aggressive compression based on a 64-bit product-quantizer (Jégou et al., 2011), and partitioning the search space in 32k cells. 8 We build one FAISS index for each language.
The compressed FAISS index for English requires only 9.2GB, i.e. more than fifty times smaller than the original sentences embeddings. This makes it possible to load the whole index on a standard GPU and to run the search in a very efficient way on multiple GPUs in parallel, without the need to shard the index. The overall mining process for German/English requires less than 3.5 hours on 8 GPUs, including the nearest neighbor search in both direction and scoring all candidates.

Bitext mining in Wikipedia
For each Wikipedia article, it is possible to get the link to the corresponding article in other languages. This could be used to mine sentences limited to the respective articles. On one hand, this local mining has several advantages: 1) mining is very fast since each article usually has a few hundreds of sentences only; 2) it seems reasonable to assume that a translation of a sentence is more likely to be found in the same article than anywhere in the whole Wikipedia. On the other hand, we hypothesize that the margin criterion will be less efficient since one article usually has few sentences which are similar. This may lead to many sentences in the overall mined corpus of the type "NAME was born on DATE in CITY", "BUILDING is a monument in CITY built on DATE", etc. Although those alignments may be correct, we hypothesize that they are of limited use to train an NMT system.
The other option is to consider the whole Wikipedia for each language: for each sentence in the source language, we mine in all target sentences. This global mining has the advantages that we can try to align two languages even though there are only a few articles in common. A drawback is a potentially increased risk of misalignment. In this work, we chose the global mining option.

Corpus preparation
Extracting the textual content of Wikipedia articles in all languages is a rather challenging task, i.e. removing all tables, citations, footnotes or formatting markup. There are several ways to download Wikipedia content. In this study, we use the so-called CirrusSearch dumps since they directly provide the textual content without any meta information. 9 We downloaded this dump in March 2019. A total of about 300 languages are available, but the size obviously varies a lot between languages. We applied the following processing: 1) extract the textual content; 2) split the paragraphs into sentences; 3) remove duplicate sentences; and 4) perform language identification and remove sentences which are not in the expected language.
It should be pointed out that sentence segmentation is not a trivial task. Some languages do not use specific symbols to mark the end of a sentence, namely Thai. We are not aware of a freely available sentence segmenter for Thai and we had to exclude 9 https://dumps.wikimedia.org/other/ cirrussearch/ L 1 (French) Ceci est une très grande maison L 2 (German) Das ist ein sehr großes Haus This is a very big house Ez egy nagyon nagy ház Ini rumah yang sangat besar it. We used a freely available Python tool 10 to detect sentence boundaries. Regular expressions were used for most of the Asian languages, falling back to English for the remaining languages. This gives us 879 million sentences in 300 languages. The margin criterion to mine for parallel data requires that the texts do not contain duplicates. This removes about 25% of the sentences. 11 LASER's sentence embeddings are totally language agnostic. This has the side effect that the sentences in other languages (e.g. citations or quotes) may be considered closer in the embedding space than a potential translation in the target language. Table 1 illustrates this problem. The algorithm would not select the German sentence although it is a perfect translation. The sentences in the other languages are also valid translations which would yield a very small margin. To avoid this problem, we perform language identification (LID) on all sentences and remove those which are not in the expected language. LID is performed with fasttext 12 (Joulin et al., 2016). Fasttext does not support all the 300 languages present in Wikipedia and we disregarded the missing ones (which typically have only a few sentences anyway). After deduplication and LID, we dispose of 595M sentences in 182 languages. English accounts for 134M sentences, and German with 51M sentences is the second largest language. The sizes for all languages are given in Tables 3 and 5 (in the appendix).

Threshold optimization
Artetxe and Schwenk (2018a) optimized their mining approach for each language pair on a provided corpus of gold alignments. This is not possible when mining Wikipedia, in particular when con-10 https://pypi.org/project/ sentence-splitter/ 11 The Cebuano and Waray Wikipedia were largely created by a bot and contain more than 65% of duplicates. 12 https://fasttext.cc/docs/en/ language-identification.html sidering many language pairs. In this work, we use an evaluation protocol inspired by the WMT shared task on parallel corpus filtering for lowresource conditions : an NMT system is trained on the extracted bitexts -for different thresholds -and the resulting BLEU scores are compared. We choose newstest2014 of the WMT evaluations since it provides an N -way parallel test sets for English, French, German and Czech. We favoured the translation between two morphologically rich languages from different families and considered the following language pairs: German/English, German/French, Czech/German and Czech/French. The size of the mined bitexts is in the range of 100k to more than 2M (see Table 2 and Figure 1). We did not try to optimize the architecture of the NMT system to the size of the bitexts and used the same architecture for all systems: the encoder and decoder are 5-layer transformer models as implemented in fairseq (Ott et al., 2019). The goal of this study is not to develop the best performing NMT system for the considered languages pairs, but to compare different mining parameters.
The evolution of the BLEU score in function of the margin threshold is given in Figure 1. Decreasing the threshold naturally leads to more mined data -we observe an exponential increase of the data size. The performance of the NMT systems trained on the mined data seems to change as expected: the BLEU score first improves with increasing amounts of available training data, reaches a maximum and than decreases since the additional data gets more and more noisy, i.e. contains wrong translations. It is also not surprising that a careful choice of the margin threshold is more important in a low-resource setting. Every additional parallel sentence is important. According to Figure 1, the optimal value of the margin threshold seems to be  1.05 when many sentences can be extracted, in our case German/English and German/French. When less parallel data is available, i.e. Czech/German and Czech/French, a value in the range of 1.03-1.04 seems to be a better choice. Aiming at one threshold for all language pairs, we chose a value of 1.04. It seems to be a good compromise for most language pairs. However, for the open release of this corpus, we provide all mined sentence with a margin of 1.02 or better. This enables end users to choose an optimal threshold for their particular applications. However, it should be emphasized that we do not expect that many sentence pairs with a margin as low as 1.02 are good translations.
For comparison, we also trained NMT systems on the Europarl corpus V7 (Koehn, 2005), i.e. professional human translations, first on all available data, and then on the same number of sentences as the mined ones (see Table 2). With the exception of Czech/French, we were able to achieve better BLEU scores with the mined bitexts in Wikipedia than with Europarl of the same size. Adding the mined bitexts to the full Europarl corpus, leads to further improvements of 1.1 to 3.1 BLEU.

Result analysis
We run the alignment process for all possible combinations of languages in Wikipedia. This yielded 1620 language pairs for which we were able to mine at least ten thousand sentences. Remember that mining L 1 → L 2 is identical to L 2 → L 1 , and is counted only once. We propose to analyze and evaluate the extracted bitexts in two ways. First, we discuss the amount of extracted sentences (Section 5.1). We then turn to a qualitative assessment by training NMT systems for all language pairs with more than twenty-five thousand mined sentences (Section 5.2).

Quantitative analysis
Due to space limits, Table 3 summarizes the number of extracted parallel sentences only for languages which have a total of at least five hundred thousand parallel sentences (with all other languages at a margin threshold of 1.04). Additional results are given in Table 5 in the Appendix.
There are many reasons which can influence the number of mined sentences. Obviously, the larger the monolingual texts, the more likely it is to mine many parallel sentences. Not surprisingly, we observe that more sentences could be mined when English is one of the two languages. Let us point out some languages for which it is usually not obvious to find parallel data with English, namely Indonesian (1M), Hebrew (545k), Farsi (303k) or Marathi (124k sentences). The largest mined texts not involving English are Russian/Ukrainian (2.5M), Catalan/Spanish (1.6M), or between the Romance languages French, Spanish, Italian and Portuguese (480k-923k), and German/French (626k).
It is striking to see that we were able to mine more sentences when Galician and Catalan are paired with Spanish than with English. On one hand, this could be explained by the fact that LASER's multilingual sentence embeddings may be better since the involved languages are linguistically very similar. On the other, it could be that the Wikipedia articles in both languages share a lot of content, or are obtained by mutual translation.
Services from the European Commission provide human translations of (legal) texts in all the 24 official languages of the European Union. This N-way parallel corpus enables training of MT system to directly translate between these languages, without the need to pivot through English. This is usually not the case when translating between other major languages, for example in Asia. Some interesting language pairs for which we were able to mine more than 100k sentences include: Korean/Japanese (222k), Russian/Japanese (196k), Indonesian/Vietnamese (146k), or Hebrew/Romance languages (120-150k sentences).
Overall, we were able to extract at least ten thousand parallel sentences for 96 different languages. For several low-resource languages, we were able to extract more parallel sentences with other languages than English. These include, among others, Aragonse with Spanish, Lombard with Italian, Breton with several Romance languages, Western Frisian with Dutch, Luxembourgish with German or Egyptian Arabic and Wu Chinese with the respective major language.
Finally, Cebuano (ceb) falls clearly apart: it has a rather huge Wikipedia (17.9M filtered sentences), but most of it was generated by a bot, as for the Waray language. 13 This certainly explains that only a very small number of parallel sentences could be extracted. Although the same bot was also used to generate articles in the Swedish Wikipedia, our alignments seem to be better for that language.

Qualitative evaluation
Aiming to perform a large-scale assessment of the quality of the extracted parallel sentences, we trained NMT systems on the bitexts. We identified a publicly available dataset which provides test sets for many language pairs: translations of TED talks as proposed in the context of a study on pretrained word embeddings for NMT 14 (Qi et al., 2018). We would like to emphasize that we did not use the training data provided by TED -we only trained on the mined sentences from Wikipedia. The goal of this study is not to build state-of-the-art NMT system for for the TED task, but to get an estimate of the quality of our extracted data, for many language pairs. In particular, there may be a mismatch in the topic and language style between Wikipedia texts and the transcribed and translated TED talks.
parameter settings shown in Figure 2 in the appendix. Since the TED development and test sets were already tokenized, we first detokenize them using Moses. We trained NMT systems for all possible language pairs with more than 25k mined sentences. This gives us in total 1886 language pairs in 45 languages. We train L 1 → L 2 and L 2 → L 1 with the same mined bitexts L 1 /L 2 . Scores on the test sets were computed with SacreBLEU (Post, 2018), see Table 4. Some additional results are reported in Table 6 in the annex. 23 NMT systems achieve BLEU scores over 30, the best one being 37.3 for Brazilian Portuguese to English. Several results are worth mentioning, like Farsi/English: 16.7, Hebrew/English: 25.7, Indonesian/English: 24.9 or English/Hindi: 25.7 We also achieve interesting results for translation between various non English language pairs for which it is usually not easy to find parallel data, e.g. Norwegian ↔ Danish ≈33, Norwegian ↔ Swedish ≈25, Indonesian ↔ Vietnamese ≈16 or Japanese / Korean ≈17.
Our results on the TED set give an indication on the quality of the mined parallel sentences. These BLEU scores should be of course appreciated in context of the sizes of the mined corpora as given in Table 3. Finally, we would like to point out that we run our approach on all available lan-guages in Wikipedia, independently of the quality of LASER's sentence embeddings for each one.

Conclusion
We have presented an approach to systematically mine for parallel sentences in the textual content of Wikipedia, for all possible language pairs. We use a mining approach based on massively multilingual sentence embeddings (Artetxe and Schwenk, 2018b) and a margin criterion (Artetxe and Schwenk, 2018a). The same approach is used for all language pairs without the need for a language-specific optimization. In total, we make available 135M parallel sentences in 96 languages, out of which only 34M sentences are aligned with English. We were able to mine more than ten thousands sentences for 1620 different language pairs. This corpus of parallel sentences is freely available. 15 We also performed a large scale evaluation of the quality of the mined sentences by training 1886 NMT systems and evaluating them on the 45 languages of the TED corpus (Qi et al., 2018). This approach was recently extended to mine in Common Crawl texts . Table 5: WikiMatrix (part 2): number of extracted sentences (in thousands) for languages with a rather small Wikipedia. Alignments with other languages yield less than 5k sentences and are omitted for clarity. Table 2 gives the detailed configuration which was used to train NMT models on the mined data in Section 5. An 5000 subword vocabulary was learnt using SentencePiece (Kudo and Richardson, 2018). Decoding was done with beam size 5 and length normalization 1.2.