CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

We show that margin-based bitext mining in a multilingual sentence space can be applied to monolingual corpora of billions of sentences. We are using ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences. Using one unified approach for 38 languages, we were able to mine 4.5 billions parallel sentences, out of which 661 million are aligned with English. 20 language pairs have more then 30 million parallel sentences, 112 more then 10 million, and most more than one million, including direct alignments between many European or Asian languages. To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Using our mined bitexts only and no human translated parallel data, we achieve a new state-of-the-art for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French. In particular, our English/German system outperforms the best single one by close to 4 BLEU points and is almost on pair with best WMT'19 evaluation system which uses system combination and back-translation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2019 workshop on Asian Translation (WAT).

To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Using our mined bitexts only and no human translated parallel data, we achieve a new state-of-the-art for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French. In particular, our English/German system outperforms the best single one by close to 4 BLEU points and is almost on pair with best WMT'19 evaluation system which uses system combination and back-translation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2019 workshop on Asian Translation (WAT).

Introduction
Most of the current approaches in Natural Language Processing (NLP) are data-driven. The size of the resources used for training is often the primary concern, but the quality and a large variety of topics may be equally important. Monolingual texts are usually available in huge amounts for many topics and languages. However, multi-lingual resources, typically sentences in two languages which are mutual translations, are more limited, in particular when the two languages do not involve English. An important source of parallel texts are international organizations like the European Parliament (Koehn, 2005) or the United Nations (Ziemski et al., 2016). These are professional human translations, but they are in a more formal language and tend to be limited to political topics. There are several projects relying on volunteers to provide translations for public texts, e.g. news commentary (Tiedemann, 2012), Opensub-Titles (Lison and Tiedemann, 2016) or the TED corpus (Qi et al., 2018).
A first system to systematically mine parallel sentences for many language pairs in Wikipedia, including bitexts without English as one of the languages, was presented in . In that work, parallel sentence mining was based on a distance measure in a joint multilingual sentence embedding space (Schwenk, 2018;Artetxe and Schwenk, 2018a), using the freely available LASER toolkit 1 which provides a language agnostic sentence encoder which was trained on 93 languages (Artetxe and Schwenk, 2018b).
In this paper, we use the same underlying mining approach based on LASER and scale to a much larger corpus: ten crawls of curated common crawl data set (Wenzek et al., 2019) instead of Wikipedia (32.7 billion against 550 million unique sentences). On one hand, we had to redesign the processing pipeline in order to attack the substantial computational challenge: billions of sentence embeddings have to be compared. One the other hand, it is an interesting research question whether global mining scales to billions of sentences, i.e. systematically comparing each sentence in a source language with all sentences in the target language. To the best of our knowledge, all existing large scale bitext mining techniques apply an hierarchical approach. First, a subset of all the texts is selected, e.g. documents, which are supposed to contain parallel sentences. Then, sentences limited to previously aligned documents are compared and the parallel ones are identified. This type of local mining has the advantage of being very fast since only a few thousand sentences need to be compared for each document. However, sentences which appear in documents which were not preselected can not be aligned.
In this work, we make no assumption on the structure of the monolingual text corpora -we simply compare all sentences against each other. Our experimental results seem to indicate that such an approach works surprisingly well: we are able to mine billions of parallel sentences which seem to be of high quality: NMT systems trained only on our mined data outperform the currently best single NMT systems in WMT'19 and WAT'19.
The paper is organized as follows. In the next section, we first discuss related work. We then present the corpus used in this work and summarize the underlying mining approach. Section 4.3 describes in detail how we applied this approach to extract parallel sentences. To asses the quality of the extracted bitexts, we train NMT systems for a subset of language pairs and evaluate them on the TED corpus (Qi et al., 2018), test sets of WMT (Barrault et al., 2019) and of the the workshop for Asian language (WAT) (Nakazawa et al., 2019). These results are presented in section 6. The paper concludes with a discussion of future research directions.

Related work
There is a large body of research on mining parallel sentences in collections of monolingual texts, usually named "comparable coprora". Initial approaches to bitext mining have relied on heavily engineered systems often based on metadata information, e.g. (Resnik, 1999;Resnik and Smith, 2003). More recent methods explore the textual content of the comparable documents. For instance, it was proposed to rely on crosslingual document retrieval, e.g. (Utiyama and Isahara, 2003;Munteanu and Marcu, 2005) or machine translation, e.g. (Abdul-Rauf and Schwenk, 2009;Bouamor and Sajjad, 2018), typically to obtain an initial alignment that is then further filtered. In the shared task for bilingual document alignment (Buck and Koehn, 2016), many participants used techniques based on n-gram or neural language models, neural translation models and bag-of-words lexical translation probabilities for scoring candidate document pairs. The STACC method uses seed lexical translations induced from IBM alignments, which are combined with set expansion operations to score translation candidates through the Jaccard similarity coefficient (Etchegoyhen and Azpeitia, 2016;Azpeitia et al., 2017Azpeitia et al., , 2018. Using multilingual noisy webcrawls such as ParaCrawl 2 for filtering good quality sentence pairs has been explored in the shared tasks for high resource (Koehn et al., 2018) and low resource  languages.
In this work, we rely on massively multilingual sentence embeddings and margin-based mining in the joint embedding space, as described in (Schwenk, 2018;Artetxe and Schwenk, 2018a,b). This approach has also proven to perform best in a low resource scenario . Closest to this approach is the research described in España-Bonet et al. (2017); ; Guo et al. (2018); Yang et al. (2019). However, in all these works, only bilingual sentence representations have been trained. Such an approach does not scale to many languages. Finally, related ideas have been also proposed in Bouamor and Sajjad (2018) or Grégoire and Langlais (2017). However, in those works, mining is not solely based on multilingual sentence embeddings, but they are part of a larger system.
Wikipedia is arguably the largest comparable corpus with high-quality human verified texts. One of the first attempts to exploit this resource was performed by Adafre and de Rijke (2006). An MT system was used to translate Dutch sentences into English and to compare them with the English texts. This method yielded several hundreds of Dutch/English parallel sentences. Later, a similar technique was applied to the Persian/English pair (Mohammadi and GhasemAghaee, 2010). Structural information in Wikipedia such as the topic categories of documents was used in the alignment of multilingual corpora (Otero and López, 2010). In another work, the mining approach of Munteanu and Marcu (2005) was applied to extract large corpora from Wikipedia in sixteen lan-guages (Smith et al., 2010). Otero et al. (2011) measured the comparability of Wikipedia corpora by the translation equivalents on three languages Portuguese, Spanish, and English. Patry and Langlais (2011) came up with a set of features such as Wikipedia entities to recognize parallel documents, and their approach was limited to a bilingual setting. Tufis et al. (2013) proposed an approach to mine parallel sentences from Wikipedia textual content, but they only considered high-resource languages, namely German, Spanish and Romanian paired with English. Tsai and Roth (2016) grounded multilingual mentions to English wikipedia by training cross-lingual embeddings on twelve languages. Gottschalk and Demidova (2017) searched for parallel text passages in Wikipedia by comparing their named entities and time expressions. Finally, Aghaebrahimian (2018) propose an approach based on bilingual BiLSTM sentence encoders to mine German, French and Persian parallel texts with English. Parallel data consisting of aligned Wikipedia titles have been extracted for twenty-three languages. 3 Since Wikipedia titles are rarely entire sentences with a subject, verb and object, it seems that only modest improvements were observed when adding this resource to the training material of NMT systems.
We are aware of two large-scale mining approaches applied to several languages pairs and large collections of texts. The European project ParaCrawl 1 focuses on mining parallel data for all European languages, mainly aligned with English. The underlying alignment engine, called Bitextor, 4 uses a two stage approach: first parallel documents are identified, and then, pairs of documents are processed to identify parallel segments. Sentence alignments either uses a seed MT system, or bilingual lexicons (Esplà-Gomis and Forcada, 2010), In another work, parallel sentences are mined in Wikipedia for many language pairs using a margin criterion in a multilingual sentence embedding space  3 The curated Common Crawl corpus In this work, we propose to mine parallel sentences from the Web, by using the data released by the Common Crawl project. 5   shot of the Web containing terabytes of web pages in various languages is obtained by randomly exploring URLs. We start by applying some preprocessing steps to the raw text data, following the pipeline introduced by Wenzek et al. (2019) and leading to the CCNet dataset. The first step is to deduplicate the data at the paragraph level, as the original crawls contain up to 70% of duplicated data. This preprocessing removes low quality content, such as boilerplate, navigation menus or cookie warnings. The second step of the pipeline is to identify the language of each document, using fastText 6 (Grave et al., 2018). This language identifier uses a linear classifier with character n-gram features, and can recognize up to 176 languages. Finally, the last step of the preprocessing is to filter low quality content by training a language model on Wikipedia, and only keeping documents with a low perplexity score. We refer the reader to Wenzek et al. (2019) for more details about this preprocessing pipeline. In Figure 1, we report the number of unique sentences obtained after preprocessing ten snapshots from Common Crawl. We currently process 38 languages. The English Web content is abundant and we used only one snapshot.

Distance-based mining approach
The underling idea of the mining approach used in this work is to first learn a multilingual sentence embedding, i.e. an embedding space in which semantically similar sentences are close, independently of the language they are written in. This means that the distance in that space can be used as an indicator whether two sentences are  mutual translations or not. Using a simple absolute threshold on the cosine distance was shown to achieve competitive results (Schwenk, 2018). However, it has been observed that an absolute threshold on the cosine distance is globally not consistent, e.g. (Guo et al., 2018).

Margin criterion
Artetxe and Schwenk (2018a) showed that the alignment quality can be substantially improved by using a margin criterion instead of an absolute threshold. The margin between two candidate sentences x and y is defined as the ratio between the cosine distance between the two sentence embeddings, and the average cosine similarity of its nearest neighbors in both directions: (1) where NN k (x) denotes the k unique nearest neighbors of x in the other language, and analogously for NN k (y). Artetxe and Schwenk (2018a) describe the "max-strategy" as one of the best performing ones: the margin is first calculated in both directions for all sentences in language L 1 and L 2 . Then, the union of these forward and backward candidates is build, candidates are sorted and pairs with source or target sentences which were already used are omitted. Finally, a threshold is applied on the margin score to decide whether two sentences are mutual translations or not. The reader is referred to Artetxe and Schwenk (2018a) for a detailed discussion with related work. The "maxstrategy" was used in  to mine parallel sentence in Wikipedia.
This strategy was initially motivated by an evaluation on the BUCC corpus (Zweigenbaum et al., 2018), for which the reference alignments were known to be strictly 1:1. With increasing corpus size, namely billions of sentences in CCNet, the probability to find several perfect translations increases. This questions the restriction that each source sentence is aligned to exactly one and only one target sentence, and vice-versa. The value of k in equation 1 should be also carefully selected to avoid that all the k nearest sentences are valid translations, i.e. having similar distances and therefore a small margin. This would result in many valid translations being excluded. Therefore, we increased the value of the neighborhood k in Equation 1 from 4, which was used in , to 16.

Multilingual sentence embeddings
Distance-based bitext mining requires a joint sentence embedding for all the considered languages. One may be tempted to train a bi-lingual embedding for each language pair, e.g. (España-Bonet et al., 2017;Guo et al., 2018;Yang et al., 2019), but this is difficult to scale to thousands of language pairs present in CCNet. We follow  and use one single massively multilingual sentence embedding for all languages, namely the one proposed by the open-source LASER toolkit (Artetxe and Schwenk, 2018b).
The underlying idea of LASER is to train a sequence-to-sequence system on many language pairs at once using a shared BPE vocabulary and a shared encoder for all languages. The sentence representation is obtained by max-pooling over all encoder output states. Figure 1 illustrates this approach. The reader is referred to Artetxe and Schwenk (2018b) for a detailed description.

Scaling to billions of sentences
We use the same underlying mining procedure as  who extracted 135 million parallel sentences from Wikipedia in 1620 different language pairs. However, our CCNet corpus is more than fifty times larger than Wikipedia: 32.7 billion against 595 million unique sentences. Our largest corpora are English and Russian, with 8.7 and 3 billion unique sentences, respectively. For ten languages, CCNet has more than one billion unique sentences (see Figure 1). This required to significantly modify the mining pipeline in order to tackle the substantially increased computational complexity. The overall processing pipeline can be structured into three tasks: 1. text extraction and processing including sentence splitting and language identification; 2. creation of a compressed index for each language; 3. mining parallel data for each language pair using the sentence embeddings and indexes.
For each step, we aimed to parallelize the processing as much as possible, by splitting the data into several blocks. We used blocks of about fifty millions sentences. This size was chosen so that the different operations can be performed in a couple of hours. As example, all the English texts are split into 160 blocks.

Text extraction
The first task, text extraction and processing, consists in the following steps: • Extract the texts from the JSON data of CCNet (see Wenzek et al. (2019) for details).
• Perform LID and exclude sentences which are not in the expected language.
• Mark all sentences which are duplicates within each block.
Each of these four steps processes are blocks in parallel. As a final step, we merge all the blockwise deduplicated sentences and create one set of globally unique sentences for each language. We used a freely available Python tool 7 to detect sentence boundaries. If specific rules for a language are not available, we fall-back to a linguistically similar languages, e.g. we use Spanish rules for Gallican, and default to English otherwise. Most of the Asian languages are handled by regular expressions. We exclude sentences with more than 500 characters. LID is performed at the sentence level with fastText (Joulin et al., 2016). Once, the text preparation task is finished, we have a corpus of N i unique sentences for each language L i . These texts are the basis for the index creation and mining tasks. The amount of data for each language is given in Table 3, third column.

Index creation
We follow  and use the highly optimized FAISS toolkit (Johnson et al., 2017) 8 to create compact indexes of the sentence embedding. LASER's sentence representations are 1024-dimensional. This means that the embeddings of all sentences would require 32.7 · 10 9 × 1024 × 4 ≈ 130 TB to store them. We use an aggressive vector compression based on a 64-bit product-quantizer (Jégou et al., 2011). In order text  to account for the huge number of sentences, we increase the amount of cells from 32k to 64k to partition the search space. This corresponds to the index type OPQ64,IVF65536,PQ64 in FAISS terms.
Exhaustive searching in huge indexes is only tractable if performed on GPU. FAISS supports sharding of a single index on multiple GPUs -this is most efficient if the GPUs are in the same machine and communicate very quickly. For our index type, and eight GPUs with 32GB of memory each, this allows to create an index of about three billion sentences. This includes all languages with the exception of English with 8.7 billion sentences. Therefore, we created three English indexes of 2.7 billion sentences each.
The processing pipeline to train and create the indexes is summarized in Figure 2. First, we train an index on 40 million sampled sentences of the whole corpus, when available. Once the index is trained, the data in each block is independently added to the common trained index. This can be also processed in parallel. These individual indexes are then merged into one index for each language. The Russian and and Japanese indexes with three billion sentences have a file size of about 200GB, all 28 indexes total about 2TB.

Mining
Once indexes for all languages are calculated, we can start the mining process for each language pair.  pre-calculated the sentence embeddings for all languages and then started the pairwise mining process. The authors report that less than 3.5h on 8 GPUs are needed for the whole "max-mining" process between English and German, i.e 134M and 51M sentences respectively. This corresponds to about 1.34 · 10 8 × 5.1 · 10 7 ≈ 6.8 · 10 15 distances calculations.
Let us consider mining Japanese/Russian bitext in CCNet with 3.0 and 2.9 billion sentences respectively, i.e. 3 · 10 9 × 2.9 · 10 9 ≈ 8.7 · 10 18 . This means that we have to perform about 1300 times more distance calculations, which would translate to more than 6 months on a single machine with 8 GPUs. We tackle this computational challenge by decoupling the distance calculations in forward and backward direction and the margin calculation (see Equation 1), and processing all these steps in parallel. This processing pipeline is illustrated in Figure 3.
In addition, we had to use a special procedure to mine for parallel sentences with English due to the large amounts of English sentences. For the sake of explanation, let us assume that we want to extract German/English bitexts. It is computationally too expensive to perform k-nn search in the German FAISS index for all the 8.7 billion English sentences (backward distances). Therefore, we are constraint to only use the forward distances de → en. Remember that we had to partition all the English sentences in three indexes of about 2.7 billion sentences each. Consequently, for each German sentence, we search in the three different English indexes, and calculate the margin with respect to the k = 16 nearest neighbors. We then combine the alignments and keep those which a margin superior to a threshold of 1.06. It can happen that the algorithm finds valid translation in each of the three indexes. We decided to keep those alternative translations.
For all other language pairs L 1 −L 2 , we used the max-margin strategy as described in Section 4 and Equation 1, i.e. calculating the forward L 1 → L 2 and backward distances L 2 → L 1 .

Quantitative result analysis
Mining for parallel sentences in more than 30 billions sentences is computationally very expensive. In the current version of the CCMatrix corpus, we have limited the alignment process to 38 languages. Those were chosen to cover several language families and scripts. In the following, we first discuss the amount of extracted sentences. We then turn to a qualitative assessment by training NMT systems for many language pairs (Section 6).

Choosing the margin threshold
The margin threshold used to mine parallel sentences will impact the quality of produced bitexts. A higher threshold will lead to better aligned sentences, and thus higher quality bitexts, but also to smaller datasets. Thus, there is a trade-off between the size of the extracted bitexts, and their quality. Exploratory experiments showed that a threshold around 1.06 seems to give good results. To confirm this, we trained and evaluated machine translation systems on the Hu-Da pair for different values of the treshold. We report results in Fig. 4, showing that 1.06 leads to the best performance.

Analysis
We were able to mine in total 4.5 billion parallel sentences when using a threshold of 1.06 on the margin, out of which 661 million are aligned with English (see Table 2).
Most of the current MT systems focus on the translation from or into English. Other language pairs are usually handled by pivoting through English since direct parallel texts are much smaller. This can be suboptimal when translating between two morphologically rich languages, e.g. French/German, or very different languages, e.g. Russian/Japanese. We also provide  parallel data for many language pairs not involving English. Due the high computational complexity, we only considered 28 languages (see Table 3). This yielded about three million parallel sentence pairs. To the best of our knowledge, this makes CCMatrix the largest collection of highquality mined parallel texts.
The general tendency is of course that mining in large monolingual corpora leads to larger extracted bitexts. This is however not systematically true. Let us consider for examples Polish and Dutch which have both about 500 million unique ISO Name
sentences. When aligned with Czech, a Slavic language, there are slightly more bitexts with Polish than Dutch (13.2M in comparison to 11.6M). When aligned with German, a Germanic language like Dutch, there are substantially more bitexts for Dutch than Polish, 33.2M and 20.4M respectively. Finally, both Polish and Dutch have much smaller bitexts with Indonesian although there more than 360M sentences for that language.
One one hand, a possible explanation could be that LASER alignments are more reliable for languages which are very similar, i.e. in the same language family. On the other hand, it may also be that people which live in nearby countries have similar interests which increases the chance to find translations on the Web.

Qualitative result evaluation
Aiming to perform a large-scale assessment of the quality of the extracted parallel sentences, we trained NMT systems on the extracted parallel sentences and evaluated them on several public test sets. We identified a publicly available data set which provides test sets for many language pairs: translations of TED talks as proposed in the context of a study on pretrained word embeddings for NMT 9 (Qi et al., 2018). The workshop on machine translation (WMT) has a long history of organising evaluations of machine translation, and many comparative results are published for these tasks (Barrault et al., 2019). We provide very competitive BLEU scores for several WMT'19 evaluation tasks in Section 6.2. Finally, we consider the task of translating between Russian and Japanese as proposed by the 2019 edition of the workshop on Asian translation (see Section 6.3).

TED corpus
In this set of experiments, we are interested in the performance of NMT systems trained on our bitexts only. Following (Gottschalk and Demidova, 2017), we evaluate on the test sets of the TED dataset (Qi et al., 2018). This dataset contains parallel TED talk transcripts in 50 languages. The TED datasets are tokenized and we first detokenize them using Moses, with the exception of pairs involving the Korean because it creates artifacts. As we do not include the training set provided with the TED dataset, we are not guaranteed that our bitexts cover the same domains.
pair. We tokenize the dataset with Moses, with the exception of Chinese where we use Jieba and Japanese where we use Mecab, and compute a BPE vocabulary of size 60k on the resulting tokenized training bitext. Then, for all the pairs, we train the same architecture, that is a Transformer network with 6 layers for both the encoder and decoder. We use a dimension of 512 and 4096 for the feed-forward. We train each model for 100 epochs with an initial learning rate of 0.001. We keep the model with the best BLEU score on the validation set of TED.
In Table 4, we report the BLEU on the test set on the Moses tokenization. The average BLEU is 16.3 for all the pairs and 26.9 for pairs with English. In comparison with WikiMatrix, we have 46 pairs out of 702 with a BLEU above 30 while they had only 10 out of 1620 language pairs. Their best pair reached 37.3 BLEU (for Brazilian Portuguese to English), while we have 11 pairs that surpass 37.3, with our best pair reaching 45.2 BLEU (Norwegian to English). These results give an indication on the quality of the mined parallel sentences and suggest that LASER is robust to the noise and difference in domains that exist in a large corpora like Common Crawl.

WMT'19 evaluation
We also evaluate the translation on WMT'19 news translation task. We only consider high resource directions for this comparison as they constitute the biggest challenge, because the existing baseline systems perform strongly and achieving superior performance with mined data is hard. We are following the setup described in (Ng et al., 2019) to train systems on En-De, En-Ru, En-Zh and De-Fr. We used Transformer Big architecture with increased FFN size (8192), we trained these models for 500k updates on 8 GPUs with batch size 3500 tokens. Given the large amounts of mined bitexts for the considered language pairs (see Table 3), we limit the sentence pairs to those with score higher than or equal to 1.07 except for En-Zh where we apply filter threshold 1.06. That gives us: 40.6M En-De, 39.5M En-Ru, 32.6M De-Fr and 17.6M En-Zh sentence pairs. For each direction we learned joined source-target BPE encoding (Sennrich et al., 2016) and used shared input/output embeddings. For En-De and En-Ru models we increased model size even further to 9 layers encoder and decoder with layer dropout (Fan et al., 2019) and increased embed dimensions to 2048. We tuned training parameters on Newstest 2014-2016 when available and on the WMT'19 dev set for De-Fr.
We compare performance of a single model for each direction with the performance of published single models trained on bitext data only. We found that systems trained on CCMatrix outperform systems trained on bitext data (see Table 5). System de-en en-de en-ru ru-en zh-en en-zh de-fr fr-de  To answer another question of how does this data combine with real human translated data we train a system using a combination of CCMatrix and bitexts provided by WMT'19, at the example of En-De. We found that this system outperforms the system trained on CCMatrix data only on average by 0.8 BLEU points achieving BLEU score 50.9 on newstest2018 and 45.1 on newstest2019.

WAT'19 evaluation
Finally, we have evaluated the translation between Russian and Japanese as proposed in the 2019 Workshop on Asian Translation (WAT) (Nakazawa et al., 2019). 12 According to the organizers of the WAT workshop, this language pairs represents "an extremely low resource situation for distant language pairs". The organizers provide only a tiny amount of parallel data from the Global Voices domain for training (12,356 sentences), and a development (486)   other Russian/English and Japanese/English bitexts and train multilingual NMT systems.
We trained an NMT system on CCMatrix Russian/Japanese bitexts only, without using other resources or texts aligned with English. We applied a threshold of 1.06 on the margin which yielded 9.3 million parallel sentences. We use the same NMT architecture than in Section 6.2, but with out layer dropout. We report tokenized BLEU scores using multi-bleu.perl using Moses tokenization for Russian, and Mecab for Japanese (see Table 6). We were able to outperform the best performing system at the WAT'19 evaluation (see Table 6), in particular when translating into Japanese. The participant in the WAT translation task were constraint to only use the provided resources, which included alignments with English. Therefore, our results are not directly comparable, but we argue that they are still a good indicator of the alignment quality of our mined bitexts.

Conclusion
We have shown that margin-based mining in a joint multilingual sentence embedding space can be scaled to monolingual texts of more than 36 billions unique sentences in 38 languages. Our approach is generic and simply compares all sentences among each other, without requiring any document alignment. We tackled the computational complexity by parallelizing all processing steps. This procedure yielded 661 million sentences aligned with English, and 4.5 billion for pairwise alignments of 28 languages. To the best of our knowledge, this is by far the largest collection of high quality parallel sentences.
We have performed an extensive evaluation of the quality of the mined bitexts by training NMT systems for many language pairs. The mined bitexts seem to be of high quality. Training only on our mined data, we are able to outperform the best reported single NMT system at the WMT'19 evaluations for the translation between German, Russian and Chinese and English, as well as between German and French. We also achieve state-of-theart BLEU scores for the translation between Russian and Japanese on the WAT'19 test set. We will provide a script to reproduce our results on the LASER github. 14 In the next version of the CCMatrix corpus, we will increase the number of common crawl snapshots and focus on low-resource languages. The mined data can be also used to train improved multilingual sentence embeddings. The large amount of parallel data also raises interesting question how to use it best, for instance, how to efficiently train NMT systems on more than fifty million high quality bitexts?

Acknowledgments
We would like to thank Matthijs Douze for support with the use of FAISS and Vishrav Chaudhary for helpful comments on this work.