Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

,


Introduction
There is increasing interest in multilingual sentence representations since they promise an appealing approach to extend NLP tasks to a large number of languages, without the need to separately train a language-specific model.Most of the current works on multilingual sentence representations have focused on training one model which handles all languages of interest, e.g.(Artetxe and Schwenk, 2019b;Feng et al., 2020;Reimers and Gurevych, 2020;Ramesh et al., 2022).The main motivation is that languages with limited resources will benefit from the fact that the same model has learned other (similar) languages.Zero-shot performance is of particular interest: the model generalizes well to a new language although it has never seen training data in that language.Training massively multilingual models faces several problems with increasing number of languages: how to make sure that all languages are learned, how to to account for the large imbalance of available training, or the high computational complexity.Reimers and Gurevych (2020) proposed a teacher-student approach to extend an existing (monolingual) sentence embedding space to new languages.We build on this generic idea and propose multiple improvements which significantly improve performance, namely different teacher and student architectures, several supervised and unsupervised training criteria, and language-specific encoders.We also investigate challenges when handling low-resources languages, showcased by training models for 50 African languages.To the best of our knowledge, many of these languages are not handled by any other sentence encoder or pretrained model.We dispose test data for 44 out of 50 languages, mine bitexts against 21.5 billion English sentences, and train SMT models to translate into English.
Multilingual sentence embeddings have many applications which is reflected by several metrics to evaluate them, e.g. the XTREME bench mark (Hu et al., 2020a;Ruder et al., 2021).In this work, we focus on the use of multilingual sentence embeddings for similarity-based bitext mining, as proposed by Artetxe and Schwenk (2019a), and on using these mined bitexts to improve NMT.Consequently, our primary metric is NMT performance.However, mining and NMT training is computationally expensive and it is intractable to systematically perform this evaluation for many different sentence encoder variants.As an evaluation proxy, we use multilingual similarity search error rate.In contrast to previous work which used the Tatoeba test set, e.g.(Artetxe and Schwenk, 2019b;Hu et al., 2020b;Reimers and Gurevych, 2020), we switch to the FLORES evaluation benchmark, which contains high-quality human translated texts from Wikipedia (Goyal et al., 2021) and covers many low-resource languages.
The contributions of this work can be summarized as follows: 1) we move away from the popular one-for-all approach and train multiple, mutually compatible language (family) specific encoders; 2) we explore several variants and improvements of teacher-student training for multilingual sentence representations (section 3), and propose a new approach which combines supervised teacher-student with self-supervised MLM training to better handle very low-resource languages (subsection 5.3); 3) the new model substantially improves 12 languages which were badly handled by the original LASER encoder (subsection 5.1); and 4) we train sentence encoders for 50 African languages, mine bitexts, and train NMT systems (section 6).To the best of our knowledge, many of these languages are not handled by any other NMT system.

Related work
Multilingual sentence representation Examples of approaches to learn multilingual representations are multilingual BERT (m-BERT) which covers 104 languages (Devlin et al., 2019), XLM (Conneau and Lample, 2019), and XLM-R which was trained on 100 languages using crawled web data (Conneau et al., 2020).However, as these approaches do not take into account a sentence-level objective during training, they can result in poor performance when applied to tasks which use sentence representations such as bitext retrieval (Hu et al., 2020b).In order to address this, methods such as SentenceBERT (SBERT) make use of a Siamese network to better model sentence representations (Reimers and Gurevych, 2019).LaBSE (Feng et al., 2020) uses a dual-encoder approach with a transformer-based architecture and additive margin softmax loss (Yang et al., 2019).It covers 109 languages, and is pre-trained using a masked language modelling (MLM) and translation language modelling (TLM) objective (Conneau et al., 2020).LabSE was used to mine bitexts in eleven Indian languages (Ramesh et al., 2022).Another popular multilingual sentence embedding model is LASER (Artetxe and Schwenk, 2019b).It is based on a LSTM encoder/decoder architecture with a fixed-size embedding layer and no pre-training.LASER covers 93 languages.
When learning a multilingual embedding space, a limitation of many existing approaches is that they require training a new model from scratch each time a language is to be added.However, there have been various methods proposed to address this.Wang et al. (2020) provide one such technique which extends m-BERT to low-resource languages by increasing the size of the existing vocabulary, and then continuing self-supervised training using monolingual data for a low-resource language.Another example by Reimers and Gurevych (2020) uses multilingual distillation.In this supervised teacher-student approach, the teacher is a monolingual model pre-trained on English (SBERT), and the student is a pre-trained multilingual model (XLM-R).Using bitexts, the student then extends the embedding space to the desired language(s) by applying regression loss between the English sentence representation of the teacher, and the target language sentence representation of the student.
Scaling multilinguality Several recent works have addressed the challenges faced when scaling multilingual models to a hundred languages and beyond, namely massively multilingual NMT systems (Fan et al., 2020;Arivazhagan et al., 2019;NLLB Team et al., 2022).A recent study explored the extension to more than a thousand languages (Siddhant et al., 2022;Bapna et al., 2022).Training NMT models for a large number of languages faces many challenges and a large variety of architectures have been explored (Ma et al., 2021;Wang et al., 2022;Escolano et al., 2021;NLLB Team et al., 2022).To the best of our knowledge, similar modelling techniques were not yet considered to train (massively) multilingual sentence encoders.
Resources for African languages Collecting resources, training NMT systems, or performing evaluations for African languages is the focus of several works (Dabre and Sukhoo, 2022;Emezue and Dossou, 2020;Siminyu et al., 2021;Abbott and Martinus, 2019;Azunre et al., 2021;Hacheme, 2021;Nekoto et al., 2020).The Masakhane project1 aims at providing resources to both strengthen and spur NLP research in African languages.A workshop focused on the evaluation of African languages will be held at EMNLP'22. 2 In the framework of the data track, several parallel corpora were made available.In general, the number of languages covered is well below the 44 languages we evaluate in this work.

Architecture
The overall architecture of our approach is summarized in Figure 1.The teacher is an improved LASER encoder.Compared to the original training procedure described in Artetxe and Schwenk (2019b), we use SPM instead of BPE preprocessing, up-sampling of low-resource languages, and a new implementation in fairseq.All the other parameters are unchanged, namely a 5-layer BiL-STM encoder, the 1024 dimensional sentence embeddings are obtained by max-pooling over the last layer, and training is performed for 93 languages with public resources obtained from OPUS.The reader is referred to Artetxe and Schwenk (2019b) for details on the original LASER training procedure.We use this new multilingual sentence encoder as the teacher in all our experiments and in this work refer to our teacher as LASER2, and student models as LASER3.The code to train the teacher or student models is freely available in the fairseq github repository. 3raining of the students follows the general idea of a teacher-student approach as initially proposed by Reimers and Gurevych (2020), but with several important differences.We want to scale encoder training and bitext mining well beyond the roughly 100 languages handled by current multilingual encoders.This may include languages which are not covered by existing pretrained models, and retraining them would be computationally very expensive.Also, some languages may be written in a new script which is not covered.Therefore, we made the following design choices: (1) We do not initialize the student with some pretrained model, e.g.XLM-R, but use a random initialization; (2) The student is trained on 2M sentences of English monolingual data, and we also add 2M sentence of English-Spanish bitexts from CCMatrix (Schwenk et al., 2021) to better align with the teacher's multilingual embedding space; (3) Instead of one massively multilingual model, we train multiple students for a small subset of (similar) languages, or even a single language; (4) Use of separate SPM vocabularies for teacher and student, better accommodating scripts and tokens in the student languages which were unseen by the teacher (cf.subsection 5.2); (5) Optimization of the cosine loss between the teacher and student embedding, since this is the relevant metric for bitext mining (cf. Figure 1 above); (6) Jointly train distillation alongside a MLM criterion to benefit additional learning from monolingual data in a foreign language (cf. Figure 1, and subsection 5.3); (7) Addition of curriculum learning in the form of progressive distillation.In this strategy, instead of sending the entire sentence pairs all at once, we send incremental versions of the respective sentence pairs to both teacher and student, which we found to be helpful for some particularly challenging low-resource languages.
Our motivation of using a total of 4M English sentences is to "anchor" the student encoder to the embedding space, and hopefully be able to learn new languages with a limited amount of parallel texts.In initial experiments, we used a 6layer BiLSTM encoder architecture as in Artetxe and Schwenk (2019b), but we saw consistent improvements by switching to a 12-layer transformer.We keep the same student architecture for all languages (L=12, H=1024, A=4, 250M params).Teacher-student training was performed on 16 GPUs, ADAM optimizer, a learning rate of 0.0005 and with a batch size of 2,000.When we minimize the cosine distance only, max-pooling of the transformer outputs to achieve the fixed-size sentence representations worked best, compared to using a special token like [CLS].For curriculum learning using progressive distillation, we incrementally send a percentage of subwords from each sentence pair (e.g.10%, 20%, ..., 100%).We experimented sending various incremental percentages of the sentence pairs to both teacher and student (e.g.20%, 40%), but found 10% increments to perform best.

Training and evaluation resources
Evaluation data Creating high-quality evaluation data for low-resource languages is challenging.In this work we evaluate our approach on two publicly available corpora: Tatoeba and FLO-RES.The Tatoeba corpus is a test set covering 112 languages (Artetxe and Schwenk, 2019b), and contains up to 1000 sentences for each language pair.FLORES is a freely available N -way parallel test with 1012 sentences the devtest set (Goyal et al., 2021). 4It initially covered 101 languages, and was recently extended to 200 languages (NLLB Team et al., 2022), including 44 African languages on which we report results in this paper.

Monolingual data comes mostly from Common
Crawl and other public sources like ParaCrawl,5 and some additional targeted crawling for several low-resource languages.We have extended and improved fastText LID (Grave et al., 2018) to include additional languages considered in this work.We trained this new LID model on publicly available monolingual data and evaluated it on FLORES.Preprocessing includes: sentence splitting, filtering of sentences in the wrong script or with more than 20% of numbers or punctuation, LID and deduplication, as well as LM filtering on English.
Bitexts are obtained from OPUS6 (Tiedemann, 2012) and used to train the sentence encoders and NMT systems.The amount of available resources is summarized in Table 4.

Multilingual similarity search
As the end goal for our mined bitexts is to improve the quality of a translation system, our main evaluation is NMT quality.However, given the expense involved in both mining and training NMT systems, it is not tractable to perform such an evaluation each time a new encoder is trained.Therefore, as a proxy metric for our encoders we use a mining-based multilingual similarity search error rate, in this work referred to as xsim.Unlike cosine accuracy which aligns source and target embeddings with the highest cosine similarity, xsim aligns based on the highest margin-based score, which has shown to be helpful for mining (Artetxe and Schwenk, 2019a).In this work, we score using the ratio margin R, defined as: where x and y are the source and target sentences, and N N k (x) denotes the k nearest neighbors of x.
We used k = 4.The xsim score is then defined as the error rate of wrongly aligned sentences in our test set, searching in English (i.e.xx → eng).The xsim evaluation tool is freely available.7 5 Experimental evaluation: multilingual similarity search In this section, we provide some evaluations of our proposed multilingual distillation approach, based on multilingual similarity search.

Improving LASER
LASER has been shown to perform well on many languages already.Rather than focusing on marginal improvements for these languages, we instead selected languages for which the original LASER encoder had an average accuracy of less than 90% on the Tatoeba test set.However, as the Tatoeba test set is translated by volunteers, contains a majority of easy confusable short sentences, and for some languages has much less than 1000 sentences, we propose in this work to instead primarily rely on the FLORES dataset as the ground truth.This dataset is of a higher quality as a result of professional human annotation, and contains the same number of sentences across languages.Also, FLORES is N-way parallel and the results are comparable among languages.To illustrate this difference between datasets, we provide results in Table 1 for the same languages across both test sets.
In all instances on FLORES, we observe significant improvements upon the original LASER encoders using our proposed teacher-student approach, and also achieve competitive results to the much larger one-for-all model LaBSE (average difference of 0.4% xsim error rate) We also notice that there is a considerable difference between both test sets.For example, on FLORES we report an xsim of 0.1% for Swahili, but an xsim of 16.4% of Tatoeba.To see if this phenomenon occurs with other representations, we also included LaBSE, for which we observed a similar effect.This stark difference further suggests that Tatoeba is a less reliable benchmark for evaluating sentence encoders.In particular, Tatoeba mainly contains very short sentences which can create a strong bias towards a particular model or training corpus.Given this observation, in the rest of this work we move away from Tatoeba and instead evaluate on FLORES.We hope that other existing approaches and future work will also adopt evaluation on FLORES using a margin criterion.
Although we also hoped to show a comparison to a similar distillation method by Reimers and Gurevych (2020), their existing results were reported on Tatoeba (which as shown above is not very reliable to compare against), and results were not reported using the margin score (cf.subsec-tion 4.1).We attempted to evaluate their reported models on FLORES using xsim, but their model is not available.We also attempted to reproduce the author's result by training new models using the provided code, but as we were not able to obtain the original training data used, we were unfortunately not able to reach a reasonably close outcome in order to make a fair comparison.

Language-specific encoders
In our first experiments, we used the same preprocessing and SPM vocabulary for each student as the LASER2 SPM teacher: a 50k SPM vocabulary which was trained on all LASER2 languages.On one hand, using a massively multilingual SPM vocabulary is expected to improve the generalization among languages, and it is the only possible solution when training a massively multilingual model which handles all languages.On the other hand, low-resource languages may be badly modeled in a joint SPM vocabulary, i.e. mostly by very short SPM tokens, despite the use of up-sampling strategies.Our approach to train multiple sentence encoders, each one specific to a small number of languages, opens the possibility to train and use specific SPM vocabularies for each subset of a small number of languages.using their own specific script.Amharic was part of the 93 languages LASER2 was trained on, but the xsim error rate is rather high, and LASER2 generalizes badly to Tigrinya.We first trained a specific encoder for three Semitic languages: Amharic, Tigrinya and Maltese.We only added Maltese, which uses a Latin script, in order to avoid a multitude of different scripts to be learnt by one encoder.This yields a significant xsim improvement to 0.2 and 1.19% respectively, highlighting the usefulness of teacher-student training and specific encoders for a small set of similar languages.We then trained an encoder for Amharic and Tigrinya only, paired with English as in all our experiments, and a specific 8k SPM vocabulary to better support the Ge'ez script.This brought xsim down to 0.1% and 0.89%, respectively although we use less training data.Our best model is on par with LaBSE, which was trained on Amharic only, and significantly outperforms it for Tigrinya.

Joint training
In order to highlight the effect of jointly training our students with masked language modelling and curriculum learning, we chose a very lowresource language with little bitexts available to use for distillation alone: Wolof.As with previous students, we trained Wolof alongside closely related Senegambian languages: Fulah, Bassari, and Wamey, but the joint training strategies are only applied to Wolof.In total we used 21k bitexts, and an additional 94k of monolingual data for Wolof.
We observe a large reduction in xsim when using joint training (see Table 3).For example, we see a 40% relative reduction when adding the MLM criterion (21.05 → 12.65), and a further decrease of 12.65 → 5.93 when also adding in curriculum learning.As we also observed a similar effect for other languages, the results above suggest that jointly training distillation alongside masked lan- In this section, we investigate the challenges to train sentence encoders for 50 African languages, perform bitext mining, and train NMT models to translate these African languages into English.

Choice of African languages
We tried to cover as many African languages as possible.The main limitation was the availability of high-quality test sets to evaluate our models.FLORES-200 covers 44 African languages.We added 6 languages for which we have no FLORES test sets, namely Acholi, Luba, Luvale, Tiv, Venda and Zande, but sufficient resources to train sentence encoders and NMT systems.Statistics for the 44 languages with test data are given in Table 4.

Encoder training and evaluation
We have explored several techniques to train sentence encoders on multiple languages, grouped into "families" in different ways.We first trained one encoder on all African languages and then tried to improve them by using smaller language family specific models.Unfortunately, several language families have a very small total amount of bitext training data, in particular Mande languages (83k).We were not able to train language specific encoders for these families which performed better than when trained together with all other African languages.The following languages were trained separately: 1) Semitic: amh and tir; 2) Kwa languages: aka, ewe, fon and twi; and 3) Wolof (using MLM training).
Table 4 provides the xsim scores for all languages for which we have FLORES devtest data.We always use the LASER2 teacher model for English and not the individual student models (which were also trained on English).This ensures that all students are mutually compatible and simplifies mining.For comparison, we also evaluated the publicly available LaBSE model 8 on our test data.LaBSE was trained on a total of 109 languages which includes 14 African languages (in bold in Table 4).LaBSE performs very well on all of them, except Wolof which has xsim of 26.2%.Our encoder for Wolof achieves 5.9% xsim error.LabSE generalizes well to 4 other languages (nso, run, ssw and tsn).LabSE's xsim scores for the other languages are rather high.Our LASER3 sentence encoders have xsim error rates below 5% for 34 languages.The most difficult languages are: cjk, dik, dyu, fon, fuv, kam, kau, kmb and umb.For most of them, we have a very limited amount of bitexts (less than 50k).In the appendix, we provide the xsim error rates among all African languages as well as against French.This demonstrates that the student encoders are mutually compatible among each others, and with other languages of the LASER2 teacher.

Bitext evaluation
We now turn to using these encoders for bitext mining.We follow exactly the same margin-based mining procedure as described in Schwenk et al. (2021).Our main source of monolingual data was Common Crawl, complemented by targeted crawling (see section 4 for details on preprocessing).The amount of unique sentences is given in Table 4 in the column "Mono [k]".We mine against 21.5 billion unique sentences in English.
NMT training To evaluate the quality of the mined bitexts, we add the mined bitexts to the available public bitexts and train NMT systems, translating from foreign into English, and compare the BLEU scores with baseline models which were trained on freely available bitexts only.We train NMT systems to translate separately from each language into English.We hope that this gives us signals on the quality of the mined bitexts.For simplicity, we use the same architecture for all languages: a 6 layer transformer for the encoder and decoder, 8 attention heads, ffn=4096 and 512dimensional embeddings.Models were trained for 100 epochs on 32 GPUs The results are summarized in Table 4, last three columns.
Analysis.We observe significant gains in the BLEU scores for several languages, e.g.fuv, sot, ssw, swh, tir and xho, improve by more than 5 points BLEU, and amh, hau and som by more than 10 BLEU.The most impressive result is obtained for Somali: training an NMT system on the available 179k bitexts yields 5.1 BLEU.This is then improved to 21.1 BLEU by adding 4.9M mined bitexts.We also obtain a nice result on Fulfude: pub-licly available bitexts are extremely limited (26k) and we are able to reach 6.5 BLEU using mined bitexts, despite a sentence encoder with a high xsim error rate of 28.46%.There are 13 languages with BLEU scores below 5%: aka, bam, cjk, dik, dyu, fon, kam, kau Arab, kau Latn, kmb, nus, umb and wol.The sentence encoders for most of these languages need to be improved further, but the limiting factor is often the amount of available monolingual data -we simply have not enough data to mine in.A typical example is Akan (aka): we have a very good sentence encoder -since it was trained jointly with the other Kwa languages, but only 163k sentences of monolingual data.There is not much mining can do here.In average over all 44 languages, BLEU improved from 11.0 to 14.8.
These results should not be considered as the best possible MT performance which can be obtained with the available resources.We made no attempt to optimize the precision/recall trade-off of the mining individually for each language pair, and use the same the margin threshold of 1.06, nor did we adapt the NMT architecture and parameters to the amount of bitexts.Significant improvements in the BLEU scores can be obtained by training a massively multilingual NMT system as demonstrated in NLLB Team et al. (2022).In that work, the same underlying teacher-student approach was used to train student for more than 148 languages and mine more than 880 million sentences of bitexts.Ablation studies have shown that mined data brought an improvement of 12.5 chrF++ when translating into English, averaged over all 200 languages.

Acknowledgements
For their helpful contributions to this work, we would like to thank: Bapi Akula, Pierre Andrews, Angela Fan, Cynthia Gao, Kenneth Heafield, Philipp Koehn, Janice Lam, Alex Mourachko, Christophe Rogers and Guillaume Wenzek.

Conclusion
Multilingual sentence representations are key to extend NLP approaches to more languages, and they are the underlying engine for distance-based bitext mining, which turned out to be crucial to scale NMT to more languages.In this work, we attack the challenge to scale the LASER encoder and cover 50 African languages.To the best of our knowledge, only 14 African languages are handled and evaluated by current multilingual encoders.
We achieve this by moving away from a onefor-all approach to an improved teacher-student training of several encoders, each one trained on a small subset of languages.This enabled us to better adapt the encoders to language specificities, e.g. a particular writing script, while maintaining mutual compatibility.Our new models significantly outperform the original LASER model on the FLORES test set, and we are on par or better than all other publicly available multilingual sentence encoders, namely LaBSE.We were also able to integrate monolingual data by jointly minimizing a cosine and MLM loss.We showcase the potential of this technique for the Wolof language, reducing the xsim error rate from 21.1% down to 5.9%.We performed bitext mining for 44 African languages.Adding this data yielded an average improvement of 3.8 point BLEU for translation into English.Training code, the models and the mined bitexts are freely available. 9

Limitations
Our new student-teacher approach to independently train multiple, but mutually compatible sentence encoders, enabled us to attack many low-resource African languages.However, not all have a sufficient low xsim error rate to mine high-quality bitexts.It is difficult to say whether these error rates are the result of missing resources to train these encoders, or inherent to the characteristics of each language.Several low-resource languages are also lacking a well-defined and standard writing system: they may be written in different standards or scripts in different regions.This obviously complicates training sentence encoders.
In addition, bitext mining by itself reaches its limits when not enough monolingual data is available, independent of the mining approach which is applied.On one hand, it could be of course that we were simply unable to locate and crawl this monolingual data.On the other hand, however, by handling more and more (very) low-resource languages, we might be faced with languages which are mainly spoken.

Figure 1 :
Figure 1: Architecture of our teacher-student approach.

Table 2 :
Table2summarizes the results for these different training strategies for two example languages: Amharic (amh) and Tigrinya (tir).Both are part of the family of Semitic languages, and use their own specific Ge'ez script.Other major languages from this family are Aramaic, Arabic and Hebrew, Maltese and Tigre, all xsim error rates on FLORES devtest for Amharic and Tigrinya and different training strategies (see text for details).The amount of training data excludes 4M sentences of English for our models.

Table 4 :
The largest family of African languages are by far Bantu languages.Other language families include Chadic, Cushtic, Kwa, Mande, Nilotic, Semitic and Senegambian.List of African languages, available resources and result summary (on FLORES devtest).Languages in bold are handled at WMT'22.LaBSE's xsim error rates in bold correspond to languages it was trained on.