CLLE: A Benchmark for Continual Language Learning Evaluation in Multilingual Machine Translation

Continual Language Learning (CLL) in multilingual translation is inevitable when new languages are required to be translated. Due to the lack of unified and generalized benchmarks, the evaluation of existing methods is greatly influenced by experimental design which usually has a big gap from the industrial demands. In this work, we propose the first Continual Language Learning Evaluation benchmark CLLE in multilingual translation. CLLE consists of a Chinese-centric corpus — CN-25 and two CLL tasks — the close-distance language continual learning task and the language family continual learning task designed for real and disparate demands. Different from existing translation benchmarks, CLLE considers several restrictions for CLL, including domain distribution alignment, content overlap, language diversity, and the balance of corpus. Furthermore, we propose a novel framework COMETA based on Constrained Optimization and META-learning to alleviate catastrophic forgetting and dependency on historical training data by using a meta-model to retain the important parameters for old languages. Our experiments prove that CLLE is a challenging CLL benchmark and that our proposed method is effective when compared with other strong baselines. Due to the construction of corpus, the task designing and the evaluation method are independent of the central language, we also construct and release the English-centric corpus EN-25 to facilitate academic research1.


Introduction
Training a multilingual Neural Machine Translation (NMT) model jointly in all the directions re-quires collecting in advance the parallel corpus of all the languages, which is less practical due to the continuously occurrence of the translation requirement of the new languages.Adding new languages to a well-trained multilingual NMT model is a resource-saving method compared with training from scratch.However, directly finetuning on new languages will result in catastrophic forgetting of historical languages.Continual Language Learning (CLL) methods (Berard, 2021;Garcia et al., 2021;Lyu et al., 2020;Escolano et al., 2019Escolano et al., , 2020Escolano et al., , 2021)), focus on gradually extending the language capacity of multilingual NMT models without forgetting old languages which is the major challenge of CLL tasks.
The existing multilingual NMT evaluation benchmarks (Akhbardeh et al., 2021;Qi et al., 2018;Schwenk et al., 2021;Zhang et al., 2020) focus more on multi-task NMT or continual domain learning (Thompson et al., 2019) but put little emphasis on CLL restrictions.Hence, most of the existing CLL methods (Berard, 2021;Garcia et al., 2021) are evaluated on the traditional multilingual NMT evaluation benchmarks and conducted on a simple experiment (e.g.training a multilingual NMT model and adding a specific new language) which has a big gap from the realistic CLL.In the industrial demands (Lyu et al., 2020), there are usually more new languages and families are required to be continually learned in more continual learning stages.Due to the lack of CLL benchmark, there are no rigorous evaluations of the CLL methods for the number of languages, language family distribution, and learning order.The existed methods' (Berard, 2021;Garcia et al., 2021;Lyu et al., 2020;Escolano et al., 2019Escolano et al., , 2020Escolano et al., , 2021) ) experiment settings such as the selection of the new and old languages, the availability of historical training data, and the growth of model parameters are not unified.However, the catastrophic forgetting observed is sensitive to the experimental design more than any inherent modeling limitations (Hussain et al., 2021).Hence, it is urgent and necessary to design a benchmark to unify the configurations of the CLL tasks.
In this work, we propose the first CLL benchmark -CLLE for CLL in the multilingual NMT scenario.CLLE consists of a Chinese-centric and domain distribution-consistent multilingual parallel corpus -CN-25, which is collected by extracting and refining the CC-Matrix corpus (Schwenk et al., 2021).Specifically, CN-25 includes 25 languages aligned with Chinese, 23 of which have more than 650k sentence pairs.The corpus refinement is processed with the text-based filter rules and the LaBSE (Feng et al., 2022) multilingual model.The content domain distribution of each language is adjusted to be similar by adjusting the number of samples for each topic clustered by Kmeans.
We design the close-distance language continual learning (CLCL) task and the language family continual learning (LFCL) task associated with the CLLE benchmark to verify the CLL method on disparate experiment settings.To be specific, in the CLCL task, the new languages are from the learned languages families, and the addition of new languages is needed at only one learning stage.In the more challenging LFCL task, the new languages are from the unseen languages families, and more learning stages are introduced.
For the CLL tasks, we propose the COMETA framework based on the Constrained Optimization (Thompson et al., 2019;Aljundi et al., 2018) and the META-learning (de Masson d' Autume et al., 2019;Wang et al., 2020;Liang et al., 2022).We train a CNN-based meta-model to predict the performance of the NMT model in old languages according to its parameters.Then we use the metamodel to calculate the importance weights to retain the language-specific embeddings (Qi et al., 2018;Mathur et al., 2019;Liang et al., 2021) of old languages.Compared with standard constrained optimization methods such as EWC (Thompson et al., 2019) and MAS (Aljundi et al., 2018), COMETA retains the knowledge of the old languages without accessing the historical training data.And the importance weights can be dynamically updated, which is more flexible for the CLL process.
The main contributions of this work include: • We introduce the first CLL benchmark CLLE which includes the CN-25 corpus and two CLL tasks, and the construction method of the CN-25 corpus can be used for any central language.
• We design two CLL tasks to verify the CLL method on disparate experiment settings, and the tasks are derived from the requirements in real scenarios.
• We propose the COMETA method based on constrained optimization and meta-learning, which outperforms existing constrained optimization methods without using the old training data.

Related work
Benchmarks in Multilingual Neural Machine Translation In this section, we focus on multilingual NMT benchmarks including Chinese.WMT series (Bojar et al., 2016(Bojar et al., , 2017;;Neves et al., 2018;Barrault et al., 2019;Specia et al., 2020;Akhbardeh et al., 2021) corpus includes the commonly used of high-quality languages and most of them are aligned with English.The corpus updates each year, and the multilingual low-resource translation corpus is now added such as the Indo-European (Akhbardeh et al., 2021) translation.WAT series (Nakazawa et al., 2014(Nakazawa et al., , 2015(Nakazawa et al., , 2016(Nakazawa et al., , 2017(Nakazawa et al., , 2018(Nakazawa et al., , 2019(Nakazawa et al., , 2020(Nakazawa et al., , 2021) ) provide a multilingual multidomain parallel corpus between Asian languages and English, and the patent task includes a Chinese-Japanese parallel corpus.FLORES-101 (Goyal et al., 2022)  The above benchmarks are designed for multitask multilingual NMT training or evaluation.Due to limitations on language and content distributions, they can not meet the CLL requirements.In Sec-tion3.4,we compare CN-25 with the above benchmarks in detail.
Continual language learning Due to the introduction of Transformer to multilingual NMT, adding new languages to translation models has been increasingly studied in the past few years.Thompson et al. (2019) use the diag Fisher matrix as importance weights to retain the important parameters for old tasks.Although it is a classical constrained optimization method designed for continual domain learning, the idea is still suitable for the CLL scenario.Lyu et al. (2020) introduces a modularized method by adding a new language-specific encoder/decoder modules to the multilingual NMT model, which can satisfy industrial requirements.Similar architecture-based approaches (Escolano et al., 2019(Escolano et al., , 2020(Escolano et al., , 2021) ) have proved valid in the CLL scenario, while the drawback is that the model size will grow concomitantly as the task count increases.Garcia et al. (2021) propose a "vocabulary substitution" approach to augment untranslatable languages of the multilingual NMT model.The core thought is reusing the overlapped embedding parameters between new and old multilingual vocabulary.And the author declares that the success of the "vocabulary substitution" approach is due to the large size vocabulary of origin multilingual NMT model.Berard (2021) proposes to add language-specific adapter modules and freeze the major structure when learning a new language, and experimentally proves no degradation of the existing language pairs.However, learning a new long-distance language may be limited by the frozen major structure.Because these models are executed in different continual learning settings and evaluated by inconsistent methods, the performance comparison is difficult to perform.

Model-based evaluation
Chinese feature extraction

K-means Clustering
Alignment

Word2vec
Figure 1: The whole process of Chinese-centric parallel corpus processing.

The CN-25 corpus
In this section, we introduce the domain-aligned corpus CN-25 and compare it with existing multilingual corpora.The processing pipeline of CN-25 is shown in Figure 1.In the first step, we utilize the multilingual encoding model LaBSE and textbased filter rules to refine the CC-matrix corpus.
Then we use a model-based method to evaluate the refined corpus and the refinement process is regulated according to the evaluation results.To align the topic distribution of different languages, we utilize the central language as the agent and cluster the whole corpus into 100 topics through K-means clustering.In each topic, the sentence number is regulated according to the median value and LaBSE score.

CC-Matirx data source
CC-Matrix (Schwenk et al., 2021) is a parallel corpus containing a wide range of languages, which is obtained from a large number of web snapshots through parallel corpus mining technologies.Through exploiting the highly optimized FAISS (Johnson et al., 2021) vector retrieval library and language-agnostic BiLSTM (Artetxe and Schwenk, 2019) to encode and retrieve monolingual data, the sentence pairs with a higher probability of translation relationship are preliminarily found.The quality of the sentence pairs are further judged by LASER2 margin score with a threshold around 1.06.

Corpus refinement
LASER supports the encoding of more than 100 languages, but the language-agnostic BiLSTM is not trained for translation ranking (Guo et al., 2018).LaBSE (Feng et al., 2022)   translation ranking task.We utilize it to re-score the CC-Matrix dataset and refine the parallel corpus with the threshold 0.8 which is selected manually by inspecting the refined samples.Through manual analysis, we find that parts of the corpus are adulterated with other languages.We discard the sentences containing too many foreign-language texts according to the proportion of other language characters.

Corpus domain alignment
The CLL task focuses more on processing language expansion rather than domain shift which is the main challenge of continual domain learning.Hence, the domain distribution alignment is an important restriction of the CLL benchmark to distinguish it from the benchmark of continual domain learning.We use the central language as the agent to get the domain distribution of the refined corpus.A word2vec (Mikolov et al., 2013) model is trained on the monolingual sentences of the central language to encode the sentences (averaging the word embeddings in a sentence).After that, the entire corpus is clustered into 100 topics (according to the agent language) through the encoding and the K-means clustering method.For each language, we rank the sentences in each topic according to the LaBSE score and only retain the sentences whose scores are above the median value for all languages.
The correlation matrix of topic distribution before and after adjustment is shown in Figure 2. The valid and test set are sampled from the top 100 and the top 100-150 sentences in each topic.

Compared with existing multilingual corpus
We compare CN-25 with existing multilingual benchmarks from multiple aspects including language diversity, corpus amount, domain alignment, and content overlap across languages.Table 1 shows the comparison of CN-25 with WMT-2022, WAT-2022, CWMT-2022, TED talks, CC-matrix, OPUS-100, and FLORES-101 benchmarks.
Language diversity and the count of sentences.CN-25 includes 25 languages from 17 language families.In each family, the typical and commonly used languages are selected.In addition, the number of sentences in each language is controlled in balance.Most languages (except for Tamil and Swahili) have more than 650k sentences aligned with Chinese.For the multilingual parallel corpus including Chinese, as shown in Table 1, the corpus size of CN-25 is greater than the widely used corpora in CLL researches (Berard, 2021;Garcia et al., 2021)  pairs.When trained on the content-overlapped corpus, the central language sentences are repeatedly learned in every translation direction, which increases the overfitting risk of the central language.
None of the existing benchmarks considers both constraints simultaneously.For instance, the TED and FLORES-101 corpus translate a sentence from one language to multiple languages.Although the domain distribution of different languages is coincident, the content overlap problem appears.We experimentally analyze it in Section 6.2.

The continual language learning tasks
We then introduce two challenging tasks for CLL in our benchmark with different language families and CLL stage settings.Both the language families and the number of learning stages have an impact on the catastrophic forgetting, as the former is empirically analyzed in Appendix A and the latter is shown by Hsu et al. (2018).CLL models are evaluated on the newly and historically learned languages by the averaged BLEU score after each learning stage.
CLCL task: close-distance language continual learning.In this scenario, the whole CLL task consists of two stages: training a multilingual NMT model at the first stage and then continuing learning new languages at the second stage.The learning sequence in this scenario is relatively short and no new language families are added to the subsequent stage.The languages learned at the second stage belong to the language families of the previous stage.For instance, participants are required to train a multilingual NMT model on three language families including Germanic (Dutch, German, Swedish), Romance (Portuguese, Spanish), and Slavic (Russian, Czech) at the first stage, and then continuously train on three languages including English (Germanic), French (Romance), and Polish (Slavic).
LFCL Task: language family continual learning.
In this more challenging scenario, we set more learning stages and introduce new language families to add difficulty to CLL approaches.For example, the Germanic family (Dutch, German, English, and Swedish), the Romance family (French Portuguese, and Spanish), and the Slavic family (Polish, Russian and Czech) are learned in sequence.
In the long sequence of learning, the catastrophic forgetting problem could be serious.Existing CCL models which try to retain knowledge from previous stages to alleviate catastrophic forgetting may suffer more time and memory consumption.Take EWC, a classical continual learning approach, for example, as the learning stage increases, the time of traversing old training sets and computing will significantly increase when calculating the fisher matrix for the parameters.
Performance metric.We use the average BLEU from and to Chinese (the centrical language) to evaluate the performance of methods on CLL tasks, which is the commonly used metrics in CLL task (Berard, 2021).Let L i represent the set of historical learned languages up to stage-i, j ∈ L i represent the learned language, − → b i,j and ← − b i,j be the test BLEUs of zh->j and j->zh after the model has finished learning stage-i, the average BLEUs from and to Chinese are respectively defined as The average BLEU B −1 of the last stage represents the performance on the CLL task.

COMETA framework for continual language learning tasks
We propose the COMETA framework which is based on constrained optimization and metalearning model according to the NMT model's parameters and loss.After that, the meta-model calculates importance weights to constrain the change to the embeddings of old languages when learning new languages.The framework of COMETA at stage-i is shown in Figure 3.At stage-i, we employ two independent computation graphs to train the multilingual NMT model and meta-model respectively.

Meta model and importance weight
As shown in Figure 4, the meta-model, a CNNbased network, predicts the translation loss according to the language-specific embeddings which correspond to the language-specific frequent tokens.The LayerNorm and max-pooling operators are used to rescale and resize the embedding, and then multiple CNN kernels of different sizes are utilized to extract the features.The features are concatenated and fed to the fully connected layer.To predict the loss, we use the SoftPlus (Dugas et al., 2000) activation function which only returns nonnegative values.The input of the meta-model is the language-specific embedding, and the objective is to fit the translation loss.
We use the meta-model to predict a loss value, and compute the gradient of the language-specific embedding respect to the predicted loss.Then we use the gradient (absolute value) as the importance weights to constrain the change to embeddings when learning new languages.We only penalize the change to embeddings because the embeddings are more language-specific than the parameters of encoder/decoder layers, and the bias of the embedding (Cao et al., 2021) is observed when learning new languages.

Training progress
As shown in Figure 3 Then the parameter ϕ is updated by Adam (Kingma and Ba, 2015) optimizer: Multilingual NMT model training.At the learning stage-i, the total loss L i of NMT model combines the translation loss L i translate and the knowledge retain loss L i retain , namely Given a batch of source sentences src i , corresponding target sentences tgt i and cross entropy loss function C E (•), the translation loss of Multilingual NMT model f θ (•) is The knowledge retain loss is where the importance weights W i is computed by the meta-model g i−1 ϕ (•) trained at stage-(i-1) The parameters θ is updated by another Adam optimizer: θ ← Adam 2 (∂L i oss /∂θ, θ)

Training tricks
When continually learning new languages, the parameters of encoder and decoder layers are frozen at early steps, and only the embeddings of the NMT model are updated, which makes the new language embedding adapt to the well-trained encoder and decoder layers.Otherwise, the parameters of the encoder and decoder layers will be updated sharply, which leads to quick forgetting.For the languagespecific embeddings, we design a dynamic frozen strategy to gradually unfreeze the embeddings that corresponded tokens are not commonly used.As observed in Appendix A Figure 5-6, at the early stage of learning a new language, the gradient of the embedding is large and the update of the optimizer is drastic.Hence, freezing part of word embeddings in the early steps can avoid updating the parameters drastically and erasing the learned knowledge stored in the embedding.

Experiments
In this section, we evaluate the CN-25 corpus and the COMETA method.All the experiments are evaluated under SacreBLEU (Post, 2018) metric.

Corpus analysis
We list the language information and parallel sentence amount in Table 2. To get the specific domain distribution, we train a FastText model to classify the Chinese sentences aligned with each language into several categories.Same with the K-means clustering results (Figure 2 right part), the Tamil corpus domain distribution also has a large difference from other languages.Because the original Tamil corpus in CC-Matrix is too scarce, it is hard to adjust the count of samples in each cluster to subject to the similar distribution.The categories distribution by FastText model is shown in Appendix B Table 9.

Corpus Qualitative evaluation
We assess the corpus quality of the CN-25 through manual and model-based methods.For intuitive comparison, we present top 30 Chinese-English sentence pairs with the highest LASER score of CC-Matrix and CN-25.
To quantitatively verify the refined data quality, we use the sentence pairs in CN-25 and the sentence pairs that are filtered out (regarded as lowquality) to finetune the M2M-418M (Fan et al., 2021) model respectively, then evaluate the model under WMT2020 and TED benchmarks.We select the seven commonly used languages (200k sentences per language) to finetune the M2M-418m model for 90k step (128 sentences per batch) , as shown in Table 3, in each translation direction, the corpus in CN-25 brings more performance improvement.And the corpus filtered out even damages the original M2M model performance in several directions.It proves that the refining process can filter out low-quality parallel sentence pairs.Analogously, we evaluate the model under the WMT-2020 benchmark (only English corpus aligned with Chinese) through finetuning and training from scratch, as shown in Table 4, the corpus with a higher LaBSE score still keeps the advantage.
To verify the performance of models trained on CN-25, we train two Chinese-centric multilingual NMT models respectively on TED and CN-25.Both models are trained on 10 languages for 450k steps (128 sentences per batch) and evaluated on the test dataset of TED.As shown in Table 5, we find two models have competitive performance in "Chinese->xx" directions, which proves that the CN-25 corpus has high data quality.While the model trained on TED severely overfits Chinese due to the content overlap problem.

Baselines evaluation
We reproduce EWC (Thompson et al., 2019) and MAS (Aljundi et al., 2018) baselines and compare them with COMETA under six experiment settings (2 model sizes * 3 replay settings).Table 6 and 7 present the average BLEU at each learning stage on CLCL task and LFCL task.Compared with EWC and MAS, COMETA does not use the source and target sentences of the historical training corpus, while the average performance of COMETA still has an advantage, which proves the meta-model can identify the important parameters for old languages.However, in each replay setting, the catastrophic forgetting is remarkable, especially in the zeroreplay scenario.It proves that the CLCL and LFCL tasks are challenging and the CLL methods still have a large room to improve.

Conclusion
We propose the first CLL benchmark -CLLE with the CN-25 corpus and two CLL tasks -CLCL and LFCL.Compared with existing multilingual benchmarks, CLLE considers several restrictions  for CLL, including domain distribution alignment, content overlap, language diversity, and the balance of corpus.For the CLL tasks, we introduce a novel method COMETA based on constrained optimization and meta-learning to retain the important parameters for old languages through the meta-model.The experiments prove that CN-25 is a high-quality corpus, that the CLL tasks are challenging and that our proposed method outperforms other strong baselines.

Limitations
We discuss the limitations of the CN-25 corpus and the COMETA method.For the CN-25 corpus, the data quality of the refined corpus relies on the LaBSE model, which prefers better on the highresource languages.Hence, we can't guarantee that each language of CN-25 has the same highquality corpus.Furthermore, the topic alignment is a resource-consuming process.We need to cluster nearly 1 billion sentences into 100 topics, if a new corpus arrives, then the clustering process needs to execute again.For the COMETA method, the limitation is that the meta-model size increases with the translation model size.And the meta-model is hard to process the parameters with a complex structure such as self-attention layers.

Ethics Statement
Training a multilingual NMT model from scratch usually costs expensive computing resources, researching CLL can effectively reduce resource consumption and carbon emissions.And using the meta-model to alleviate the catastrophic forgetting provides a new perspective for studying continual learning.

A Analyze the continual language learning task
In this section, we empirically analyze the challenge in CLL.We train a multilingual NMT model based on multilingual vocabulary released by M2M, then add new languages from different families.To be specific, at the first stage, we train a multilingual NMT model on seven languages from three families including Germanic (Dutch, German, Swedish), Romance (Portuguese, Spanish), and Slavic (Russian, Czech) language families.At the second stage, we directly finetune the model on a new language and then replay 2000 samples of each old direction.We execute the experiments on three new languages including English (Germanic), French (Romance), and Polish (Slavic).As shown in Table 8, the old languages from the same family with the new language have the minimum degree of forgetting.However, the forgetting of all families is still remarkable.

A.1 Gradient visualization of the embedding
To investigate the reason for the remarkable forgetting, we visualize the gradients of language-specific embeddings.We select 1000 frequent tokens of each language by traversing the entire corpus.Then, we record the gradient of each embedding in the finetuning process and average the gradient along the feature dimension.The visualization of gradients is shown in Figure 5, the vertical axis shows the frequent tokens of different languages, the horizontal axis represents the training step of the new language.
Compare the magnitude of the gradient visualized in Figure 5, we can observe several phenomenons: • If the new and old languages use different scripts, the gradient on the old languages' embedding is minimal.For example, Russian (ru, Slavic) uses Cyrillic script, and other languages use Latin script, the magnitude of the gradient of Russian-specific embedding is very small when learning any new languages.
• If the new and old languages use the same script, learning a new language will create a larger gradient of the languages from the same family.For example, when learning English (en, Germanic), the gradient of the Germanic languages (German, Dutch, Swedish) is large.
• At the beginning of learning a new language, the gradient magnitude is larger (in a darker color) than the subsequent process.

A.2 The semantic shift phenomenon
We guess that the above phenomenons are due to the BPE-based (Sennrich et al., 2016) multilingual vocabulary.Under the multi-tasks scenario, employing a subword-shared vocabulary across multilanguages can promote the semantic knowledge learning, multilingual semantic knowledge can be learned from shared tokens embedding.While in the CLL scenario, shared token embeddings are trained across multiple stages, and the new knowledge may wash out old knowledge.It is worth noting that the semantics of embeddings does not change in the continual domain learning scenario.In the CLL scenario, the shared token may represent different semantics in different languages.The language shared embedding will be updated in new language learning stages, which brings more challenges for the CLL method.We call this phenomenon as semantic shift.
To verify the existence of the semantic shift phenomenon, in Figure 6, we present the L2-Norm and L2-Distance of embedding when continually learning the new languages.In each sub-figure, 1000 frequent tokens of new and old languages and 100 shared tokens (or less than 100 if not exist) are selected to plot the curves.From the L2-Norm and L2-Distance curves, we observe that the shared token embedding varies faster than the old languages' embedding except for Russian (not Latin script), which proves the existence of semantic shift.It is noticed that the semantic shift exists in the languages which use the same script due to the BPE-based multilingual vocabulary.Inspired by the semantic shift phenomenon, to reduce the catastrophic forgetting, we can 1) control the optimizer's updating of the shared embedding (the strategy of COMETA), or 2) utilize a new method to generate the multilingual vocabulary with fewer shared tokens.

Figure 2 :
Figure 2: Correlation matrix of K-means topic distribution.Dark colors indicate low relevancy.Left: original correlation matrix.Right: adjusted correlation matrix.
, at each stage-i we train the multilingual NMT model f θ (•) and train meta-model g i ϕ (•) apart in two calculation graphs.The embedding and the translation loss (θ i emb , L i translate ) in the left calculation graph are used to train the meta-model in the right calculation graph.Meta model training.The embeddings θ i emb are fed to the meta-model g i ϕ (•) to predict a translation loss value, and the meta loss (loss of training metamodel) is computed by the mean square error loss function M SE (•):

Figure 5 :
Figure 5: Gradients of the language-specific embedding.The vertical axis shows the tokens of the old language, the horizontal axis represents the finetuning step of new languages.The intensity of the color indicates the magnitude of the gradient, red indicates a positive value while blue indicates a negative value.

Figure 6 :
Figure 6: The L2-Norm and L2-Distance when finetuning new languages.The horizontal axis represents the finetuning step.The L2-Distance is calculated according to step 0 and the current step.
60×59) language pairs that rely on volunteers to provide translations for public texts.The content domain of TED corpus includes but not limited to Technology, Entertainment, and Design.
(Qi et al., 2018)of China's ethnic minorities' languages such as Mongolian, Tibetan, and Uyghur aligned with simplified Chinese.However, the language number is not enough to support Chinesecentric CLL research.TED talks corpus(Qi et al., 2018)has 3540 (

Table 1 :
such as TED and OPUS-100.And the CN-25 corpus is the largest corpus satisfying the domain alignment restriction.Compared with existed datasets from four aspects.Sentence count means the average count of all translation directions.* We select the corpus of the general machine translation task in WMT-2022 and the corpus of the document-level translation task in WAT-2022.† TED has average of 90k sentence pairs of Chinese to/from other languages, and the average sentence pairs of English to/from 20 commonly used languages is about 253k.‡ OPUS-100 consists of 55M Englishcentric sentence pairs covering 100 languages, and has 1000k English-Chinese pairs .
son et al., 2019).In the CLL scenario, the domain distribution should be aligned to eliminate the influence of domain differences.The content overlap across languages refers to the central language repetition across different language sentence

Table 2 :
The statistics of 25 languages aligned with Chinese.Groups are divided according to M2M-100."Original": the amount of CC-matrix corpus."Refined": the amount of corpus after rules filtering, and LaBSE refining."Aligned": amount of corpus of domain alignment.

Table 3 :
M2M-418m is finetuned on CN-25 and filtered out sentences, then evaluated on TED.The red font means performance decline.

Table 4 :
M2M-418m is finetuned on CN-25 and filtered out sentences, then evaluated on wmt20.The red font means performance decline.

Table 5 :
The performance of the transformer-base trained on TED and CN-25 corpus respectively.

Table 6 :
The performance of baselines in CLCL task.RS means the number of replay samples in each old direction.

Table 7 :
The performance of baselines in LFCL task with different replay settings.Transactions of 2021.CCMatrix: Mining billions of high-quality parallel sentences on the web.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6490-6500, Online.Association for Computational Linguistics.

Table 8 :
The influence of adding a new language on old languages.Languages in same color come from the same family.
B The detail results and hyper-parameters of experiments B.1 Text classification results by FastText ISO Politics Education Fashion Sports Entertainme Technology Healthy Fiction Game Social Asset Stocks Lottery Economics Other

Table 9 :
Text classification by FastText model which is trained by the THUCNews dataset.Percentages are calculated by text classification based on aligned Chinese sentences.The distribution of Tamil is different from other languages, which is consistent with the result (Figure 2 right par) of the topic distribution obtained by K-means clustering.
B.2 The hyper-parameters of experiments for CLL tasks

Table 10 :
The hyper-parameters of experiments