ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.


Introduction
Recent trends in generative pre-training of programming languages (Feng et al., 2020;Chen et al., 2021;Li et al., 2022) have led to a proliferation of improvements in code intelligence scenarios, including program understanding and generation (Wang et al., 2021;Ahmad et al., 2021).In this context, a transformer-based large language model (LLM) is pre-trained on a large corpus of open source code (e.g., from GitHub) and then finetuned or zero-shotly evaluated on downstream tasks, such as program synthesis (Austin et al  2021; Fried et al., 2022;Nijkamp et al., 2022), code search (Husain et al., 2019;Li et al., 2021), clone detection (Lu et al., 2021b), and text-to-code generation (Clement et al., 2020).
Although there has been a surge of interest in learning general-purpose multilingual LLMs for source code (Feng et al., 2020;Ahmad et al., 2021;Wang et al., 2021;Fried et al., 2022;Xu et al., 2022), research in this area has been essentially connecting English texts (e.g., comments or docstring) and multiple computer programs (e.g., Python, C++, and Java), as shown in Figure 1 (a), and primarily focused around English-centric corpora and benchmarks.This English-centricity issue dramatically limits their use and practice given that 95% of the world population does not have English as their native language (Guo, 2018).
As such, it is crucial to mitigate barriers and draw connections between non-English natural languages (NLs) and multiple programming languages (PLs).One engineering solution is to use English translation of non-English texts by engaging neural machine translation (NMT) systems before/after the code LLM as a pipeline.Unfortunately, most general-purpose NMT systems (Wu et al., 2016;Johnson et al., 2017) are not designed for codespecific texts and can be prone to accumulative errors due to cascaded prediction stages.
A more general way is to learn a multilingual LLM that encodes a mixture of multiple NLs and PLs into a shared cross-mode representation space.
The success in learning universal representations of many languages (Conneau and Lample, 2019;Xue et al., 2021;Ahmad et al., 2021;Wang et al., 2021;Xu et al., 2022) that focuses on PLs or NLs suggests that it is possible to build a universal multilingual model that jointly represent multiple PLs and NLs.
In this work, we present ERNIE-Code, a unified cross-lingual pre-trained LLM for multiple NLs and PLs in hopes of mitigating the English-centric bias for program pre-training, as illustrated in Figure 1.Our model builds on the T5 (Raffel et al., 2020) encoder-decoder architecture that has been demonstrated to be effective in understanding and generation tasks for multilingual NL (Xue et al., 2021) and PL (Wang et al., 2021).For monolingual pre-training on mono-mode data (i.e., unpaired multilingual code or text), we follow the same T5 recipe to employ the "span-corruption" denoising objective in the text-to-text format.
The good-quality parallel corpus between lowresource NLs and multilingual PLs is usually unavailable.Instead, most popular PLs, accompanying API documentation, code examples, and discussion forums are primarily written in English, which poses a bottleneck to drawing connections between low-resource NLs and PLs.Inspired by the pivotbased machine translation (Gispert and Mariño, 2006;Utiyama and Isahara, 2007) that uses a pivot language and decomposes the source↔target translation into source↔pivot and pivot↔target bilingual translation, we introduce the pivot-based translation language modeling (PTLM) with prompting that disassembles multi-NL↔multi-PL into multi-NL↔English and English↔multi-PL with pivoting through English.Specifically, we leverage the PTLM training in dual direction for parallel corpus in different modes: (1) English↔multi-PL.For multi-PL↔English parallel data, i.e., code snippets and their accompanying comments, the model learns to generate English comments from code fragments and vice versa.(2) English↔Multi-NL.It learns to translate between English and other NLs.The model thus encodes PL↔English and English↔NL at the same time, with English as a pivot language.We conduct extensive experiments on different downstream tasks: (1) Multilingual text-to-code generation; (2) Multilingual code summarization (code-to-text); (3) Documentation translation (text-to-text); (4) Code repair (code-to-code).Empirical results have shown that our model outperforms strong multilingual LLMs for PL or NL and have verified its universal multilingual capacity.We also provide examples to show its decent zero-shot capability on code summarization and text translation via zero-shot prompting.
To summarize, this paper makes the following contributions: (1) We first propose a unified cross-lingual pre-trained LLM for both multilingual NLs and multilingual PLs, enlarging the capacity of LLMs towards jointly learning the universal multilingualism.(2) We employ the pivotbased translation language modeling with prompting to build connections between multi-NLs and multi-PLs (with English pivots) and mitigate the problem when the parallel corpus of multilingual-NL↔multilingual-PL is unavailable.(3) We obtain superior performance compared with previous multilingual LLMs across a wide range of code intelligence tasks, including text-to-code, code-to-text, code repair, and code documentation translation.(4) To some extent, our model has shown zeroshot prompting ability on multilingual code-to-text, text-to-code, and text-to-text generation.Moreover, ERNIE-Code is well-performed at naming a function and completing corresponding arguments given multilingual NL instructions.

Related work
As text-based formal languages with strict syntax and semantics, PL differs from NL because NL is only used for human communication while PL requires the interaction between humans and computers.This work targets bridging the gap between human languages and computer programs in a cross-lingual manner for unified multilingual pre-training, which is closely related to LLMs in either multilingual PL or NL.
Multilingual PL pre-training The success of large-scale pre-training has led to impressive advances in computer programs.This line of research involves pre-training on multilingual PLs using bidirectional transformer encoders (Feng et al., 2020;Li et al., 2021), casual transformer decoders (Chen et al., 2021;Austin et al., 2021;Fried et al., 2022;Nijkamp et al., 2022;Xu et al., 2022), and transformer encoder-decoder architectures (Wang et al., 2021;Ahmad et al., 2021;Li et al., 2022).Those with bidirectional encoder focus on program understanding tasks, such as code search (Husain et al., 2019), while the encoder-decoder ones target at building unified LLMs for both program understanding and generation.We observe that a large body of pre-trained models for PL tend to scale up their parameters under the framework of causal language modeling, mainly focusing on program synthesis (Chen et al., 2021;Austin et al., 2021;Fried et al., 2022;Nijkamp et al., 2022;Xu et al., 2022).Nevertheless, all of these works are almost English-centric, posing significant challenges to coping with PL end-tasks in non-English scenarios.
Multilingual NL pre-training This work is also related to the continual trend of multilingual LLMs.One line of this work focuses on encoding multiple NLs into a shared representation space (Conneau and Lample, 2019;Conneau et al., 2020), while some make efforts to extend the efficient monolingual pre-training method into multilingual settings (Xue et al., 2021;Liu et al., 2020).
Inheriting the recent success of LLMs in multilingualism, this work lies in the intersection between multilingual NL and PL pre-training.In contrast to the previous work that attends to either multilingual NL or multilingual PL, we seek to explicitly learn multiple NLs and PLs in a shared representation space in hopes of breaking the language barriers between these two modes.

Pre-training tasks
We pre-train on two pre-training tasks using both PL and NL data: one ( §3.1.1)uses monolingual PL/NL data (unsupervised), while the other ( §3.1.2) requires parallel NL-PL and NL-NL pairs (supervised).The former advances to learn intra-modal patterns from PL or NL only, while the latter endows the model with cross-lingual/modal alignment and zero-shot capabilities.

Task#1: Span-corruption language modeling (SCLM)
Denoising sequence-to-sequence pre-training has been highly effective across a broad set of tasks, including natural language processing (Liu et al., 2020;Raffel et al., 2020;Xue et al., 2021) and programming language processing (Wang et al., 2021;Ahmad et al., 2021).The denoising pretraining objective first corrupts input sequences by masking or adding noise; and then recovers the original inputs by forcing the model to predict corrupted spans, sentences, or documents.Raffel et al. (2020) finds that span-corruption denoising pre-training produces strong performance while being more computationally efficient on account of shorter target sequence lengths.
In similar vein, we extend the span-corruption denoising pre-training on both PL and NL.We refer to this task as span-corruption language modeling (SCLM), as illustrated in Figure 2. Specifically, it corrupts 15% of the original NL/PL input tokens with a mean span length of 3 by replacing contiguous, randomly-spaced spans of tokens as a single mask placeholder and then predicting the corrupted span on the target side.Suppose we have a total of M monolingual corpora of NL and PL corpora {C m } m=1•••M .We apply the SCLM pre-training objective on both NL and PL data in a multi-tasking fashion: where θ denotes trainable parameters, x \mask (m) and x mask (m) are span-corrupted inputs and corresponding target spans from monolingual corpus C m , respectively.x mask (m),<t indicates the generated tokens until the t-th time step out of the target (corrupted) sequence length T .
3.1.2Task#2: Pivot-based translation language modeling (PTLM) NL and multilingual PL are unavailable.The lack of parallel corpus stems from the fact that most popular PLs, accompanying documentations, and discussion websites are primarily written in English.
Early investigation of statistical machine translation proposed pivot-based approach (Gispert and Mariño, 2006;Utiyama and Isahara, 2007) to introducing a third language -named pivot language -for which there exist good-quality source-pivot and pivot-target bilingual corpora.Johnson et al. (2017) adopt a single NMT model to simultaneously learn many translation directions (including source↔pivot, pivot↔target), enabling the zeroshot translation between NLs implicitly.
In our context, the good-quality multi-PL to the multi-NL bilingual corpus is unavailable, yet there exists multi-NL to English and English to multi-PL parallel corpora, with pivoting through English.Motivated by the pivot-based NMT (Johnson et al., 2017) and translation language modeling (TLM; Conneau and Lample, 2019) approach, we apply a unified pivot-based training objective to the course of multilingual NL-PL pre-training, namely pivot translation language modeling (PTLM).With bilingual PL-NL and NL-NL corpora, we jointly learn the parallelism with pivoting in dual directions: for instance, Python↔English and English↔Russian.This allows for implicit bridging between PL-NL pairs that are never seen explicitly in training data (Johnson et al., 2017).More precisely, we concatenate parallel source-target sentences and learn to predict the corrupted target language, as shown in Figure 3. Instead of masking random tokens (Conneau and Lample, 2019), we corrupt the whole sentence in either direction of bilingual data and predict on the target side.The model requires attending to complete representations of source sentences to recover the target sentence and learn the alignment between sourcetarget pairs.Suppose we have N bilingual NL-NL and NL-PL parallel corpora {D n } n=1,••• ,N .We can formulate the PTLM training as: where x source To enable a pivot-based approach and specify the target language, we reformat the PTLM by prompting with a task prefix (See Figure 3), in which we prepend a task instruction "translate A to B: \n" on the left of input sentences, where A and B denote the source and target language, respectively.This prompt instruction indicates the target language the model should translate to, resulting in descent zero-shot abilities ( §5.3).
Shared NL/PL encoding We base our tokenizer on SentencePiece tokenizer in Xue et al. (2021).However, the original SentencePiece tokenizer designed for encoding NLs does not effectively represent PL data.We thus add a set of tokens representing whitespace indentation of different lengths in PL.See tokenization details in §A.1.
To alleviate the bias towards high-resource languages, we follow Conneau and Lample (2019) to rebalance the data distribution on both corpora and up/down-sample sentences from each language (or language pair) i with a rescaled multinomial distribution q i : where p i is the data percentage of each monolingual or parallel corpus.Following Conneau and Lample (2019), we set α = 0.3 for both monolingual and parallel corpus.

Comparison to related models
To contextualize our new model, we briefly compare it with existing multilingual LLMs for NLs/PLs.Considering that ERNIE-Code is the first LLM targeting multilingual NL and PL explicitly, for brevity, we focus on models that support either many NLs or many PLs.Table 1 reports the overall statistics of comparison models.mBART (Liu et al., 2020) is a multilingual-NL variant of BART (Lewis et al., 2020)

Evaluation datasets and metrics
Table 9 displays the statistics of evaluation dataset.We use the same public datasets and train-test splits for all downstream tasks.We refer to §A.3.3 for experimental settings of finetuning.
Multilingual code summarization is a code-totext task that aims to generate multilingual texts given a code snippet.We use mCoNaLa (Wang et al., 2022) to evaluate the performance of generating multilingual NL from PL.It consists of 341/210/345 manually curated parallel samples with NL in Spanish/Japanese/Russian and PL in Python.As mCoNaLa does not provide the training and validation set, we use CoNaLa (Yin et al., 2018), an English-Python parallel data (consisting of #2,379 samples), as the train/dev set (with 10:1 data split) after translation.For "translate-train" settings, we use machine-translated CoNaLa as training and dev sets, while use mCoNaLa as the test set.Particularly, we translate CoNaLa's training set into three target languages using FLORES-101 (Goyal et al., 2022) and adopt them as train/dev set.We utilize ROUGE-L (R-L; Lin, 2004), BLEU-4 (B-4;Post, 2018), and chrF (Popović, 2015) for comprehensive comparison.Multilingual text-to-code generation refers to the code generation task that generates code fragments from multilingual NL instructions.We use the same train/dev/test set as the code summarization mentioned above.Specifically, under "translate-train" settings, we use translated CoNaLa data as training and dev set, mCoNaLa as the test set to generate Python code from NL instruction in three different NLs (i.e., Spanish, Japanese, and Russian).We use ROUGE-L, BLEU-4, and CodeBLEU (C-B; Ren et al., 2020) for evaluating code predictions.

Model
Documentation translation is a text-to-text task that translates code documentation from one NL to another.We use Microsoft Docs from CodeXGLUE dataset (Lu et al., 2021a) to verify the multilingual NL translation between English ↔ Danish, Latvian, Norwegian, and Chinese.We report BLEU-4 and exact match (EM) in our results.
Code repair is a code-to-code task that automatically fixes bugs given a piece of buggy code.We evaluate on Bugs2Fix (Tufano et al., 2019) dataset with two subsets: (i) "small" with tokens less than 50; (ii) "medium" with a length of between 50 and 100.We report BLEU-47 and EM for evaluation.

Multilingual code summarization
Table 2 shows the multilingual code-to-text results of generated NL summaries in Spanish, Japanese, and Russian.We use translated English CoNaLa as training sets in target three languages8 , denoted as "translate-train" evaluation.As shown in Table 2, our model outperforms all baseline LLMs for either NL (mBART, mT5) or PL (PLBART, CodeT5).In particular, ERNIE-Code, with a length of 1024, exceeds its counterpart of 512-length (1.12 vs. 0.88 on BLEU-4) in that it allows for learning more extended contexts from training NL/PL segments.PLBART performs worst among all baselines on average, while CodeT5, mT5, and mBART behave similarly.We conjecture that PLBART only learns data from Java/Python functions and English Stack-Overflow posts, whose training data lacks the diversity of multilingualism.

Multilingual text-to-code generation
Table 3 shows the "translate-train" results of multilingual text-to-code generation on mCoNaLa.ERNIE-Code outperforms all baselines on BLEU-4, ROUGE-L, and CodeBLEU scores, showing that our multilingual PL-NL pre-training can capture code syntax and semantics.Among all code generation tasks, multilingual models for NL behave worse than those counterparts of PL.PLBART beats all baselines on surface-form n-gram match (BLEU-4/ROUGE-L) and structured code-related match (CodeBLEU), even achieving on par with our model on CodeBLEU.In contrast, mT5 underperforms all the other models on either of three subtasks, suggesting that the mT5 tokenizer is ineffective in encoding PLs, as aforementioned in §3.2.By comparing mT5 and our models, the improvements suggest our approach's effectiveness in encoding whitespace characters for tokenization.
Our model with more extended contexts (1024length) overshadows that of 512-length on all three text-to-code subtasks.

Documentation translation (text-to-text)
We further investigate the multilingual text-totext translation between English (en) and Danish (da)/Latvian (lv)/Norwegian(no)/Chinese(zh).Table 4 shows the documentation translation results of comparison models, including multilingual transformer (Johnson et al., 2017), XLM-R (Conneau et al., 2020), and mT5.Specifically, we finetune our model in a multilingual manner where all bilingual language pairs are learned simultaneously.
Our model surpasses mT5 and XLM-R in all eight translation directions, demonstrating that our model can perform code-related text-to-text translation.As the experiment design only aims to verify the NL translation ability of our model, we did not conduct comprehensive results to compare with state-of-art (SOTA) NMT methods.

Program repair (code-to-code)
We further validate that our model can perform code-to-code generation.Table 5 demonstrates the comparison model results on the Bugs2Fix benchmark.Baseline models include RoBERTa (code)a PL variant of RoBERTa (Liu et al., 2019), Code-BERT (Feng et al., 2020), PLBART, and CodeT5.
On "small" and "medium" tasks, our model achieves 80.10 and 91.20 BLEU scores, outper-forming or achieving competitive results compared with previous SOTA performance. 9The results of 1024-length and 512-length models slightly differ, possibly because both "small" and "medium" Java data are of no more than 100-token length, far shorter than our model's length limit.

Syntactic & semantic probing
Code fragments with highly-overlapping surface forms but with different semantic and syntactic logic can be given high scores by NL evaluation metrics, such as BLEU and ROUGE.To evaluate the semantic and syntactic aspects of text-to-code generation, we follow Ren et al. (2020) to adopt dataflow and abstract syntax tree (AST) match to compute the accuracy of dataflow graph and AST subtrees between hypothesis and reference.We refer to Ren et al. ( 2020) for further details.
Figure 4 illustrates the dataflow and AST match results of comparison models.PL baselines tend to generate code with better AST structures than NL models.In particular, mT5 fails to produce code with proper AST syntax but can match or surpass others on dataflow evaluation except on Russian tasks.Our model (L512/1024) exceeds or matches  baselines in terms of both the semantic dataflow and syntactic AST match.

Ablation study
Quantitative results We carry out ablation experiments by ablating either SCLM or PTLM tasks and report the average results in Figure 5

Analyzing PL semantics & syntax
We further analyze the semantic and syntactic structure of multilingual text-to-code generation for ablation comparison.Figure 7 shows dataflow and AST match performance on text-to-code generation given multilingual NL inputs.We find that removing SCLM does not overly impact the semantic dataflow and syntactic structures of generated PL.At the same time, ablating PTLM would generally cause more considerable fluctuation in the semantics and syntax of generated PL, suggesting that PTLM could allow the model to capture bilingual alignment and translation across multilingualism.

Zero-shot prompting
To verify the zero-shot ability of ERNIE-Code, we carry out code-to-text, text-to-code, and text-to-text experiments with zero-shot prompting.Precisely, we prepend a prompt prefix "translate S to T: \n" on the left of inputs, where S and T denote the source and target language respectively.Then we use beam search with five beams to obtain zeroshot predictions.tion.Our model demonstrates excellent zero-shot capability on Japanese and Russian summary generation, even outperforming "translate-train" settings by 0.43 / 9.05 on BLEU / ROUGE-L in general.This is because the training data is automatically translated rather than human annotated (i.e., "translate-train" settings), lowering the quality of training data.

Limitations
Releasing multilingual NL-PL benchmark While our model has been shown to capture multilingual languages between humans and computer programs, we could not systemically evaluate its performance on a wide range of multilingual NLs due to the lack of corresponding benchmarks.Instead, we undertake NL-to-PL and PL-to-NL experiments on mCoNaLa that involves only three NLs and present demonstration examples via zero-shot prompting to reveal its cross-lingual capacity.We encourage researchers in the community to release more multilingual NL-PL benchmarks to accelerate the development of this intersecting area.
Scaling up the model size and data In this work, we only use the PL data from CodeSearchNet for a fair comparison to baselines, preventing the model from learning from more PL genres and billions of open-source repositories.Increasing the amount of data for bilingual NL-PL pairs is also a promising direction, such as using data augmentation.Moreover, the scaling law for large pre-training has been well studied and shown significant performance gains in the literature (Chen et al., 2021;Li et al., 2022).A targeted effort at expanding the pre-training data size and scaling up models could give rise to more considerable improvement toward universal multilingual NL-PL pre-training.

Curse of multilinguality
We argue that the curse of multilinguality (Conneau et al., 2020) also exists in unified multilingual NL-PL pre-training, in which per-language capacity decreases as the number of languages increases given a fixed model size.
It is an interesting direction to investigate the issue of curse of multilinguality upon this work.

A.1 Input representation
We base our shared text/code lexer on the mT5 tokenizer -SentencePiece (Kudo and Richardson, 2018), specifically unigram language model (Kudo, 2018).Since the word distribution in PL essentially differs from that of NL, it is not feasible to directly apply the SentencePiece tokenization on PL.SentencePiece is ineffective in encoding whitespace characters -such as blank space, tab \t, and newline character \n -which are crucial in representing structures and indentations in source code.We thus add a set of additional tokens for encoding whitespace of different lengths in PL.Considering that developers with different programming habits may type indentations with various lengths and characters (tab or space), we add spaces of length-1/2/4 (denoted as <space * 1>, <space * 2>, <space * 4>, respectively), and tab \t to represent various indentations.Moreover, we use the newline symbol \n to encode line breaks.
Our tokenizer eventually consists of 250,105 Sen-tencePiece vocabularies.Figure 8     NL has been given as a part of PL inputs, thereby hurting the code-to-text test performance.Accordingly, for all code-to-text generation and randomly 50% of text-to-code generation in PTLM training, we replace all NL sentences as an NL placeholder "<|removed|>" if it exists in the corresponding PL fragments.
We additionally observe that parallel data in CodeSearchNet only contain few non-English NLs.Directly regarding all NLs in CodeSearchNet as English would confuse the model to distinguish various NLs.To better leverage this parallel supervision signal, we utilize FastText (Joulin et al., 2016) tools10 to identify different NLs.Specifically, we only consider NL sentences with confidence higher than 80% predicted by FastText.In PTLM training, we use the predicted language genre with 50% probability at random; otherwise, we treat the sample as "text" other than "English".Therefore, the model could implicitly tell different language genres without being exposed to erroneous supervision.

A.2.2 NL data
Monolingual NL corpus CC-10011 was constructed by processing CommonCrawl snapshots (Wenzek et al., 2019).The original CC-100 dataset comprises documents separated by double newlines.We maintain the document-level corpus by concatenating paragraphs within the same document page.Table 7 summarizes the statistics of our processed data, totaling 1.5 billion training document pages in 116 monolingual NLs.We rescale the data distribution according to page counts as aforementioned in Eq. ( 3) with α = 0.3.Code repair "fix bugs: \n" Table 10: Task prompt we use for finetuning.For documentation translation, the "src_lang" and "tgt_lang" represent the source and target language (e.g., English, Danish, Latvian, Norwegian, and Chinese), respectively.
1. Must be a native Chinese speaker; 2. Holding at least a master's degree in Spanish, Japanese, and Russian translation, literature, or related subjects; 3. Holding professional translation certificates in the corresponding language.
After human translation, we also employ professional engineers who are Chinese native speakers with at least five years of experience in Python to perform further translation refinement.We will release this dataset to speed up the research on multilingual code summarization.
Examples of Chinese code summarization (zeroshot prompting) We show the Chinese code summarization examples of our model under zero-shot prompting evaluation in Figure 9.We prepend the instruction prompt "translate Python to Chinese: \n" for training and evaluation.It demonstrates that our model equips the zero-shot ability on Chinese code summarization, affirming the positive effect of our cross-lingual pre-training.Moreover, as shown in Figure 9, our model focuses on the high-level meaning of the input code fragments, neglecting the implementation details.We guess this is because we use code search corpus as NL-PL bilingual training data, where NL instructions comprising high-level descriptions are usually extracted from code comments.It causes a discrepancy between the training and evaluation settings.

A.6 Qualitative examples (zero-shot prompting)
Zero-shot multilingual PL-to-NL generation Figure 9 and 10 illustrate the code summarization examples with zero-shot prompting.As mentioned earlier, As illustrated in Figure 9 and 10, we find that our model focuses on the global overview of code semantics rather than verbalizing the implementation process.Moreover, when explaining a short snippet of code, different people may interpret it with various meanings, which we refer to as "program ambiguity", making difficulties in annotating and evaluating the multilingual code summarization.This is because the test-set reference of mCoNaLa is human-rewritten, while the training NL is not.We also find that the model tends to copy small code snippets for code summarization.For instance, given inputs "# -*-utf-8 -*-", our model tends to copy the original string rather than describe its usage using NL.
Zero-shot NL-to-PL generation Figure 11 and 12 demonstrate examples of zero-shot text-to-code generation.We also observe that ERNIE-Code is well-performed in generating function names, arguments, and docstrings.It tends to generate functionlevel snippets and call user-defined functions following the object-oriented logic while lacking the knowledge of builtin functions or user-defined contexts given multilingual NL inputs.The given Japanese instruction requires the model to memorize the API usage of selenium14 library that our model may never see in the training data.We argue that training on data collected from GitHub and StackOverflow would confer benefits in memorizing and comprehending the API usage and instruction contexts.We suspect that the training on additional PL data from GitHub and StackOverflow rather than limited data of CodeSearchNet can lead to large improvements.Note that the generated "<|removed|>" docstring in Figure 11 is consistent with our preprocessing in §A.2.1.

Zero-shot multilingual NL-to-NL translation
To further validate the zero-shot translation capability between multilingual NLs, we report several selected language pairs from different language families and translate technical terminologies with zero-shot prompting.Figure 13 exhibits examples of multilingual NL translation in eight randomly selected directions, such as Spanish to French and Russian to Arabic.This suggests that our crosslingual pre-training can capture semantic alignment without seeing direct supervision from bilingual phrase or short-term pairs.

Figure 2 :
Figure 2: Schematic of the SCLM objective for PL (left) and NL (right) example.
denote source and target sentences from bilingual corpus D n .x target (n),<t indicates the generated tokens until the t-th time step out of the target sequence length T .This training format is the same as an NMT task.
trained with a full-text denoising objective on a subset of 25 languages from CommonCrawl.It learns to reconstruct the full NL texts from corrupted ones with an arbitrary noising function.mT5 (Xue et al., 2021) is a multilingual-NL encoder-decoder model adapted from T5.It is trained on 101 NLs using filtered CommonCrawl data (mC4) using the same SCLM objective as our model.PLBART (Ahmad et al., 2021) is a multilingual-PL version of BART with a denoising objective using three noising formats.It is trained on 210M Java functions, 470M Python functions from GitHub, and 47M English posts from StackOverflow.CodeT5 (Wang et al., 2021) is a PL version of mT5 that is pretrained on six-PL monolingual/parallel data from CodeSearchNet and extra C/C# data collected from GitHub.It additionally learns token-type information from identifiers and applies dual generation between English and PLs.

Figure 4 :
Figure4: Semantic and syntactic comparison on multilingual text-to-code generation.All comparison models are evaluated under "translate-train" settings by default, unless otherwise specified (i.e., "zero-shot").

Figure 5 :
Figure 5: Ablation test performance (log-scale).The reported results are averaged among all subtasks.
exhibits a tokenization example of Python snippets.Sentence-Piece tends to normalize whitespaces and skip extra empty characters, while our modified tokenizer allows the model to cope with whitespace characters such as indentation in PL.

Figure 9 :
Figure 9: Examples of Chinese code summarization with zero-shot prompting.

Table 1 :
Comparison of our model to existing massively multilingual pre-trained models for NLs and PLs.

Table 5 :
Results of program repair task.

Table 2
(last row)shows the performance of zero-shot code-to-text genera-

Table 3
mentation, demonstrating the effectiveness of our approach on zero-shot prompting.To extend the evaluation to other NL, we further release a Python-Chinese test set by translating mCoNaLa into its Chinese variant via crowd-sourcing.Our model shows decent ability on zero-shot PL-to-Chinese generation.We give zero-shot demonstrations and provide data curation details in §A.5.We argue that our model captures many NL genres via crosslingual pre-training.We encourage the community to release more multilingual code-to-text benchmarks for further evaluation.

Table 6 :
Statistics of CodeSearchNet in six PLs, totaling 6.5 million monolingual PL instances and 1.5 million parallel NL-PL samples.

Table 7 :
Statistics of CC-100 corpus, totaling 1.5 billion training document pages from 116 different NLs.Reported training pages and percentages are calculated according to the document distribution of original data.Note that our 116 NLs include 5 Romanized variants of existing languages denoted by "Romanized".

Table 8 :
Statistics of OPUS corpus, totaling 7.8 billion bilingual NL pairs from 105 different NL pairs.The reported count of bilingual pairs ("#Sent.")and percentage ("#Percent.")are calculated according to the original data.