CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Since the rise of neural natural-language-to-code models (NL->Code) that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the natural language input preceding the generated code, thus modeling the consistency between the generated code and its given natural language context as well. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. We release five language-specific pretrained models to use with our publicly available code. Our language-specific models have been downloaded more than 1,000,000 times from the Huggingface Hub. Our code and data are available at https://github.com/neulab/code-bert-score


Introduction
Natural-language-to-code generation (NL→Code) has seen sharply growing popularity recently due to the emergence of large language models (LLMs) trained on vast amounts of natural language and code (Chen et al., 2021;Fried et al., 2022;Zhou et al., 2023;Austin et al., 2021;Allal et al., 2023).LLMs have reached such a high NL→Code accuracy that they are now useful for the broad programming audience and actually save * Equal contribution † Now at Google DeepMind 1 The code and data are available at https://github.com/neulab/code-bert-score developers' time when implemented in tools such as GitHub's Copilot.This sharp rise in LLMs' usability was achieved thanks to their ability to accurately generate long completions, which span multiple tokens and even lines, rather than only a single next-token as in early models (Allamanis and Sutton, 2013;Movshovitz-Attias and Cohen, 2013).Nevertheless, evaluating and comparing different models has remained a challenging problem (Xu et al., 2022) that requires an accurate and reliable evaluation metric for the quality of the models' generated outputs, and existing metrics are sub-optimal.
Existing evaluation approaches The most common evaluation metrics are token-matching methods such as BLEU (Papineni et al., 2002), adopted from natural language processing.These metrics are based on counting overlapping ngrams in the generated code and the reference code.CrystalBLEU (Eghbali and Pradel, 2022) extends BLEU by ignoring the 500 most occurring n-grams, arguing that they are trivially shared between the prediction and the reference.Nonetheless, both BLEU and CrystalBLEU rely on the lexical exact match of tokens, which does not account for diversity in implementation, variable names, and code conventions.Figure 1 shows an example: given the reference code in Figure 1(a), both BLEU and CrystalBLEU prefer (rank higher) the non-equivalent code in Figure 1(b) over the functionally equivalent code in Figure 1(c).
CodeBLEU (Ren et al., 2020) attempts to lower the requirement for a lexical exact match, by relying on data-flow and Abstract Syntax Tree (AST) matching as well; nevertheless, valid generations may have different ASTs and data flow from the reference code, which may lead to low CodeBLEU score even when the prediction is correct.Further, partial predictions may be useful for a program-  shows a reference code snippet in Java. Figure 1(b) and Figure 1(c) show two generated predictions.Among these two candidates and given the reference, both BLEU and CrystalBLEU prefer (score higher) the snippet in Figure 1(b), which is not functionally equivalent to the reference, while our proposed CodeBERTScore prefers the code in Figure 1(c), which is functionally equivalent to the code in Figure 1(a).
mer, but accepting them may lead to partial code that does not parse, and thus cannot be fully evaluated by CodeBLEU (for example, predicting the first line of a for loop, without the loop's body).
Execution-based evaluation attempts to address these problems by running tests on the generated code to verify its functional correctness (Chen et al., 2021;Athiwaratkun et al., 2022;Li et al., 2022;Wang et al., 2022;Lai et al., 2022).This provides a direct measure of the functionality of the generated code while being agnostic to diversity in implementation and style.However, execution-based evaluation requires datasets that are provided with hand-written test cases for each example, which is costly and labor-intensive to create; thus, only few such datasets exist.Additionally, executing model-generated code is susceptible to security threats, and thus should be run in an isolated sandbox, which makes it technically cumbersome to work with iteratively.
Our approach In this work, we introduce Code-BERTScore, an evaluation metric for code generation, leveraging self-supervised pretrained models of code such as CodeBERT (Feng et al., 2020), and adopting best practices BERTScore (Zhang et al., 2020).First, CodeBERTScore encodes the generated code and the reference code independently with pretrained models, with the inclusion of natural language instructions or comments.Then, we compute the cosine similarity between the encoded representations of each token in the generated code and each token in the reference code.Finally, the best matching token vector pairs are used to compute precision and recall.Code-BERTScore allows comparing code pairs that are lexically different while taking into account the (1) programmatic-or natural-language-context, if such provided; the (2) contextual information of each token; and (3) implementation diversity.Our approach is illustrated in Figure 2. for token-matching approaches such as BLEU and CrystalBLEU to compare the reference with the candidates, while CodeBERTScore can trivially match variable names according to their semantic similarity and their functional role in the code.
Contributions In summary, our main contributions are: (a) CodeBERTScore: a self-supervised metric for NL→Code evaluation, based on BERTScore, which leverages the benefits of pretrained models, while not requiring labeling or manually annotated data.(b) An extensive empirical evaluation across four programming languages, showing that CodeBERTScore is more correlated with human preference and more correlated with execution correctness than all previous approaches including BLEU, CodeBLEU, and CrystalBLEU.(c) We pretrain and release five language-specific CodeBERT models to use with our publicly available code, for Java, Python, C, C++, and JavaScript.As of the time of this submission, our models have been downloaded from the Huggingface Hub more than 1,000,000 times.

Problem Formulation
Given a context x ∈ X (e.g., a natural language instruction or comment), a code generation model M : X → Y produces a code snippet ŷ ∈ Y by conditioning on the intent specified by x.The quality of the generation is evaluated by comparing ŷ ∈ Y with the reference implementation y * ∈ Y, using a metric function f : Y × Y → R, essentially computing f (ŷ, y * ).
A larger value of f (ŷ, y * ) indicates that the generated code is more accurate with respect to the reference code, and the way f ranks different can-didates is more important than the absolute value of f (ŷ, y * ).That is, ideally, if a prediction ŷ1 is more functionally equivalent to y * and more preferable by human programmers over a prediction ŷ2 , we wish that a good metric would rank ŷ1 higher than ŷ2 .That is, we seek an f function such that f (ŷ 1 , y * ) > f (ŷ 2 , y * ).

Background: BERTScore
BERTScore (Zhang et al., 2020) was proposed as a method for evaluating mainly machine translation outputs.The idea in BERTScore is to encode the candidate sentence (the prediction) and the reference sentence (the ground truth) separately, using a BERT-based model, which encodes each sequence of tokens as a sequence of vectors.Then, BERTScore computes the cosine similarity between every vector from the candidate sequence and every vector from the reference sequences.
Given these similarity scores, BERTScore computes sentence-level precision by taking the maximum similarity score for every candidate vector and averaging, and computes recall by taking the average of the maximum similarity scores for every reference vector.Intuitively, a high BERTScore-recall is obtained, for example, if every vector from the reference sentence has at least one vector from the candidate sentence that is highly cosine-similar to it; a high BERTScoreprecision is obtained if every vector from the candidate sentence is highly cosine-similar to at least one vector from the reference sentence.Ultimately, the final score is the F 1 score, computed as the harmonic mean of precision and recall.

CodeBERTScore
Our approach generally follows BERTScore, with the following main differences: 1.We encode the context (the natural language instruction or comment) along with each of the generated and reference code snippets, but without using the encoded context in the final similarity computation, essentially computing f (ŷ, y * , x) rather than f (ŷ, y * ). 2. Given the precision and recall, instead of computing the F 1 score, we also compute F 3 to weigh recall higher than precision, following METEOR (Banerjee and Lavie, 2005).3.As our underlying BERT-like model, we use programming language-specific models that we pretrain and release, rather than models that were intended for natural language only.We use a BERT-like pretrained model B to encode the reference and candidate.In our experiments, B is a CodeBERT model that we further pretrained using the masked language modeling objective (Devlin et al., 2019) on languagespecific corpora, but B can be any transformerbased model which we have access to its internal hidden states.
Token Representation We concatenate the context x with each of the reference and the candidate, resulting in x • y * and x • ŷ.We use the tokenizer T B provided with the model B: to get a sequences of tokens.We run a standard "forward pass" with the model B for each tok-enized sequence, resulting in sequences of vectors: (2) Finally, we mask out the encoded context tokens x 1 , ..., x k as well as all non-alphanumeric tokens (parentheses, brackets, dots, commas, whitespaces, etc.) except for arithmetic operators, from each of the encoded reference and encoded candidate.This results in encoded reference tokens y * = ⟨y * 1 , ..., y * m ⟩, encoded candidate tokens ŷ = ⟨ ŷ1 , ..., ŷn ⟩, and their corresponding masks m * and m.We denote y[m] as the remaining encoded tokens in y after selecting only alphanumeric token vectors according to the mask m.

Similarity Computation
We compute the cosine similarity between the encoded reference and candidate tokens, following Zhang et al. (2020): Although this compares the individual tokens y * i and ŷj , their vector representations y * i and ŷj contain information about their context, and thus about their semantic role in the code.

Experimental Setup
We evaluate CodeBERTScore across multiple datasets and programming languages.We first show that CodeBERTScore is more correlated with human preference than previous metrics, using human-rated solutions for the CoNaLa dataset (Yin et al., 2018a;Evtikhiev et al., 2022).We then show that CodeBERTScore is more correlated with functional correctness, using the HumanEval dataset (Chen et al., 2021).We also show that CodeBERTScore achieves a higher newly proposed distinguishability than other metrics (Appendix F).Finally, we analyze some of the design decisions and their implications.

Training Language-specific CodeBERT models
Training We used CodeBERT (Feng et al., 2020) as our base model (B) and continued its selfsupervised pretraining (Gururangan et al., 2020) with the masked language modeling (MLM) objective (Devlin et al., 2019) on Python, Java, C++, C, and JavaScript corpora.We trained a separate model for each programming language, for 1,000,000 steps for each language, using a batch size of 32, an initial learning rate of 5e −5 , decayed linearly to 3e −5 .Our implementation is based on the widely used HuggingFace Transformers library (Wolf et al., 2019) and BERTScore2 , and it supports any transformer-based model available on the HuggingFace hub.
Dataset We trained each model on the languagespecific subset of the CodeParrot (Tunstall et al., 2022) dataset3 , which consists of overall 115M code files from GitHub, further filtered by keeping only files having average line length lower than 100, more than 25% alphanumeric characters, and non-auto-generated files.Even after 1,000,000 training steps, none of the models have completed even a single epoch, meaning that every training example was seen only once at most.

Comparing Different Metrics
We compare CodeBERTScore with existing metrics that are commonly used on code generation evaluation.We use human annotated preference and execution-based results as the ground truth and measure their correlation with these metrics.
Correlation metrics We used three major correlation metrics.Following best practices in natural language evaluation, we used Kendall-Tau (τ ), Pearson (r p ) and Spearman (r s ) to measure the correlation between each metric's scores and the references.The detailed equations can be found in Appendix C.
Human preference experiments We evaluate different metrics on CoNaLa (Yin et al., 2018b), a natural language to Python code generation benchmark collected from StackOverflow.We use the human annotation of Evtikhiev et al. (2022) to measure the correlation between each metric and human preference.More details are provided in Appendix B.1.

Functional correctness experiments
We evaluate functional correctness using the Hu-manEval (Chen et al., 2021) benchmark.Each example in HumanEval contains a natural language goal, hand-written input-output test cases, and a human-written reference solution.While the original HumanEval is in Python, Cassano et al. (2022) translated HumanEval to 18 programming languages, and provided the predictions of the Codex model (Chen et al., 2021) (code-davinci-002) and their corresponding functional correctness. 4We used Java, C++, Python, and JavaScript for these experiments, which are some of the most popular programming languages in open-source projects. 5More details are provided in Appendix B.2.
Hyperparameters We tuned only the following hyperparameters for CodeBERTScore: whether to use F 1 or F 3 , and which layer of the underlying model to extract the encoded tokens from, which we examine in Section 5. We used F 1 in the human preference experiments and F 3 in the functional correctness experiments.We perform 3-fold cross-validation and report average results across the three folds.As for the layer to extract the token vectors from, we used layer 7 for CoNaLa, and in HumanEval we used layer 7 for Java, 10 for C++, 11 for JavaScript, and 9 for Python.

Analysis
We conducted a series of additional experiments to understand the importance of different design decisions, and to gain insights on applying Code-BERTScore to new datasets and scenarios.
Can we use CodeBERTScore in a new language without a language-specific CodeBERT?
In all experiments in Section 4, we used the language-specific model which we continued to pretrain on each language.But what if we wish to use CodeBERTScore in a language in which we don't have a language-specific model?We compare the language-specific models to CodeBERTbase in Figure 4. Generally, CodeBERT-base achieves close performance to a language-specific model.However, in most HumanEval experiments and correlation metrics, using the languagespecific model is beneficial.These results show that language-specific models are often preferred if such models are available, but the CodeBERTbase can still provide close performance even without language-specific pretraining.Which transformer layer should we use?We further investigate the impact of using hidden states from different layers of the model -the layer which the vectors in Equation (2) come from, in the computation of CodeBERTScore.The results are shown in Figure 5: generally, the deeper the layer -the higher the average correlation between CodeBERTScore and functional correctness, across all programming languages.However in almost all languages, performance reaches its maximum before the last layer, and decreases at the following layers.This suggests that higher layers encode the semantic information of each token more accurately, but the final layers may be more task-specific.These observations are consistent with Tenney et al. (2019), who found that lower layers in BERT tend to process shallow informa-tion, while higher layers encode deeper semantic meaning in natural language.
Does encoding natural language context help?One major difference between CodeBERTScore and BERTScore is that CodeBERTScore leverages the context for the generated code, such as the natural language instruction or intent that was given as input for generation.We find that using context increases the correlation, for example, the Kendall-Tau of CodeBERTScore from 0.50 to 0.52.While this paper mainly focuses on natural language instructions, we believe that Code-BERTScore can thus benefit other programming scenarios as well, for example when generating code given the human-written comments, or generating code given the preceding code context.
CodeBERTScore allows soft matching of tokens The heatmaps in Figure 6 show the similarity scores between tokens in CodeBERTScore.For example, both shutil.rmtreeand os.rmdir in Figure 6(a) delete a folder; CodeBERTScore aligns each token to a respective token in the other expression, even though the two spans do not share many identical tokens.
In Figure 6(b), both code snippets calculate a square root, where one uses math.sqrt(x) and the other uses x ** 0.5.An exact surfaceform-matching metric such as chrF would assign a low similarity score to this code pair, as they only share the token x.However, CodeBERTScore assigns non-zero scores to each token with meaningful alignments, such as matching [sq,rt] with [_0,5], since a square root is the 0.5-th power.
Additionally, we study the robustness of Code-BERTScore to adversarial perturbations.We found that token-based metrics such as chrF are Figure 6: Heatmaps of the similarity scores between two pieces of code that achieve the same goal.Figure 6(a) shows the similarity scores between os.rmdir(folder) and shutil.rmtree(folder).

Related Work
Token-based metrics Metrics such as BLEU (Papineni et al., 2002) evaluate code generation by counting matching n-grams between generated and reference code.CrystalBLEU (Eghbali and Pradel, 2022) refines this approach by disregarding trivially shared n-grams, while ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) emphasize recall and balance of precision and recall respectively.However, these metrics, relying on exact lexical matches, often fail to capture semantically equivalent but lexically different code snippets.Unlike these, CodeBERTScore captures the wide, two-sided context of each token, which n-grams cannot capture.(Ren et al., 2020) incorporates data-flow and Abstract Syntax Tree (AST) matching, in addition to token-matching.However, valid code may not always align in ASTs and data-flows.Additionally, partial code, although potentially useful, may not parse, thus cannot be fully evaluated by CodeBLEU.Further, as highlighted by subsequent studies (Wang et al., 2022), CodeBLEU does not correlate well with execution accuracy.

Static analysis-based metrics CodeBLEU
Execution-based Metrics To alleviate previous issues, execution-based evaluation counts a generated code snippet as correct if it produces the required outputs when run with given inputs (Chen et al., 2021;Athiwaratkun et al., 2022;Li et al., 2022;Wang et al., 2022;Lai et al., 2022;Huang et al., 2022).However, execution-based evaluation requires datasets that are provided with manually crafted test cases for each example, which is costly and labor-intensive to create; thus, only few such datasets exist.In contrast, CodeBERTScore is completely unsupervised and does not depend on any specific dataset.Further, executing modelgenerated code is susceptible to security threats, and thus should be run in an isolated sandbox, which makes it technically cumbersome to work with iteratively.

Conclusion
In this paper, we present CodeBERTScore, a simple evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020), using pretrained language models of code, and leveraging the natural language context of the generated code.We perform an extensive evaluation across four programming languages which shows that CodeBERTScore is more correlated with human preference than all prior metrics.Further, we show that generated code that receives a higher score by CodeBERTScore is more likely to function correctly when executed.Finally, we release five programming language-specific pretrained models to use with our publicly available code.These models were downloaded more than 1,000,000 times from the HuggingFace Hub.Our code and data are available at https://github.com/neulab/code-bert-score.

Acknowledgement
We thank Misha Evtikhiev, Egor Bogomolov, and Timofey Bryksin for the discussions, and for the data from their paper (Evtikhiev et al., 2022).We thank anonymous reviewers for the valuable feedback.We are grateful to Yiwei Qin for the discussions regarding the T5Score paper (Qin et al., 2022); the idea to use functional correctness as a meta-metric was born thanks to the discussion with her.We are also grateful to Aryaz Eghbali and Michael Pradel for the discussions about CrystalBLEU (Eghbali and Pradel, 2022).This material is partly based on research sponsored in part by the Air Force Research Laboratory under agreement number FA8750-19-2-0200.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government.This project was also partially supported by a gift from AWS AI.

Limitations
CodeBERTScore requires a GPU for computing the metric, while traditional metrics such as BLEU require only a CPU.This adds a hardware requirement to the evaluation of models of code, while most previous approaches are computationally cheaper (e.g., by counting n-grams).However, since training and testing neural models require GPU anyways, we can safely assume that a GPU is available.Further, BERT-base models are encoder-only and non-autoregressive; this means that they require only a single "forward pass", compared to encoder-decoder models (e.g., T5) and decoder-only models (e.g., GPT-3) that need to autoregressively generate token after token, using a forward pass for each output token.Thus, the additional time consumption by encoder-only models (e.g., BERT) is negligible, especially when evaluating encoder-decoder or decoder-only as the NL→Code generator models.
Another point to consider is that Code-BERTScore relies on a strong underlying BERTbased model, while methods such as BLEU do not have many "moving parts" or hyperparameters to tune.However, this is mostly an advantage, since CodeBERTScore can be further improved in the future using stronger base models.

A Additional Details
F β The well-known F 1 score is computed as: ) A more general F score F β uses a positive factor β, where recall is considered β times as important as precision: As found in METEOR (Banerjee and Lavie, 2005), using F β with β = 3, thus preferring recall over precision, results in a higher correlation with human preference in machine translation.In our experiments, we found that this applies to NL→Code as well.
Token Weighting Following Zhang et al.
(2020), we compute the inverse document frequency (idf), according to a language-specific test set, and weigh each token according to its negative log frequency.
Scaling Following Zhang et al. (2020), the cosine similarity scores of hidden states tend to lie in a limited range.Thus, we can linearly scale the resulting scores, using an empirical base scalar b: ) This typically spreads the CodeBERTScore F 1 scores to the [0, 1] range, and is merely a cosmetical change: this scaling does not change the way CodeBERTScore ranks different prediction, but can be slightly more intuitive and easier to interpret.We computed b empirically by sampling random unrelated code pairs and measuring their average similarity score.For Java, the empirical b Java was 0.78 and for C++, b C++ it was 0.76.

B.1 Human Preference
For each example, Evtikhiev et al. (2022) asked experienced software developers to grade the generated code snippets from five different models.The grade scales from zero to four, with zero denoting that the generated code is irrelevant and unhelpful, and four meaning that the generated code solves the problem accurately.Overall, there are 2860 annotated code snippets (5 generations × 472 examples) where each snippet is graded by 4.5 annotators.

B.2 Functional Correctness
We evaluate functional correctness using the Hu-manEval (Chen et al., 2021) benchmark.Each example in HumanEval contains a natural language goal, hand-written input-output test cases, and a human-written reference solution.On average, each example has 7.7 test cases and there are 164 examples in total.While the original HumanEval is in Python, Cassano et al. (2022) translated Hu-manEval to 18 programming languages, and provided the predictions of the Codex model (Chen et al., 2021) (code-davinci-002) and their corresponding functional correctness. 6We used Java, C++, Python, and JavaScript for these experiments, which are some of the most popular programming languages in open-source projects.7Notably, Cassano et al. (2022) did not translate the reference solutions to the other languages, so, we collected these from HumanEval-X (Zeng et al., 2022). 8The reference score of every example is either 1 ("correct", if it passes all test cases) or 0 ("incorrect", otherwise).

C Correlation Metrics
Kendall-Tau (τ ) τ measures the ordinal/rank association between a metric such as Code-BERTScore and the reference measurement.It is calculated as: where |concordant| represents the number of pairs where two measurements agree on their relative rank.That is, if f ( ŷ1 , y * 1 ) > f ( ŷ2 , y * 2 ), the reference measurement also yields f * ( ŷ1 , y * 1 ) > f * ( ŷ2 , c * 2 ).Similarly, |discordant| represents the number of pairs where two measurements yield opposite ranks.Notably, in our experiments, we restrict the comparisons of ranks within the generations of the same question.
Pearson (r p ) r p measures the linear correlation between a metric and the reference measurement.
It is defined as: where N is the number of generations in the dataset, f is the mean CodeBERTScore of the dataset, and f * is the mean similarity score calculated by the reference measurement.Spearman (r s ) r s measures the Pearson correlation coefficient between the ranks produced by a metric and the reference measurement: where R returns the ranks of code snippets in a collection of code snippets Y. cov(•, •) is the covariance of two variables and σ(•) is the standard deviation.

D Standard Deviation
Table 3 shows the same results as in Table 1, but with standard deviations.Figure 7: The similarity rankings of three code snippets given the reference code shutil.rmtree(folder).
We conducted a qualitative evaluation of Code-BERTScore under various perturbations.An example is shown in Figure 7, which shows the CodeBERTScore and chrF rankings of three code snippets based on the similarity to the reference shutil.rmtree(folder).CodeBERTScore gives a higher ranking to the code snippet that employs the appropriate API (os.rmdir) than the trivial (folder) that has the same variable name but without any function call.Contrarily, chrF assigns a higher ranking to (folder) which has a longer common sequence of characters, although semantically inequivalent.

F Distinguishing Code with Different Semantics
We study how well can CodeBERTScore perform as a generic similarity function that measures the similarity between two arbitrary code snippets y i and y j .

F.1 Distinguishability Metric
We evaluate CodeBERTScore using the distinguishability metric d proposed by Eghbali and Pradel (2022) which is calculated as follows: where Pair intra defines a set of code pairs from the same semantically equivalent clusters, and Pair inter defines a set of code pairs from two clusters of different functionality.Formally, Pair intra ={(y i , y j ) | ∃k such that y i , y j ∈ C k } Pair inter ={(y i , y j ) | ∃k such that where C k is the k-th cluster with semantically equivalent code snippets.Intuitively, a similarity function f that can distinguish between similar and dissimilar code will produce d larger than 1, meaning that a pair of code snippets from the same semantic cluster has a higher similarity score than a pair of snippets from different clusters.Since the number of intra-class and inter-class pairs grows quadratically with the number of code snippets, in our experiments we followed Eghbali and Pradel (2022) to sample N inter-and N intra-class pairs instead.

F.2 Dataset with Semantically equivalent clusters
We follow Eghbali and Pradel (2022)     by the platform.The dataset consists 6958 code snippets covering 278 problems in Java and C++.We use CodeBERTScore to calculate the similarity score for code pairs that share the same semantic class and code pairs that do not.We then measure the distinguishability of CodeBERTScore according to Equation 7. The results are shown in Table 5.
Table 5 shows that CodeBERTScore achieves a higher distinguishability than CrystalBLEU, which proposed this meta-metric, in both Java and C++.CodeBERTScore achieves distinguishability scores of 9.56 in Java while CrystalBLEU achieves 5.96; in C++, CodeBERTScore achieves 9.13 while CrystalBLEU achieves only 6.94.This result confirms that CodeBERTScore assigns higher similarity scores to semantically similar code pairs, compared to randomly paired snippets that belong to different semantic classes.
Can We Hack the Distinguishability Metric?Despite the encouraging results in Table 5, we also found that distinguishability can be easily manipulated since it compares absolute scores across different metrics.For example, while CrystalBLEU achieves a distinguishability score of 5.96, we can craft a variant of CodeBERTScore that achieves a distinguishability score of 120,000 by simple exponentiation of CodeBERTScore's output score.
To illustrate this, we conducted a distinguishability evaluation with the same configurations as before, but with a variant of CodeBERTScore that we call CodeBERTScore k , and defined as the composition of CodeBERTScore with the f (x) = x k function, that is: CodeBERTScore k (y 1 , y 2 ) = (CodeBERTScore (y 1 , y 2 )) k .
As  BERTScore metric has not changed.
We thus argue that distinguishability is not a reliable meta-metric and is no substitute for execution-based-or human-rating.We further suspect that any meta-metric that compares exact, absolute, scores across different metrics is susceptible to such manipulations, and the reliable way to compare metrics is according to the way they rank different examples, rather than the exact scores.
The distinguishability results of CodeBERTScore k with different values of k are shown in Figure 8.As Figure 8 shows, the distinguishability increases almost exponentially with the increasing value of k.We thus argue that distinguishability is not a reliable metametric and is no substitute for execution-basedor human-rating.We further suspect that any meta-metric that compares exact, absolute, scores across different metrics is susceptible to such manipulations, and the reliable way to compare metrics is according to the way they rank different examples, rather than the exact scores.

G Additional Examples
In this section, we provide additional examples in which CodeBERTScore prefers the functionally correct prediction, while the best baseline metric in each language ranks higher a functionally incorrect prediction, which is inequivalent to the reference.Figure 9 shows an example in Java, and Figure 10 shows a C++ example.

Figure 1 :
Figure 1: An intuitive example for the usefulness of CodeBERTScore in measuring generated code: Figure 1(a) shows a reference code snippet in Java. Figure 1(b) and Figure 1(c) show two generated predictions.Among these two candidates and given the reference, both BLEU and CrystalBLEU prefer (score higher) the snippet in Figure 1(b), which is not functionally equivalent to the reference, while our proposed CodeBERTScore prefers the code in Figure 1(c), which is functionally equivalent to the code in Figure 1(a).

ExampleFigure 2 :
Figure2: A diagram illustrating CodeBERTScore: We use a language-specific CodeBERT model to encode each of ⟨natural_language, reference_code⟩ and ⟨natural_language, generated_code⟩.We then compute the pairwise cosine similarity between every encoded token in the reference and every encoded token in the generated code, ignoring the encoded natural language context tokens and encoded punctuation tokens; finally, we take the max across the rows of the resulting matrix to compute Precision and across columns to compute Recall.

Figure 4 :Figure 5 :
Figure 4: The Kendall-Tau and Spearman on the development set of different datasets with the language-specific pretrained model (Lang-specific) and with the base CodeBERT (Base model).
Figure6: Heatmaps of the similarity scores between two pieces of code that achieve the same goal.Figure6(a)shows the similarity scores between os.rmdir(folder) and shutil.rmtree(folder).Figure6(b)shows the similarity scores between math.sqrt(x) and x ** 0.5.
Figure 8 shows, distinguishability of CodeBERTScore k increases almost exponentially while increasing k, although the base Code-
The ground truth reference -find the index of target in this.elements.Preferred by BLEU & CrystalBLEU -find whether or not target is in this.elements.
(c) Preferred by CodeBERTScore -find the index of target in this.elements.

Table 1 :
Kendall-Tau (τ )and Spearman (r s ) correlations of each metric with the functional correctness on Hu-manEval in multiple languages.The correlation coefficients are reported as the average across three runs.Standard deviation is provided in Table3.

Table 2 :
The Kendall-Tau (τ ), Pearson (r p ) and Spearman (r s ) correlation with human preference.The best performance is bold.The correlation coefficients are reported as the average across three runs.Standard deviations are provided in Table4.
Correlation with functional correctness Table 1 shows the correlation between different metrics and functional correctness: CodeBERTScore achieves the highest or comparable Kendall-Tau and Spearman correlation with functional correctness across all four languages.METEOR achieves a comparable correlation with CodeBERTScore in Java and JavaScript, and its correlation is surprisingly better than other baseline metrics.However, in C++ and Python, CodeBERTScore is strictly better.Overall on average across languages, Code-BERTScore is more correlated with functional correctness than all baselines.
Table 4 shows the results from Table 2, with standard deviations.

Table 3 :
Kendall-Tau (τ )and Spearman (r s ) correlations of each metric with the functional correctness on Hu-manEval in multiple languages.The correlation coefficients are reported as the average across three runs, along with the standard deviation.

Table 4 :
The Kendall-Tau (τ ), Pearson (r p ) and Spearman (r s ) correlation with human preference.The best performance is bold.The correlation coefficients are reported as the average across three runs.Numbers inside parentheses indicate the standard deviations.

Table 5 :
Distinguishability with different metrics as the similarity function.CodeBERTScore achieves a higher distinguishability than CrystalBLEU, which proposed this meta-metric, on the same datasets.