Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present \textbf{SCodeR}, a \textbf{S}oft-labeled contrastive pre-training framework with two positive sample construction methods to learn functional-level \textbf{Code} \textbf{R}epresentation. Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels through an iterative adversarial manner and use them to learn better code representation. The positive sample construction is another key for contrastive pre-training. Previous works use transformation-based methods like variable renaming to generate semantically equal positive codes. However, they usually result in the generated code with a highly similar surface form, and thus mislead the model to focus on superficial code structure instead of code semantics. To encourage SCodeR to capture semantic information from the code, we utilize code comments and abstract syntax sub-trees of the code to build positive samples. We conduct experiments on four code-related tasks over seven datasets. Extensive experimental results show that SCodeR achieves new state-of-the-art performance on all of them, which illustrates the effectiveness of the proposed pre-training method.


Introduction
Function-level code representation learning aims to learn continuous distributed vectors that represent the semantics of code snippets (Alon et al., 2019), which has led to dramatic empirical improvements on a variety of code-related tasks such as code search, clone detection, code summarization, etc.
To learn function-level code representation on unlabeled code corpus with self-supervised objectives, recent works (Jain et al., 2021;Bui et al., 2021; Variable Renaming conditional statement conditional statement Figure 1: An example of applying variable renaming.Ding et al., 2022;Wang et al., 2022a) propose contrastive pre-training methods for programming language.In their contrastive pre-training, they usually pull positive code pairs together in representation space and regard different codes as negative pairs via pushing their representation apart.However, they ignore the potential relevance between codes since different programs in a large code corpus may have some similarities.For example, an ascending sort program and a descending sort program are somewhat similar since they both sort their input in a certain order.More seriously, there are a lot of duplications in code corpus (Lopes et al., 2017;Allamanis, 2019), which can cause the "false negative" problem and deteriorate the model (Huynh et al., 2022;Chen et al., 2021b).
The other problem of current code contrastive pretraining methods is their positive sample construction.ContraCode (Jain et al., 2021) and Corder (Bui et al., 2021) design code transformation algorithms like variable renaming and dead code insertion to generate semantically equivalent programs as positive samples, while Code-MVP (Wang et al., 2022a) leverages code structures like abstract syntax tree (AST) and control flow graphs (CFG) to transform a program to different variants.However, as shown in Figure 1, these methods usually result in generated positive samples with highly similar structures (e.g.double loop statements with a conditional statement) to the original program.To pull such positive pairs closer in representation space, the model will tend to learn function-level code representation from superficial code structure rather than substantial code semantics.To address these limitations, we present SCodeR, a Soft-labeled contrastive pre-training framework with two positive sample construction methods to learn functionlevel Code Representation.
The soft-labeled contrastive pre-training framework can obtain relevance scores between samples and the original program as soft-labels in an iterative adversarial manner to improve code representation.Specifically, we first leverage hard-negative samples from contrastive pre-training to fool discriminators that can explore finer-grained tokenlevel interactions, while discriminators learn to distinguish them and predict relevance scores among samples as soft-labels for contrastive pre-training.Through this adversarial iteration, discriminators can provide progressive feedback to improve code contrastive pre-training through soft-labels.
As for positive sample construction, we propose to utilize code comment and abstract syntax subtree of the source code to construct positive samples for SCodeR pre-training.Generally, user-written code comments highly describe the function of a source code like "sort the input array" in Figure 1, which provides crucial semantic information for the model to capture code semantics.Besides the comment, the code itself also contains rich information.To further explore the intra-code correlation and contextual knowledge for code contrastive pretraining, we randomly select a piece of code via AST like the conditional statement of Figure 1 and its context as a positive pair.These positive pairs require the model to understand code semantics and learn to infer the selected code based on its context and can help the model learn representation from code semantics.
We evaluate SCodeR on four code-related downstream tasks over seven datasets, including code search, clone detection, zero-shot code-to-code search, and markdown ordering in python notebooks.Results show that SCodeR achieves state-of-the-art performance and ablation studies demonstrate the effectiveness of positive sample construction and soft-labeled contrastive pretraining.We release the codes and resources at https://github.com/microsoft/AR2/tree/main/SCodeR.

Related Works
Pre-trained Models for Programming Language.With the great success of pre-trained models in natural language processing field (Devlin et al., 2018;Lewis et al., 2019;Raffel et al., 2019;Brown et al., 2020), recent works attempt to apply pre-training techniques on programming languages to facilitate the development of code intelligence.Kanade et al. (2019) pre-train CuBERT on a large-scale Python corpus using masked language modeling (MLM) and next sentence prediction objectives.Feng et al. (2020) pre-train CodeBERT on codetext pairs in six programming languages via MLM and replaced token detection objectives to support text-code related tasks such as code search.Graph-CodeBERT (Guo et al., 2020) leverages data flow as additional semantic information to enhance code representation.To support code completion, Svyatkovskiy et al. (2020) and Lu et al. (2021) respectively propose GPT-C and CodeGPT.Both of them are decoder-only models and pre-trained by unidirectional language modeling.Some recent works (Ahmad et al., 2021;Wang et al., 2021;Guo et al., 2022) explore unified pre-trained models to support both understanding and generation tasks.PLBART (Ahmad et al., 2021) and CodeT5 (Wang et al., 2021) are based on the encoder-decoder framework.PLBART uses denoising objective to pre-train the model and CodeT5 considers the crucial token type information from identifiers.However, these pretrained models usually result in poor function-level code representation (Guo et al., 2022) due to the anisotropy representation issue (Li et al., 2020).In this work, we mainly investigate how to learn function-level code semantic representations.
Contrastive Pre-training for Code Representation.To learn function-level code semantic representation, several attempts have been made to leverage contrastive pre-training on programming languages.ContraCode (Jain et al., 2021) andCorder (Bui et al., 2021) design transformation algorithms like variable renaming and dead code insertion to generate semantically equivalent programs as positive instances, while Ding et al. (2022) design structure-guided code transformation algorithms that inject real-world security bugs to build hard negative pairs for contrastive pre-training.Instead of using semantic-preserving program transformations, SynCoBERT (Wang et al., 2022b) and Code-MVP (Wang et al., 2022a) construct the positive pairs through the compilation process of programs like AST and CFG.However, these works usually generate positive samples with highly similar structures as the original program.To distinguish these positive samples from candidates, the model might learn code representation from code surface forms according to hand-written patterns, instead of code semantics.In this paper, we propose to utilize the comment and abstract syntax sub-tree of the code to construct positive samples and present a method to obtain relevance scores among samples as softlabels for contrastive pre-training.

Positive Sample Construction
In this section, we describe how to construct positive pairs for SCodeR.Different from previous works that design transformation algorithms to generate semantically equivalent but highly similar programs, we propose to leverage comment and abstract syntax sub-tree of the code for positive sample construction to encourage the model to capture semantic information.

Code Comment
User-written code comments usually summarize the functionality of the codes and provide crucial semantic information about the source code.Taking the code in Figure 2 as an example, the comment "sort the input array" highly describes the goal of the code and can help the model to understand code semantics from the natural language.Therefore, we take source code c with the corresponding comment t as positive pair (t, c).Such positive pairs not only enable the model to understand the code semantics but also align the representation of different programming languages with a unified natural language description as a pivot.

Abstract Syntax Sub-Tree
Besides the comment, the code itself also contains rich information.To further explore the intracode correlation and contextual knowledge for contrastive pre-training, we propose a method, called Abstract Syntax Sub-Tree Extraction (ASST), that leverages the abstract syntax sub-tree of the source code to construct positive code pairs.We give an example of a Python code with its AST in Figure 2. We first randomly select the sub-tree of the AST like "if statement", and then take the corresponding code of the sub-tree and the remaining code as positive code pairs.The procedure of extraction is illustrated in Algorithm 1. Specifically, we first predefine a set N of node types whose sub-trees can be used to construct positive pairs.The set mainly consists of statement-level types like "for_statement" that usually contain a complete and functional code snippet.We then start from a randomly selected leaf node (line 1-2) and find an eligible node in the pre-defined set N along the direction of the root node (line 3-10).Finally, we take the corresponding code s (i.e.leaf children) of the eligible node and the remaining code context s as a positive code pairs (s, s) for contrastive pre-training.To avoid extracting those code spans that are too short or meaningless, we set a minimum length l min for the extracted code spans s.
While transformation-based methods generate programs with similar structures, the structures of positive code pairs generated by ASST are different since they belong to different parts of a program.Meanwhile, they are logically relevant because they compose a program of full functionality.To estimate which code is complementary to a given code context in contrastive pre-training, the model needs to understand code semantics and learn to reason based on its context, which encourages the model to understand code semantics.
There are similar mechanisms to learn text representation such as Inversed Cloze Task (ICT) (Lee et al., 2019) that takes a random span of natural language tokens and their context as a positive pair.However, ICT cannot be directly applied to code because code has an explicit structure.If we randomly select code spans on token-level, the selected code a ← the parent node of a 9: end if 10: end while spans might be ungrammatical such as "for i", which will mislead the model to focus on structural matching rather than semantic matching.

Soft-Labeled Contrastive Pre-training
Previous code contrastive pre-training methods usually take different programs in a code corpus as negative pairs and push them apart in the representation space.However, different programs in an unlabeled code corpus may have some similarities.Taking a program that sorts the input in ascending order as an example, even though another "descendingly sort" program is not semantically equal with it, they both sort their input in a certain order and thus are somewhat similar.Another problem is the "false negative" issue (Huynh et al., 2022;Chen et al., 2021b) due to the duplication in the code corpus (Lopes et al., 2017;Allamanis, 2019).To alleviate these problems, we propose soft-labeled contrastive pre-training framework that uses relevance scores between different samples as softlabels to learn function-level code representation.

Overview
The soft-labeled contrastive pre-training framework involves three components: (1) A dual-encoder G θ that aims to learn function-level code representation (2) Two discriminators D ϕ and D ψ that calculate relevance scores between two inputs for text-code and code-code pairs, respectively.These components compute the similarity between two samples (x, y) as follows: (1) where E θ , E ϕ and E ψ are multi-layer Transformer (Vaswani et al., 2017)  While the dual-encoder encodes samples separately, discriminators take the concatenation of two samples as the input and fully explore finer-grained token-level interactions through the self-attention mechanism, which can predict more accurate relevance scores between two samples.Therefore, we propose to utilize relevance scores from discriminators as soft-labels to help the encoder E θ learn better code representation.
We show the detailed illustration of our proposed soft-labeled contrastive pre-training in Algorithm 2. Specifically, we first initialize all encoders with a pre-trained model like UniXcoder (Guo et al., 2022) and follow Li et al. (2022) to train a warm-up dual-encoder using a simple strategy where negative samples come from other positive pairs in the same batch X b (line 1-2 of Algorithm 2).The loss is calculated as follows, where (x, x + ) ∈ X is a positive pair as described by Section 3. We then iteratively alternate two training procedures: (1) The dual-encoder is used to obtain hardnegative codes to train the discriminators (line 5-8).
(2) The optimized discriminators predict relevance scores among samples as soft-labels to improve the dual-encoder (line 9-12).Through this iterative training, the dual-encoder gradually produces harder negative samples to train better discriminators, whereas the discriminators provide better progressive feedback to improve the dual-encoder.The details about training procedures for the discriminators and dual-encoder will be described next.

Discriminators Training
Given a text x from positive text-code pairs (x, x + ), the discriminator D ϕ is optimized by maximizing the log likelihood of selecting positive code x + from candidates X as follows, x ′ ∈X e D ϕ (x,x ′ ) (6) where X is the set of negative codes X − with a positive code x + .If x is a code from positive code-code pairs, the calculation of p ψ and L ψ are analogous to p ϕ and L ϕ , respectively.
To better train discriminators, we take those hardnegative examples that are not positive samples but closed to the original example x in the vector space as the negative candidates X − .In practice, we first get the top-K code samples that are closest to x using G θ as the distance function and randomly sample examples from them to obtain a subset X − .

Dual-Encoder Training
After training discriminators, we utilize relevance scores predicted by discriminators as soft-labels and follow Zhang et al. (2021) to use adversarial and distillation losses to optimize the dual-encoder.

Adversarial loss:
We apply the same approach to obtain hard-negative candidates X − as described in Section 4.2.When optimizing G θ , w in Equation 8 is a constant and adjusts weight for each negative example.When − log p ϕ(ψ) (x + |x, {x + , x − }) is small, i.e. discriminators predict that x and x − are semantically relevant, w will be a high weight and force G θ to draw the representation of x and x − closer among X − .Since we optimize the dualencoder on negative codes under different weight w, the representation of negative codes with high relevance score will be closer to x, and those with low relevance score will be pushed away.

Distillation loss:
We also use a distillation loss function (Hinton et al., 2015) to encourage the dual-encoder to fit the probability distribution of discriminators over {x + } ∪ X − using KL divergence loss H. Through L θ distill , we can inject discriminators' knowledge into the dual-encoder by soft-labels p ϕ(ψ) .
Training Objective of Dual-Encoder The overall loss function of the dual-encoder is the integration of adversarial loss and distillation loss as follows, where λ is a pre-defined hyper-parameter.
Through L θ , we can provide discriminators' progressive feedback to the dual-encoder through softlabels.After this adversarial iteration, we will use E θ to serve for downstream tasks.

Model Comparison
We compare SCodeR with various state-of-theart pre-trained models.RoBERTa (Liu et al., 2019) is pre-trained on text corpus by masked language model (MLM).CodeBERT (Feng et al., 2020) is pre-trained on large scale code corpus with MLM and replaced token detection.Graph-CodeBERT (Guo et al., 2020) is based on Code-BERT and integrates the data flow information to enhance code representation.PLBART (Ahmad et al., 2021) is adapted from the BART (Lewis et al., 2019) architecture and pre-trained using denoising objective on Java, Python and stackoverflow corpus.CodeT5 (Wang et al., 2021) is based on the T5 (Raffel et al., 2020) architecture, considering the identifier token information and applying multitask learning.UniXcoder (Guo et al., 2022) is adapted from the UniLM (Dong et al., 2019) architecture, pretrained by different tasks (understanding and generation) on unified cross-modal data (code, AST and text).We also compare SCodeR with those code pre-trained models that utilize contrastive pre-training.SynCoBERT (Wang et al., 2022b) and Code-MVP (Wang et al., 2022a) construct positive pairs through multiple views of code like AST and CFG. Corder (Bui et al., 2021) and DISCO (Ding et al., 2022) construct positive code pairs from semantic-preserving transformations, and the latter additionally uses bug-injected codes as hard negatives.CodeRetriever (Li et al., 2022) builds code-code pairs by corresponding documents and function name automatically.For fair comparison, we use the same model architecture, pre-training corpus, and downstream hyperparameters as previous works (Li et al., 2022;Guo et al., 2022).To accelerate the training process, we initialize dual-encoder and discriminators with the released parameters of UniXcoder (Guo et al., 2022).More details about pre-training and finetuning can be found in the Appendix A and B.

Natural Language Code Search
Given a natural language query as the input, code search aims to retrieve the most semantically relevant code from a collection of code candidates.We conduct experiments on CSN (Guo et al., 2020), AdvTest (Lu et al., 2021) and CosQA (Huang et al., 2021) to evaluate SCodeR.CSN contains six programming languages, including Ruby, Javascript, Python, Java, PHP and Go.The dataset is constructed from CodeSearchNet Dataset (Husain et al., 2019) and noisy queries with low quality are filtered.AdvTest normalizes the function name and variable name of python code and thus is more challenging.The queries of CosQA are from Microsoft Bing search engine, which makes it closer to realworld code search scenario.Following previous works (Feng et al., 2020;Guo et al., 2020Guo et al., , 2022)), we adopt Mean Reciprocal Rank (MRR) (Hull, 1999) as the evaluation metric.
The results are shown in Table 1.We can see that SCodeR outperforms previous code pre-trained models and achieves the new state-of-the-art performance on all datasets.Specifically, SCodeR outperforms UniXcoder by 2.3 points on the CSN dataset, and improves over state-of-the-art models about 2.5 points on AdvTest and CosQA datasets, which demonstrates the effectiveness of SCodeR.Table 2: Performance on code clone detection.The results of compared models are from their original papers.

Code Clone Detection
Code clone detection aims to identify the semantic similarity between two codes.We consider POJ-104 (Mou et al., 2016) and BigCloneBench (Svajlenko et al., 2014a) to evaluate SCodeR.POJ-104 dataset (C/C++) consists of codes from online judge (OJ) system.It aims to find the semantically similar codes given a code as query and evaluates by Mean Average Precision (MAP).Big-CloneBench dataset (Java) is to judge whether two codes are similar and evaluates by Precision, Recall, and F1-score.We show the results in Table 2.
Compared with previous pre-trained models, SCodeR achieves the overall best performance on both datasets.On POJ-104 dataset, SCodeR surpasses all other methods.Specifically, SCodeR outperforms UniXcoder by 1.93 points.Although the pre-training corpus does not cover C/C++ programming languages, the superior performance reflects that SCodeR learns better general code knowledge.On the BigCloneBench dataset, SCodeR also achieves comparable performance.These results show that SCodeR learns better function-level code representation for code clone detection.Table 3: The comparison on zero-shot code-to-code search.Baselines' results are reported by Guo et al. (2022).

Zero-Shot Code-to-Code Search
We also evaluate SCodeR in zero-shot code-tocode search.Given a code snippet as query, the task aims to find semantically similar codes from a collection of code candidates in zero-shot setting.Since the annotation of code-to-code search is labor-intensive and costly (Svajlenko et al., 2014b;Li et al., 2022), the zero-shot performance can indicate the model's utility in real-world scenario, where a lot of programming languages do not have an annotated dataset for code-to-code search.We follow Guo et al. (2022) to conduct the experiment on CodeNet (Puri et al., 2021) and evaluate models using MAP score.The results are listed in Table 3.The first and the second row correspond to query and target programming languages.We can see that SCodeR outperforms all other compared models and improves over the state-ofthe-art model, i.e.UniXcoder, by 3.12 average absolute points.Meanwhile, SCodeR has a consistent improvement on the cross-PL setting, which can help users to translate programs from one PL to another via retrieving semantically relevant codes.

Markdown Ordering in Python Notebooks
This task is to reconstruct the order of markdown cells in a given notebook according to the ordered code cells.We conduct experiments on the dataset provided by Kaggle 1 and use the official evaluation metric, Kendall's tau (τ ).It is computed as 1 − 2 * N/ n r where N is the number of pairs in the 1 https://www.kaggle.com/competitions/AI4Code/overviewpredicted sequence with incorrect relative order and n is the sequence length.
We take the normalized markdown cell's position in a given notebook as labels for each markdown cell (0∼1), and solve this task as a regression task.To test performance of function-level code representation, we use pre-trained models to encode each cell to function-level representation as features.We use a randomly initialized Transformer that takes extracted features of cells in the python notebook to predict position of each cell.Note that parameters of pre-trained models are fixed in the fine-tuning procedure, and thus the performance of this task depends on function-level feature extracted from pre-trained models.
We show the results in Table 4. SCodeR outperforms other pre-trained models and achieves 0.5 points higher than UniXcoder.This indicates that SCodeR learns better representation for both code and natural language comments, and can help better understand the fine-grained relationship of codes and comments in the python notebook.

Analysis
Ablation Study To evaluate the effect of our positive sample construction methods and soft-labeled contrastive pre-training framework, we conduct ablation study on the CSN dataset and take the pretrained model with no enhancement as the baseline (i.e.UniXcoder).At first, we individually compare the proposed ASST with the transformation-based positive sample construction method (Jain et al., 2020;Bui et al., 2021).Notice that previous works do not apply their transformation-based methods on all six programming languages covered by our pre-training corpus.For fair comparison and keeping the pre-training corpus consistent, we follow Lu et al. (2022) to implement the widely used transformations including variable renaming and dead code insertion on six programming languages by ourselves.Then, we add the remaining modules of SCodeR to evaluate their performance.The results def average(inp): total = 0 for i in inp: total += i return total/len(inp) 0.4713

Soft Label
Figure 3: Case study on the discriminator.The softlabel is the relevance scores between the comment and codes from the discriminator, p ϕ (•|x, X).
Case Study We give a case study in Figure 3 to show the importance of soft-labels for contrastive pre-training.The figure includes one paired code and two other codes with soft-labels provided by the discriminators.We can see that the soft-label of negative Code − 1 is close to 0 since the code is unrelated with the comment and the discriminators can predict correct relevance score between them for contrastive pre-training.Code − 2 is a false negative that has the same functionality as Code + and should be assigned similar weights when we apply contrastive pre-training.As we can see in the figure, the discriminator can understand code semantics and provide similar soft-labels (i.e 0.5284 vs 0.4713) about Code + and Code − 2 for contrastive pre-training, which can alleviate the influence of false negative issue and learn better code representation through soft-labels.

Conclusion
In this paper, we present SCodeR to learn functionlevel code representation with soft-labeled contrastive pre-training.To alleviate the "false negative" issue in code corpus, we propose a softlabeled contrastive pre-training framework that takes relevance scores among samples as softlabels for contrastive pre-training in an iterative adversarial manner.Besides, we propose to utilize code comment and abstract syntax sub-tree of the source code to build positive samples that can facilitate the model to capture semantic information from the source code.Experimental results show that SCodeR achieves state-of-the-art performance on four code-related tasks over seven datasets.Further ablation studies show the effectiveness of our soft-labeled contrastive pre-training framework and positive sample construction methods.

Limitations
There are two limitations of this work: 1) In the adversarial iteration, we introduce discriminators to provide soft-labels for the training of dual-encoder, which increases GPU memory occupation.To solve it, we can obtain these soft-labels offline, which may complicate the pipeline of data processing.2) We only use UniXcoder as the backbone model in the experiments due to the computation resources limitation.We leave pre-training based on other code pre-trained models like Code-BERT (Feng et al., 2020), GraphCodeBERT (Guo et al., 2020) and Codex (Chen et al., 2021a) as future work.

Figure 2 :
Figure 2: An ASST example of bubble sort.
Algorithm 1 Abstract Syntax Sub-Tree Extraction Require: The AST T of a code c and pre-defined selectable node types N .1: Collect leaf children C of nodes whose types are in N 2: Randomly sample a node a from C 3: while True do Soft-Labeled contrastive pre-training Require: A dual-encoder G θ , two discriminators D ϕ(ψ) , and a set X of positive pairs with a unlabeled code corpus C. 1: Initialize the dual-encoder and discriminators.2: Train the warm-up dual-encoder.3: Get top-K negative codes C •] indicates the concatenation operator.If the input (x, y) is a text-code pair, we use D ϕ to calculate the similarity, otherwise we use D ψ .
encoders with mean-pooling.w ϕ (w ψ ) is a linear layer to obtain similarity score Algorithm 2

Table 1 :
The comparison on code search task.The results of compared models are from their original papers.