Contrastive Code Representation Learning

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like code clone detection, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based RoBERTa model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training outperforms RoBERTa on an adversarial code clone detection benchmark by 39% AUROC. Surprisingly, improved adversarial robustness translates to better accuracy over natural code; ContraCode improves summarization and TypeScript type inference accuracy by 2 to 13 percentage points over competitive baselines. All source is available at https://github.com/parasj/contracode.

ContraCode learns such representations with contrastive learning: the network is trained to find equivalent programs among many distractors, thereby distilling compiler invariants into the representation. explicitly address the underlying program functionality. The resulting models attend to particularities of each program implementation such as variable names.
We hypothesize that programs with the same functionality should have the same underlying representation for downstream code understanding tasks, a principle illustrated in Figure 1. While it is time intensive to identify equivalent programs in a large corpus, it is cheap to leverage static compiler transformations to automatically generate many equivalent versions of a particular source program.
In this work, we develop ContraCode, a self-supervised representation learning algorithm that uses source-to-source compiler transformation techniques (e.g., dead code elimination, obfuscation and constant folding) to generate syntactically diverse but functionally equivalent programs. ContraCode uses these equivalent programs to construct a challenging discriminative pretext task that requires the model to identify equivalent programs out of a large dataset of distractors. In doing so, it has to embed the functionality, not the form, of the code. In essence, the domain knowledge from our code transformations induces the knowledge of the structure of programs onto learned representations. The contributions of our work include: 1. the concept of program representation learning based on functional equivalence, independent of the underlying encoder model architecture, 2. an instantiation of this based on a novel application of compiler transformations to generate equivalent, textually divergent batches of programs, and 3. an evaluation of several model architectures demonstrating that Contrastive Code Representation Learning results in a 10.2% relative accuracy improvement over supervised baselines for the code summarization task and up to 4.5% improvement for the type inference task.

Related Work
Self-supervised learning Self-supervised learning is a general representation learning strategy where some dimensions or attributes of a datapoint are predicted from the remaining parts. These methods are unsupervised in the sense that they do not rely on labels. However, self-supervised tasks are often solved using losses and architectures designed for supervised learning. Self-supervised pre-training has yielded large improvements in both natural-language processing [18,29,60,61] and computer vision [46,15,14,26,28,53] by improving generalization [19]. Early work in self-supervised learning for computer vision found that weak features, such as the orientation [23], color [74] and context [56] of an image, are meaningful signals for representation learning [46].
Contrastive learning Recently, contrastive learning has emerged as a simple framework unifying many past approaches to self-supervised learning based on comparing pairs or collections of similar and dissimilar items [25]. Rather than training the network to predict labels or generate data, contrastive methods directly minimize a distance between the representations of similar data (positives) and maximize the distance between dissimilar data (negatives). Approaches that use few negatives include Siamese networks [13] and triplet losses [62]. Contrastive predictive coding [54,28] learns to encode pieces of sequential data such as audio so that the representations are predictive of representations of future pieces of the data using the InfoNCE loss [53], a variational lower bound on mutual information between views of the data [72] inspired by noise-constrastive estimation [24].  Figure 2: ContraCode extends the MoCo training framework [26] to learn an encoder of programs using a database of unlabeled programs and a suite of semantics-preserving transformations.
By framing the problem as classification, the model need not generate all the fine details of the data, but must extract its identity. In instance discrimination tasks, rather than comparing pieces of data such as timeslices, variants (data augmentations) of an entire image are compared to different images. Momentum Contrast [26] is a memory-efficient method for contrastive learning that caches representations of negative samples and only computes gradients for the positive query encoder. SimCLR [14] evaluated an InfoNCE-like loss over exceptionally large batch sizes, thereby providing a dense loss signal between positive and negative pairs. SimCLR demonstrated state-of-the-art results without the momentum queue in MoCo. However, it requires considerable computational resources to train over such large batch sizes. A recent extension of MoCo [15] added the projection head and additional augmentations from SimCLR. In computer vision, simple augmentations such as rotating, cropping, blurring or jittering an image can generate diverse variants of a base image. However, text-based domains like code lack such simple transformations. The InfoNCE pre-training objective has been applied to natural language [54,16], but is outperformed by baselines or requires auxiliary supervised models for data augmentation [20].
Code representation learning There has been substantial work on architectures and tasks for machine learning on code [3]. The tree or graph structure of code can be exploited to encode invariances into the representation learning method. Inst2vec [9] locally embeds individual statements in LLVM IR by processing a contextual flow graph with a context prediction objective [50]. Tree-Based CNN embeds the Abstract Syntax Tree (AST) nodes of high-level source code. Code2seq [6] embeds paths in the AST with a specialized attention-based encoder and an LSTM decoder for supervised sequence-to-sequence tasks. These architectures are orthogonal to the training objective. Transformer [67] architectures have been pre-trained on unlabeled code [35,21] using the masked language modeling objective [18,44], an instance of the cloze task [66] where hidden tokens predicted by the model. Recurrent networks have also been pre-trained for code tasks [32] e.g. through language modeling [57,36]. These models do not use contrastive learning, though semi-automated program transformations have been used to assess the stability of the predictions of program classifiers under refactoring and optimization [68,69]. Our framework differs from previous work in that it learns contextual embeddings of program tokens in an unsupervised fashion via contrastive learning and transfers knowledge to tasks. We adopt the extreme code summarization (method naming) task from [6] to verify the performance and the semantic meaning of the representation learned by the encoder, and the variable type inference task of DeepTyper [27]. Other authors have explored summarization [52,5,33] and type inference [11,55,70,4,58] with different languages and datasets.

Method: Contrastive Code Representation Learning
Learned code representations should be similar for functionally equivalent programs and dissimilar for non-equivalent programs ( Figure 1). Program structure encodes the global information about programs necessary for code understanding tasks such as code summarization [5] Figure 3: A JavaScript method from the unlabeled training set with two automatically generated semantically-equivalent programs. The original method from the StackEdit Markdown editor.
classification [68]. The principle of contrastive learning offers a simple objective for learning such representations if data can be organized into pairs of positives and negatives. The objective uses each pair to shape representation space, drawing positives together and pushing negatives apart. However, a major question remains: given an unlabeled corpus of programs, how do we identify equivalent programs? We address this question in Sec. 3.1, then introduce the learning framework in Sec. 3.2.

Equivalence by construction
Modern programming languages afford great flexibility to software developers, allowing them to implement the same desired functionality through different implementation choices. Crowdsourced datasets mined from developers, such as GitHub repositories, have many near-duplicates in terms of textual similarity [2], and are bound to contain even more functional equivalences for common tasks. Satisfiability solvers can identify these equivalent programs [34,8], though are slow and require formal documentation of semantics. Programs can instead be compared using test-cases [47], but this is also costly and requires dependencies and execution environments to run the code.
Instead of searching for equivalences, we propose equivalent by construction data generation. Our insight is to apply source-to-source compiler transformations to unlabeled code to generate many variants with the same functionality. For example, dead-code elimination (DCE) is a common compiler optimization that removes operations that leave the output of a function unchanged. While the functionality of the program is the same after DCE, [68] finds that up to 12.7% of the predictions of current code understanding models change after the transformation.  A particular source code sequence, e.g. "W*x + b" can be parsed unambiguously into a tree-structured representation ''(+ (* W x) b)''. This structure opens the possibility of performing automated code transformations. A rich body of prior programming language work explores transformations of Abstract Syntax Trees that parse and optimize a program prior to machine code generation. We leverage compiler infrastructure tools for JavaScript [48,17] and perform the following source-to-source transformations on JavaScript method bodies: Variable renaming, identifier mangling: Arguments can be renamed with random word sequences and identifiers can be replaced with short tokens to learn naming invariance. Program behavior is preserved despite obfuscation.
Reformatting, beautification, compression: Personal coding conventions do not affect the semantics of code; autoformatting normalizes style conventions.
Dead-code insertion: Commonly used no-ops such as comments and log statements that do not affect program behavior are inserted.
Dead-code elimination: In this pass, all unused code with no side effects are removed. Various statements can be inlined or removed as stale or unneeded functionality.
Constant folding: During constant folding, all expressions that can be pre-computed at compilation time can be inlined. As an example, an expression (2 + 3) * 4 is replaced with 20.
Type upconversion: In Javascript, some types are polymorphic and can be converted between each other. As an example, booleans can be represented as true or as 1.
Subword regularization [41]: With an appropriate vocabulary, text can be tokenized in several different ways, with a single word (_function) versus many subtokens sequences (_func tion).
Line subsampling: We randomly select a subset (p = 0.9) of lines from a method body. This transformation is not semantics-preserving. However, it serves as a valuable regularization.

Contrastive pre-training
Augmentations discussed in Section 3.1 can be used to adapt several contrastive learning frameworks to code representation learning. We adapt the Momentum Contrast [26] method that was designed for vision model pre-training. Our training procedure is depicted in Figure 2. In each iteration of training, a batch of programs is sampled from a large database of programs. Each program in the batch is transformed in two different ways using a random subset of transformations to derive functionally equivalent query programs and key programs. The query programs are embedded with an encoder trained via SGD, while key programs are embedded with an architecturally identical momentum encoder trained slowly via an exponential moving average (EMA) of the query encoder parameters.
While equivalent query and key programs define positive pairs, we exploit past programs to generate a very large set of negatives. Past embeddings of the key programs are stored in a queue. As there are few identical programs in a varied dataset, these programs will largely be functionally different from the positives and define the negatives. This approach allows us to use a very large set of negatives, over 100K, with minimal additional computational cost.
Our method is independent to the choice of underlying encoder architecture. We evaluate contrastive pre-training over a Transformer [67] and BiLSTM [30] architecture, with specific details in Sec. 4.1.
Pre-training objective To pre-train, we minimize the InfoNCE loss by formulating a contrastive objective that measures similarity between programs by the inner product of their embeddings. Equation (1) shows the InfoNCE loss for instance discrimination, a function whose value is low when q is similar to its positive key k + and dissimilar to all other keys (considered negative keys for q): In general, the query representation is q = f q (x q ) where f q is an encoder network and x q is a query sample (likewise, k = f k (x k ) using the EMA key encoder). Views x q , x k depend on the specific domain and pretext task. In our case, the views are tokenized representations of the program with appropriate data augmentation via code transformation. This loss can be seen as pre-training f q to classify the positive x k+ among all x k , using the normalizing denominator to define possible labels.
Transfer After pre-training converges, the encoder f q is transferred to downstream tasks. As the output space of the task can differ from the encoder, we add a task-specific MLP, LSTM or Transformer decoder after f q , then train the resulting network end-to-end on task data.

Experiments
In this section, we explore the following experimental questions: (1) Can neural network text encoders learn program representations that are predictive of equivalent programs? and (2) Does contrastive pre-training improve downstream task performance? We also perform ablations to understand partial transfer of the model. To answer these questions, we compare models from the extreme code summarization and type inference literature to versions pre-trained with ContraCode.

Dataset and tasks
For pre-training, we use the CodeSearchNet dataset [31], a large corpus of methods extracted from popular Github repositories across 6 programming languages. We train models on the JavaScript programming language. CodeSearchNet includes 1,843,099 JavaScript training programs, 81,487 of which have an extracted documentation string and method name. The asymmetry in labeled and unlabeled dataset sizes stems from JavaScript coding practices: anonymous functions with no name and often no documentation are widespread. These labeled programs are used for a downstream extreme code summarization task, method prediction [5,7,6]. In addition, we use the Github dataset from DeepTyper [27] for a type inference task. Some repositories used by DeepTyper have been deleted or made private since publication, so we regenerate a subset of the dataset using the same procedure. Dataset statistics are included in the supplement. We precompute up to 21 equivalent forms of each training method by applying 20 random subsets of the transformations from Section 3.1, keeping the original method, and removing exact duplicates. The statistics are shown in Figure 4. 11% of the methods have no alternatives after our compiler transforms, such as one-line functions that are already obfuscated. However, we apply subword regularization [41] during pre-training to derive random, different tokenizations for each batch, so pairs will still differ. After pre-training, we fine-tune the network's encoder for downstream tasks.
Extreme code summarization by method name prediction The CodeSearchNet dataset used for pre-training includes method name labels where available. We extract 81,487 methods that have a documentation string and method name. The asymmetry in labeled and unlabeled dataset sizes stems from JavaScript coding practices; anonymous functions with no name nor docstring are widespread. These labeled programs are used for a downstream extreme code summarization task, where the method name is masked in the input function and predicted by the neural network [5,7,6]. Method names are generally informative and summarize the method when tokenized, such as reverseString.
Our downstream method name prediction task is a sequence-to-sequence generation problem, so we implement a Transformer model [67] with 6 encoder layers and 4 decoder layers. The encoder and decoder share the same token embedding, which is weight-tied with the output projection in the decoder. We use subword tokenization based on fitting a unigram language model, with a top-down EM procedure to iteratively reduce the size of the vocabulary to 8,000 tokens following [41]. Unlike bottom-up byte-pair encoding [64], the unigram LM allows multiple decodings to be sampled. While the dataset has a long-tail of rare symbols, the vocabulary covers 99.95% of characters in the text.
Type inference Type inference or hinting tools can generate type annotations for untyped JavaScript programs, which can help programmers find bugs and serve as documentation. We regenerate the DeepTyper dataset using the subset that is still available on GitHub using the original procedure. The training set consists of 15,570 TypeScript files from 187 projects with 6,902,642 total tokens, and the validation and test sets are from held-out repositories. For training, additional types are inferred by static analysis to augment user-defined types; all types are removed from the input to the model. We generate contextual embeddings of each token using a 2-layer Bidirectional LSTM, as used by DeepTyper, and a 6-layer Transformer, modified from RoBERTa to be parameter equivalent to the LSTM. A 2-layer MLP head is used to predict types from the predicted embedding of each token.

Can ContraCode program representations match equivalent programs?
In Figure 7, we compare two strategies of refreshing the MoCo queue of key embeddings (i.e., the set of negative, non-equivalent programs). In the first strategy, we add 8 items out of the batch to the queue (1×), while in the second we add 96 items (12×). In addition, we use a larger queue (65k versus 125k keys) and a slightly larger batch size (64 versus 96). We observe that for the baseline queue fill rate, the accuracy decreases for the first 8125 iterations as the queue fills. This decrease in accuracy is expected as the task becomes more difficult due to the increasing number of negatives during queue warmup. However, it is surprising that accuracy grows so slowly once the queue is filled. We suspect this is because the key encoder changes significantly over thousands of iterations: with a momentum term m = 0.999, the original key encoder parameters are decayed by a factor of 2.9 × 10 −4 by the moving average. If the queue is rapidly refreshed, queue embeddings are predicted by recent key encoders, not old parameters. This also indicates that a large diversity of negative, non-equivalent programs are helpful for rapid convergence of the ContraCode pre-training task.
We qualitatively inspect the quality of learned representations by visualizing ContraCode representations using t-SNE [45]. We annotate each method with a tag derived from the method name. While there is some overlap, each method class is clustered with other similarly tagged methods. We found that the representations learned by BERT showed more overlap between different algorithm tags; contrastive features may therefore learn better global representations of programs.

Does contrastive pre-training improve downstream task performance?
After contrastive pre-training, we fine-tune the model on the downstream task of code summarization (method name prediction). In Table 1, we tested 3 different settings: (1) supervised training with the 81k labeled programs using baseline AST-based architectures (code2vec, code2seq), (2) pre-training on all 1.84M programs using the masked language model objective followed by fine-tuning on the labeled programs (RoBERTa [44]), (3) supervised training with a Transformer architecture using ContraCode augmentations and (4) contrastive pre-training with all 1.84M programs followed by pre-training (ContraCode). We find that all models overfit during supervised training, so we use early stopping according to the validation loss.
Contrastive pre-training with fine-tuning outperforms the prior code2seq model, a competitive supervised baseline, by 8.2% in test precision and 7.9% in test F1 score. ContraCode outperforms a model fine-tuned after RoBERTa pre-training by 4.8% F1. Representations learned via masked language modeling appear to learn poor structural representations of code. Surprisingly, ContraCode augmentations improve supervised learning performance; a simple Transformer model with our augmentations (described in Sec. 3.1) obtains a higher F1 score than RoBERTa pre-training. Table 2 shows the improvements that ContraCode offers on the type inference task in terms of accuracy averaged over all typed variables and averaged over all typed variables with a type other than the catch-all any type. We use early stopping based on validation set top-1 accuracy across all types. All learning-based approaches outperform static-analysis baselines. In addition, learned  models rank multiple type annotations that can be displayed to users, allowing us to compute top-5 accuracy. ContraCode significantly improves accuracy across three representative baseline models: Transformer, pre-trained RoBERTa with a masked language modeling objective, and the BiLSTM used by DeepTyper. In particular, we outperform the supervised Transformer by up to 1.77% in top-5 accuracy and the supervised DeepTyper model by up to 2.75% in top-1 accuracy (4.5% relative increase), simply by pre-training for global representations with ContraCode. The model pretrained with masked language modeling performs poorly due to the superficial local reconstruction objective. We enrich the MLM objective with ContraCode via an auxiliary loss. This hybrid local-global representation achieves between 2.95% and 6.31% increases in top-1 accuracy over RoBERTa. Figure 9 shows a qualitative example of predictions for the code summarization task. The JavaScript method is not seen during training. A Transformer pretrained with ContraCode predicts the correct method name as the most likely decoding through beam search. The next four predictions are reasonable, capturing that the method processes an image. The 2nd and 3rd most likely decodings, getImageItem and createImage, use get and create as synonyms for load, though the final two unlikely decodings include terms not mentioned in the method body.

Qualitative results
We can also visualize outputs of the type inference model. Figure 8 shows two TypeScript programs from the held-out test set. User-provided type annotations are removed from the programs, and the model is provided with a tokenized form without access to dependencies. We visualize predictions from a variant of DeepTyper (a bidirectional LSTM) pretrained with ContraCode, the best-performing model in Table 2. In the first program, our model consistently predicts the correct return and parameter type. While a tool based on static analysis could infer the void return types, the type of the message argument is ambiguous without access to the imported write method signature. Still, the model correctly predicts with high confidence that the variable message is a string.
In the second program, ContraCode correctly predicts 4 of 8 types including the ViewContainerRef and ChangeDetectorRef types, each imported from the AngularJS library. As this sample is held-out from the training set, these predictions show generalization from other repositories using AngularJS.

Ablations
Should we pre-train global or local representations? We compare pre-training DeepTyper with two variants of ContraCode. We either use the mean of token hidden states across the program (averaging local features), or the terminal hidden states as input to the MLP used to extract the contrastive representation q = f q (x) (global features). Using the global features for pre-training yields significantly improved performance (bottom two rows, Table 2).
Do pre-trained encoders help more with shallow decoders? In Table 3, we ablate the size of the decoder [19] to understand whether large untrained decoders limit the improvements from contrastive encoder pre-training. We tested 1-layer and 4-layer Transformers. We can conclude that 4-layer Transformer achieves higher performance using all 3 criteria. With 45k pre-training steps, the 4-layer decoder achieves 0.50% higher precision, 0.64% higher recall and 0.77% higher F1 score than the 1-layer model. The 1-layer decoder models benefit significantly from longer pre-training, with a 6.3% increase in F1 from 10k to 45k iterations. We hope to see similarly large gains in downstream tasks with small decoders like code classification, the setting considered by self-supervised vision models.
Which part of the model should be transferred? SimCLR [14] proposed using a small MLP head to reduce the dimensionality of the representation used in the InfoNCE loss during pre-training, and did not transfer the MLP to the downstream image-classification task. In contrast, we find it beneficial to transfer part of the contrastive MLP head to type inference, showing a 2% improvement in top-5 accuracy over transferring the encoder only (Table 4). We believe the improvement stems from fine-tuning both the encoder and MLP, while SimCLR did not fine-tune. We only transferred the MLP when pre-training with the mean of token embeddings, not the terminal hidden states, as the dimensionality of the MLP head differs.

Conclusions
A key challenge when applying machine learning to machine-aided programming tools is how to leverage large-scale unannotated repositories of code like GitHub. We propose ContraCode, a pretraining task that learns global representations of the functionality of code based on the hypothesis that good representations of functionally equivalent programs should be similar. We leverage contrastive learning to induce this invariance via automatically applying equivalence preserving transformations to the source code. We find application of ContraCode pre-training significantly improves accuracy on two downstream tasks. Our approach is complementary to model architecture and consistently improves performance when combined with baseline approaches.

Broader Impact
Complex software systems are deployed in safety-critical applications. Software bugs impact end-user safety; tragic examples include unintended acceleration [40] and radiation therapy overdoses [43]. Machine-aided programming tools such as type checkers have improved end-user safety by preventing bugs before code deployment.
We believe these tools also improve the equity of programming by making it more accessible to novice programmers. Developer tools have significantly improved developer productivity by providing insights on large code bases during development. Machine learning methods like ContraCode have the potential to further these benefits through summarization and semantic code understanding.
Still, machine-aided programming tools have the potential to introduce software bugs via false positives or mss bugs via false negatives that a user may not notice. Such errors should be characterized prior to deploying a tool pre-trained by ContraCode, and may be mitigated by surfacing several possible predictions to the user with associated confidence scores. It is important that learned machine-aided programming tools do not provide a false sense of security.
decay. For the Transformer, the learning rate is linearly increased for the first 5,000 steps from 0 to a maximum of 10 −4 . For the bidirectional LSTM, the learning rate is increased for 10,000 steps to a maximum of 10 −3 . The Transformer has 6 encoder layers (23M parameters) in all experiments, and 4 decoder layers for method name prediction in Table 1. We leverage the default positional embedding function (sin, cos) as used in the original Transformer architecture. The BiLSTM originally proposed in DeepTyper [27] had 11M parameters with a 200 dimensional hidden state. We increase the hidden state size to 512 to increase model capacity, so our BiLSTM for type prediction has 17.5M parameters.
ContraCode pretraining To pretrain a Transformer using the ContraCode objective, we first embed each token in the program using the Transformer. However, the InfoNCE objective is defined in terms of a single embedding for the full program. Our model averages the 512-dimensional token embeddings across the sequence, then applies a two-layer MLP with 512 hidden units and a ReLU activation to extract a 128-dimensional program embedding for the loss. The ContraCode Transformer is pretrained with a batch size of 96. The DeepTyper bidirectional LSTM architecture offers two choices for extracting a global program representation. We aggregate a 1024-dimensional global representation of the program by concatenating its four terminal hidden states (from two sequence processing directions and two stacked LSTM layers), then apply the same MLP architecture as before to extract a 128-dimensional program representation. Alternatively, we can average the hidden state concatenated from each direction across the tokens in the sequence before applying the MLP head. We refer to the hidden-state configuration as (hidden) and the sequence averaging configuration as (mean) in Table 2. We pretrain the BiLSTM with batch size 512 and weight decay.
Type prediction Following DeepTyper [27], our regenerated dataset for type prediction has 187 training projects with 15,570 TypeScript files, totaling 6,902,642 tokens. We tune hyperparameters on a validation set of 23 distinct projects with 1,803 files and 490,335 tokens, and evaluate on a held-out test set of 24 projects with 2,206 files and 958,821. The training set is smaller than originally used in DeepTyper as several projects were made private or deleted from GitHub before May 2020, but we used the same commit hashes for available projects so our splits are a subset of the original. To select the number of training epochs before evaluation, we perform early stopping. We train each model for 100 epochs and select the checkpoint with the minimum accuracy@1 metric (all types, including any) on the validation set.
Extreme code summarization As described in Section 4.1, we train method prediction models using the labeled subset of CodeSearchNet. Method names and docstrings are not provided as input to the model: the docstring is deleted, and the method name is replaced with the token 'x'. Thus, the task is to predict the method name using the method body and comments alone. To decode method names from all models except the code2vec and code2seq baselines which implement their own decoding procedures, we use a beam search with a beam of size 5 and a maximum target sequence length of 20 subword tokens. We detail the distribution of programs in 5.

A.2 Baselines
Baselines for code summarization and type prediction are trained on an inconsistent set of programming languages and datasets. In order to normalize the effect of datasets during experiments, we selected several diverse state-of-the-art baselines and reimplemented them on the Javascipt dataset.
AST-based models The authors of code2vec [7] and code2seq [6], AST-based code understanding models, made both data and code available, but train their model on the Java programming language. In order to extend the results in their paper to JavaScript for comparison with our approach, we generated an AST path dataset for the CodeSearchNet dataset. The sensitivity of path-mining embeddings to different datasets is well-documented in prior work; F1 scores for code2vec [7] vary between 19 [6] and 43 [7] depending on the dataset used. Therefore, we reimplement the same dataset generation procedure for fair comparison. We first parse the source functions using the Babel compiler infrastructure. Up to 300 token-to-token (leaf-to-leaf) paths are extracted from each function's AST as a precomputed dataset. Then, we generate a token and AST node vocabulary using the same author-provided code, and train the models for 20 epochs, using early stopping for code2seq. We observed that code2vec overfits after 20 epochs, and longer training was not beneficial.

A.3 Code transformations
We use the Babel compiler infrastructure and the [48] terser JavaScript library to generate equivalent programs, and the sentencepiece Python library for tokenization. We perform variable renaming, dead code insertion (variable declaration insertion) and line subsampling using custom Babel transforms, subword regularization with sentencepiece and other transformations with terser. Terser has two high-level transformation modes, mangling and compression, each with finer grained controls such as formatting, comment and log removal, and dead code elimination. We show an example merge sort and its corresponding equivalent version after mangling and compression in Fig. 10.