ContraCLM: Contrastive Learning For Causal Language Model

Despite exciting progress in causal language models, the expressiveness of their representations is largely limited due to poor discrimination ability. To remedy this issue, we present CONTRACLM, a novel contrastive learning framework at both the token-level and the sequence-level. We assess CONTRACLM on a variety of downstream tasks. We show that CONTRACLM enhances the discrimination of representations and bridges the gap with encoder-only models, which makes causal language models better suited for tasks beyond language generation. Specifically, we attain 44% relative improvement on the Semantic Textual Similarity tasks and 34% on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of representations, CONTRACLM also boosts the source code generation capability with 9% relative improvement on execution accuracy on the HumanEval benchmark.


Introduction
Causal Language Models (CLM) have seen remarkable success in language generation, both in natural language (Radford et al., 2018(Radford et al., , 2019;;Brown et al., 2020) and programming language (Chen et al., 2021;Nijkamp et al., 2022).However, one limitation at their core is the poor discrimination ability of the representations, which often causes a large performance gap with encoder-only or encoderdecoder models on discriminative tasks (see Appendix D.1), and hence limits the wide usage of CLM beyond language generation.
Prior studies posit that the anisotropy issue, i.e., representations being squeezed into a tiny cone in the vector space (Ethayarajh, 2019), can be the main cause of the poor discrimination ability of language models across different architectures and objectives.Many efforts have focused on resolving the anisotropy issue on encoder-only or encoderdecoder models, either through post-processing (Su et al., 2021;Li et al., 2020) or integrating different regularization terms into the training objective (Gao et al., 2019;Wang et al., 2020).A recent work (Su and Collier, 2022) shows that the decoder-only CLM does not suffer from the anisotropic problem as long as the model is beyond a certain size.However, we find that such conclusions can vary across domains.As shown in Figure 1a, CLMs pretrained on text, i.e., GPT-2 (Radford et al., 2019), do yield representations with good isotropy and discrimination as long as the model is not smaller than 774M parameters (GPT2-Large), whilst CodeGen (Nijkamp et al., 2022), pretrained on programming language data, consistently suffers from anisotropy and poor discrimination across different model sizes.Therefore, an effective training strategy is still essential for CLMs to improve representation quality with better isotropy and discrimination (Figure 1b).We conjecture that this is essential not only for models suffering from inferior representations, e.g., CodeGen, and GPT2 (124M) but also for those with a good starting point (suffer less e.g., GPT2-Large (774M)).
We argue that an ideal CLM should yield isotropic representations to better leverage the representation space, as well as discriminative representations such that tokens or sequences from the same context are mapped to comparatively closer locations in the vector space compared to those from randomly sampled contexts.To this end, we developed CONTRACLM, a novel contrastive learning framework at both the token-level and sequence-level.
CONTRACLM is able to promote more uniformly distributed and hence isotropic representations by separating the instances at different semantic levels, e.g., tokens or sequences, apart from each Figure 1: Evaluating the representation quality with respect to both isotropy := 1 − intra-similarity and discrimination := 1−intra-similarity / inter-similarity, where intra-similarity refers to the average cosine similarity between tokens from the same sequence and inter-similarity is defined with respect to tokens from two randomly sampled sequences.We use WIT (Srinivasan et al., 2021) and code search dataset (Guo et al., 2022) to evaluate GPT2 and CodeGen models, respectively.
other.CONTRACLM improves the discrimination of representations due to the implicit grouping effect on semantically similar instances, yielded by pulling together the variations that preserve semantics or positive pairs, of the same instance (Wang and Isola, 2020;Wang and Liu, 2021;Zhang et al., 2021).
A natural question arises as to how would the improved representations affect the generation ability of CLMs.Towards addressing this, we assess CONTRACLM on language generation tasks in different domains, where we achieve better MAUVE (Pillutla et al., 2021) on text generation and 9% relative improvement on pass@1 accuracy on Hu-manEval (Chen et al., 2021).The improvement in code completion is indeed significant as it reflects that more model-generated programs pass a suite of test cases.On the discriminative tasks, CONTRACLM attains 44% relative improvement on Semantic Textual Similarity tasks and 34% on Code-to-Code Search tasks, which largely bridges the gap with the encoder-only or encoder-decoder models (see Section 4.4 and Appendix D.1).Such improvements allow us to boost the performance of decoder-only models on a wide range of discriminative tasks where encoder-only models are currently the workhorse.

Related Work
Anisotropic Representation of Language Models Despite the remarkable success achieved by language models (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019;Raffel et al., 2020;Lewis et al., 2020), they suffer from the anisotropy issue where the representations are distributed into a tiny cone in the vector space (Gao et al., 2019;Ethayarajh, 2019;Li et al., 2020;Wang et al., 2020).In particular, Ethayarajh (2019) shows that the degeneration is severer on CLM, where the average cosine similarity between two words sampled from randomly selected sequences is almost at one when evaluating the outputs from the last hidden layer of GPT-2 (Radford et al., 2019).However, Su and Collier (2022) show that CLMs (Radford et al., 2019) are indeed coherent as long as the model is larger than a certain size.We find such conclusions can vary across domains, e.g., when pretraining on code, CodeGen (Nijkamp et al., 2022) consistently suffers from the anisotropy issue over a wide range of model sizes.On the bright side, Figure 1b shows that CONTRACLM can effectively improve the representation quality when we continue to train the existing CLMs with our proposed objectives, regardless of whether the CLMs suffer from inferior representations initially.
Contrastive Learning Contrastive learning (Chen et al., 2020;He et al., 2020) has seen remarkable successes in Natural Language Processing (NLP).A large amount of research has focused on sentence representation learning for encoder-only models, with the main differences lying in how the augmentations are generated (Fang and Xie, 2020;Giorgi et al., 2021;Wu et al., 2020;Meng et al., 2021;Yan et al., 2021;Kim et al., 2021;Gao et al., 2021;Zhang et al., 2022).Recently there is an emerging interest in developing effective contrastive learning approaches for text generation models.However, most existing work mainly focuses on the encoder-decoder structure (Dong et al., 2019;Raffel et al., 2020;Lewis et al., 2020) by contrasting suboptimal model generations obtained via diverse sampling (An et al., 2022) or adding perturbations on the embedding space (Lee et al., 2021), against the ground truth.On the other hand, it is not intuitive to develop an effective contrastive learning strategy for decoder-only models.A recent work (Su et al., 2022) proposes SimCTG, a token-level contrastive learning approach that aims to separate each token apart from others within the same sequence by a predefined distance.As shown in Section 4, our temperature-based token-level contrastive learning approach, CONTRACLM-TOK, consistently outperforms SimCTG across different tasks.We conjecture that the fixed margin-based objective allows less flexibility for the token-level representation separation, especially considering how the semantic relevance among tokens can vary across contexts (sequences).Code Generation and Beyond Language modeling for source code is a fast growing area of research.Various model architectures have been explored recently, including encoder-only (Feng et al., 2020;Guo et al., 2021), encoder-decoder (Ahmad et al., 2021;Wang et al., 2021;Li et al., 2022), and decoder-only models (Chen et al., 2021;Nijkamp et al., 2022;Chowdhery et al., 2022).Among them, the decoder-only models have been found to be effective for code generation.However, as shown in Section 4.3.2 and Appendix D.1, they suffer from unsatisfactory performance on various discriminative tasks (Lu et al., 2021;Huang et al., 2021;Guo et al., 2022).This motivates us to improve the decoder-only models on the discriminative tasks to extend their main usage beyond language generation.Furthermore, code is fundamentally different from natural language in that it is more structured, which helps validate the generalization of our approach beyond plain text.

Causal Language Modeling
denote a sequence with variable length |x|, e.g., a piece of text or a code snippet.Causal Language Modeling (CLM) is usually formulated as sequence distribution estimation over a set of sequences, x 1 , x2 , . . ., x N .For tractable estimation, common practice is to factorize the joint distribution of each sequence into the product of conditional token prediction probabilities.The model is then trained via maximum likelihood estimation as follows, Here x j <i = [x j 1 , . . ., x j i−1 ] denotes the subsequence before x j i and |x j | is the sequence length.

Contrastive Learning for CLM
Let h (i) , h (i + ) denote two representation variations of the same instance that preserve semantics, or a positive pair for contrastive learning.Then denote I = {1, 2, . . ., N } ∪ {1 + , 2 + , . . ., N + } as the set of representation indices associated with N instances.Further, let τ denote the temperature hyper-parameter and ⋄ denote cosine similarity.We then minimize the following, Note that in our setting, an instance can refer to either a token or a sequence.When h (j) , h (j+) denote a pair of representation variations of the j-th token within a sequence, N is the sequence length that can vary across sequences; in this case the objective is L Tok .For the sequence-level contrastive loss, h (j) , h (j+) refer to the pair of representations of the j-th sequence within a batch, and N denotes the batch size; in this case the objective is L Seq . 2 Therefore, when applied at both token-level and sequence-level, the contrastive learning objective defined above tries to separate tokens at each distinct location apart from every other token within the same sequence, and sequences within the same randomly sampled batch apart from each other.Intuitively, such separation can improve the uniformity (isotropy) of the representations.Further, better discriminative representations are achieved due to the implicit grouping effect of contrastive learning on semantically similar instances.Such grouping effect of contrastive learning has been studied in recent work (Wang and Liu, 2021;Zhang et al., 2021;Wang and Isola, 2020) as well.

CONTRACLM
In addition to the causal language modeling loss, CONTRACLM optimizes the contrastive learning objective defined in Equation (1) at both the tokenlevel (L Tok ) and sequence-level (L Seq ) as follows Furthermore, to understand how the token-level and sequence-level contrastive learning contribute to the overall performance, we assess the performance of L CONTRACLM-TOK = L CLM + L Tok and L CONTRACLM-SEQ = L CLM + L Seq in Section 4. Unless otherwise specified, we weigh each loss equally and set the temperature τ = 0.05.Although better performance can be achieved by hyperparameter optimization, we mainly investigate how CONTRACLM improves the representation quality and the zero-shot transfer learning performance.We hence leave hyperparameter optimization in a supervised setting as future work.
Positive pair of representations For GPT-2 (Radford et al., 2019), we consider the simple yet effective dropout-based augmentation (Gao et al., 2021), where the positive pair of representations is obtained by performing a forward pass of the same sequence twice.On the other hand, for CodeGen (Nijkamp et al., 2022), we simply duplicate the representation of each instance as positive pair for an apples-to-apples comparison since dropout is disabled during its initial pretraining stage.Unlike the existing findings that the dropout-based augmentation can boost the contrastive learning performance when (continually) training a language model, we find that the trends can vary when evaluating on discrimination tasks and generation tasks.A detailed ablation study can be found in Section 4.4 and Appendix D.2.

Experiments
To demonstrate the effectiveness of our proposed framework in different application domains, we evaluate our models and baselines on natural language and programming language tasks.We design our experiments to address -(1) Does contrastive learning improve the discrimination ability of representations?(2) Do the representations learned by contrastive learning lead to better performance on language generation tasks?(3) Is the joint contrastive learning at both token-and sequence-level necessary, and how do they bene-fit from each other?(4) How does the impact of contrastive learning vary across language domains?

Data and Models
Data & Models For text, we continue training GPT-2 (124M) (Radford et al., 2019) on WikiText-103, a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia (Merity et al., 2017).For code, we continue training CodeGen 350M monolingual (Nijkamp et al., 2022) on collected permissively licensed Python code from GitHub.Please refer to Appendix B for the training details.We consider the following objectives for the continual training of both GPT-2 and CodeGen: • CLM.The standard left-to-right autoregression objective for training causal language models, which is also the objective used for pretraining both GPT-2 and CodeGen.• SimCTG (Su et al., 2022).A predefined margin3 based token-level contrastive learning framework that aims to separate tokens at each distinct location within a sequence apart from each other.
As defined in Section 3.3, these two are obtained by combining the CLM objective with our proposed token-level or sequence-level contrastive loss, respectively.This investigation allows us to better understand how our token-level and seqeunce-level contrastive losses contribute to the overall performance of CONTRACLM.

Evaluation on Natural Language
We first evaluate our model on discrimination and generation tasks in natural language.

Semantic Textual Similarity
We assess CONTRACLM on semantic textual similarity (STS), the most commonly used benchmark for evaluating the semantic discrimination capability of representations.STS consists of seven tasks, namely STS 2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016)), the STS Benchmark (Cer et al., 2017), and the SICK-Relatedness (Marelli et al., 2014).In this benchmark, human annotators provide a fine-grained similarity score from 0 to 5 for each sequence pair.Following Reimers and Gurevych (2019), for the sequence pairs in each dataset, we report the overall Spearman's correlation between the cosine similarities of representa-  tions and the human-provided similarity scores in Table 1.
Effectively Enhancing Discrimination Table 1 shows that both GPT-2 and the one continually trained with CLM perform poorly on STS, which is a consequence of poor discrimination: the cosine similarities between semantically similar or dissimilar pairs are both almost at one (Figure 4 in Appendix C.1).Also note that continuing to train GPT-2 with CLM on WikiText-103 worsens performance, which can occur since the domains of WikiText-103 and the STS datasets are different. 4In contrast, both CONTRACLM and SimCTG largely outperform GPT-2, yet still, CONTRACLM attains 25% relative improvement over SimCTG.Moreover, CONTRACLM-TOK outperforms Sim-CTG on almost all STS benchmarks and the trend remains the same even without the dropout-based augmentation (Appendix D.3).Therefore, we posit that our temperature-based contrastive learning objective allows more flexibility towards separating representations based on token semantics, whereas requiring a predefined separation margin between tokens (as SimCTG does) is not ideal.
4 STS datasets include text from image captions, news headlines and user forums.As a result, adapting GPT-2 to WikiText-103 reduces its transfer ability.

CONTRACLM-TOK vs. CONTRACLM-SEQ
Table 1 also indicates that CONTRACLM-TOK and CONTRACLM-SEQ complement each other, as CONTRACLM consistently performs better than both of them on STS.Note that CONTRACLM-SEQ performs worse than CONTRACLM-TOK.It is surprising, especially since STS mainly assesses the sequence-level representation quality.We investigate this by dividing the sequence pairs into two groups -semantically similar pairs with humanannotated similarity scores no less than 0.7 and dissimilar pairs with human scores no larger than 0.3.We plot the rank of the model inferred similarity scores against the human similarity scores in Figure 2 (left).As we can see, CONTRACLM-SEQ struggles in ranking semantically dissimilar sequence pairs higher and similar pairs lower.This suggests that the token-level contrastive loss is essential for making the sequence-level representations robust to spurious patterns of tokens or phrases, e.g., ranking semantically similar sequences with different synonyms low and dissimilar sequences high even in presence of the same phrase (Figure 2 (right)).

Text Generation
Next, we assess the open-ended language generation capability, where each model is required to generate text continuations given the prefixes from the WikiText-103 test set.Following Su et al. (2022), we set the lengths of prefix and continuation to 32 and 128, respectively.We use nucleus sampling (Holtzman et al., 2020) with top-p = 0.95.In addition to Perplexity (PPL; evaluated on the ground truth only) and MAUVE, we also evaluate the discrimination of representations of generated text under different settings in Table 2.
CONTRACLM Leads to More Semantically Coherent Generations It is desired that contextual token representations within the same or semantically similar sequences have relatively higher similarities among each other when compared to similarities between tokens sampled from random contexts.Therefore, given a prompt, lower discrimination scores are desired between the ground truth and generation, while higher discrimination values are desired between generations for randomly sampled prompts.
As reported in Table 2, compared to CLM, ContraCLM attains much better discrimination on the generations under dissimilar context (prompts) pairs, as indicated by the high value of Disc(D).Further, ContraCLM and ContraCLM-Tok achieve better or at least comparable semantic coherence between the generation and the ground truth, as indicated by the MAUVE scores.We argue that, the zero valued discrimination score between generation and ground truth, i.e., Disc(S), attained by GPT-2 and CLM does not imply better semantic coherence -this is a consequence of their inferior representations evidenced by the zero discrimination score between semantically irrelevant sequences.
Finally, a slight increase in PPL is probably expected, considering that PPL is better aligned with the standard CLM objective.Thereby, contrastive learning can be interpreted as a regularization that trades off between PPL and the desired representation properties.

Evaluation on Programming Language
In this section, we study the effectiveness of our proposed contrastive learning framework on programming language applications -code search, code completion, and code re-ranking.Since Code-Gen models are pretrained without dropout activations, we follow the same for our models in this subsection helping us study the effectiveness of CON-TRACLM without dropout augmentations.We also investigate how dropout would affect the decoderonly models when evaluated on the downstream tasks in Section 4.4 and Appendix D.2.

Code Search
Code search is the task of retrieving relevant code fragments given a code fragment as a query.We perform in-language (query and relevant code are in the same language) and cross-language (query and relevant code are in different languages) code searches.We provide an example in Figure 5    the results for the code search tasks in Table 3. 7We observe CONTRACLM-TOK and CONTRA-CLM frameworks improve upon CodeGen trained with CLM by 33.5% (absolute 2.12) and 32.6% (absolute 2.06) on average, respectively.We also point out that the performance gap between CONTRACLM-TOK and SimCTG are apples-toapples comparisons since the dropout-based augmentation is not used in either models.As aforementioned, the consistently better performance of CONTRACLM-TOK suggests the superiority of our temperature-based contrastive learning objective.On the other hand, CONTRACLM-SEQ improves over the CLM baseline by 10.4% only.Code search results indicate that CONTRACLM-SEQ performs poorly compared to CONTRACLM-TOK.This performance gap is larger than what we observed in the natural language evaluation.We conjecture that CONTRACLM-TOK generates better discrimina-tive representations for code sequences since the finer-grained understanding of the code tokens is crucial to understanding the code sequences' functionality (semantics).To verify this, we check if non-semantic factors impact model performances in the following section.
Token-level Contrastive Learning is Effective for Code Understanding We break down the code search performance based on edit similarities and length differences between query code and their relevant code fragments.While edit similarity indicates how much queries and their relevant code overlap, the length difference indicates whether models effectively capture relevance between two code fragments if they are similar in length or differ significantly.We present the results for Python language in Figure 3   conclude that sequence overlap or length are not the reasons for improvements in CONTRACLM-TOK.Presumably, a finer-grained understanding of code tokens makes CONTRACLM-TOK more effective for code representations.

Code Completion and Re-Ranking
Given a sequence of tokens composed of natural language, function signature, and input-output examples (as a whole, we call them prompt), the goal of the code completion task is to complete the function.To evaluate the functional correctness of a complete code, we use existing benchmarks that include unit tests.If the generated code successfully passes the unit tests, we refer to this as successful execution.We compute pass@k for k ≤ n following (Chen et al., 2021).In addition, we compare the models on the code re-ranking task -given n sampled code using a code completion model, the goal is to order the generated samples, for which we use the mean log probability of each sampled code (Chen et al., 2021).For code re-ranking evaluation, we report ranked pass@k (Inala et al., 2022).Figure 6 (Appendix C.2.2) illustrates both the code completion and re-ranking tasks.We detail the evaluation metrics in Appendix C.2.1.
Contrastive Learning Improves Source Code Generation Chen et al. ( 2021) introduced Hu-manEval, a collection of 164 handwritten programming problems and their respective unit tests.Each problem in this dataset is presented using a prompt for a function, and the task is to complete the function, such that it can pass all unit tests.In all our experiments, we use nucleus sampling (Holtzman et al., 2020) with top p = 0.95.We sample n = 10 completions per problem with sampling tempera-ture 0.2.Table 4 presents the evaluation results on the HumanEval benchmark.While CONTRACLM-TOK and CONTRACLM-SEQ perform comparably to CLM and SimCTG, CONTRACLM outperforms them significantly, i.e., by 9% and 10.3% in terms of pass@1 accuracy respectively, and by 11% and 12% in terms of ranked pass@1 accuracy, respectively.While CONTRACLM-SEQ underperforms in code completion, it boosts code re-ranking significantly.We hypothesize the improvement is due to the contrastive learning's alignment with the mean log probability-based re-ranking choice.

Discussion
Impact of Dropout Dropout-based augmentation (Gao et al., 2021) for contrastive learning on language models has shown to have a significant improvement on discriminative tasks.We observe the same trend on both GPT-2 and CodeGen (see Table 8 in Appendix D.2).However, we observed the opposite for language generation, no matter when training with CLM only or with contrastive learning (see Table 9 in Appendix D.2). Dropout has been one of the key ingredients for training large models.Further investigation on proper ways to use and evaluate it are indeed required.Nevertheless, even without dropout, Section 4.3 shows CONTRACLM still yields significant improvement.
Bridge the Gap In comparison with the causal (left-to-right) attention mechanism of the decoderonly models, the bidirectional attention mechanism better leverages the context of sequences, yielding better representations for discriminative tasks.Take the encoder-only models as an example: as Table 7a in Appendix shows, both BERT-Base (Devlin et al., 2019) and RoBERTa-Base (Liu et al., 2019) outperform GPT-2 by at least 60% relative performance on STS.Although the performance gap between CodeGen and the encoder-only or encoder-decoder models decreases in Table 7b, it is still significant considering that both the model and pretraining data sizes used by CodeGen are much larger.Such a large performance gap severely limits the usage of decoder-only models in many discriminative tasks.On the bright side, contrastive learning shows the promise to bridge the gap, e.g., reducing the relative performance gap between GPT-2 and the encoderonly models by at least 50% when evaluating on STS (see Table 7a).Please refer to Appendix D.1 for more detailed discussions.
In this paper, we present CONTRACLM, an effective contrastive learning framework to resolve the representation degeneration issue of CLMs trained with the autoregression objective.We assess the effectiveness of CONTRACLM on various downstream tasks in both the natural language and code domains, where we attain significant improvements on both discrimination and generation tasks.While we explored only the decoder-only CLMs, our proposed contrastive learning framework can serve as a drop-in term for encoder-decoder, encoder-only, or prefixLM models also.We leave these explorations as future work.

Limitations
While our work displays many strengths, we highlight some limitations.First, we focus on Python for programming language evaluation, which is one of the most widely used programming languages.However, we believe that our proposed approach, CONTRACLM, would benefit Code LMs trained on any programming language.Second, the empirical findings presented in this work are mainly based on the smaller versions of GPT-2 and Code-Gen with 124M and 350M parameters, respectively.However, as shown in Figure 1b, by continuing to train the pretrained models with our proposed objective, CONTRACLM is able to address not only the isotropy and poor discrimination issue that both GPT2-small and CodeGen suffer from, but also improve the representation quality of GPT2-large which has a good starting point for both isotropy and discrimination.Therefore, we believe the effectiveness of CONTRACLM should be applicable to larger versions of these LMs, regardless of whether they suffer from the anisotropy issue (e.g., large CodeGen models) or not (large scale GPT-2 models).We leave the explorations of larger models as future work.

Ethics Statement
Training data We use WikiText-103 and source code in Python from permissively licensed GitHub repositories to train GPT2 and CodeGen, respectively.We do not perform any preprocessing that would get rid of any personally identifiable information or offensive content.However, the use of code LMs comes with certain risks, e.g., generating biased, toxic, and insecure code.We refer readers to Chen et al. ( 2021) (Section 7) for a detailed discussion on the broader impact of code LMs.
Compute We use an in-house cluster of 128 A100s for all jobs in this paper.Each run takes a couple of hours to one day to finish, depending on the configuration and the model size.We performed one round of training for each setting as it is very expensive to repeat them multiple times.However, we perform the code completion and reranking evaluation with three seeds.STS and code search evaluation do not need multiple runs of inference (as the predictions are deterministic).

Author Contributions
Dejiao and Wasi proposed the initial framework for CONTRACLM and completed the paper writing.Nihal and Dejiao setup the pretraining code.Nihal processed the pretraining data for the programming language experiments.Dejiao designed and completed all natural language related training and evaluations.Nihal and Wasi completed the associated counterparts for programming language data.Zijian was in-charge of the pretraining data collection and multinode distributed training of CONTR-ACLM models on the programming language data.Feng and Xiaopeng helped with our preliminary explorations on natural language data evaluation.All the other co-authors provided thought-provoking discussions and suggestions for this project, and helped shape and proofread the paper draft.

Supplementary Material: Appendices A Contrastive Learning for CLM
We detail our proposed token-level and sequencelevel contrastive losses.Before that, we first call out the following notations that will be used throughout this section.Let denote a sequence with variable length |x|, e.g., a text document or a code snippet, and h = [h 1 , 2 , • • • , h |x| ] be its representation output by the last hidden layer of the decoder.For a randomly sampled batch B = x j N j=1 with N sequences, we use x j i and h j i to denote the i th token and its representations in the j th sequence, respectively.Let h j , h j + denote the representation pair of sequence x j and h j i , h j+ i correspond to the representations of the i-th token.Such representation pairs are referred to as positive pairs in contrastive learning, which are often obtained via data augmentation.

A.1 Token-Level Contrastive Learning
As aforementioned, h j i , h j + i are a pair of representations for x j i , the i-th token in the j-th sequence.Let I j = {1, 2, . . ., |x j |} denote the indices of tokens in x j .Further let τ denote the temperature hyper-parameter and ⋄ denotes the cosine similarity, i.e., a ⋄ b = a T b/∥a∥ 2 ∥b∥ 2 .Then we minimize L Tok defined in Table 5.

B Training Details
Training Data For text, we use WikiText-103, a collection of over 100 million tokens extracted from the set of verified and featured articles on Wikipedia (Merity et al., 2017).For code, we collect permissively licensed Python code from GitHub.Following (Chen et al., 2021;Nijkamp et al., 2022), we perform filtering and deduplication and further remove data that contains a significant use of non-English languages or is not parsable, resulting in a dataset of 101GB code.Model We use GPT-2 (Radford et al., 2019) and CodeGen 350M monolingual (Nijkamp et al., 2022) for all experiments on natural language (text) and programming language (code), respectively.We set the batch size to 512 and continue to train GPT-2 on WikiText-103 and CodeGen on the GitHub data for 12 and 2 epochs, respectively.We train both models using a max sequence length of 512 tokens and 1024 for WikiText-103 and Code data, respectively.We set the learning rate to 2e-5, warm-up steps as 500 with linear annealing after peak learning rate, weight decay of 0.1, temperature of 0.05 (when using contrastive losses), and gradient clipping of 1.0.We use AdamW optimizer (Loshchilov and Hutter, 2019) with β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 following (Nijkamp et al., 2022).Our training pipeline is based on Py-Torch Lightning 8 , and we use DeepSpeed (Rasley et al., 2020) for training optimization.Processing Code Training Data Our preprocessing strategy for code datasets used for training is designed to ensure that we optimize for data utilization while retaining the syntactic structure of programming language sequences.We also eliminate duplicate sequences since this benefits training large language models (Lee et al., 2022).Specifically, we break long sequences into chunked sequences of smaller lengths to retain most parts of the original program.Further, we maintain syntactic structure in the chunks by ensuring that each chunk ends with a '\n' character.Each chunk obtained this way contains at most max_chars_per_seq characters where max_chars_per_seq = max_tokens_per_seq * chars_per_tok.In our experiments, we fix chars_per_tok = 3.2 and max_tokens_per_seq = 1024.We also perform deduplication using character-based exact matches between chunked sequences over the entire dataset.This step helps eliminate exact duplicates that might be present after the chunking stage.

C More on Evaluation
C.1 Representation Quality Evaluated on STS For each sequence pair in STS, a fine-grained similarity score ranging from 0 to 5 is provided, with a high similarity score indicating semantically similar pairs and low similarity scores suggesting semantically dissimilar or irrelevant pairs.For better illustration, we scale the human-annotated similar-

Contrastive Loss Expression
Table 5: Formulation of our token-level and sequence-level contrastive losses denoted as L Tok and L Seq respectively.Figure 4: CLM versus Contrastive Learning in Similarity Prediction and Ranking.We report the results on two STS benchmarks: (1) STS14 where CLM performs the worst when compared to its own performance on the other STS tasks; and STS15 where CLM attains the best performance when compared with its own performance on other STS tasks.For the purposes of illustration, we scale the human-annotated similarity scores from [0, 5] to [0, 1].A good CLM is expected to predict discriminative similarity scores such that the resulting ranking results are as close to the ranks provided by humans as possible.
ity scores to [0, 1] to align with the model-predicted cosine similarity scores.This does not affect the evaluation as the spearman correlation reported in Section 4.2 is a rank-based correlation metric.
CLM yields poorly discriminative representations We report the model predicted similarity scores of sequence pairs in the left column in Figure 4.A good model is expected to yield representations that attain higher similarity scores between similar sequence pairs and lower similarity values for dissimilar sequences.Thereby, a large gap between the predicted similarity scores of similar and dissimilar pairs is desired.However, as seen in Figure 4 (left), the similarity scores attained by the model trained with the standard CLM only objective are almost at one for both similar and dissimilar sequence pairs.This suggests that the representa-tions yielded by CLM can9 be squeezed into a tiny cone in the representation space rather than being scattered apart to leverage the vector space's capacity better.Despite the resulting similarity ranks not being entirely flattened, as shown in the right column in Figure 4b, CLM struggles in ranking similar sequences lower and dissimilar sequences higher as a consequence of the poor discriminative representations.
In contrast, Figure 4 (left) further validates that contrastive learning effectively yields more discriminative representations with a comparatively larger similarity gap between similar pairs and dissimilar pairs.Thereby, the similarity ranking results of the sequence pairs are more aligned with those obtained according to similarity scores provided by humans, as shown in Figure 4 (right).

C.2.1 Evaluation Metrics
Mean Average Precision (MAP) For a set of queries, it indicates the mean of the average precision scores for each query.
where Q is the number of queries.
Pass@k Given a problem (code prompt as shown in Figure 6), pass@k indicates the functional correctness of model-generated code samples.A problem is considered solved if any sample passes the unit tests.Following (Chen et al., 2021), we generate n ≥ k samples per problem (in this paper, we use n = 10 and k ∈ {1, 5}), count the number of correct samples c ≤ n that pass unit tests, and calculate the unbiased estimator of pass@k as: .
Ranked Pass@k Unlike Pass@k, where we randomly chose k out of n samples, in ranked pass@k, we chose the top-k samples based on modelprovided scores and then computed pass@k.

C.2.2 Examples and Statistics
In Figure 5, we present an example of a query code fragment in Python and relevant code fragments in Python and Java, respectively.While in-language code-to-code search refers to retrieving relevant code fragments in the same language, cross-language code-to-code search refers to retrieving code fragments in a different language.
We present the statistics of the code search dataset in Table 6.To demonstrate the code completion task, we illustrate an example in Figure 6.

C.2.3 Detailed Code Search Results
We provide a comparison between encoder-only (Feng et al., 2020;Guo et al., 2021), encoderdecoder (Ahmad et al., 2021;Wang et al., 2021), and decoder-only models (main focus of this work) on the zero-shot code-to-code search task in Table 7b.We see that CONTRACLM-TOK and CONTR-ACLM outperform the encoder-only model Code-BERT and both the encoder-decoder models.It is important to note that the comparison across these models is not apple-to-apple as these models differ in size, the scale of pretraining, and language settings.This comparison's purpose is to show the promise of decoder-only models being used in discriminative tasks like code search.We further break down the code search performances based on edit similarities and length differences between query code and their relevant code fragments.We present the results in Figure 7 and 8.We observe a similar performance trend in all three languages, although cross-lingual search performance still needs to improve.Nonetheless, the objective of this performance analysis is to show that sequence overlap or length are not the reasons for improvements in CONTRACLM-TOK.Instead, a finer-grained understanding of code tokens due to the token-level contrastive learning makes CONTRACLM-TOK more effective.

D.1 Bridge the Gap on Discriminative Tasks
Compared to the causal (left-to-right) attention mechanism of the decoder-only models, the bidirectional attention mechanism in both encoder-only and encoder-decoder models allows for better leverage of the context of the sequence and hence leads to better representations.
Taking the encoder-only models in Table 7a for illustration, on average, BERT-Base (Devlin et al., 2019) and Roberta-Base (Liu et al., 2019)   between CodeGen and the BERT models trained on programming languages, i.e., CodeBERT (Feng et al., 2020) and GraphCodeBERT (Guo et al., 2021), decreases or even diminishes when evaluated on the code search tasks, the performance gap is still significant as both the model size and pretraining data in CodeGen are much larger than those used by the encoder-only models in Table 7b.Similar trends were observed in the performance gap between the decoder-only and encoderdecoder models on both natural language (Lewis et al., 2020;Raffel et al., 2020) and programming language (Ahmad et al., 2021;Wang et al., 2021).The large performance gap severely limits the decoder-only models used in many discriminative tasks.To this end, contrastive learning shows the promise to largely bridge the gap.As seen in Table 7a, on STS, CONTRACLM reduces the relative performance gap from 67.24% (absolute 21.12%) to 16.17% (absolute 7.33%) regarding BERT-Base, and from 84.62% (absolute 26.64%) to 28.24% (absolute 12.8%).Similarly, Table 7b shows that CONTRACLM outperforms encoder-decoder models and performs comparably to the encoder-only model, GraphCodeBERT.Gao et al. (2021) showed that the dropout-based augmentation is an effective strategy for unsupervised contrastive learning, and the follow-up works (Chuang et al., 2022;Wu et al., 2022) endorse the effectiveness.This motivates us to study dropoutbased augmentation in our proposed contrastive learning framework.We present the results on discriminative and generation tasks in Tables 8 and 9, respectively.From the results, it is evident that the adoption of dropout-based augmentation improves the discrimination task performances, which corroborates the findings of (Gao et al., 2021).In contrast, dropout-based augmentation hurts the generation task performances.On the other hand, for code completion, we had anticipated that dropoutbased augmentation would hurt performance since we used the CodeGen model (Nijkamp et al., 2022) which does not use dropout activations during its initial pretraining stage.However, we observe a drop in perplexity due to disabling dropout for both CLM and CONTRACLM in Table 9, which does not go with our anticipation, especially considering that, unlike CodeGen, GPT-2 is pretrained with dropout enabled.We leave diving deeper into the reasoning behind this finding as future work.

D.3 CONTRACLM outperforms SimCTG
To better understand the performance gap between CONTRACLM and SimCTG (Su et al., 2022), we run the following ablations on GPT-2 and report the evaluations on STS.In Table 10, we report the results of (1) running CONTRACLM w/o dropoutbased data augmentation and compare it with the original SimCTG model and (2) augmenting Sim-CTG with both the sequence-level contrastive loss and dropout-based augmentation and compare it with our proposed CONTRACLM model.As we can see, CONTRACLM consistently outperforms SimCTG in both settings.Figure 10 together with our results reported in Section 4.3, where we disabled the dropout-based augmentation for CONTR-ACLM and its variations but still observed consistently better performance than SimCTG on both discrimination and generation tasks, conclude that CONTRACLM is better than SimCTG across domains and settings.
The isotropy and discrimination abilities of the representations yielded by a model do not always improve with increase in model sizes.CONTRACLM can effectively enhance both isotropy and discrimination, regardless of whether the original models suffer from degenerate representations or not.

Figure 2 :
Figure2: CONTRACLM-TOK is essential for making the sequence-level representations robust to spurious patterns of words or phrases (results reported on STS-B).We scale the ground truth similarity scores from [0, 5] to [0,1].
Performance breakdown based on edit similarities (x-axis).Performance breakdown based on length differences (x-axis).

Figure 3 :
Figure 3: Code search performances based on (a) and (b) between the query code fragments (in Python) and their relevant code fragments (in Python).We observe that in both cases, CONTRACLM-TOK outperforms CLM, SimCTG, and CONTRACLM-SEQ.
STS14: (Left) Predicted cosine similarity vs. human annotated ground truth.Right Similarity ranking according to the model predicted similarity scores vs Human similarity based ranking.STS15: (Left) Predicted cosine similarity vs. human annotated ground truth.Right Similarity ranking according to the model predicted similarity scores vs Human similarity-based ranking.

Table 1 :
Spearman rank correlation between the cosine similarity of sentence representation pairs and the ground truth similarity scores.

Table 2 :
Evaluation on the Wikitext-103 test set.Disc(D) is the discrimination score computed between the generated continuations of two randomly sampled prompts.Disc(S) is computed between the ground truth text and the generated one associated with the same prompt.A lower Disc(S) indicates better coherence with the ground truth, while a higher Disc(D) indicates better representation discrimination among the generations under different contexts.All metrics are evaluated over the entire test set.

Table 3 :
Guo et al. (2022)ing Yields More Discriminative Code Representations For the code-to-code search task,Guo et al. (2022)used problem solutions in Ruby, Python, and Java languages from CodeNet(Puri et al., 2021).They propose to use each program as a query and retrieve all programs that solve the same problem.We present detailed statistics of the dataset in Table 6 (Appendix C.2.2).We set the maximum sequence length as 512 6 and use cosine similarity between two mean vectors of the last hidden states as relevance scores.We then sort the candidates by their scores to calculate the Mean Average Precision (MAP) score.We present MAP score (%) of the zero-shot code search task.The language names mentioned in the top two rows indicate the languages queries and candidates are written in.
. In this study, we experiment in the zero-shot setting -we use the models described in Section 4.1 to generate dense representations of code and perform a nearest neighbor search to retrieve relevant code fragments.We use publicly available implementations ofGuo et al. (2022).5

Table 6 :
Statistics of code-to-code search task dataset created from CodeNet(Puri et al., 2021).We truncate the code if its length exceeds the maximum sequence length, which is set to 512.