A Simple, Yet Effective Approach to Finding Biases in Code Generation

Recently, high-performing code generation systems based on large language models have surfaced. They are trained on massive corpora containing much more natural text than actual executable computer code. This work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones, which can reduce the quality of the generated code under specific circumstances. To investigate the effect, we propose the"block of influence"concept, which enables a modular decomposition and analysis of the coding challenges. We introduce an automated intervention mechanism reminiscent of adversarial testing that exposes undesired biases through the failure modes of the models under test. Finally, we demonstrate how our framework can be used as a data transformation technique during fine-tuning, acting as a mitigation strategy for these biases.


Introduction
Large language models (LLM) have recently demonstrated their ability to generate code (Li et al., 2022;Brown et al., 2020;Wang et al., 2021) or solve challenging programming/math tasks on par with human coders (Li et al., 2022;Lewkowycz et al., 2022b;Chowdhery et al., 2022a); these models are trained with the data-driven paradigm.On the other hand, an increasing body of work also questions whether the data-driven approach leads to acquiring reasoning skills (Piekos et al., 2021;Zhang et al., 2022;Mouselinos et al., 2022), showing that if left alone, it might not be sufficient for achieving truly human-level performance on tasks such as logical or visual reasoning.In many studied cases, models still rely on various hints in their reasoning process.This work extends the results above, i.e., the lack of reasoning capabilities, to the code generation domain.More specifically, we devise a framework that automatically identifies subtle cues a code generation model might exploit.Changes or removal of those cues stands as a reasoning test towards the generational capabilities of the model at hand.
We presume that the reasoning process of code generation models should remain invariant under changes that still provide enough context or pose little, if any, additional challenge to a human coder.To this end, we propose an automatic and modelagnostic framework that modifies the following: (1) function names, (2) keywords in a problem specification, and (3) examples provided in the problem prompt.We refer to these parts as Blocks-Of-Influence; see Figure 1.Each block contributes partially to the context needed for correct completion.We show that minor modifications of these blocks are sufficient to "fool" LLM-based code generation methods.
Our results reveal biases such as keyword preference and memorization effects, which can be identified across multiple models.During our experiments, we ensure that any modifications maintain the global semantics of the coding challenge.This is achieved through a context-aware filtering mechanism that guarantees any information altered or removed still exists and/or can be deducted from the remaining unaltered part.
Contributions.The main contributions of our work can be summarized in three points.First, we propose a novel automated framework that identifies possible biases in code generation models.Our framework removes subtle hints, introducing minimal changes such as keyword replacement or partial code-block omission, ultimately acting as an adversarial test.Since the framework operates on a data level, it is agnostic to the model's structure and internal workings.The framework can be easily adjusted to any input format or programming language.Second, we introduce the "Blocks of Influence" concept.We suggest that every instance of a typical coding challenge can be analyzed into three parts (blocks).Each part is correlated with a different method of hinting and is used as a target of our transformations.A model's reasoning process is informed by all three blocks, making them perfect analyzing tools for cases of failing code generation.Third, we explore new ways of mitigating biases during code generation.In Section 6, we study the effects of adversarial training against our proposed perturbations, and the benefits of including examples with longer descriptions during fine-tuning.Our results show that combining these techniques leads to more accurate code completions.

Related Work
Our approach is inspired by works of various research directions, which we briefly describe here.Solving coding and math challenges.The emergent abilities of large language models to generate, summarize and translate textual information, have recently sparked interest in their aptitude for math, logic, and programming challenges.Tasks such as code-completion (Chen et al., 2021;Shin et al., 2019;Hendrycks et al., 2021a;Li et al., 2022), code summarization and code translation (Lu et al., 2021) have been proposed, with models constantly progressing towards near-human performance.Similarly, (Hendrycks et al., 2021b;Saxton et al., 2019;Ling et al., 2017;Amini et al., 2019) have proposed tests measuring a model's ability to perform math and logic, ranging from school problems to competition-grade challenges.Impressive results in multiple programming languages have also been achieved by decoder-only works (Brown et al., 2020;Chen et al., 2021).Fried et al. (2022) created the first generative model to perform infilling using a novel masking objective.Finally, massive-scale models such as (Chowdhery et al., 2022b;Lewkowycz et al., 2022a) demonstrated breakthrough capabilities in language, reasoning, and code tasks achieving state-of-the-art performance in multiple domains simultaneously.Social biases in large language models.Trained on ever-increasing amounts of publicly available data, large language models have been studied for adopting social biases commonly found among humans.Wallace et al. (2019) show that generative models can be conditioned to produce toxic content, with the use of nonsense, adversarial prefixes.Similarly, Liang et al. (2021) suggest that models might adopt biases and social stereotypes found among their training data and provide ways to ap-ply fairness during generation.Countermeasures have been proposed by (Zhao et al., 2021;Liu et al., 2022), claiming that sanitized zero-shot examples contribute to mitigating biases during generation.

Probing reasoning through cognitive biases.
There have been notable attempts to systemize intelligence and reasoning as concepts (Legg, 2008;Chollet, 2019), yet a few recent works try to approach reasoning, through the analysis of failure modes, caused by biases in deep learning models.Glockner et al. (2018) suggest that natural language inference systems can be easily fooled with a single hypernym/hyponym swap, exhibiting an bias towards specific word choices.Similarly, Lin et al. (2020) prove that numerical commonsense reasoning in LLMs is heavily biased by adjectives describing the object of interest.Concerns against the current data-driven methods have been expressed by Razeghi et al. (2022), pointing out that LLMs are more accurate on mathematical challenges that involve terms significantly more frequently in their pre-training dataset.Piekos et al. (2021) claim that LLMs can answer math and logic questions without understanding the rationale behind them, relying blindly on the existence of specific keywords.We place our work in this line of research, provoking and studying the failures of LLMs under reasoning-heavy coding tasks.Our main goal consists of identifying cognitive bias sources, i.e., words, structures, or co-occurrence patterns, that exist in current LLMs, and lead to systematic failures of rationale.Adversarial methods and Language Processing.NLP community developed excellent methods to prepare adversarial tasks, including the TextAttack framework (Morris et al., 2020) and sophisticated techniques to elicit adversarial examples from humans, as in Talmor et al. (2022), though our work seems to be the first focused on the disciplined construction of adversarial examples for code.

Benchmarks
In this section, we describe the datasets used in our experiments.We employed widely used coding challenges HumanEval (HE) and MBPP and a more complex dataset with lengthy descriptions of problems (DMCC).More information about the datasets can be found in the Appendix 10.2. HumanEval (HE).This is a human-curated problem-solving dataset described in Chen et al. (2021).It consists of 164 original programming challenges assessing language comprehension, al-  2021), it contains 974 short Python functions designed to be solved by entrylevel programmers.Contrary to HumanEval, each task is given through a text description rather than a docstring.Since there are no input-output examples in the prompt, we generate 3 valid pairs using the code solutions provided.c MBPP challenges models to perform tasks of imperative control flow, requiring loops and conditionals.DeepMind Code Contests (DMCC).Is the highly challenging dataset proposed by Li et al. (2022).The dataset includes problems from the Codeforces platform (Mirzayanov, 2020), Description2Code (Caballero, 2016), and CodeNet (Puri et al., 2021).We used challenges written in the Python3 language of the training split for our experiments.DMCC contains long descriptions of the problems and input-output examples of the functions to be completed.
In this work, DMCC is used for its long context properties during experiments of augmented finetuning (Table 5).Models presented in our work achieve zero or near-zero scores on it; hence it is excluded from our perturbation analysis, with HumanEval and MBPP being more suitable targets.

Evaluation
Models.In our experimental setup, we test five models representing different approaches to code generation.CodeParrot (Tunstall et al., 2022a) comes with an open-source dataset and can be easily used for fine-tuning experiments due to its size.Its smaller variant (110M) achieves competitive results to other open-source LLMs at larger parameter budgets.By exploring its dataset, we tested our hypothesis that function names act as biases during code generation.Models can be heavily inspired by similarly named snippets in their training set and resort to copying whole or parts of the solution instead of performing reasoning.(See Appendix A.9) We also test the Incoder (Fried et al., 2022) model, which is trained under a novel bi-directional causal objective, being able to handle context more efficiently than its causal counterparts.Against our initial hypothesis, our methods cause significant performance drops despite the model's enhanced context-understanding capabilities (Table 3).The Bloom model (Mitchell et al., 2022) exhibits emergent abilities in multiple domains by training on massive multilingual and multi-purpose content.Despite not being a code generation model, it performs equally well with code-specific models in the same parameter budget.Theoretically, bias effects can be reduced when a model is exposed to diverse training examples.Our experiments reveal that this is still not the case under our setup, and post-training solutions are explored.CodeGen (Nijkamp et al., 2022) is a high-performing model trained in natural language understanding and code.We test its Mono variant, further fine-tuned on the Python language.Finally, we have the powerful Codex model, which can tackle most of the proposed coding challenges in the HumanEval and MBPP datasets.A list of the tested models, as well as KeyBert (Grootendorst, 2020) that is used in our framework, can be found in Table 1.

Model Name
Sizes Used KeyBert (Grootendorst, 2020) 2M Codeparrot (Tunstall et al., 2022a) 110M / 350M*/ 1.5B InCoder (Fried et al., 2022) 1.6B / 6B CodeGen (Nijkamp et al., 2022) 350M / 6B Bloom (Mitchell et al., 2022) 560M* / 1.7B / 176B † Codex (v1 / v2) (Chen et al., 2021 Performance metrics.We evaluate the functional correctness of the generated programs with the pass@k metric, introduced in Kulal et al. (2019).This metric serves as an estimator of a model's generative capabilities under a specific budget.In Chen et al. (2021), authors propose an updated un-biased version that we adopt throughout the rest of this work.To avoid any confusion, we calculate pass@k at exactly k attempts.The average of ten runs with different seeds is presented for all experiments in Table 3.We use sampling temperatures of 0.2 / 0.8 for pass@1 / pass@100, which are the optimal values across the tested models.

Blocks of Influence
Our method treats each coding challenge as a combination of three distinct but complementary blocks rather than a single, homogeneous input.We refer to them as Blocks of Influence and correlate each with a different source of bias during code generation.Taking as an example Figure 1, we challenge the model to complete a function that reverses a list and then returns its second item.Name Block.The first block of influence, marked in red, informs the model about the function name and the names and types of the input arguments.Let us assume that initially, a model generates correct solutions to a problem.However, the model fails when we rename the function name to something unrelated to the task, e.g., "fun".This failure mode indicates that neither the problem description was understood nor the model could extract a reasoning pattern from the given usage examples.We associate such cases with memorization effects, where the model relies heavily on the function name, replicating snippets from its training dataset with the same or similar names.Description Block.The problem description stands as the second block, marked in green.Here the model is expected to form a solution by utilizing its natural language understanding capabilities.We observe that removing specific keywords from the problem description can lead to catastrophic results in model performance.It is vital that removing these keywords must not degrade the description semantics, and any information lost should be recoverable from the rest of the context.For example, in Figure 1, the removal of the word pair "the list" creates a description that is still well understandable by a human coder.We challenge the model to deduct the missing context from the word "list" in the function name and the input list type in the example given.The inability to recover the missing context is associated with an inherent preference bias, where the model relies on superficial lexical clues or frequently co-occurring terms seen during training rather than the given context to "mentally" fill any gaps.Example Block.As the final block, we consider the examples after the problem description.They act as demonstrations, guiding the model to specific reasoning patterns.Let us consider a scenario where models cannot generate correct code when examples are absent.Arguably, more than the task and given inputs alone were needed for the model to form a proper problem understanding.In this failure mode, the provided examples act as a "reasoning tie-breaker" between proposed solutions the model can generate.Generated solutions are not entirely irrelevant but a relatively poor interpretation of the problem.For example, in Figure 2, when stripped of its examples, the model still exhibits signs of task understanding (i.e., comparing element difference to a threshold, iterating over elements).However, combining these logic parts in a meaningful manner is complex enough that the model requires additional examples to filter out faulty strategies.We associate such effects with poor reasoning abilities.

Framework
The first step involves splitting a coding challenge into the three Blocks of Influence.For this purpose, we utilize a regular expression module that searches for common patterns of each block's start or end.(e.g., Name Block: "def (...):", Description Block: " or """, Example Block: "Examples:" or > / ≫ followed by usage of the function name).
As the next step, the Description Block is further analyzed to identify possible hinting keywords.Ideally, we are interested in unigrams or bigrams that provide excess information towards completing the coding task.For keyword identification, we use KeyBert (Grootendorst, 2020), an LLM tasked to perform keyword extraction and word similarity.We proceed to fine-tune KeyBert on the open-source CodeParrot dataset (Tunstall et al., 2022a) so that more code-specific suggestions are provided.For each candidate keyword, we calculate its embedding similarity with the set of words: [Python, Programming, Code, Variable], again through KeyBert.Words with cosine similarity scores under 0.7 for all the items of the set are unrelated to coding and thus filtered out.However, carelessly removing keywords can lead to non-interesting drops in performance associated with removing crucial information rather than hinting effects.Thus, an additional context-aware filtering stage is employed to validate that any information lost can be retrieved from the remaining coding challenge.
During this stage, we compute each candidate keyword's embedding similarity with every nonpotential keyword token.The keyword is marked as valid for removal if at least one "close" word is identified.Again, we consider "close" keywords with a similarity score larger than 0.7.If a keyword exists in multiple locations, the first instance is not marked as valid for removal, while the rest are.When a keyword happens to be an argument type (i.e., list, integer, tuple), we additionally look for instances of that type in the examples or name block.In case of a match, the keyword is safe for removal.Equivalent information already exists in the context.As the final step, we chose between the following transformations: Drop one.Removes one of the provided keywords from the Description Block.The transformation is repeated N times where N is the number of identified keywords.Drop all.Removes all the provided keywords simultaneously from the Description Block Drop examples.Removes all the provided examples from the Example Block.Anonymize.Replaces the function name with an arbitrary token.We use "func" in our experiments.Note that the function name is also replaced in the provided examples, so no information leak occurs.We also tested whether the choice of "func" may potentially bear some intrinsic adversarial effect associated with the training data.We experimented with other word choice replacements ("action","do stuff", "XYZ") and got the same results.Furthermore, we identified instances where the function name, although closely correlated to the task at hand, if it was to be taken as the sole source of information, could instead be misleading, signifying the need for proper context understanding by the tested models (See Appendix 10.8).
For example, let us use our framework on the challenge presented in Figure 1.At the first stage, KeyBert would have identified the following keywords: [Reverse, list, return, second].Among these, the word second does not pass the first filtering stage with over 0.7 similarity score against our set.In the second stage, each word would be compared against all the existing tokens.Reverse and return will not be associated with other tokens.List will be identified in the function name and in-put argument type.Also, since list is also a python keyword, it will be matched against the list type of the input given in the examples.This leaves list as the only available keyword for removal.If keyword drop would be combined with anonymization, the drop would still be valid since the information would still be available in the examples and input type.
These transformations test the hypotheses we associate with each block, as presented in Section 5.1.Removing possible hints leads to performance drops between the original and modified challenges, revealing underlying biases in the models' logic Arguably, any of our suggested transformations can destroy local semantics.However, we take significant measures to ensure that global semantics is preserved and enough information exists towards its solution.This is also why we refrain from performing simultaneous transformations in the Example Block and Description Block, or all of the Blocks of Influence together; a model stripped of all necessary information cannot generate a proper solution.To quantify the possible degree of ambiguity our transformations introduce, we employ the LM critic test, inspired by the work of (Yasunaga et al., 2021;Yasunaga and Liang, 2021): We collect a random sample of 200 coding challenges from the HumanEval and MBPP.Each challenge is then transformed according to the methods presented in Table 2. Afterwards, for both the original and every modified version of a challenge, we calculate their log probability score using a large language model.The core idea is that the model will act as a soft critic, ranking model inputs by their overall plausibility.Modified inputs that seem "off" to the critic and are partially understood will be assigned a log probability score far lower than the unmodified ones.Since this criterion is based on local neighborhood optimality, only moderate changes are allowed between the challenges under comparison.For example, two completely different but syntactically and semantically correct text snippets can have similar log probability scores.During their comparison, however, we would have violated the locality assumption, and no conclusions could be drawn about their contents.As our critic, we employ the Codex-v2 model (Chen et al., 2021).We calculate log probability similarity as: 2 shows that our transformations do not introduce drastic changes to the coding challenge.Even in the most aggressive transformation of Anonymization + Drop All, the critic assigns over 94% similarity between code challenges affected by it versus their original form.For comparison, removing the context-aware filtering stage, leads to only 78% similarity in the case of Anonymization + Drop All transformation.We believe this is a fair indicator that the tested models observe inputs of similar quality and comprehensibility during our experiments.Note that we omit results for the Drop Examples method.In this case, the log probabilities will significantly change since we remove many tokens, which violates the method's locality prerequisite.

Results on Block Transformations
The main results of our experiments are presented in These observations suggest a clear model preference over its sources of information, with the task description being the primary one.Thus, when a model exhausts its ability to understand the task, it exploits similarities of the function name with previously seen code solutions.Simultaneously, the model's reasoning relies on the example demonstrations, which, as seen from (Anonymize + Drop All), are not always able to provide clear directives.

Towards Bias Mitigation
Inspired by the field of adversarial training, we decided to investigate the effects of using our framework transformations as training augmentations.To this end, we apply our framework to examples of the MBPP challenge and use them as a fine-tuning dataset for three different Codeparrot models.We use HumanEval as our test dataset, which bears no overlap with the MBPP.In this way, our models have not seen examples of the test set during their training or fine-tuning steps.In Table 4, we compare the results of our models before and after fine-tuning.Models benefit from the introduction of augmented examples and partially recover from failure modes caused by the need to rely on hints.The larger the model, the more its abilities benefit.We believe this effect is closely related to large language models' scaling reasoning capabilities and their parameter size.The need to rely on hints can be attributed to low data quality or lack of task-specific inductive biases.However, the capacity to properly understand coding tasks is undoubtedly there.To improve the code generation abilities of models, we thus suggest exposing them to challenges that push their deductive and reasoning abilities.We decided to repeat the experiments, but without including any of our data augmentation techniques during fine-tuning.We observe that under this setup, models do not exhibit any significant improvement against our method's perturbations.Our suggested data augmentations that push the reasoning limits of the models are finetuning does not contribute to bias removal, achieving similar results against the perturbations.However, our suggested augmentations lead to higher model performance, especially in the pass@100 metric.The average of 15 runs is presented.Bold marks statistically significant improvements under the T-Test (Before versus After-A) with a = 0.95.

Effects of Longer Context
When causally training on coding datasets, models condition on multiple functions and declarations in the same file.The input is a conglomerate of rapidly changing contexts, with each function or class being a self-contained entity.Subsequently, a model is accustomed to localizing its focus when trained on such data.As an extension to our previous experiment, we measure the effects of using a long description dataset, DMCC, as a fine-tuning target.By training on long descriptions of natural language, we promote the context-deducting skills of the model under test.A model able to widen its focus can avoid distractions caused by missing keywords.Efficient context understanding will replace not rely heavily on internal biases.We choose Bloom as the model under test since it was not explicitly tuned for code genera-tion but rather general language understanding.In Table 5, we present results of fine-tuning on MBPP, modified by our framework.We observe similar performance improvements as in Table 4.We ex- We attribute this to the restricted focus regarding training data (exclusively Python3 code) and architectural differences between the models.We believe that the merging benefits of our two proposed setups can serve as an interesting direction towards model resilience in code generation scenarios.

Conclusions
We present a simple approach to isolate cues and benchmark the reasoning of code generation models through input-level transformations.Our  method treats code examples as a combination of three blocks, each providing different cues to the model.We show that minor transformations can lead models to failure, signifying the existence of biases.Our framework can automatically identify and remove keywords responsible for indirect hinting.We show that popular models with solid results on challenging coding challenges are susceptible to our tests, with their performance degrading noticeably.Moreover, we studied the effects of utilizing our proposed transformations during the finetuning of a model.Models can benefit from our proposed changes, with the effect proportional to their parameter size.We believe that, despite their success, code generation systems with LLMs as backbones inherit some of their biases and modes of failure.Training on structured and well-documented code, combined with our proposed techniques, is a promising direction towards reliable code generation.Although an ideal fit for competition-style challenges, our method can be extended to support less formatted high-quality codebases (e.g.GitHub repositories).For a short analysis see Section 10.1 of the Appendix.Using only the problem description, the model creates partially informed subparts (any derives from "if there are", sum(x) == 0 from "sum to zero", and for x in l from "elements in the list") that are not combined correctly to solve the task (bottom), signifying that hints from the function name / examples were used in the correct solution (top).

Limitations
Some limitations and possible research directions exist in our work.Our study focuses on the Python3 programming language, with many coding challenges existing in different popular choices (e.g., C, C++, Java, Scala).Although the Blocks of Influence identification mechanism could be easily adapted to each case, an off-the-shelf application of our framework in another language would lead to errors.
Similarly, the framework assumes that each coding challenge will be in a "competition-style" format, meaning that a proper problem description, in-docstring examples, and each input types are present for each example.In Appendix Section 10.1, we present how an adaptation to less formatted codebases would be possible, but for now, we leave it as a future investigation.
Finally, there is no guarantee that the improved performance against the suggested perturbations reflects an equivalent performance increase in realworld code assistant applications.Real-time coding suggestions and completions that are more user aligned are out of the scope of this work.

Risks and Ethical Considerations
Our research aims to discover and remove biases in code-generation scenarios through adversarial intervention.However, we acknowledge that insecure or malicious code can still be generated after finetuning with our suggested augmentations.Furthermore, our work is focused only on cognitive biases that affect the reasoning and logic behind the coding process of large language models.Social biases and stereotypes can still appear when general-purpose LLMs such as Codex or Bloom are used in typical text generation scenarios.Signs of robustness against our methods are not to be confused with indicators of other forms of biases not existent.For all of our perturbation experiments, we utilize the abovementioned models, and we comply with their respective licenses and intended use (generating code completions in python3).This also stands true for Codeparrot and Bloom, for which we create fine-tuned versions.Furthermore, we do not plan to repack or redistribute any of the used datasets.We plan to release the codebase of this work as an open-source project.

Information on Experimental Setup
Our experimental setup consisted of 4x NVIDIA V100 GPUs.Regarding the results of Table 3, the computing time of each table entry was influenced by: the model size, the k value of pass@k metric (number of generations), the perturbation method, and the dataset tested.Specifically for the drop one / anonymize + drop one methods, the experiment was repeated N times, where N corresponds to the number of keywords identified.This results in approximately four times slower experiments for those perturbations since in both HumanEval and MBPP, four keywords on average per problem were identified (see Table 8).API calls to Codex and Bloom models were subject to throttling limits, and waiting loops def do_algebra(operator, operand): "Given two lists operator, and operand.The first List has basic algebra operations, and the second list is a List of integers.Use the two given lists to build the algebric expression and return the evaluation of this expression.>>> square_nums ([1, 2, 3, 4, 5, 6, 7, 8, 9, 16]) [1,4,9,16,25,36,49,64,81,100] >>> square_nums ([10,20,30]) [100,400,900] return list(map(lambda x: x**2, nums)) def func(nums): "Find squares of individual elements using the lambda function."
Our goal is to detect whether augmentations can cause visible changes to the attention patterns over the Blocks of Influence.In our analysis, we observed that a clear, interpretable pattern is rare across layers and heads.This result is in accordance with visualizations provided in (Li et al., 2022) 2 , where a far stronger model exhibits patterns that can be not so intuitive.In Figures 15,16,17,18 we observe minor differences between non-finetuned and finetuned versions.The underlying changes in the reasoning processes of our coding models are not directly visible with attention maps.Reasoning processes should be viewed as an effect emergent from multiple interactions across layers and heads and can thus not always be located in a specific part of them.

Figure 1 -
Figure 1-Left: The three blocks of influence: Name Block in red, Description Block in green and Example Block in blue.Right: We demonstrate three possible transformations, one for each block: Swap the function name with "func", remove keywords, and remove examples.Transformations can be applied alone or in combinations of two as described in Section 5.2

Figure 2 -
Figure 2-Example removal reveals poor reasoning (Example drop / Codex-v1): The model initially exhibits signs of task comprehension (top), generating a correct solution.Removing the examples, however, reveals a lack of proper reasoning; Although the model still understands that it has to compare numbers, it resorts to a naive sequential check instead of comparing each available pair (bottom).
Figure 3-Keyword hinting (Drop All / Bloom 175B): After the removal of keywords, the context remains intact: The two strings keyword can be assumed by observing the function arguments, and the binary/string keywords by the examples and return type signature of the function.Nevertheless, the model fails to generate a correct solution (bottom).

Figure 7 -
Figure 7-Instance of dropping the prompt examples on CodeParrot-1.7B

•
DB: All tokens belonging to the Description Block • EB: All tokens belonging to the Example Block • GE: The so-far model generated tokens (solution)

Table 2 :
Similarity scores for different methods of our frame-

Table 3 .
Despite their simplicity, our transformations cause consistent drops in performance across different model sizes on both datasets. 1Mere anonymization causes drops of 19% on average in both Pass@1 and Pass@100 metrics, validating our claims of memorization effects.Single (Drop One) and full keyword removal (Drop All) reduce models' performance by 15% and 22% on average, suggesting their inability to deduct the missing context from Name Block and Example Block.Instead, models rely on generating arbitrary, commonly used snippets that vaguely fit for the task.Especially interesting are the cases of Drop Examples and Anonymize + Drop Examples, with 15% and 25% average drops.Both transformations remove the information provided by the docstring examples, with the latter having the additional restriction of an anonymized function.With the Description Block unmodified in both cases, these transformations target the models' abilities to create solutions based on their natural language understanding.The combination of anonymization with the drop of all keywords (Anonymize + Drop All) seems to be the most challenging transformation overall, with drops of approximately 40%.Its primary purpose is to assess the model's capability of deducting the missing context of the Description Block by only observing patterns in the examples.

Table 3 :
Model results on Human Eval and MBPP.

Table 4 :
HumanEval results of fine-tuning Codeparrot on the MBPP dataset with (A) or with no (NA) augmentations: Regular

Table 7 :
URL and Licenses of used Datasets.

Table 8 :
Datasets used in experiments.We present the number of problems, number of tests per problem, average length of the challenge description and average distinct keywords identified by our framework.
10.5 Quantative ResultsWe present our full results table, including the CodeParrot(110M) and Codex(v1) results.Note here that experiments involving the large version of the Bloom Model were done once in the case of pass@100 metric due to restrictions with the API request limits.

Table 9 :
First part of results on Human Eval and MBPP datasets, for four tested models.

Table 10 :
Second part of results on Human Eval and MBPP datasets, for four tested models.Block of Influence Splitting 1: cc : Code Challenge Instance # Locate function name, which is the next token after the last matched "def", and keep start and end index of it.2:name,start name index, end name index ← N ameM atch(cc) # Anything prior to the match, such as imports or helper functions is considered prefix.3:prefix ← cc[: start name index] # Look for tokens such as (Example, example, >, ≫).If no matches were found, look for uses of the function name in the challenge.4:ifExampleM atch(cc[end name index :]) ̸ = N one then 5: examples, start example index ← ExampleM atch(cc[end name index :]) 6: else 7: examples, start example index ← F unctionM atch(cc[end name index :]) 8: end if # The description should fall between the function name and the examples.9:description← cc[end name index : start example index] # Form the blocks and return.10:NameBlock ← pref ix + name 11: DescriptionBlock ← description 12: ExampleBlock ← examples Algorithm 2 Keyword Identification 1: KB : The KeyBert model 2: nb : Name Block 3: db : Description Block 4: eb : Example Block 5: kw :← ∅ Keywords 6: f kw :← ∅ Filtered Keywords # Use the model to extract some initial unigram and bigram keywords.7:kw← KB(db) # Filter out keywords non-related to coding.8: for i in kw do 9: if cossim(i, [P ython, P Code]) > 0.7 then 10:if stem(i) ∈ [nb, eb] or equiv(i) ∈ [nb, eb] thenIn this section, we present illustrations of attention patterns.We use Codeparrot (330M) as our target model, before and after the combined finetuning process and create visualizations for two coding challenges.The first challenge is: