RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

The task of repository-level code completion is to continue writing the unﬁnished code based on a broader context of the repository. While for automated code completion tools, it is difﬁcult to utilize the useful information scattered in different ﬁles. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It stream-lines the repository-level code completion process by incorporating a similarity-based re-triever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap be-tween retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of Re-poCoder through comprehensive analysis, providing valuable insights for future research.


Introduction
In real-world software production, it is crucial for developers to be aware of other files within the repository during programming.This challenge gives rise to the task of repository-level code completion, where automated tools are expected to utilize the broader context of a repository rather than relying solely on in-file information to complete unfinished code.Code files within a repository often exhibit interrelated dependencies, including shared utilities, configurations, and cross-API invocations resulting from modularization (Tu et al., Figure 1: Illustration of the In-File code completion method, the repository-level Retrieval-Augmented Generation (RAG) method, and the iterative retrievalgeneration RepoCoder method.2014).Additionally, each repository typically follows customized naming conventions and coding styles (Zou et al., 2019), which contribute to enhanced readability and maintainability.However, developing effective repository-level code completion tools remains an open problem.Although approaches relying on static code analysis and heuristic rules (Raychev et al., 2014;Svyatkovskiy et al., 2019Svyatkovskiy et al., , 2021) ) can reliably parse specific repository context, they have limitations in the completion scenario, limiting capability for varying-length completions anywhere in a file.Meanwhile, studies (Hellendoorn and Devanbu, 2017;Svyatkovskiy et al., 2020;Ding et al., 2022) tuning language models on labeled data excel in their respective evaluation scenarios but face challenges generalizing to unseen repositories without retraining.
In this paper, we propose an approach to leverage off-the-shelf retrievers in order to locate valuable information within a repository and enhance the context for language models.We introduce a novel framework called RepoCoder that aims to improve code retrieval and completion performance.As depicted in Figure 1, we enhance the conventional In-File code completion method by incorporating the Retrieval-Augmented Generation (RAG) technique, which allows us to search for relevant code snippets from the repository to assist in generating the code completion.Additionally, we introduce RepoCoder, which employs an iterative pipeline that utilizes the generated code completion to enhance the retrieval process, thus bridging the gap between the retrieval context and the intended completion target.Figure 2 provides an example that illustrates the rationale behind our design.We demonstrate that relying solely on the unfinished code is insufficient to retrieve useful information from the repository.In the example, the model improvises a statement calling the COLMAP API in the first iteration.The predicted parameters are reasonable yet incorrect.This is because the incomplete code preceding the code completion does not serve as an adequate retrieval query for the intended completion target.However, by performing a subsequent retrieval from the repository using the model's generated completion, we can successfully retrieve the target API signature and complete the code effectively.
Furthermore, we introduce the RepoEval benchmark designed for evaluating the repository-level code completion task, which is constructed using the latest high-quality repositories sourced from GitHub 1 .By introducing RepoEval, we address the lack of established benchmarks in the repositorylevel scenario.Notably, RepoEval is the first benchmark that encompasses three levels of code completion granularity: line, API invocation, and function body.We also leverage unit tests present in the repository to enhance the accuracy of evaluation, which overcomes the limitations of similaritybased metrics.To rigorously validate the effectiveness of RepoCoder, we conduct extensive experiments using different language models of varying sizes, including GPT-3.5-Turbo 2 and CODE-GEN (Nijkamp et al., 2022).Experimental results demonstrate that RepoCoder achieves significant improvements over In-File completion performance, surpassing the baseline by over 10% across different experimental settings.Moreover, our iterative framework consistently enhances the performance of vanilla retrieval-augmented generation.We also provide a comprehensive analysis of the effectiveness and limitations of RepoCoder, offering insights for future research.Our contributions can be summarized as follows: 1 https://github.com 2 https://platform.openai.com/docs/models/gpt-3-5 • We propose RepoCoder, a novel iterative retrieval-generation framework for the repository-level code completion task.
• We introduce the RepoEval benchmark, enabling the evaluation of repository-level code completion with varying levels of granularity and improved evaluation accuracy through the utilization of unit tests.
• Through rigorous experimentation, we demonstrate that RepoCoder significantly outperforms the In-File code completion paradigm and enhances the performance of vanilla retrieval-augmented generation.

Overall Framework
The task of code completion using a language model M can be generally described as Ŷ = M(X), where Ŷ represents the predicted tokens and X corresponds to the in-file unfinished code.
By introducing an additional code retrieval model R, we can transform the code completion pipeline into a Retrieval-Augmented Generation (RAG) approach.Initially, we establish a retrieval database by partitioning the code files from the repository into a collection of code snippets Subsequently, we utilize the retrieval model R to extract the most relevant code snippets from C repo by employing the unfinished code X as the retrieval query.This process yields a set of retrieved code snippets C ret = R(C repo , X).Following this, we leverage the language model M to perform code completion, resulting in the prediction Ŷ = M(C ret , X).Consequently, we are able to incorporate the contextual information from the repository level during the code completion task.However, using the unfinished code X as the sole retrieval query introduces a gap between the retrieval context and the intended completion target, as exemplified in Figure 2. To address this limitation, we propose RepoCoder, an iterative retrieval-generation pipeline designed to further enhance the performance of the vanilla RAG method.Specifically, for the i-th retrieval-generation (i > 1) iteration, RepoCoder utilizes the previous model prediction Ŷ i−1 to construct a new query for the retrieval process.This leads to the generation of another set of relevant code snippets The newly generated code completion can serve as either the output of RepoCoder or be utilized for the subsequent retrieval-generation iteration.
Importantly, it is worth noting that the parameters of M and R remain unchanged throughout the entire process.Moreover, there is no requirement for static code analysis tools or heuristic rules to construct the retrieval database.In the following subsections, we provide a detailed explanation of the code retrieval process (Section 2.2) and the code generation process (Section 2.3).

Code Retrieval
The retriever utilized within the RepoCoder framework can be any model capable of searching for relevant documents given a specific query.To construct the retrieval database, a sliding window approach is employed.The sliding window traverses the files in the repository and extracts contiguous lines of code that fit within the window size, denoted as S w .The sliding window moves a fixed number of lines at each iteration, which is referred to as the sliding size, denoted as S s .
During the initial retrieval process, when no model prediction is available, the query is formulated using the last S w lines of the unfinished code X.Consequently, the most similar code snippets are retrieved using the retrieval model, resulting in C 1 ret = R(C repo , X).However, a gap exists between the retrieval context, based on X, and the intended completion target, which is to continue writing X.A possible solution is to adjust C 1 ret by shifting each code snippet down by a few lines to include the subsequent code.Although this shifting approach has shown effectiveness in previous work (Lu et al., 2022), indiscriminately shifting all retrieved code snippets without considering their content may not always be appropriate.
To address this issue, RepoCoder augments the retrieval query during the i-th iteration (i > 1) with the previously generated code Ŷ i−1 .Despite the lack of customized information for new repositories, pre-trained code language models have demonstrated impressive general-domain understanding and generation capabilities.The generated code Ŷ i−1 can provide valuable supplementary information for the retrieval process, even though its correctness may not be guaranteed.Therefore, for the i-th iteration of retrieval (i > 1), the query is constructed by concatenating the last (S w − S s ) lines of X with the first S s lines of Ŷ i−1 .This approach yields the grounded retrieval results C i ret = R(C repo , X, Ŷ i−1 ).

Code Generation
The generator employed within the RepoCoder framework can be any pre-trained language model capable of predicting subsequent tokens given a specific prompt.As mentioned earlier, it is crucial to incorporate both the context from the repository C repo and the context within the target file for effective code completion.This enables the model to leverage grounding information and enhances its generalization ability to unseen repositories.
In the RepoCoder framework, we retrieve the most relevant code examples, denoted as C ret , from the repository and concatenate them with the unfinished code X.To ensure readability and comprehension, we create a prompt template that seamlessly integrates X and C ret , as illustrated in Figure 3.The retrieved code snippets are arranged in ascending order based on their similarity scores to the query.Each code snippet is accompanied by its original file path, and the maximum number of code snippets included in the prompt, denoted as K, depends on the available prompt length.Ultimately, the prompt contains as much relevant information as possible to facilitate code completion.

Benchmark Construction
To facilitate the evaluation of code completion tools in the repository-level scenario, we propose a novel RepoEval benchmark.This benchmark is carefully constructed using the latest high-quality repositories sourced from GitHub and encompasses three levels of code completion granularity: line, API invocation, and function body.To assess the correctness of completed functions, we utilize unit tests present in the repository instead of relying solely on similarity-based metrics.Each sample in the RepoEval benchmark is annotated with the corre-sponding source repository, file path, line numbers, and ground truth completion.For analysis and unit test execution, complete copies of the repositories are archived as of January 2023.
To construct RepoEval, we first meticulously curate a collection of Python repositories from GitHub that satisfy the following criteria: opensource license, created after January 1, 20223 , nonfork original repositories, over 100 stars, over 80% of files written in Python, and explicit unit tests.Furthermore, to mitigate potential biases, we employ a random selection process for the repositories and create three distinct datasets for line completion, API invocation completion, and function body completion.Additional details regarding the selected repositories can be found in Table 1.
Line completion: In adherence to the conventions of code completion benchmarks (Lu et al., 2021(Lu et al., , 2022)), we implement the line completion scenario.First, according to the above-mentioned criteria, we select 8 repositories that vary in size and cover different domains.Then we randomly select 200 lines to complete from each repository, ensuring the lines are non-repetitive, not code comments, and each line contains at least 5 tokens.Eventually, a total of 1600 test samples are generated for the line completion dataset.

API Invocation Completion:
We also choose to test the API completion scenario, especially inrepository defined APIs.It is a harder problem than the completion of built-in or third-party APIs due to the lack of customized training data (Hellendoorn et al., 2019).We utilize the same group of repositories in the line dataset and parse the target repositories to locate invocations of in-repository APIs.From these candidates, we then randomly select 200 non-repetitive API invocations from each repository, resulting in a total of 1600 test samples for the API invocation completion dataset.
Function Body Completion: Alongside the line and API completion evaluations, we also assess the ability to complete function bodies, which requires executing unit tests present in the repository.However, running tests can be time-consuming and computationally expensive.To address this, we randomly select a separate set of smaller-scale repositories that are easy to deploy.Within these repositories, we locate functions covered by unit tests and select function bodies containing 3 to 30 lines of code to complete.This yields a total of 373 test samples for the function body completion dataset.

Methods for Comparison
In-File Completion: Previous studies (Chen et al., 2021;Nijkamp et al., 2022;Chen et al., 2022) have demonstrated the effectiveness of utilizing large pre-trained language models for code generation in a zero-shot completion manner, conditioned on the provided context.Furthermore, it has been established that incorporating in-file context is beneficial for code completion scenarios (Clement et al., 2021).Hence, as a baseline, we implement an In-File completion method by populating the prompt with the unfinished code and directly utilizing the pre-trained code generation model to predict the code completion.
Oracle Method: A key contribution of Re-poCode is the integration of model predictions for retrieval, bridging the gap between retrieval and the intended completion target.To showcase the effectiveness of this approach, we devise an oracle retrieval-augmented generation method for comparison purposes.This method performs a single retrieval process to obtain relevant code snippets, denoted as C gt ret , by utilizing the last S w − S s lines of X and the first S s lines of the ground truth code, Y .Subsequently, the completion code, denoted as Ŷ , is generated through M(C gt ret , X).This allows us to achieve the upper bound of performance for RepoCoder, conditioned on the retrieval model R and the generation model M.

Implementation Details
Retrieval Model: For our main experiments, we employ a sparse bag-of-words model as the retrieval model, which has demonstrated effectiveness in retrieving similar code snippets (Lu et al., 2022).This model transforms the query and candidate code snippets into sets of tokens and calculates their similarity using the Jaccard index (Jaccard, 1912), computed as Jaccard(S q , S c ) = |Sq∩Sc| |Sq∪Sc| , where S q and S c represent the tokens of the query and candidate code snippets, respectively.We also experiment with a dense retriever based on UniXcoder (Guo et al., 2022), detailed in Appendix B.

Generation Model:
We evaluate RepoCoder using four pre-trained language models with vary-ing code generation capabilities.The first model, GPT-3.5-Turbo, is a state-of-the-art commercial code generation model with billions of trainable parameters and has been pre-trained on an extensive code corpus.Access to GPT-3.5-Turbo is obtained through the API provided by OpenAI.The second model, CODEGEN, is an open-source code generation model that has multiple published versions with varying model sizes and training data.In our experiments, we utilize three versions of CODE-GEN model with 6B, 2B, and 350M parameters.
Hyper-parameters: We found that RepoCoder's performance was not highly sensitive to changes in hyper-parameters.Therefore, for our experiments on RepoEval, we assign hyper-parameters based on our experience.Specifically, the maximum number of tokens for the combined input prompt and output prediction is set to 4, 096 for GPT-3.5-Turbo and 2, 048 for CODEGEN.The length of retrieved code snippets is set to half the prompt length.For line and API completion, the maximum number of tokens in the generated completion ( Ŷ ), the line length of the sliding window (S w ), and the sliding size (S s ) are set to 100, 20, and 10 respectively.For function body completion, these values are adjusted to 500, 50, and 10.The maximum number of retrieved snippets (K) is set to 10.The same hyperparameters were used for the single-iteration RAG, iterative RepoCoder, and Oracle baselines, ensuring a fair comparison between methods.Notably, given that these parameters are intricately linked to the programming language and contextual scenarios, practitioners should make adjustments to ensure optimal real-world performance.

Evaluation Metrics
Similarity-based Evaluation: Following established practices in code completion research (Lu et al., 2021(Lu et al., , 2022)), we evaluate our line and API completion datasets using two metrics: Exact Match (EM) and Edit Similarity (ES).The EM score is a binary metric that takes the value of 1 if the predicted code exactly matches the ground truth code, and 0 otherwise.The ES metric provides a more fine-grained evaluation and is calculated as , where Lev represents the Levenshtein distance (Levenshtein et al., 1966).
Execution-based Evaluation: For the function body completion dataset, we utilize unit tests present in the repository to evaluate functional  correctness.This approach is more reliable than similarity-based metrics in assessing the behavior of the completed functions.While collecting unit tests can be time-consuming, we focus on a realistic scenario and utilize the unit tests available in GitHub repositories to validate the generated code.
We execute the completed code and report the Pass Rate (PR), where PR is 1 if the code passes all the corresponding test cases, and 0 otherwise.

Line and API Completion Datasets
We compare the performance of RepoCoder with the In-File completion method and the Oracle method on the line and API invocation completion datasets using four pre-trained language models and different retrieval-generation iterations.From the results listed in  by UniXcoder (Guo et al., 2022) (detailed in Appendix B) and find that the simple sparse retriever achieves equivalent performance, highlighting the robustness of RepoCoder across different code retrieval and generation models.

Function Completion Dataset
We proceed to assess the performance of Re-poCoder on the function body completion dataset.
To tackle the greater difficulty of function body completion, we employ the GPT-3.5-Turbomodel due to its superior code understanding and generation capabilities, as well as its larger prompt length suitable for longer function code snippets.The evaluation results, presented in cation completion datasets.Across most repositories, RepoCoder exhibits significant improvement over the In-File completion method and competitive performance compared to the Oracle method.Moreover, with additional retrieval-generation iterations, RepoCoder consistently outperforms the vanilla Retrieval-Augmented Generation (RAG) approach.These results reaffirm the effectiveness of our approach.

Analysis
In this section, we conduct further analyses on the retrieved code snippets to gain a deeper understanding of RepoCoder and provide valuable insights for future research.

Quality of Retrieved Code
We observe a significant impact of the retrieved code's quality on code completion performance.And the most helpful code snippets typically contain code statements similar to the target completion or demonstrate example usages of the target API invocation.Then, to validate the correlation between retrieval quality and completion performance, we design an analysis experiment using the API invocation completion dataset.In this experiment, we leverage a static code analysis tool to locate code snippets in other files that include invocations of the ground truth API.Subsequently, we rank these code snippets based on their similarity to the unfinished code and select the most similar Table 5: Locations of retrieved code snippets when the Oracle/RepoCoder method outperforms the In-File completion method using GPT-3.5-Turbo on the line and API completion datasets.
ones to include in the completion prompt.We refer to this method as GT-Code and compare its performance against the In-File and RepoCoder methods.Additionally, we show the recall performance of RepoCoder by counting the number of retrieved code snippets containing invocation examples of the ground truth API.
Since not every API in the test dataset has invocation examples in other files, and we also exclude the invocation examples existing in the input prompt for the model, we finally extract from the API invocation dataset 1046 and 1083 eligible test samples respectively for the GPT-3.5-Turboand CODE-GEN models to conduct the experiment.From the obtained results in Table 4, we observe that the GT-Code method, which utilizes ground truth API invocation examples, generally achieves the best performance among all methods.Furthermore, Re-poCoder with two iterations exhibits higher recall for ground truth API invocations compared to a single-iteration, which likely contributes to its superior code completion performance.Notably, as the language model grows more powerful, the recall value using RepoCoder Iter-2 also increases, indicating the model predictions indeed assist the retrieval process and emphasizing the effectiveness of RepoCoder.

Locations of Retrieved Code
The retrieval of code snippets provides valuable contextual information from other files to enhance the context for language models.We conduct a separate experiment to study the various locations from which effective retrieval occurs.Specifically, we select test samples that are successfully pre-dicted by the Oracle/RepoCoder method but not by In-File completion using GPT-3.5-Turbo.This yields a number of eligible test samples and retrieved code snippets for the line and API invocation completion datasets.To determine the original source of these code snippets, we adopt a classification scheme inspired by Shrivastava et al. (2022), consisting of five distinct file locations: 1. Imported: code from a file imported by the target file.2. Current File: code from the excluded content of the target file.3. Current Directory: code from a file in the same directory as the target file.4. Similar Import: code from a file sharing at least one same API import with the target file.5. Similar Name: code from a file with a file name sharing at least one token with the target file (assuming snake-case style file names).
The results are as outlined in Table 5.Our findings indicate a similar distribution of retrieved code snippets between the Oracle method and Re-poCoder.The majority of code snippets fall within our defined categories, and a significant portion of code snippets originates from files with "Similar Import", "Similar Name", or "Current Directory" locations, underscoring the importance of contextual information in code completion tasks.Furthermore, we conduct an ablation study, wherein we restrict the retrieval process to only the aforementioned file locations.The results reveal a degradation in performance, highlighting the efficacy and simplicity of RepoCoder in the retrieval process.

Related Work
Repository Context in Code Completion: Incorporating repository-level context into code completion tools has been a long-standing challenge.Traditional code completion techniques typically involve analyzing code to identify potential suggestions, followed by re-ranking them (Raychev et al., 2014;Svyatkovskiy et al., 2019Svyatkovskiy et al., , 2021)).While this approach offers efficient performance, it lacks the flexibility to generate code at arbitrary granularity.Another line of research treats code completion as a language modeling task, where the next tokens are generated based on the given context.Several methods have been proposed to incorporate repository context into the training of language models, including n-grams (Tu et al., 2014), LSTMs (Hellendoorn andDevanbu, 2017), and Transformers (Svyatkovskiy et al., 2020;Liu et al., 2022;Ding et al., 2022).However, the process of collecting labeled data and fine-tuning models for different applications remains resource-intensive.In recent years, there has been significant attention on Large Language Models (LLMs) for code completion.A study by Shrivastava et al. (2022) also explores the scenario of repository-level code completion.Despite its innovative approach, the study relies on inflexible heuristics and classifier training for prompt construction.This highlights the ongoing challenges in effectively leveraging LLMs for code completion and the need for further research.
Joint Modeling Retrieval and Generation: Despite the impressive capabilities of LLMs (Brown et al., 2020;Thoppilan et al., 2022;Chowdhery et al., 2022), their offline training paradigm often limits access to customized and up-to-date information.Recent studies have started exploring the joint modeling of retrieval and generation in knowledgeintensive tasks, such as question answering (Guu et al., 2020;Lewis et al., 2020;Izacard et al., 2022) and dialogue generation (Zhang et al., 2022).This approach has also been extended to code generation by incorporating retrieved documents or code examples into the generation process (Rizwan Parvez et al., 2021;Zhou et al., 2022;Lu et al., 2022;Zan et al., 2022).As language models have become increasingly sophisticated, there is a growing trend towards in-context joint retrieval and generation, treating the LLM as a fixed black box (Levine et al., 2022;Ram et al., 2023;Shi et al., 2023).Moreover, some studies have investigated utilizing the model's predictions as supplementary context to inform the retrieval process (Mao et al., 2020;Li et al., 2022;Wang et al., 2023;Zemlyanskiy et al., 2022).In this work, we demonstrate that adopting an iterative paradigm that combines code retrieval and generation can serve as an effective method for repository-level code completion.

Conclusion and Future Work
In conclusion, we introduce RepoCoder, a straightforward and effective framework for the repositorylevel code completion task.
By leveraging a retriever and a language model, RepoCoder effectively utilizes repository-level information.Through an iterative process of retrieval and generation, RepoCoder bridges the gap between retrieval context and the target code, resulting in improved code completion performance.Our extensive experiments conducted on the RepoEval benchmark demonstrate that RepoCoder consistently and significantly enhances In-File completion performance, surpassing the vanilla Retrieval-Augmented Generation (RAG) approach.Furthermore, our analysis provides valuable insights into the rationale and limitations of RepoCoder.With its simplicity, versatility, and effectiveness, RepoCoder has the potential to become an indispensable tool in real-world software development, and we aim to further enhance its usability and robustness.

Limitations
Limited Effectiveness for Repositories with Low Code Duplication: Despite we have demonstrated the effectiveness of RepoCoder through extensive experiments and analysis, RepoCoder may not bring significant performance improvements when a repository has few instances of code duplication.In such scenarios, the code retrieval process struggles to find sufficient relevant information from the repository to facilitate code completion.This issue is further highlighted in the study presented in Appendix C.
Difficulty in Identifying the Optimal Number of Iterations: While RepoCoder with two iterations outperforms the RAG method, determining the optimal number of iterations remains a challenge.Subsequent iterations of RepoCoder may exhibit unstable performance compared to previous iterations.Appendix D provides a demonstration of this issue.To mitigate this, we have explored different approaches to automatically terminate the iteration process when necessary.However, finding an optimal stopping criterion without significantly impacting RepoCoder's performance remains an ongoing challenge.Further research is required to develop techniques that can identify the iteration at which RepoCoder achieves the best performance.

Time Efficiency for Real-Time Deployment:
While RepoCoder demonstrates promising gains in code completion accuracy through iterative retrieval and generation, concerns may arise due to the latency of additional retrieval-generation steps.For real-time deployment scenarios with strict latency requirements, we can further improve RepoCoder through model optimizations such as quantization, distillation, and hardware acceleration to expedite inference.Techniques like caching frequent code snippets and pre-processing repositories can also boost speed.The model iterations can be dynamically adapted based on latency goals and contextual needs to balance accuracy and efficiency.Nevertheless, improving time efficiency is another important topic that is out of the scope of our current paper.
Limited Exploration of Different Experimental Settings: First, while we have validated the effectiveness of RepoCoder, we have not yet explored the potential improvements that can be achieved through different prompt templates.We believe that more careful prompt engineering could enhance the performance of our approach even further.Second, our focus in this study has primarily been on exploring similarity-based retrieval models.The reason for this limited scope is rooted in the complexity of code retrieval, which involves numerous intricate details that are not directly relevant to the RepoCoder framework.Considering alternative retrieval models or expanding the exploration to other code retrieval techniques could provide further insights and comparative evaluations.Third, we have observed significant advancements in code generation models, such as GPT-4 (OpenAI, 2023), StarCoder (Li et al., 2023), and WizardCoder (Luo et al., 2023).While our experiments demonstrate the efficacy of RepoCoder across different language models (GPT-3.5-Turboand CODEGEN), it would be valuable to investigate how our approach performs with these advanced code generation models.Incorporating them into our experimental setup would provide a broader evaluation of RepoCoder across a wider range of language models.Fourth, our experiments primarily use the In-File and Oracle methods as baselines.This decision stems from the fact that repositorylevel code completion using language models is a relatively new task, lacking well-established and reproducible baselines.To provide further insights, we include comparisons to other commercial code completion products.Nonetheless, it is impractical to systematically benchmark against complex, confidential commercial products.We instead conduct a study in Appendix E showcasing the repositorylevel completion ability of RepoCoder and another three major commercial products, where we can illustrate their qualitative differences.In summary, future work should aim to explore different prompt designs, consider alternative retrieval or generation models, and incorporate additional baselines.

A Repository Details
As mentioned in Section 3, we meticulously selected repositories for our RepoEval benchmark based on criteria such as open-source license, creation date, code quantity, and quality.Detailed information about these repositories is provided in Table 7.

B Using the Dense Retriever
In our main experiments (as described in Section 4.2), we utilize a sparse retrieval model for RepoCoder due to its acceptable performance and computational efficiency.However, RepoCoder is a versatile framework that can be applied with other code retrieval models as well.To further validate the effectiveness of RepoCoder, we conduct additional experiments using a dense code retriever.
Specifically, we employ UniXcoder (Guo et al., 2022), a state-of-the-art code embedding model, to  transform code snippets into hidden vectors.We then calculate the similarity between code snippets using cosine similarity.The experimental results on the line and API invocation completion datasets using the dense retriever are presented in Table 6a and Table 6b.Notably, the performance of Re-poCoder using the dense retriever is comparable to that using the sparse retriever.Furthermore, the findings remain consistent across both retrievers, highlighting the robustness and generalizability of RepoCoder.

C Code Duplication in Repositories
We explore the relationship between the performance of RepoCoder and the code duplication ratio of the repositories.Intuitively, since RepoCoder utilizes similarity-based retrieval to find code exemplars, one might expect a positive correlation between its performance and the code duplication ratio.To assess this relationship, we calculate the code duplication ratio of the repositories by determining the ratio of duplicated code lines to the total code lines.Figure 4 presents the results, demonstrating the correlation between RepoCoder's performance, as measured by the Exact Match (EM) metric, and the code duplication ratio on the line and API completion datasets using GPT-3.5-Turbo.Notably, the repository "diffusers" exhibits the highest duplication ratio, which corresponds to a significant performance improvement for RepoCoder on both datasets.Conversely, "rl" and "vizier" have low duplication ratios, resulting in comparatively lower performance for RepoCoder.However, the correlation between RepoCoder's performance and the code duplication ratio is not absolute.For example, "FedScope" and "evaluate" have similar duplication ratios but show different performance gains for RepoCoder.

D Failed Cases between Iterations
In Section 5, the evaluation results demonstrate that increasing the number of RepoCoder iterations does not necessarily guarantee performance improvements.To further investigate this issue, we analyze the changes in the number of correct code completions achieved by different methods on the API invocation completion dataset.The prediction is considered correct when the EM score is 1.  of correct code completions for each iteration of RepoCoder.We observe that each iteration of Re-poCoder both passes cases that the previous iteration failed and fails cases that the previous iteration has passed.
Upon manually examining the failed cases, we have the following observations: Firstly, a majority of failures are caused by misleading retrieved code, which leads to incorrect predictions.For instance, the same API may have different sets of parameters across different files, and the retrieved API usage example can be misleading in such cases.Secondly, the model's predictions are not always suitable for retrieval.This is because the query is constructed using a fixed length of the predicted code, which may include noisy code beyond the initial lines of helpful code completion.Furthermore, our investigation reveals that many cases in the line and API datasets are actually correct despite being evaluated as incorrect by the EM score.This highlights the importance of considering the actual functionality of the code, rather than solely relying on exact matching, and suggests incorporating unit tests to assess code correctness.

E Case Study of Commercial Products
We conduct a study to showcase the repositorylevel code completion ability of RepoCoder and another three major commercial code completion tools: Github Copilot4 , Tabnine5 , and Amazon CodeWhisperer6 .The experiment was conducted using the Visual Studio Code IDE, with each product providing completions as a plugin.These products are based on large language models pre-trained on code data and can perform line-level and blocklevel completion, similar to our study scenario.We selected a simple API invocation example from the RepoEval dataset.The task was to complete the function body for initializing a StableDiffusionKD-iffusionPipeline, where the prefix in-file context provided little information.As shown in Figure 5, none of the commercial products generated the correct completion.The implementation details of these commercial products are confidential, it is  difficult to perform systematic comparison.However, in our case, RepoCoder successfully predicted the correct completion by retrieving a relevant code snippet from the repository context.This demonstrates the need for state-of-the-art tools to effectively leverage repository-level context.

Figure 2 :
Figure 2: A motivating example showcasing the utilization of model predictions to enhance the performance of code retrieval.

Figure 3 :
Figure 3: A visual example demonstrating the format of the RepoCoder prompt, which combines the retrieved code snippets from the repository with the unfinished code present in the target file.

Figure 4 :
Figure 4: Correlation between the absolute performance improvements achieved by RepoCoder Iter-2 over the In-File method and the repository duplication ratios.
The changes in the number of correct code completions achieved using different methods on the API invocation completion dataset.
(a) Incorrect code completion of Github Copilot.(b) Incorrect code completion of Tabnine.(c) Incorrect code completion of Amazon CodeWhisperer.(d) Correct code completion of RepoCoder.

Figure 5 :
Figure 5: Code completion examples of RepoCoder and three major commercial products.

Table 2 :
Performance comparison on the line and API invocation completion datasets.Results present the average performance of each method evaluated using Exact Match (EM) and Edit Similarity (ES) scores.Numbers are shown in percentage (%), with the best performance highlighted in bold.

Table 3 :
Performance comparison on the function body completion dataset using GPT-3.5-Turbo.Results display the Pass Rate (PR) of each method as evaluated using test cases.Numbers are presented in percentage (%), with the best performance highlighted in bold.ID represents the repository IDs, and N. indicates the number of test samples in each repository.

Table 3
, showcase similar trends to our findings on the line and API invo-

Table 6 :
Performance comparison on the line and API invocation completion datasets using the dense retriever.

Table 7 :
Table 8 presents the results, showing the counts Detailed information of the Github repositories used for RepoEval.ID represents the repository IDs.F. denotes the total number of Python source files, while L. indicates the total number of non-empty Python code lines.Statistics are accurate as of January 2023.