Yesai Wu


2024

pdf bib
RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation
Qinyu Luo | Yining Ye | Shihao Liang | Zhong Zhang | Yujia Qin | Yaxi Lu | Yesai Wu | Xin Cong | Yankai Lin | Yingli Zhang | Xiaoyin Che | Zhiyuan Liu | Maosong Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Generative models have demonstrated considerable potential in software engineering, particularly in tasks such as code generation and debugging. However, their utilization in the domain of code documentation generation remains underexplored. To this end, we introduce RepoAgent, a large language model powered open-source framework aimed at proactively generating, maintaining, and updating code documentation. Through both qualitative and quantitative evaluations, we have validated the effectiveness of our approach, showing that RepoAgent excels in generating high-quality repository-level documentation. The code and results are publicly accessible at https://github.com/OpenBMB/RepoAgent.

pdf bib
DebugBench: Evaluating Debugging Capability of Large Language Models
Runchu Tian | Yining Ye | Yujia Qin | Xin Cong | Yankai Lin | Yinxu Pan | Yesai Wu | Hui Haotian | Liu Weichuan | Zhiyuan Liu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2024

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs’ debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce ‘DebugBench’, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.