RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

Jingjing Liu; Zeming Liu; Zihao Cheng; Mengliang He; Xiaoming Shi; Yuhang Guo; Xiangrong Zhu; Yuanfang Guo; Yunhong Wang; Haifeng Wang

doi:10.18653/v1/2025.findings-emnlp.1294

RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

Jingjing Liu, Zeming Liu, Zihao Cheng, Mengliang He, Xiaoming Shi, Yuhang Guo, Xiangrong Zhu, Yuanfang Guo, Yunhong Wang, Haifeng Wang

Abstract

Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant advancements in debugging datasets have been made to promote the development of code debugging. However, these datasets primarily focus on assessing the LLM’s function-level code repair capabilities, neglecting the more complex and realistic repository-level scenarios, which leads to an incomplete understanding of the LLM’s challenges in repository-level debugging. While several repository-level datasets have been proposed, they often suffer from limitations such as limited diversity of tasks, languages, and error types. To mitigate this challenge, this paper introduces RepoDebug, a multi-task and multi-language repository-level code debugging dataset with 22 subtypes of errors that supports 8 commonly used programming languages and 3 debugging tasks. Furthermore, we conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debugging.

Anthology ID:: 2025.findings-emnlp.1294
Original:: 2025.findings-emnlp.1294v1
Version 2:: 2025.findings-emnlp.1294v2
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23784–23813
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1294/
DOI:: 10.18653/v1/2025.findings-emnlp.1294
Bibkey:
Cite (ACL):: Jingjing Liu, Zeming Liu, Zihao Cheng, Mengliang He, Xiaoming Shi, Yuhang Guo, Xiangrong Zhu, Yuanfang Guo, Yunhong Wang, and Haifeng Wang. 2025. RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23784–23813, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models (Liu et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1294.pdf
Checklist:: 2025.findings-emnlp.1294.checklist.pdf

PDF (v2) PDF (v1) Cite Search Checklist Fix data