2025
pdf
bib
abs
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Linghao Zhang
|
Junhao Wang
|
Shilin He
|
Chaoyun Zhang
|
Yu Kang
|
Bowen Li
|
Jiaheng Wen
|
Chengxing Xie
|
Maoquan Wang
|
Yufan Huang
|
Elsie Nallipogu
|
Qingwei Lin
|
Yingnong Dang
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
pdf
bib
abs
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
Xing Zhang
|
Jiaheng Wen
|
Fangkai Yang
|
Yu Kang
|
Pu Zhao
|
Junhao Wang
|
Maoquan Wang
|
Yufan Huang
|
Shengyu Fu
|
Elsie Nallipogu
|
Qingwei Lin
|
Yingnong Dang
|
Saravan Rajmohan
|
Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Code translation benchmarks are essential for evaluating the accuracy and efficiency of LLM-based systems. Existing benchmarks mainly target individual functions, overlooking repository-level challenges like intermodule coherence and dependency management. Recent repository-level efforts exist, but suffer from poor maintainability and coarse evaluation granularity. We introduce Skeleton-Guided-Translation, a framework for benchmarking Java-to-C# translation at the repository level, featuring fine-grained quality evaluation. It follows a two-step process: first translating repository “skeletons”, then refining the entire repository guided by these skeletons. Based on this, we present TRANSREPO-BENCH , the first test-driven benchmark of high-quality Java repositories paired with C# skeletons, unit tests, and build configurations. Our adaptive unit tests support multiple and incremental translations without manual tuning, enhancing automation and scalability. We also propose fine-grained metrics that evaluate translation quality per test case, overcoming limitations of binary metrics in distinguishing build failures. Evaluations using TRANSREPO-BENCH reveal issues like broken cross-file references, showing that our structured approach reduces dependency errors and preserves interface consistency.