Yingnong Dang


2025

pdf bib
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Linghao Zhang | Junhao Wang | Shilin He | Chaoyun Zhang | Yu Kang | Bowen Li | Jiaheng Wen | Chengxing Xie | Maoquan Wang | Yufan Huang | Elsie Nallipogu | Qingwei Lin | Yingnong Dang | Saravan Rajmohan | Dongmei Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.

pdf bib
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
Xing Zhang | Jiaheng Wen | Fangkai Yang | Yu Kang | Pu Zhao | Junhao Wang | Maoquan Wang | Yufan Huang | Shengyu Fu | Elsie Nallipogu | Qingwei Lin | Yingnong Dang | Saravan Rajmohan | Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025

Code translation benchmarks are essential for evaluating the accuracy and efficiency of LLM-based systems. Existing benchmarks mainly target individual functions, overlooking repository-level challenges like intermodule coherence and dependency management. Recent repository-level efforts exist, but suffer from poor maintainability and coarse evaluation granularity. We introduce Skeleton-Guided-Translation, a framework for benchmarking Java-to-C# translation at the repository level, featuring fine-grained quality evaluation. It follows a two-step process: first translating repository “skeletons”, then refining the entire repository guided by these skeletons. Based on this, we present TRANSREPO-BENCH , the first test-driven benchmark of high-quality Java repositories paired with C# skeletons, unit tests, and build configurations. Our adaptive unit tests support multiple and incremental translations without manual tuning, enhancing automation and scalability. We also propose fine-grained metrics that evaluate translation quality per test case, overcoming limitations of binary metrics in distinguishing build failures. Evaluations using TRANSREPO-BENCH reveal issues like broken cross-file references, showing that our structured approach reduces dependency errors and preserves interface consistency.