LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research

Shuo Yan; Ruochen Li; Ziming Luo; Zimu Wang; Daoyang Li; Liqiang Jing; Kaiyu He; Peilin Wu; Juntong Ni; George Michalopoulos; Yue Zhang; Ziyang Zhang; Mian Zhang; Zhiyu Chen; Xinya Du

doi:10.18653/v1/2025.emnlp-main.314

LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research

Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, Juntong Ni, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, Xinya Du

Abstract

Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents’ ability to autonomously reproduce scientific research.

Anthology ID:: 2025.emnlp-main.314
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6164–6186
Language:
URL:: https://aclanthology.org/2025.emnlp-main.314/
DOI:: 10.18653/v1/2025.emnlp-main.314
Bibkey:
Cite (ACL):: Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, Juntong Ni, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. 2025. LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6164–6186, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research (Yan et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.314.pdf
Checklist:: 2025.emnlp-main.314.checklist.pdf

PDF Cite Search Checklist Fix data