LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks

Jiaxuan You, Mingjie Liu, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro


Abstract
The advancement of large language models (LLMs) has extended their use to dynamic and interactive real-world applications, where models engage continuously with their environment and potentially enhance their performance over time. Most existing LLM benchmarks evaluate LLMs on i.i.d. tasks, overlooking their ability to learn iteratively from past experiences. Our paper bridges this evaluation gap by proposing a novel framework, LLM-Evolve, which extends established benchmarks to sequential problem-solving settings. LLM-Evolve evaluates LLMs over multiple rounds, providing feedback after each round to build a demonstration memory that the models can query in future tasks. We applied LLM-Evolve to the MMLU, GSM8K, and AgentBench benchmarks, testing 8 state-of-the-art open-source and closed-source models. Results show that LLMs can achieve performance improvements of up to 17% by learning from past interactions, with the quality of retrieval algorithms and feedback significantly influencing this capability. These insights advocate for more understanding and benchmarks for LLMs’ performance in evolving interactive scenarios.
Anthology ID:
2024.emnlp-main.940
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16937–16942
Language:
URL:
https://aclanthology.org/2024.emnlp-main.940
DOI:
Bibkey:
Cite (ACL):
Jiaxuan You, Mingjie Liu, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2024. LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16937–16942, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks (You et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.940.pdf