Xujie Si

2026

Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour
Ian Berlot-Attwell | Tobias Sesterhenn | Frank Rudzicz | Xujie Si
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The in-context learning (ICL) coding, reasoning, and tool-using ability of LLMs has spurred interest in library learning (i.e., the creation and exploitation of reusable and composable functions, tools, or lemmas). Such systems often promise improved task performance and computational efficiency by caching reasoning (i.e., storing generated tools) - all without finetuning. However, we find strong reasons to be skeptical. Specifically, we identify a serious evaluation flaw present in a large number of ICL library learning works: these works do not correct for the difference in computational cost between baseline and library learning systems. Studying three separately published ICL library learning systems, we find that all of them fail to consistently outperform the simple baseline of prompting the model - improvements in task accuracy often vanish or reverse once computational cost is accounted for. Furthermore, we perform an in-depth examination of one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples).Our findings suggest that a serious re-examination of the effectiveness of ICL LLM-based library learning is required, as is much stronger standards for evaluation. An equal computational budget must be used for baselines, alongside behavioural analysis.

2025

pdf bib abs

Large Language Models (LLMs) have become increasingly capable of handling diverse tasks with the aid of well-crafted prompts and integration of external tools, but as task complexity rises, the workflow involving LLMs can be complicated and thus challenging to implement and maintain. To address this challenge, we propose APPL, A Prompt Programming Language that acts as a bridge between computer programs and LLMs, allowing seamless embedding of prompts into Python functions, and vice versa. APPL provides an intuitive and Python-native syntax, an efficient parallelized runtime with asynchronous semantics, and a tracing module supporting effective failure diagnosis and replaying without extra costs. We demonstrate that APPL programs are intuitive, concise, and efficient through representative scenarios including Chain-of-Thought with self-consistency (CoT-SC) and ReAct tool-use agent. We further use LLMs to judge the language design between APPL and previous work, where the results indicate that codes written in APPL are more readable and intuitive. Our code, tutorial and documentation are available at https://github.com/appl-team/appl.

Co-authors

Venues

ACL1
EACL1

Fix author