STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

Sungeun An, Swanand Ravindra Kadhe, Shailja Thakur, Chad DeLuca, Hima Patel


Abstract
Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model’s unique and distinct skill gaps.
Anthology ID:
2026.findings-acl.1977
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39675–39705
Language:
URL:
https://aclanthology.org/2026.findings-acl.1977/
DOI:
Bibkey:
Cite (ACL):
Sungeun An, Swanand Ravindra Kadhe, Shailja Thakur, Chad DeLuca, and Hima Patel. 2026. STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 39675–39705, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs (An et al., Findings 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.findings-acl.1977.pdf
Checklist:
 2026.findings-acl.1977.checklist.pdf