Shailja Thakur
2026
STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs
Sungeun An | Swanand Ravindra Kadhe | Shailja Thakur | Chad DeLuca | Hima Patel
Findings of the Association for Computational Linguistics: ACL 2026
Sungeun An | Swanand Ravindra Kadhe | Shailja Thakur | Chad DeLuca | Hima Patel
Findings of the Association for Computational Linguistics: ACL 2026
Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model’s unique and distinct skill gaps.
Think Like You Execute: Verifiable Chain of Thought from Program Traces
Shailja Thakur | Vaibhav Saxena | Rohan Kulkarni | Shivdeep Singh | Parameswaran Selvam | Hiroshi Kanayama | Hima Patel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Shailja Thakur | Vaibhav Saxena | Rohan Kulkarni | Shivdeep Singh | Parameswaran Selvam | Hiroshi Kanayama | Hima Patel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Teaching language models to reason about code execution is still an open problem. Current synthetic Chain-of-Thought (CoT) training data often consists of plausible-sounding explanations generated by teacher models, not verifiable accounts of actual program behavior. This causes models to learn logically flawed reasoning patterns despite syntactic correctness.We address this by grounding CoT generation directly in program execution traces. Our pipeline instruments code to capture dynamic behavior, narrates execution traces into natural language, and actively verifies each rationale against the trace. We systematically create 54,000 execution-verified, bi-directional rationales that teach models to reason both forward (input→output) and backward (output→input). Models fine-tuned on our verified data achieve substantial improvements, with a performance boost of +24.2 on LiveCodeBench-Exec, +22.3 on CruxEval-Output, and +21.1 on CruxEval-Input, demonstrating that verification quality directly determines both reasoning and code generation capabilities.