Shima Imani
2026
TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
Shima Imani | Seungwhan Moon | Lambert Mathias | Lu Zhang | Babak Damavandi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Shima Imani | Seungwhan Moon | Lambert Mathias | Lu Zhang | Babak Damavandi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Reliable mathematical and scientific reasoning remains an open challenge for large vision–language models (VLMs). Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE (Transparent Reasoning And Consistency Evaluation), a framework for analyzing, diagnosing, and improving reasoning in VLMs. At its core, TRACE leverages Auxiliary Reasoning Sets (ARS), compact sub-question–answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS is linked to final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement.
SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
Shima Imani | Seungwhan Moon | Adel Ahmadyan | Lu Zhang | Ahmed Kirmani | Babak Damavandi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Shima Imani | Seungwhan Moon | Adel Ahmadyan | Lu Zhang | Ahmed Kirmani | Babak Damavandi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
We introduce SymPyBench, a large-scale synthetic benchmark of 15K university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. In addition to standard accuracy, we introduce three new metrics: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems.
2023
MathPrompter: Mathematical Reasoning using Large Language Models
Shima Imani | Liang Du | Harsh Shrivastava
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Shima Imani | Liang Du | Harsh Shrivastava
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks and often provide incorrect answers. Unlike natural language understanding, math problems typically have a single correct answer, making the task of generating accurate solutions more challenging for LLMs. To the best of our knowledge, we are not aware of any LLMs that indicate their level of confidence in their responses which fuels a trust deficit in these models impeding their adoption. To address this deficiency, we propose ‘MathPrompter’, a technique that improves performance of LLMs on arithmetic problems along with increased reliance in the predictions. MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple algebraic expressions or python functions to solve the same math problem in different ways and thereby raise the confidence level in the output results. This is in contrast to other prompt based CoT methods, where there is no check on the validity of the intermediate steps followed. Our technique improves over state-of-the-art on the ‘MultiArith’ dataset (78.7% - 92.5%) evaluated using 175B parameter GPT-based LLM.