Ruilong Wang

2026

ARQA: A Benchmark for Grounded Table–Text QA in Enterprise Annual Reports
Ruilong Wang | Simone Balloccu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Annual reports communicate corporate performance to stakeholders through dense tables and explanatory text, with rich grounding signals making automated reasoning challenging. Existing QA benchmarks focus on retrieval or single-modality reasoning and rarely require justification for answers with both textual and tabular evidence. We introduce ARQA (Annual Report QA), a benchmark of ~2.5K QA pairs spanning ten fiscal years of automotive enterprise annual reports and three reasoning families — Lookup, Arithmetic, and Insight. Data are produced via a planner–generator pipeline, deterministically verified and recomputed, and fully reviewed by domain experts. We evaluate state-of-the-art instruction-tuned language models on ARQA, showing strong factual retrieval but persistent weaknesses in grounded arithmetic and causal reasoning. We release ARQA and its evaluation toolkit to facilitate research on auditable, evidence-first reasoning over enterprise documents. (https://github.com/RuilongWang/ARQA-Benchmark/)

Co-authors

Simone Balloccu 1

Venues

EACL1

Fix author