Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset

Vittorio Mazzia, Sandro Pollastrini, Davide Bernardi, Chiara Rubagotti, Daniele Amberti


Abstract
Time reasoning is a make-or-break capability for Large Language Models (LLMs) aspiring to act as reliable personal and enterprise assistants. This work introduces the Temporal Reasoning Dataset (TRD), a programmatically generated multilingual benchmark designed to evaluate temporal reasoning operational capabilities in LLMs across ten languages, with particular focus on basic operations relevant to conversational agents handling time-sensitive tasks. TRD utilizes human-curated carrier phrases to generate a resilient-to-overfitting dataset with diverse samples and controlled difficulty levels across five core task categories, each at five difficulty levels. Extensive experimentation shows consistent patterns in model performance across languages, with a strong linear decline in accuracy as task difficulty rises in reasoning-based tasks, while memorization-based tasks remain stable. Furthermore, reasoning tasks remain robust across temporal shifts, whereas memorization tasks show performance degradation. Additionally, contextual modifications to prompts influence model performance differently than human cognitive patterns.
Anthology ID:
2026.iwsds-1.19
Volume:
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Month:
February
Year:
2026
Address:
Trento, Italy
Editors:
Giuseppe Riccardi, Seyed Mahed Mousavi, Maria Ines Torres, Koichiro Yoshino, Zoraida Callejas, Shammur Absar Chowdhury, Yun-Nung Chen, Frederic Bechet, Joakim Gustafson, Géraldine Damnati, Alex Papangelis, Luis Fernando D’Haro, John Mendonça, Raffaella Bernardi, Dilek Hakkani-Tur, Giuseppe "Pino" Di Fabbrizio, Tatsuya Kawahara, Firoj Alam, Gokhan Tur, Michael Johnston
Venue:
IWSDS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
168–181
Language:
URL:
https://aclanthology.org/2026.iwsds-1.19/
DOI:
Bibkey:
Cite (ACL):
Vittorio Mazzia, Sandro Pollastrini, Davide Bernardi, Chiara Rubagotti, and Daniele Amberti. 2026. Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 168–181, Trento, Italy. Association for Computational Linguistics.
Cite (Informal):
Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset (Mazzia et al., IWSDS 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.iwsds-1.19.pdf