EventHopNLI: A Functional Dataset for Systematically Diagnosing Logical Failures in LLM Temporal Reasoning

Ved Mathai; Janet Pierrehumbert

EventHopNLI: A Functional Dataset for Systematically Diagnosing Logical Failures in LLM Temporal Reasoning

Abstract

This paper presents EventHopNLI, a simplified functional diagnostic dataset for the task of event temporal ordering. This paper uses this diagnostic dataset to improve the interpretability of the performance of attention-based language models on this task. Existing datasets based on natural data have multiple overlapping linguistic features. Simplifying and isolating these features improves interpretability. EventHopNLI is a programmatically-created NLI dataset that systematically varies over various complexity factors such as number of events, number of logical hops etc. Even though EventHopNLI is highly simplified, it still proves challenging to language models. Being functional, the dataset is dynamic. This reduces the risk that the data is available to language models during training. We ablate over the different complexity parameters and illustrate different shortcomings of attention-based models at this task. We discuss the performance of RoBERTa-large, Llama-405B and GPT-4o.

Anthology ID:: 2025.clasp-main.2
Volume:: Proceedings of the 2025 CLASP Conference on Language models And RePresentations (LARP)
Month:: September
Year:: 2025
Address:: Gothenburg, Sweden
Editors:: Nikolai Ilinykh, Mattias Appelgren, Erik Lagerstedt
Venues:: CLASP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11–27
Language:
URL:: https://aclanthology.org/2025.clasp-main.2/
DOI:
Bibkey:
Cite (ACL):: Ved Mathai and Janet B. Pierrehumbert. 2025. EventHopNLI: A Functional Dataset for Systematically Diagnosing Logical Failures in LLM Temporal Reasoning. In Proceedings of the 2025 CLASP Conference on Language models And RePresentations (LARP), pages 11–27, Gothenburg, Sweden. Association for Computational Linguistics.
Cite (Informal):: EventHopNLI: A Functional Dataset for Systematically Diagnosing Logical Failures in LLM Temporal Reasoning (Mathai & Pierrehumbert, CLASP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.clasp-main.2.pdf

PDF Cite Search Fix data