When LLMs Annotate: Reliability Challenges in Low-Resource NLI

Solmaz Panahi, John Kelleher, Vasudevan Nedumpozhimana


Abstract
This paper systematically evaluates LLM reliability on the complex semantic task of Natural Language Inference (NLI) in Farsi, assessing six prominent models across eight prompt variations through a multi-dimensional framework that measures accuracy, prompt sensitivity, and intra-class consistency. Our results demonstrate that prompt design—particularly the order of premise and hypothesis—significantly impacts prediction stability. Proprietary models (Claude-Opus-4, GPT-4o) exhibit superior stability and accuracy compared to open-weight alternatives. Across all models, the ’Neutral’ class emerges as the most challenging and least stable category. Crucially, we redefine model instability as a diagnostic tool for benchmark quality, demonstrating that observed disagreement often reflects valid challenges to ambiguous or erroneous gold-standard labels.
Anthology ID:
2026.loreslm-1.17
Volume:
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venue:
LoResLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
178–188
Language:
URL:
https://aclanthology.org/2026.loreslm-1.17/
DOI:
Bibkey:
Cite (ACL):
Solmaz Panahi, John Kelleher, and Vasudevan Nedumpozhimana. 2026. When LLMs Annotate: Reliability Challenges in Low-Resource NLI. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 178–188, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
When LLMs Annotate: Reliability Challenges in Low-Resource NLI (Panahi et al., LoResLM 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.loreslm-1.17.pdf