Alexandra Mihaela Danila
2025
GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs
Marius Dumitran
|
Angela Dumitran
|
Alexandra Mihaela Danila
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Large language models (LLMs) have revolutionised NLP, yet their pedagogical value for low‐resource languages remains unclear. We present GRILE, the first open benchmark of 1 151 multiple‐choice questions harvested from Romanian high‐stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state‐of‐the‐art multilingual and Romanian‐specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically faithful explanations. While Gemini 2·5 Pro reaches 83% accuracy, most open‐weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM 3 orthographic norms. All data, code and a public web demo are released to catalyse future research. Our findings expose open challenges for trustworthy educational NLP in low‐resource settings and establish GRILE as a new test‐bed for controllable explanation generation and evaluation.