Christophe Friezas Gonçalves


2026

Reading comprehension resources for low-resource languages remain limited, particularly datasets designed for educational assessment and diagnostic analysis in contrast to binary correctness.We present a diagnostically rich reading comprehension corpus forLuxembourgish, annotated using a two-layer framework that separateslinguistic sources of textual difficulty from cognitive and diagnosticproperties of comprehension questions. The linguistic layer captures span-level lexical, syntactic, morphological, and discourse-related features, while the cognitive layerannotates multiple-choice questions according to the PIRLS cognitiveprocesses and diagnostically meaningful distractor types following theSTARC framework.This design enables fine-grained analysis of reading comprehensionerrors by linking response patterns to underlying linguistic phenomena. The resulting corpus consists of 640 multiple-choice questions based on 16 annotated Luxembourgish texts. We describe the annotation methodology agreement measures, and will releasethe dataset as a publicly available resource for educational andlow-resource NLP research.