SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation

Matej Klemen, Aleš Žagar, Jaka Čibej, Marko Robnik-Šikonja


Abstract
Natural language inference (NLI) is an important language understanding benchmark. Two deficiencies of this benchmark are: i) most existing NLI datasets exist for English and a few other well-resourced languages, and ii) most NLI datasets are formed with a narrow set of annotators’ instructions, allowing the prediction models to capture linguistic clues instead of measuring true reasoning capability. We address both issues and introduce SI-NLI, the first dataset for Slovene natural language inference. The dataset is constructed from scratch using knowledgeable annotators with carefully crafted guidelines aiming to avoid commonly encountered problems in existing NLI datasets. We also manually translate the SI-NLI to English to enable cross-lingual model training and evaluation. Using the newly created dataset and its translation, we train and evaluate a variety of large transformer language models in a monolingual and cross-lingual setting. The results indicate that larger models, in general, achieve better performance. The qualitative analysis shows that the SI-NLI dataset is diverse and that there remains plenty of room for improvement even for the largest models.
Anthology ID:
2024.lrec-main.1294
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
14859–14870
Language:
URL:
https://aclanthology.org/2024.lrec-main.1294
DOI:
Bibkey:
Cite (ACL):
Matej Klemen, Aleš Žagar, Jaka Čibej, and Marko Robnik-Šikonja. 2024. SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14859–14870, Torino, Italia. ELRA and ICCL.
Cite (Informal):
SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation (Klemen et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1294.pdf