C-XNLI: Croatian Extension of XNLI Dataset

Leo Obadić, Andrej Jertec, Marko Rajnović, Branimir Dropuljić


Abstract
Comprehensive multilingual evaluations have been encouraged by emerging cross-lingual benchmarks and constrained by existing parallel datasets. To partially mitigate this limitation, we extended the Cross-lingual Natural Language Inference (XNLI) corpus with Croatian. The development and test sets were translated by a professional translator, and we show that Croatian is consistent with other XNLI dubs. The train set is translated using Facebook’s 1.2B parameter m2m_100 model. We thoroughly analyze the Croatian train set and compare its quality with the existing machine-translated German set. The comparison is based on 2000 manually scored sentences per language using a variant of the Direct Assessment (DA) score commonly used at the Conference on Machine Translation (WMT). Our findings reveal that a less-resourced language like Croatian is still lacking in translation quality of longer sentences compared to German. However, both sets have a substantial amount of poor quality translations, which should be considered in translation-based training or evaluation setups.
Anthology ID:
2023.findings-acl.142
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2258–2267
Language:
URL:
https://aclanthology.org/2023.findings-acl.142
DOI:
10.18653/v1/2023.findings-acl.142
Bibkey:
Cite (ACL):
Leo Obadić, Andrej Jertec, Marko Rajnović, and Branimir Dropuljić. 2023. C-XNLI: Croatian Extension of XNLI Dataset. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2258–2267, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
C-XNLI: Croatian Extension of XNLI Dataset (Obadić et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.142.pdf
Video:
 https://aclanthology.org/2023.findings-acl.142.mp4