C-XNLI: Croatian Extension of XNLI Dataset

Leo Obadić; Andrej Jertec; Marko Rajnović; Branimir Dropuljić

doi:10.18653/v1/2023.findings-acl.142

C-XNLI: Croatian Extension of XNLI Dataset

Leo Obadić, Andrej Jertec, Marko Rajnović, Branimir Dropuljić

Abstract

Comprehensive multilingual evaluations have been encouraged by emerging cross-lingual benchmarks and constrained by existing parallel datasets. To partially mitigate this limitation, we extended the Cross-lingual Natural Language Inference (XNLI) corpus with Croatian. The development and test sets were translated by a professional translator, and we show that Croatian is consistent with other XNLI dubs. The train set is translated using Facebook’s 1.2B parameter m2m_100 model. We thoroughly analyze the Croatian train set and compare its quality with the existing machine-translated German set. The comparison is based on 2000 manually scored sentences per language using a variant of the Direct Assessment (DA) score commonly used at the Conference on Machine Translation (WMT). Our findings reveal that a less-resourced language like Croatian is still lacking in translation quality of longer sentences compared to German. However, both sets have a substantial amount of poor quality translations, which should be considered in translation-based training or evaluation setups.

Anthology ID:: 2023.findings-acl.142
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2258–2267
Language:
URL:: https://aclanthology.org/2023.findings-acl.142/
DOI:: 10.18653/v1/2023.findings-acl.142
Bibkey:
Cite (ACL):: Leo Obadić, Andrej Jertec, Marko Rajnović, and Branimir Dropuljić. 2023. C-XNLI: Croatian Extension of XNLI Dataset. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2258–2267, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: C-XNLI: Croatian Extension of XNLI Dataset (Obadić et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.142.pdf
Video:: https://aclanthology.org/2023.findings-acl.142.mp4

PDF Cite Search Video Fix data