AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages

Kayode Olaleye; Arturo Oncevay; Mathieu Sibue; Nombuyiselo Zondi; Michelle Terblanche; Sibongile Mapikitla; Richard Lastrucci; Charese Smiley; Vukosi Marivate

doi:10.18653/v1/2025.acl-long.1601

AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages

Kayode Olaleye, Arturo Oncevay, Mathieu Sibue, Nombuyiselo Zondi, Michelle Terblanche, Sibongile Mapikitla, Richard Lastrucci, Charese Smiley, Vukosi Marivate

Abstract

Code-switching is prevalent in multilingual communities but lacks adequate high-quality data for model development, especially for African languages. To address this, we present AfroCS-xs, a small human-validated synthetic code-switched dataset for four African languages (Afrikaans, Sesotho, Yoruba, isiZulu) and English within a specific domain—agriculture. Using large language models (LLMs), we generate code-switched sentences, including English translations, that are rigorously validated and corrected by native speakers. As a downstream evaluation task, we use this dataset to fine-tune different instruction-tuned LLMs for code-switched translation and compare their performance against machine translation (MT) models. Our results demonstrate that LLMs consistently improve in translation accuracy when fine-tuned on the high-quality AfroCS-xs dataset, highlighting that substantial gains can still be made with a low volume of data. We also observe improvements on natural code-switched and out-of-domain (personal finance) test sets. Overall, regardless of data size and prior exposure to a language, LLMs benefit from higher quality training data when translating code-switched texts in under-represented languages.

Anthology ID:: 2025.acl-long.1601
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 33391–33410
Language:
URL:: https://aclanthology.org/2025.acl-long.1601/
DOI:: 10.18653/v1/2025.acl-long.1601
Bibkey:
Cite (ACL):: Kayode Olaleye, Arturo Oncevay, Mathieu Sibue, Nombuyiselo Zondi, Michelle Terblanche, Sibongile Mapikitla, Richard Lastrucci, Charese Smiley, and Vukosi Marivate. 2025. AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33391–33410, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages (Olaleye et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1601.pdf

PDF Cite Search Fix data