Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation

Samuel Gyamfi; Alfred Malengo Kondoro; Yankı Öztürk; Richard Hans Schreiber; Vadim Borisov

Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation

Samuel Gyamfi, Alfred Malengo Kondoro, Yankı Öztürk, Richard Hans Schreiber, Vadim Borisov

Abstract

Despite serving over 100 million speakers as a vital African lingua franca, Swahili remains critically under-resourced for Natural Language Processing, hindering technological progress across East Africa. We present a scalable solution: a controllable synthetic data generation pipeline that produces culturally grounded Swahili text for sentiment analysis, validated through automated LLM judges. To ensure reliability, we conduct targeted human evaluation with a native Swahili speaker on a stratified sample, achieving 80.95% agreementbetween generated sentiment labels and human ground truth, with strong agreement on judge quality assessments. This demonstrates that LLM-based generation and quality assessment can transfer effectively to low-resource languages. We release a dataset and provide a reproducible pipeline in tandem, providing ample knowledge and working material for NLP researchers in low-resource contexts. We release the resulting Swahili sentiment dataset and the full reproducible generation pipeline publicly at https://huggingface.co/datasets/tabularisai/swahili-sentiment-dataset and https://github.com/tabularis-ai/Synthetic-Data-Generation-Pipeline-for-Low-Resource-Swahili-Sentiment-Analysis.

Anthology ID:: 2026.africanlp-main.12
Volume:: Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Everlyn Asiko Chimoto, Constantine Lignos, Shamsuddeen Muhammad, Idris Abdulmumin, Clemencia Siro, David Ifeoluwa Adelani
Venues:: AfricaNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 116–141
Language:
URL:: https://aclanthology.org/2026.africanlp-main.12/
DOI:
Bibkey:
Cite (ACL):: Samuel Gyamfi, Alfred Malengo Kondoro, Yankı Öztürk, Richard Hans Schreiber, and Vadim Borisov. 2026. Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation. In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), pages 116–141, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation (Gyamfi et al., AfricaNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.africanlp-main.12.pdf

PDF Cite Search Fix data