Linear Semantic Segmentation for Low-Resource Spoken Dialects

Kirill Chirkunov; Younes Samih; Abed Alhakim Freihat; Hanan Aldarmaki

Linear Semantic Segmentation for Low-Resource Spoken Dialects

Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, Hanan Aldarmaki

Abstract

Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource conversational varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard semantic segmentation approaches for text. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in Arabic, focusing on dialectal discourse. The benchmark covers casual telephone conversations, code-switched podcasts, expressive dialogue, and broadcast news, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal conversational texts. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.

Anthology ID:: 2026.findings-acl.1740
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34844–34861
Language:
URL:: https://aclanthology.org/2026.findings-acl.1740/
DOI:
Bibkey:
Cite (ACL):: Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, and Hanan Aldarmaki. 2026. Linear Semantic Segmentation for Low-Resource Spoken Dialects. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34844–34861, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Linear Semantic Segmentation for Low-Resource Spoken Dialects (Chirkunov et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1740.pdf
Checklist:: 2026.findings-acl.1740.checklist.pdf

PDF Cite Search Checklist Fix data