Rethinking Polarity Detection: When BPE Fails Across Scripts

Manodyna K H; Luc De Nardi

Rethinking Polarity Detection: When BPE Fails Across Scripts

Abstract

Multilingual evaluation often relies on language coverage or translated benchmarks, implicitly assuming that subword tokenization behaves comparably across scripts. In mixed-script settings, this assumption breaks down. We examine this effect using polarity detection as a case study, comparing Orthographic Syllable Pair Encoding (OSPE) and Byte Pair Encoding (BPE) under identical architectures, data, and training conditions on SemEval Task 9, which spans Devanagari, Perso-Arabic, and Latin scripts. OSPE is applied to Hindi, Nepali, Urdu, and Arabic, while BPE is retained for English. We find that BPE systematically underestimates performance in abugida and abjad scripts, producing fragmented representations, unstable optimization, and drops of up to 27 macro-F1 points for Nepali, while English remains largely unaffected. Script-aware segmentation preserves orthographic structure, stabilizes training, and improves cross-language comparability without additional data or model scaling, highlighting tokenization as a latent but consequential evaluation decision in multilingual benchmarks. While the analysis spans multiple scripts, we place particular emphasis on Arabic and Perso-Arabic languages, where frequency-driven tokenization most severely disrupts orthographic and morphological structure.

Anthology ID:: 2026.abjadnlp-1.2
Volume:: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venues:: AbjadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6–14
Language:
URL:: https://aclanthology.org/2026.abjadnlp-1.2/
DOI:
Bibkey:
Cite (ACL):: Manodyna K H and Luc De Nardi. 2026. Rethinking Polarity Detection: When BPE Fails Across Scripts. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 6–14, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Rethinking Polarity Detection: When BPE Fails Across Scripts (K H & De Nardi, AbjadNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.abjadnlp-1.2.pdf

PDF Cite Search Fix data