Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Giuseppe Samo, Paola Merlo


Abstract
In this paper, we investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish—with its transparent morphological markers—both monolingual and multilingual models succeed either when tokenization is highly atomic or breaking words into small subword units. For Hebrew, however, a multilingual model using character-level tokenization fails to capture its non-concatenative morphology, while a monolingual model with unified morpheme-aware segmentation excels. Performance improves on more synthetic datasets, in all models.
Anthology ID:
2026.sigturk-1.8
Volume:
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Kemal Oflazer, Abdullatif Köksal, Onur Varol
Venues:
SIGTURK | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
82–94
Language:
URL:
https://aclanthology.org/2026.sigturk-1.8/
DOI:
Bibkey:
Cite (ACL):
Giuseppe Samo and Paola Merlo. 2026. Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew. In Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026), pages 82–94, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew (Samo & Merlo, SIGTURK 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.sigturk-1.8.pdf