SindBERT, the Sailor: Charting the Seas of Turkish NLP

Raphael Scheible; Stefan Schweter

SindBERT, the Sailor: Charting the Seas of Turkish NLP

Abstract

Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.

Anthology ID:: 2026.sigturk-1.1
Volume:: Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Kemal Oflazer, Abdullatif Köksal, Onur Varol
Venues:: SIGTURK | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–13
Language:
URL:: https://aclanthology.org/2026.sigturk-1.1/
DOI:
Bibkey:
Cite (ACL):: Raphael Schmitt and Stefan Schweter. 2026. SindBERT, the Sailor: Charting the Seas of Turkish NLP. In Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026), pages 1–13, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: SindBERT, the Sailor: Charting the Seas of Turkish NLP (Schmitt & Schweter, SIGTURK 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.sigturk-1.1.pdf

PDF Cite Search Fix data