Asım Ersoy
2025
In-Depth Analysis of Arabic-Origin Words in the Turkish Morpholex
Mounes Zaval
|
Abdullah İhsanoğlu
|
Asım Ersoy
|
Olcay Taner Yıldız
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
MorphoLex is an investigation that focuses on analyzing the roots, prefixes, and suffixes of words. Turkish Morpholex, for example, analyzes 48,472 Turkish words. Unfortunately, it lacks in-depth analysis of the Arabic-origin words, and does not include their accurate and correct roots. This study analyzes Arabic-origin words in the Turkish Morpholex, annotating their roots, morphological patterns, and semantic categories. The methodology developed for this work is adaptable to other languages influenced by Arabic, such as Urdu and Persian, offering broader implications for studying loanword integration across linguistic contexts.
2023
In What Languages are Generative Language Models the Most Formal? Analyzing Formality Distribution across Languages
Asım Ersoy
|
Gerson Vizcarra
|
Tahsin Mayeesha
|
Benjamin Muller
Findings of the Association for Computational Linguistics: EMNLP 2023
Multilingual generative language models (LMs) are increasingly fluent in a large variety of languages. Trained on the concatenation of corpora in multiple languages, they enable powerful transfer from high-resource languages to low-resource ones. However, it is still unknown what cultural biases are induced in the predictions of these models. In this work, we focus on one language property highly influenced by culture: formality. We analyze the formality distributions of XGLM and BLOOM’s predictions, two popular generative multilingual language models, in 5 languages. We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions. Overall, we observe a diversity of behaviors across the models and languages. For instance, XGLM generates informal text in Arabic and Bengali when conditioned with informal prompts, much more than BLOOM. In addition, even though both models are highly biased toward the formal style when prompted neutrally, we find that the models generate a significant amount of informal predictions even when prompted with formal text. We release with this work 6,000 annotated samples, paving the way for future work on the formality of generative multilingual LMs.
Search
Fix data
Co-authors
- Tahsin Mayeesha 1
- Benjamin Muller 1
- Gerson Vizcarra 1
- Olcay Taner Yıldız 1
- Mounes Zaval 1
- show all...