Alper Karamanlioglu

Also published as: Alper Karamanlıoğlu


2026

Large Language Models (LLMs) achieve strong performance on many tasks, but they still struggle with morphologically rich, low-resource languages such as Turkish. This difficulty stems from Turkish being an agglutinative language and underrepresented in multilingual training data, which causes current models to often fail at capturing its morphology, flexible word order, and formal registers. In this paper, we introduce MODA (Model Adapted for Domain Applications), a Turkish-specialized LLM built via a modular pipeline that combines continual pre-training, parameter-efficient fine-tuning, and model merging. Starting from Qwen2.5-7B as the base model, we first perform large-scale continual pre-training on a Turkish web corpus to improve grammatical and morphological representations. We then apply parameter-efficient supervised fine-tuning on task-oriented instruction data, and finally merge specialized variants into a single unified model. We evaluate MODA on TurkishMMLU, the Turkish subset of EXAMS, and TRCLAIM-19, where it consistently outperforms both the base and instruction-tuned Qwen2.5-7B models. Our results support a training strategy that explicitly separates linguistic acquisition from task alignment when adapting LLMs to morphologically rich, underrepresented languages under realistic hardware constraints.

2025

Stance detection in NLP involves determining whether an author is supportive, against, or neutral towards a particular target. This task is particularly challenging for Turkish due to the limited availability of data, which hinders progress in the field. To address this issue, we introduce a novel dataset focused on stance detection in Turkish, specifically within the political domain. This dataset was collected from X (formerly Twitter) and annotated by three human annotators who followed predefined guidelines to ensure consistent labeling and generalizability. After compiling the dataset, we trained various transformer-based models with different architectures, showing that the dataset is effective for stance classification. These models achieved an impressive Macro F1 score of up to 82%, highlighting their effectiveness in stance detection.
This paper explains a Retrieval-Augmented Generation (RAG) pipeline that optimizes reg- ularity compliance using a combination of em- bedding models (i.e. bge-m3, jina-embeddings- v3, e5-large-v2) with reranker (i.e. bge- reranker-v2-m3). To efficiently process long context passages, we introduce context aware chunking method. By using the RePASS met- ric, we ensure comprehensive coverage of obli- gations and minimizes contradictions, thereby setting a new benchmark for RAG-based regu- latory compliance systems. The experimen- tal results show that our best configuration achieves a score of 0.79 in Recall@10 and 0.66 in MAP@10 with LLaMA-3.1-8B model for answer generation.
This study presents the development of a Retrieval-Augmented Generation (RAG) framework tailored for analyzing regulatory documents from the Abu Dhabi Global Markets (ADGM). The methodology encompasses comprehensive data preprocessing, including extraction, cleaning, and compression of documents, as well as the organization of the ObliQA dataset. The embedding model is utilized for generating embeddings during the retrieval phase, facilitated by the txtai library for managing embeddings and streamlining testing. The training process incorporated innovative strategies such as duplicate recognition, dropout implementation, pooling adjustments, and label modifications to enhance retrieval performance. Hyperparameter tuning further refined the retrieval component, with improvements validated using the recall@10 metric, which measures the proportion of relevant passages among the top-10 results. The refined retrieval component effectively identifies pertinent passages within regulatory documents, expediting information access and supporting compliance efforts.