Other Workshops and Events (2026)
Volumes
- Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script 71 papers
- Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026) 31 papers
- Proceedings of the Ninth Fact Extraction and VERification Workshop (FEVER) 12 papers
- Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics 8 papers
- Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026) 29 papers
- Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026 33 papers
- The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26) 14 papers
- Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026) 25 papers
- Proceedings of the First Workshop on Multilingual Multicultural Evaluation 16 papers
- Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026) 35 papers
- Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026) 10 papers
- Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026) 21 papers
- Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP 7 papers
- The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family 15 papers
- Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026) 18 papers
- Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects 33 papers
- The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026) 24 papers
up
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi, and MSA) and an interactive web platform. Phrases were generated using LLMs and validated by five native speakers, stratified by difficulty, and organized thematically. The open-source platform provides translation exploration, adaptive quizzing with algorithmic distractor generation, cloud-synchronized progress tracking, and cultural context. Both the dataset and complete platform source code are released under MIT license. Platform: https://arabic-dialect-hub.netlify.app.
Multilingual evaluation often relies on language coverage or translated benchmarks, implicitly assuming that subword tokenization behaves comparably across scripts. In mixed-script settings, this assumption breaks down. We examine this effect using polarity detection as a case study, comparing Orthographic Syllable Pair Encoding (OSPE) and Byte Pair Encoding (BPE) under identical architectures, data, and training conditions on SemEval Task 9, which spans Devanagari, Perso-Arabic, and Latin scripts. OSPE is applied to Hindi, Nepali, Urdu, and Arabic, while BPE is retained for English. We find that BPE systematically underestimates performance in abugida and abjad scripts, producing fragmented representations, unstable optimization, and drops of up to 27 macro-F1 points for Nepali, while English remains largely unaffected. Script-aware segmentation preserves orthographic structure, stabilizes training, and improves cross-language comparability without additional data or model scaling, highlighting tokenization as a latent but consequential evaluation decision in multilingual benchmarks. While the analysis spans multiple scripts, we place particular emphasis on Arabic and Perso-Arabic languages, where frequency-driven tokenization most severely disrupts orthographic and morphological structure.
Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction
Rabab Alkhalifa
Rabab Alkhalifa
Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline—two framers, a critic, and a discriminator—treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.
Optimizer Choice and Calibration for QARiB on Arabic-Script Social Media Offensive Language Detection
Auda Elshokry | Mohammed Alhanjouri
Auda Elshokry | Mohammed Alhanjouri
Optimizer choice is a central hyperparameter in fine-tuning transformer models, yet its impact remains under-studied for Arabic-script social media classification un der class imbalance. We compare Adam, AdamW, and SGD for fine-tuning QARiB on two Arabic offensive-language bench marks, OffensEval20 and MPOLD, using a controlled grid over learning rate, weight decay, and warmup, and report test-set performance as mean (std) over three random seeds. Minority-class discrimination is evaluated using macro-F1 and AUC-PROFF, while calibration is assessed via expected calibration error (ECE), reliability diagrams, and proper scoring rules (Brier score and negative log-likelihood, NLL). Across both datasets, AdamW and Adam are consistently strong and closely matched when properly tuned, whereas SGD substantially underperforms under the same tuning bud get and exhibits higher seed sensitivity. We observe non-trivial miscalibration across optimizers; post-hoc temperature scaling offers a low-cost adjustment, yielding modest, dataset-dependent changes in calibration while preserving ranking-based discrimination. We further evaluate a practical decision-rule step by optimizing the classification threshold on the validation set and applying it to test predictions, and provide qualitative examples il lustrating typical optimizer-dependent confidence behaviors. In practice, for Arabic offensive-language detection under imbalance, we recommend starting from a tuned AdamW or Adam baseline; when calibrated probabilities are required for thresholding or triage, temperature scaling can be applied. We will release a reproducible pipeline to support further evaluation of optimizer–calibration trade-offs in Arabic-script safety tasks.
We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at https://huggingface.co/datasets/drelhaj/Tarab.
LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Ahmed Khamis | Hesham Ali Ahmed
Ahmed Khamis | Hesham Ali Ahmed
Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources con- centrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic— the most widely understood Arabic dialect— severely under-resourced. We address this gap by introducing NileTTS: 38 hours of tran- scribed speech from two speakers across di- verse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natu- ral speech using audio synthesis tools, followed by automatic transcription and speaker diariza- tion with manual quality verification. We fine- tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data gen- eration pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.
HCMUS_PrompterXPrompter at AbjadMed: When Classification Meets Retrieval: Taming the Long Tail in Arabic Medical Text Classification
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Dinh Ha Duong | Nguyen Chi Tran | Phu Quy Nguyen Lam | Hoa Pham Phu
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Dinh Ha Duong | Nguyen Chi Tran | Phu Quy Nguyen Lam | Hoa Pham Phu
Medical text classification is high-stakes work, yet models often falter precisely where they are needed most: on rare, critical conditions buried in the long tail of the data distribution. In the EACL 2026 ABJAD-NLP Shared Task, we confronted this challenge with a dataset of Arabic medical questions heavily skewed towards a few common topics, leaving dozens of categories with fewer than ten examples. We present HybridMed, a system that effectively tames this long tail by marrying the semantic generalization of a fine-tuned Arabic BERT model with the precise, instance-based memory of k-nearest neighbor retrieval. This complementary union allowed our system to achieve a macro-F1 score of 0.4902, demonstrating that for diverse and imbalanced medical data, the whole is indeed greater than the sum of its parts.
KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
Henry Gagnier | Sophie Gagnier | Ashwin Kirubakaran
Henry Gagnier | Sophie Gagnier | Ashwin Kirubakaran
Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.
Seeing Words Differently: Visual Embeddings for Robust English-Arabic Machine Translation
Mahdi Alshaikh Saleh | Irfan Ahmad
Mahdi Alshaikh Saleh | Irfan Ahmad
Context: Natural Language Processing (NLP) has become an essential field with widespread applications across domains such as Large Language Models (LLMs). One of the core applications of NLP is machine translation (MT). A major challenge in MT is handling out-of-vocabulary (OOV) words and spelling mistakes, which can lead to poor translation quality. Objective: This study compares traditional text-based embeddings with visual embeddings for English-to-Arabic translation. It investigates the effectiveness of each approach, especially in handling noisy inputs or OOV terms. Method: Using the IWSLT 2017 English-Arabic dataset, we trained a baseline transformer encoder-decoder model using standard text embeddings and compared it with models using several visual embeddings strategies, including vowel-removal preprocessing and trigram-based image rendering. The translated outputs were evaluated using BLEU scores. Results: show that although traditional BPE-based models achieve higher BLEU on clean data, visual embedding models are substantially more robust to spelling noise, retaining up to 2.4× higher BLEU scores at 50% character corruption.
Character-Level Transformer for Tajik–Persian Transliteration with a Parallel Lexical Corpus
Arabov Mullosharaf Kurbonovich
Arabov Mullosharaf Kurbonovich
This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik–Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik–Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The best Transformer configuration with beam search (k=3) achieves a CER of 0.3182 and an exact-match accuracy of 0.3215, achieving lower error rates than dictionary-based rule-based and recurrent neural baselines. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All resources (dataset, preprocessing scripts, splits, and training configurations) will be released publicly to ensure reproducibility and facilitate future work on Tajik–Persian transliteration, cross-script NLP, and lexicographic applications.
Arabic Dialect Translation with Small LLMs: Enhancing through Reasoning-Oriented Reinforcement Learning
Sohaila Abdulsattar | Keith Ross
Sohaila Abdulsattar | Keith Ross
Arabic dialect↔English machine translation remains difficult due to extreme dialect variation, inconsistent orthography, and limited parallel data. Moreover, dialect translation is often needed in remote regions or by economically-disadvantaged communities, which often operate in compute-constrained or offline settings. Motivated by these concerns, in this paper we explore optimizing Arabic dialect↔English translators that run over small LLMs, which could be implemented on small offline devices. We show that reasoning-oriented reinforcement learning can substantially improve small multilingual LLMs for Arabic dialect translation. Using the MADAR corpus, small Qwen-2.5 models trained with a think-then-translate template and optimized with Group-Relative Policy Optimization using a SacreBLEU reward outperform a much larger 7B baseline trained with supervised fine-tuning. The dialect-to-English BLEU score more than doubles from 17.4 to 34.9, while the English-to-dialect COMET score improves from 0.57 to 0.73.
MedArabs at AbjadMed: Arabic Medical Text Classification via Data- and Algorithm-Level Fusion
Amrita Singh
Amrita Singh
In this work, we address the challenges of Arabic medical text classification, focusing on class imbalance and the complexity of the language’s morphology. We propose a multiclass classification pipeline based on Data- and Algorithm-Level fusion, which integrates the optimal Back Translation technique for data augmentation with the Class Balanced (CB) loss function to enhance performance. The domain-specific AraBERT model is fine-tuned using this approach, achieving competitive results. On the official test set of the AbjadMed task, our pipeline achieves a Macro-F1 score of 0.4219, and it achieves 0.4068 on the development set.
GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification
Ahmed Khamis
Ahmed Khamis
This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.
Named Entity Recognition (NER) models trained on clean text often fail on real-world data containing orthographic noise. Work on NER for Persian is emerging, but it has not yet explored the orthographic robustness of models to perturbations often exhibited in user-generated content. We evaluate ParsBERT, ParsBERT v2.0, BertNER, and two XLM-r-based models on a subset of Persian-NER-Dataset-500k after applying eleven different perturbations, including simulated typos, code-switching, and segmentation errors. All models were competitive with each other, but XLM-r-large consistently displayed the best robustness to perturbations. Code-switching, typos, similar character swaps, segmentation errors, and noisy text all decreased F1 scores, while Latinized numbers increased F1 scores in ParsBERT. Removing diacritics, zero-width non-joiners, and normalizing Yeh/Kaf all did not have an effect on F1. These findings suggest that Persian NER models require improvement for performance on noisy text, and that the Perso-Arabic script introduces unique factors into NER not present in many high-resource languages, such as code-switching and Eastern Arabic numerals. This work creates a foundation for the development of robust Persian NER models and highlights the necessity of evaluating low-resource NER models under challenging and realistic conditions.
ArabicMedicalBERT-QA-82 at AbjadMed: Fighting Class Imbalance in Arabic Medical Text Classification
Gleb Shanshin
Gleb Shanshin
We present a supervised system for Arabic medical question-answer classification developed for the AbjadMed shared task. The task involves assigning one of 82 highly imbalanced medical categories and is evaluated using macro-averaged F1. Our approach builds on an AraBERT model further pretrained on a related Arabic medical classification dataset. Under a unified fine-tuning setup, this domain-adapted model consistently outperforms general-purpose Arabic backbones, with the best results obtained using a low backbone learning rate, indicating that only limited adaptation is required. The final system achieves a macro F1 score of 0.51 on the private test split. For comparison, we evaluate several cost-efficient large language models under constrained prompting and observe substantially lower performance.
KvochurHegel at AbjadMed: Combining LDAM Loss and Adversarial Training for Arabic Medical Question-Answer Classification
Minh-Hoang Le
Minh-Hoang Le
This paper describes our team’s submission to AbjadMed at AbjadNLP 2026. The task involves classifying Arabic medical question-answer pairs into 82 categories, characterized by a long-tail distribution and significant semantic overlap. While domain-specific Arabic models exist, they are primarily optimized for Named Entity Recognition or span-extraction tasks rather than high-cardinality sequence classification. Consequently, our system adopts a robust optimization approach using a general-purpose encoder. We utilize ARBERTv2 as the backbone, employing Label-Distribution-Aware Margin (LDAM) loss to mitigate class imbalance and Fast Gradient Method (FGM) adversarial training to enhance generalization boundaries. Our approach achieves a Macro-F1 score of 0.4028 on the private test set, demonstrating that advanced optimization techniques can yield competitive performance on specialized taxonomies without requiring domain-specific pre-training.
baellouf at AbjadMed: Efficient Fine-tuning with All-Linear LoRA for Arabic Medical QA Classification
Abdallah Khallouf
Abdallah Khallouf
We describe our system for the AbjadMed shared task on Arabic medical text classification at AbjadNLP 2026. Our approach combines efficient fine-tuning of Qwen3-8B using QLoRA with a Dice+CrossEntropy hybrid loss designed for Macro F1 optimization. Taking inspiration from recent research on optimal LoRA configurations, we apply low-rank adapters to all linear layers of the model rather than attention layers only, which we validate improves performance by 4.0 points. We also explore data augmentation through machine translation of external medical QA data, though this did not improve generalization. Our best submission achieves a Macro F1 score of 0.4441 on the test set.
Supachoke at AbjadMed: Enhancing Arabic Medical Text Classification Using Fine-Tuned AraBERT
Thanh Phu Nguyen | Tuan Thai Huy Nguyen Cu | Son Thai Pham | Tri Duy Ho Nguyen
Thanh Phu Nguyen | Tuan Thai Huy Nguyen Cu | Son Thai Pham | Tri Duy Ho Nguyen
Medical text classification is an important task in healthcare NLP, yet Arabic medical texts remain underexplored due to linguistic complexity and limited annotated data. In this paper, we study the effectiveness of AraBERT, a pre-trained Arabic transformer model, for Arabic medical text classification. We fine-tune AraBERT on a labeled medical dataset and evaluate its performance using standard classification metrics. Experimental results show that our fine-tuned AraBERT model achieves a private leaderboard score of 0.4076 and ranks 13th among participating teams, outperforming classical machine learning baselines and other transformer variants. These findings highlight the potential of transformer-based approaches for Arabic medical NLP and motivate further research.
REIGNITE at AbjadMed: Imbalance-Aware Fine-Tuning of Pretrained Arabic Transformers for Arabic Medical Text Classification Task
Nahid Montasir Rifat | Foyez Ahmed Dewan
Nahid Montasir Rifat | Foyez Ahmed Dewan
This paper presents our system developed for the AbjadNLP Shared Task 4 on Medical Text Classification in Arabic, which aims to assign Arabic medical question-answer pairs to a predefined set of medical categories. The task poses significant challenges due to severe class imbalance across 82 categories and the linguistic complexity of domain-specific Arabic medical text. To address these challenges, we propose an imbalance-aware training framework that combines targeted data augmentation for minority classes with class-weighted focal loss during fine-tuning. We evaluate multiple Arabic pretrained transformer models under a unified training configuration and further improve robustness through a majority-voting ensemble of the best-performing models. Our approach achieves competitive performance, ranking 15th on the private leaderboard with a macro F1 score of 0.4052, demonstrating the effectiveness of combining different data augmentation techniques, imbalance-aware training objectives, and ensemble learning for large-scale, highly imbalanced Arabic medical text classification. The code is available on GitHub.
Tashkees-AI at AbjadMed 2026: Flat vs. Hierarchical Classification for Fine-Grained Arabic Medical QA
Fatimah Mohamed Emad Eldin
Fatimah Mohamed Emad Eldin
This paper describes Tashkees-AI, a system developed for the AbjadMed 2026 Shared Task on Arabic Medical Question Classification. A comprehensive empirical study was conducted across 82 fine-grained categories, investigating three paradigms: fine-tuned encoder models, hierarchical classification, and ensemble methods. Leveraging a dataset of 27k Arabic medical question-answer pairs, an extensive ablation studies was conducted, comparing MARBERTv2, CAMeLBERT, two-stage hierarchical classifiers, and RAG-based approaches. The findings reveal that fine-tuned MARBERTv2 with data cleaning yields the best performance, achieving a macro F1-score of 0.3659 on the blind test set. In contrast, hierarchical methods surprisingly underperformed (0.332 F1) due to error propagation. The system ranked 26th on the official leaderboard.
MetaSwarm at AbjadMed: Forensic Optimization and Class-Balanced Discovery for Medical Diglossia in Abjad Scripts
Rahul Jaisy
Rahul Jaisy
The classification of diglossic medical text presents a high-dimensional challenge defined by extreme class imbalance (N = 82) and the orthographic ambiguity of unvocalized Abjad scripts. While standard supervised learning often collapses into majority-class prediction due to the "Long Tail" distribution, we intro- duce a Human-in-the-Loop Forensic Opti- mization framework. Unlike static end-to-end pipelines, our approach decouples strategic hy- perparameter tuning from high-throughput tac- tical execution (Elastic Compute). We lever- age a rigorous Class-Balanced Focal Loss (CBFL) derived from the "Effective Number of Samples" theory (En) to stabilize the de- cision manifold against stochastic class domi- nance. Using a CAMELBERT-DA backbone optimized via a custom weighted trainer on Dual H200 GPUs, our system achieved a ro- bust Public Leaderboard score of 0.3588. We further perform a "Linguistic Error Topology" analysis, utilizing UMAP projections and atten- tion saliency, to demonstrate that generalization gaps are driven by dialectal "Constraint Drift" rather than stochastic model failure.
QurSci-Onto: A Hierarchical Ontology and Dataset for Scientific Exegesis in the Quran
Ibad-ur-Rehman Rashid | Junaid Hussain | Sadam Al-Azani
Ibad-ur-Rehman Rashid | Junaid Hussain | Sadam Al-Azani
This paper introduces resources for the computational study of scientific exegesis (Tafsir Ilmi): a structured ontology, a curated dataset of 194 scientifically relevant Quranic verses linked to 260 exegetical records from two authoritative Tafsir books, and an annotation framework that organizes scientific references by topic and sequential context. Existing Quranic resources treat verses as unstructured text, losing the logical order and causal relationships of scientific concepts documented in exegesis. To address this, we present QurSci-Onto, a three-layer ontology that categorizes verses by scientific domain, links them to authoritative Tafsir, and provides a framework for representing sequential processes through stage-based annotations. Our dataset includes page-level citations and covers 8 major scientific topics across 73 nodes. While the full corpus is tagged with broad categories and scientific topics, a specialized subset features granular node-level mappings to capture complex scientific narratives. We release QurSci-Onto as a foundational resource for Arabic semantic NLP and demonstrate that it enables significant improvements in semantic retrieval and enables multi-hop sequential reasoning capabilities over unstructured baselines.
AjamiMorph: Zero-Annotation Morphological Discovery for Hausa Ajami via Multi-Method Consensus
Soumedhik Bharati | Shibam Mandal | Prithwish Ghosh | Swarup Kr Ghosh | Sayani Mondal
Soumedhik Bharati | Shibam Mandal | Prithwish Ghosh | Swarup Kr Ghosh | Sayani Mondal
Hausa Ajami (Hausa written in Arabic script) remains severely under-resourced for computational morphology. We present AjamiMorph, a zero-annotation framework that discovers morphemes through consensus among three unsupervised methods, namely, Byte Pair Encoding (BPE), transition-based boundary detection using Pointwise Mutual Information (PMI), and computational linguistics based Distributional Affix Mining (DAM). Using a Hausa Ajami Bible corpus consisting of 637,414 tokens, AjamiMorph identifies 1,611 high-confidence morphemes, achieving 99.9% coverage. The inventory exhibits a linguistically realistic distribution (66.0% stems, 22.6% suffixes, 11.4% prefixes) and recovers 77.8% of known Hausa affixes. A permutation test that shuffles method assignments (preserving per-method selection sizes) confirms that the observed agreement is above-chance; chi-square remains as a secondary check. A lightweight 5-gram LM comparison (characters vs. consensus morphemes) provides an extrinsic signal. We also report negative results for script-driven Arabic assumptions and LLM-first annotation. This work provides the first unsupervised morpheme inventory for Hausa Ajami and demonstrates consensus as a robust strategy for zero-resource morphology.
Morphological Feature Extraction for Fine-Grained Sorani Kurdish Dialect Identification: A Hybrid Transformer-Linguistic Approach
Soumedhik Bharati | Shibam Mandal | Subham Majumdar | Swarup Kr Ghosh | Sayani Mondal
Soumedhik Bharati | Shibam Mandal | Subham Majumdar | Swarup Kr Ghosh | Sayani Mondal
As reported, approximately 6 million people in Iraq and Iran speak in Sorani Kurdish, which exhibits substantial regional variation but lacks computational resources for dialect identification. We present the first fine-grained sub-dialect classification system for six Sorani varieties namely, Sulaymaniyah, Erbil, Iranian Sorani, Ardalani, Babani, and Mukriani. This investigation combines cross-lingual contextual embeddings (XLM-RoBERTa) with morphological features derived from explicit linguistic rules, including 24 patterns capturing verb prefixes, pronominal clitics, and definite markers. The suggested morphology-augmented XLM-R model has been trained on a unified dataset of 16,409 sentences without manual annotation, and achieves 91.91% accuracy, outperforming pure transformers (91.79%) and traditional machine learning baselines (SVM 86.41%). Key ablation studies reveal that morphological features serve as effective regularizers for geographically proximate dialects.
Olga Snissarenko at AbjadMed: Arabic Clinical Text Classification with AraBERT: Results from the AbjadMed Shared Task
Olga Snissarenko
Olga Snissarenko
We present a solution for the Arabic medical text classification task, formulated as a multi-class classification problem with 82 medical categories. The task is challenging due to severe class imbalance, long and heterogeneous input texts, and the presence of domain-specific medical terminology in Modern Standard Arabic. Our approach is based on fine-tuning pretrained AraBERT models with a focus on loss-level imbalance handling rather than architectural complexity. Through a systematic comparison of multiple AraBERT-based configurations, we show that class-weighted loss combined with simple mean pooling yields the strongest performance. Our best model achieves a macro-F1 score of 0.387 on the public evaluation set and 0.411 on the private test set.
From Classical to Contemporary: Evolutionary Analysis & Classification of Urdu Poetry
Noor Fatima | Hasan Faraz Khan | Irfan Ahmad
Noor Fatima | Hasan Faraz Khan | Irfan Ahmad
Automatic classification of literary text by historical era can support literary analysis and reveal stylistic evolution. We study this problem for Urdu poetry across three eras, classical, modern, and contemporary. We introduce a new dataset of 10,026 four-line Urdu poetry segments collected from online archives (Rekhta and UrduPoint) and labeled by era. To handle Urdu’s script and orthographic variability, we apply standard preprocessing, including Unicode normalization and removal of diacritics and non-Urdu characters. We benchmark a range of approaches, from traditional machine learning classifiers to deep learning models, including fine-tuned Urdu BERT-style transformers. To assess generalization, we evaluate under two regimes: (i) a standard stratified random split and (ii) a stricter author-disjoint split that ensures poets do not overlap between training and test sets. On the random split, the best traditional models achieve about 70-73% accuracy, suggesting era-related stylistic cues are learnable. However, performance drops to roughly 58-60% under the author-disjoint split, highlighting the difficulty in generalizing across unseen poets and the possibility of overestimating performance via author-specific leakage. Notably, fine-tuned transformers do not surpass simpler TF-IDF-based baselines, indicating that era cues may be subtle and that data limitations constrain more complex models.
Alkhalil Corpus: An Open-Source Thematic and Lemmatized Corpus for Modern Standard Arabic
Samir Belayachi | Azzeddine Mazroui
Samir Belayachi | Azzeddine Mazroui
The availability of large annotated corpora remains a major challenge for the development of natural language processing systems for under-resourced languages such as Arabic. In this paper, we present two annotated corpora dedicated to Modern Standard Arabic. These corpora are open-source and freely available on the Hugging Face platform. The first corpus, annotated by theme and designed to provide a balanced representation of contemporary Arabic usage, comprises approximately 76 million words collected from diverse sources covering multiple domains and geographical regions. The second corpus, containing approximately one million words, is a sub-corpus extracted from the first. It was annotated with lemma tags using a semi-automatic approach that combines automatic annotation with the Alkhalil lemmatizer and MADAMIRA, followed by manual validation.
Enhancing Urdu Sentiment Classification through Instruction-Tuned LLMs and Cross-Lingual Transfer
Hasan Faraz Khan | Noor Fatima | Irfan Ahmad
Hasan Faraz Khan | Noor Fatima | Irfan Ahmad
Sentiment analysis in low-resource languages such as Urdu poses unique challenges due to limited annotated data, morphological complexity, and significant class imbalance in most publicly available datasets. This study addresses these issues through two experimental strategies. First, we explore class imbalance mitigation by using instruction-tuned large language models (LLMs) to generate synthetic negative sentiment samples in Urdu. This augmentation strategy results in a more balanced dataset, which significantly improves the recall and F1-score for minority class predictions when fine-tuned using a multilingual BERT model. Second, we investigate the effectiveness of translating Urdu text into English and applying sentiment classification through a pre-trained English language model. Comparative evaluation reveals that the translation-based pipeline, using a RoBERTa model fine-tuned for English sentiment classification, achieves superior performance across major metrics. Our results suggest that LLM-based augmentation and cross-lingual transfer via translation both serve as viable approaches to overcome data scarcity and performance limitations in sentiment analysis for low-resource languages. The findings highlight the potential applicability of these approaches to other under-resourced linguistic domains.
Back-of-the-book indexes (BoBIs) are crucial for book readability. However, their manual creation is laborious and error prone. In this paper, we introduce ArBoBIM to automate BoBI extraction and review processes for Arabic books. Given a book with a corresponding BoBI, ArBoBIM extracts BoBI terms and identifies their occurrences and aligns those across several versions of the book. ArBoBIM first defines a pool of candidates for each term by leveraging noun phrases and named entities. ArBoBIM leverages several metrics, including exact matches, morpho-lexical similarity, and semantic similarity, to determine the best candidates. We empirically fine-tuned thresholds for ArBoBIM and achieve an F1-score of 0.94 (precision= 0.97, recall=0.91). These results are significantly better than baseline results, and top LLM based results with lower computational cost and no publishing house IP risks. Additionally, with ArBoBIM, over 500 books have been processed, resulting in the ArBoBIMap dataset, containing books, their terms, occurrences, and various metadata related to them, to be made available for the public. This dataset is used to train a model to identify if a term, given its features, should be added to the back-of-the-book index of a specific book. The model achieves an F1-score of 0.91 (precision = 0.97, recall = 0.85).
Improving on State-of-the-Art Models for Sentiment Analysis on Saudi-English Code-Switching Text
Samaher Alghamdi | Paul Rayson | Reem Alotibi
Samaher Alghamdi | Paul Rayson | Reem Alotibi
Inserting English words, phrases, or sentences while writing or speaking in the Saudi Arabic dialect has become a widespread phenomenon in Saudi society. This phenomenon is linguistically called code-switching. It remains unclear how current sentiment analysis methods perform on Saudi-English code-switching text. In this paper, we address this gap by conducting the first sentiment analysis study on Saudi-English code-switching text. We present the first Saudi-English Sentiment Analysis Code Switching Dataset (SESA-CSD) and establish baseline results on this dataset. By evaluating multiple state-of-the-art small language models, we achieve improvements over the baseline of 3% to 11% in both accuracy and macro-F1. Among all small language models, XLM-RoBERTa achieved the highest performance,with an accuracy of 95.50% and a macro-F1 of 95.53%. Our findings indicate that multilingual and Arabic small language models, such as XLM-RoBERTa, GigaBERT, and SaudiBERT, consistently outperform bilingual Arabic-English large language models, such as Fanar and ALLaM, across zero-shot and multiple few-shot settings.
OMAN-SPEECH: A Multi-Layer Annotated Speech Corpus for Omani Arabic Dialects
Rayyan S. Al Khadhuri | Firas Al Mahrouqi | Salim Al Mandhari | Amir Azad Al-Kathiri | Omar Said Alshahri | Ghassab Mansoor Alsaqr | Badri Abdulhakim Mudhsh | Tarek Fatnassi
Rayyan S. Al Khadhuri | Firas Al Mahrouqi | Salim Al Mandhari | Amir Azad Al-Kathiri | Omar Said Alshahri | Ghassab Mansoor Alsaqr | Badri Abdulhakim Mudhsh | Tarek Fatnassi
Automatic Speech Recognition (ASR) has achieved strong performance in high-resource languages; however, Dialectal Arabic remains significantly under-resourced. This gap is particularly evident in Oman, where Arabic exhibits substantial sociolinguistic variation shaped by settlement patterns between sedentary (Hadari) and nomadic (Badu) communities, which are often overlooked by urban-centric or generalized Gulf Arabic datasets. We introduce OMAN-SPEECH, a sociolinguistically stratified spoken corpus for Omani Arabic comprising approximately 40 hours of spontaneous and semi-spontaneous speech from 32 speakers across 11 Wilayats (provinces). The corpus is balanced to capture regional and lifestyle variation and is annotated at the sentence level with Arabic transcription, English translation, and phonetic transcription using the International Phonetic Alphabet (IPA) through a human-in-the-loop annotation pipeline. OMAN-SPEECH provides a foundational resource for evaluating ASR and related speech technologies on Omani and Gulf Arabic varieties and supports more granular modeling of regional dialectal variation.
Hala Technical Report Building Arabic-Centric Instruction & Translation Models at Scale
Hasan Abed Al Kader Hammoud | Mohamad Bilal Zbib | Bernard Ghanem
Hasan Abed Al Kader Hammoud | Mohamad Bilal Zbib | Bernard Ghanem
We present HALA, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR↔EN teacher to FP8 (yielding ~2× higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2–1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train HALA models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, HALA achieves state-of-the-art results within both the "nano" (≤2B) and "small" (7–9B) categories, outperforming their bases. We are committed to release models, data, evaluation, and recipes to accelerate research in Arabic NLP.
Arabic Citation Parsing using Part of Speech and Named Entity Recognition
Youssef Karout | Hadi Hammoud | Fadi Zaraket
Youssef Karout | Hadi Hammoud | Fadi Zaraket
This paper introduces an industry level citation element extractor from Arabic text. Citation element extraction enables editorial task automation for publishing houses, creation of citation networks, and automatic citation analytics for impact analysis firms. Citation library tools help users manage their citations. However, for Arabic, these tools lack basic support to identify and extract citation elements. Consequently, researchers, editors and reviewers manually manage Arabic citations tasks. We present a novel Arabic citation element dataset, use it to train a citation element extraction model, and use named entity recognition, morphological analysis, and keyword detection to improve the results for practical use. The paper reports industry ready performance with F1 scores ranging between .80 and .95 for interesting citation elements.
DeformAR: A Visual Analytics Framework for Evaluation of Arabic Named Entity Recognition
Ahmed Mustafa Younes
Ahmed Mustafa Younes
Arabic Named Entity Recognition (ANER) presents challenges due to its linguistic characteristics (Qu et al., 2023). While Transformer models have advanced ANER, evaluation still relies heavily on aggregate metrics like F1 score that obscure the interplay between data characteristics, model behaviour, and error patterns. We present DeformAR, a diagnostic visual analytics framework for evaluating and diagnosing Arabic NER systems through structured, component-level analysis and interpretability. DeformAR integrates quantitative metrics with interactive visualizations to support systematic error analysis, dataset and model debugging. In a case study on ANERCorp, DeformAR identifies annotation mistakes, model calibration issues, and subcomponent interaction effects. To our knowledge, this is the first open-source framework for component-level diagnostic evaluation and interpretability in Arabic NER, available at https://github.com/ay94/DeformAR.
The spoken Arabic exhibits substantial dialectal variation in the Arabic-speaking world. This paper presents a corpus-based analysis of Arabic dialectal variation using the SADA corpus, examining lexical, morphosyntactic, and discourse-pragmatic patterns across dialects. We combine quantitative frequency-based measures with qualitative linguistic analysis, including keyword comparison, distributional profiling, collocational and trigram analyses, and similarity-based clustering. Our results show that Arabic dialects share a substantial common core, while differing systematically in frequent discourse markers, evaluative expressions, and recurrent phraseological patterns. These findings provide empirical evidence for regional clustering among contemporary dialects and for variation relative to the standard register. The study contributes linguistic insights that support both Arabic dialectology and the development of dialect-aware NLP systems.
HACS-TL: Cross-Script Transfer Learning for Hausa Ajami Hate Speech Detection Using Transformer-Based Architecture
Abdulkadir Shehu Bichi | Muqaddar Ali | Prashant Sharma | Ismail Dauda Abubakar
Abdulkadir Shehu Bichi | Muqaddar Ali | Prashant Sharma | Ismail Dauda Abubakar
The Arabic-derived scripts contain several languages that face challenges with the limited resources of speech detection, these challenges are worsened by the scarcity of resources and highly complex linguistic challenges. We proposed ( HACS-TL Hausa Ajami Cross-Script Transfer Learning) a brand new transformer-based architecture that focuses on the detection of hate speech within Ajami script. Hausa is a Chadic language which contains over 77 million speakers located in West Africa; it uses two types of scripts: the Latin (Boko) and the Arabic-derived Ajami which creates new computational difficulties. Our method combines scripts of artistically converted linguistics, augmented cross script multi-head attention, and dialect feature extraction to trellis the morphophonological depth of the Hausa. After a thorough examination using stratified cross-validation along with systemically augmented data, HACS-TL obtained a Macro F1 score of 76.09% which is a significant improvement from the other multilingual baselines (mBERT (69.17 % ) XLM-RoBERTa (73.20 % ) AraBERT (58.63% ) ) HACS-TL outperformed all of the previously stated models. Strong multilingual baselines refer to the other stated models; AraBERT (58.63) XLM-RoBERTa (73.20) mBERT (69.17) HACS-TL 70.73 + 10 % Cross-Script+ (mBERT) 46.73 + 0.9 % Cross-Script + AraBERT. The importance of cross-script attention and learning from transfer sources of resources to languages with limited scripts has proven effective. Our systematic method has aided the advancement of Arabic script homage Hausa and African language resources for the NLP of the Nubians in learning African languages and the intricate Nubian and cross-learning systems from different scripts.
Code-Switching as a Safety Failure Mode in Large Language Models: An Empirical Study of Roman Urdu across English, Mixed, and Transliteration-Only Inputs
Waleed Jamil | Saima Rafi
Waleed Jamil | Saima Rafi
Large Language Models exhibit robust safety alignment when harmful intent is expressed in English, yet their resilience to code-switching and transliteration remains underexplored. This paper presents the first targeted investigation of code-switching as a safety failure mode, focusing on Roman Urdu—a widely used transliterated form common in informal and emotionally expressive communication. We introduce the Roman Urdu Adversarial Benchmark (RUAB), a semantically controlled evaluation benchmark designed to isolate linguistic variation from intent across four safety-critical categories: passive suicidal ideation, psychological distress, threat or intimidation, and coercion or emotional manipulation. Evaluating seven state-of-the-art models, we find that safety detection degrades consistently in code-switched and transliterated inputs, with the most pronounced failures occurring for passive suicidal ideation. Instruction-tuned and reasoning-capable models demonstrate greater robustness, suggesting these failures reflect alignment gaps rather than inherent model limitations. Our findings highlight transliteration and code-switching as under-recognized safety risks and motivate the development of linguistically inclusive, transliteration-aware safety methods.
QAMAR: A New Fully Verified and Accurate Quranic Arabic Morphological Analysis Resource.
Sara Faqihi | Karim Bouzoubaa | Rachida Tajmout | Driss Namly
Sara Faqihi | Karim Bouzoubaa | Rachida Tajmout | Driss Namly
Several Quranic morphological corpora have been developed to support Arabic linguistic analysis and NLP applications, yet they often lack full coverage, consistency, or manual verification. We present QAMAR, a morphologically oriented, multi-task corpus derived from the Qur’an. This comprehensive, manually verified resource provides a detailed linguistic layer for every Quranic word, including the Modern Standard Arabic (MSA) equivalent, the stem, the lemma, the root, and the part of speech (POS). QAMAR supports multiple NLP tasks, such as normalization, lemmatization, root extraction, and POS tagging, and serves as a gold-standard reference for Quranic and Arabic NLP research, including corpus-to-corpus evaluation and morphological analyzer benchmarking. The paper details QAMAR’s annotation framework, verification process, and resource structure, and reports comparative analyses with existing Quranic morphological resources and outputs produced by current large language models (LLMs).
AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic
Omar Elshehy | Omer Nacar | Abdelbasset Djamai | Muhammed Ragab | Khloud AL Jallad | Mona Abdelazim
Omar Elshehy | Omer Nacar | Abdelbasset Djamai | Muhammed Ragab | Khloud AL Jallad | Mona Abdelazim
Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.
Parameter-Efficient Adaptation of Self-Supervised Models for Arabic Speech Recognition
Wafa Mohammed Alshehri | Wasfi G. Al-khatib | Mohammad Ismail Amro
Wafa Mohammed Alshehri | Wasfi G. Al-khatib | Mohammad Ismail Amro
Arabic speech recognition systems face distinct challenges due to the language’s complex morphology and dialectal variations. Self-supervised models (SSL) like XLS-R have shown promising results, but their size with over than 300 million of parameters, makes fine-tuning computationally expensive. In this work, we present the first comparative study of parameter-efficient fine-tuning (PEFT), specifically LoRA and DoRA, applied to XLS-R for Arabic ASR. We evaluate on the newly released Common Voice Arabic V24.0 dataset, establishing new benchmarks. Our full fine-tuning achieves state-of-the-art results among XLS-R-based models with 23.03% Word Error Rate (WER). In our experiments, LoRA achieved a 36.10% word error rate (WER) while training just 2% of the model’s parameters. DoRA reached 45.20% WER in initial experiments. We analyze the trade-offs between accuracy and efficiency, offering practical guidance for developing Arabic ASR systems when computational resources are limited. The models and code are publicly available.
Current state of LLMs for Arabic dialectal machine translation
Josef Jon | Rawan Bondok | Ondřej Bojar
Josef Jon | Rawan Bondok | Ondřej Bojar
This work presents an evaluation of large language models (LLMs) for English to dialectal Arabic machine translation on the MADAR dataset. We evaluate both translation directions (English to Arabic and vice-versa) on 16 Arabic dialects. Our experiments cover a diverse set of models, including specialized Arabic models (Jais, Nile), multilingual models (Gemma, Command-R, Mistral, Aya), and commercial APIs (GPT-4.1). We employ multiple evaluation metrics: BLEU, CHRF, COMET (both reference-based and reference-less variants) and GEMBA (LLM-as-a-judge), as well as a small-scale manual evaluation, to assess translation quality. We discuss the challenges of automatic MT evaluation, especially in the context of Arabic dialects. We also evaluate the ability of LLMs to classify the dialect used in a text. The study offers insights into the capabilities and limitations of current LLMs for dialectal Arabic machine translation, particularly highlighting the difficulty of handling dialectal diversity, although the results may be influenced by possible training data contamination, which is always a concern with LLMs.
A Hybrid Confidence-Aware Framework for Arabic Toxicity Detection in Social Media
Fawzia Zaal Alanazi | Asma Mohammed Alamri | Arwa Bin Saleh | Abdullah I. Alharbi
Fawzia Zaal Alanazi | Asma Mohammed Alamri | Arwa Bin Saleh | Abdullah I. Alharbi
Automatic detection of toxic and offensive content in Arabic social media is a challenging task due to rich morphology, dialectal variation, and noisy writing styles. While transformer-based language models have achieved strong performance, they often produce uncertain predictions in borderline cases. This paper presents a hybrid framework for Arabic toxicity detection that combines a pretrained Arabic-specific transformer model with a confidence-aware rule-based mechanism. The proposed approach activates automatically induced lexical rules only when the model prediction falls within a predefined gray zone of uncertainty, preserving neural dominance while improving robustness and interpretability. Experiments conducted on a manually annotated dataset of 35,000 Arabic posts demonstrate that the hybrid approach achieves consistent improvements over the baseline model, particularly in reducing false negatives for toxic content. The results indicate that selective rule activation is an effective strategy for enhancing reliability in real-world Arabic social media moderation systems.
Arabic-Adapted One-Step Speech-to-Diacritized ASR: Evaluation and Error Analysis
Osamah A. I. Abduljalil | Dalal Ali | Razan A. Bajaman | Abdullah I. Alharbi
Osamah A. I. Abduljalil | Dalal Ali | Razan A. Bajaman | Abdullah I. Alharbi
Arabic diacritics encode phonetic information essential for pronunciation, disambiguation, and downstream applications, yet most Arabic ASR systems generate undiacritized output. In this work, we study direct speech-to-diacritized-text recognition using a single-stage ASR pipeline that predicts diacritics jointly with Arabic letters, without text-based post-processing. We evaluate two Arabic-adapted ASR architectures—wav2vec 2.0 XLSR-53 and Whisper-base—under a unified experimental setup on the ClArTTS Classical Arabic dataset. Performance is assessed using surface and lexical WER/CER alongside diacritic error rate (DER) to disentangle base transcription accuracy from diacritic realization. Our results show that Arabic-adapted wav2vec 2.0 achieves substantially lower diacritic error rates than Whisper, indicating stronger exploitation of acoustic cues relevant to vowelization. We further analyze the effect of decoding strategy and provide a detailed breakdown of diacritic errors, highlighting challenges associated with short vowels and morphosyntactic markers. These findings underscore the importance of model architecture and Arabic-specific adaptation for accurate diacritized Arabic ASR.
GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification
Ahmed Khamis
Ahmed Khamis
We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.
AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Mohamad Bilal Zbib | Hasan Abed Al Kader Hammoud | Ammar Mohanna | Nadine Rizk | Fatima Karnib | Sina Moukaled | Bernard Ghanem
Mohamad Bilal Zbib | Hasan Abed Al Kader Hammoud | Ammar Mohanna | Nadine Rizk | Fatima Karnib | Sina Moukaled | Bernard Ghanem
We present AraLingBench, a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language mod- els (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than au- thentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The benchmark and evaluation code are available on Hugging Face and GitHub.
REGLAT at AbjadMed: Handling Imbalanced Arabic Medical Text Classification via Hierarchical KNN-MLP Architecture
Ahmed Megahed Fetouh | Mohammed Rahmath | Omer Dawood | Mariam Labib | Nsrin Ashraf | Hamada Nayel
Ahmed Megahed Fetouh | Mohammed Rahmath | Omer Dawood | Mariam Labib | Nsrin Ashraf | Hamada Nayel
In this paper, we demonstrate the system submitted to the shared task of medical text classification in Arabic. We proposed a single-model approach based on fine-tuned LLM-based embedding combined with hierarchical classical classifiers, achieving a competitive macro F1-score of 0.46 on the blind test set. We explored various modeling strategies, including tree-based ensembles, LLM, and hierarchical correction for rare classes, highlighting the effectiveness of domain-specific fine-tuning in low-resource settings. The results demonstrate that a single fine-tuned Arabic BERT variant can serve as a strong baseline in extreme imbalance scenarios, outperforming more complex ensembles in simplicity and reproducibility.
Murabaa: A comprehensive Resource Platform for Arabic Morphology
Karim Bouzoubaa | Driss Namly | Hamid Jihad | Rachida Tajmout | Jamal Ezzouaine | Hakima Khamar
Karim Bouzoubaa | Driss Namly | Hamid Jihad | Rachida Tajmout | Jamal Ezzouaine | Hakima Khamar
Arabic language faces technical and cultural challenges, including a lack of high-quality resources and the prevalence of regional dialects, which hinders the development of effective language processing systems. Therefore, the "Murabaa" platform was developed to transform Arabic linguistic knowledge into integrated digital resources. The platform aims to provide accurate digital content and promote the use of Arabic in various fields to bridge the gap between tradition and modernity by offering integrated linguistic resources for developing advanced research tools. The platform provides eight accurate dictionaries in the form of a website and a web application, contributing to the digitization of knowledge and its representation within the framework of standard lexical markup. In this study, we also conduct a quantitative comparison of the resources against similar ones to assess the quality of the linguistic knowledge they provide.
Sujith Kanakkassery at AbjadMed: Imbalance-Aware Transformer Fine-tuning for Arabic Medical Text Classification
Sujith Kanakkassery
Sujith Kanakkassery
This paper describes our system submitted to the AbjadMed 2026 shared task at AbjadNLP. The task focuses on the multi-class classification of Arabic medical texts under severe class imbalance. Our approach fine-tunes a pre-trained Arabic Transformer model and incorporates several imbalance-aware strategies, including data cleaning, class-weighted loss, and label smoothing. Through ablation experiments, we observe consistent improvements over a baseline system, demonstrating the effectiveness of these techniques in improving performance on underrepresented medical categories. Finally, our error analysis highlights persistent challenges related to label sparsity and semantic overlap among medical classes.
A Knowledge Graph Based Diagnostic Framework for Analyzing Hallucinations in Arabic Machine Reading Comprehension
Najwa Abdullah AlGhamdi | Sadam Al-Azani | Kwabena Nuamah | Alan Bundy
Najwa Abdullah AlGhamdi | Sadam Al-Azani | Kwabena Nuamah | Alan Bundy
Large Language Models (LLMs) frequently generate answers that are fluent but not fully grounded in the provided context, a phenomenon commonly referred to as hallucination. While recent work has explored hallucination detection primarily in English and open domain settings, comparatively little attention has been given to Arabic machine reading comprehension (MRC), particularly in culturally sensitive domains such as Qur’anic texts. In this paper, we present a knowledge graph based diagnostic framework for analyzing hallucinations and question misalignment in Arabic MRC. Rather than proposing a new detection model or metric, the framework provides an interpretable, triple level analysis of model generated answers by comparing subject-relation-object representations derived from the passage, the question, and the answer. The approach incorporates question-aware filtering and operates under weak supervision, combining automatic analysis with targeted human adjudication to handle annotation gaps and semantic ambiguity. We apply the framework to the Qur’anic Reading Comprehension Dataset (QRCD) and demonstrate how it exposes systematic hallucination patterns that are difficult to capture using surface level similarity metrics alone, particularly for questions requiring justification or abstract interpretation. The results highlight the value of structured, transparent diagnostic evaluation for understanding LLM behavior in low resource and high stakes Arabic NLP settings.
From Posts to Pressure: An Arabic Dataset about Stress and Mental-Health Monitoring
Wajdi Zaghouani | Eman Sedqy Shlkamy | Mabrouka Bessghaier
Wajdi Zaghouani | Eman Sedqy Shlkamy | Mabrouka Bessghaier
How do Arabic-speaking communities express and engage with psychological stress on social media? We introduce AraStress, the first large-scale Arabic corpus dedicated to psychological stress research, comprising 175,862 public social media posts from 2020 to 2024, covering pandemic and post-pandemic periods.It fills a significant gap in Arabic mental-health NLP resources focused on stress, enabling large-scale analysis of related expressions.Unlike prior work focusing primarily on Twitter and depression or suicidality, AraStress addresses the critical gap in stress-focused resources. Our lexicon-based analysis reveals that stress-related posts elicit predominantly affective engagement and exhibit a hybrid lexical framing that integrates religious and therapeutic language. AraStress provides a foundational resource for culturally grounded computational models of stress detection and digital wellbeing in Arabic-speaking communities.
HCMUS_TheFangs at AbjadGenEval Shared Task: Weighted Layer Pooling with Attention Fusion for Arabic AI-Generated Text Detection
Duy Minh Dao Sy | Nguyen Chi Tran | Trung Kiet Huynh | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
Duy Minh Dao Sy | Nguyen Chi Tran | Trung Kiet Huynh | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
The rapid advancement of large language mod-els poses significant challenges for content au-thenticity, particularly in under-resourced lan-guages where detection tools remain scarce.We present our winning system for the Abjad-GenEval shared task on Arabic AI-generatedtext detection. Our key insight is that AI-generated text exhibits distinctive patternsacross multiple linguistic levels-from local syn-tax to global semantics-that can be captured bylearning to fuse representations from differenttransformer layers. We introduce aWeightedLayer Poolingmechanism that learns optimallayer combinations, combined withAttentionPoolingfor sequence-level context aggregation.Through systematic experimentation with 15+ approaches, we make a surprising discovery:model architecture selection dominates over so-phisticated training techniques, with DeBERTa-v3 providing +27% relative improvement overAraBERT regardless of training strategy. Oursystem achieves 0.93 F1-score, securing 1st placeamong all participants and outperform-ing the runner-up by 3 absolute points
HCMUS_The Fangs at AbjadStyleTransfer Shared Task: Learning to Query Style, Contrastive Representations for Zero-Shot Arabic Authorship Style Transfer
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Chi Tran | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Chi Tran | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
This paper describes the system developed by team HCMUS_The Fangs for the AbjadStyleTransfer shared task (ArabicNLP 2026), where we achieved 1st place. We present a contrastive style learning approach for zero-shot Arabic authorship style transfer. Our key discovery is that the 21 test authors-including Nobel laureate Naguib Mahfouz and literary pioneer Taha Hussein-have zero overlap with the 32,784 training authors, transforming this into a pure zero-shot challenge. This insight led us to develop a dual-encoder architecture that learns transferable style representations through contrastive objectives, rather than memorizing author-specific patterns. Our system achieves 19.77 BLEU and 55.74 chrF, outperforming retrieval-augmented generation (+18%) and multi-task learning (+31%). Counter-intuitively, we find that sophisticated architectural modifications like style injection consistently degrade performance, while simpler approaches that preserve pre-trained knowledge excel. Our analysis reveals that for famous authors, pre-trained Arabic language models already encode substantial stylistic knowledge-the key is surfacing it, not learning from scratch.
Large Language Models (LLMs) have rapidly proliferated, presenting challenges in distinguishing human-written text from AI-generated content, especially in low-resource languages like Urdu. This paper introduces U-RoCX, a novel hybrid architecture for the AbjadGenEval Shared Task on AI-Generated Urdu Text Detection. U-RoCX combines the multilingual semantic capabilities of a frozen XLM-RoBERTa backbone with local feature extraction from Convolutional Neural Networks (CNNs) and the advanced sequential modeling of the recently proposed Extended LSTM (xLSTM). By utilizing xLSTM’s matrix memory and covariance update rules, the model addresses traditional Recurrent Neural Network bottlenecks. Experimental results demonstrate the robustness of U-RoCX, achieving a balanced accuracy and F1-score of 88% on the test set.
HCMUS_PrisonDilemma at AbjadAuthorID Shared Task: Less is More with Base Models
Trung Kiet Huynh | Duy Minh Dao Sy | Nguyen Chi Tran | Pham Phu Hoa | Nguyen Lam Phu Quy | Truong Bao Tran
Trung Kiet Huynh | Duy Minh Dao Sy | Nguyen Chi Tran | Pham Phu Hoa | Nguyen Lam Phu Quy | Truong Bao Tran
We present our approach to the AbjadNLP 2026 Arabic Authorship Identification shared task, achieving 4th place. Our key finding is that AraBERT-base (110M) outperforms AraBERT-large (340M) on the test set with macro F1 of 0.8449 versus 0.8096, despite lower validation scores. We handle long passages via sliding window chunking with mean pooling, and use a two-stage classification head with dual dropout for regularization. Per-class analysis reveals that translated works achieve perfect F1 while classical poets remain challenging due to shared formal structures. Our results challenge the "scale is all you need" assumption for stylometric tasks.
U-MIRAGE: Benchmarking Chain-of-Thought Reasoning for Urdu Medical QA
Ali Faheem | Faizad Ullah | Muhammad Hammad | Ahmed Hassan | Muhammad Sohaib Ayub | Asim Karim
Ali Faheem | Faizad Ullah | Muhammad Hammad | Ahmed Hassan | Muhammad Sohaib Ayub | Asim Karim
Medical AI systems increasingly rely on large language models (LLMs), yet their deployment in linguistically diverse regions remains unexplored. We address this gap by introducing U-MIRAGE, the first medical question-answering benchmark for Urdu and Roman Urdu. Urdu is the 11th most spoken language (with over 246 million speakers) worldwide. Our systematic evaluation of six state-of-the-art LLMs reveals three main findings. (1) 6% to 10% drop in performance when moving from English to Urdu variants, even though medical knowledge should theoretically transfer across languages. (2) Chain-of-Thought (CoT) prompting improves small models by 8% to 20%, while surprisingly the larger models’ performance degraded by up to 3%. (3) Quantized small models fail catastrophically in low-resource languages, achieving near-random accuracy regardless of various prompting strategies. These findings challenge core assumptions about multilingual medical AI systems. Roman Urdu consistently outperforms standard Urdu script, suggesting orthographic alignment with pre-training data matters more than linguistic proximity. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. Our contributions are threefold: (1) U-MIRAGE, (2) systematic benchmarking of LLMs for Urdu and Roman Urdu medical reasoning, and (3) empirical analysis of CoT prompting in low-resource contexts. Our code and datasets are publicly available.
XLMR-Urdu at AbjadGenEval Shared Task: A Data-Centric Transformer-Based Approach for AI-Generated Urdu Text Detection
Mohannad Mohammad Hendi
Mohannad Mohammad Hendi
The rapid advancement of large language models (LLMs) has led to a substantial increase in automatically generated textual content, raising concerns regarding misinformation, plagiarism, and authorship verification. These challenges are particularly pronounced for low-resource languages such as Urdu, where limited annotated data and complex linguistic properties hinder robust detection. In this paper, we present a transformer-based approach for binary classification of human-written versus AI-generated Urdu text, developed for the AbjadGenEval Task 2 shared task. Beyond model fine-tuning, we adopt a data-centric perspective, emphasizing dataset diagnostics, document-level inference, and calibration strategies. Our system achieves strong performance on the official test set, with an F1-score of 88.68% and balanced accuracy of 88.71%. Through empirical analysis, we demonstrate that dataset characteristics and generator-specific artifacts play a dominant role in model generalization, highlighting critical directions for future research in low-resource AI-generated text detection.
This paper describes our system submitted to the AbjadGenEval Shared Task at ArabicNLP 2026, which focuses on binary classification of human-written versus machine-generated text in low-resource languages. We participated in two independent subtasks targeting Arabic and Urdu news and literary texts. Our approach relies exclusively on fine-tuning XLM-RoBERTa, a multilingual Transformer-based model, under carefully controlled training and preprocessing settings. While the same model architecture was used for both subtasks, language-specific data handling strategies were applied based on empirical observations. The proposed system achieved first place in the Urdu subtask and third place in the Arabic subtask according to the official evaluation. These results demonstrate that multilingual pretrained models can serve as strong and reliable systems for AI-generated text detection across diverse languages.
The proliferation of Large Language Models (LLMs) has introduced significant challenges regarding algorithmic bias, privacy, and the authenticity of digital content. While detection mechanisms for English are maturing, low-resource languages like Urdu—spoken by over 100 million people—require dedicated research. In this paper, we present a technical framework for Urdu AI-generated text detection developed for the *ACL shared task. We propose a hybrid pipeline that combines TF-IDF Character N-grams with a custom stylometric feature extractor designed to capture unique Urdu linguistic markers, including repeated word ratios, punctuation density, and formal function markers. Using a Linear Support Vector Machine (SVM) optimized via Stochastic Gradient Descent (SGD), our system achieves a balanced accuracy and F1-score of 87.80% on a dataset of 6,800 records. Our results demonstrate that a computationally efficient, classical machine learning approach—prioritizing stylistic signals over heavy preprocessing—remains highly effective for distinguishing between human-written and AI-generated Urdu text.
QalamID at AbjadAuthorID Shared Task: Morphology Matters, A Hybrid Ensemble for Arabic Authorship Attribution
Youssef Zaghloul
Youssef Zaghloul
Arabic authorship attribution presents unique challenges due to the language’s rich derivational morphology, which often fragments word-level frequencies. In this paper, we describe our winning submission to the AbjadAuthorID Shared Task. We propose a hybrid ensemble system that fuses the morphological precision of character n-gram LinearSVCs with the semantic understanding of fine-tuned Transformers (AraBERT and XLM-RoBERTa). Contrary to current trends in NLP, we demonstrate that traditional character n-grams (0.92 F1) significantly outperform deep learning baselines (AraBERT 0.87 F1) for this task, suggesting that authorial signature in Arabic is encoded more densely in morphological patterns than in semantic content. Our final system employs a novel Precision Scalpel post-hoc calibration technique and selective pseudo-labeling to address class imbalance and genre confounds. The system achieved the 1st place ranking with a macro F1-score of 0.932 and accuracy of 0.963 on the test set.
Kashif-AI at AbjadGenEval Shared Task: A Transformer-based Approach for Arabic AI-Generated Text Detection
Fatimah Mohamed Emad Eldin
Fatimah Mohamed Emad Eldin
As Large Language Models (LLMs) become increasingly proficient at generating human-like text, distinguishing between human-written and machine-generated content has become a critical challenge for information integrity. This paper presents Kashif-AI, a system developed for the AbjadGenEval Task 1: AI-Generated Arabic Text Detection. The approach leverages fine-tuned Arabic Pre-trained Language Models (PLMs), specifically MARBERT and CAMeLBERT, to classify news articles. A rigorous ablation study was conducted to evaluate the impact of data augmentation, comparing models trained on the official shared task data against those trained on a combined corpus of over 47,000 samples. While near-perfect performance was observed during validation, the blind test set evaluation revealed a significant generalization gap. Contrary to expectations, data augmentation resulted in performance degradation due to domain shifts. The best-performing configuration, which utilized CAMeLBERT-Mix trained on the original dataset, achieved an F1-score of 66.29% and an Accuracy of 70.5% on the blind test set.
NileUn at AbjadGenEval Shared Task: Contrastive Learning with Stacking Ensemble for Efficient Arabic AI-Generated Text Detection
Mohamed Hussein Mohamed | Shrouk Shalaby | Nesreen Mohamed
Mohamed Hussein Mohamed | Shrouk Shalaby | Nesreen Mohamed
We present a computationally efficient ap- proach for detecting AI-generated Arabic text as part of the AbjadGenEval shared task. Our method combines Supervised Con- trastive Learning with a Stacking Ensemble of AraBERT and XLM-RoBERTa models. Our training pipeline progresses through three stages: (1) standard fine-tuning without con- trastive loss, (2) adding supervised contrastive loss for better embeddings, and (3) further fine-tuning on diverse generation styles. On our held-out test split, the stacking ensemble achieves F1=0.983 before fine-tuning. On the official workshop test data, our system achieved 4th place with F1=0.782, demonstrating strong generalization using only encoder-based trans- formers without requiring large language mod- els. Our implementation is publicly available
REGLAT at AbjadGenEval: Multi-Model Ensemble Approach for Arabic AI-Generated Text Detection
Mariam Labib Francies | Nsrin Ashraf | Ahmed Megahed Fetouh | Hamada Nayel
Mariam Labib Francies | Nsrin Ashraf | Ahmed Megahed Fetouh | Hamada Nayel
The rapid advancement of large language models necessitates robust methods for detecting AI-generated Arabic text. This paper presents our system for distinguishing human-written from machine-generated Arabic content. We propose a weighted ensemble combining AraBERTv2 and BERT-base-arabic, trained via 5-fold stratified cross-validation with class-balanced loss functions. Our methodology incorporates Arabic text normalization, strategic data augmentation using 16,678 samples from external scientific abstracts, and threshold optimization prioritizing recall. On the official test set, our system achieved an F1-score of 0.763, an accuracy of 0.695, a precision of 0.624, and a recall of 0.980, demonstrating strong detection of machine-generated texts with minimal false negatives at the cost of elevated false positives. Analysis reveals critical insights into precision-recall trade-offs and challenges in cross-domain generalization for Arabic AI text detection.
AyahVerse at AbjadGenEval Shared Task: Monolingual Precision and Cross-Lingual Analysis in Perso-Arabic AI Detection
Fizza Nawaz | Ibad-ur-Rehman Rashid | Uswa Abid | Junaid Hussain
Fizza Nawaz | Ibad-ur-Rehman Rashid | Uswa Abid | Junaid Hussain
This paper presents our submission to the AbjadGenEval shared task on AI-generated text detection in Arabic and Urdu. To address the challenges of morphologically rich and low-resource environments, we developed a composite framework leveraging monolingual specialists (AraBERTv2, CAMeLBERT-DA) and multilingual transformers. Our system achieved robust in-domain performance with Test F1-scores of 0.75 for Arabic and 0.86 for Urdu. Methodologically, we tested both raw and normalized text to distinguish whether models detect based on semantic content or on surface artifacts such as punctuation and formatting patterns. Furthermore, our cross-lingual investigations reveal directional performance differences, where Urdu-trained models achieve 0.75 F1 on Arabic, while Arabic-trained models achieve only 0.61 F1 on Urdu. Despite this difference, both directions maintained notably high recall for the machine class, indicating that the model learns cross-lingual machine detection patterns across the Perso-Arabic script. Finally, transfer performance collapsed when internal layers were frozen, demonstrating that full fine-tuning is essential for cross-lingual detection. However, the observed performance differences may partly reflect data imbalance rather than purely linguistic factors.
AbjadMed: Arabic Medical Text Classification at AbjadNLP 2026
Pranav Gupta | Niranjan Kumar M | Balaji Nagarajan | Imed Zitouni | Mo El-Haj
Pranav Gupta | Niranjan Kumar M | Balaji Nagarajan | Imed Zitouni | Mo El-Haj
We present AbjadMed, a shared task on Arabic medical text classification organised as part of the 2nd AbjadNLP workshop at EACL 2026. The task targets supervised multi-class classification under realistic conditions of severe class imbalance, fine-grained category structure, and naturally occurring label noise. Participants assign each Arabic medical question–answer instance to one of 82 predefined categories derived from real healthcare consultations. The dataset is based on the Arabic Healthcare Dataset (AHD) and is released as curated training and test splits containing 27,951 and 18,634 instances respectively, while preserving the original label distribution. Systems are evaluated using macro-averaged F1 to emphasise performance on minority medical topics. Results show that Arabic medical text classification remains challenging even with modern pretrained models, particularly for low-frequency and semantically overlapping categories. AbjadMed provides a reproducible benchmark for studying robustness and generalisation in Arabic healthcare NLP.
Uslub at AbjadAuthorID Shared Task: A Comparative Analysis of Traditional Machine Learning and Transformer-Based Models for Authorship Attribution in Arabic and Urdu
Shahad Alsuhaibani | Mohamed Alkaoud
Shahad Alsuhaibani | Mohamed Alkaoud
Authorship attribution is a critical task in natural language processing with applications ranging from forensic linguistics to plagiarism detection. While well-studied in high-resource languages, it remains challenging for low-resource languages like Arabic and Urdu. In this paper, we present our participation in the AbjadNLP shared task, where we systematically evaluate three distinct approaches: traditional machine learning using SVM with TF-IDF features, fine-tuned transformer-based models (AraBERT), and LLMs. We demonstrate that while fine-tuned AraBERT excels in Arabic, traditional lexical models (SVM) prove more robust for Urdu, outperforming both BERT-based and LLM approaches. We also show that few-shot prompting with LLMs, when operated as a reranker over top candidates, significantly outperforms zero-shot baselines. Our final systems achieved competitive performance, ranking 6th and 1st in the Arabic and Urdu tasks respectively.
Arabic Author Attribution Using Transformer-Based Models: Insights from the AbjadAuthorID Shared Task
Ghader Kurdi
Ghader Kurdi
This paper describes the author’s participation in the Arabic track of the AbjadAuthorID shared task which focuses on multiclass authorship attribution using transformer-based models. The task involves identifying the author of a given text excerpt drawn from diverse genres and historical periods, posing significant challenges due to stylistic variation and linguistic richness. Experimental results demonstrate strong performance, with an ensemble of MAR BERTv2 and ARBERTv2 achieving achieving an accuracy of 92% and a macro-averaged F1 score of 89%, ranking second on the leader board, and highlighting the effectiveness of the proposed approach for Arabic authorship identification.
R-R at AbjadAuthorID Shared Task: A Fine-Tuned Approach for Kurdish Authorship Identification
Rania Azad M. San Ahmed | Rebwar M. Nabi
Rania Azad M. San Ahmed | Rebwar M. Nabi
Authorship identification is a fundamental task in natural language processing and computational stylistics. Despite significant advancements in high-resource languages, lowresource languagesparticularly those utilizing non-Latin scriptsremain largely underexplored, leaving a critical gap in resources and benchmarks for this linguistically distinct, lowresource language. Addressing this oversight, this paper presents Task 3 of AbjadNLP 2026, the first shared task dedicated to authorship identification for Kurdish. The task introduces a newly constructed dataset designed to capture the unique phonological and orthographic features of Sorani Kurdish and formulate the task as a closed-set multiclass classification problem. To establish a robust baseline, we fine-tune the pretrained XLM-RoBERTa model to capture authorial, stylistic patterns. Experimental results on the test set demonstrate the efficacy of transformer-based representations for this domain, achieving an accuracy of approximately 75%.
AbjadGenEval: Abjad AI Generated Text Detection Shared Task for Languages Using Arabic Script at AbjadNLP 2026
Saad Ezzini | Irfan Ahmad | Salmane Chafik | Shadi Abudalfa | Mo El-Haj | Ahmed Abdelali | Mustafa Jarrar | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Saad Ezzini | Irfan Ahmad | Salmane Chafik | Shadi Abudalfa | Mo El-Haj | Ahmed Abdelali | Mustafa Jarrar | Nadir Durrani | Hassan Sajjad | Farah Adeeba
We present the findings of the AbjadGenEval shared task, organized as part of the AbjadNLP workshop at EACL 2026, which benchmarks AI-generated text detection for Arabic-script languages. Extending beyond Arabic to include Urdu, the task serves as a binary classification platform distinguishing human-written from AI-generated news articles produced by varied LLMs (e.g., GPT, Gemini). Twenty teams par- ticipated, with top systems achieving F1 scores of 0.93 for Arabic and 0.89 for Urdu. The re- sults highlight the dominance of multilingual transformers-specifically XLM-RoBERTa and DeBERTa-v3-and reveal significant challenges in cross-domain generalization, where naive data augmentation often yielded diminishing returns. This shared task establishes a robust baseline for authenticating content in the Abjad ecosystem.
AbjadAuthorID: Authorship Identification for Arabic-Script Languages at AbjadNLP 2026
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba | Sina Ahmadi
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba | Sina Ahmadi
Authorship identification is a core problem in Natural Language Processing and computational linguistics, with applications spanning digital humanities, literary analysis, and forensic linguistics. While substantial progress has been made for English and other high-resource languages, authorship attribution for languages written in the Arabic (Abjad) script remains underexplored. In this paper, we present an overview of AbjadAuthorID, a shared task organised as part of the AbjadNLP workshop at EACL 2026, which focuses on multiclass authorship identification across Arabic-script languages. The shared task covers Modern Standard Arabic, Urdu, and Kurdish, and is formulated as a closed-set multiclass classification problem over literary text spanning multiple authors and historical periods. We describe the task motivation, dataset construction, evaluation protocol, and participation statistics, and report official results for the Arabic track. The findings highlight both the effectiveness of current approaches in controlled settings and the challenges posed by lower participation and resource availability in some language tracks. AbjadAuthorID establishes a new benchmark for multilingual authorship attribution in morphologically rich, underrepresented languages.
AbjadStyleTransfer: Authorship Style Transfer for Arabic-Script Languages at AbjadNLP 2026
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Authorship style transfer aims to rewrite a given text so that it reflects the distinctive style of a target author while preserving the original meaning. Despite growing interest in text style transfer, most existing work has focused on English and other high-resource languages, with limited attention to languages written in the Arabic script. In this paper, we present an overview of AbjadStyleTransfer, a shared task organised as part of the AbjadNLP workshop at EACL 2026, which targets authorship style transfer for Arabic-script languages with a strong focus on literary text. The shared task covers Modern Standard Arabic and Urdu, and is designed to encourage research on controllable text generation in morphologically rich and stylistically diverse languages. Participants are required to generate text that conforms to the writing style of a specified author, given a semantically equivalent formal input. We describe the task motivation, dataset construction, evaluation protocol, and participation statistics, and provide an initial discussion of the challenges associated with authorship style transfer in Arabic-script languages. AbjadStyleTransfer establishes a new benchmark for literary style transfer beyond Latin-script settings and supports future research on culturally grounded and linguistically informed text generation.
up
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Everlyn Asiko Chimoto | Constantine Lignos | Shamsuddeen Muhammad | Idris Abdulmumin | Clemencia Siro | David Ifeoluwa Adelani
Everlyn Asiko Chimoto | Constantine Lignos | Shamsuddeen Muhammad | Idris Abdulmumin | Clemencia Siro | David Ifeoluwa Adelani
Dealing with the Hard Facts of Low-Resource African NLP
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté | Aymane Dembélé | Madani Amadou Tall | Emmanuel Elise Kone
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté | Aymane Dembélé | Madani Amadou Tall | Emmanuel Elise Kone
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Seung Hun Eddie Han | Youssef Mohamed | Mohamed Elhoseiny
Seung Hun Eddie Han | Youssef Mohamed | Mohamed Elhoseiny
This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: **ZarmaInstruct-50k**, **BambaraInstruct-50k**, and **FulfuldeInstruct-50k**.
Leveraging CoHere Multilingual Embeddings and Inverted Softmax Retrieval for Automatic Parallel Sentence Alignment in Low-Resource Languages
Abubakar Auwal Khalid | Salisu Musa Borodo | Amina Abubakar Imam
Abubakar Auwal Khalid | Salisu Musa Borodo | Amina Abubakar Imam
We present an improved method for automaticparallel sentence alignment in low- resourcelanguages. We used CoHere multilingualembeddings and inverted softmax retrieval.Our technique achieved a higher F1-score of78.30% on the MAFAND-MT test set, comparedto the existing technique’s 54.75%. Precisionand recall have shown similar performance.We assessed the quality of the extracted data bydemonstrating that it outperforms the existingtechnique in terms of low-resource translationperformance.
AfriCaption: Establishing a New Paradigm for Image Captioning in African Languages
Mardiyyah Oduwole | Prince Mireku | Fatimo Adebanjo | Oluwatosin Olajide | Mahi Aminu Aliyu | Jekaterina Novikova
Mardiyyah Oduwole | Prince Mireku | Fatimo Adebanjo | Oluwatosin Olajide | Mahi Aminu Aliyu | Jekaterina Novikova
Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parametervision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across underrepresented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for underrepresented African languages, laying the groundwork for truly inclusive multimodal AI.
Developing an English–Efik Corpus and Machine Translation System for Digitization Inclusion
Offiong Bassey Edet | Mbuotidem Awak | Emmanuel Ubene Oyo-Ita | Benjamin Okon Nyong | Ita Etim Bassey
Offiong Bassey Edet | Mbuotidem Awak | Emmanuel Ubene Oyo-Ita | Benjamin Okon Nyong | Ita Etim Bassey
Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English–Efik translation, leveraging a small-scale, community-curated parallel corpus of N = 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB-200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English–Efik and 31.21 for Efik–English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.
Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
Millicent Ochieng | Anja Thieme | Ignatius Ezeani | Risa Ueno | Samuel Chege Maina | Keshet Ronen | Javier Gonzalez | Jacki O'Neill
Millicent Ochieng | Anja Thieme | Ignatius Ezeani | Risa Ueno | Samuel Chege Maina | Keshet Ronen | Javier Gonzalez | Jacki O'Neill
Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social science measurement lens, we operationalize LLM outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating greater interpretive stability, while smaller open-weight models in our study show reduced stability under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.
ÒWE-Voice: An Evaluation of Monolingual and Multilingual ASR Model Using Yoruba Proverb Speech Dataset
Daud Abolade
Daud Abolade
Given the advancement of various Artificial Intelligence (AI) technologies in the 21st century, Automatic Speech Recognition (ASR) plays a vital role in human and machine interaction and serves as an interface for a wide range of applications. The development of these high-performing, robust and useful technologies continue to gain more attention on high-resource languages due to high availability of language data, market profitability dominance and access to funding and research initiatives compared to the marginalised low-resource languages. Despite efforts to develop ASR systems for African languages, there are still numerous challenges due to limited speech datasets, tonal complexity and dialectal variation. In this study, we curated a domain-specific speech dataset for one of the oral Yoruba literatures, proverbs, which are highly culturally inclined. We used the Yoruba recording app that was developed for Iroyin-speech project to record 6 hours of Yoruba proverb sentences. The NCAIR1/Yoruba-ASR model which was finetuned on Open AI Whisper Small and Massively Multilingual Speech, a multilingual speech model featuring low-resource languages including Yoruba language was evaluated with the recorded Yoruba proverbs. Evaluation was conducted based on Word Error Rate (WER) and Tone Error Rate (TER). Our result shows that current ASR systems that support Yoruba does not capture cultural nuances. These findings highlight an urgent need to curate more robust speech datasets that are culturally embedded for low resource languages and in this case particularly, Yoruba language in order to build technological tools that preserve African culture, language and identity.
Language choice in multilingual societies is rarely arbitrary. In Nigerian, English, Nigerian Pidgin (NP) and indigenous languages are strategically deployed in online discourse, yet little is known about how they function in hostile contexts. Here we conduct the first systematic analysis of NP in online hate speech on two platforms, Twitter and Instagram. Using a linguistically enriched annotation scheme, we label each post for class, targeted group, language variety, and hate type. Our results show that NP is disproportionately used in offensive and hateful discourse, particularly against Hausa, women, and LGBTQ+ groups, and that insults are the dominant hate strategy. Cross-domain evaluation further reveals that classifiers trained on Twitter systematically over-predict hate on Instagram, highlighting challenges of domain transfer. These findings underscore NP’s role as a linguistic resource for hostility and its sociolinguistic salience in amplifying stereotypes and affect. For NLP, the work demonstrates the need for NP-specific resources, sensitivity to figurative strategies, and domain adaptation across platforms. By bridging sociolinguistics and computational modeling, this study contributes new evidence on how language choice shapes online hate speech in a multilingual African context.
The Token Tax: Systematic Bias in Multilingual Tokenization
Jessica M. Lundin | Ada Zhang | Nihal Karim | Hamza Louzan | Guohao Wei | David Ifeoluwa Adelani | Cody Carroll
Jessica M. Lundin | Ada Zhang | Nihal Karim | Hamza Louzan | Guohao Wei | David Ifeoluwa Adelani | Cody Carroll
Tokenization inefficiency is associated with structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and reducing accuracy. We evaluate 10 Large Language Models (LLMs) on AfriMMLU (5 subjects; 16 African languages) and show that token fertility reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (e.g., DeepSeek, o1) consistently outperform non-reasoning peers across high- and low-resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. In terms of economics, a doubling in tokens results in quadrupled training cost and time, underscoring the “token tax” faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
EduNaija AI Tutor: A Multi-Agent Retrieval-Augmented Generation System for Nigerian Curriculum Education
Israel Olanrewaju Odeajo | Edifon Emmanuel Jimmy
Israel Olanrewaju Odeajo | Edifon Emmanuel Jimmy
Equitable access to quality education remains a critical challenge in Nigeria, where millions of students prepare annually for standardized examinations (WAEC, NECO, JAMB) with limited access to personalized tutoring (Badei et.al, 2024). This research presents EduNaija AI Tutor, a multi-agent Retrieval-Augmented Generation (RAG) system designed to democratize educational support through AI-powered tutoring aligned with Nigerian curricula. The system integrates conversational AI with document-based question answering, automated assessment generation, and multilingual support for English, Yoruba, Hausa, and Igbo. Using LangChain for agent orchestration, OpenAI GPT models for natural language processing, and FAISS for vector retrieval, the system enables students to interact with educational content through natural language queries while maintaining cultural relevance through Nigerian-contextualized examples and conventions (Chukwuma et.al, 2024). The multi-agent architecture comprises five specialized components: a main orchestrator, explanation agent, quiz generation agent, web search agent, and RAG agent for processing uploaded educational materials. Preliminary evaluation demonstrates the system’s capability to provide curriculum-aligned explanations, generate practice assessments, and answer questions from uploaded textbooks and study materials. This work contributes a culturally-aware educational AI framework addressing linguistic diversity and curriculum alignment challenges in African educational contexts, while leveraging open-source tools for reproducibility and accessibility (Shoukat et.al, 2025).
Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation
Samuel Gyamfi | Alfred Malengo Kondoro | Yankı Öztürk | Richard Hans Schreiber | Vadim Borisov
Samuel Gyamfi | Alfred Malengo Kondoro | Yankı Öztürk | Richard Hans Schreiber | Vadim Borisov
Despite serving over 100 million speakers as a vital African lingua franca, Swahili remains critically under-resourced for Natural Language Processing, hindering technological progress across East Africa. We present a scalable solution: a controllable synthetic data generation pipeline that produces culturally grounded Swahili text for sentiment analysis, validated through automated LLM judges. To ensure reliability, we conduct targeted human evaluation with a native Swahili speaker on a stratified sample, achieving 80.95% agreementbetween generated sentiment labels and human ground truth, with strong agreement on judge quality assessments. This demonstrates that LLM-based generation and quality assessment can transfer effectively to low-resource languages. We release a dataset and provide a reproducible pipeline in tandem, providing ample knowledge and working material for NLP researchers in low-resource contexts. We release the resulting Swahili sentiment dataset and the full reproducible generation pipeline publicly at https://huggingface.co/datasets/tabularisai/swahili-sentiment-dataset and https://github.com/tabularis-ai/Synthetic-Data-Generation-Pipeline-for-Low-Resource-Swahili-Sentiment-Analysis.
In this paper, we present some of our recent efforts to provide base NLP pipelines for African languages. These include an infrastructure called UDMorph to make UD-compatible training data available for resources that do not have dependency relations, and a Python package called flexiPipe to easily run an NLP pipeline in various NLP tools using a uniform front-end, including the models provided by UDMorph. flexiPipe also provides Unicode normalization, an often overlooked feature that has a significant impact on African NLP. flexiPipe currently provides an NLP pipeline for 33 African languages, a significant increase from the handful of models that are currently easily accessible. And UDMorph is designed to make it easy to provide training data for more languages.
Linguistically Informed Evaluation of Multilingual ASR for African Languages
Fei-Yueh Chen | Lateef Adeleke | C. M. Downey
Fei-Yueh Chen | Lateef Adeleke | C. M. Downey
Word Error Rate (WER) mischaracterizes ASR models’ performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models’ performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.
Evaluating Native-Speaker Preferences on Machine Translation and Post-Edits for Five African Languages
Hiba El Oirghi | Tajuddeen Gwadabe | Marine Carpuat
Hiba El Oirghi | Tajuddeen Gwadabe | Marine Carpuat
Wikipedia editors undertake the task of editing machine translation (MT) outputs in various languages to disseminate multilingual knowledge from English. But are editors doing more than just translating or fixing MT output? To answer this broad question, we constructed a dataset of 4,335 fine-grained annotated parallel pairs of MT translations and human post-edit (HE) translations for five low-resource African languages: Hausa, Igbo, Swahili, Yoruba, and Zulu. We report on our data selection and annotation methodologies as well as findings from the annotated dataset, the most surprising of which is that annotators mostly preferred the MT translations over their HE counterparts for three out of five languages. We analyze the nature of these "fluency breaking" edits and provide recommendations for the MT post-editing workflows in the Wikipedia domain and beyond.
Building a Conversational AI Assistant for African Travel Services with LLMs and RAG
Grace Kevine Ngoufo | Shamsuddeen Hassan Muhammad | Kevin Jeff Fogang Fokoa
Grace Kevine Ngoufo | Shamsuddeen Hassan Muhammad | Kevin Jeff Fogang Fokoa
Travel agencies in many African countries face increasing pressure to handle large volumes of customer inquiries with limited staff or, either non-existent or outdated rule-based chat-bots. To address this challenge, we develop a conversational virtual assistant powered by a Large Language Model (LLM) and enhanced with a Retrieval-Augmented Generation (RAG) pipeline. The system combines LLM reasoning, company-specific knowledge retrieval, and real-time API (Application Programming Interface) integration to deliver accurate, context-aware responses through WhatsApp, the region’s most widely used communication platform. A dedicated web interface enables staff to upload and update internal documents, ensuring that the assistant remains aligned with changing service information. Demonstrations show that the proposed solution improves response speed, enhances user experience, and reduces operational burden.
Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform
Abdifatah Ahmed Gedi | Shafie Abdi Mohamed | Yusuf A. Yusuf | Muhidin A. Mohamed | Fuad Mire Hassan | Houssein A Assowe
Abdifatah Ahmed Gedi | Shafie Abdi Mohamed | Yusuf A. Yusuf | Muhidin A. Mohamed | Fuad Mire Hassan | Houssein A Assowe
Lemmatization, which reduces words to their root forms, plays a key role in tasks such as information retrieval, text indexing, and machinelearning-based language models. However, a key research challenge for low-resourced languages such as the Somali is the lack of human-annotated lemmatization datasets and reliable ground truth to underpin accurate morphological analysis and training relevant NLP models. To address this problem, we developed the first large-scale, purpose-built Somali lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The system leverages Somali’s agglutinative and derivational morphology, encompassing over5,584 root words and 78,629 derivative forms, each annotated with part-of-speech tags. For data validation purpose, we have devised a pilot lexicon-based lemmatizer integrated with rule-based logic to handle out-of-vocabulary terms. Evaluation on a 294-document corpuscovering news articles, social media posts, and short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% forexcerpts, and 59.51% for short texts such as tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable framework for addressing morphological complexity in Somali and other low-resource languages
Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté
We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47% to 37.12% on one and from 36.07% to 32.33% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
Full Fine-Tuning vs. Parameter-Efficient Adaptation for Low-Resource African ASR: A Controlled Study with Whisper-Small
Sukairaj Hafiz Imam | Muhammad Yahuza Bello | Hadiza Ali Umar | Tadesse Destaw Belay | Idris Abdulmumin | Seid Muhie Yimam | Shamsuddeen Hassan Muhammad
Sukairaj Hafiz Imam | Muhammad Yahuza Bello | Hadiza Ali Umar | Tadesse Destaw Belay | Idris Abdulmumin | Seid Muhie Yimam | Shamsuddeen Hassan Muhammad
Automatic speech recognition (ASR) for African low-resource languages (LRLs) is often limited by scarce labelled data and the high cost of adapting large foundation models. This study evaluates whether parameter-efficient fine-tuning (PEFT) can serve as a practical alternative to full fine-tuning (FFT) for adapting Whisper-Small with limited labelled speech and constrained compute. We used a 10-hour subset of NaijaVoices covering Hausa, Yorùbá, and Igbo, and we compared FFT with several PEFT strategies under a fixed evaluation protocol. DoRA attains a 22.0% macro-average WER, closely aligning with the 22.1% achieved by FFT while updating only 4M parameters rather than 240M, and this difference remains within run-to-run variation across random seeds. Yorùbá consistently yields the lowest word error rates, whereas Igbo remains the most challenging, indicating that PEFT can deliver near FFT accuracy with substantially lower training and storage requirements for low-resource African ASR.
Real-Time Spoken Instruction Following and Translation in Ugandan Languages
Benjamin Akera | Tim Wenjie Hu | Patrick Walukagga | Evelyn Nafula Ouma | Yiga Gilbert | Ernest Tonny Mwebaze | John Quinn
Benjamin Akera | Tim Wenjie Hu | Patrick Walukagga | Evelyn Nafula Ouma | Yiga Gilbert | Ernest Tonny Mwebaze | John Quinn
Many languages are predominantly spoken rather than written, and to bring the benefits of LLMs to speakers of these languages, it is essential that models cater to the voice modality. The typical approach is to cascade ASR, LLM and TTS models together, though this results in systems with high latency, making them unsuitable for natural, real-time interaction. We describe results on taking the encoder part of a Whisper-based model trained to recognise ten languages common in Uganda, and using the Ultravox architecture to project its output directly to the input embedding space of a text model based on Qwen 3 32B, also trained to have comprehension of those languages. The result is a speech LLM with high accuracy and very low latency. For most spoken prompts, we can begin streaming a text response within as low as 50 ms, and a speech audio response within around one second, making real-time spoken interaction with an LLM possible for the first time in these languages. The model is available open source onHugging Face.
SALT-31: A Machine Translation Benchmark Dataset for 31 Ugandan Languages
Solomon Nsumba | Benjamin Akera | Evelyn Nafula Ouma | Medadi E. Ssentanda | Deo Kawalya | Engineer Bainomugisha | Ernest Tonny Mwebaze | John Quinn
Solomon Nsumba | Benjamin Akera | Evelyn Nafula Ouma | Medadi E. Ssentanda | Deo Kawalya | Engineer Bainomugisha | Ernest Tonny Mwebaze | John Quinn
We present the SALT-31 benchmark dataset for evaluation of machine translation models covering 31 Ugandan languages. Unlike sentence-level evaluation sets, SALT-31 is constructed from short, scenario-driven mini-dialogues designed to preserve discourse context, pragmatics, and culturally grounded communication patterns common in everyday Ugandan settings. The dataset contains 100 English sentences organized into 20 typical communication scenarios, each represented as a five-sentence mini-sequence. It can therefore be used to evaluate both sentence-level and paragraph level machine translation, and includes nearly every language spoken in a country with high linguistic diversity. It is available at https://huggingface.co/datasets/Sunbird/salt-31
Sample-Size Scaling of the African Languages NLI Evaluation
Anuj Tiwari | Oluwapelumi Ogunremu | Terry Oko-odion | Jesujuwon Egbewale | Hannah Sopuruchi Nwokocha
Anuj Tiwari | Oluwapelumi Ogunremu | Terry Oko-odion | Jesujuwon Egbewale | Hannah Sopuruchi Nwokocha
African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language-sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language-sensitive datasets creation and stronger multi-lingual modelling strategies.
Evaluating Yoruba Text-to-Speech Systems for Accessible Computer-Based Testing in Visually Impaired Learners
Kausar Yetunde Moshood | Victor Tolulope Olufemi | Oreoluwa Boluwatife Babatunde | Emmanuel Bolarinwa | Williams Oluwademilade
Kausar Yetunde Moshood | Victor Tolulope Olufemi | Oreoluwa Boluwatife Babatunde | Emmanuel Bolarinwa | Williams Oluwademilade
Text-to-Speech (TTS) technology offers potential to improve exam accessibility for visually impaired learners, but existing systems often underperform in underrepresented languages like Yoruba. This study evaluates current Yoruba TTS models in delivering standardized exam content to five visually impaired students through a web-based interface. Before testing, four Yoruba TTS systems were compared; only Facebook’s mms-tts-yor and YarnGPT produced intelligible Yoruba speech. Students experienced exam questions delivered by human voice, Braille, and TTS. All preferred Braille for clarity and independence, some valued human narration, while TTS was least favored due to robotic and unclear output. These results reveal a significant gap between TTS capabilities and the needs of users in low-resource languages. The paper highlights the urgency of developing tone-aware, user-centered TTS solutions to ensure equitable access to digital education for visually impaired speakers of underrepresented languages.
Power Asymmetries, Bias, and AI, a Reflection of Society on Low-Resourced Languages - African Languages as Case Study
Simbiat Ajao
Simbiat Ajao
In recent times, artificial intelligence (AI) systems have become the primary intermediary to information access, services, and opportunities. Currently, there are growing concerns as to how existing social inequalities are reproduced and amplified through AI. This is significantly evident in language technologies, where a small number of dominant languages or what we’ll refer to as big languages and cultural contexts shape the training, design, and evaluation of models. This paper examines the intersections of power asymmetries, linguistic bias, and cultural representation in AI, with a major focus on African languages and communities. We argue that current Natural Language Processing (NLP) systems reflect a high level of global imbalances in the availability of data, infrastructure, and decision making power, often marginalizing low-resourced languages and cultural peculiarities. It is important we know that how these data are structured is a great determinant in what their outcome will be. With reference to examples from speech recognition, machine translation, and large language models, we highlight the social and cultural consequences of linguistic exclusion, including reduced accessibility, misinterpretation, and digital invisibility. Finally, we identify and discuss pathways toward more equitable language technologies, emphasizing community-led data practices, interdisciplinary collaboration, and context-aware evaluation frameworks. By foregrounding language as both a technical and political concern, this work advocates for African-centered approaches to NLP that promote fairness, accountability, and linguistic justice in AI development.
Sudanese-Flores: Extending FLORES+ to Sudanese Arabic Dialect
Hadia Mohmmedosman Ahmed Samil | David Ifeoluwa Adelani
Hadia Mohmmedosman Ahmed Samil | David Ifeoluwa Adelani
In this work, we introduce Sudanese-Flores, an extension of the popular Flores+ machine translation (MT) benchmark to the Sudanese Arabic dialect. We translate both the DEV and DEVTEST splits of the Modern Standard Arabic dataset into the corresponding Sudanese dialect, resulting in a total of 2,009 sentences. While the dialect was recently introduced in Google Translate, there are no available benchmark in this dialect despite spoken by over 40 million people. Our evaluation on two leading LLMs such as GPT-4.1 and Gemini 2.5 Flash showed that while the performance English to Arabic is impressive (more than 23 BLEU), they struggle on Sudanese dialect (less than 11 BLEU) in zero-shot settings. In few-shot scenario, we achieved only a slight boost in performance.
Where Are We at with Automatic Speech Recognition for the Bambara Language?
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards; the top-performing system in terms of Word Error Rate (WER) achieved 46.76% and the best Character Error Rate (CER) of 13.00% was set by another model, while several prominent multilingual models exceeded 100% WER due to severe hallucinations. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures likely establish an upper bound for performance in practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
Enhancing Automatic Speech Recognition Models for Maternal and Reproductive Health: Fine-Tuning and Real-World Evaluation in Wolof
Ertony Basilwango | Yann Le Beux | Oche David Ankeli | Pierre Herve Berdys
Ertony Basilwango | Yann Le Beux | Oche David Ankeli | Pierre Herve Berdys
Automatic Speech Recognition (ASR) systems perform well for high-resource languages, but most African languages, including Wolof, remain underrepresented, particularly in maternal and reproductive healthcare. This work proposes a domain-specific approach to improving Wolof ASR under low-resource conditions, addressing limited annotated data, orthographic variability, and code-switching. We curated a dataset of 750 validated Wolof utterances covering 250 maternal health keywords and applied data augmentation to increase acoustic diversity. Pretrained models, including wav2vec 2.0 and Whisper, were benchmarked to select candidates for fine-tuning. Using parameter-efficient Low-Rank Adaptation (LoRA), a Whisper model was adapted to the maternal health domain. Evaluation using Word Error Rate (WER), Character Error Rate (CER), and Keyword Error Rate (KER), which measures medically critical term transcription accuracy, shows substantial gains, reducing WER from 46.5% to 23.2% and KER from 17% to 11%. Community-based evaluation on 1,340 real-world utterances reveals a moderate degradation, with WER increasing by 35%. These results demonstrate that lightweight domain adaptation with small, high-quality data can significantly improve ASR for low-resource healthcare applications.This work introduces one of the first Wolof ASR datasets for healthcare and presents a practical framework for developing reliable speech recognition tools in underrepresented languages, improving access to healthcare information and services.
Eyaa-Tom 26, Yodi-Mantissa and Lom Bench: A Community Benchmark for TTS in Local Languages
Bakoubolo Essowe Justin | Catherine Nana Nyaah Essuman | Messan Agbobli | Ahoefa Kansiwer | Eli Jean Doumeyan | Julie Pato | Notou Your Timibe | Emile KOGBEDJI Agossou | Guedela Bakouya
Bakoubolo Essowe Justin | Catherine Nana Nyaah Essuman | Messan Agbobli | Ahoefa Kansiwer | Eli Jean Doumeyan | Julie Pato | Notou Your Timibe | Emile KOGBEDJI Agossou | Guedela Bakouya
We present an extension of our previous work on multilingual NLP for Togolese languages by introducing new datasets, improved models, and a community-driven evaluation benchmark for Text-To-Speech (TTS). We expand the Eyaa-Tom multilingual corpus with additional speech data of about 26.9k recordings (30.9 hours) across 10 local languages, and incorporated 64.6k clips (46.6 hours) of Mozilla Common Voice contributions for Adja, Nawdm, Mina, and Tem to strengthen Automatic Speech Recognition (ASR) and speech synthesis. We detail how community contributors – including collaboration with a national TV journalist – helped collect and validate the Kabyè and French text, with an ethical compensation model in place. We fine-tune state-of-the-art models: OpenAI Whisper and faster-whisper, and Meta’s NLLB-200 model for machine translation across 11 languages (achieving 19.4 BLEU score for French→Ewe and 26.1 BLEU score for Kabyè→French). We also introduce the Lom Bench, a community-based benchmark where native speakers rate TTS output, indicating promising preliminary results in Mina and Togolese lingua franca french although further data is needed. We provide a comparative analysis of our results with recent multilingual systems, including Simba, Meta’s Omnilingual ASR, and UBC Toucan. Our work emphasizes practical pathways and how FAIR data sourcing and community participation can drive sustainable NLP development for underserved languages.
Using Subword-Embeddings for Bilingual Lexicon Induction in Bantu Languages
Adrian Breiding | Alan Akbik
Adrian Breiding | Alan Akbik
Bilingual Lexicon Induction (BLI) is a valuable tool in machine translation and cross-lingual transfer learning, but it remains challenging for agglutinative and low-resource languages. In this work, we investigate the use of weighted sub-word embeddings in BLI for agglutinative languages. We further evaluate a graph-matching and Procrustes-based BLI approach on two Bantu languages, assessing its effectiveness in a previously underexplored language family. Our results for Swahili with an average P@1 score of 51.84% for a 3000 word dictionary demonstrate the success of the approach for Bantu languages. Weighted sub-word embeddings perform competitively on Swahili and outperform word embeddings in our experiments with Zulu.
AfriNLLB: Efficient Translation Models for African Languages
Yasmin Moslem | Aman Kassahun Wassie | Amanuel Gizachew Abebe
Yasmin Moslem | Aman Kassahun Wassie | Amanuel Gizachew Abebe
In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research.
up
Proceedings of the Ninth Fact Extraction and VERification Workshop (FEVER)
Proceedings of the Ninth Fact Extraction and VERification Workshop (FEVER)
Mubashara Akhtar | Rami Aly | Rui Cao | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Mubashara Akhtar | Rami Aly | Rui Cao | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Weakly-supervised Argument Mining with Boundary Refinement and Relation Denoising
Wei Sun | Mingxiao Li | Jesse Davis | Elena Cabrio | Serena Villata | Marie-Francine Moens
Wei Sun | Mingxiao Li | Jesse Davis | Elena Cabrio | Serena Villata | Marie-Francine Moens
Argument mining (AM) involves extracting argument components and predicting relations between them to create argumentative graphs, which are essential for applications requiring argumentative comprehension. To automatically provide high-quality graphs, previous works require a large amount of human-annotated training samples to train AM models. Instead, we leverage a large language model (LLM) to assign pseudo-labels to training samples for reducing reliance on human-annotated training data. However, the training data weakly-labeled by the LLM are too noisy to develop an AM model with reliable performance. In this paper, to improve the model performance, we propose a center-based component detector that refines the boundaries of the detected components and a relation denoiser to deal with noise present in the pseudo-labels when classifying relations between detected components. Experimentally, our AM model improves the boundary detection obtained from the LLM by up to 16% in terms of IoU75 and of the relation classification obtained from the LLM by up to 12% in terms of macro-F1 score. Our AM model achieves new state-of-the-art performance in weakly-supervised AM, showing up to a 6% improvement over the state-of-the-art component detector and up to a 7% improvement over the state-of-the-art relation classifier. Additionally, our model uses less than 20% of human-annotated data to match the performance of state-of-the-art fully-supervised AM models.
POaaS: Minimal-Edit Prompt Optimization as a Service to Lift Accuracy and Cut Hallucinations on On-Device sLLMs
Jungwoo Shim | Dae Won Kim | Sunwook Kim | Sooyoung Kim | Myungcheol Lee | Jaegeun Cha | Hyunhwa Choi
Jungwoo Shim | Dae Won Kim | Sunwook Kim | Sooyoung Kim | Myungcheol Lee | Jaegeun Cha | Hyunhwa Choi
Small language models (sLLMs) are increasingly deployed on-device, where imperfect user prompts–typos, unclear intent, or missing context–can trigger factual errors and hallucinations. Existing automatic prompt optimization (APO) methods were designed for large cloud LLMs and rely on search that often produces long, structured instructions; when executed under an on-device constraint where the same small model must act as optimizer and solver, these pipelines can waste context and even hurt accuracy. We propose POaaS, a minimal-edit prompt optimization layer that routes each query to lightweight specialists (Cleaner, Paraphraser, Fact-Adder) and merges their outputs under strict drift and length constraints, with a conservative skip policy for well-formed prompts. Under a strict fixed-model setting with Llama-3.2-3B and Llama-3.1-8B, POaaS improves both task accuracy and factuality while representative APO baselines degrade them, and POaaS recovers up to +7.4% under token deletion and mixup. Overall, per-query conservative optimization is a practical alternative to search-heavy APO for on-device sLLMs.
Evidence Grounding vs. Memorization: Why Neural Semantics Matter for Knowledge Graph Fact Verification
Ankit Kumar Upadhyay | John S. Erickson | Deborah L. McGuinness
Ankit Kumar Upadhyay | John S. Erickson | Deborah L. McGuinness
Knowledge graphs like DBpedia enable structured fact verification, but the relative contributions of symbolic structure, neural semantics, and evidence grounding remain unclear. We present a systematic study on FACTKG (108,675 claims) comparing symbolic, neural, and LLM-based approaches. Our symbolic baseline using 29 hand-crafted features covering graph structure, entity coverage, and semantic relation type achieves 66.54% accuracy, while BERT over linearized subgraphs reaches 92.68% and graph neural networks plateau at 70%, demonstrating that token-level semantics outperform both symbolic features and message passing. Using GPT-4.1-mini to filter training data, budget-matched controls show that token-budget control recovers most of the gap over truncation-dominated inputs, while LLM semantic selection adds +1.31 points beyond lexical heuristics (78.85% filtered vs. 77.54% heuristic vs. 52.70% unfiltered), showing that semantic relevance, not just evidence quantity, governs learnability. Finally, comparing 300 test claims under memorization (claim-only) versus KG-grounded reasoning with chain-of-thought, we find KG grounding improves GPT-4o-mini and GPT-4.1-mini accuracy by 12.67 and 9.33 points respectively, with models citing specific triples for interpretability. These results demonstrate that neural semantic representations and explicit KG evidence grounding are highly effective for robust, interpretable fact verification.
The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods
Arpit Singh Gautam | Kailash Talreja | Saurabh Jha
Arpit Singh Gautam | Kailash Talreja | Saurabh Jha
Large Language Models (LLMs) frequently "hallucinate" plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are "confidently wrong." We propose DiffuTruth, an unsupervised framework that re-conceptualizes fact verification via non-equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the "Generative Stress Test": claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector-space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state-of-the-art unsupervised AUROC of 0.725, outperforming baselines by +1.5% through the correction of overconfident predictions. Furthermore, we show superior zero-shot generalization on the multi-hop HOVER dataset, outperforming baselines by over 4%, confirming the robustness of thermodynamic truth properties to distribution shifts.
BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking
Hyunkyung Park | Arkaitz Zubiaga
Hyunkyung Park | Arkaitz Zubiaga
Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. On the DialFact benchmark, this gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification particularly strong gains on SUPPORTS and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao | Yulong Chen | Zhenyun Deng | Michael Schlichtkrull | Andreas Vlachos
Rui Cao | Yulong Chen | Zhenyun Deng | Michael Schlichtkrull | Andreas Vlachos
The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMAN, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.
Take It All: Ensemble Retrieval for Multimodal Evidence Aggregation
Max Upravitelev | Veronika Solopova | Premtim Sahitaj | Ariana Sahitaj | Charlott Jakob | Sebastian Möller | Vera Schmitt
Max Upravitelev | Veronika Solopova | Premtim Sahitaj | Ariana Sahitaj | Charlott Jakob | Sebastian Möller | Vera Schmitt
Multimodal fact checking has become increasingly important due to the predominance of visual content on social media platforms, where images are frequently used to enhance the credibility and spread of misleading claims, while generated images become more prevalent and realistic as generative models advance. Incorporating visual information, however, substantially increases computational costs, raising critical efficiency concerns for practical deployment. In this study, we propose and evaluate the ADA-AGGR (ensemble retrievAl for multimoDAl evidence AGGRegation) pipeline, which achieved the second place on both the dev and test leaderboards of the FEVER 9/AVerImaTeC shared task. However, long runtimes per claim highlight challenges regarding efficiency concerns when designing multimodal claim verification pipelines. We therefore run extensive ablation studies and configuration analyses to identify possible performance–runtime improvements. Our experiments show that substantial efficiency gains are possible without significant loss in verification quality. For instance, we reduced the average runtime by up to 6.28× while maintaining comparable performance across evaluation metrics by aggressively downsampling input images processed by visual language models. Overall, our results highlight that careful design choices are crucial for building scalable and resource-efficient multimodal fact-checking systems suitable for real-world deployment.
REVEAL: Retrieval-Enhanced Verification for Multimodal Fact-Checking
Amina Tariq | Yova Kementchedjhieva
Amina Tariq | Yova Kementchedjhieva
Multimodal misinformation combines images and text to amplify false narratives, yet most fact-checking research addresses only textualclaims. The AVerImaTeC shared task introduces real-world image-text claims requiring sophisticated evidence retrieval. We present REVEAL (Retrieval-Enhanced Verification with Evidence Accumulation Loop), a system designed to overcome the “semantic gap,” defined as the disconnect between the neutral phrasing of claims and the adversarial vocabulary of debunking evidence. Unlike static baselines, REVEAL breaks down the verification task into an iterative context loop, integrating sparse and dense retrieval signals to aggressively target refuting evidence. We achieve a Verdict Accuracy of 23.6% and an Evidence Recall of 27.7% on the test set. Our results outperform the official baseline across all metrics, validating our hybrid retrieval strategy for complex multimodal verification.
VILLAIN at AVerImaTeC: Verifying Image–Text Claims via Multi-Agent Collaboration
Jaeyoon Jung | Yejun Yoon | Seunghyun Yoon | Kunwoo Park
Jaeyoon Jung | Yejun Yoon | Seunghyun Yoon | Kunwoo Park
This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.
Selective Multimodal Retrieval for Automated Verification of Image–Text Claims
Yoana Tsoneva | Paul-Conrad Feig | Jiaao Li | Veronika Solopova | Neda Foroutan | Arthur Hilbert | Vera Schmitt
Yoana Tsoneva | Paul-Conrad Feig | Jiaao Li | Veronika Solopova | Neda Foroutan | Arthur Hilbert | Vera Schmitt
This paper presents an efficiency-aware pipeline for automated fact-checking of real-world image–text claims that treats multimodality as a controllable design variable rather than a property that must be uniformly propagated through every stage of the system. The approach decomposes claims into verification questions, assigns each to text- or image-related types, and applies modality-aware retrieval strategies, while ultimately relying on text-only evidence for verdict prediction and justification generation. Evaluated on the AVerImaTeC dataset within the FEVER-9 shared task, the system achieves competitive question, evidence, verdict, and justification scores and ranks fourth overall, outperforming the official baseline on evidence recall, verdict accuracy, and justification quality despite not using visual evidence during retrieval. These results demonstrate that strong performance on multimodal fact-checking can be achieved by selectively controlling where visual information influences retrieval and reasoning, rather than performing full multimodal fusion at every stage of the pipeline.
In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year’s retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module.Despite its simplicity, our system delivers competitive performance with a single multimodal LLM call per fact-check at just 0.013 on average using GPT5.1 via OpenAI Batch API.Our system is also easy to reproduce and tweak, consisting of only three decoupled modules — a textual retrieval module based on similarity search, an image retrieval module based on API-accessed RIS, and a generation module using GPT5.1 — which is why we suggest it as an accesible starting point for further experimentation.We publish its code and prompts, as well as our vector stores and insights into the scheme's running costs and directions for further improvement.
up
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist
Kellen Parker van Dam | Abishek Stephen
Kellen Parker van Dam | Abishek Stephen
Lexical data collection in language documentation often contains transcription errors and borrowings that can mislead linguistic analysis. We present unsupervised methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using phoneme-level and syllable-level n-gram language models, our approach identifies potential transcription errors and borrowings. We evaluate our methods using hand annotated gold standard and rank the phonotactic outliers using precision and recall at K metric. The ranking approach provides field linguists with a method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
Field linguistics increasingly relies on computational tools to organize, analyze, and preserve linguistic data, yet the classificatory assumptions embedded in these tools are rarely examined. A pervasive assumption is that languages can be treated as discrete, genealogically defined units, with relatedness modeled as tree-structured descent. We argue that this assumption misrepresents linguistic evidence in contact-heavy regions and risks distorting the computational mediation of field linguistic data. Focusing on South Asia, we show that widely assumed boundaries—such as the Indo-Aryan–Dravidian divide—collapse in long-standing contact zones characterized by convergence, dialect continua, and institutional multilingualism. Through historically grounded case studies including Kannada–Telugu and Tamil–Malayalam, we demonstrate how convergence, script-mediated distance, and post-hoc standardization reshape how field data is segmented, compared, and interpreted when organized through genealogical labels. We argue that contact-aware, relational models of linguistic relatedness are necessary if NLP tools are to support, rather than distort, the documentation and analysis of linguistic diversity.
Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
Siyu Liang | Talant Mawkanuli | Gina-Anne Levow
Siyu Liang | Talant Mawkanuli | Gina-Anne Levow
Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
Linguistically Informed Tokenization Improves ASR for Underresourced Languages
Massimo Marie Daul | Alessio Tosolini | Claire Bowern
Massimo Marie Daul | Alessio Tosolini | Claire Bowern
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems rely on data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec 2.0 ASR model on Yanyhangu, an Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves word error rate (WER) and character error rate (CER) compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can provide significant assistance for underresourced language documentation.
Short-form verbal arts as a speech data resource in the field
Matthew Faytak | Tianle Yang | Pius Wuchu Akumbu | Ivo Forghema Njuasi | Éric Le Ferrand
Matthew Faytak | Tianle Yang | Pius Wuchu Akumbu | Ivo Forghema Njuasi | Éric Le Ferrand
We propose a method for efficient field data collection of speech resource data which leverages short-form verbal arts, namely riddles and proverbs, which permit a predictable transcript to be assigned to naturalistic but conventionalized utterances. As a proof of concept, we describe a 5.25 hour corpus of proverbs and riddles collected for Kom, a low-resource language of Cameroon, and conduct ASR modeling experiments on the corpus. Results suggest that the method yields high quality speech data, albeit with relatively low lexical diversity. We highlight the alignment of the collected data with community priorities for cultural education and preservation in the Cameroonian context.
Quantitative Lect Description: A Case Study of Lemko from the Field Data of 1920s-1930s
Ilia Afanasev
Ilia Afanasev
While qualitative descriptions (in the form of reference grammars) and benchmarks for low-resource languages are becoming increasingly widespread, computational linguists do not often use quantitative methods to describe a new lect rather than a new model. This paper intends to close this lacuna. The case study is a Lemko text transcribed at the beginning of the twentieth century. Using morphosyntactic tagging and topic modelling, the study demonstrates areal influences and archaic features of the lect. Fine-grained evaluation significantly assists in identifying subtle patterns that are not readily apparent through traditional metrics such as accuracy score. The results highlight the necessity of a more detailed analysis of model performance, which may yield more linguistically significant results than a purely manual check. This information is present in the resulting dataset, which can be used for further investigation into the structural features of the Lemko lect.
We conduct a preliminary study of the order of subject (S), object (O), and verb (V) in Tatyshly Udmurt (Finno-Ugric) on the basis of approximately 900 clauses from oral folklore and non-folklore narratives (including contemporary texts and texts recorded earlier) using a gradient approach. We show that the most frequent word orders are SOV, SV, and OV. In full clauses (with both S and O), in folklore texts SOV order (≈ 70%) is followed by OSV order (≈ 15%). In contemporary non-folklore texts, however, SOV order competes with SVO order (50% vs 30%), which may be explained by the influence of Russian. We note that full clauses may differ from clauses with only S or with only O: in contemporary folklore texts VS order is much more frequent in S-only clauses (≈ 23%) than in full ones (≈ 4%), and in contemporary non-folklore texts VO order is more frequent in full clauses (≈ 35%) than in O-only ones (≈ 12%). Moreover, we show that word order can depend on the type of clause. For example, in existential clauses the order is almost always SV, while clauses with verbs of speech often have VS order.
up
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)
Vera Danilova | Murathan Kurfalı | Ylva Söderfeldt | Julia Reed | Andrew Burchell
Vera Danilova | Murathan Kurfalı | Ylva Söderfeldt | Julia Reed | Andrew Burchell
Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?
Grace Chang Yuan | Xiaoman Zhang | Sung Eun Kim | Pranav Rajpurkar
Grace Chang Yuan | Xiaoman Zhang | Sung Eun Kim | Pranav Rajpurkar
Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.
The Doctor Will Agree With You Now: Sycophancy of Large Language Models in Multi-Turn Medical Conversations
Taeil Matthew Kim | Luyang Luo | Sung Eun Kim | Arjun Kumar Manrai | Eric Topol | Pranav Rajpurkar
Taeil Matthew Kim | Luyang Luo | Sung Eun Kim | Arjun Kumar Manrai | Eric Topol | Pranav Rajpurkar
Large language models (LLMs) increasingly exhibit sycophancy—the tendency to conform to user beliefs rather than provide factually accurate information—posing significant risks in healthcare applications where reliability is paramount. We evaluate sycophantic behavior in ten LLMs from OpenAI, Google, and Anthropic across multi-turn medical conversations using an escalatory pushback framework. To enable fine-grained analysis, we introduce Resistance, a metric that measures nonconformity to user stances at each conversational turn, providing insights beyond existing flip-based metrics. Evaluating on MedCaseReasoning (open-ended diagnostic questions) and PubMedQA (clear-answer biomedical questions), we find that Gemini models exhibit the highest Resistance, followed by OpenAI and Claude models. We further observe that response patterns ("Yes, but..." vs. "Yes, and...") may be more predictive of sycophancy than specific phrases. Notably, all models are more easily persuaded to change their answers on clear multiple-choice questions than on ambiguous diagnostic cases. Our findings highlight critical vulnerabilities in deploying LLMs for clinical decision support and suggest that training toward contradiction-maintaining response patterns may serve as a potential mitigation strategy.
Discourses of Prevention: A Multimodal Study of HPV Vaccination Campaigns in Italy
Claudia Roberta Combei | Antonio Bianco | Elena Giribaldi | Adalberto Lovotti | Valentina Ghirotto | Marianna France Pasquali | Sara Gemelli | Chiara Cassani | Chiara Zanchi
Claudia Roberta Combei | Antonio Bianco | Elena Giribaldi | Adalberto Lovotti | Valentina Ghirotto | Marianna France Pasquali | Sara Gemelli | Chiara Cassani | Chiara Zanchi
This study assesses the communicative effectiveness of Italian HPV vaccination campaign materials using a mixed-methods design that combines expert annotation and a public perception experiment. A corpus of 49 official documents was annotated by six experts (three Linguistics Ph.D. students and three Gynecology residents) across 56 variables capturing the appropriateness and efficiency of verbal and visual elements. The perception experiment, administered to a convenience sample of Italian general public, examined attitudes toward HPV vaccination and evaluations of communication effectiveness. Overall, both expert and public assessments converged in judging the HPV vaccination campaign materials as relatively weak, citing reduced informativeness in overly concise texts, inappropriate choice of colors, and recurring issues regarding gender representation, inclusivity, and diversity.
Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans–common in advice and precaution decisions–are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.
Semantic Echo Pathways (SEP): Tracing How Medical Language Propagates and Transforms
Charu Karakkaparambil James | Marcio Monteiro | Sophie Fellenz
Charu Karakkaparambil James | Marcio Monteiro | Sophie Fellenz
We introduce Semantic Echo Pathways (SEP), a new approach for modeling the cross-domain evolution of medical language. Using continual neural topic models (CoNTM) trained separately on scientific literature, clinical notes, and public health-related data, we track linguistic drift and identify points where concepts change meaning. We propose three novel metrics: Cross-Domain Drift Score, Temporal Echo Lag, and Semantic Mutation Patterns to quantify how medical language travels between the scientific, clinical, and public domain. Applications to evolving concepts such as "long COVID", diagnostic category changes reveal previously undocumented patterns of medical-semantic evolution. Our results bridge computational modeling with the human-centered perspectives of medical humanities, offering clear, domain-aware maps of how medical language shifts across time and domains, and combining quantitative analysis with linguistic and clinical insight.
A Graph-Augmented Liquid Neural Network for Extracting Food Hazards and Disease Outbreaks
Tirthankar Dasgupta | Manjira Sinha | Sudeshna Jana | Diya Saha | Ishan Verma | Vaishali Aggarwal
Tirthankar Dasgupta | Manjira Sinha | Sudeshna Jana | Diya Saha | Ishan Verma | Vaishali Aggarwal
The increasing frequency of foodborne illnesses, safety hazards, and disease outbreaks in the food supply chain demands urgent attention to protect public health. These incidents, ranging from contamination to intentional adulteration of food and feed, pose serious risks to consumers, leading to poisoning, and disease outbreaks that lead to product recalls. Identifying and tracking the sources and pathways of contamination is essential for timely intervention and prevention. This paper explores the use of social media and regulatory news reports to detect food safety issues and disease outbreaks. We present an automated approach leveraging a multi-task sequence labeling and sequence classification model that uses a liquid time-constant neural network augmented with a graph convolution network to extract and analyze relevant information from social media posts and official reports. Our methodology includes the creation of annotated datasets of social media content and regulatory documents, enabling the model to identify foodborne infections and safety hazards in real-time. Preliminary results demonstrate that our model outperforms baseline models, including advanced large language models like LLAMA-3 and Mistral-7B, in terms of accuracy and efficiency. The integration of liquid neural networks significantly reduces computational and memory requirements, achieving superior performance with just 1.2 × e6 bytes of memory, compared to the 20.3 GB of GPU memory needed by traditional transformer-based models. This approach offers a promising solution for leveraging social media data in monitoring and mitigating food safety risks and public health threats.
Multimodal Artificial Intelligence (AI) promises to transform biomedicine by integrating imaging, genomics, and clinical data for superior decision-making. Yet, we contend that the current pursuit of large-scale generalist models is fundamentally misaligned with the high-risk nature of biomedical applications. This position paper argues that biomedical NLP demands specialization, not generalization, challenging the assumption that greater model scale and generality inherently ensure robustness in healthcare. We propose a theoretical framework built on three biomedical axioms: error cost asymmetry, multimodal data fragility, and interpretability–utility coupling, alongside a formal proof of criticality in biomedical NLP, showing that generalist models are intrinsically unsuited for medical tasks. As a secondary contribution, we advance a task-first design paradigm centered on modular, specialized, and ethically grounded AI architectures for biomedical use. Through analysis and illustrative cases, we contrast this approach with scale-centric strategies, exposing risks such as bias amplification, reduced interpretability, and exclusion of rare or underrepresented populations. We call for a realignment of research, funding, and regulation toward specialization as the sustainable path for meaningful and equitable biomedical AI, aiming to spark critical discourse on what constitutes genuine progress in machine learning for health.
An Enhanced Training-Free Pipeline for Entity Recognition and Linking: A Low-Resource Case Study – 20-th Century Historical Medical Texts
Phu-Vinh Nguyen | Vera Danilova
Phu-Vinh Nguyen | Vera Danilova
Entity linking in biomedicine typically relies on large annotated corpora and supervised methods, which often fail in out-of-distribution settings. Historical medical texts are rich in biomedical terms but pose unique challenges: terminology has changed, some concepts are obsolete, and stylistic differences from modern journals prevent off-the-shelf models fine-tuned on contemporary datasets from aligning historical terms with current ontologies. Training-free methods based on LLMs offer a solution by linking historical terms to modern concepts and inferring their meaning from context. In this paper, we evaluate a state-of-the-art training-free entity linking method on historical medical texts and propose an improved pipeline—end-to-end entity extraction and linking with confidence estimation. We also assess performance on modern benchmarks to check whether the gains generalize to other domains and show their superior performance in most cases. We report an analysis of the findings. The code and curated dataset for historical medical entity linking are available on GitHub.
Graph-Enhanced LLM Analysis of Multimodal Health Communities: A Computational Framework for Patient Discourse Understanding on TikTok
Tawakalit Agboola | Oluwaseun Ajao
Tawakalit Agboola | Oluwaseun Ajao
Social media platforms have become critical sources of patient-generated health data, yet existing computational approaches fail to capture the interconnected nature of online health discourse. We present a novel framework that integrates graph-based community detection with large language model analysis to understand patient narratives in multimodal social media content. Applied to 10,253 TikTok posts about JAK inhibitors (2020-2024), our approach constructs heterogeneous graphs representing user-content-medical entity relationships and applies community detection algorithms enhanced with context-aware LLM interpretation. Our comprehensive analysis of 10,253 posts (January 2020–September 2024) reveals five distinct patient communities characterized by different discourse patterns: treatment success narratives (873 nodes), medication guidance (642 nodes), side effect discussions (589 nodes), comparative treatment analysis (412 nodes), and dosage optimization (347 nodes). The Louvain algorithm significantly outperformed Girvan-Newman in modularity (0.9931 vs. 0.9928), conductance (0.0002 vs. 0.0006), and computational efficiency (0.14s vs. 54.24s). Temporal analysis demonstrates increasing community cohesion and evolving discourse patterns from cautious inquiry (2020-2021) to experience sharing and specialized sub-communities (2023-2024). This work contributes: (1) a scalable computational framework for multimodal health content analysis, (2) methodological innovations in graph-LLM integration, and (3) insights into platform-specific health communication patterns. The framework has applications in pharmacovigilance, computational social science, and AI-assisted health monitoring systems.
Almost Clinical: Linguistic properties of synthetic electronic health records
Serge Sharoff | John Baker | Dr David Francis Hunt | Alan Simpson
Serge Sharoff | John Baker | Dr David Francis Hunt | Alan Simpson
This study evaluates the linguistic and clinical suitability of synthetic electronic health records in mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we examine expressions of agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) with the aim to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures. The results show both the potential and limitations of synthetic corpora for enabling large-scale linguistic research otherwise impossible with genuine patient records.
Mind Your Steps in Biomedical Named Entity Recognition: First Extract, Tag Afterwards
Darya Shlyk | Stefano Montanelli | Marco Mesiti | Lawrence Hunter
Darya Shlyk | Stefano Montanelli | Marco Mesiti | Lawrence Hunter
Few-shot prompting with Large Language Models (LLMs) has emerged as a promising paradigm for advancing information extraction, particularly in data-scarce domains like biomedicine, where high annotation costs constrain the availability of training data.However, challenges persist in biomedical Named Entity Recognition (NER), where LLMs fail to achieve necessary accuracy and lag behind supervised fine-tuned models. In this study, we introduce FETA (First Extract, Tag Afterwards), a two-stage approach for entity recognition that combines instruction-guided prompting and a novel self-verification strategy to improve accuracy and reliability of LLM predictions in domain-specific NER tasks. FETA achieves state-of-the-art results on multiple established biomedical datasets.Our experiments demonstrate that carefully designed prompts, using self-verification and instruction guidance, can steer general-purpose LLMs to outperform fine-tuned models in knowledge-intensive NER tasks, unlocking their potential for more reliable and accurate information extraction in resource-constrained settings.
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani | Oumaima El Khettari | Pacôme Constant dit Beaufils | Richard Dufour | Benoit Favre
Ikram Belmadani | Oumaima El Khettari | Pacôme Constant dit Beaufils | Richard Dufour | Benoit Favre
Automatic evaluation of open-ended question answering in specialized domains remains challenging mainly because it relies on manual annotations from domain experts. In this work, we assess the ability of several large language models (LLMs), including closed-access (GPT-5.1, Gemini-2.5-Pro), open-source general-purpose (Qwen-80B), and biomedical domain-adapted models (MedGemma-27B, Phi-3.5-mini variants), to act as automatic evaluators of semantic equivalence in French medical open-ended QA. Our analysis reveals that LLM-based judgments are sensitive to the source of answer generation: judgement correlation varies substantially across different generator models. Among the judges, MedGemma-27B and Qwen-80B achieve the highest agreement with expert annotations in terms of F1 score and Pearson correlation. We further explore lightweight adaptation strategies on Phi-3.5-mini using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Even with 184 training instances, these adaptations significantly improve Phi-3.5’s results and reduce variability across answer generators, achieving performance comparable to larger domain-adapted models. Our results highlight the importance of generator-aware evaluation, the limitations of general-purpose LLMs in domain-specific settings, and the effectiveness of lightweight adaptation for compact models in low-resource scenarios.
Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
Chaimae Abouzahir | Congbo Ma | Nizar Habash | Farah E. Shamout
Chaimae Abouzahir | Congbo Ma | Nizar Habash | Farah E. Shamout
In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic & English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis shows that model-reported confidence and explanations are poor indicators of correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
Modulating Multi-Label Tendency in Zero-Shot LLM Coding: The Effect of Output Structure on CDSS Feedback Analysis
Hyunwoo Choo | Sungsoo Hong
Hyunwoo Choo | Sungsoo Hong
Large language models (LLMs) often default to single-label classification in zero-shot multi-label tasks—a tendency we term "conservative default". While few-shot prompting mitigates this, it introduces "example bias". We evaluate zero-shot strategies to modulate this tendency using 1,441 healthcare feedback records and two LLMs. We compare instruction-based methods with structural constraints that modify the token generation sequence, specifically an Enum-First format requiring domain enumeration before selection. Results show that structural constraints substantially reduce single-label rates (Magistral: 96% → 19%; Qwen3: 54% → 0.0%), though the latter suggests potential over-correction compared to human baselines (16.7–41.3%). These findings indicate that while output structure is a potent modulator of classification behavior by shifting the decision point upstream, its effect magnitude is model-dependent, necessitating empirical calibration to prevent spurious associations.
Normalizing Health Concepts with Biomedical Embedding and LLMs
Iram Azam | Keyuan Jiang | Gordon Bernard
Iram Azam | Keyuan Jiang | Gordon Bernard
Accurate normalization of health-related expressions to standardized biomedical concepts is crucial for both healthcare and biomedical research. However, traditional string-based matching methods are limited by lexical variations. In this study, we propose a neural embedding-based normalization framework that utilizes an embedding model trained on biomedical terminology, generating over 3.59 million embeddings corresponding to UMLS terms and Concept Unique Identifiers (CUIs). For clinical data, CUIs were retrieved via semantic matching, while Twitter phrases were first processed using a large language model (LLM) to generate preferred terms prior to embedding-based CUI retrieval. Our approach substantially outperforms exact string matching and MetaMap Lite. For clinical data (3,144 phrases), normalization accuracy improved from 0.679 (string match) and 0.574 (MetaMap Lite) to 0.858. For Twitter data (102 phrases), accuracy increased from 0.235 (string match) and 0.118 (MetaMap Lite) to a range of 0.882 (Gemini 2.5 Flash) to 0.980 (GPT-4o mini). These findings highlight both the effectiveness of embedding-based semantic retrieval and the ability of LLMs to generate preferred terms, enhancing robustness in health concept normalization across diverse text sources.
From Pain to Praise: Aspect-Based Sentiment Analysis for Norwegian Patient Feedback
Lilja Charlotte Storset | Elma Jelin | Rebecka Maria Norman | Oyvind Bjertnaes | Lilja Øvrelid | Erik Velldal
Lilja Charlotte Storset | Elma Jelin | Rebecka Maria Norman | Oyvind Bjertnaes | Lilja Øvrelid | Erik Velldal
This paper describes a new dataset for aspect-based sentiment analysis (ABSA) for analyzing patient feedback about healthcare services. In an interdisciplinary collaboration spanning the fields of natural language processing and healthcare research, we manually annotate a dataset of 2382 free-text comments collected from national patient experience surveys in Norway, covering two sub-fields of services – special mental healthcare and general practitioners. Annotations are provided on both the sentence- and comment-level, covering a fine-grained set of 25 unique healthcare-related aspects and their polarities. We also report results for fine-tuning both encoder- and decoder models on the resulting dataset, comparing different modeling strategies, like joint and sequential prediction of aspects and polarity. The resources developed in this work can assist healthcare researchers in the analysis of patient feedback, bringing a much more efficient approach compared to today’s manual analysis, potentially leading to improved patient satisfaction and clinical outcomes.
LLM Plug-ins Are Not a Free Lunch for Clinical Time-Series Prediction
Juhwan Choi | Kwanhyung Lee | Sangchul Hahn | Eunho Yang
Juhwan Choi | Kwanhyung Lee | Sangchul Hahn | Eunho Yang
Inspired by recent plug-in frameworks that repurpose frozen layers from large language models (LLMs) as inductive priors, we explore whether such mechanisms can be extended to clinical time-series prediction without textual inputs or LLM fine-tuning. We introduce a lightweight plug-in architecture that inserts a single frozen LLM Transformer layer between an aggregated time-series representation and the prediction head. Unlike prior work focused on vision or language tasks, our study targets clinical time-series data, where LLMs typically underperform when applied directly.Experiments on two ICU prediction tasks from MIMIC-III show that the proposed plug-in exhibits heterogeneous effects across different backbones and tasks, with occasional performance improvements and minimal computational overhead. We further compare general-purpose and medical-domain LLM layers under an identical plug-in setting, analyzing how domain specialization interacts with clinical time-series models. Overall, our results highlight important limitations of frozen LLM plug-ins and motivate future work on understanding the conditions under which such layers may be beneficial.
Tracking Autism Stigma in Italian Newspapers: A Longitudinal Analysis of Media Discourse (2016–2025)
Ginevra Martinelli | Chiara Barattieri di San Pietro | Daniela Ovadia | Marta Bosia | Valentina Bambini
Ginevra Martinelli | Chiara Barattieri di San Pietro | Daniela Ovadia | Marta Bosia | Valentina Bambini
Public awareness of Autism Spectrum Disorder (ASD) has grown in recent years, yet stigma surrounding this condition persists. Building on prior research showing increasingly positive portrayals of ASD, this study examines recent longitudinal trends in stigma and ASD, with a focus on Italian newspapers, and how these were affected by a key event such as the COVID-19 pandemic. We analyzed nearly 3,000 articles published between 2016 and 2025 using an innovative multi-layered Natural Language Processing (NLP) framework to capture multiple dimensions of stigma, including discriminatory language, emotional framings indicative of prejudices, stereotypes, and the thematic contexts in which ASD-related stigma appears. Overall, results indicate low levels of overt stigma and a gradual shift toward more positive portrayals, with only temporary disruptions during the pandemic. Some stereotypes remain, highlighting the need for ongoing attention to ASD representation in the media.
Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers
Michelle Damin Kim | Ellie S. Paek | Yufen Lin | Emily Mroz | Jane Chung | Jinho D. Choi
Michelle Damin Kim | Ellie S. Paek | Yufen Lin | Emily Mroz | Jane Chung | Jinho D. Choi
This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers’ loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles | Patrick Schrempf | David Harris-Birtill
Craig Myles | Patrick Schrempf | David Harris-Birtill
Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection
Linguistic Features Competitive with Bert! Leveraging Speech for Detection of Mental Health in Paediatric Lupus
Jida Jaffan | Barend Beekhuizen | Andrea Knight
Jida Jaffan | Barend Beekhuizen | Andrea Knight
Neuropsychiatric lupus (NPSLE) is characterized by inflammation in the brain with common symptoms of depression and anxiety. Early detection is crucial as it may change the treatment regimen; however, current approaches are costly and resource intensive. Therefore, we propose that leveraging current work using linguistics in NLP detection of mental health symptoms can be advantageous in early detection of NPSLE. This study is a proof-of-concept using 20 interviews from N=20 adolescents (10-17 years) diagnosed with Lupus. Our results suggest that linguistic feature-based models supported by Word2Vec embeddings offer an interpretable output compared with BERT models, while maintaining competitiveness in depression, and improvement over BERT in anxiety detection. This work may transform early screening methods in paediatric contexts and can be adapted to other clinical populations.
A Multimodal Framework for Aphasia Severity Classification in Russian
Kolmogorova Anastasia | Ekaterina Yavshitz | Anastasia Margolina | Anna Sugian
Kolmogorova Anastasia | Ekaterina Yavshitz | Anastasia Margolina | Anna Sugian
Automatic classification of aphasia severity presents persistent challenges, particularly for languages with limited clinical speech resources such as Russian. This paper explores a multimodal approach to severity estimation that combines acoustic and semantic representations of pathological speech. Acoustic features are extracted using pretrained Wav2Vec 2.0 models, while semantic information is obtained from the encoder of the Whisper model. The two representations are integrated via early feature fusion and evaluated using gradient boosting classifiers in a speaker-independent cross-validation setting. Experiments are conducted on a newly collected dataset of Russian speech recordings from patients with aphasia and neurotypical speakers (RuAphasiaBank). The results suggest that the combined use of acoustic and semantic embeddings can provide more stable severity estimates than unimodal baselines. This study contributes empirical evidence on the applicability of multimodal representation learning for aphasia severity classification under data-scarce conditions.
Data Augmentation Based on Selective Masking of Language Models for One Health Context
Youssef Mahdoubi | Najlae Idrissi | Mathieu Roche | Sarah Valentin
Youssef Mahdoubi | Najlae Idrissi | Mathieu Roche | Sarah Valentin
This study focuses on improving the performance of language models for two critical applications within the One Health context, specifically in epidemiological monitoring using textual data: (i) thematic classification across syndromic surveillance, biomedical and plant health domains, and (ii) detection of epidemic misinformation. A key challenge in these tasks is the limited availability of labeled textual data, which constrains the effectiveness of supervised learning methods. To overcome this limitation, we introduce two families of selective masking–based data augmentation strategies: lexical and non-lexical. Each family is implemented in a standard variant (Aug-SM-Lex and Aug-SM-NonLex), and a TF-IDF-weighted variant (Aug-SM-Lex-TFIDF and Aug-SM-NonLex-TFIDF). We perform two complementary experiments: the first determines the optimal masking rate, while the second evaluates the proposed strategies against LLM-based text reformulation. Experimental results indicate that selective masking-based augmentation outperformed both LLM-based reformulation (Mistral-7B and GPT-Neo-1.3B) and baseline models trained on original data alone across three of the five evaluated datasets, with the best performance achieved at a masking rate of 20%. This suggests that selective masking is a promising approach, potentially more effective than computationally expensive LLM-based reformulation.
Towards Inclusive Communication in Cancer Prevention and Treatment: A Case Study on Italian Informational Materials
Chiara Cassani | Luca Brigada Villa | Marco Forlano | Serena Coschignano | Amelia Barcellini | Silvia Luraghi | Alberto Giovanni Leone | Chiara Zanchi | Adalberto Lovotti
Chiara Cassani | Luca Brigada Villa | Marco Forlano | Serena Coschignano | Amelia Barcellini | Silvia Luraghi | Alberto Giovanni Leone | Chiara Zanchi | Adalberto Lovotti
This paper presents an annotation scheme developed to analyze linguisticaccessibility and inclusivity in Italian cancer-related informational materials.The scheme combines metadata annotation, qualitative analysis of textual andvisual features, and automatically extracted measures of linguistic complexitycapturing structural, lexical, and probabilistic properties of the texts. Abrief case study demonstrates how the proposed framework can be applied tocompare documents and identify different sources of linguistic difficulty. Theapproach provides a replicable methodological basis for large-scale analyses ofhealth communication materials.
Empathy as interactional accomplishment in clinical interactions with a conversational agent
Spencer Hazel | Adam Brandt | Yajie Vera He | Ernest Lim | Jared Joselowitz | Zachary Ellis
Spencer Hazel | Adam Brandt | Yajie Vera He | Ernest Lim | Jared Joselowitz | Zachary Ellis
As healthcare services deploy AI to automate patient-facing communication, concerns persist about the interactional work through which empathy is made relevant. We examine empathy not as an internal state but as an interactional accomplishment, asking how patients display orientations to an LLM-powered voice assistant’s turns as (non-)empathic in real clinical telephone calls. Using Conversation Analysis (CA) to analyse post–cataract surgery follow-up calls conducted by AI-powered voice assistant Dora (Ufonia), we compare patient responses across earlier and later system versions.Earlier calls show minimal, delayed, prosodically closed responses to wellbeing enquiries, consistent with treating Dora as a transactional information-gathering device. Later calls more often feature socially rich formats, for example colloquial upgrades, gratitude tokens, occasional return enquiries, and increased turn-final rising intonation, suggesting patients hear Dora’s talk as socially implicative and thus opening space for affiliative/empathetic uptake. We discuss implications for CA-informed conversation design and for evaluating “empathy” via participant orientations in situ rather than post-hoc self-report.
Delayed Wh-Question Development in Children with Hearing Loss: Evidence for Morphosyntactic Vulnerability from Corpus-Based NLP and LLM Analyses
Tong Wu
Tong Wu
This study provides corpus-based evidence that English-speaking children with hearing loss (CHL) show both quantitative and qualitative delays in wh-question development compared to typically developing (TD) peers. Using Natural Language Processing (NLP)/Large Language Model (LLM) based methods and two clinical subcorpora from CHILDES, we analyzed child utterances across several syntactic dimensions: frequency, lexical diversity, structural completeness, clausal embedding, wh-fronting, and utterance length. CHL produced significantly fewer wh-questions, used a narrower range of wh-types, showed lower rates of embedding, and more structural incompleteness. These differences were most evident in syntactically complex forms, such as embedded and canonical fronted wh-questions. The results support input-sensitive and usage-based accounts of syntactic development and highlight the need for enriched linguistic input in supporting CHL’s grammatical growth. Importantly, these group differences persisted when controlling for overalllanguage development as indexed by mean length of utterance (MLU) in words, indicatingthat CHL’s difficulties with wh-questions are not reducible to generalgrammatical delay.Methodologically, the study combines dependency-parsing-based analyses with exploratory LLM evaluation to assess the feasibility and limits of automated approaches to spontaneous child language. NLP-based analyses were more stable for formally defined syntactic features, while GPT-based analysis showed mixed performance, performing better on global structural judgments than on fine-grained syntactic diagnostics.
StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection
Amal Abdullah Alqahtani | Efsun Kayi | Mona T. Diab
Amal Abdullah Alqahtani | Efsun Kayi | Mona T. Diab
The prevalence of chronic stress represents a major public health concern, yet automated detection of vulnerable individuals remains limited. Social media platforms like X (formerly Twitter) serve as important venues for people to share their experiences openly. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for the automatic detection of self-reported chronic stress in English tweets. We investigate whether continual pretraining on clinically related conditions, such as depression, anxiety, and PTSD, which have a high comorbidity with chronic stress, improves stress detection compared to general language models. We continually pretrained RoBERTa on the Stress-SMHD corpus, a subset of Self-reported Mental Health Diagnoses focused on stress-related conditions, consisting of 108 million words from users with self-reported diagnoses of depression, anxiety, and PTSD. Then, we fine-tuned on the SMM4H 2022 Shared Task 8. StressRoBERTa achieves 82% F1, which outperforms the best shared task system (79% F1) by 3 percentage points. Our results demonstrate that focused cross-condition transfer learning from stress-related disorders provides stronger representations than general mental health training. To validate cross-condition generalization, we also fine-tuned the model on the Dreaddit. Our result of 81% F1 further demonstrates the transfer from clinical mental health contexts to situational stress discussions.
DementiaBank-Emotion: A Multi-Rater Emotion Annotation Corpus for Alzheimer’s Disease Speech (Version 1.0)
Cheonkam Jeong | Jessica Liao | Audrey Lu | Yutong Song | Christopher Rashidian | Donna Krogh | Erik Krogh | Mahkameh Rasouli | Jung-Ah Lee | Nikil Dutt | Lisa M Gibbs | David Sultzer | Julie Rousseau | Jocelyn Ludlow | Margaret Galvez | Alexander Nuth | Chet Khay | Sabine Brunswicker | Adeline Nyamathi
Cheonkam Jeong | Jessica Liao | Audrey Lu | Yutong Song | Christopher Rashidian | Donna Krogh | Erik Krogh | Mahkameh Rasouli | Jung-Ah Lee | Nikil Dutt | Lisa M Gibbs | David Sultzer | Julie Rousseau | Jocelyn Ludlow | Margaret Galvez | Alexander Nuth | Chet Khay | Sabine Brunswicker | Adeline Nyamathi
We present DementiaBank-Emotion, the first multi-rater emotion annotation corpus for Alzheimer’s disease (AD) speech. Annotating 1,492 utterances from 108 speakers for Ekman’s six basic emotions and neutral, we find that AD patients express significantly more non-neutral emotions (16.9%) than healthy controls (5.7%; p < .001). Exploratory acoustic analysis suggests a possible dissociation: control speakers showed substantial F0 modulation for sadness (Delta = -3.45 semitones from baseline), whereas AD speakers showed minimal change (Delta = +0.11 semitones; interaction p = .023), though this finding is based on limited samples (sadness: n=5 control, n=15 AD) and requires replication. Within AD speech, loudness differentiates emotion categories, indicating partially preserved emotion-prosody mappings. We release the corpus, annotation guidelines, and calibration workshop materials to support research on emotion recognition in clinical populations.
up
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Diego Alves | Yuri Bizzoni | Stefania Degaetano-Ortlieb | Anna Kazantseva | Janis Pagel | Stan Szpakowicz
Diego Alves | Yuri Bizzoni | Stefania Degaetano-Ortlieb | Anna Kazantseva | Janis Pagel | Stan Szpakowicz
From Corpus to Concept Scheme: Developing a SKOS Vocabulary for Armenian Epigraphic Heritage
Hamest Tamrazyan | Kamal Nour | Emanuela Boros
Hamest Tamrazyan | Kamal Nour | Emanuela Boros
Armenian epigraphy, one of the world’s oldest and most diverse inscriptional traditions, remains largely absent from digital research infrastructures due to a lack of basic linguistic and conceptual resources. No machine-readable corpus, standardized terminology, or controlled vocabulary exists for describing Armenian inscription types, preventing indexing and interoperability. This paper addresses this gap by constructing the first dataset of Armenian inscription-type terminology and by developing a computational pipeline for analyzing it at scale. We digitize and preprocess a broad corpus of authoritative printed publications; curate a culturally grounded terminology list; and train transformer-based NER models to identify both attested inscription types and potential terminological variants across unseen texts. The resulting resources form the first empirical foundation for modelling Armenian epigraphic concepts needed for further developing a SKOS vocabulary aligned with, yet culturally distinct from, existing international epigraphic ontologies.
Armenian AutoEpiDoc: Automated Extraction and Encoding of Armenian Inscriptions into EpiDoc TEI/XML
Hamest Tamrazyan | Emile Cornamusaz | Emanuela Boros
Hamest Tamrazyan | Emile Cornamusaz | Emanuela Boros
Armenian epigraphy is extensively documented in printed scholarly corpora, yet lacks machine-readable editions that support interoperability or computational analysis. In this paper, we present Armenian AutoEpiDoc, a system that automatically converts expert-verified Armenian inscription records into EpiDoc-compliant TEI/XML files. Operating on curated and domain-validated data, AutoEpiDoc maps Armenian-specific metadata to EpiDoc structures through rule-based templates and schema-aware validation. The workflow significantly reduces manual encoding effort and provides a scalable path toward producing digital editions and integrating Armenian inscriptions into international epigraphic infrastructures.
Studying Expert-ese: Profiling and Classification of Domain-Specific Language Variation in Architecture with Traditional Machine Learning and LLMs
Carmen Schacht | Renate Delucchi Danhier
Carmen Schacht | Renate Delucchi Danhier
This study investigates how domain expertise shapes spontaneous oral language production, with a focus on architecture. Building on the ExpLay Corpus, which contains image descriptions by speakers with and without architectural training, we analyze linguistic variation by combining Profiling-UD and the DECAF framework. We extract a broad range of syntactic and morpho-syntactic features to build linguistic profiles for both groups and train classifiers to distinguish expert from non-expert productions. Two traditional machine learning models (logistic regression and SVM) are compared with a lightweight BiLSTM and two large language models (GliClass and LLaMA 2). While the expert and non-expert corpora diverge only subtly (pairwise Jensen–Shannon divergence (JSD)= 0.25), the BiLSTM using fastText embeddings achieves the highest F1-score (0.88), outperforming both traditional models and LLMs. This indicates that semantic representations are more predictive of domain variation than purely structural features and that smaller neural architectures generalize better on limited data. Overall, the findings provide empirical evidence that architectural expertise leaves measurable linguistic traces in spontaneous speech, supporting the Grammar of Space hypothesis.
We introduce CroCoSyn, a controlled, cross-lingual and cross-model corpus of 25,920 LLM-generated film synopses in English and French. Each synopsis is generated under systematically varied conditions, including model type, temperature, genre, protagonist gender, and narrative constraints, and enriched with structured metadata capturing characters and their relationships. Comparing Mistral and Llama across different model temperature degrees, CroCoSyn enables fine-grained analysis of narrative content, style, and character representation across models and languages. The corpus supports research on gender and cultural biases and story generation evaluation, and provides a foundation for comparative studies between LLM-generated and human-written narratives.
Identity Without Action: Rethinking Collective Action Models in Disinformation Research
Lorella Viola
Lorella Viola
Despite the rapid growth of disinformation research, the fundamental reasons behind user engagement with such content remain poorly understood. Recently, several scholars have suggested that researchers should study engagement with disinformation as a form of collective action (CA). Drawing on Social IdentityTheory (SIT) and the Social Identity Model of Collective Action (SIMCA), this study empirically verifies this assumption by testing it across two distinct linguistic communities, English and Spanish. Specifically, it investigates whether mobilizing CA language functions as a uniform predictor of engagement, or if engagement is primarily driven by community specific identity dynamics. The experiment analysed a bilingual corpus of 4,035 X (formerly Twitter) posts associated with conspiracy theory and disinformation-related hashtags (e.g., #Agenda2030, #TheGreatReset). Using a mixed-methods approach combining BERTopic for narrative discovery, non-parametric statistical testing and Random Forest Regressor, we disentangled the effects of language presence from community behaviour. The results revealthat the Spanish community exhibits a higher baseline engagement compared to the English community indicating that engagement is primarily driven by macro-level community norms (i.e., identity) rather than micro-level linguistic triggers. We argue that rather than treating mobilizing language as a uniform predictor of engagement, future application of SIMCA in disinformation research should account for these identity-based baseline differences.
Weakly Supervised Named Entity Recognition for Historical Texts
Marco Sorbi | Laurent Moccozet | Stephane Marchand-Maillet
Marco Sorbi | Laurent Moccozet | Stephane Marchand-Maillet
Named Entity Recognition has emerged as a critical task in natural language processing, particularly for extracting meaningful information from unstructured text. Although traditional approaches rely heavily on large annotated datasets, recent advances have explored weak supervision techniques to address the limitations of resource-intensive annotation processes. Historical texts provide unique challenges to this task because of their linguistic peculiarities, and several approaches exist to address texts of this domain in a supervised way, but they involve lengthy manual annotations of the documents of interest by domain experts. To address this issue, this paper explores how recent weakly supervised NER techniques can be adapted to historical texts, analyzing their suitability for this domain. The experiments show that domain-specific architectures can be effectively trained on low-resource corpora with weak supervision over a small set of entity labels. Using only 10% of the annotations, the performance of these architectures remains above 80% of the supervised quality in terms of F1-Score.
Invisible Speakers? Gender Disparity in German AI Discourse and Its Reflection in Language Models
Milena Belosevic
Milena Belosevic
This paper investigates how language models (LMs) reproduce the existing gender disparity found in German media discourse about artificial intelligence (AI). Building on a human-annotated corpus of quotations from German media discourse on AI, we first quantify the frequency with which male and female speakers are directly cited across domains and speaker roles. We then train LLäMmlein (Pfister et al., 2025), a state-of-the-art German-only language model, GBERT, and a logistic regression model using only the quoted text as input and without providing any gender cues to classify the quotation as originating from a male or female speaker. By comparing model predictions with corpus-based gold labels, we find that male voices dominate both the corpus and the model predictions. Balancing the data mitigates but does not fully eliminate this disparity, indicating that the strong male-default tendency of transformer models cannot be explained by corpus skew alone, but also by their priors from pretraining. The study contributes to the interpretability of language models’ output for DH-related tasks, adaptation of NLP tools to domain-specific humanities corpora, and knowledge modelling in the humanities.
GlobLingDiv: A global dataset linking linguistic diversity and digital support to reveal landscapes with under-resourced languages for NLP
Katharina Zeh | Hannes Essfors | Juliane Benson | Lale Tüver | Andreas Baumann | Hannes A. Fellner
Katharina Zeh | Hannes Essfors | Juliane Benson | Lale Tüver | Andreas Baumann | Hannes A. Fellner
Linguistic diversity is increasingly under pressure globally and is becoming ever more relevant in digital contexts, where many languages remain structurally under-resourced, limiting access to language technologies and inhibiting equitable NLP development. To support linguistic diversity, publicly available data are needed that capture both the number of languages spoken and the distribution of speakers across them. We introduce GlobLingDiv, a database that uses country-level speaker distributions to derive language richness and entropy-based diversity measures, alongside a population-weighted digital language support measure. Applying these metrics globally, we examine the association between linguistic diversity and digital support conditions. The results reveal a substantial imbalance: highly diverse linguistic landscapes show comparatively low digital support, underscoring the need for more inclusive NLP environments.
LLMs Got Rhyme? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
Stergios Chatzikyriakidis | Anastasia Natsina
Stergios Chatzikyriakidis | Anastasia Natsina
Large Language Models (LLMs), even though exhibiting multiple capabilities on many NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. When one moves to lower-resource languages such as Modern Greek, this is even more evident. In this paper, we present a hybrid neural-symbolic system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification and generation. We implement a comprehensive taxonomy of Greek rhyme types and employ an agentic generation pipeline with phonological verification. We use multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant reasoning gap: while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails significantly (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. Along with the system presented, we further release a corpus of 40,000+ rhymes, derived from the \textit{Anemoskala} and \textit{Interwar Poetry} corpora, to support future research.
Style as Signature: Profile-Based Authorship Verification of Mihai Eminescu’s Journalistic Corpus
Ioana-Roxana Boriceanu | Liviu Dinu
Ioana-Roxana Boriceanu | Liviu Dinu
Authorship verification aims to assess whether a questioned text is stylistically compatible with an author’s known writings, a task that is particularly challenging in historical corpora with partial ground truth. We address this problem in the context of Mihai Eminescu’s journalistic corpus, a historically grounded collection comprising published articles, manuscripts, and texts of uncertain authorship. Using a profile-based framework with character n-grams and function words, we examine how stylistic compatibility behaves across different profile construction settings and temporal splits. The results show that character trigram profiles consistently accept verified texts while producing a small and stable set of rejections among disputed items, whereas function word profiles show near complete acceptance across the corpus. A qualitative analysis shows that rejected texts exhibit meaningful differences in discourse structure and communicative purpose. These findings illustrate how authorship verification can support literary scholarship through stable signals for close reading.
Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs
Joonatan Laato | Veera Schroderus | Jenna Kanerva | Jenni Kauppi | Virpi Lummaa | Filip Ginter
Joonatan Laato | Veera Schroderus | Jenna Kanerva | Jenni Kauppi | Virpi Lummaa | Filip Ginter
We study how to better use digitized historical archives to answer sociological and historical questions that require more context than raw text mentions provide. Using Finnish World War II Karelian evacuee family interviews, we build on prior extraction of 350K mentions of leisure activities and organizational memberships (71K unique names) that are too diverse and unstructured to analyze directly. We introduce a categorization framework capturing key dimensions of participation: type of activity/organization, typical sociality, regularity, and the level of physical demand. After creating a gold-standard annotated set, we evaluate whether large language models can apply the schema at scale and find that an open-weight LLM, combined with simple multi-run voting, closely matches expert judgments. We then label all 350K entities to produce a structured resource for downstream analyses of social integration and related outcomes.
Catalogues as Data: Interpretable NLP Pipelines for Ottoman-Turkish Bibliographies
Mark Hill | Ayse Bulus | Paul Spence
Mark Hill | Ayse Bulus | Paul Spence
Bibliographies are both humanities infrastructure and historic record. To computationally analyse them, however, requires implementing complex digitisation and standardisation decisions. This paper turns to Seyfettin Özege’s Eski Harflerle Basılmış Türkçe Eserler Kataloğu as an example, a scanned set of volumes marked by complex page layouts, degraded typography, irregular entry structures, and historically contingent inconsistencies. With this we present a pipeline that constructs a structured, machine-readable, and analysable dataset out of the 27,000 entries with computer vision, OCR, large and visual language models, sequence-based validation, and custom review tools. This process captures 97.8% of records, with remaining cases capable of being addressed by targeted review. This process demonstrates that combining LLMs with interpretable, review-centric pipelines, offers an appropriate approach for historically complex bibliographic sources.
Large language models (LLMs) are post-trained on human feedback collected from annotator communities, yet the linguistic influence of these annotator communities on language models remains poorly understood. We investigated the stylistic transfer from Nigerian annotators to the LLaMA family of models through a natural experiment with LLaMA 2 and LLaMA 3.1, as their release dates are separated by the shutdown of a major data annotation service provider in Nigeria. We generated corpora from both model families and measured linguistic style by computing the difference-in-difference of the Jensen-Shannon distance on the bigram distribution between model outputs and corpora of Nigerian English and US English. We found that, although both pre-trained model variants exhibit similar proximity to both English variants, the LLaMA 2 post-trained model moved toward Nigerian English, while the LLaMA 3.1 post-trained model moved away from Nigerian English. Qualitatively, we found that post-trained LLaMA 2 models used significantly fewer contractions, in line with Nigerian English speakers opting to use a formal register due to its role as an index of knowledgeability. Our findings suggest that annotator communities can imprint linguistic style on large language models, with potential implications such as a disproportionately higher false positive rate in AI plagiarism detection for users who share a linguistic style with annotator communities.
Modeling Changing Scientific Concepts with Complex Networks: A Case Study on the Chemical Revolution
Sofia Aguilar Valdez | Stefania Degaetano-Ortlieb
Sofia Aguilar Valdez | Stefania Degaetano-Ortlieb
While context embeddings produced by LLMs can be used to estimate conceptual change, these representations are often not interpretable nor time-aware. Moreover, bias augmentation in historical data poses a non-trivial risk to researchers in the Digital Humanities. Hence, to model reliable concept trajectories in evolving scholarship, in this work we develop a framework that represents prototypical concepts through complex networks based on topics. Utilizing the Royal Society Corpus, we analyzed two competing theories from the Chemical Revolution (phlogiston vs. oxygen) as a case study to show that onomasiological change is linked to higher entropy and topological density, indicating increased diversity of ideas and connectivity effort.
Speaking on Their Behalf: Detecting Indirect Speech in Historical Danish and Norwegian Texts
Ali Al-Laith | Alexander Conroy | Kirstine Degn | Jens Bjerring-Hansen | Daniel Hershcovich
Ali Al-Laith | Alexander Conroy | Kirstine Degn | Jens Bjerring-Hansen | Daniel Hershcovich
Indirect speech is a fundamental yet understudied form of reported speech that plays a crucial role in literary texts and communication. While direct speech detection has received significant attention in computational linguistics, the automatic identification of indirect speech remains a challenge due to its nuanced linguistic structure and contextual dependencies. This paper focuses on the detection of indirect speech in late 19th-century Scandinavian literature, where its presence has been linked to shifting aesthetic ideals. We present an annotated dataset of 150 segments, each randomly selected from 150 different novels, designed to capture indirect speech in Danish and Norwegian literature. We evaluate four pre-trained language models for classifying indirect speech, with results showing that a Danish Foundation Model (DFM Large), trained on extensive Danish data, has the highest performance. Finally, we conduct a classifier-assisted quantitative corpus analysis and find that the prevalence of indirect speech exhibits fluctuations over time.
Harder than Finding the Lost Sheep? Towards Automatically Suggesting Deliberate Metaphor Annotations in German Sermons
Ronja Laarmann-Quante | Stefanie Dipper
Ronja Laarmann-Quante | Stefanie Dipper
Automatic metaphor detection so far has largely focused on English data annotated for all kinds of metaphors including ubiquitous conventionalized ones. In this paper, we focus on deliberate metaphors in German sermons, i.e., metaphors that are used with a specific communicative goal. This task is harder because there is less training data available, and deliberate metaphors are very rare. Our goal is to support human annotators with automatically generated suggestions, so we strive above all for high recall. Using multilingual transfer learning based on various metaphor datasets and different transformer models, the highest recall we achieve is .70 (precision .10). Our results suggest that larger context windows beyond the sentence level are not helpful and that adding in-domain data even when annotated with different guidelines and in a different language is beneficial.
Semantic Factor Analysis: Validating Personality Structure Recovery from empirically-mediated Word Embeddings
Oliver Müller
Oliver Müller
The present study introduces Semantic Factor Analysis (SFA), a novel computational approach recovering Big Five personality trait structures from pre-trained adjective word embeddings weighted by empirical participant data. Using Word2Vec embeddings trained on the Google-News-300 corpus, semantic relationships of IPIP-50 Big Five inventory adjectives (Goldberg, 1992) were extracted and factor structures computed through weighted vector averaging and K-means clustering. To validate the methodology, SFA was compared against a baseline using unweighted Word2Vec embeddings. In a controlled experiment with n=55 participants completing standard IPIP-50 assessments, HSP-R scale (Pluess et al., 2024) and multimedia impact surveys, empirically-weighted SFA successfully recovered all five personality dimensions with 62.5% average factor purity, substantially outperforming the unweighted baseline (52.0%, 10% relative improvement), while traditional Confirmatory Factor Analysis showed factor collapse and poor model fit. The approach was validated through Latent Class Analysis deriving empirically-based classification thresholds for Big Five dimensions and supporting a trichotomous Environmental Sensitivity model (Lionetti et al., 2018). Results demonstrate that integrating semantic representations with empirical data improves Big Five structure recovery beyond pure semantic similarity alone, particularly for small sample studies where traditional methods such as CFA will fail due to limited empirical data points.
While machine translation systems have been applied to many tasks with remarkable success, machine poetry translation has remained a challenge. This study investigates the capabilities of generative Large Language Models (LLMs) in the translation of poetry (taking Shakespeare’s 154 sonnets as an example) from English to German. For this purpose, I define metrics that assess the reproduction of the rhyme scheme and the metre of the original in a quantitative way. The results indicate that LLMs still lag behind professional human translators (especially with regard to the reproduction of the rhyme scheme), but that their performance is significantly influenced by the chosen prompt strategy. In particular, iteratively refining the result emerges as a successful strategy in terms of the reproduction of the form, but this comes at the expense of other aspects such as grammaticality and the reproduction of the meaning.
WikiLingDiv: a dataset for quantifying digital linguistic diversity using Wikipedia page views
Hannes Essfors | Andreas Baumann
Hannes Essfors | Andreas Baumann
With the conflation of digital and non-digital spaces, and NLP technologies being integrated into an increasing number of aspects of daily life, linguistic diversity cannot be fully understood without considering how language is used online. While existing models of linguistic diversity typically have relied on speaker numbers or language production, the dimension of diversity in language consumption remains comparatively understudied. To facilitate such research, we introduce WikiLingDiv, an openly accessible dataset for quantifying linguistic diversity in online knowledge retrieval using Wikipedia page views. Our dataset is based on yearly page views of 340 language editions of Wikipedia, aggregated across 239 countries and territories over 10 years (2015-2024). Using the dataset, we illustrate spatial and temporal patterns of digital linguistic diversity, suggesting that diversity has both increased and decreased across countries and regions, while highlighting country-specific dynamics in language usage. We release the dataset as an openly available and easily integrable data resource for researchers in computational linguistics, digital humanities, and the broader social sciences, enabling further work on linguistic variation, digital inequality, and the interaction between language use and digital technology.
Modeling Linguistic Imprints of War Propaganda in a Russian Wikipedia Fork: A Comparative Analysis with the Original Wikipedia
Anastasiia Vestel | Stefania Degaetano-Ortlieb
Anastasiia Vestel | Stefania Degaetano-Ortlieb
Although Wikipedia aspires to provide neutral information, alternative versions can be used for political manipulation. This paper analyzes how narratives about the Russo-Ukrainian War are linguistically reframed in a Russian Wikipedia Fork compared to the original Russian Wikipedia. Using Kullback-Leibler Divergence on a corpus of war-related edits in more than 13,000 articles, we identify key differences between the two versions. While the original Wikipedia features Ukrainian references and administrative details, direct war terminology, and Ukraine’s territorial designation, governance, and statehood, RWFork replaces or removes these elements, emphasizing reassignment of Ukrainian territories to Russia, favoring euphemistic war language, renaming locations, and recognizing Russia-backed DPR and LPR. These patterns closely align RWFork with demobilizational strategies observed in pro-Kremlin media.
Stylometric Approach to AI-generated Texts. An Analysis of Contemporary French-Language Literature
Adam Pawłowski | Tomasz Walkowiak
Adam Pawłowski | Tomasz Walkowiak
The article focuses on a stylometric analysis of authentic literary texts and thematically related texts generated by large language models. The texts under study represent a fairly broad cross-section of twentieth-century French literature. Five models were used to generate the texts (ChatGPT 4-o, GPT 4-o mini, DeepSeek v.3, c4ai-command-r-plus, and c4ai-command-a). The original human-written stories of approximately 20,000 characters were summarized, and new narratives were then generated on the basis of these abstracts. In terms of plot and style, they were intended to resemble the originals. The research carried out with TF-IDF of the most frequent words showed that texts generated by specific LLMs and written by humans cluster relatively well as distinct groups. The experiments also showed that the "authorial" specificity of machine-generated texts partly matches the original clustering of human-written source texts.
Degree Zero of Translation: Using Interlinear Baselines to Quantify Translator Intervention
Maciej Rapacz | Aleksander Smywiński-Pohl
Maciej Rapacz | Aleksander Smywiński-Pohl
Literary translation is rarely a neutral act of linguistic transfer, but rather a continuous series of conscious interventions - restructuring, semantic shifts, and stylistic adaptations. While Translation Studies analyzes these shifts qualitatively, current computational methods focus primarily on quality evaluation (e.g., BLEU, COMET) or authorship attribution (e.g., stylometry), lacking a scalable metric to quantify the extent and character of the translator’s intervention. We propose a novel method to measure the translator’s signal by using Interlinear Translation - a strict word-for-word gloss - as a computational baseline representing translational "Degree Zero," i.e., a neutral form of source text devoid of any stylistic adaptation.We define the Intervention Vector as the semantic difference between a literary translation and its interlinear counterpart in a high-dimensional vector space. We validate this approach on a multilingual corpus of the Greek New Testament translations comprising 5 interlinear baselines and 74 literary translations across 5 languages: English (16), French (14), Italian (12), Polish (16), and Spanish (16).Our results demonstrate that the magnitude of the Intervention Vector effectively ranks texts along a spectrum from literal to paraphrase, aligning with established theoretical categories. We find that this magnitude consistently distinguishes between translation strategies, yielding significantly longer vectors for dynamic and paraphrase strategies compared to literal and formal ones. This framework provides a quantitative method for analyzing translator agency without the need for a comprehensive corpus of reference translations.
How to Efficiently Explore Noisy Historical Data? Leveraging Corpus Pre-Targeting to Enhance Graph-based RAG
Donghan Bian | Marie Puren | Florian Cafiero
Donghan Bian | Marie Puren | Florian Cafiero
Graph-based Retrieval-Augmented Generation (RAG) is increasingly used to explore long, heterogeneous, and weakly structured corpora, including historical archives. However, in such settings, naive full-corpus indexing is often computationally costly and sensitive to OCR noise, document redundancy, and topical dispersion. In this paper, we investigate corpus pre-targeting strategies as an intermediate layer to improve the efficiency and effectiveness of graph-based RAG for historical research.We evaluate a set of pre-targeting heuristics tailored to single-hop and multi-hop of historical questions on HistoriQA-ThirdRepublic, a French question-answering dataset derived from parliamentary debates and contemporary newspapers. Our results show that appropriate pre-targeting strategies can improve retrieval recall by 3–5% while reducing token consumption by 32–37% compared to full-corpus indexing, without degrading coverage of relevant documents.Beyond performance gains, this work highlights the importance of corpus-level optimization for applying RAG to large-scale historical collections, and provides practical insights for adapting graph-based RAG pipelines to the specific constraints of digitized archives.
Detecting reported speech as a token classification task: an application to Classical Latin?
Agustin Dei
Agustin Dei
This paper presents the first application of an automatic token-classification approach for detecting reported speech spans in Classical Latin using transformer-based neural architectures.Focusing on Seneca the Elder’s Declamatory Anthology, the study addresses the text’s highly polyphonic nature, resulting from theuse of reported speech. Instead of relying exclusively on sentence-level syntactic information, the proposed approach treats reported speech detection as a token-level sequence labeling problem. This enables the identification of reported speech spans extending across multiple sentences. We fine-tune three Latin neural language models —LatinBERT, LaBERTa, and PhilBERTa— for binary token-level classification and conduct experiments both with and without punctuation. The results show that RoBERTa-based models effectively identify reported speech, with LaBERTa achieving the best performance (F1 scores above 0.90).
Narrative in Short German Prose: A Multi-Phenomenon Dataset for Computational Literary Analysis
Hans Ole Hatzel | Haimo Stiemer | Evelyn Gius | Chris Biemann
Hans Ole Hatzel | Haimo Stiemer | Evelyn Gius | Chris Biemann
We present the novel dataset GermAnProse, an annotated corpus consisting of four German short prose texts accompanied by an extensive set of narrative-focused annotations.As part of this dataset, we contribute an annotation scheme for mentions, speech, and character agency: Characters in Action (ChiA).GermAnProse also contains information on narrative phenomena: narrativity, semantic verb classes, and plot keyness.Moreover, we include reader reception data in the form of timing information for audiobook performances, indicating pauses between sentences and the time taken to read a specific sentence in a performance.We release the dataset, which contains more than 18,000 manually created standoff annotations in JSON format, enabling researchers to utilize this resource for further exploratory applications.
Sense-Based Annotation of Geographical Nouns in Ancient Greek and Latin: A Diachronic Study with LLMs
Andrea Farina | Michele Ciletti | Barbara Mcgillivray | Andrea Ballatore
Andrea Farina | Michele Ciletti | Barbara Mcgillivray | Andrea Ballatore
This paper investigates the lexicalisation of geographical nouns in Latin and Ancient Greek using a nd Ancient Greek using a diachronic, multi-genre corpus (8th cent. BCE – 2nd cent. CE) and Large Language Models for Word Sense Disambiguation. We focus on two main aspects: the onomasiological question of which words encode core geographical concepts, and the semasiological distribution of senses across lemmas. Across both languages, city-related concepts are the most frequently expressed, but Greek shows a stronger focus on maritime terms, whereas Latin favours concepts related to land. Semasiologically, Latin shows clearer evidence of semantic change over time (e.g., ’citizenship’ - ’city’, aequor ’flat surface’ - ’sea’), while Greek displays more gradual or distributed shifts. These results show that computational annotation enables cross-linguistic and diachronic analysis of spatial semantics, allowing us to compare the frequency of concepts across languages, genres, and periods, and to track when semantic change occurs and how core concepts evolve over time.
Evaluating Humanities Theory Alignment in Large Language Models: Incremental Prompting and Statistical Assessment
Axel Pichler | Janis Pagel
Axel Pichler | Janis Pagel
We propose a method to evaluate the extent to which an LLM’s observable input–output behavior aligns with established theories in the humanities and cultural studies. We instantiate the framework on three humanities theories—Davidson’s truth-conditional semantics, Lewis’s truth in fiction, and Iser’s concept of textual gaps—using a top-down, theory-driven black-box framework. Core assumptions of these theories are reconstructed into testable behavioral rules and assessed via controlled classification tasks with systematic prompt comparisons and significance testing. Our experiments show that theory-uninformed classification prompts generally outperform theory-enriched prompts in Lewis and Iser settings, while theory-informed prompts help in the Davidson task. Gemini Flash consistently achieves the highest scores across tasks and corpora, while the Iser gap detection task remains substantially harder than binary truth-conditional judgments. Statistical tests confirm robust prompt effects and the failure of basic prompts. However, model behavior under incremental theory exposure is unstable and architecture-dependent.
Too Long, Didn’t Model: Decomposing LLM Long Context Understanding With Novels
Sil Hamilton | Rebecca Hicke | Mia Ferrante | Matthew Wilkens | David Mimno
Sil Hamilton | Rebecca Hicke | Mia Ferrante | Matthew Wilkens | David Mimno
Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Existing novel-based long-context benchmarks are limited in scale due to the cost of manual annotating long texts. Inspired by work on computational novel analysis, we release the Too Long, Didn’t Model (TLDM) benchmark, which tests a model’s ability to reliably report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond "lost in the middle” benchmarks when evaluating model performance in complex long context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.
We present an AI assistant designed to help researchers interact with language corpora using natural language instead of formal query languages. Built as a custom GPT with access to multilingual corpora via Czech National Corpus platform API, the system translates research questions into CQL queries, retrieves corpus data, and guides users through linguistic analysis. After more than a year of deployment, the system has processed over 1000 interactions with human users. We discuss the hybrid approach combining rule-based translation with LLM intelligence, challenges of building on a constantly evolving platform, and lessons learned from production usage. Notably, this system represents the first voice-enabled corpus interface in history, significantly lowering barriers to corpus-based research for non-technical users and users outside linguistic fields.
Generative Information Extraction from Biographical Sources
Robin Winkle | Manfred Stede | Jörn Kreutel
Robin Winkle | Manfred Stede | Jörn Kreutel
Biographical sources, such as literature encyclopedias, encode knowledge about historical figures in textual form. In this paper, we address the task of consolidating structured biographical information about authors from the former German Democratic Republic into a unified database. To this end, we present a generalizable Information Extraction (IE) system based on LLM prompting. Specifically, we compare two midsized open-source models, Qwen-2.5-32B and Llama-3-70B-Instruct, investigate a range of Prompt Engineering (PE) strategies, and propose a semantic similarity-based evaluation metric for open-ended IE. Our experiments on an unpublished annotated subset of biographical texts deliver moderate precision and variable recall, highlighting both the potential and current limitations of generative IE in the Digital Humanities.
WikiFirst: A Genre-Fixed, Content-controlled Corpus for Evaluating Content Effects in Authorship Analysis
Dung Nguyen | G. Çağatay Sat | Evgeny Pyshkin | John Blake
Dung Nguyen | G. Çağatay Sat | Evgeny Pyshkin | John Blake
This paper presents the design and construction of WikiFirst, a corpus for investigating the impact of content variation on authorship similarity under a fixed genre. Prior work has investigated individual authorial style and impact of genre. However, the role of content has remained underexplored due to the lack of suitable data. We address this gap by constructing a Wikipedia-based corpus consisting exclusively of first revisions authored by non-anonymous editors, thereby ensuring high authorship certainty while maintaining a stable encyclopaedic genre.
Measuring the Symbolic Power of Languages with LLM-based Multilingual Persuasion Simulation
Yin Jou Huang | Fei Cheng
Yin Jou Huang | Fei Cheng
Prior studies on the symbolic power of languages have largely relied on surveys or localized experiments, limiting systematic comparison across cultures and domains. In this work, we propose an LLM-based multilingual persuasion simulation framework to quantify the symbolic power of languages through persuasion outcomes. We also introduce a Symbolic Power Index (SPI) that measures how language choice affects persuasion success and efficiency across domains. Experiments show that the LLM-based simulations largely reproduce established sociolinguistic prestige hierarchies tied to institutional authority and global power, especially in domains such as business, finance, education, and technology. These results suggest that LLM-based persuasion simulations offer a scalable, decision-making-driven approach to studying symbolic power in language.
up
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
Nina Tahmasebi | Pierluigi Cassotti | Syrielle Montariol | Andrey Kutuzov | Netta Huebscher | Elena Spaziani | Naomi Baes
Nina Tahmasebi | Pierluigi Cassotti | Syrielle Montariol | Andrey Kutuzov | Netta Huebscher | Elena Spaziani | Naomi Baes
The SlangTrack Dataset: Supporting the Detection of Words Used in Slang Senses
Afnan Mohammed Aloraini | Riza Batista-Navarro | Goran Nenadic | Viktor Schlegel
Afnan Mohammed Aloraini | Riza Batista-Navarro | Goran Nenadic | Viktor Schlegel
Slang is widespread in informal communication, yet its fluidity poses challenges for natural language processing (NLP), especially when words alternate between slang and non-slang senses. While prior work has examined slang through dictionaries, sentiment analysis, and lexicon building, little attention has been given to detecting slang usage in context. We address this gap by reframing slang detection as distinguishing slang from non-slang senses of the same lexical item. To support this task, we introduce SlangTrack (ST), a diachronically structured dataset of dual-meaning words annotated at the sentence level with high inter-annotator agreement. We benchmark (1) deep learning models with static and contextual embeddings, (2) transformer-based models, and (3) large language models evaluated in zero-shot, few-shot, and fine-tuned settings. Fine-tuned transformers, especially BERT-large enriched with sentiment and emotion features, achieve the strongest performance, reaching an F1-score of 72% for slang and 92% for non-slang usage. Our findings highlight both the difficulty of contextual slang detection and the value of affective cues for improving model robustness.
Statistical Semantic Change Detection via Usage Similarities
Taichi Aida | Daichi Mochihashi | Hiroya Takamura | Toshinobu Ogiso | Mamoru Komachi
Taichi Aida | Daichi Mochihashi | Hiroya Takamura | Toshinobu Ogiso | Mamoru Komachi
Semantic change detection comprises two subtasks: classification, which predicts whether a target word has undergone a semantic shift, and ranking, which orders words according to the degree of their semantic change. While most prior studies concentrated on ranking subtask, the classification subtask plays an equally important role, since many practical scenarios require a yes/no decision on semantic change rather than a global ranking. In this work, we propose a novel statistical method that predicts the presence or absence of semantic change. While most existing approaches infer semantic change by comparing word embeddings across time periods or domains, our method directly models the diachronic/synchronic consistency of usage-level similarity scores. Our experiments on SemEval-2020 Task 1 and WUGS datasets demonstrate that the proposed formulation outperforms existing state-of-the-art embedding-based methods, and robustly detects semantic change across languages in both diachronic and synchronic settings.
Tonogenesis—the historical process by which segmental contrasts evolve into lexical tone—has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduce a computational approach that quantifies the functional role of pitch at different stages of this sound change by measuring how pitch manipulation affects automatic speech recognition (ASR) performance. Through analysis on the sensitivity to pitch-flattening from a set of closely related Tibetan languages, we find evidence of a tonogenesis continuum: atonal Amdo dialects tolerate pitch removal the most, while fully tonal Ü-Tsang varieties show severe degradation, and intermediate Kham dialects fall measurably between these extremes. These gradient effects demonstrate how ASR models implicitly learn the shifting functional load of pitch as languages transition from consonant-based to tone-based lexical contrasts. Our findings show that computational methods can capture fine-grained stages of sound change and suggest that traditional functional load metrics, based solely on minimal pairs, may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain phonetically intertwined.
Cross-lingual Lexical Semantic Change in Romance Languages
Ana Sabina Uban | Liviu P Dinu | Anca Daniela Dinu | Simona Georgescu
Ana Sabina Uban | Liviu P Dinu | Anca Daniela Dinu | Simona Georgescu
We present a comprehensive quantitative analysis of lexical semantic change in the five main Romance languages (Romanian, Italian, Spanish, French and Portuguese), based on the most exhaustive database of related words in these languages. We include both cognate words and borrowings (for the first time, to our knowledge), and compute semantic shift measures using different static and contextual embedding models, as well as three different corpora. We publish the obtained lists of semantic divergences across all related word pairs, compute global trends in language-level semantic divergence, and provide insights on particular study cases of highly stable and highly divergent words for different language pairs.
Threshold-Calibrated Word Sense Disambiguation: Semantic Broadening Without Sense Redistribution in Schizophrenia
Naomi Baes | Nick Haslam
Naomi Baes | Nick Haslam
Polysemous words pose a challenge for computational approaches to language change. We extend a recent hypothesis-driven, prototype-based framework to estimate word sense prevalence in diachronic text corpora and apply it to 109,940 usages of schizophrenia drawn from U.S. news media (1985–2025). Our extensions include a contextual dispersion measure (Breadth), robust prototype construction, and human-calibrated prototype-similarity thresholds for conservative sense assignment at scale. Across four decades, distributional semantic change indices commonly used in lexical semantic change detection (LSCD) show significant increases in Breadth and baseline-relative semantic drift (APD), while changes in the central usage prototype (PRT) are influenced by term frequency. In contrast, threshold-calibrated sense assignments reveal stable sense proportions: the psychiatric sense remains dominant, with split-personality and metaphorical senses consistently marginal. Together, these results demonstrate that dispersion- and drift-based LSCD metrics can increase even under stable sense prevalence, indicating that such increases can occur without sense redistribution and primarily reflect broad shifts in usage distributions rather than evidence of polysemization or sense loss. We introduce a threshold-calibrated, prototype-based sense-tracking pipeline that enables conservative sense prevalence estimation at scale and clarifies whether rising distributional LSCD metrics reflect sense redistribution or increasing contextual diversity when historical sense annotation is limited.
Using Correspondence Patterns to Identify Irregular Words in Cognate Sets Through Leave-One-Out Validation
Frederic Blum | Johann-Mattis List
Frederic Blum | Johann-Mattis List
Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.
DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Mariia Fedorova | Andrey Kutuzov | Khonzoda Umarova
Mariia Fedorova | Andrey Kutuzov | Khonzoda Umarova
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets.DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field.
Transparent Semantic Change Detection with Dependency-Based Profiles
Bach Phan Tat | Kris Heylen | Dirk Geeraerts | Stefano De Pascale | Dirk Speelman
Bach Phan Tat | Kris Heylen | Dirk Geeraerts | Stefano De Pascale | Dirk Speelman
Most modern computational approaches to lexical semantic change detection (LSC) rely on embedding-based distributional word representations with neural networks. Despite the strong performance on LSC benchmarks, they are often opaque. We investigate an alternative method which relies purely on dependency co-occurrence patterns of words. We demonstrate that it is effective for semantic change detection and even outperforms a number of distributional semantic models. We provide an in-depth quantitative and qualitative analysis of the predictions, showing that they are plausible and interpretable.
Semantic Change Characterization with LLMs using Rhetorics
Jáder Martins Camboim de Sá | Jooyoung Lee | Marcos Da Silveira | Cedric Pruski
Jáder Martins Camboim de Sá | Jooyoung Lee | Marcos Da Silveira | Cedric Pruski
Languages continually evolve in response to societal events, resulting in new terms and shifts in meanings. These changes have significant implications for computer applications, including automatic translation and chatbots, making it essential to characterize them accurately. The recent development of LLMs has notably advanced natural language understanding, particularly in sense inference and reasoning. In this paper, we investigate the potential of LLMs in characterizing three types of semantic change: dimension, relation, and orientation. We achieve this by combining LLMs’ Chain-of-Thought with rhetorical devices and conducting an experimental assessment of our approach using newly created datasets. Our results highlight the effectiveness of LLMs in capturing and analyzing semantic changes, providing valuable insights to improve computational linguistic applications.
This paper presents a semi-supervised approach to investigating lexical semantic change in English prepositions using contextualized word embeddings from BERT. Due to their hybrid lexico-grammatical nature and high degree of polysemy, prepositions have received limited attention in computational studies of semantic change. We address this gap by first applying BERT-based embeddings in combination with a k-nearest neighbors classifier to the task of preposition sense disambiguation, achieving competitive performance without relying on external lexical resources. The trained model is then applied to diachronic data from the Corpus of Historical American English to analyze semantic change over time. By measuring classifier confidence and correlating it with usage year, we detect systematic differences between simple and compound prepositions. Our results confirm linguistic hypotheses that simple prepositions remain largely semantically stable, while compound prepositions exhibit measurable semantic change. The study demonstrates that BERT embeddings provide an effective tool for exploring diachronic semantic phenomena in functionally complex word classes and can be extended to other languages and datasets.
A Computational Analysis of the Emergence of Therapy-speak in Social Media
Alina Iacob | Ana Sabina Uban
Alina Iacob | Ana Sabina Uban
The present article investigates semantic change in psychology-related concepts, in scientific and social media texts comparatively. We assess patterns of change over 15 years (2010-2025) and compare word usage in a corpus of Psychology journals abstracts and Reddit comments, testing whether specialized communities on social media align with psychology experts. We analyze semantic breadth, semantic displacement and neighbours similarity evolutions, and in addition include in our experiments contextual embeddings alongside static Word2Vec embeddings. Our results reveal diverse patterns of semantic change across the examined concepts and confirm that many terms are used differently on social media compared to specialized literature. Furthermore, Reddit communities focused on psychology discussions occupy an intermediate position, adopting a more objective stance than general-domain threads while remaining distinct from specialized literature.
Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and cosine distance over word prototypes (PRT). We introduce Average Minimum Distance (AMD) and Symmetric Average Minimum Distance (SAMD), new measures that quantify semantic change via local correspondence between word usages across time periods. Across multiple languages, encoder models, and representation spaces, we show that AMD often provides more robust performance, particularly under dimensionality reduction and with non-specialised encoders, while SAMD excels with specialised encoders. We suggest that LSCD may benefit from considering alternative semantic change metrics beyond APD and PRT, with AMD offering a robust option for contextualised embedding-based analysis.
From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media
Maria Ryskina | Matthew R. Gormley | Kyle Mahowald | David R. Mortensen | Taylor Berg-Kirkpatrick | Vivek Kulkarni
Maria Ryskina | Matthew R. Gormley | Kyle Mahowald | David R. Mortensen | Taylor Berg-Kirkpatrick | Vivek Kulkarni
Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence *(neology)* identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts [(Ryskina et al., 2020)](https://aclanthology.org/2020.scil-1.43/). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different word formation mechanisms.
up
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jonathan Washington | Nathaniel Oco | Xiaobing Zhao
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jonathan Washington | Nathaniel Oco | Xiaobing Zhao
Are Small Language Models the Silver Bullet to Low-Resource Languages Machine Translation?
Yewei Song | Lujun Li | Cedric Lothritz | Saad Ezzini | Lama Sleem | Niccolo' Gentile | Radu State | Tegawendé F. Bissyandé | Jacques Klein
Yewei Song | Lujun Li | Cedric Lothritz | Saad Ezzini | Lama Sleem | Niccolo' Gentile | Radu State | Tegawendé F. Bissyandé | Jacques Klein
Small language models (SLMs) offer computationally efficient alternatives to large language models, yet their translation quality for low-resource languages (LRLs) remains severely limited. This work presents the first large-scale evaluation of SLMs across 200 languages, revealing systematic underperformance in LRLs and identifying key sources of linguistic disparity. We show that knowledge distillation from strong teacher models using predominantly monolingual LRL data substantially boosts SLM translation quality—often enabling 2B–3B models to match or surpass systems up to 70B parameters. Our study highlights three core findings: (1) a comprehensive benchmark exposing the limitations of SLMs on 200 languages; (2) evidence that LRL-focused distillation improves translation without inducing catastrophic forgetting, with full-parameter fine-tuning and decoder-only teachers outperforming LoRA and encoder–decoder approaches; and (3) consistent cross-lingual gains demonstrating the scalability and robustness of the method. These results establish an effective, low-cost pathway for improving LRL translation and provide practical guidance for deploying SLMs in truly low-resource settings.
Tao–Filipino Neural Machine Translation: Strategies for Ultra–Low-Resource Settings
Adrian Denzel Macayan | Luis Andrew Sunga Madridijo | Ellexandrei Esponilla | Zachary Mitchell Francisco
Adrian Denzel Macayan | Luis Andrew Sunga Madridijo | Ellexandrei Esponilla | Zachary Mitchell Francisco
Neural Machine Translation (NMT) performance degrades significantly in ultra-low resource settings, particularly for endangeredlanguages like Tao (Yami) which lack extensive parallel corpora. This study investigates strategies to bootstrap a Tao-Tagalog translation system using the NLLB-200 (600 million parameter) model under extremely limited supervision. We propose a multi-faceted approach combining domain-specific fine-tuning, synthetic data augmentation, and cross-lingual transfer learning. Specifically, we leverage the phylogenetic proximity of Ivatan, a related Batanic language, to pre-train the model, and utilize dictionary-based generation to construct synthetic conversational data. Our results demonstrate that transfer learning from Ivatan improves translation quality on in-domain religious texts, achieving a BLEU score of 34.85. Conversely, incorporating synthetic data enhances the model’s ability to generalize to conversational contexts, mitigating the domain bias often inherent in religious corpora. These findings highlight the effectiveness of exploiting linguistic typology and structured lexical resources to develop functional NMT systems for under-represented Austronesian languages.
Text Filter Based on Automatically Acquired Vocabularies for Multilingual Machine Translation
Kenji Imamura | Masao Utiyama
Kenji Imamura | Masao Utiyama
In this paper, we propose a text filter designed to support multiple languages. The method simply aggregates vocabulary from a monolingual corpus and compares it against the input. Despite its simplicity, the approach proves highly effective in removing code-mixed text.When combined with existing language identification techniques, our method can enhance the purity of the corpus in the target language. Consequently, applying it to parallel corpora for machine translation has the potential to improve translation quality.Additionally, the proposed method supports the incremental addition of new languages without the need to retrain those already learned. This feature easily enables our method to be applied to low-resource languages.
Comparing LLM-Based Translation Approaches for Extremely Low-Resource Languages
Jared Coleman | Ruben Rosales | Kira Toal | Diego Cuadros | Nicholas Leeds | Bhaskar Krishnamachari | Khalil Iskarous
Jared Coleman | Ruben Rosales | Kira Toal | Diego Cuadros | Nicholas Leeds | Bhaskar Krishnamachari | Khalil Iskarous
We present a comprehensive evaluation and extension of the LLM-Assisted Rule-Based Machine Translation (LLM-RBMT) paradigm, an approach that combines the strengths of rule-based methods and Large Language Models (LLMs) to support translation in no-resource settings. We present a robust new implementation (the Pipeline Translator) that generalizes the LLM-RBMT approach and enables flexible adaptation to novel constructions. We benchmark it against four alternatives (Builder, Instructions, RAG, and Fine-tuned translators) on a curated dataset of 150 English sentences, and compare them across translation quality and runtime. The Pipeline Translator consistently achieves the best overall performance. The LLM-RBMT methods (Pipeline and Builder) also offer an important advantage: they naturally align with evaluation strategies that prioritize grammaticality and semantic fidelity over surface-form overlap, which is critical for endangered languages where mistranslation carries high risk.
We evaluate the capabilities of several small large language models (LLMs) to translate between Italian and six low-resource language varieties from Italy (Friulan, Ligurian, Lombard, Sicilian, Sardinian, and Venetian). Using recent benchmark datasets, such as FLORES+ and OLDI-Seed, we compare prompting and fine-tuning approaches for downstream translation, evaluated with CHRF scores. Our findings confirm that these LLMs struggle to translate into and from these low-resource language varieties. Pretraining and fine-tuning a small LLM did not yield improvements over a zero-shot baseline. These results underscore the need for further NLP research on Italy’s low-resource language varieties. As the digital divide continues to threaten the conservation of this diverse linguistic landscape, greater engagement with speaker communities to create better and more representative datasets is essential to boost the translation performance of current LLMs.
Balancing Fluency and Adherence: Hybrid Fallback Term Injection in Low-Resource Terminology Translation
Kurt Abela | Marc Tanti | Claudia Borg
Kurt Abela | Marc Tanti | Claudia Borg
Integrating domain-specific terminology into Machine Translation systems is a persistent challenge, particularly in low-resource and morphologically-rich scenarios where models lack the robustness to handle imposed constraints. This paper investigates the trade-off between static dictionary-based data augmentation and dynamic inference constraints (Constrained Beam Search). We evaluate these methods on two high-to-low resource language pairs: English-Maltese (Semitic) and English-Slovak (Slavic). Our experiments reveal a dichotomy: while dynamic constraints achieve near-perfect Terminology Insertion Rates (TIR), they drastically degrade translation quality (BLEU) in low-resource settings, breaking the fragile fluency of the model. Conversely, static augmentation improves terminology adherence on unseen terms in Maltese (4% → 19%), but fails in the context of a highly inflected language like Slovak. To resolve this conflict, we propose Hybrid Fallback Term Injections, a strategy that prioritizes the fluency of static models while using dynamic constraints as a safety net. This approach recovers up to 90% of missing terms while mitigating the quality degradation of pure constraint approaches, providing a viable solution for high-fidelity translation in data-scarce environments.
Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan | Raphael Merx | Jey Han Lau
David Samuel Setiawan | Raphael Merx | Jey Han Lau
Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using **Dhao**, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a **hybrid framework** where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the **number of retrieved examples** rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
Building and Evaluating a High Quality Parallel Corpus for English Urdu Low Resource Machine Translation
Munief Hassan Tahir | Hunain Azam | Sana Shams | Sarmad Hussain
Munief Hassan Tahir | Hunain Azam | Sana Shams | Sarmad Hussain
Low-resource languages like Urdu suffer from limited high quality parallel data for machine translation. We introduce a curated English–Urdu corpus of 80,749 high-fidelity sentence pairs across 18 diverse domains, built via ethical collection, manual alignment, deduplication, and strict length-based filtering (AWCD ≤ 5). The corpus is converted into a bidirectional SFT dataset with bilingual (English/Urdu) instructions to enhance prompt-language robustness. Fine-tuning Llama-3.1-8B-Instruct (Llama-FT) and UrduLlama 1.1 (UrduLlama-FT) yields major gains over the baseline. sacreBLEU scores reach 24.65–25.24 (En→Ur) and 76.14–77.97 (Ur→En) for Llama-FT, with minimal sensitivity to prompt language. Blind human evaluation on 90 sentences per direction confirms substantial perceptual improvements. Results demonstrate the value of clean parallel data and bilingual instruction tuning, revealing complementary benefits of general SFT versus Urdu specific pretraining. This work provides a reproducible resource and pipeline to advance Urdu machine translation and similar low-resource languages.
This paper presents a set of linguistic resources that describes Quechua verbs. We first present a dictionary of 1,444 fundamental Quechua verbs, associated with morpho-syntactic grammars to formalize their inflection and their derivations, that can be used to produce over 2,777,000 conjugated Quechua derived verbal forms. We aligned this list of Quechua verbal forms with the corresponding Spanish dictionary that contains 618,000 conjugated verbal forms, thus producing both a Spanish to Quechua and a Quechua to Spanish dictionary.
Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing
Aashish Dhawan | Christopher Driggers-Ellis | Christan Grant | Daisy Zhe Wang
Aashish Dhawan | Christopher Driggers-Ellis | Christan Grant | Daisy Zhe Wang
Machine translation for Indigenous and other low-resource languages is constrained by limited parallel data, orthographic variation, and evaluation instability for morphologically rich languages. In this work, we study Spanish–Aymara, Spanish–Guarani, and Spanish–Quechua translation in the context of the AmericasNLP benchmarks, focusing on data-centric improvements rather than architectural changes. We augment curated parallel corpora with forward-translated synthetic sentence pairs generated using a high-capacity multilingual translation model, while applying conservative, language-specific preprocessing tailored to each language. Training data is filtered using length-ratio constraints and deduplication, whereas official development sets are left unfiltered to ensure fair evaluation. We fine-tune a multilingual mBART model under curated-only and curated+synthetic settings and evaluate performance primarily using chrF++, which is better suited for agglutinative languages than BLEU. Across all three languages, synthetic data augmentation consistently improves chrF++, with the largest gains observed for Aymara and Guarani, while Quechua benefits primarily from deterministic orthographic normalization. Our analysis highlights both the effectiveness and the limitations of generic preprocessing for highly agglutinative languages, suggesting that data-centric augmentation and language-aware normalization are strong, reproducible baselines for low-resource Indigenous language machine translation.
Adapting Multilingual NMT to Language Isolates: The Role of Proxy Language Selection and Dialect Handling for Nivkh
Eleonora Izmailova | Alexey Sorokin | Pavel Grashchenkov
Eleonora Izmailova | Alexey Sorokin | Pavel Grashchenkov
Neural machine translation has achieved remarkable results for high-resource languages, yet language isolates – those with no demonstrated genetic relatives – remain severely underserved, as they cannot benefit from cross-lingual transfer with related languages. We present the first NMT system for Nivkh, a critically endangered language isolate spoken by fewer than 100 fluent speakers in the Russian Far East. Working with approximately 9.5k parallel sentences – expanded through fine-tuned LaBSE sentence alignment – we adapt NLLB-200 to Nivkh-Russian translation. Since Nivkh is absent from NLLB’s language inventory, we investigate proxy language token selection, comparing six typologically diverse languages: Bashkir, Kazakh, Halh Mongolian, Turkish, Tajik, and French. We find that using any proxy substantially outperforms random token initialization (BLEU 18-19.02 vs. 15.44 for rus→niv), confirming the value of proxy-based transfer. However, the choice of which proxy has minimal impact, with all six achieving comparable results despite spanning four language families and two scripts. This suggests that for language isolates, practitioners can select any typologically reasonable proxy without significant performance penalty. We additionally present preliminary experiments on dialect-specific models for Amur and Sakhalin Nivkh. Our findings establish baseline results for future Nivkh NLP research and provide practical guidance for adapting multilingual models to other language isolates.
Machine translation (MT) evaluation is central in guiding researchers on how to improve a model’s performance. Current automatic evaluation practices fail to provide reliable insights into the specific translation errors that occur, especially for low-resource languages. This paper introduces the Lux-MT-Test-Suite, enabling a linguistically motivated and fine-grained analysis of Luxembourgish–English (LB-EN) MT based on 896 test items covering 12 linguistic categories and 36 linguistic phenomena. We compare a baseline local LLM (Gemma 3), its fine-tuned counterpart (LuxMT), and a proprietary state-of-the-art LLM (GPT-5) to analyse what local LLMs learn through fine-tuning in a low-resource setting and to assess performance differences between local and proprietary systems. The findings identify specific performance gains through fine-tuning, minor degradations, a difference in translation strategies, performance gaps between local and proprietary models, and remaining challenges.
Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
Kaustubh Shivshankar Shejole | Sourabh Deoghare | Pushpak Bhattacharyya
Kaustubh Shivshankar Shejole | Sourabh Deoghare | Pushpak Bhattacharyya
Neural Machine Translation (NMT) systems rely heavily on explicit punctuation cues to resolve semantic ambiguities in a source sentence. Inputting user-generated sentences, which are likely to contain missing or incorrect punctuation, results in fluent but semantically disastrous translations. This work attempts to highlight and address the problem of punctuation robustness of NMT systems through an English-to-Marathi translation. First, we introduce Virām, a human-curated diagnostic benchmark of 54 punctuation-ambiguous English-Marathi sentence pairs to stress-test existing NMT systems. Second, we evaluate two simple remediation strategies: cascade-based restore-then-translate and direct fine-tuning. Our experimental results and analysis demonstrate that both strategies yield substantial NMT performance improvements. Furthermore, we find that current Large Language Models (LLMs) exhibit relatively poorer robustness in translating such sentences than these task-specific strategies, thus necessitating further research in this area. The code and dataset are available at https://github.com/KaustubhShejole/Viram_Marathi.
Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?
Aishwarya Ramasethu | Rohin Garg | Niyathi Allu | Harshwardhan Fartale | Dun Li Chan
Aishwarya Ramasethu | Rohin Garg | Niyathi Allu | Harshwardhan Fartale | Dun Li Chan
Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model’s vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Broadly, our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.
CTC Regularization for Low-Resource Speech-to-Text Translation
Zachary William Hopton | Rico Sennrich
Zachary William Hopton | Rico Sennrich
The challenges of building speech-to-text translation (ST) systems (e.g., a relative lack of parallel speech–text data and robustness to noise in audio) are exacerbated for low-resource language pairs. In this work, we seek to improve low-resource ST by building on previous studies that regularize ST training with the connectionist temporal classification (CTC) loss. By systematically evaluating a diverse range of linguistic annotations as CTC labels across multiple auxiliary loss configurations, we improve speech translation systems for both low- and high-resource settings. These improvements over both a standard end-to-end ST system and a speech LLM indicate a need for continued research on regularizing speech representations in ST.
Navigating Data Scarcity in Low-Resource English-Tatar Translation using LLM Fine-Tuning
Ahmed Khaled Khamis
Ahmed Khaled Khamis
The scarcity of high-quality parallel corpora remains the primary bottleneck for English-Tatar machine translation. While the OPUS project provides various datasets, our tests reveal that datasets like WikiMatrix, GNOME, and NLLB, suffer from significant noise and incorrect labeling, making them unsuitable for training robust encoder-decoder translation models that typically requires larger amount of high quality data. Furthermore, we demonstrate that small-scale multilingual Large Language Models (LLMs), such as Qwen3 (4B-30B), Gemma3 (4B-12B) and others, show severe "Turkish interference", and they frequently hallucinate Turkish vocabulary when prompted for Tatar.In this paper, we navigate this data scarcity by leveraging Llama 3.3 70B Instruct, which is the only model in our zero-shot benchmarks capable of maintaining distinct linguistic boundaries for Tatar. To address the lack of gold-standard data, we curated a synthetic dataset of 7,995 high-quality translation pairs using a frontier model as a teacher. We then performed 4-bit LoRA fine-tuning to train Llama for English-Tatar translation. Our results show a performance leap: while fine-tuning on the limited Tatoeba dataset (1,193 samples) yielded a CHRF++ score of 24.38, while fine-tuning on our synthetic dataset achieved 32.02 on the LoResMT 2026 shared task test set. We release our curated dataset and fine-tuned models to support further research in low-resource Turkic machine translation.
No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data
Dmitry Karpov
Dmitry Karpov
We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.
DevLake at LoResMT 2026: The Impact of Pre-training and Model Scale on Russian-Bashkir Low-Resource Translation
Vyacheslav Tyurin
Vyacheslav Tyurin
This paper describes the submission of Team DevLake for the LoResMT 2026 Shared Task on Russian-Bashkir machine translation. We conducted a comprehensive comparative study of three distinct neural architectures: NLLB-200 (1.3B), M2M-100 (418M), and MarianMT (77M). To overcome hardware constraints, we employed parameter-efficient fine-tuning techniques (QLoRA) and extensive data filtering using a domain-specific BERT-based classifier. Our experiments demonstrate that the presence of the target language (Bashkir) in the model’s pre-training data is the decisive factor for performance. Our best system, a fine-tuned NLLB-200-1.3B model augmented with exact match retrieval, achieved a CHRF++ score of 52.67. We also report on negative results with custom tokenization for smaller models, providing insights into the limitations of vocabulary adaptation without extensive pre-training.
We describe an evaluation of several open-source models under identical inference conditions without task-specific training. Despite covering a wide range of available models, including both multilingual systems and models specifically designed for Russian-Kazakh translation, the results indicate that the highest performance is achieved by the language-specific approach.
Script Correction and Synthetic Pivoting: Adapting Tencent HY-MT for Low-Resource Turkic Translation
Bolgov Maxim
Bolgov Maxim
This paper describes a submission to the LoResMT 2026 Shared Task for the Russian-Kazakh, Russian-Bashkir, and English-Chuvash tracks. The primary approach involves parameter-efficient fine-tuning (LoRA) of the Tencent HY-MT1.5-7B multilingual model. For the Russian-Kazakh and Russian-Bashkir pairs, LoRA adaptation was employed to correct the model’s default Arabic script output to Cyrillic. For the extremely low-resource English-Chuvash pair, two strategies were compared: mixed training on authentic English-Chuvash and Russian-Chuvash data versus training exclusively on a synthetic English-Chuvash corpus created via pivoting through Russian. Baseline systems included NLLB 1.3B (distilled) for Russian-Kazakh and Russian-Bashkir, and Gemma 2 3B for English-Chuvash. Results demonstrate that adapting a strong multilingual backbone with LoRA yields significant improvements over baselines while successfully addressing script mismatch challenges. Code for training and inference is released at: https://github.com/defdet/low-resource-langs-mt-adapt
This paper outlines our winning submission to the English-to-Tatar translation task. We evaluated three strategies: few-shot prompting with Gemini 3 Pro Preview, specialized trans-tokenized Tweeties models, and the RL-distilled TranslateGemma family. Results demonstrate that large commercial models significantly outperform smaller specialized ones in this low-resource setting. Gemini secured first place with a chrF++ score of 56.71, surpassing the open-source baseline of 25.23.
Data-Centric Approach at the LoResMT 2026 Turkic Translation Challenge: Russian-Kyrgyz
Dmitry Novokshanov
Dmitry Novokshanov
We describe our submission to the Turkic languages translation challenge at LoResMT 2026, which focuses on translation from Russian into Kyrgyz. Our approach leverages parallel data, synthetic translations, a comprehensive filtering pipeline and a four-stage curriculum learning strategy. We compare our system with contemporary baselines and present the model that achieves a chrF++ score of 49.1 and takes first place in the competition.
We describe our submission to the shared task LoResMT 2026, which involved translating from low-resource Turkic languages Bashkir, Chuvash, Kazakh, Kyrgyz, and Tatar from English or Russian. We submitted runs for the English-Chuvash language pair using Neural machine translation (NMT). Our approach focused on systematic experimentation with diverse model architectures and an emphasis on optimizing inference-time parameters. The key findings indicate that a large-scale, specialized multilingual translation model, combined with targeted data preprocessing and careful generation tuning, yielded the best performance, achieving a chrF++ score of 29.67 on the public test set.
Ensemble Methods for Low-Resource Russian-Kyrgyz Machine Translation: When Diverse Models Beat Better Models
Adilet Metinov
Adilet Metinov
We present our submission to the LoResMT 2026 Shared Task on Russian-Kyrgyz machine translation. Our approach demonstrates that ensembling diverse translation models with simple consensus-based voting can significantly outperform individual models, achieving a +1.37 CHRF++ improvement over our best single model. Notably, we find that including "weaker" models in the ensemble improves overall performance, challenging the conventional assumption that ensembles should only combine top-performing systems. Our system achieved 49.31 CHRF++ on the public leaderboard and 48.55 CHRF++ on the final private test set, placing 3rd in the Russian-Kyrgyz track using only open-weight models without any fine-tuning on parallel Kyrgyz data. We report several counter-intuitive findings: (1) simple voting outperforms quality-weighted selection, (2) more diverse models help even when individually weaker, and (3) post-processing "corrections" can hurt performance when reference translations contain similar artifacts.
up
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Pinzhen Chen | Vilém Zouhar | Hanxu Hu | Simran Khanuja | Wenhao Zhu | Barry Haddow | Alexandra Birch | Alham Fikri Aji | Rico Sennrich | Sara Hooker
Pinzhen Chen | Vilém Zouhar | Hanxu Hu | Simran Khanuja | Wenhao Zhu | Barry Haddow | Alexandra Birch | Alham Fikri Aji | Rico Sennrich | Sara Hooker
LLMs as Span Annotators: A Comparative Study of LLMs and Humans
Zdeněk Kasner | Vilém Zouhar | Patrícia Schmidtová | Ivan Kartáč | Kristýna Onderková | Ondrej Platek | Dimitra Gkatzia | Saad Mahamood | Ondrej Dusek | Simone Balloccu
Zdeněk Kasner | Vilém Zouhar | Patrícia Schmidtová | Ivan Kartáč | Kristýna Onderková | Ondrej Platek | Dimitra Gkatzia | Saad Mahamood | Ondrej Dusek | Simone Balloccu
Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.
Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This shows that even a high average agreement with human data when considering LLM responses independently does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which consider all survey answers independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.
Code-switching is a common feature of multilingual communication, and identifying where the language switches reliably is essential for downstream tasks such as generating code-switched machine translations. This paper introduces CSDI, a Code-Switching Detection (CSD) system for Indic text, which jointly learns CSD, Named Entity Recognition, and Part-of-Speech tagging through a shared encoder. Leveraging multitask learning, CSDI captures linguistic cues that signal switching boundaries and achieves a new state-of-the-art macro-F1 score with near-zero 𝛥CMI across six Indic languages. The model also demonstrates strong cross-lingual transfer, effectively leveraging high-resource languages to improve low-resource performance. Despite challenges such as intra-word code-mixing and limited token-level context, CSDI establishes a new baseline for scalable, low-resource NLP research in code-mixed environments.
Vinclat: Evaluating Reasoning, Cognition and Culture in One Game
Marc Pàmies | Javier Aula-Blasco | Aitor Gonzalez-Agirre | Marta Villegas
Marc Pàmies | Javier Aula-Blasco | Aitor Gonzalez-Agirre | Marta Villegas
This paper introduces Vinclat, a novel evaluation dataset for Catalan carefully designed to assess the reasoning capabilities and cultural knowledge of LLMs. It comprises 1,000 high-quality instances, meticulously crafted and reviewed by human annotators. Each instance presents a complex riddle that requires a two-step reasoning process involving inferential and abductive reasoning, along with other cognitive skills such as lexical retrieval, paraphrasing, flexibility in interpretation, pattern recognition, and associative thinking. Given four independent clues, models should infer intermediate concepts which, despite being seemingly unrelated, can be creatively connected to reach a final solution. The task targets a unique blend of capabilities, distinguishing it from existing NLP benchmarks. Our evaluation of state-of-the-art models reveals that these still fall significantly short of human-level reasoning, although scaling trends suggest that the performance gap may narrow over time. This indicates that Vinclat provides a robust and long-term challenge, resisting the rapid saturation that is commonly observed in many existing evaluation datasets.
Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
Takumi Ohashi | Hitoshi Iyatomi
Takumi Ohashi | Hitoshi Iyatomi
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI.
The Anthropology of Food: How NLP can Help us Unravel the Food cultures of the World
Arij Riabi | Sougata Saha | Monojit Choudhury
Arij Riabi | Sougata Saha | Monojit Choudhury
Food carries cultural meaning beyond nutrition. It shapes identity, memory, and social norms, which makes it a central concern in anthropology. Given the diversity of food practices across cultures, analyzing them at scale while preserving their depth (“thick” descriptions) remains difficult for ethnographic methods, where Natural Language Processing (NLP) methods can help. Earlier NLP tools often captured only surface-level ”thin” descriptions. Recent methods, especially Large Language Models (LLMs), create openings to recover cultural nuance. In this position paper, we outline research questions at the intersection of food anthropology and NLP, and discuss how LLMs can enable a scalable and culturally grounded anthropology of food. We present a case study examining what LLMs represent about global eating habits, which are often shaped by colonial histories and globalization. Our findings suggest that LLMs’ internal representations recognize cultural clusters, such as shared food habits among formerly colonized regions, but fail to grasp the pragmatic and experiential aspects of food, like the worldwide spread of dishes like pizza or biryani. We conclude by highlighting some of the potential risks and gaps of using NLP for cultural analysis.
LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Nadezhda Chirkova | Tunde Oluwaseyi Ajayi | Seth Aycock | Zain Muhammad Mujahid | Vladana Perlić | Ekaterina Borisova | Markarit Vartampetian
Nadezhda Chirkova | Tunde Oluwaseyi Ajayi | Seth Aycock | Zain Muhammad Mujahid | Vladana Perlić | Ekaterina Borisova | Markarit Vartampetian
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance.
Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
Isaac Chung | Linda Freienthal
Isaac Chung | Linda Freienthal
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences.This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.
Cross-lingual and cross-country approaches to argument component detection: a comparative study.
Cecilia Graiff | Chloé Clavel | Benoît Sagot
Cecilia Graiff | Chloé Clavel | Benoît Sagot
Argument mining in multilingual settings has rarely been investigated, due to the lack of annotated resources and to the inherent difficulty of the task. We benchmark the performance of models on cross-lingual and cross-country argument component detection, focusing on political data from the US and France. To do so, we introduce FrenchPolArg, a corpus of argumentative political discourse in French, and we automatically translate already existing US-English resources. We benchmark three different cross-lingual and cross-country pipelines, and compare their results to find the best-performing one. We obtain promising results to be integrated in semi-automatic annotation workflows to reduce the time and cost of annotations.
UNSC-Bench: Evaluating LLM Diplomatic Role-Playing Through UN Security Council Vote Prediction
Ayush Nangia | Aman Gokrani | Ruggero Marino Lazzaroni
Ayush Nangia | Aman Gokrani | Ruggero Marino Lazzaroni
This paper introduces UNSC-Bench, a benchmark for evaluating Large Language Models (LLMs) in simulating diplomatic decision-making through United Nations Security Council (UNSC) vote prediction. The dataset includes 469 UNSC resolutions from 1947 to 2025, with voting records for the five permanent members (P5) (United States, China, France, Russia, United Kingdom) and translations in four languages. We analyze 26 LLMs, along with thinking variants, across multiple P5 roles and find that (1) without explicit role assignment, models are diplomatically unaligned, defaulting to high yes rates and failing to match any P5 voting pattern, indicating they lack inherent diplomatic identity; (2) model capability (as measured by MMLU-Pro) is strongly correlated with role-playing accuracy; (3) regional models do not outperform others in predicting their home country’s votes; and (4) multilingual evaluation reveals that prompt language impacts model predictions, particularly for minority vote outcomes.
Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America
Yannis Karmim | Renato Pino | Hernan Contreras | Hernan Lira | Sebastian Cifuentes | Simon Escoffier | Luis Martí | Djamé Seddah | Valentin Barriere
Yannis Karmim | Renato Pino | Hernan Contreras | Hernan Lira | Sebastian Cifuentes | Simon Escoffier | Luis Martí | Djamé Seddah | Valentin Barriere
Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground.We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of Questions/Answers (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create a database of around 23k questions and associated answers extracted from 23k Wikipedia articles, and transformed into a multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out extit(i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, extit(ii) that the models perform better in their original language, extit(iii) that Iberian Spanish culture is better known than Latam one. Our code, our results for reproducing the results, and all datasets by region will be available.
Whom to Trust? Analyzing the Divergence Between User Satisfaction and LLM-as-a-Judge in E-Commerce RAG Systems
Arif Türkmen | Kaan Efe Keleş
Arif Türkmen | Kaan Efe Keleş
We study retrieval-augmented generation (RAG) evaluation in the Trendyol QA Assistant using 150k real e-commerce interactions. Our framework combines user satisfaction labels, LLM-as-a-judge scoring, and factor-based diagnostics to separate retrieval from generation errors. We find that judge models broadly reflect user satisfaction trends, though important nuances of dissatisfaction are often missed. Factor-level analysis highlights systematic error patterns across query types and context quality, demonstrating that hybrid evaluation, combining multiple LLM judges with direct user feedback offers the most reliable assessment strategy for production RAG systems.
Query-Following vs Context-Anchoring: How LLMs Handle Cross-Turn Language Switching
Kyuhee Kim | Chengheng Li Chen | Anna Sotnikova
Kyuhee Kim | Chengheng Li Chen | Anna Sotnikova
When multilingual users switch languages mid-conversation, how should LLMs respond? We extend MultiChallenge to evaluate cross-turn language switching, translating 182 multi-turn conversations into German, Chinese, Spanish, and Arabic. Across five frontier models, we observe asymmetric behavior: switching into a foreign language (EN→X) yields high query-language fidelity (89–99%), but switching back to English (X→EN) reveals divergent policies. GPT-5 follows the query language (>95%), while Claude Opus 4.5 and Command R+ maintain the established conversation language (<8%). Task accuracy remains stable across conditions regardless of language selection differences. A simple explicit system prompt shows limited effectiveness in modifying these defaults.
Generating Difficult-to-Translate Texts
Vilém Zouhar | Wenda Xu | Parker Riley | Juraj Juraska | Mara Finkelstein | Markus Freitag | Daniel Deutsch
Vilém Zouhar | Wenda Xu | Parker Riley | Juraj Juraska | Mara Finkelstein | Markus Freitag | Daniel Deutsch
Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark’s ability to distinguish which model is better or to reveal models’ weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.
’A Woman is More Culturally Knowledgeable than A Man?’: The Effect of Personas on Cultural Norm Interpretation in LLMs
Mahammed Kamruzzaman | Hieu Minh Nguyen | Nazmul Hassan | Gene Louis Kim
Mahammed Kamruzzaman | Hieu Minh Nguyen | Nazmul Hassan | Gene Louis Kim
As the deployment of large language models (LLMs) expands, there is an increasing demand for personalized LLMs. One method to personalize and guide the outputs of these models is by assigning a persona—a role that describes the expected behavior of the LLM (e.g., a man, a woman, an engineer). This study examines whether an LLM’s interpretation of social norms varies based on assigned personas and whether these variations stem from embedded biases within the models. In our research, we tested 34 distinct personas from 12 categories (e.g., age, gender, beauty) across four different LLMs. We find that LLMs’ cultural norm interpretation varies based on the persona used and that the variations within a persona category (e.g., a fat person and a thin person as in physical appearance group) follow a trend where an LLM with the more socially desirable persona (e.g., a thin person) interprets social norms more accurately than with the less socially desirable persona (e.g., a fat person). While persona-based conditioning can enhance model adaptability, it also risks reinforcing stereotypes rather than providing an unbiased representation of cultural norms. We also discuss how different types of social biases due to stereotypical assumptions of LLMs may contribute to the results that we observe.
up
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Atul Kr. Ojha | Verginica Barbu Mititelu | Mathieu Constant | Ivelina Stoyanova | A. Seza Doğruöz | Alexandre Rademaker
Atul Kr. Ojha | Verginica Barbu Mititelu | Mathieu Constant | Ivelina Stoyanova | A. Seza Doğruöz | Alexandre Rademaker
Large Language Models Put to the Test on Chinese Noun Compounds: Experiments on Natural Language Inference and Compound Semantics
Le Qiu | Emmanuele Chersoni | He Zhou | Yu-Yin Hsu
Le Qiu | Emmanuele Chersoni | He Zhou | Yu-Yin Hsu
Noun compounds are generally considered an open challenge for NLP systems, given to the difficulty of interpreting the implicit semantic relation between modifier and head, although the advent of Large Language Models (LLMs) recently led to remarkable performance leaps. However, most evaluations have been carried out on English benchmarks.In our work, we test LLMs on compound semantics understanding in Chinese, adopting two different evaluation scenarios: an extrinsic evaluation in a Natural Language Inference task, and an intrinsic evaluation in which models are directly asked to predict the semantic relation linking the two constituents.Our results show that the bigger and more recent LLMs are able to surpass supervised baselines in the inference task, especially when tested under the few-shot setting. In the more challenging task of selecting the correct interpretation of the compounds out of a fine-grained typology of semantic relations between head and modifier, the best Chinese LLM (Qwen-plus) manages to select the correct option in about one third of the cases.
SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech
Johan Nevin Sofalas | Dilushri Pavithra | Nevidu Jayatilleke | Ruvan Weerasinghe
Johan Nevin Sofalas | Dilushri Pavithra | Nevidu Jayatilleke | Ruvan Weerasinghe
Figures of Speech (FOS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
Swedish Multiword Expression Corpora in PARSEME
Sara Stymne | Astrid Berntsson Ingelstam | Eva Pettersson
Sara Stymne | Astrid Berntsson Ingelstam | Eva Pettersson
We present the annotation of Swedish multiword expressions under the PARSEME annotation scheme, including a new release and a historical overview of previous releases. We provide an overview of the evolution of the Swedish datasets and of inter-annotator agreement. We discuss general guidelines and the development of Swedish-specific guidelines for particle verbs and multiword tokens, as well as additional challenges for the Swedish annotation. We also conduct an initial comparison of Swedish and other Germanic languages, identifying aspects where the PARSEME guidelines require revision to ensure better consistency across languages.
Ukrainian Multiword Expressions Corpus: Creation, Annotation, and Linguistic Analysis
Hanna Sytar | Maria Shvedova | Olha Kanishcheva
Hanna Sytar | Maria Shvedova | Olha Kanishcheva
This paper presents the development of a corpus of annotated multiword expressions (MWEs) for Ukrainian. The resource covers four major categories of MWEs: verbal, nominal, adjectival/adverbial, and functional. We describe the methodology used for data selection, the annotation scheme, and the procedures employed during annotation. In addition, the paper discusses some specific types of MWE constructions, illustrating their usage with numerous examples and addressing complex and borderline cases. The resulting corpus is an important resource for linguistic studies and NLP tasks involving MWEs, and is publicly accessible https://gitlab.com/parseme/sharedtask-data/-/tree/master/2.0?ref_type=heads.
Cognitive Signatures of Multi-Word Expressions: Reading-Time and Surprisal
Diego Alves | Sergei Bagdasarov | Elke Teich
Diego Alves | Sergei Bagdasarov | Elke Teich
This study investigates whether eye-tracking measures predict if a word is the final token of a multi-word expression (MWE), focusing on two understudied MWE types: fixed expressions (e.g., due to) and phrasal verbs (e.g., turn out). Using mixed-effects logistic regression, we compared tokens in MWE contexts with the same tokens in non-MWE contexts. Results reveal a clear difference in processing. For fixed expressions, reading-time measures significantly predict MWEhood. In contrast, phrasal verbs show no consistent predictive effects. Additionally, we compared the reading-time models to models that included GPT-2 surprisal as a predictor. While surprisal does predict MWEhood, it fails to capture the distinction between types. These findings highlight the need to consider MWE typology in models of formulaic language processing.
Cheese it up: CamemBERT Outperforms Large Language Models for Identification of French Multi-word Expressions
Sergei Bagdasarov | Diego Alves | Elke Teich
Sergei Bagdasarov | Diego Alves | Elke Teich
In recent years, language models, both encoder-only and generative, have been applied to a variety of downstream NLP tasks, includingsequence labeling tasks like automatic multi-word expression identification (MWEI). Multiple studies show that, in general, fine-tunedencoder-only models like BERT tend to outperform pretrained generative LLMs on downstream tasks (Arzideh et al., 2025; Ochoa et al.,2025; Bucher and Martini, 2024; Sebok et al., 2025). However, such comparisons are sparse for MWEI, in particular for French, in partdue to the lack of comprehensive gold-standard datasets. In this study, we address this research gap by comparing CamemBERT with gpt-oss and Qwen3 for MWEI, using the French subcorpus of the newly released PARSEME dataset. CamemBERT outperforms both LLMs by large margins in precision, recall, and F1. We complement this numerical evaluation with a qualitative analysis of prediction errors.
Extracting Multi-Word Expressions Representing Technical Terms and Proper Nouns in Log Messages
Kilian Dangendorf | Sven-Ove Hänsel | Jannik Rosendahl | Felix Heine | Carsten Kleiner | Christian Wartena
Kilian Dangendorf | Sven-Ove Hänsel | Jannik Rosendahl | Felix Heine | Carsten Kleiner | Christian Wartena
IT-systems generate log messages containing important information about the system’s health. To gather information about system entities, we extract technical terms and proper nouns as multi-word expressions (MWEs) from a wide range of log messages from 16 different real systems. We apply Gries’ information-theoretic approach which iteratively calculates the best MWE candidates using an eight-dimensional ranking method. These candidates are evaluated in an annotation study, achieving a precision of 66 %. This value is significantly higher than evaluations on general-purpose texts, demonstrating the higher occurrence of compound technical terms and proper nouns in log messages. The MWEs found can be used to reduce the number of nodes in a system behavior graph while increasing the information density of the nodes.
Two Birds with One Stone: Annotating Romanian Multiword Expressions with an Eye to the PARSEME 2.0 Guidelines Applicability
Verginica Mititelu | Mihaela Cristescu | Elena Irimia | Carmen Mîrzea Vasile
Verginica Mititelu | Mihaela Cristescu | Elena Irimia | Carmen Mîrzea Vasile
This paper presents an enhanced version of the Romanian corpus previously annotated only for verbal multiword expressions. The new release extends the annotation to multiword expressions of other parts of speech, following version 2.0 of the PARSEME guidelines. The corpus has been expanded, its new part was automatically morpho-syntactically annotated based on the Universal Dependencies framework, followed by extensive semi-automatic annotation of multiword expressions across all morphological categories. The paper also reports quantitative data on the updated corpus and discusses the distribution and characteristics of Romanian multiword expressions. We also highlight language-specific annotation challenges and issues arising from the PARSEME 2.0 guidelines.
Incorporating Multiword Expressions in Galician Neural Machine Translation: Compositionality, Efficiency, and Performance
Daniel Solla | Paula Pinto-Ferro | Laura Castro | Pablo Gamallo | Marcos Garcia
Daniel Solla | Paula Pinto-Ferro | Laura Castro | Pablo Gamallo | Marcos Garcia
This paper explores the behavior of neural machine translation models on two newly introduced datasets containing noun-adjective MWEs with different degrees of semantic ambiguity and compositionality. We compare general-domain machine translation systems with fine-tuned models exposed to small subsets of the target MWEs. By assessing the effects of the learning steps and corpus size, we found that carefully designed fine-tuned may improve MWE handling while mitigating catastrophic forgetting. However, our error analysis reveals that models still struggle in several scenarios, particularly when translating MWEs with idiomatic meanings. Both the datasets and the experiments focus on translation involving Galician, English, and Spanish.
Beyond Single Words: MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD
Jurgi Giraud | Andrew Gargett
Jurgi Giraud | Andrew Gargett
Multiword Expressions (MWEs) are pervasive in scientific writing, and in specialized domains they include both multiword terminology (e.g., noun compounds) and recurrent academic phrasing. This study profiles MWEs in a large corpus of bioinformatics research articles segmented by IMRaD sections. Building on recent multi-method approaches to scientific MWE identification, we extract MWEs using complementary automated strategies (semantic matching, dependency parsing, controlled vocabularies, and academic formula lists) and compare the resulting inventories by size, form, and IMRaD section distribution. We further quantify cross-document dispersion using document frequency and Gries’ DP to distinguish widely reused expressions from items concentrated in a small subset of articles. Results show that bioinformatics MWEs are predominantly short and nominal, but that extraction methods differ in the extent to which they recover discourse and reporting phraseology. Dispersion is strongly long-tailed across sections with most MWEs being document-specific, while a smaller recurrent core aligns with section function and is enriched for conventional templates and standardized multiword terms. Overall, the findings argue for combining complementary identification methods with dispersion profiling to characterize domain "multiwordness" in a principled and section-sensitive way.
Multiword expressions are an important area of study in linguistics and natural language processing as they represent combination of words that function as a single unit, and display properties that cannot be predicated fully from their individual components. This paper describes annotated corpora of about 3000 multiword expressions across syntactic categories in Marathi. This is the first exhaustive resource for Marathi which includes both verbal and non-verbal multiwords. In order to develop the guidelines for annotation, we have used the existing literature on the identification and classification of these expressions. Following the PARSEME 2.0 guidelines, we discuss the categories of multiwords and their behaviour in the corpus. Throughout the annotation process, we encounter variability in compositionality and syntactic realization and discuss our design decisions during annotation. Such a dataset will further our understanding of how grammatical structure can be integrated with lexically stored multiword units in Marathi.
Despite recent significant advances, idioms, like other forms of figurative language, present a challenge to natural language processing (NLP). Benchmark corpora are essential for improving the current models on understanding idioms. However, such corpora are only available for a limited set of languages. In this paper, we introduce our ongoing work on a benchmark corpus of Turkish idioms. Our corpus is structured for testing both idiom recognition and idiom understanding. The corpus is currently consists of 200 instances with sentences including idiomatic use, their literal paraphrases, similar sentences with no entailment, and non-idiomatic use of the idiomatic expressions when possible. We describe the methodology used to create the corpus, as well as initial experiments with a selection of LLMs.
Diversity patterns run deep: Impact of diversity intake on multiword expression identification
Mathilde Deletombe | Manon Scholivet | Louis Estève | Thomas Lavergne | Agata Savary
Mathilde Deletombe | Manon Scholivet | Louis Estève | Thomas Lavergne | Agata Savary
Multiword expressions (MWEs) are good examples of a phenomenon where identification systems struggle with generalisation: MWE present in the test set but absent in the training set are rarely identified. This raises the question of the diversity of the test set, relative to that of the train set, and how this impacts performance. We set out to measure how much diversity of a train corpus increases when adding individual MWEs from the test corpus, and how this increase impacts MWE identification performance. We measure diversity across a three-dimension framework and find mostly consistent negative correlations with performance in 14 languages and 8 systems.
A Curious Class of Adpositional Multiword Expressions in Korean
Junghyun Min | Na-Rae Han | Jena D. Hwang | Nathan Schneider
Junghyun Min | Na-Rae Han | Jena D. Hwang | Nathan Schneider
Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts.In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing frameworks.In this paper, we present a study of Korean functional multiword expressions: postpositional verb-based constructions (PVCs).Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them from non-MWEs with similar structure.Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.
PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani
Nina Hosseini-Kivanani
Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe 2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision–language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English, and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.
IdiomRanker-X at MWE-2026 AdMIRe 2: Multilingual Idiom-Image Alignment via Low-Rank Adaptation of Cross-Encoders
Mehmet Utku Colak
Mehmet Utku Colak
This paper describes the system submitted for the MWE 2026 Shared Task (AdMIRe 2.0 Subtask A). The submission focused on a text-centric approach, reframing the idiom-image alignment task as a sentence-pair classification problem using mBERT (Multilingual BERT). The submitted system relied on full fine-tuning using only the English training data, achieving a Top-1 Accuracy of approximately 0.30 on the blind test set. Following the evaluation phase, significant limitations were identified in the cross-lingual generalization of the base model. In a post-evaluation study, the backbone was upgraded to XLM-RoBERTa-Large-XNLI, incorporating Low-Rank Adaptation (LoRA) and utilizing the full multilingual dataset with hard negative mining. These improvements boosted the accuracy to 0.41, demonstrating the necessity of NLI-specific pre-training and parameter-efficient tuning for MWE-aware multimodal tasks.
alexandru412 at MWE-2026 AdMIRe 2.0: Advancing Multimodal Idiomaticity Representation
Cristea Alexandru-Marian
Cristea Alexandru-Marian
This paper presents the system developedby team alexandru412 for the AdMIRe 2.0Shared Task. We participated in the Text-Onlytrack, ranking images based on idiomatic us-age without accessing pixel data. Our approachcombines a strict list-wise ranking strategy withsystematic test-time augmentation. We fine-tuned a Large Language Model (LLM) on En-glish and Portuguese data and relied on zero-shot transfer for other languages. Our systemachieved the 3rd place in the Text-Only track.
BeeParser at MWE-2026 PARSEME 2.0 Subtask 1: Can Cross-Lingual Interactions Improve MWE Identification?
Ahmet Erdem | Oguzhan Karaarslan
Ahmet Erdem | Oguzhan Karaarslan
This paper describes a multilingual system for automatic multiword expression identification for PARSEME 2.0 Subtask 1. We formulate MWE identification as a token-level sequence labeling problem using a BIO tagging scheme and fine-tune XLM-RoBERTa-base on PARSEME 2.0. We mainly investigate cross-lingual interactions on language pairs, and test hypotheses whether using a given language pair for training improves MWE detection performance on both or one of the languages. Then, we apply selected successful language pairs on PARSEME 2.0 MWE Identification task. Experiments are conducted independently for a subset of the languages given in PARSEME 2.0, for a total of 8 languages. Our approach achieves strong token-based and span-based F1 scores across diverse languages, and we observe that training with even distant language pairs may result in improvement on at least one of the languages. We publish our code at https://github.com/ahmeterdem1/parseme-blg505
VisAffect at MWE-2026 AdMIRe 2: IMMCAN Idiom Multimodal Cross-Attention Network
Barış Bilen | Ali Azmoudeh | Hazım Kemal Ekenel | Hatice Kose
Barış Bilen | Ali Azmoudeh | Hazım Kemal Ekenel | Hatice Kose
We address AdMIRe 2.0, a static image ranking task where a sentence containing a potentially idiomatic expression is paired with five image–caption candidates, and the goal is to rank the candidates by semantic compatibility with the intended idiomatic or literal meaning. We propose IMMCAN, which keeps XLM-R and Jina-CLIP-v2 frozen and learns a lightweight two-stage cross-attention fusion, caption–image grounding followed by idiom-to-multimodal conditioning, to predict a compatibility score per candidate. We also evaluate caption-only augmentation via back-translation and synonym substitution, and compare regression and rank-class formulations. On AdMIRe 1.0, text-only achieves higher test top-image accuracy than VLM-grounded modeling. In contrast, on AdMIRe 2.0 zero-shot, adding visual patch grounding improves both accuracy and NDCG indicating better cross-lingual ranking transfer.
Sahara Tokenizers at PARSEME 2.0 Subtask 1: Combining Contextual Embeddings with Structural Decoding for Multi-Word Expression Detection
Yunus Karatepe | Mert Sülük | Zeynep Tuğçe Kırımlı | Begüm Özbay
Yunus Karatepe | Mert Sülük | Zeynep Tuğçe Kırımlı | Begüm Özbay
Multi-Word Expressions (MWEs) pose a significant challenge for natural language processing systems due to their idiosyncratic semantic and syntactic properties. This paper describes our system for the PARSEME 2.0 Shared Task on automatic identification of verbal MWEs across 17 typologically diverse languages. Our approach combines multilingual BERT with explicit Part-of-Speech (POS) feature injection through a dual-head architecture that jointly performs BIO-based identification and category classification. We further investigate extensions, including Conditional Random Field (CRF) decoding for structured prediction, focal loss for addressing class imbalance, and model ensembling for improving discontinuous MWE detection. Our official submission achieves a global MWE-based F1 score of 48.39%, securing second place in the shared task. Ablation studies reveal a strong synergy between POS features and CRF decoding, with the combined approach yielding the best single-model performance. Furthermore, ensembling models trained with different objectives improves both overall F1 score and discontinuous MWE scores, demonstrating the importance of training diversity for capturing non-adjacent syntactic patterns.
3K2T at MWE-2026 AdMIRe 2: CARIM– Category-Aware Reasoning for Idiomatic Multimodality
Kubilay Kağan Kömürcü | Tugce Temel
Kubilay Kağan Kömürcü | Tugce Temel
Idiomatic expressions pose a fundamental challenge for multimodal understanding due to their non-compositional semantics, while pretrained vision–language models tend to over-rely on literal visual alignments. We address this issue in the context of the AdMIRe 2.0 multimodal idiomatic image ranking task by introducing CARIM (Category-Aware Reasoning for Idiomatic Multimodality), an inference-time framework that injects structured semantic reasoning without end-to-end retraining.Experiments on the official Codabench leaderboard demonstrate that CARIM achieves competitive Top-1 Accuracy and nDCG across multiple languages. Additional post-competition evaluation on the released test annotations further shows that CARIM maintains robust multilingual performance, highlighting the effectiveness of inference-time category-aware reasoning for multimodal idiomatic grounding.
PMI MWE Scorer at PARSEME 2.0 Subtask 1: identifying multi-word expressions using pointwise mutual information and universal dependencies
Anna Bogdanova | Ileana Bucur
Anna Bogdanova | Ileana Bucur
Multi-word expressions (MWEs) remain a challenge for NLP systems due to their syntactic variability and non-compositional semantics, that is why this issue was proposed as shared task within Unidive organization. With increasing popularity of large language models (LLM) it is important to continue researching alternative solutions. One of classical approaches for identifying MWEs is calculating pointwise mutual information (PMI), but this is a purely statistical approach that cannot unveil the links between words in natural text. To fix this issue we propose this paper with a simple syntax-aware PMI method that leverages Universal Dependency (UD) trees (Nivre et al.,2016) to model co-occurrence between syntactically related words. By computing PMI over dependency-linked word pairs and aggregating these scores, we aim to improve surface-based methods. Opposed to expectations, our experiment shows that classical statistical approach gets better results in identifying MWEs partially. Still, this approach is aimed to find a balance between lightweight calculations as opposed to LLMs and precision in results.
tiberiucarp at MWE-2026 AdMIRe 2: GLIMMER-Gloss-based Image Multiword Meaning Expression Ranker
Andrei Tiberiu Carp
Andrei Tiberiu Carp
Multiword expressions (MWEs), particularlyidioms, pose persistent challengesfor vision-language systems due to theirnon-compositional semantics and culturallygrounded meanings. This paper presentsGLIMMER, a three-stage hybrid ranking systemthat evaluates how well images expressthe intended meaning of MWEs across 15 languages.Our approach uses LLM-generatedsemantic glosses as multilingual meaning anchors,combined with dual-path embeddingscoring (textual captions and visual features),and LLM-based semantic verification. Evaluatedon the ADMIRE shared task benchmark,GLIMMER achieves competitive performanceacross diverse languages without relying onparallel training data or language-specific resources.The results show that using glossesto anchor meaning helps match idioms withimages across languages and modalities, andthat combining retrieval with reasoning is morerobust than using embeddings alone.
IPN at MWE-2026 PARSEME 2.0 Subtask 1: MWE Identification via Related Languages and Harnessing Thinking Mode
Anna Hülsing | Noah-Manuel Michael | Daniel Mora Melanchthon | Andrea Horbach
Anna Hülsing | Noah-Manuel Michael | Daniel Mora Melanchthon | Andrea Horbach
We present IPN, our system for Subtask 1 of the PARSEME 2.0 Shared Task, which targets the identification of MWEs in 17 languages. Overall, IPN outperformed a much larger-parameter baseline model, yet a performance gap to the top-performing systems remains. To better understand these results, we investigate Qwen3-32B’s suitability for mono-, cross- and multilingual MWE identification. We also explore whether this model benefits from prepending automatically generated thinking data to the gold label during instruction-tuning. We find that target language data is vital for instruction-tuning. Prepending generated thinking data to a subset of the training data slightly improves performance for two out of three languages, but more detailed evaluation is required.
Semantic Stars at MWE-2026 PARSEME 2.0 Subtask 2: Alternative Approaches for MWE Paraphrasing
Elif Bayraktar | Vedat Doğancan | Muhammed Abdullah Gümüş | Nusret Ali Kızılaslan
Elif Bayraktar | Vedat Doğancan | Muhammed Abdullah Gümüş | Nusret Ali Kızılaslan
This paper describes the system submitted by Semantic Stars Team for Subtask 2 of the PARSEME 2.0 shared task (Paraphrasing Multiword Expressions). Our approach addresses the challenge of paraphrasing sentences containing MWEs such that the MWE is removed while the original meaning and grammatical structure are preserved. The paper describes multiple distinct approaches powered by open-weight Large Language Models (LLMs), each employing a combination of different techniques such as prompting, multi-agent pipelines and classical NLP methods. Four distinct methods are tested on the test data in French, including a fifth one combining the results from the first four. We tested with several different open-weight LLMs including Llama3.1:8b, Qwen3:8b and gpt-oss-120b and were able to achieve significant improvements over the baseline, securing the first place on the shared task leader board.
MorphoFiltered-Gemini at MWE-2026 PARSEME 2.0 Subtask 1: Tackling LLM Overgeneration via Universal POS-based Constraints
Irina Moise | Sergiu Nisioi
Irina Moise | Sergiu Nisioi
This paper describes MorphoFiltered-Gemini, a multilingual system submitted to the PARSEME 2.0 shared task on multiword expression (MWE) identification. The system relies on Google Gemini 2.0 Flash-Lite to generate MWE predictions using zero-shot and selectively applied few-shot prompting, without fine-tuning or language-specific resources. To reduce the tendency of large language models to over-generate MWEs, we introduce a lightweight morphological post-filter that removes unlikely constructions while preserving high-precision patterns.Rather than optimizing peak performance for individual languages, our approach prioritizes precision and cross-lingual robustness. As a result, the system exhibits stable behavior across 17 typologically diverse languages and achieves the highest Shannon evenness score among all submitted systems. The experimental results highlight a clear trade-off between recall-oriented LLM prompting strategies and precision-oriented filtering, and show that simple linguistic constraints can effectively improve the stability of LLM-based multilingual MWE identification systems.
LST at MWE-2026 AdMIRe 2: Advancing Multimodal Idiomaticity Representation
Le Qiu | Yu-Yin Hsu | Emmanuele Chersoni
Le Qiu | Yu-Yin Hsu | Emmanuele Chersoni
This paper presents our methods for the AdMIRe 2.0 shared task, which addresses multilingual and multimodal idiom understanding. Our submission focuses on the text-only track. Specifically, we employ an ensemble of three large language models (LLMs) to directly perform the presented image ranking task. Each model independently produces a ranking of the candidate images, and we aggregate their outputs using a hard voting strategy to determine the final prediction. This ensemble learning framework leverages the complementary strengths of different LLMs, improving robustness and reducing the variance of individual model predictions.
UniBO at MWE-2026 PARSEME 2.0 Subtask 2: A Cross-lingual Approach to Multiword Expression Paraphrasing
Debora Ciminari | Alberto Barrón-Cedeño
Debora Ciminari | Alberto Barrón-Cedeño
This paper describes MISP (Multilingual Id-iomatic Sentence Paraphrasing), a system sub-mitted to the PARSEME 2.0 MultilingualShared Task on Identification and Paraphras-ing of Multiword Expressions (MWEs). Weparticipated in Subtask 2 on MWE para-phrasing and developed our system based onQwen3-4B-Instruct fine-tuned on syntheticPortuguese MWE paraphrases. We appliedMISP not only to Portuguese, but also to Frenchand Romanian, aiming to leverage cross-lingualtransfer within related languages, with ours be-ing the only submission for Portuguese. Ourresults indicate that MISP struggles to generateparaphrases that both rephrase and preserve theoriginal meaning of the MWE. Additionally,instruction fine-tuning does not appear to im-prove performance. Overall, our findings high-light the challenges of paraphrasing MWEs,particularly in a cross-lingual setting
DCSN-NLP at MWE-2026 AdMIRe 2: Bridging Literal and Figurative Meaning Through Hierarchical Multimodal Reasoning
David Cotigă | Sergiu Nisioi
David Cotigă | Sergiu Nisioi
This paper presents our system for the MWE-2026 ADMiRe 2.0 shared task, which aimedto advance multimodal idiomatic understand-ing across 15 languages. We address the taskof selecting, from a set of five images, theone that best represents either the literal oridiomatic meaning of a given compound incontext. Our approach follows a multi-steppipeline: a large language model (LLM) firstdetermines whether the compound is used lit-erally or idiomatically and generates auxiliarytext, consisting of an idiomatic meaning expla-nation and a visual description of the literalmeaning. An ensemble of three CLIP modelsthen identifies the two images most semanti-cally similar to the appropriate generated textvia a voting mechanism. Finally, the LLM se-lects the best image from these two candidates.
ITUNLP at MWE-2026 AdMIRe 2: A Zero-Shot LLM Pipeline for Multimodal Idiom Understanding and Ranking
Atakan Site | Oğuz Ali Arslan | Gülşen Eryiğit
Atakan Site | Oğuz Ali Arslan | Gülşen Eryiğit
This paper presents our system for AdMIRe 2 (Advancing Multimodal Idiomaticity Representation), a shared task on multilingual multimodal idiom understanding. The task focuses on ranking images according to how well they depict the literal or idiomatic usage of potentially idiomatic expressions (PIEs) in context, across 15 languages and two tracks: a text-only track, and a multimodal track that uses both images and captions. To tackle both tracks, we propose a hybrid zero-shot pipeline built on large vision–language models (LVLMs). Our system employs a chain-of-thought prompting scheme that first classifies each PIE usage as literal or idiomatic and then ranks candidate images by their alignment with the inferred meaning.A primary–fallback routing mechanism increases robustness to safety-filter refusals, while lightweight post-processing recovers consistent rankings from imperfect model outputs.Without any task-specific fine-tuning, our approach achieves 55.9% Top-1 Accuracy in the text-only track and 60.1% in the multimodal (text+image) track, ranking first overall on the official leaderboard. These results suggest that carefully designed zero-shot LVLM pipelines can provide strong baselines for multilingual multimodal idiomaticity benchmarks.
Archaeology at WE-2026 PARSEME 2.0 Subtask 1 and 2: Parsing is for Encoders, Paraphrasing is for LLMs
Rares-Alexandru Roscan | Sergiu Nisioi
Rares-Alexandru Roscan | Sergiu Nisioi
This paper presents our approach to the PARSEME 2.0 Shared Task on Romanian, covering both Identification (Subtask 1) and Paraphrasing (Subtask 2). While Large Language Models (LLMs) excel at semantic generation, we hypothesize that they lack the structural precision required for MWE identification, leading to "boundary hallucinations" that compromise downstream simplification. Our Rank 1 results on Romanian confirm this: a specialized encoder (RoBERT) using standard sequence labeling outperforms both few-shot LLMs and complex structural parsers (MTLB-STRUCT). This justifies our proposed pipeline: using encoders as precise “pointers” to guide the generative power of LLMs.
ITUNLP2 at MWE-2026 AdMIRe 2: Modular Zero-Shot Pipelines for Multimodal Idiom Grounding and Ranking
Özge Umut | Bora Şenceylan
Özge Umut | Bora Şenceylan
We describe a zero-shot system for AdMIRe 2.0, a shared task on multimodal understanding of potentially idiomatic expressions (PIEs). Given a context sentence with a PIE and five candidate images, the system predicts whether the usage is literal or idiomatic and ranks images by how well they match the intended meaning. We use closed-source large multimodal models and compare prompting pipelines from direct one-step ranking to modular multi-step pipelines that separate sense prediction, PIE-focused image semantics, and final ranking. All steps produce constrained JSON outputs to enable deterministic parsing and composition. In the official AdMIRe 2.0 evaluation on CodaBench, our best pipeline achieves an average Top-1 accuracy of 0.52 and an average nDCG score of 0.70 across the 12 languages we submitted. We obtain the best score among submitted systems in 10 of these languages.
Edition 2.0 of the PARSEME shared task on multilingual identification and paraphrasing of multiword expressions
Manon Scholivet | Agata Savary | Carlos Ramisch | Eric Bilinski | Takuya Nakamura | Maria Mitrofan | Vasile Pais
Manon Scholivet | Agata Savary | Carlos Ramisch | Eric Bilinski | Takuya Nakamura | Maria Mitrofan | Vasile Pais
Multiword expressions (MWEs) have been a major challenge in NLP for decades and research on MWEs was driven notably by shared tasks, including those organized by the PARSEME community. We report the organisation and the results of edition 2.0 of the PARSEME shared task. For the first time, all syntactic categories are covered: verbal, nominal, adjectival, adverbial and functional. We rely on edition 2.0 of the PARSEME corpus, annotated for all these categories in 17 languages. We create a new dataset with paraphrases of sentences containing idioms in 14 languages, and defining a new subtask dedicated to MWE paraphrasing. We extend our evaluation protocol by measuring both performance and diversity of systems, and including manual evaluation in paraphrasing. 10 systems, including the baseline, participated in the MWE identification subtask and 5 in the paraphrasing subtask. Results are promising, but known MWE identification challenges remain unsolved. Performance correlates positively with diversity in MWE identification, and negatively in MWE paraphrasing.
MWE-2026 Shared Task: AdMIRe 2 Advancing Multimodal Idiomaticity Representation
Doğukan Arslan | Rodrigo Wilkens | Wei He | Dilara Torunoglu Selamet | Thomas Pickard | Aline Villavicencio | Adriana Silvina Pagano | Gülşen Eryiğit
Doğukan Arslan | Rodrigo Wilkens | Wei He | Dilara Torunoglu Selamet | Thomas Pickard | Aline Villavicencio | Adriana Silvina Pagano | Gülşen Eryiğit
Idiomatic expressions present a unique chal-lenge in NLP, as their meanings are often notdirectly inferable from their constituent words.Despite recent advancements in large languagemodels, idiomaticity remains a significant ob-stacle to robust semantic representation. Wepresent datasets and task results for MWE-2026 Shared Task 2: Advancing MultimodalIdiomaticity Representation 2 (AdMIRe 2),which challenges the community to assess andimprove models’ ability to interpret idiomaticexpressions in multimodal contexts across mul-tiple languages. Participants competed in animage ranking task in which, for each item,systems receive a context sentence containinga potentially idiomatic expression (PIE) andfive candidate images. Participating systemsare required to predict the sentence type (i.e.,idiomatic vs. literal) for the given context andrank the images by how well they depict the in-tended meaning in that context. Among the par-ticipating systems the most effective methodsinclude pipelines utilizing closed-source com-mercial models such as Gemini 2.5 and GPT-5, and employing chain-of-thought reasoningstrategies. Methods to mitigate language mod-els’ bias towards literal interpretations and en-sembles to smooth out variance were common.
up
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Elena V. Epure | Sergio Oramas | SeungHeon Doh | Pedro Ramoneda | Anna Kruspe | Mohamed Sordo
Elena V. Epure | Sergio Oramas | SeungHeon Doh | Pedro Ramoneda | Anna Kruspe | Mohamed Sordo
From Novice to Expert: Generating Audience-Dependent Concert Moderations with RAG-LLMs
Kerstin Denecke
Kerstin Denecke
In this paper, we study the capabilities of large language models (LLMs) to adapt a concert moderation to diverse expertise levels of listeners. Our proof-of-concept concert moderator is based on retrieval-augmented generation (RAG) and uses few-shot audience modelling to infer listener’s expertise. We study the capabilities of the system to adapt to three different listener’s expertise levels. Two open domain LLMs are compared: gpt-oss:20b and llama3. The recognised differences among the models suggest that they vary in how directly they reproduce versus paraphrase retrieved information while maintaining semantic alignment.
LabelBuddy: An Open Source Music and Audio Language Annotation Tagging Tool Using AI Assistance
Ioannis Prokopiou | Ioannis Sina | Agisilaos Kounelis | Pantelis Vikatos | Themos Stafylakis
Ioannis Prokopiou | Ioannis Sina | Agisilaos Kounelis | Pantelis Vikatos | Themos Stafylakis
The advancement of Machine learning (ML), Large Audio Language Models (LALMs), and autonomous AI agents in Music Information Retrieval (MIR) necessitates a shift from static tagging to rich, human-aligned representation learning. However, the scarcity of open-source infrastructure capable of capturing the subjective nuances of audio annotation remains a critical bottleneck. This paper introduces LabelBuddy, an open-source collaborative auto-tagging audio annotation tool designed to bridge the gap between human intent and machine understanding. Unlike static tools, it decouples the interface from inference via containerized backends, allowing users to plug in custom models for AI-assisted pre-annotation. We describe the system architecture, which supports multi-user consensus, containerized model isolation, and a roadmap for extending agents and LALMs. Code available at https://github.com/GiannisProkopiou/gsoc2022-Label-buddy.
Stochastic Parrots or True Virtuosos? Digging Deeper Into the Audio-Video Understanding of AVQA Models
Sara Pernille Jensen | Hallvard Innset Hurum | Anna-Maria Christodoulou
Sara Pernille Jensen | Hallvard Innset Hurum | Anna-Maria Christodoulou
Audio-video question answering (AVQA) systems for music show signs of multimodal "understanding", but it is unclear which inputs they rely on or whether their behavior reflects genuine audio-video reasoning. Existing evaluations focus on overall accuracy and rarely examine modality dependence. We address this gap by suggesting a method of using counterfactual evaluations to analyse the audio-video understanding of the models, illustrated with a case study on the audio-video spatial-temporal (AVST) architecture. This includes interventions that zero out or swap audio, video, or both, where results are benchmarked against a baseline based on linguistic patterns alone. Results show stronger reliance on audio than video, yet performance persists when either modality is removed, indicating learned cross-modal representations. The AVQA system studied thus exhibits non-trivial multimodal integration, though its "understanding" remains uneven.
Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries
Marion Baranes | Romain Hennequin | Elena V. Epure
Marion Baranes | Romain Hennequin | Elena V. Epure
Although annotated music descriptor datasets for user queries are increasingly common, few consider the user’s intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles.We then investigate how reliably large language models (LLMs) can extract these music descriptors, finding that they do capture explicit descriptors but struggle with context-dependent ones. This work can further serve as a benchmark for fine-grained modeling of user intent and for gaining insights into improving LLM-based music understanding systems.
How Far Can Pretrained LLMs Go in Symbolic Music? Controlled Comparisons of Supervised and Preference-based Adaptation
Deepak Kumar | Emmanouil Karystinaios | Gerhard Widmer | Markus Schedl
Deepak Kumar | Emmanouil Karystinaios | Gerhard Widmer | Markus Schedl
Music often shares notable parallels with language, motivating the use of pretrained large language models (LLMs) for symbolic music understanding and generation. Despite growing interest, the practical effectiveness of adapting instruction-tuned LLMs to symbolic music remains insufficiently characterized. We present a controlled comparative study of finetuning strategies for ABC-based generation and understanding, comparing an off-the-shelf instruction-tuned backbone to domain-adapted variants and a music-specialized LLM baseline. Across multiple symbolic music corpora and evaluation signals, we provide some insights into adaptation choices for symbolic music applications. We highlight the domain adaptation vs. preserving prior information tradeoff as well as the distinct behaviour of metrics used to measure the domain adaptation for symbolic music.
Text-only training is a promising new method for training multimodal machine learning models without data from every modality. However, few studies have explored its use as an approximation of missing data for supervised learning in data-scarce environments. In this work, we examine techniques to acquire text-based training data, address the modality gap, and present a case study on classifying subjective audio timbre descriptions based on three kinds of text-only training data and six augmentation methods on eight audio-timbre datasets. We find text-only training successfully trains supervised audio classifiers without audio that are able to compete with a zero-shot baseline and training on real audio.
Read Between the Tracks: Exploring LLM-driven Intent-based Music Recommendations
Anna Hausberger | Petra Jósár | Markus Schedl
Anna Hausberger | Petra Jósár | Markus Schedl
This paper evaluates the effectiveness of large language models (LLMs) on the task of context-aware music recommendation, specifically focusing on the alignment of music tracks with a listening intent, in addition to user preferences. We present a preliminary investigation in which five LLMs (variants of LLama, Qwen, and Mistral) are tasked with ranking a candidate set of tracks containing both ground-truth items (associated with specific user-intent pairs) and distractor items (containing user-relevant, intent-relevant, or non-user and non-intent relevant items). Our results show that LLMs rank intent-user-relevant items higher than the distract items, with "Llama-3.1-8B-Instruct" having the best performance (NDCG of 0.320.20 vs. 0.200.15). We further investigate whether performance differs when mentioning the listening intent explicitly in the prompt vs. implicitly given solely music preferences.Surprisingly, the LLMs achieved the best performance through an implicit indication of intent, versus explicitly adding it to the prompt, with "Mistral-7B-Instruct-v0.3" performing the best (NDCG of 0.370.22 vs. 0.290.18).
Learning When to Personalize: LLM Based Playlist Generation via Query Taxonomy and Classification
Fedor Buzaev | Ivan Sukharev | Rinat Mullahmetov | Roman Bogachev | Ilya Sedunov | Oleg Pavlovich | Daria Pugacheva
Fedor Buzaev | Ivan Sukharev | Rinat Mullahmetov | Roman Bogachev | Ilya Sedunov | Oleg Pavlovich | Daria Pugacheva
Playlist generation based on textual queries using large language models (LLMs) is becoming an important interaction paradigm for music streaming platforms. User queries span a wide spectrum from highly personalized intent to essentially catalog-style requests. Existing systems typically rely on non-personalized retrieval/ranking or apply a fixed level of preference conditioning to every query, which can overfit catalog queries to a single user or under-personalize explicitly listener-dependent requests. We present an industrial-scale LLM-based playlist generation system with dynamic personalization that adapts the personalization strength to the query type. We define a query taxonomy, train a query-type classifier on 5,000 manually labeled queries, and use its predicted probability to modulate the mixture of LLM-based semantic scoring and personalized evaluation. In a blind user study with pairwise comparisons and ELO aggregation, this approach consistently outperforms both non-personalized and fixed-personalization baselines.
HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
Benno Weck | Pablo Puentes | Andrea Poltronieri | Satyajeet Prabhu | Dmitry Bogdanov
Benno Weck | Pablo Puentes | Andrea Poltronieri | Satyajeet Prabhu | Dmitry Bogdanov
The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet.This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension.To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.
up
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Kemal Oflazer | Abdullatif Köksal | Onur Varol
Kemal Oflazer | Abdullatif Köksal | Onur Varol
Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.
Authorial style transfer is particularly challenging in low-resource scenarios, such as those presented by languages with a distinct socio-digital trajectory like Turkish, where contemporary digital text coexists with under-resourced literary and historical styles. This work addresses this gap through the Dual-Stage Stylometric Imprinting (DSSI) framework, introducing a Rule+Example paradigm for effective style profiling. Evaluated on a corpus of Turkish texts, the approach enables smaller models to achieve up to 90% of large model performance by combining explicit stylistic guidelines with contextual demonstrations. The findings demonstrate altered scaling laws for stylistic tasks and facilitate the practical deployment of personalized style transfer for preserving distinctive writing characteristics.
TUNE: A Task For Turkish Machine Unlearning For Data Privacy
Doruk Benli | Ada Canoğlu | Nehir İlkim Gönençer | Dilara Keküllüoğlu
Doruk Benli | Ada Canoğlu | Nehir İlkim Gönençer | Dilara Keküllüoğlu
Most large language models (LLMs) are trainedon massive datasets that include private infor-mation, which may be disclosed to third-partyusers in output generation. Developers put de-fences to prevent the generation of harmful andprivate information, but jailbreaking methodscan be used to bypass them. Machine unlearn-ing aims to remove information that may beprivate or harmful from the model’s genera-tion without retraining the model from scratch.While machine unlearning has gained somepopularity to counter the removal of privateinformation, especially in English, little to noattention has been given to Turkish unlearn-ing paradigms or existing benchmarks. In thisstudy, we introduce TUNE (Turkish Unlearn-ing Evaluation), the first benchmark datasetfor Turkish unlearning task for personal infor-mation. TUNE consists of 9842 input-targettext pairs about 50 fictitious personalities withtwo training task types: (1) Q A and (2) In-formation Request. We fine-tuned the mT5base model to evaluate various unlearning meth-ods, including our proposed approach. We findthat while current methods can help unlearnunwanted private information in Turkish, theyalso unlearn other information we want to re-tain in the model.
A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages
Gözde Aslantaş | Tunga Gungor
Gözde Aslantaş | Tunga Gungor
Idiomatic expressions are culturally grounded, semantically opaque, and difficult to interpret for multilingual natural language processing systems. Despite the large speaker population of Turkic languages, resources that focus on monolingual and cross-lingual idioms and their meanings are limited. We introduce the first unified benchmark for idiom understanding across Turkish, Azerbaijani, Turkmen, Gagauz, and Uzbek languages. The datasets compiled include token-level idiom span annotations. We develop models for idiom identification and semantic retrieval tasks. We evaluate seven models for idiom identification and nine embedding models for semantic retrieval tasks under several fine-tuning schemes using standard dense retrieval metrics. This benchmark provides a basis for studying idiomatic phenomena in Turkic languages and clarifies how idiomatic meanings are shared, altered, or diverge across languages.
TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization
Figen Eğin | Aytuğ Onan
Figen Eğin | Aytuğ Onan
This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.
SarcasTürk: Turkish Context-Aware Sarcasm Detection Dataset
Niyazi Ahmet Metin | Sevde Yılmaz | Osman Enes Erdoğdu | Elif Sude Meydan | Oğul Sümer | Dilara Keküllüoğlu
Niyazi Ahmet Metin | Sevde Yılmaz | Osman Enes Erdoğdu | Elif Sude Meydan | Oğul Sümer | Dilara Keküllüoğlu
Sarcasm is a colloquial form of language that is used to convey messages in a non-literal way, which affects the performance of many NLP tasks. Sarcasm detection is not trivial and existing work mainly focus on only English. We present SarcasTürk, a context-aware Turkish sarcasm detection dataset built from Ekşi Sözlük entries, a large-scale Turkish online discussion platform where people frequently use sarcasm. SarcasTürk contains 1,515 entries from 98 titles with binary sarcasm labels and a title-level context field created to support comparisons between entry-only and context-aware models. We generate these contexts by selecting representative sentences from all entries under a title using summarization techniques. We report baseline results for a fine-tuned BERTurk classifier and zero-shot LLMs under both no-context and context-aware conditions. We find that BERTurk model with title-level context has the best performance with 0.76 accuracy and balanced class-wise F1 scores (0.77 for sarcasm, 0.75 for no sarcasm). SarcasTürk can be shared upon contacting the authors since the dataset contains potentially sensitive and offensive language.
Language Matters: Target-Language Supervision for Political Bias Detection in Turkish News
Umut Ozbagriacik | Haim Dubossarsky
Umut Ozbagriacik | Haim Dubossarsky
We present, to our knowledge, the first systematic transformer-based outlet-ideology classification study for Turkish news. Using a topic-balanced corpus of Turkish political articles drawn from six outlets commonly perceived as left-, centre-, or right-leaning, we formulate a three-way outlet-ideology classification task. On this dataset, we evaluate a monolingual encoder (BERTurk), two multilingual encoders (mBERT, XLM-R), and a LoRA-adapted decoder model (Mistral). BERTurk achieves the best performance among individual models (70% accuracy, 71% macro-F1), reaching levels comparable to English-language studies despite operating in a lower-resource setting. Error analyses show that all encoders reliably distinguish centrist from partisan articles, but frequently confuse left- and right-leaning articles with each other. Moreover, BERTurk is relatively stronger on right-leaning content, whereas the multilingual models favour left-leaning content, suggesting an “ideological fingerprint” of their pre-training data. Crucially, models fine-tuned on an English political-bias task fail to transfer to Turkish, collapsing to near-chance performance. Taken together, these results demonstrate that effective political bias detection requires target-language supervision and cannot be achieved through naïve cross-lingual transfer. Our work establishes a first baseline for Turkish political bias detection and underscores the need for open, carefully designed Turkish (and broader Turkic) bias benchmarks to support robust and fair media analysis.
Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Giuseppe Samo | Paola Merlo
Giuseppe Samo | Paola Merlo
In this paper, we investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish—with its transparent morphological markers—both monolingual and multilingual models succeed either when tokenization is highly atomic or breaking words into small subword units. For Hebrew, however, a multilingual model using character-level tokenization fails to capture its non-concatenative morphology, while a monolingual model with unified morpheme-aware segmentation excels. Performance improves on more synthetic datasets, in all models.
A Morphology-Aware Evaluation of Turkish Syntax in Large Language Models
Ezgi Başar | Arianna Bisazza
Ezgi Başar | Arianna Bisazza
Minimal pair benchmarks have become a common approach for evaluating the syntactic knowledge of language models (LMs). However, the creation of such benchmarks often overlooks language-specific confounders that may affect model performance, particularly in the case of morphologically rich languages. In this paper, we investigate how surface-level factors such as morpheme count, subword count, and sentence length influence the performance of LMs on a Turkish benchmark of linguistic minimal pairs. We further analyze whether a tokenizer’s degree of alignment with morphological boundaries can serve as a proxy for model performance. Finally, we test whether the distribution of morphemes in a minimal pair benchmark can skew model performance. Our results show that while surface factors have limited predictive power, they might still serve as a systematic source of bias. Moreover, we find that morphological alignment can roughly correspond to model performance, and morpheme-level imbalances in the benchmark may have a significant influence on results.
Benchmarking Hate Speech Detection in Azerbaijani with Turkish Cross-Lingual Transfer and Transformer Models
Tural Alizada | Haim Dubossarsky
Tural Alizada | Haim Dubossarsky
In this paper, we investigated the task of hate-speech classification in the closely related Turkic language pair, Turkish-Azerbaijani. Transformer models can achieve strong hate-speech classification in Turkish, but their performance does not reliably transfer to closely related low-resource languages without careful evaluation. We study Turkish–Azerbaijani hate speech detection and introduce the first manually annotated Azerbaijani benchmark, comprising 1,112 YouTube comments from major news channels with severe class imbalance. We compare XLM-RoBERTa and a compact BERT-Tiny model against a TF–IDF + logistic regression baseline under monolingual training, zero-shot Turkish→Azerbaijani transfer, low-resource balanced subsampling, bilingual mixed fine-tuning, and translation-based augmentation using machine-translated Turkish data. XLM-R attains high macro-F1 in Turkish and achieves moderate zero-shot transfer to Azerbaijani, but native Azerbaijani training is fragile for the hate class. Mixed bilingual training improves robustness for both languages, whereas TF–IDF generalizes poorly to Azerbaijani.
When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik | Libby Barak | Jing Peng | Anna Feldman
Hasan Can Biyik | Libby Barak | Jing Peng | Anna Feldman
Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.
TurkBench: A Benchmark for Evaluating Turkish Large Language Models
Cagri Toraman | Ahmet Kaan Sever | Ayşe Aysu Cengiz | Elif Ecem Arslan | Görkem Sevinç | Sarp Kantar | Mete Mert Birdal | Yusuf Faruk Güldemir | Ali Buğra Kanburoğlu | Sezen Felekoğlu | Birsen Şahin Kütük | Büşra Tufan | Elif Genç | Serkan Coşkun | Gupse Ekin Demir | Muhammed Emin Arayıcı | Olgun Dursun | Onur Gungor | Susan Üsküdarlı | Abdullah Topraksoy | Esra Darıcı
Cagri Toraman | Ahmet Kaan Sever | Ayşe Aysu Cengiz | Elif Ecem Arslan | Görkem Sevinç | Sarp Kantar | Mete Mert Birdal | Yusuf Faruk Güldemir | Ali Buğra Kanburoğlu | Sezen Felekoğlu | Birsen Şahin Kütük | Büşra Tufan | Elif Genç | Serkan Coşkun | Gupse Ekin Demir | Muhammed Emin Arayıcı | Olgun Dursun | Onur Gungor | Susan Üsküdarlı | Abdullah Topraksoy | Esra Darıcı
With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English-language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench
BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish
Burak Aktaş | Mehmet Can Baytekin | Süha Kağan Köse | Ömer İlbilgi | Elif Özge Yılmaz | Cagri Toraman | Bilge Kaan Görür
Burak Aktaş | Mehmet Can Baytekin | Süha Kağan Köse | Ömer İlbilgi | Elif Özge Yılmaz | Cagri Toraman | Bilge Kaan Görür
Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation–driven by both structural linguistic divergence and underrepresentation in LLM pretraining–while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.
Tokenisation of Turkic Copula Constructions in Universal Dependencies
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Identifying units, ’syntactic words’, for morphosyntactic analysis is important yet challenging for morphologically rich languages. In this paper we propose a set of guiding principles to determine units of morphosyntactic analysis, and apply them to the case of copular constructions in Turkic languages, in the context of Universal Dependencies (UD) framework. We also provide a survey of the practice in the Turkic UD treebanks published to date, and discuss the advantages and disadvantages of the proposed tokenisation for a selection of Turkic languages.
RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish
Süha Kağan Köse | Mehmet Can Baytekin | Burak Aktaş | Bilge Kaan Görür | Evren Ayberk Munis | Deniz Yılmaz | Muhammed Yusuf Kartal | Cagri Toraman
Süha Kağan Köse | Mehmet Can Baytekin | Burak Aktaş | Bilge Kaan Görür | Evren Ayberk Munis | Deniz Yılmaz | Muhammed Yusuf Kartal | Cagri Toraman
Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline—from query transformation and reranking to answer refinement—without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.
OCRTurk: A Comprehensive OCR Benchmark for Turkish
Deniz Yılmaz | Evren Ayberk Munis | Cagri Toraman | Süha Kağan Köse | Burak Aktaş | Mehmet Can Baytekin | Bilge Kaan Görür
Deniz Yılmaz | Evren Ayberk Munis | Cagri Toraman | Süha Kağan Köse | Burak Aktaş | Mehmet Can Baytekin | Bilge Kaan Görür
Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining the best Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type: models perform well on non-academic documents, while slideshows become the most challenging.
Building a Turkish Large Language Model via Continual Pre-Training and Parameter-Efficient Adaptation
Alperen Enes Bayar | Mert Ege | Gökhan Yurtalan | Alper Karamanlioglu | Berkan Demirel | Ramazan Gokberk Cinbis
Alperen Enes Bayar | Mert Ege | Gökhan Yurtalan | Alper Karamanlioglu | Berkan Demirel | Ramazan Gokberk Cinbis
Large Language Models (LLMs) achieve strong performance on many tasks, but they still struggle with morphologically rich, low-resource languages such as Turkish. This difficulty stems from Turkish being an agglutinative language and underrepresented in multilingual training data, which causes current models to often fail at capturing its morphology, flexible word order, and formal registers. In this paper, we introduce MODA (Model Adapted for Domain Applications), a Turkish-specialized LLM built via a modular pipeline that combines continual pre-training, parameter-efficient fine-tuning, and model merging. Starting from Qwen2.5-7B as the base model, we first perform large-scale continual pre-training on a Turkish web corpus to improve grammatical and morphological representations. We then apply parameter-efficient supervised fine-tuning on task-oriented instruction data, and finally merge specialized variants into a single unified model. We evaluate MODA on TurkishMMLU, the Turkish subset of EXAMS, and TRCLAIM-19, where it consistently outperforms both the base and instruction-tuned Qwen2.5-7B models. Our results support a training strategy that explicitly separates linguistic acquisition from task alignment when adapting LLMs to morphologically rich, underrepresented languages under realistic hardware constraints.
From Lemmas to Dependencies: What Signals Drive Light Verbs Classification?
Sercan Karakas | Yusuf Şimşek
Sercan Karakas | Yusuf Şimşek
Light verb constructions (LVCs) are a challenging class of verbal multiword expressions, especially in Turkish,where rich morphology and productive complex predicates create minimal contrasts between idiomatic predicatemeanings and literal verb–argument uses. This paper asks what signals drive LVC classification bysystematically restricting model inputs. Using UD-derived supervision, we compare lemma-driven baselines(lemma TF–IDF + Logistic Regression; BERTurk trained on lemma sequences), a grammar-only Logistic Regressionover UD morphosyntax (UPOS/DEPREL/MORPH), and a full-input BERTurk baseline. We evaluate on a controlleddiagnostic set with Random negatives, lexical controls (NLVC), and LVC positives, reporting split-wiseperformance to expose decision-boundary behavior. Results show that coarse morphosyntax alone is insufficientfor robust LVC detection under controlled contrasts, while lexical identity supports LVC judgments but issensitive to calibration and normalization choices. Overall, our findings motivate targeted evaluation forTurkish MWEs and highlight that “lemma-only” is not a single representation but depends critically on hownormalization is instantiated.
Beyond the Token: Correcting the Tokenization Bias in XAI via Morphologically-Aligned Projection
Muhammet Anil Yagiz | Fahrettin Horasan
Muhammet Anil Yagiz | Fahrettin Horasan
Current interpretability methods for Large Language Models (LLMs) operate on a fundamental yet flawed assumption: that subword tokens represent independent semantic units. We prove that this assumption creates a fidelity bottleneck in Morphologically Rich Languages (MRLs), where semantic meaning is densely encoded in sub-token morphemes. We term this phenomenon the Tokenization-Morphology Misalignment (TMM). To resolve TMM, we introduce MAFEX (Morpheme-Aligned Faithful Explanations), a theoretically grounded framework that redefines feature attribution as a linear projection from the computational (token) basis to the linguistic (morpheme) basis. We evaluate our method on a diverse suite of Turkish LLMs, including BERTurk, BERTurk-Sentiment, Cosmos-BERT, and Kumru-2B. On our embedded benchmark (N=20), MAFEX achieves an average F1@1 of 91.25% compared to 13.75% for standard token-level baselines (IG, SHAP, DeepLIFT), representing a +77.5% absolute improvement, establishing it as the new standard for faithful multilingual interpretability.
Overview of the SIGTURK 2026 Shared Task: Terminology-Aware Machine Translation for English–Turkish Scientific Texts
Ali Gebeşçe | Abdulfattah Safa | Ege Uğur Amasya | Gözde Gül Şahin
Ali Gebeşçe | Abdulfattah Safa | Ege Uğur Amasya | Gözde Gül Şahin
This paper presents an overview of the SIGTURK 2026 Shared Task on Terminology-Aware Machine Translation for English-Turkish Scientific Texts. We address the critical challenge of terminological accuracy in low-resource settings by constructing the first terminology-rich English-Turkish parallel corpus, comprising 3,300 sentence pairs from STEM domains with 10,157 expert-validated term pairs. The shared task consists of three subtasks: term detection, expert-guided correction, and end-to-end post-editing. We evaluate state-of-the-art baselines (including GPT-5.2 and Claude Sonnet 4.5) alongside participant systems employing diverse strategies from fine-tuning to Retrieval-Augmented Generation (RAG). Our results highlight that while massive generalist models dominate zero-shot detection, smaller, domain-adapted models using Supervised Fine-Tuning and Reinforcement Learning can significantly outperform them in end-to-end post-editing. Furthermore, we find that rigid retrieval pipelines often disrupt fluency, whereas Chain-of-Thought prompting allows models to integrate terminology more naturally. Despite these advances, a significant gap remains between automated systems and human expert performance in strict terminology correction.
up
Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Ekaterina Vylomova | Andrei Shcherbakov | Priya Rani
Ekaterina Vylomova | Andrei Shcherbakov | Priya Rani
Automatic Grammatical Case Prediction for Template Filling in Case-Marking Languages: Implementation and Evaluation for Finnish
Johannes Laurmaa
Johannes Laurmaa
Automatically generating grammatically correct sentences in case-marking languages is hard because nominal case inflection depends on context. In template-based generation, placeholders must be inflected to the right case before insertion, otherwise the result is ungrammatical. We formalise this case selection problem for template slots and present a practical, data-driven solution designed for morphologically rich, case-marking languages, and apply it to Finnish. We automatically derive training instances from raw text via morphological analysis, and fine-tune transformer encoders to predict a distribution over 14 grammatical cases, with and without lemma conditioning. The predicted case is then realized by a morphological generator at deployment. On a held-out test set in the lemma-conditioned setting, our model attains 89.1% precision, 81.1% recall, and 84.2% F1, with recall@3 of 93.3% (macro averages). The probability outputs support abstention and top-k- suggestion User Interfaces, enabling robust, lightweight template filling for production use in multiple domains, such as customer messaging. The pipeline assumes only access to raw text plus a morphological analyzer and generator, and can be applied to other languages with productive case systems.
The paper presents a prototype of a web-app designed to automatically generate verb valency lexica based on the Universal Dependencies (UD) treebanks. It offers an overview of the structure of the app, its core functionality, and functional extensions designed to handle treebank-specific features. Besides, the paper highlights the limitations of the prototype and the potential of its further development.
Evaluating the Interplay of Information Status and Information Content in a Multilingual Parallel Corpus
Julius Steuer | Toshiki Nakai | Andrew Thomas Dyer | Luigi Talamo | Annemarie Verkerk
Julius Steuer | Toshiki Nakai | Andrew Thomas Dyer | Luigi Talamo | Annemarie Verkerk
The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.
It is common in cognitive computational linguistics to use language model surprisal as a measure of the information content of units in language production. From here, it is tempting to then apply this to information structure and status, considering surprising mentions to be new and unsurprising ones to be given, providing us with a ready-made continuous metric of information givenness/newness. To see if this conflation is appropriate, we perform regression experiments to see if language model surprisal is actually well predicted by information status as manually annotated, and if so, if this effect is separable from more trivial linguistic information such as parts of speech and word frequency. We find that information status alone is at best a very weak predictor of surprisal, and that surprisal can be much better predicted by the effect of parts of speech, which are highly correlated with both information status and surprisal; and word frequency. We conclude that surprisal should not be used as a continuous representation of information status by itself.
Beyond Multilinguality: Typological Limitations in Multilingual Models for Meitei Language
Badal Nyalang
Badal Nyalang
We present MeiteiRoBERTa, the first publicly available monolingual RoBERTa-based language model for Meitei (Manipuri), a low-resource language spoken by over 1.8 million people in Northeast India. Trained from scratch on 76 million words of Meitei text in Bengali script, our model achieves a perplexity of 65.89, representing a 5.2× improvement over multilingual baselines BERT (341.56) and MuRIL (355.65). Through comprehensive evaluation on perplexity, tokenization efficiency, and semantic representation quality, we demonstrate that domain-specific pre training significantly outperforms general-purpose multilingual models for low-resource languages. Our model exhibits superior semantic understanding with 0.769 similarity separation compared to 0.035 for mBERT and near-zero for MuRIL, despite MuRIL’s better tokenization efficiency (fertility: 3.29 vs. 4.65). We publicly release the model, training code, and datasets to accelerate NLP research for Meitei and other underrepresented Northeast Indian languages
Linguistic reference material is a trove of information that can be utilized for the analysis of languages. The material, in the form of grammar books and sketches, has been used for machine translation, but it can also be used for language analysis. Retrieval Augmented Generation (RAG) has been demonstrated to improve large language model (LLM) capabilities by incorporating external reference material into the generation process. In this paper, we investigate the use of grammar books and RAG techniques to identify language features. We use Grambank for feature definition and ground truth values, and we evaluate on five typologically diverse low-resource languages. We demonstrate that this approach can effectively make use of reference material.
up
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Rayyan Merchant | Karine Megerdoomian
Rayyan Merchant | Karine Megerdoomian
Unmasking the Factual-Conceptual Gap in Persian Language Models
Alireza Sakhaeirad | Ali Ma'manpoosh | Arshia Hemmat
Alireza Sakhaeirad | Ali Ma'manpoosh | Arshia Hemmat
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DIVANBENCH, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model’s ability to discern contradictions; and all models show a 21% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
Benchmarking Offensive Language Detection in Persian and Pashto
Zahra Bokaei | Bonnie Webber | Walid Magdy
Zahra Bokaei | Bonnie Webber | Walid Magdy
Offensive language detection and target identification are essential for maintaining respectful online environments. While these tasks have been widely studied for English, comparatively less attention has been given to other language, including Persian and Pashto, and the effectiveness of recent large language models for these languages remains underexplored. To address this gap, we created a comprehensive benchmark of diverse modeling approaches in Persian and Pashto. Our evaluation covers zeroshot, fine-tuned, and cross-lingual transfer settings, analyzing when detection succeeds or fails across different model approaches. This study provides one of the first systematic analyses of offensive language detection and crosslingual transfer between these languages.
Large language models (LLMs) are increasingly used for communication in many languages, therefore, understanding their limitations with respect to culture-specific pragmatics is important. While LLMs perform well on statistically frequent structures, their shortcomings are most evident in rare pragmatic phenomena. This study investigates whether LLMs can generate a (rare) complex honorific mismatch in Farsi. The pattern arises at two levels:(i) a plural pronoun disagrees with a singular referent for the sake of honorification, and (ii) the related components violate the Polite Plural Generalization due to intimacy implication. This double mismatch pattern is attested in everyday speech, though it is statistically sparse. We tested GPT-4 across multiple scenarios. The results reveal that the model successfully employs the first mismatch to indicate honorific, but fails to adopt the second mismatch that simultaneously conveys intimacy. The model thus deviates from humanlike behavior at the syntax–pragmatics interface. These findings suggest that, while machine models demonstrate partial success in generating honorifics, they rely primarily on statistical patterns and lack the deeper pragmatic understanding necessary for contextual competence.
TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP
Mullosharaf Kurbonovich Arabov
Mullosharaf Kurbonovich Arabov
This work introduces TajPersLexon, a curated Tajik–Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families:(i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.
A Computational Approach to Language Contact – A Case Study of Persian
Ali Basirat | Danial Namazifard | Navid Baradaran Hemmati
Ali Basirat | Danial Namazifard | Navid Baradaran Hemmati
We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as CASE and GENDER are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.
Polarization detection in low-resource and mid-resource languages remains a significant challenge for social understanding. This paper presents the first comprehensive benchmark to evaluate transformer-based models for detection of polarized language in Persian (also called Farsi) social media. The aim is to evaluate 1) how and if finetuning the pre-trained models have substantial impact; 2) how Persian specific monolingual models compare to multilingual for this task; 3) how and if transfer learning from models trained on other languages such as culturally-distant English, and culturally-close[er] Turkish, and Arabic can be of interest for this task; and 4) how competitive Large Language Models (LLMs) are in a zero-shot setting. Our evaluation of ten transformer-based models and two LLMs on a publicly available Farsi polarization dataset shows promising findings,highlighting both the strengths and limitations of each approach.
ParsCORE: The Persian Corpus of Online Registers
Alireza Razzaghi | Erik Henriksson | Veronika Laipalla
Alireza Razzaghi | Erik Henriksson | Veronika Laipalla
Despite recent advances in automatic web register (genre) labeling and its applications to web-scale datasets and LLM development, the effectiveness of these tools for digitally lowresource languages remains unclear. This study introduces ParsCORE, the first largescale collection of Persian web registers (genres), and evaluates deep learning models for register classification and keyword analysis across major registers. Using 2,000 humanannotated documents, the models achieved a micro F1-score of 0.76. The findings provide a foundation for future research on the linguistic and cultural specificities of Persian registers.
PMWP: A Benchmark for Math Word Problem Solving in Persian
Marzieh Abdolmaleki | Mehrnoush Shamsfard | Veronique Hoste | Els Lefever
Marzieh Abdolmaleki | Mehrnoush Shamsfard | Veronique Hoste | Els Lefever
Mathematical reasoning captures fundamental aspects of human cognitive ability. Although recent advances in LLMs have led to substantial improvements in automated mathematical problem solving, most existing benchmarks remain focused on English. As a result, robust mathematical reasoning remains a challenging and insufficiently explored capability for underrepresented languages including Persian. To address this gap, we introduce PMWP, the first dataset of 15K elementary-level Persian math word problems that supports both supervised training and evaluation of reasoning models. By expanding mathematical reasoning resources beyond English, PMWP contributes to the development of multilingual AI systems with stronger reasoning capabilities. In this work, we conduct a systematic evaluation of the Persian math word problem solving capabilities of different state-of-the-art LLMs. Our results indicate that DeepSeek-V3 exhibits reduced language bias when problem texts are translated into English, while Gemini-2.5-Flash achieves the highest equation value accuracy (72.02%) in Persian. In addition, we investigate parameter-efficient adaptation for equation generation by applying LoRA-based fine-tuning to LLaMA-3-8B and Qwen-2.5-7B. Our results show that, following fine-tuning, these openweight models achieve 91.65% and 92.53% exact equation match accuracy, respectively. Overall, our findings provide insights into the comparative strengths and limitations of proprietary and open-weight models for mathematical reasoning in Persian.
APARSIN: A Multi-Variety Sentiment and Translation Benchmark for Iranic Languages
Sadegh Jafari | Tara Azin | Farhad Roodi | Zahra Dehghani Tafti | Mehrdad Ghadrdan | Elham Vatankhahan Esfahani | Aylin Naebzadeh | Mohammadhadi Shahhosseini | Ghafoor Khan | Kazem Forghani | Danial Namazi | Seyed Mohammad Hossein Hashemi | Farhan Farsi | Mohammad Osoolian | Maede Mohammadi | Mohammad Erfan Zare | Muhammad Hasnain Khan | Muhammad Hussain | Nooreen Zaki | Joma Mohammadi | Shayan Bali | Mohammad Javad Ranjbar | Els Lefever | Veronique Hoste
Sadegh Jafari | Tara Azin | Farhad Roodi | Zahra Dehghani Tafti | Mehrdad Ghadrdan | Elham Vatankhahan Esfahani | Aylin Naebzadeh | Mohammadhadi Shahhosseini | Ghafoor Khan | Kazem Forghani | Danial Namazi | Seyed Mohammad Hossein Hashemi | Farhan Farsi | Mohammad Osoolian | Maede Mohammadi | Mohammad Erfan Zare | Muhammad Hasnain Khan | Muhammad Hussain | Nooreen Zaki | Joma Mohammadi | Shayan Bali | Mohammad Javad Ranjbar | Els Lefever | Veronique Hoste
The Iranic language family includes many underrepresented languages and dialects that remain largely unexplored in modern NLP research. We introduce APARSIN, a multi-variety benchmark covering 14 Iranic languages, dialects, and accents, designed for sentiment analysis and machine translation. The dataset includes both high and low-resource varieties, several of which are endangered, capturing linguistic variation across them. We evaluate a set of instruction-tuned Large Language Models (LLMs) on these tasks and analyze their performance across the varieties. Our results highlight substantial performance gaps between standard Persian and other Iranic languages and dialects, demonstrating the need for more inclusive multilingual and dialectally diverse NLP benchmarks.
One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki on Translation and Understanding Tasks
Noor Mairukh Khan Arnob | Abu Bakar Siddique Mahi
Noor Mairukh Khan Arnob | Abu Bakar Siddique Mahi
The Iranian linguistic family is pluricentric, encompassing Iranian Persian, Dari (Afghanistan), and Tajiki (Tajikistan). While Multilingual Large Language Models (MLLMs) claim broad coverage, their robustness across these regional variants and script differences (Perso-Arabic vs. Cyrillic) remains under-explored, particularly in the open-weight landscape. We evaluate five openweight models from the Qwen, Bloomz, and Gemma families across four downstream tasks: Sentiment Analysis, Machine Translation (MT), NLI, and QA. Utilizing a dataset of over 240,000 processed samples, we observe severe performance disparities. While the fine-tuned gemma-3-4b-persian achieves promising results on Iranian Persian (77.3% accuracy in Sentiment), almost all tested models appear to suffer catastrophic degradation on Tajiki script (dropping to 1.0 BLEU). These findings highlight a critical “script barrier” in current open-weight MLLM development for Central Asian languages. Code and data available here.
PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi | Heshaam Faili | Azadeh Shakery
Mohammad Javad Ranjbar Kalahroodi | Heshaam Faili | Azadeh Shakery
Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset and model publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.
Shughni Machine Translation Enhanced by Donor Languages
Dmitry Novokshanov | Innokentiy S. Humonen | Ilya Makarov
Dmitry Novokshanov | Innokentiy S. Humonen | Ilya Makarov
This paper presents the first machine translation system for Shughni, an extremely lowresource Eastern Iranian language spoken in Tajikistan and Afghanistan. We fine-tune NLLB-200 models and explore auxiliary language selection through typological similarity and "super-donor" experiments. Our final Shughni–Russian model achieves a chrF++ score of 36.3 (45.7 on BivalTyp data), establishing the first computational translation resource for this language. Beyond reporting system performance, this work demonstrates a practical path toward supporting languages with virtually no prior MT resources. Our demo system with Shughni-Russian- English translation (Russian serves as a pivot language for the Shughni- English pair) is available on Hugging- Face (https://huggingface.co/spaces/Novokshanov/Shughni-Translator).
Segmentation Strategy Matters: Benchmarking Whisper on Persian YouTube Content
Reihaneh Iranmanesh | Rojin Ziaei | Joe Garman
Reihaneh Iranmanesh | Rojin Ziaei | Joe Garman
Automatic Speech Recognition (ASR) transcription accuracy remains highly sensitive to audio segmentation strategies, yet most benchmarks assume oracle timestamps unavailable in deployment. We systematically evaluate how audio segmentation affects Whisper’s performance on 10 hours of Persian YouTube content, comparing transcript-aligned (oracle) versus silence-based (realistic) approaches across contrasting acoustic conditions. Results reveal striking content-type dependency: podcast content benefits from timestamp segmentation (33% lower mean WER), while entertainment content favors silence-based segmentation (8% lower mean WER). This finding demonstrates that optimal segmentation must be content-aware, with silence detection better capturing natural boundaries in acoustically heterogeneous media while avoiding mid-utterance splits. We publicly release our evaluation framework, 10 hours of audio with gold transcripts, and segmentation results here: https://github.com/ri164-bolleit/persian-youtube-whisper-benchmark
Multi-modal Neural Machine Translation for Low-Resource Classical Persian Poetry: A Culture-Aware Evaluation
Soheila Ansari | Mounir Boukadoum | Fatiha Sadat
Soheila Ansari | Mounir Boukadoum | Fatiha Sadat
Persian poetry, particularly Rumi’s Masnaviye-Ma’navi, is known for its complex form, mystical narrative style, rich cultural information, and linguistic nuances, and is considered a low-resource domain. Translating Persian poetry is a challenging task for neural machine translation (NMT) systems. To address this challenge, we present a novel multimodal NMT system for Rumi’s Masnavi in four stages. First, we built a new multi-modal parallel Persian-English corpus of 26,571 aligned verses from all six books of Masnavi, and all paired with aligned audio recitations. Second, a strong text-only baseline is developed by applying domain-adaptive fine-tuning to mBART- 50, pre-trained on a large monolingual Persian poetry corpus, followed by training on the parallel Masnavi corpus (train set). Third, we extend this model to a multi-modal scenario by adding aligned audio representations using a cross-attention fusion mechanism. Fourth, we conduct a culture-aware evaluation. We propose a culture-specific item (CSI) evaluation approach by developing a CSI classification system and a Persian-English CSI dictionary alongside the standard MT metrics. Our findings demonstrate that integrating audio recitations increased the BLEU score from 9.85 to 17.95, and raised CSI-recall from 61.60% to 82.04%, suggesting greater consistency in producing culturally meaningful terms.
up
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)
Matthias Aßenmacher | Laura Biester | Claudia Borg | György Kovács | Margot Mieskes | Sofia Serrano
Matthias Aßenmacher | Laura Biester | Claudia Borg | György Kovács | Margot Mieskes | Sofia Serrano
Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at https://animatedllm.github.io, both as a teaching aid and for self-educational purposes.
Pedagogic Applications of Argument Maps to Enhance Critical Thinking: Thought Seeds, Argument Mapping, Collaborative Mapping
Sruti Narra
Sruti Narra
Argument maps are used extensively in Natural Language Processing (NLP), for training Large Language Models (LLMs) to analyze and generate arguments coherently. This paper discusses the pedagogic applications of the concept of argument mapping to enhance critical thinking in learning within educational contexts. The approach was found to be useful for shaping the thinking process during thesis writing and project courses and can be applied in higher education. In the age of rapid Gen AI advancement, it is important to embed critical thinking into education and such approaches can address challenges like AI overuse and potential loss of key skills and competences in learners. Argument mapping necessitates learners to visualize their thinking and while doing so, they not only achieve clarity of thought, but also make distinct connections between concepts in the form of arguments. Such clarity is at a much higher level compared to that achieved through concept or mind mapping as learners need to think in terms of well-formed claims and connections between them. In addition, collaborative argument mapping tasks could give learners opportunities for peer learning, and to concretize the abstract ideas through visualization and discussion.
The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces “Vibe Coding,” a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.
LLM-based methods supersede many approaches in NLP at high velocity, making it necessary to adapt curricula. We argue that this effort also presents a chance to integrate LLM chatbots as learning support. We demonstrate (a) how we re-conceptualized an existing class segment on digital assistance systems to discuss LLM-based chatbots, (b) how we created a specialized instructional chatbot as a demonstrator that students could directly use for learning and revision and (c) how students’ initial perception of LLM-based AI changed due to instruction.
Language Technology Initiative: Framework for Teaching NLP and Computational Linguistics at the Universities in Latvia
Inguna Skadina | Jana Kuzmina | Marina Platonova | Tatjana Smirnova | Sergei Kruk
Inguna Skadina | Jana Kuzmina | Marina Platonova | Tatjana Smirnova | Sergei Kruk
This short paper provides an overview of language technology related modules and courses developed at three leading universities of Latvia - University of Latvia (UL), Riga Technical University (RTU) and Riga Stradiņš University (RSU).
Teaching NLP in the AI Era: Experiences from the University of Latvia
Inguna Skadina | Guntis Barzdins | Uldis Bojārs | Normunds Gruzitis | Pēteris Paikens
Inguna Skadina | Guntis Barzdins | Uldis Bojārs | Normunds Gruzitis | Pēteris Paikens
From being a niche technology with practical applications in translation and speech recognition, NLP is now underpinning the AI era through LLMs, promising a universal economic impact in the future. Although transitioning to the AI era is hyped by BigTech companies, practical adoption of the LLM capabilities for economically impactful tasks and processes goes via education of specialists capable to apply it properly. Human-in-the-loop, accuracy measurement, fine-tuning, on-premises processing of sensitive data have become essential skills for applying NLP. This short paper introduces two language technology modules developed and piloted at the Faculty of Science and Technology of the University of Latvia.
With the advent of Large Language Models (LLMs) researchers outside the Natural Language Processing (NLP) field are interested in learning how to process textual data for their own domain research goals. They are particularly motivated to start experimenting directly with LLMs, implicitly neglecting the large amount of accumulated knowledge that NLP has to offer them. In this text, we briefly share our new lesson materials that aim to show aspiring practitioners the strong connection between NLP fundamentals and LLMs, in the form of a two-day workshop. Our training material is mainly aimed at graduate students outside the NLP sphere who have basic technical knowledge and wish to start working with text, is fully open source and available online.
From Standard Transformers to Modern LLMs: Bringing Dialogue Models, RAG, and Agents to the Classroom
Maria Tikhonova | Viktoriia A. Chekalina | Artem Chervyakov | Alexey Zaytsev | Alexander Panchenko
Maria Tikhonova | Viktoriia A. Chekalina | Artem Chervyakov | Alexey Zaytsev | Alexander Panchenko
Modern LLM education is increasingly centered on system building: grounding generation with retrieval, enabling tool use, and deploying models under latency and cost constraints.We present an updated release of our open course on Transformer-based LLMs and multimodal models (Nikishina et al, 2024).The update introduces topics which became importance since the first edition, namely session on Retrieval Augmented Generation (RAG), a hands-on session on tool-using agents, an API-based track for applied work with LLM, and practical local inference with vLLM.We also add a dedicated session on multimodal dialog models with a focus on dialog grounding. We enriched the course with a discussion on long-context transformers, focusing on KV-cache efficiency along with the related models and benchmarks.All materials are released online.
Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs
Junyi Jessy Li | Yang Janet Liu | Kanishka Misra | Valentina Pyatkin | William Sheffield
Junyi Jessy Li | Yang Janet Liu | Kanishka Misra | Valentina Pyatkin | William Sheffield
The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula.We present a new course, "Computational Discourse and Natural Language Generation". The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.
Student demand for NLP training now spans linguistics, computer science, data science, and applied fields, producing cohorts with uneven preparation. We report on a four-course curriculum used in an M.S. Computational Linguistics program: an undergraduate on-ramp, a two-course graduate core (classical methods and neural/LLM methods), and a rotating special-topics seminar. We describe the role of each course, the bridging strategy that keeps the core sequence focused, and assessment patterns that emphasize error analysis, experimental reasoning, and reproducible practice. The goal is a set of reusable curricular design patterns for mixed-background programs facing rapid topic turnover in NLP.
NLP researchers regularly invoke abstract concepts like "interpretability," "bias," "reasoning," and "stereotypes," without defining them.Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made:Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what _should_ they mean, and how should we measure them?I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.
Bridging Applied Experience and Research Contexts in Ukrainian NLP Education
Yurii Paniv | Viktoriia Makovska
Yurii Paniv | Viktoriia Makovska
We present an open, bachelor-level Natural Language Processing (NLP) course developed at Ukrainian Catholic University and delivered in Ukrainian. The course addresses several challenges in NLP education: adapting predominantly English-centric materials to a different linguistic and cultural context, supporting students with heterogeneous technical backgrounds, and balancing foundational theory with industry-relevant skills. All course materials, including lecture slides, notebooks, video recordings, and assignments, are publicly available. We describe our pedagogical design choices, focusing on culturally adapted tasks, integrated ethics, project-based assessment, and continuous student feedback. Our experience demonstrates that it is feasible to build a comprehensive and modern NLP curriculum from scratch in a non-English context, even when instructors come primarily from industry backgrounds.
Teaching Modern NLP and LLMs at Kyiv School of Economics: A Practice-Oriented Course with Ukrainian Language Focus
Roman Kyslyi | Anton Bazdyrev
Roman Kyslyi | Anton Bazdyrev
This paper describes a Natural Language Processing (NLP) course taught at Kyiv School of Economics. The course consists of 16 lectures, 5 practical assignments and focuses on modern large language models (LLMs) while preserving an introduction to classical NLP. Practical assignments are organized using Kaggle, where GPU support plays an important role in enabling students to work with complex models. A key feature of the course is the focus on Ukrainian in the practical assignments, contributing to the development of Ukrainian NLP expertise and community. The course is taught primarily in-person, but due to the ongoing war in Ukraine, also includes a full online participation option and additional weekly QnA sessions.
Practising responsibility: Ethics in NLP as a hands-on course
Malvina Nissim | Viviana Patti | Beatrice Savoldi
Malvina Nissim | Viviana Patti | Beatrice Savoldi
As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field’s rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and “learning by teaching” methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.
Beyond Passive Viewing: A Pilot Study of a Hybrid Learning Platform Augmenting Video Lectures with Conversational AI.
Mohammed Abraar | Raj Dandekar | Rajat Dandekar | Sreedath Panat
Mohammed Abraar | Raj Dandekar | Rajat Dandekar | Sreedath Panat
The exponential growth of AI education has brought millions of learners to online platforms, yet this massive scale has simultaneously exposed critical pedagogical shortcomings. Traditional video-based instruction, while cost-effective and scalable, demonstrates systematic failures in both sustaining learner engagement and facilitating the deep conceptual mastery essential for AI literacy. We present a pilot study evaluating a novel hybrid learning platform that integrates real-time conversational AI tutors with traditional video lectures. Our controlled experiment (N = 58,\ mean age M = 21.4,\ SD = 2.8) compared traditional video-based instruction with our AI-augmented video platform. This study employed a sequential within-subjects design where all participants first completed the traditional video condition followed by the AI-augmented condition, providing direct comparisons of learning outcomes. We measured learning effectiveness through immediate post-tests and delayed retention assessments (2-week delay). Results suggest improvements in learning performance: immediate post-test performance showed a large effect size (d = 1.505) with participants scoring 8.3 points higher after AI-augmented instruction (91.8\ vs.\ 83.5\ out of\ 100,\ p < .001). Behavioral analytics revealed increased engagement duration (71.1% improvement with AI tutoring) in the experimental group. This pilot study provides preliminary evidence that conversational AI tutors may enhance traditional educational delivery, suggesting a potential avenue for developing scalable, adaptive learning systems.
From Sentiment to Interpretation: Teaching NLP for Literary Understanding Across Educational Contexts
Karl-Emil Kjær Bilstrup | Kirstine Nielsen Degn | Morten Schultz | Alexander Conroy | Jens Bjerring-Hansen | Daniel Hershcovich
Karl-Emil Kjær Bilstrup | Kirstine Nielsen Degn | Morten Schultz | Alexander Conroy | Jens Bjerring-Hansen | Daniel Hershcovich
We developed Litteraturmaskinen, a graphical annotation and exploration interface that enables students to collaborate on labeling sentiment in literary passages, comparing their decisions with model predictions, and justifying their interpretations. We deployed the system in two educational settings: A university module on computational literary studies and regular teaching by two first-language high school teachers. Based on observations, collected teaching plans, and interviews, we find that tensions between epistemic and academic traditions are both a barrier for integration and a productive entry point for literary reflection and argumentation. We conclude with recommendations for integrating NLP into literature and first-language curricula.
The ubiquitous adoption of large language models by students prompts teachers to redesign courses and evaluation methods, especially in computer science and natural language processing (NLP) where the impact is more tangible.Our contribution is two-fold. First, we attempt to define invariants for the role of education itself given the over-abundance of information that appears to be more accessible than ever before. Then, we present our approach and materials used for an introductory course in NLP for undergraduate students, drawing inspiration from software engineering best practices. Our vision regarding large language models is torely on local models to cultivate a sense of ownership and sovereignty in an age where every bit of independence and privacy get eroded.
up
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
AMIYA Shared Task: Arabic Modeling In Your Accent at VarDial 2026
Nathaniel R. Robinson | Shahd Abdelmoneim | Anjali Kantharuban | Otba Alsboul | Salima Lamsiyah | Kelly Marchisio | Kenton Murray
Nathaniel R. Robinson | Shahd Abdelmoneim | Anjali Kantharuban | Otba Alsboul | Salima Lamsiyah | Kelly Marchisio | Kenton Murray
Arabic, often considered a single language, actually describes a wide variety of sometimes mutually unintelligible language varieties. While large language models (LLMs) have revolutionized natural language processing (NLP) with rapid advances, these models still best serve speakers of high-resource and standard language varieties. One particular deficiency of theirs is in dialectal Arabic. We present the first ever shared task for dialectal Arabic language modeling: Arabic Modeling In Your Accent, or AMIYA. The goal of the shared task was to develop LLMs that could (1) respond in the correct dialectal variety when explicitly or implicitly prompted to, (2) translate between dialectal Arabic and standard Arabic or English, (3) adhere to LLM instructions in dialectal Arabic, and (4) produce fluent Arabic outputs. We called for submissions in the dialectal varieties of five countries: Morocco, Egypt, Palestine, Syria, and Saudi Arabia. We received 45 submitted systems from six participating teams. We saw positive results from supervised fine-tuning on a translation objective, and reinforcement learning to improve dialectness. Manual evaluation also showed that some systems had learned to output dialectal words or phrases, but at the expense of actual fluency or coherence. Overall the most effective system involved continual pre-training and supervised fine-tuning of 12 candidate LLMs, followed by selection of the best performing models.
Far Out: Evaluating Language Models on Slang in Australian and Indian English
Deniz Kaya Dilsiz | Dipankar Srirag | Aditya Joshi
Deniz Kaya Dilsiz | Dipankar Srirag | Aditya Joshi
Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: WEB, containing 377 web-sourced usage examples from Urban Dictionary, and GEN, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP*) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP*, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on WEB versus GEN datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP* tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.
Effects of Speaker Bias in Dialect Identification and Automatic Transcription with Self-Supervised Speech Models
Olli Kuparinen
Olli Kuparinen
A major issue in audio modeling is speaker bias, in which the models learn language external traits, such as a speaker’s timbre or pitch, and use this information as a shortcut to a language task. This is especially problematic for dialectology, as it is typical in dialect corpora that only a few speakers represent a complete dialect area. In this paper, we explore the effects of speaker bias in two dialectal tasks: dialect identification and automatic dialectal transcription. We build two different data partitions of dialect interviews in Finnish and Norwegian: 1) a speaker dependent partition in which all of the speakers appear in training, development, and test sets, and 2) a speaker independent partition where each speaker only appears in exactly one set. We further experiment with modifications of the training data by augmenting the original audio with pitch shifts and noise, as well as changing the original speakers’ voices with voice conversion models. We show that the dialect identification models are highly affected by speaker bias, whereas automatic dialectal transcription models are not. The audio modifications do not offer major performance gains for either of the languages or tasks.
OcWikiDialects: A Wikipedia Dataset With Rich Metadata for Occitan Dialect Identification
Oriane Nédey | Rachel Bawden | Thibault Clérice | Benoît Sagot
Oriane Nédey | Rachel Bawden | Thibault Clérice | Benoît Sagot
Occitan is a Romance language spoken mostly in the South of France and characterised by rich dialectal variation, which can pose problems for certain NLP tools. This shortfall is largely attributable to the scarcity of dialect-annotated corpora, in a context where linguistic classification within the Occitan dialect continuum is still debated and major nomenclatures, such as ISO 639, fail to provide granular codes for varieties below the generic "Occitan" label. In this paper, we introduce OcWikiDialects, a new dataset comprising articles from the Occitan Wikipedia. The corpus features rich metadata, including dialect labels, and is segmented at both paragraph and sentence levels. Combined with previously released datasets, we explore approaches for Occitan dialect identification by training three types of model on up to 8 labels: linear SVM classifiers based on word and character n-grams, FastText classifiers based on pretrained vectors, and BERT-based neural classifiers adapted through fine-tuning. Evaluations across in- and out-of-domain test sets demonstrate the substantial impact of our new dataset for the task. However, a peak macro-averaged F1 score of 58.15 underscores persistent challenges for underrepresented Occitan varieties, supported by our per-dialect analysis. Code, dataset and models are available: https://github.com/DEFI-COLaF/OcWikiDialects.
Language Mixture to Develop Accurate Galician Dependency Parsers: An Exploration of Its Effects
Xabier Irastortza-Urbieta | José M. García-Miguel | Marcos Garcia
Xabier Irastortza-Urbieta | José M. García-Miguel | Marcos Garcia
The development of accurate syntactic parsers remains a challenge for low-resource languages. To overcome it, the literature has proposed leveraging syntactic annotations from typologically related languages. This work investigates the viability and adequacy of this approach for Galician, evaluating the use of annotations from major Romance languages as source data. Our methodology extends beyond standard automatic evaluation to incorporate a detailed error analysis, which precisely quantifies the effects of multilingual training and assesses the practical scalability of the method. The results establish the necessity of embedding models for effective cross-lingual transfer and demonstrate that even languages not particularly close can yield adequate parsers. This work confirms the benefits of cross-lingual data augmentation while delineating its scalability limits. Furthermore, the error analysis identifies specific, typologically conditioned grammatical dependencies that remain persistent challenges for accurate dependency parsing.
We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian–Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
German-English Code-Switching in Large Language Models
Firat Cem Aksüt | Stefan Hillmann | Pia Knoeferle | Sebastian Möller
Firat Cem Aksüt | Stefan Hillmann | Pia Knoeferle | Sebastian Möller
Code-Switching (CS) is common in multilingual communication, yet it is unclear how well current Large Language Models (LLMs) reproduce naturally occurring switching patterns. This paper studies German–English CS ("Denglisch") generated by GPT-4o and LLaMA-3.3, using Reddit data from the Denglisch Corpus as a reference. Model outputs are compared to authentic posts using established CS metrics (M-Index, I-Index, CESAR), an analysis of Shared Lexical Items (SLIs) as switch triggers, and a human evaluation of perceived naturalness and fluency. Both models approximate global CS characteristics but differ in the diversity and complexity in comparison to real data. LLaMA-3.3 more closely matches corpus-level metrics, whereas GPT-4o produces more conservative switching that is rated as significantly more natural and fluent. In addition, GPT-4o reproduces SLI-triggered switching patterns similar to those found in authentic data, while this effect is weaker for LLaMA-3.3.
Perplexity as a Metric for Dialectal Distance: A Computational Study of Greek Varieties
Stergios Chatzikyriakidis | Erofili Psaltaki | Dimitrios Papadakis | Erik Henriksson | Veronika Laippala
Stergios Chatzikyriakidis | Erofili Psaltaki | Dimitrios Papadakis | Erik Henriksson | Veronika Laippala
In this paper, we use LLM perplexity as a measure to assess Greek dialectal distance. We test seven models on Standard Modern Greek (SMG) and eight dialects, namely Heptanesian, Cypriot, Maniot, Pontic, Northern, Cretan, Tsakonian, and Griko. Using samples of 5k, 15k, and 25k tokens from the GRDD+ corpus for each variety, we find a consistent dialect ranking across models, with Heptanesian closest to SMG, and Griko most distant (perplexity ratio 3.6–14.5× depending on model). These results are largely in agreement with theoretical dialectological knowledge. For example, Tsakonian consistently appears distant in all measures, reflecting its status as the sole Doric descendant, while Heptanesian appears closer by all metrics, pointing to its status as one of the dialects used to shape the official variety. Perplexity correlates strongly with Bits Per-Character (mean r = 0.94) and Normalized Compression Distance (mean r = 0.87, range 0.76–0.93), providing support for its use as a dialectometric tool. However, a number of important confounds are also found. First, tokenization effects compress Llama 2’s perplexity range. Second, genre artifacts seem to inflate the results for Cretan. Third, potential training data contamination likely reduces perplexity for Cypriot and Pontic. Lastly, we find that Greek-specific models like Meltemi and Krikri do not consistently outperform general models.
A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments
Anne-Marie Lutgen | Alistair Plum | Christoph Purschke
Anne-Marie Lutgen | Alistair Plum | Christoph Purschke
This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in “noisy” or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.
Onomasiological Sense Alignment Across Dialect Dictionaries. A Taxonomy-Constrained LLM Classification
Nathalie Mederake | Nico Urbach | Hanna Fischer | Alfred Lameli
Nathalie Mederake | Nico Urbach | Hanna Fischer | Alfred Lameli
We propose a taxonomy-guided approach to semantic alignment that assigns lexicographic senses to an onomasiological taxonomy derived from the Hallig–Wartburg/Post system. Using an LLM under strict taxonomic constraints, short and heterogeneous meaning descriptions are assigned to a common conceptual space. Evaluation against expert annotation shows that run-to-run model agreement (kappa = 0.73) closely matches human agreement (kappa = 0.74), with robustness at coarse taxonomic levels and predictable degradation at finer granularity. A qualitative network analysis demonstrates the resulting potential for cross-dictionary exploration of dialectal variation in semantics.
On the Intelligibility of Romance Language Varieties: Spanish and Portuguese in Europe and America
Liviu P. Dinu | Ana Sabina Uban | Teodor-George Marchitan | Ioan-Bogdan Iordache | Simona Georgescu
Liviu P. Dinu | Ana Sabina Uban | Teodor-George Marchitan | Ioan-Bogdan Iordache | Simona Georgescu
Mutual intelligibility within language families presents a significant challenge for multilingual NLP, particularly due to the prevalence of dialectal variation and asymmetric comprehension. In this paper, we present a corpus-based computational analysis to quantify linguistic proximity across Romance language variants, with a focus on major Spanish (Argentine, Chilean and European) and Portuguese (Brazilian and European) varieties and the other main Romance languages (Italian, French, Romanian). We apply a computational metric of lexical intelligibility based on surface and semantic similarity of related words to measure mutual intelligibility for the five main Romance languages in relation to the Spanish and Portuguese varieties studied.
Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties
Akriti Dhasmana | Aarohi Srivastava | David Chiang
Akriti Dhasmana | Aarohi Srivastava | David Chiang
We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally associated with phylogenetic distance across languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.
Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation
Abdullah Alabdullah | Lifeng Han | Chenghua Lin
Abdullah Alabdullah | Lifeng Han | Chenghua Lin
Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation is a challenging task in Machine Translation (MT) due to significant lexical, syntactic, and semantic divergences between Arabic dialects and MSA. Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment. This paper introduces Ara-HOPE, a human-centric post-editing evaluation framework designed to systematically address these challenges. The framework includes a five-category error taxonomy and a decision-tree annotation protocol. Through comparative evaluation of three MT systems (Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200), Ara-HOPE effectively highlights systematic performance differences between these systems. Our results show that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation. Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems. For reproducibility, we make the annotation files and related materials publicly available at https://github.com/abdullahalabdullah/Ara-HOPE.
Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Mihir Panchal | Deeksha Varshney | Mamta . | Asif Ekbal
Mihir Panchal | Deeksha Varshney | Mamta . | Asif Ekbal
Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.
Building ASR Resources for the Hutsul Dialect of Ukrainian
Roman Kyslyi | Artem Orlovskyi | Pavlo Khomenko | Bohdan Onyshchenko | Zakhar Guzii
Roman Kyslyi | Artem Orlovskyi | Pavlo Khomenko | Bohdan Onyshchenko | Zakhar Guzii
Dialectal speech remains largely underexplored in Automatic Speech Recognition (ASR) research, particularly for Slavic languages. While Ukrainian ASR systems have rapidly improved in recent years with the adoption of Whisper, XLS-R, and Wav2Vec-based models, performance on dialectal variants remains unknown and often significantly degraded. In this work, we present the first dedicated effort to build ASR resources for the Hutsul dialect of Ukrainian. We develop a data preparation and segmentation pipeline, evaluate multiple forced alignment strategies, and benchmark state-of-the-art ASR models under zero-shot and fine-tuned conditions. We evaluate results using WER and CER demonstrating that large multilingual ASR models struggle with dialectal speech, while lightweight fine-tuning produces substantial improvements. All scripts, alignment tools, and training recipes are made publicly available to support future research on Ukrainian dialect speech.
From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
Abdulmuizz Khalak | Abderrahmane Issam | Gerasimos Spanakis
Abdulmuizz Khalak | Abderrahmane Issam | Gerasimos Spanakis
Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
Extending ASR Evaluation Resources for Modern Greek Dialects
Chara Tsoukala | Stavros Bompolas | Antigoni Margariti | Konstantina Panagiotou | Maria Elisavet Plaiti | Nefeli Tzanakaki | Petros Karatsareas | Angela Ralli | Antonios Anastasopoulos | Stella Markantonatou
Chara Tsoukala | Stavros Bompolas | Antigoni Margariti | Konstantina Panagiotou | Maria Elisavet Plaiti | Nefeli Tzanakaki | Petros Karatsareas | Angela Ralli | Antonios Anastasopoulos | Stella Markantonatou
Recent progress in Automatic Speech Recognition (ASR) has primarily benefited high-resource standard languages, while dialectal speech remains challenging and underexplored. We present an expanded benchmark for low-resource Modern Greek dialects, covering Aperathiot, Cretan, Lesbian, and Cappadocian, spanning southern, northern, and contact-influenced varieties with varying degrees of divergence from Standard Modern Greek. The benchmark provides dialectal transcriptions in the Greek alphabet, following SMG-based orthographic conventions, while preserving dialectal lexical and morphophonological forms. Using this benchmark, we evaluate state-of-the-art multilingual ASR models in a zero-shot setting and by further fine-tuning per dialect. Zero-shot results reveal a clear performance gradient with dialectal distance from Standard Modern Greek, with best WERs ranging from about 60-70% for southern dialects to over 80% for Lesbian and nearly 97% for Cappadocian. Fine-tuning substantially reduces error rates (up to 47% relative WER improvement), with Cappadocian remaining the most challenging variety (best WER 68.17%). Overall, our results highlight persistent limitations of current pretrained ASR models under dialectal variation and the need for dedicated benchmarks and adaptation strategies.
How Should We Model the Probability of a Language?
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
Bridging Dialectal Variation: A Phonetic Transcription Tool for Tamil
Ahrane Mahaganapathy | Sumirtha Karunakaran | Kavitha Navakulan | Kengatharaiyer Sarveswaran
Ahrane Mahaganapathy | Sumirtha Karunakaran | Kavitha Navakulan | Kengatharaiyer Sarveswaran
Phonetic transcription is vital for speech processing and linguistic documentation, particularly in languages like Tamil with complex phonology and dialectal variation. Challenges such as consonant gemination, retroflexion, vowel length, and one-to-many grapheme-phoneme mappings are compounded by limited data on Sri Lankan Tamil dialects. We present a dialect-aware, rule-based transcription tool for Tamil that supports Indian and Jaffna Tamil, with extensions underway for other dialects. Using a two-stage pipeline: Tamil script to Latin, then to IPA with context-sensitive rules, the tool handles dialect shifts. A real-time interface enables dialect selection. Evaluated on a 7,830-word corpus, it achieves 94.54% accuracy for Jaffna Tamil and is higher than other tools like eSpeak NG, advancing linguistic preservation and accessible speech technology for Tamil communities.
Regional Variation in the Performance of ASR Models on Croatian and Serbian
Tanja Samardžić | Peter Rupnik | Nikola Ljubešić
Tanja Samardžić | Peter Rupnik | Nikola Ljubešić
Regional variation was a limiting factor for automatic speech recognition (ASR) before large language models. With the new technology, speech processing becomes more general, which opens the question of how to use data in similar languages such as Croatian and Serbian. In this paper, we analyse model performance in the currently available train-test scenarios with the goal of better understanding the mutual interference of these two languages. Our findings suggest that better performing models are not very sensitive to the regional variation. Training from scratch in one of the languages can give good results on both of them, while fine-tuning large pre-trained multilingual models on smaller data sets does not give the expected results.
Syllable Structures Across Arabic Varieties
Abdelrahim Qaddoumi | Jordan Kodner | Salam Khalifa | Ellen Broselow | Owen Rambow
Abdelrahim Qaddoumi | Jordan Kodner | Salam Khalifa | Ellen Broselow | Owen Rambow
This study compares the syllable structures of nine Arabic varieties from Wiktionary, using a computational syllabifier. It further investigates methods for learning syllable boundaries in unsyllabified words transcribed in the International Phonetic Alphabet (IPA). The syllabification algorithm is evaluated under three conditions: (i) Default, employing fixed rules; (ii) Joint, learning onsets and codas across all varieties collectively; and (iii) Per-variety, learning onsets and codas specific to each variety. Results indicate that the default configuration yields the highest accuracy, ranging from 97.05% to 100%. The per-variety approach achieves 90.64% to 100% accuracy, while the joint approach ranges from 84.63% to 94.74%. A cross-variety analysis using Jensen-Shannon divergence reveals three principal groupings: Egyptian, Hejazi, and Modern Standard Arabic are closely related; Levantine and Gulf varieties constitute a second cluster; and Juba Arabic, Maltese, and Moroccan emerge as outliers. A cleaned dataset encompassing all nine varieties is also provided.
Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Ali Mekky | Mohamed El Zeftawy | Lara Hassan | Amr Keleg | Preslav Nakov
Ali Mekky | Mohamed El Zeftawy | Lara Hassan | Amr Keleg | Preslav Nakov
Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LahjatBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system.
OpenLID-v3: Improving the Precision of Closely Related Language Identification – An Experience Report
Mariia Fedorova | Nikolay Arefyev | Maja Buljan | Jindřich Helcl | Stephan Oepen | Egil Rønningstad | Yves Scherrer
Mariia Fedorova | Nikolay Arefyev | Maja Buljan | Jindřich Helcl | Stephan Oepen | Egil Rønningstad | Yves Scherrer
Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During the development we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages.
Improving Dialect Robustness in Large Language Models via LoRA and Mixture-of-Experts
Sanjh Maheshwari | Aniket Singh Rajpoot | Oana Cocarascu | Mamta .
Sanjh Maheshwari | Aniket Singh Rajpoot | Oana Cocarascu | Mamta .
Despite the success of large language models (LLMs) in a wide range of applications, it has been shown that their performance varies across English dialects. Differences among English dialects are reflected in vocabulary, syntax, and writing style, and can adversely affect model performance. Several studies evaluate the dialect robustness of LLMs, yet research on enhancing their robustness to dialectal variation remains limited. In this paper, we propose two parameter-efficient frameworks for improving dialectal robustness in LLMs: DialectFusion where we train separate LoRA layers for each dialect and apply different LoRA merging methods, and DialectMoE which is built on top of Mixture of Experts LoRA and introduces multiple LoRA-based experts to the feed-forward layer to internally model the dialectal dependencies. Our comprehensive analysis on five open-source LLMs for sentiment and sarcasm tasks in zero- and few-shot settings shows that our proposed approaches enhance the dialect robustness of LLMs and outperforms instruct and LoRA fine-tuning based approaches.
Evaluation Framework for Transfer Learning between Closely Related Lects: A Case Study of Lemko
Ilia Afanasev
Ilia Afanasev
The creation of a robust evaluation methodology is one of the pivotal issues for transfer learning between closely related lects. The current study proposes to resolve this issue by concisely implementing a group of evaluation methods that enable a more systematic qualitative analysis of errata (for instance, string similarity measures to assess lemmatisation more effectively). The paper introduces a robustness score, a metric that aims to assess the stabilityof model performance across different datasets. The case study is a morphosyntactic tagging of a small historical (beginning of the twentieth century) corpus of Lemko (Slavic clade, Transcarpathian area). It presents a diversity of cross-dependent tasks, made rather complex by the rich Lemko morphology, highly influenced by areal convergence processes. The tagger is a pre-trained Stanza. The study uses modern standard Ukrainian as the source language, as it is the closest to the Lemko high-resource lect. The analysis reveals that linguistically-aware metrics improve the speed and accuracy of analysis of the errata, especially those caused by the differences between source and target lects. The key data contribution is the open- source dataset of Lemko, obtained during the tagging tasks. Future research directions include a larger-scale test that applies more models to a more extensive material.
Do Large Language Models Adapt to Language Variation across Socioeconomic Status?
Elisa Bassignana | Mike Zhang | Dirk Hovy | Amanda Cercas Curry
Elisa Bassignana | Mike Zhang | Dirk Hovy | Amanda Cercas Curry
Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.
Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation
Jonathan Mutal | Perla Al Almaoui | Simon Hengchen | Pierrette Bouillon
Jonathan Mutal | Perla Al Almaoui | Simon Hengchen | Pierrette Bouillon
Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model will be released upon paper acceptance.
Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding
Abdulhai Alali | Abderrahmane Issam
Abdulhai Alali | Abderrahmane Issam
Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English–Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.
Dialectal Arabic continues to represent a persistent challenge for contemporary large language models, which are predominantly trained and optimized for Modern Standard Arabic (MSA) and therefore exhibit limited capability when processing colloquial varieties. In this study, a dedicated system developed for participation in the AMIYA shared task focusing on Syrian Arabic is presented. The proposed solution is based on the integration of parameter-efficient fine-tuning through Low-Rank Adaptation (LoRA) with prompt-guided inference, aiming to enhance dialectal adequacy and linguistic naturalness. Rather than emphasizing strict factual precision, the system is deliberately designed to prioritize fluent and authentic Syrian Arabic generation, in accordance with the evaluation principles adopted by the AL-QASIDA benchmark. This design choice reflects a focus on human-perceived language quality and dialectal fidelity, which are central to effective dialect-aware language modeling.
NUS-IDS at AMIYA/VarDial 2026: Improving Arabic Dialectness in LLMs with Reinforcement Learning
Sujatha Das Gollapalli | Mouad Hakam | Mingzhe Du | See-Kiong Ng
Sujatha Das Gollapalli | Mouad Hakam | Mingzhe Du | See-Kiong Ng
In this paper, we describe models developed by our team, NUS-IDS, for the Closed data track at the Arabic Modeling In Your Accent (AMIYA) shared task at VarDial 2026. The core idea behind our solution involves data augmentation enabled by a dialect classifier trained on AMIYA data. We effectively combine various translation, summarization, and question answering prompts with AMIYA training data to form dialectal prompts for use with state-of-the-art LLMs. Next, dialect predictions from our classifier on outputs from these LLMs are used to compile preference data for Reinforcement Learning (RL). We report model performance on dialectal Arabic from Egypt, Morocco, Palestine, Saudi Arabia and Syria using FLORES+, a multilingual machine translation dataset. Our experiments illustrate that though our RL models show significant performance gains on dialectness scores, they under perform on translation metrics such as chrF++ compared to base LLMs.
MBZUAI at AMIYA Shared Task 2026: Adapting Open-Source LLMs for Dialectal Arabic
Rana Gaber | Yara Allam | Serag Amin | Ranwa Aly | Bashar Alhafni
Rana Gaber | Yara Allam | Serag Amin | Ranwa Aly | Bashar Alhafni
This paper presents our contribution to the closed data track of the AMIYA Shared Task on Dialectal Arabic text generation. In this track, we train fully open-source Large Language Models (LLMs) on five Arabic dialects: Egyptian, Moroccan, Palestinian, Saudi, and Syrian, using the provided training datasets. We experiment with different base and instruct models using several pretraining and instruction tuning approaches. In total, five models were submitted, with three variants per dialect. Our best-performing models for the five dialects are ALLaM for Egyptian, LLaMa for Moroccan, and Palestinian, and Aya for Saudi and Syrian.
A Closed-Track System for Palestinian Arabic in the AMIYA Shared Task
Khaleel Hamad | Ahmad Al-Najjar
Khaleel Hamad | Ahmad Al-Najjar
We describe a closed track system for mod- eling Palestinian Arabic that is developed for the AMIYA shared task using a parameter effi- cient fine-tuning strategy. A 1.5B instruction- tuned language model was adapted with LoRA (Hu et al., 2021), updating only .28% of the model parameters, and trained on an aggre- gated set of conversations between Palestini- ans and resources covering both translation and generation. Model selection was guided by a comparative benchmark that prioritized performance efficiency and its tradeoffs. At the same time the paper focuses on targeting error analysis as well as structured instruction following. These findings illustrate both the viability and shed light on the current limita- tions of efficient adaptation methods for low- resource Arabic dialects.
up
The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026)
The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026)
Jeremy Barnes | Valentin Barriere | Orphée De Clercq | Roman Klinger | Célia Nouri | Debora Nozza | Pranaydeep Singh
Jeremy Barnes | Valentin Barriere | Orphée De Clercq | Roman Klinger | Célia Nouri | Debora Nozza | Pranaydeep Singh
Council of LLMs: Evaluating Capability of Large Language Models to Annotate Propaganda
Vivek Sharma | Shweta Jain | Mohammad Shokri | Sarah Ita Levitan | Elena Filatova
Vivek Sharma | Shweta Jain | Mohammad Shokri | Sarah Ita Levitan | Elena Filatova
Data annotation is essential for supervised natural language processing tasks but remains labor-intensive and expensive. Large language models (LLMs) have emerged as promising alternatives, capable of generating high-quality annotations either autonomously or in collaboration with human annotators. However their use in autonomous annotations is often questioned for their ethical take on subjective matters. This study investigates the effectiveness of LLMs in a autonomous, and hybrid annotation setups in propaganda detection. We evaluate GPT and open-source models on two datasets from different domains, namely, Propaganda Techniques Corpus (PTC) for news articles and the Journalist Media Bias on X (JMBX) for social media. Our results show that LLMs, in general, exhibit high recall but lower precision in detecting propaganda, often over-predicting persuasive content. Multi-annotator setups did not outperform the best models in single-annotator setting although it helped reasoning models boost their performance. Hybrid annotation, combining LLMs and human input, achieved the highest overall accuracy than LLM-only settings. We further analyze misclassifications and found that LLM have higher sensitivity towards certain propaganda techniques like loaded language, name calling, and doubt. Finally, using error typology analysis, we explore the reasoning provided on misclassifications by the LLM. Our result shows that although some studies report LLM outperforming manual annotations and it could prove useful in hybrid annotation, its incorporation in the human annotation pipeline must be implemented with caution.
Emoji Reactions on Telegram: Unreliable Indicators of Emotional Resonance
Serena Tardelli | Lorenzo Alvisi | Lorenzo Cima | Stefano Cresci | Maurizio Tesconi
Serena Tardelli | Lorenzo Alvisi | Lorenzo Cima | Stefano Cresci | Maurizio Tesconi
Emoji reactions are a frequently used feature of messaging platforms, yet their communicative role remains understudied. Prior work on emojis has focused predominantly on in-text usage, showing that emojis embedded in messages tend to amplify and mirror the author’s affective tone. This evidence has often been extended to emoji reactions, treating them as indicators of emotional resonance or user sentiment. However, they may reflect broader social dynamics. Here, we investigate the communicative function of emoji reactions on Telegram. We analyze over 650k crypto-related messages that received at least one reaction, annotating each with sentiment, emotion, persuasion strategy, and speech act labels, and inferring the sentiment and emotion of emoji reactions using both lexicons and LLMs. We uncover a systematic mismatch between message and reaction sentiment, with positive reactions dominating even for neutral or negative content. This pattern persists across rhetorical strategies and emotional tones, indicating that emojis used as reactions do not reliably function as indicators of emotional mirroring or resonance of the content, in contrast to findings reported for in-text emojis. Finally, we identify the features that most predict emoji engagement. Overall, our findings caution against treating emoji reactions as sentiment labels, highlighting the need for more nuanced approaches in sentiment and engagement analysis.
This paper presents a domain-specific transformer pipeline for quantifying social atmosphere in hostel reviews, an experiential dimension that travelers consistently prioritize but that existing NLP methods and booking platforms fail to capture. We train a cross-encoder on 4,994 manually annotated reviews and use it to pseudo-label 162,840 additional reviews; these labels are then distilled into a sentence-transformer bi-encoder, producing embeddings where proximity reflects social interaction level rather than generic sentiment. On held-out human-labeled data, the domain-adapted embeddings achieve F1 = 0.826, outperforming generic sentence embeddings (0.671) and zero-shot GPT-4o (0.774), with a 40-fold improvement in intra-class versus inter-class similarity. Aggregating predictions to the property level reveals that hostel socialness follows an approximate exponential distribution, confirming that highly social hostels are rare. This work formalizes socialness as a measurable semantic construct and provides a general template for extracting implicit experiential attributes from text at scale.
Predicting Convincingness in Political Speech: How Emotional Tone Shapes Persuasive Strength
Bhuvanesh Verma | Mounika Marreddy | Alexander Mehler
Bhuvanesh Verma | Mounika Marreddy | Alexander Mehler
Emotional tone plays a central role in persuasion, yet its impact on computational assessments of political argument quality in real world election campaign speeches remains understudied. In this work, we investigate whether positive emotional framing correlates with higher perceived convincingness in political arguments. We fine-tune language models on argument quality datasets and test their ability to transfer convincingness predictions to real-world campaign speeches. Using a corpus of U.S. presidential campaign speeches, we analyze emotional polarity in relation to predicted persuasive strength to test whether positively framed arguments are judged more convincing than neutral or negative ones. Our empirical analysis shows that political parties rely heavily on argumentation during their election campaigns. Also, we found the evidence that politicians strategically employ emotional cues within their arguments during these campaign speeches, with positive emotions being more strongly associated with persuasive strength, for example in topics such as USMCA’s Effect on American Jobs and Agriculture, Border Control Policies, Progressive Tax Reforms. At the same time, we find that negative emotions have a weaker yet still non-negligible influence on voter persuasion in topics such as City Crime and Civil Unrest and White Supremacist Violence (Charlottesville Incident).
Large language models (LLMs) are now widely used in applications that depend on closed-ended decisions, including automated surveys, policy screening, and decision-support tools. In such contexts, these models are typically expected to produce consistent binary or ternary responses (for example, Yes, No, or Neither) when presented with questions that are semantically equivalent. However recent studies shows that LLM outputs can be influenced by relatively minor changes in prompt wording, raising concerns about the reliability of their decisions under paraphrasing. In this paper, we conduct a systematic analysis of paraphrase robustness across five widely used LLMs. To support this evaluation, we develop a controlled dataset consisting of 200 opinion-based questions drawn from multiple domains, each accompanied by five human-validated paraphrases. All models are evaluated under deterministic inference settings and constrained to a fixed Yes/No/Neither response format. We assess model behavior using a set of complementary metrics that capture the stability of each evaluated model. DeepSeek Reasoner and Gemini 2.0 Flash show the highest stability when responding to paraphrased inputs, whereas Claude 3.7 Sonnet exhibits strong internal consistency but produces judgments that differ more frequently from those of other models. By contrast, GPT-3.5 Turbo and LLaMA 3 70B display greater sensitivity to surface-level variations in prompt phrasing. Overall, these findings suggest that robustness to paraphrasing is driven more by alignment strategies and reasoning design choices than by model size alone.
The Impact of Highlighting Subjective Language on Perceived News Trustworthiness
Mohammad Shokri | Vivek Sharma | Emily Klapper | Shweta Jain | Elena Filatova | Sarah Ita Levitan
Mohammad Shokri | Vivek Sharma | Emily Klapper | Shweta Jain | Elena Filatova | Sarah Ita Levitan
The rise of misinformation and opinionated articles has made understanding how misleading or biased content influences readers an increasingly important problem. While most prior work focuses on detecting misinformation or deceptive language in real time, far less attention has been paid to how such content is perceived by readers, which is an essential component of misinformation’s effectiveness. In this study, we examine whether highlighting subjective sentences in news articles affects perceived trustworthiness. Using a controlled user experiment and 1,334 article–reader evaluations, we find that highlighting subjective content produces a modest yet statistically significant decrease in trust, with substantial variation across articles and participants. To explain this variation, we model trust change after highlighting subjective language as a function of article-level linguistic features and reader-level attitudes. Our findings suggest that readers’ reactions to highlighted subjective language are driven primarily by characteristics of the text itself, and that highlighting subjective language offers benefits for may help readers better assess the reliability of potentially misleading news articles.
Appraisal Trajectories in Narratives Reveal Distinct Patterns of Emotion Evocation
Johannes Schäfer | Janne Wagner | Roman Klinger
Johannes Schäfer | Janne Wagner | Roman Klinger
Understanding emotion responses relies on reconstructing how individuals appraise events. While prior work has studied emotion trajectories and inherent correlations with appraisals, it has considered appraisals only in a snapshot analysis. However, because appraisal is a complex, sequential process, we argue that it should be analyzed based on how it unfolds throughout a narrative. In this study, we investigate whether trajectories of appraisals are distinctive for different emotions in five-event stories – narratives where each of five sentences describes an event. We employ zero-shot prompting with a large language model to predict appraisals on sub-sequences of a narrative. We find that this approach is effective in identifying relevant appraisals in narratives, without prior knowledge of the evoked emotion, enabling a comprehensive analysis of appraisal trajectories. Furthermore, we are the first to quantitatively identify typical patterns of appraisal trajectories that distinguish emotions. For example, a rising trajectory for self-responsibility indicates trust, while a falling trajectory suggests anger.
Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Model
Donya Rooein | Flor Miriam Plaza-del-Arco | Debora Nozza | Dirk Hovy
Donya Rooein | Flor Miriam Plaza-del-Arco | Debora Nozza | Dirk Hovy
Given Farsi’s speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and identify significant challenges in data availability and quality, despite overall increases in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings show that the volume of data alone is insufficient to improve a language’s standing in NLP.
Emotional Lexicons: How Large Language Models Predict Emotional Ratings of Russian Words
Polina V. Iaroshenko | Natalia V Loukachevitch
Polina V. Iaroshenko | Natalia V Loukachevitch
This study examines the capability of LLMs to predict emotional ratings of Russian words by comparing their assessments with both native speakers’ ratings and expert evaluations. The research utilises two datasets: the ENRuN database containing associative emotional ratings of Russian nouns by native speakers, and RusEmoLex, an expert-compiled lexicon. Various open-source LLMs were evaluated, including international models (Llama-3, Qwen 2.5), Russian-developed models, and Russian-adapted variants, representing three parameter scales. The findings reveal distinct patterns in model performance: Russian-adapted models demonstrated superior alignment with native speakers’ ratings, whilst model size was not a decisive factor. Conversely, larger models showed better performance in matching expert assessments, with language adaptation having minimal impact. Emotional or sensitive lexis with strong connotations produce a more substantial human-model gap.
Emotion-aware text simplification of user generated content using LLMs
Anastasiia Bezobrazova | Daria Sokova | Constantin Orasan
Anastasiia Bezobrazova | Daria Sokova | Constantin Orasan
Digital inclusion increasingly supports adults with intellectual disabilities (ID) to participate online, yet social media posts can be difficult to understand, particularly when they contain strong emotions, slang, or non-standard writing. This paper investigates whether large language models (LLMs) can simplify social media texts to improve cognitive accessibility and preserve emotional meaning. Using an accessibility-oriented prompt based on existing guidance, posts are simplified and emotion preservation is assessed. The results suggest that many simplified posts retain the same emotions, though changes occur, especially when emotions are weakly expressed or ambiguous. Qualitative analysis shows that simplification improves fluency and structure but can also shift perceived emotion through changes to tone, formatting, and other affective cues common in social media text. The research has also revealed that different LLMs produce very different outputs.
Crowd-Based Evaluation of Emotion Intensity Preservation in Spanish–Basque Tweet Machine Translation
Nora Aranberri
Nora Aranberri
Machine translation (MT) systems perform well on standard benchmarks, yet their ability to preserve emotional meaning in informal user-generated content—particularly for low-resource languages—remains underexplored. We investigate the preservation of emotion intensity in Spanish–Basque tweet translation, focusing on Basque, an under-represented language in MT research. We compile a small, controlled corpus of Spanish reaction tweets and evaluate Basque translations from three publicly available systems through a crowd-based study. While all systems achieve comparable and above mid-range accuracy and fluency, emotion intensity is systematically attenuated in the translations, with greater loss for more emotionally intense inputs. A follow-up on highly emotional tweets shows that LLM prompting reduces emotion loss, yet substantial attenuation remains, highlighting emotion preservation as a persistent challenge in Spanish–Basque MT.
A Position Paper on Toxic Reasoning: Grounding Categories of Toxic Language in Implications and Attitudes
Stefan F. Schouten | Ilia Markov | Piek Vossen
Stefan F. Schouten | Ilia Markov | Piek Vossen
Automatic detection of toxic language has the potential to considerably improve engagement with online spaces. Previous work has characterized toxic language detection as a classification problem, often using fine-grained classes for increased explainability. In this position paper, we argue for a particular way of operationalizing categories of toxic language. Our approach focuses on what is expressed or implied, and breaks down implications based on two traits: (i) the core content of what was expressed, and (ii) relevant stakeholders’ attitudes towards that content. We argue for an approach, which we call toxic reasoning, where such distinctions are made explicit. We point out the benefits for such an approach, and develop a toxic reasoning schema, which can explain categories of toxic language from diverse sources. We demonstrate this by mapping the classes of existing toxic language datasets to the schema. Toxic reasoning promises to provide improved understanding of implicit toxicity while increasing explainability.
Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors
Laurits Lyngbaek | Pascale Feldkamp | Yuri Bizzoni | Kristoffer Nielbo | Kenneth Enevoldsen
Laurits Lyngbaek | Pascale Feldkamp | Yuri Bizzoni | Kristoffer Nielbo | Kenneth Enevoldsen
Use cases of sentiment analysis in the humanities often require contextualized, continuous scores. Concept Vector Projections (CVP) offer a recent solution: by modeling sentiment as a direction in embedding space, they produce continuous, multilingual scores that align closely with human judgments. Yet the method’s portability across domains and underlying assumptions remain underexplored.We evaluate CVP across genres, historical periods, languages, and affective dimensions, finding that concept vectors trained on one corpus transfer well to others with minimal performance loss. To understand the patterns of generalization, we further examine the linearity assumption underlying CVP. Our findings suggest that while CVP is a portable approach that effectively captures generalizable patterns, its linearity assumption is approximate, pointing to potential for further development. Code available at: github.com/lauritswl/representation-transfer
Disentangling Emotion Understanding and Generation in Large Language Models
Sadegh Jafari | Els Lefever | Veronique Hoste
Sadegh Jafari | Els Lefever | Veronique Hoste
Large language models (LLMs) have demonstrated strong performance on emotion understanding tasks, yet their ability to faithfully generate emotionally aligned text remains less well understood.We propose a semantic evaluation framework that jointly assesses emotion understanding, emotion generation, and internal consistency, using a VAE-based emotion cost matrix that captures graded semantic similarity between emotion categories.Our framework introduces four complementary metrics that disentangle baseline understanding, human-perceived emotion in generated text, generation quality, and model consistency.Experimental results show that while understanding and consistency scores are highly correlated, emotion generation exhibits substantially weaker correlations with these metrics.These findings motivate the development of specialized evaluation protocols that independently measure emotional understanding and generation, enabling more reliable assessments of LLM emotional intelligence.
News Credibility Assessment by LLMs and Humans: Implications for Political Bias
Pia Wenzel Neves | Charlott Jakob | Vera Schmitt
Pia Wenzel Neves | Charlott Jakob | Vera Schmitt
In an era of rapid misinformation spread, LLMs have emerged as tools for assessing news credibility at scale. However, the assessments are influenced by social and cultural biases. Studies investigating political bias, compare model credibility ratings with expert credibility ratings. Comparing LLMs to the perceptions of political camps extends this approach to detecting similarities in their biases.We compare LLM-generated credibility and bias ratings of news outlets with expert assessments and stratified political opinions collected through surveys. We analyse three models (Llama 3.3 70B, Mixtral 8x7B, and GPT-OSS 120B) across 47 news outlets from two countries (U.S. and Germany).We found that models demonstrated consistently high alignment with expert ratings, while showing weaker and more variable alignment with public opinions. For US-American news outlets all models showed stronger alignment with center-left perceptions, while for German news outlets the alignment is more diverse.
Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Nils Schwager | Simon Münker | Alistair Plum | Achim Rettinger
Nils Schwager | Simon Münker | Alistair Plum | Achim Rettinger
The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama-3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.
Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents
Mohammad Hossein Akbari Monfared | Lucie Flek | Akbar Karimi
Mohammad Hossein Akbari Monfared | Lucie Flek | Akbar Karimi
We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high-quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks—Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)—four SemEval datasets, and two encoder–decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.
Antisocial behavior (ASB) on social media encompasses online behaviors that harm individuals, groups, or platform ecosystems, including hate speech, harassment, cyberbullying, trolling, and coordinated abuse. While most prior work has focused on detecting harm after it occurs, a growing body of research on ASB prediction seeks to forecast future harmful outcomes before they materialize, including—but not limited to—hate-speech diffusion, conversational derailment, and user recidivism. However, this emerging field remains fragmented, with limited conceptual grounding and few integrative frameworks. This paper establishes a foundation for ASB prediction by introducing a structured taxonomy spanning temporal, structural, and behavioral dimensions. Drawing on 49 machine learning studies identified through a literature review, we map predictive goals to datasets, modeling choices, and evaluation practices, and identify key challenges, including the lack of standardized benchmarks, the dominance of text-centric representations, and trade-offs between accuracy and interpretability. We conclude by outlining actionable directions toward more robust, generalizable, and responsible ASB prediction systems.
Real-Time Mitigation of Negative Emotion in Customer Care Calls
Surupendu Gangopadhyay | Mahnoosh Mehrabani
Surupendu Gangopadhyay | Mahnoosh Mehrabani
Speech emotion recognition (SER) is a compelling yet challenging research area with substantial practical relevance, particularly in enhancing human–machine interaction. Despite considerable progress in the field, the scarcity of realistic datasets that reflect real-world conditions makes it difficult to analyze system behavior in practice and can lead to degraded performance in industrial applications. In this study, we propose a system that detects negative emotions at each turn in a conversation by leveraging both linguistic and acoustic features. The approach is evaluated on real-world data, with a particular focus on identifying and responding to negative emotion in customer support scenarios. Designed for real-time application, the system is suitable for live deployment in call center environments. Furthermore, we propose an effective prompting strategy for using large language models (LLMs) as annotators, generating labeled data used to fine-tune small language models that achieve performance on par with the LLM used for annotation, while remaining suitable for real-time deployment.
Says Who? Argument Convincingness and Reader Stance Are Correlated with Perceived Author Personality
Sabine Weber | Lynn Greschner | Roman Klinger
Sabine Weber | Lynn Greschner | Roman Klinger
Alongside its literal meaning, text also carries implicit social signals: information that is used by the reader to assign the author of the text a specific identity or make assumptions about the author’s character. The reader creates a mental image of the author which influences the interpretation of the presented information. This is especially relevant for argumentative text, where the credibility of the information might depend on who provides it. We therefore focus on the question: How do readers of an argument imagine its author? Using the ContArgA corpus, we study arguments annotated for convincingness and perceived author properties (level of education and Big Five personality traits). We find that annotators perceive an author to be similar to themselves when they agree with the stance of the argument. We also find that the envisioned personality traits and education level of the author are statistically significantly correlated with the argument’s convincingness. We conduct experiments with four generative LLMs and a RoBERTa-based regression model showing that LLMs do not replicate the annotators judgments. Argument convincingness can however provide a useful signal for modeling perceived author personality when it is explicitly used during training.
A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection
Ximing Wen | Rezvaneh Rezapour
Ximing Wen | Rezvaneh Rezapour
Sarcasm detection, with its figurative nature, poses unique challenges for affective systems designed to perform sentiment analysis. While these systems typically perform well at identifying direct expressions of emotion, they struggle with sarcasm’s inherent contradiction between literal and intended sentiment. Since transformer-based language models (LMs) are known for their efficient ability to capture contextual meanings, we propose a method that leverages LMs and prototype-based networks, enhanced by sentiment embeddings to conduct interpretable sarcasm detection. Our approach is intrinsically interpretable without extra post-hoc interpretability techniques. We test our model on three public benchmark datasets and show that our model outperforms the current state-of-the-art. At the same time, the prototypical layer enhances the model’s inherent interpretability by generating explanations through similar examples in the reference time. Furthermore, we demonstrate the effectiveness of incongruity loss in the ablation study, which we construct using sentiment prototypes.
Multimodal Claim Extraction for Fact-Checking
Joycelyn Teo | Rui Cao | Zhenyun Deng | Zifeng Ding | Michael Sejr Schlichtkrull | Andreas Vlachos
Joycelyn Teo | Rui Cao | Zhenyun Deng | Zifeng Ding | Michael Sejr Schlichtkrull | Andreas Vlachos
Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today’s misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.
A Multi-Aspect Evaluation Framework for Synthetic Data: Case Study on Irony and Sarcasm
Laura Majer | Ana Barić | Florijan Sandalj | Ivan Unković | Bojan Puvača | Jan Šnajder
Laura Majer | Ana Barić | Florijan Sandalj | Ivan Unković | Bojan Puvača | Jan Šnajder
Data augmentation (DA) using large language models (LLMs) is a cost-effective method for generating synthetic data, particularly for tasks with scarce datasets. However, its potential remains largely underexplored, both in terms of augmentation configuration and evaluation of synthetic data. This paper investigates LLM-based synthetic data generation for irony and sarcasm, two subjective and context-dependent forms of figurative language. We propose a multi-aspect evaluation framework assessing synthetic data’s utility-plausibility and extrinsic-intrinsic dimensions through four aspects: predictive performance, sample diversity, linguistic properties, and human judgment. Our findings indicate that other aspects of evaluation, like diversity and linguistic features, do not necessarily correlate with an increase in predictive performance, underscoring the importance of multi-faceted evaluation. This work highlights the potential of LLM-based DA for irony and sarcasm detection, offering insights into the linguistic competence of LLMs. As synthetic data becomes increasingly prevalent, our framework offers a broadly applicable and crucial evaluation method, particularly for linguistically complex tasks.