The 15th International Conference on Recent Advances in Natural Language Processing

Varna, Bulgaria
September 8-10, 2025

Volumes

Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era 167 papers
Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing 10 papers
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects 14 papers
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text 6 papers
Proceedings of the 8th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Texts 20 papers
Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models 4 papers
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing 10 papers
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models 23 papers
Proceedings of the First on Natural Language Processing and Language Models for Digital Humanities 17 papers
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages 19 papers
Proceedings of Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models 12 papers
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models 18 papers

pdf (full)
bib (full) Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

pdf bib
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Galia Angelova | Maria Kunilovskaya | Marie Escribe | Ruslan Mitkov

pdf bib abs
Harnessing Open-Source LLMs for Tender Named Entity Recognition
Asim Abbas | Venelin Kovatchev | Mark Lee | Niloofer Shanavas | Mubashir Ali

In the public procurement domain, extracting accurate tender entities from unstructured text remains a critical, less explored challenge, because tender data is highly sensitive and confidential, and not available openly. Previously, state-of-the-art NLP models were developed for this task; however developing an NER model from scratch required huge amounts of data and resources. Similarly, performing fine-tuning of a transformer-based model like BERT requires training data, as a result posing challenges in training data cost, model generalization, and data privacy. To address these challenges, an emerging LLM such as GPT-4 in a Few-shot learning environment achieves SOTA performance comparable to fine-tuned models. However, being dependent on the closed-source commercial LLMs involves high cost and privacy concerns. In this study, we have investigated open-source LLMs like Mistral and LLAMA-3, focusing on the tender domain for the NER tasks on local consumer-grade CPUs in three different environments: Zero-shot, One-shot, and Few-shot learning. The motivation is to efficiently lessen costs compared to a cloud solution while preserving accuracy and data privacy. Similarly, we have utilized two datasets open-source from Singapore and closed-source commercially sensitive data provided by Siemens. As a result, all the open-source LLMs achieve above 85% F1-score on an open-source dataset and above 90% F1-score on a closed-source dataset.

pdf bib abs
On the Limitations of Large Language Models (LLMs): False Attribution
Tosin Adewumi | Nudrat Habib | Lama Alkhaled | Elisa Barney

In this work, we introduce a new hallucination metric - SHI and provide insight into one important limitation of the parametric knowledge of large language models LLMs, i.e. false attribution. The task of automatic author attribution for relatively small chunks of text is an important NLP task but can be challenging. We empirically evaluate the power of 3 open SotA LLMs in zero-shot setting (Gemma-7B, Mixtral 8x7B, and LLaMA-2-13B). We acquired the top 10 most popular books of a month, according to Project Gutenberg, divided each one into equal chunks of 400 words, and prompted each LLM to predict the author. We then randomly sampled 162 chunks per book for human evaluation, based on the error margin of 7% and a confidence level of 95%. The average results show that Mixtral 8x7B has the highest prediction accuracy, the lowest SHI, and a Pearson’s correlation r of 0.724, 0.263, and -0.9996, respectively, followed by LLaMA-2-13B and Gemma-7B. However, Mixtral 8x7B suffers from high hallucinations for 3 books, rising as high as a SHI of 0.87 (in the range 0-1, where 1 is the worst). The strong negative correlation of accuracy and SHI, given by r, demonstrates the fidelity of the new hallucination metric, which may generalize to other tasks. We also show that prediction accuracies correlate positively with the frequencies of Wikipedia instances of the book titles instead of the downloads and we perform error analyses of predictions. We publicly release the annotated chunks of data and our codes to aid the reproducibility and evaluation of other models.

pdf bib abs
Candidate Profile Summarization: A RAG Approach with Synthetic Data Generation for Tech Jobs
Anum Afzal | Ishwor Subedi | Florian Matthes

As Large Language Models (LLMs) become increasingly applied to resume evaluation and candidate selection, this study investigates the effectiveness of using in-context example resumes to generate synthetic data. We compare a Retrieval-Augmented Generation (RAG) system to a Named Entity Recognition (NER)-based baseline for job-resume matching, generating diverse synthetic resumes with models like Mixtral-8x22B-Instruct-v0.1. Our results show that combining BERT, ROUGE, and Jaccard similarity metrics effectively assesses synthetic resume quality, ensuring the least lexical overlap along with high similarity and diversity. Our experiments show that RAG notably outperforms NER for retrieval tasks—though generation-based summarization remains challenged by role differentiation. Human evaluation further highlights issues of factual accuracy and completeness, emphasizing the importance of in-context examples, prompt engineering, and improvements in summary generation for robust, automated candidate selection.

pdf bib abs
PersianSciQA: A New Dataset for Bridging the Language Gap in Scientific Question Answering
Safoura Aghadavoud Jolfaei | Azadeh Mohebi | Zahra Hemmat

The shortage of specialized datasets hinders the development of Natural Language Processing (NLP) for scientific texts in low-resource languages such as Persian. To address this, we introduce PersianSciQA , a large-scale resource of 39,809 question- answer snippet pairs, each containing a question and a scientific answer snippet from a scientific engineering abstract source from IranDoc’s ‘Ganj’ repository, linked by an LLM-assigned relevance score (0-3) that measures how relevant the question is to the content of the accompanying answer snippet. The dataset was generated using a two stage prompting methodology and refined through a rigorous cleaning pipe-line, including text normalization and semantic deduplication. Human validation of 1,000 instances by two NLP researchers confirmed the dataset’s quality and a substantial LLM-human agreement (Cohen’s kappa coefficient κ=0.6642). To demonstrate its value, we establish baseline benchmarks and show that fine-tuning on PersianSciQA dramatically improves a state-of-the-art model, achieving a Spearman correlation of 0.895 on a blind test set. PersianSciQA provides a crucial new resource to facilitate research in information retrieval and question answering within the Persian scientific domain.

pdf bib abs
Multilingual Pre-training Meets Supervised Neural Machine Translation: A Reproducible Evaluation on English–French and Finnish Translation
Benyamin Ahmadnia | Yeswanth Soma | Hossein Sarrafzadeh

This paper presents a comparative evaluation of Transformer-based Neural Machine Translation (NMT) models and pre-trained multilingual sequence-to-sequence models in the context of moderately-resourced MT. Using English-French (high-resource) and English-Finnish (moderate-resource) as case studies, we assess the effectiveness of fine-tuning the mBART model versus training standard NMT systems from scratch. Our experiments incorporate data-augmentation techniques such as back-translation and evaluate translation quality using BLEU, TER, METEOR, and COMET metrics. We also provide a detailed error analysis that covers lexical choice, named entity handling, and word order. While mBART demonstrates consistent improvements over classical NMT, particularly in handling complex linguistic structures and sparse training data, we acknowledge the challenges of deploying large models in resource-constrained settings. Our findings highlight practical trade-offs between model complexity, resource availability, and translation quality in multilingual scenarios.

pdf bib abs
Advancing Clinical Translation in Nepali through Fine-Tuned Multilingual Models
Benyamin Ahmadnia | Sumaiya Shaikh | Bibek Poudel | Shazan Mohammed | Sahar Hooshmand

Low-resource Neural Machine Translation (NMT) remains a major challenge, particularly in high-stakes domains such as healthcare. This paper presents a domain-adapted pipeline for English-Nepali medical translation leveraging two state-of-the-art multilingual Large Language Models (LLMs): mBART and NLLB-200. A high-quality, domain-specific parallel corpus is curated, and both models are fine-tuned using PyTorch frameworks. Translation fidelity is assessed through a multi-metric evaluation strategy that combines BLEU, CHRF++, METEOR, BERTScore, COMET, and perplexity. Our experimental results show that NLLB-200 consistently outperforms mBART across surface-level and semantic metrics, achieving higher accuracy and lower hallucination rates in clinical settings. In addition, error profiling and ethical assessments are conducted to highlight challenges such as term omissions and cultural bias. This work underscores the viability of large-scale multilingual models in enhancing medical translation for low-resource languages and proposes actionable paths toward safer and more equitable MT deployment in healthcare.

Active learning (AL) reduces annotation costs by selecting the most informative samples for labeling. However, traditional AL methods rely on a single heuristic, limiting data exploration and annotation efficiency. This paper introduces two ensemble-based AL methods: Ensemble Union, which combines multiple heuristics to improve dataset exploration, and Ensemble Intersection, which applies majority voting for robust sample selection. We evaluate these approaches on the United Nations Parallel Corpus (UNPC) in both English and Spanish using domain-specific models such as ConfliBERT. Our results show that ensemble-based AL strategies outperform individual heuristics, achieving classification performance comparable to full dataset training while using significantly fewer labeled examples. Although focused on political texts, the proposed methods are applicable to broader NLP annotation tasks where labeling costs are high.

pdf bib abs
Evaluating Large Language Models on Sentiment Analysis in Arabic Dialects
Maram I. Alharbi | Saad Ezzini | Hansi Hettiarachchi | Tharindu Ranasinghe | Ruslan Mitkov

Despite recent progress in large language models (LLMs), their performance on Arabic dialects remains underexplored, particularly in the context of sentiment analysis. This study presents a comparative evaluation of three LLMs, DeepSeek-R1, Qwen2.5, and LLaMA-3, on sentiment classification across Modern Standard Arabic (MSA), Saudi dialect and Darija. We construct a balanced sentiment dataset by translating and validating MSA hotel reviews into Saudi dialect and Darija. Using parameter-efficient fine-tuning (LoRA) and dialect-specific prompts, we assess each model under matched and mismatched prompting conditions. Evaluation results show that Qwen2.5 achieves the highest macro F1 score of 79% on Darija input using MSA prompts, while DeepSeek performs best when prompted in the input dialect, reaching 71% on Saudi dialect. LLaMA-3 exhibits stable performance across prompt variations, with 75% macro F1 on Darija input under MSA prompting. Dialect-aware prompting consistently improves classification accuracy, particularly for neutral and negative sentiment classes.

pdf bib abs
From Posts to Predictions: A User-Aware Framework for Faithful and Transparent Detection of Mental Health Risks on Social Media
Hessam Amini | Leila Kosseim

We propose a user-aware attention-based framework for early detection of mental health risks from social media posts. Our model combines DisorBERT, a mental health–adapted transformer encoder, with a user-level attention mechanism that produces transparent post-level explanations. To assess whether these explanations are faithful, i.e., aligned with the model’s true decision process, we apply adversarial training and quantify attention faithfulness using the AtteFa metric. Experiments on four eRisk tasks (depression, anorexia, self-harm, and pathological gambling) show that our model achieves competitive latency-weighted F1 scores while relying on a sparse subset of posts per user. We also evaluate attention robustness and conduct ablations, confirming the model’s reliance on high-weighted posts. Our work extends prior explainability studies by integrating faithfulness assessment in a real-world high-stakes application. We argue that systems combining predictive accuracy with faithful and transparent explanations offer a promising path toward safe and trustworthy AI for mental health support.

pdf bib abs
Beyond Methods and Datasets Entities: Introducing SH-NER for Hardware and Software Entity Recognition in Scientific Text
Aftab Anjum | Nimra Maqbool | Ralf Krestel

Scientific Information Extraction (SciIE) has become essential for organizing and understanding scientific literature, powering tasks such as knowledge graph construction, method recommendation, and automated literature reviews. Although prior SciIE work commonly annotates entities such as tasks, methods, and datasets, it systematically neglects infrastructure-related entities like hardware and software specifications mentioned in publications. This gap limits key applications: knowledge graphs remain incomplete, and recommendation systems cannot effectively filter methods based on hardware compatibility. To address this gap, we introduce SH-NER, the first large-scale, manually annotated dataset focused on infrastructure-related entities in NLP research. SH-NER comprises 1,128 full-text papers from the ACL Anthology and annotates five entity types: Software, Cloud-Platform, Hardware-Device, Device-Count, and Device-Memory. Our dataset comprises over 9k sample sentences with around 6k annotated entity mentions. To assess the effectiveness of SH-NER, we conducted comprehensive experiments employing state-of-the-art supervised models alongside large language models (LLMs) as baselines. The results show that SH-NER improves scientific information extraction by better capturing infrastructure mentions. You can find the manually annotated dataset at https://github.com/coderhub84/SH-NER.

pdf bib abs
Toponym Resolution: Will Prompt Engineering Change Expectations?
Isuri Anuradha | Deshan Koshala Sumanathilaka | Ruslan Mitkov | Paul Rayson

Large Language Models(LLMs) have revolutionised the field of artificial intelligence and have been successfully employed in many disciplines, capturing widespread attention and enthusiasm. Many previous studies have established that Domain-specific Deep Learning models to competitively perform with the general-purpose LLMs (Maatouk et al., 2024;Lu et al., 2024). However, a suitable prompt which provides direct instructions and background information is expected to yield im- proved results (Kamruzzaman and Kim, 2024). The present study focuses on utilising LLMs for the Toponym Resolution task by incorporating Retrieval-Augmented Generation(RAG) and prompting techniques to surpass the results of the traditional Deep Learning models. Moreover, this study demonstrates that promising results can be achieved without relying on large amounts of labelled, domain-specific data. After a descriptive comparison between open-source and proprietary LLMs through different prompt engineering techniques, the GPT-4o model performs best compared to the other LLMs for the Toponym Resolution task.

pdf bib abs
HoloBERT: Pre-Trained Transformer Model for Historical Narratives
Isuri Anuradha | Le An Ha | Ruslan Mitkov

Oral texts often contain spontaneous, unstructured language with features like disfluencies, colloquialisms, and non-standard syntax. In this paper, we investigate how further pretraining language models with specialised learning objectives for oral and transcribed texts to enhance Named Entity Recognition (NER) performance in Holocaust-related discourse. To evaluate our models, we compare the extracted named entities (NE) against those from other pretrained models on historical texts and generative AI models such as GPT. Furthermore, we demonstrate practical applications of the recognised NEs by linking them to a knowledge base as structured metadata and representing them in a graph format. With these contributions, our work illustrates how the further-pretrain-and-fine-tune paradigm in Natural Language Processing advances research in Digital Humanities.

Large language models (LLMs) have been applied to a wide range of tasks, including text summarization, web navigation, and chat- bots. They have benefitted from supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) following an un- supervised pretraining. These datasets can be difficult to collect, limited in scope, and vary in sample quality. Additionally, datasets can vary extensively in supervision format, from numer- ical to binary as well as multi-dimensional with many different values. We present a framework for fine-tuning LLMs using heterogeneous feed- back, which has two main components. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high- quality and diverse subset to obtain perfor- mance increases potentially exceeding the full dataset. We conduct extensive experiments to understand the effectiveness of these tech- niques for incorporating heterogeneous feed- back, and demonstrate improvements from us- ing a high-quality and diverse subset of the data. We find that our framework is able to improve models in multiple areas simultaneously, such as in instruction following and bias reduction.

In this research, we developed and evaluated “chakoshi” an LLM guardrail model designed to address Japanese-specific nuances. chakoshi is a lightweight LLM that has been fine-tuned using multiple open datasets and proprietary learning datasets. Based on gemma-2-9b, the chakoshi model achieved an average F1 score of 0.92 or higher across multiple test datasets, demonstrating superior performance compared to existing models. Additionally, we implemented a feature that allows customization of categories to be filtered using natural language, and confirmed its effectiveness through practical examples.

pdf bib abs
KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines
Alexander Baranov | Anna Palatkina | Yulia Makovka | Pavel Braslavski

We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to. Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts – each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities – the mechanism that has been underrepresented in previous humor datasets. Our experiments with five LLMs show that there is ample room for improvement in wordplay detection and interpretation tasks. The dataset and evaluation scripts are available at https://github.com/Humor-Research/KoWit-24

pdf bib abs
Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets
Eduard Barbu | Meeri-Ly Muru | Sten Marcus Malva

This paper presents a method for text simplification based on two neural architectures: a neural machine translation (NMT) model and a fine-tuned large language model (LLaMA). Given the scarcity of existing resources for Estonian, a new dataset was created by combining manually translated corpora with GPT-4.0-generated simplifications. OpenNMT was selected as a representative NMT-based system, while LLaMA was fine-tuned on the constructed dataset. Evaluation shows LLaMA outperforms OpenNMT in grammaticality, readability, and meaning preservation. These results underscore the effectiveness of large language models for text simplification in low-resource language settings. The complete dataset, fine-tuning scripts, and evaluation pipeline are provided in a publicly accessible supplementary package to support reproducibility and adaptation to other languages.

pdf bib abs
Mitigating Bias in Text Classification via Prompt-Based Text Transformation
Charmaine Barker | Dimitar Kazakov

The presence of specific linguistic signals particular to a certain sub-group can become highly salient to language models during training. In automated decision-making settings, this may lead to biased outcomes when models rely on cues that correlate with protected characteristics. We investigate whether prompting ChatGPT to rewrite text using simplification, neutralisation, localisation, and formalisation can reduce demographic signals while preserving meaning. Experimental results show a statistically significant drop in location classification accuracy across multiple models after transformation, suggesting reduced reliance on group-specific language. At the same time, sentiment analysis and rating prediction tasks confirm that the core meaning of the reviews remains greatly intact. These results suggest that prompt-based rewriting offers a practical and generalisable approach for mitigating bias in text classification.

pdf bib abs
Towards CEFR-targeted Text Simplification for Question Adaptation
Luca Benedetto | Paula Buttery

Text Simplification (TS) can adapt educational content to learners’ proficiency levels. In reading comprehension questions, passage complexity directly affects the question difficulty; thus, TS could enable automatic question adaptation by generating multiple versions of a reading passage. However, despite the potential of TS and its applications in other domains, the feasibility, reliability, and robustness of TS for question adaptation remains unexplored. In this paper, we conduct the first evaluation of LLMs for CEFR targeted text simplification aimed at question adaptation. Specifically, we investigate whether LLMs can perform CEFR-targeted text simplification and how this affects question answerability. Evaluating four LLMs on two English learning datasets, we show that they can mostly perform targeted simplification with readability values correlating with reference CEFR levels, but alignment is imperfect. Crucially, the simplified texts generally preserve the information needed to for question answering, and questions associated with texts simplified at lower levels show reduced difficulty in virtual pretesting. These preliminary findings show the potential of LLMs for educational content adaptation, but practical deployment will need improved CEFR alignment.

pdf bib abs
Evaluation of Pretrained and Instruction-Based Pretrained Models for Emotion Detection in Arabic Social Media Text
Md. Rafiul Biswas | Shimaa Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani

This study evaluates three approaches—instruction prompting of large language models (LLMs), instruction fine-tuning of LLMs, and transformer-based pretrained models on emotion detection in Arabic social media text. We compare pretrained transformer models like AraBERT, CaMelBERT, and XLM-RoBERTa with instruction prompting with advanced LLMs like GPT-4o, Gemini, Deepseek, and Fanar, and instruction fine-tuning approaches with LLMs like Llama 3.1, Mistral, and Phi. With a highly preprocessed dataset of 10,000 labeled Arabic tweets with overlapping emotional labels, our findings reveal that transformer-based pretrained models outperform instruction prompting and instruction fine-tuning approaches. Instruction prompts leverage general linguistic skills with maximum efficiency but fall short in detecting subtle emotional contexts. Instruction fine-tuning is more specific but trails behind pretrained transformer models. Our findings establish the need for optimized instruction-based approaches and underscore the important role played by domain-specific transformer architectures in accurate Arabic emotion detection.

pdf bib abs
Can LLMs Disambiguate Grounded Language? The Case of PP Attachment
John Blackmore | Matthew Stone

We explore the potential of large language models in resolving ambiguity in prepositional phrase attachments in grounded language. We find that when prompted in such a way that we can compute a probability of the respective attachment, models yield promising results. However, additional inputs from a measure of information structure may help improve prediction accuracy. We also investigate where we need more sophisticated tools, commonsense reasoning, world knowledge, and additional context to resolve ambiguity.

pdf bib abs
MLDataForge: Accelerating Large-Scale Dataset Preprocessing and Access for Multimodal Foundation Model Training
Andrea Blasi Núñez | Lukas Paul Achatius Galke | Peter Schneider-Kamp

Preprocessing large and possibly multimodal datasets remains a key bottleneck in many machine learning workflows, particularly when random access to samples is needed for global shuffling and sorting. Existing approaches, including widely used formats like JSONL and frameworks such as Huggingface Datasets and MosaicML Streaming, typically incur substantial computational, memory, and storage overhead in such settings. Here, we introduce MLDataForge, a Python-based open-source framework designed for scalable dataset pre-processing and access. Our key contributions are: (1) optimized readers for Mosaic Data Shards (MDS) that substantially improve throughput, reduce peak storage usage, and support sample-level compression; (2) JINX (JSON Indexed ’N’ eXtended), a novel, index-augmented JSONL-compatible format supporting structured footers and binary sidecar files; and (3) a lazy-loading mechanism that defers data loading, decompression, and decoding JINX files until sample fields are accessed. We empirically evaluate MLDataForge and our contributions on a representative 200 GB supervised fine-tuning dataset for vision language models. Our best configuration – zstd-compressed JINX with binary sidecar and lazy loading – yields at least a decimal order-of-magnitude throughput increase compared to the best baselines for iteration, global shuffling, and sorting. These advances enable substantial gains in data preprocessing performance, facilitating more scalable and resource-efficient model training pipelines.

pdf bib abs
The Impact of Named Entity Recognition on Transformer-Based Multi-Label Dietary Recipe Classification
Kemalcan Bora | Horacio Saggion

This research explores the impact of Named Entity Recognition (NER) on transformer-based models for multi-label recipe classification by dietary preference. To support this task, we introduce the NutriCuisine Index: a collection of 23,932 recipes annotated across six dietary categories (Healthy, Vegan, Gluten-Free, Low-Carb, High-Protein, Low-Sugar). Using BERT-base-uncased, RoBERTa-base, and DistilBERT-base-uncased, we evaluate how NER-based preprocessing affects the performance (F1-score, Precision, Recall, and Hamming Loss) of Transformer-based multi-label classification models. RoBERTa-base shows significant improvements with NER in F1-score (∆F1 = +0.0147, p < 0.001), Precision, and Recall, while BERT and DistilBERT show no such gains. NER also leads to a slight but statistically significant increase in Hamming Loss across all models. These findings highlight the model dependent impact of NER on classification performance.

pdf bib abs
Balancing the Scales: Addressing Gender Bias in Social Media Toxicity Detection
Beatriz Botella-Gil | Juan Pablo Consuegra-Ayala | Alba Bonet-Jover | Paloma Moreda-Pozo

The detection of toxic content in social media has become a critical task in Natural Language Processing (NLP), particularly given its intersection with complex issues like subjectivity, implicit language, and cultural context. Among these challenges, bias in training data remains a central concern—especially as language models risk reproducing and amplifying societal inequalities. This paper investigates the interplay between toxicity and gender bias on Twitter/X by introducing a novel dataset of violent and non-violent tweets, annotated not only for violence but also for gender. We conduct an exploratory analysis of how biased data can distort toxicity classification and present algorithms to mitigate these effects through dataset balancing and debiasing. Our contributions include four new dataset splits—two balanced and two debiased—that aim to support the development of fairer and more inclusive NLP models. By foregrounding the importance of equity in data curation, this work lays the groundwork for more ethical approaches to automated violence detection and gender annotation.

pdf bib abs
“Simple-Tool”: A Tool for the Automatic Transformation of Spanish Texts into Easy-to-Read
Beatriz Botella-Gil | Isabel Espinosa-Zaragoza | Paloma Moreda Pozo | Manuel Palomar

Automatic Text Simplification (ATS) has emerged as a key area of research within the field of Natural Language Processing, aiming to improve access to information by reducing the linguistic complexity of texts. Simplification can be applied at various levels—lexical, syntactic, semantic, and stylistic—and must be tailored to meet the needs of different target audiences, such as individuals with cognitive disabilities, low-literacy readers, or non-native speakers. This work introduces a tool that automatically adapts Spanish texts into Easy-to-Read format, enhancing comprehension for people with cognitive or reading difficulties. The proposal is grounded in a critical review of existing Spanish-language resources and addresses the need for accessible, well-documented solutions aligned with official guidelines, reinforcing the potential of text simplification as a strategy for inclusion.

pdf bib abs
QuARK: LLM-Based Domain-Specific Question Answering Using Retrieval Augmented Generation and Knowledge Graphs
Edward Burgin | Sourav Dutta | Mingxue Wang

Retrieval Augmented Generation (RAG) has been pivotal in the utilization of Large Language Models (LLM) to improve the factuality of long-form question answering systems in industrial settings. Knowledge graphs (KG) represent a linking of disparate information sources that potentially yield useful information for mitigating the issues of insufficient knowledge and hallucination within the LLM-RAG pipeline. However, the creation of domain-specific KG is costly and usually requires a domain expert. To alleviate the above challenges, this work proposes QuARK, a novel domain-specific question answering framework to enhance the knowledge capabilities of LLM by integrating structured KG, thereby significantly reducing the reliance on the “generic” latent knowledge of LLMs. Here, we showcase how LLMs can be deployed to not only act in dynamic information retrieval and in answer generating frameworks, but also as flexible agents to automatically extract relevant entities and relations for the automated construction of domain-specific KGs. Crucially we propose how the pairing of question decomposition and semantic triplet retrieval within RAG can enable optimal subgraph retrieval. Experimental evaluations of our framework on financial domain public dataset, demonstrate that it enables a robust pipeline incorporating schema-free KG within a RAG framework to improve the overall accuracy by nearly 13%.

pdf bib abs
Classifying Emotions in Tweets from the Financial Market: A BERT-based Approach
Wesley Pompeu Carvalho | Norton Trevisan Roman

Behavioural finance emphasizes the relevance of investor sentiment and emotions in the pricing of financial assets. However, little research has examined how discrete emotions can be detected in text related to this domain, with extant work focusing mostly in sentiment instead. This study approaches this problem by describing a framework for emotion classification in tweets related to the stock market, written in Brazilian Portuguese. Emotion classifiers were then developed, based on Plutchik’s psychoevolutionary theory, by fine-tuning BERTimbau, a pre-trained BERT-based language model for Brazilian Portuguese, and applying it to an existing corpus of tweets, from the stock market domain, previously annotated with emotions. Each of Plutchik’s four emotional axes was modelled as a ternary classification problem. For each axis, 30 independent training iterations were executed using a repeated holdout strategy with different train/test splits in each iteration. In every iteration, hyperparameter tuning was performed via 10-fold stratified cross-validation on the training set to identify the best configuration. A final model was then retrained using the selected hyperparameters and evaluated on a hold-out test set, generating a distribution of macro-F1 scores in out-of-sample data. The results demonstrated statistically significant improvements over a stratified random baseline (Welch’s t-test, << 0.001 across all axes), with macro-F1 scores ranging from 0.50 to 0.61. These findings point to the feasibility of using transformer-based models to capture emotional nuance in financial texts written in Portuguese and provide a reproducible framework for future research.

pdf bib abs
Detecting Changes in Mental Health Status via Reddit Posts in Response to Global Negative Events
Zenan Chen | Judita Preiss | Peter A. Bath

Detecting population-level mental health responses to global negative events through social media language remains understudied, despite its potential for public health surveillance. While pretrained language models (PLMs) have shown promise in mental health detection, their effectiveness in capturing event-driven collective psychological shifts – especially across diverse crisis contexts – is unclear. We present a prototype evaluation of three PLMs for identifying population mental health dynamics triggered by real-world negative events. We introduce two novel datasets specifically designed for this task. Our findings suggest that DistilBERT is better suited to the noisier global negative events data, while MentalRoBERTa shows the validity of the method on the Covid-19 tidier data. SHAP interpretability analysis of 500 randomly sampled posts revealed that mental-health related vocabulary (anxiety, depression, worthless) emerged as the most influential linguistic markers for mental health classification.

pdf bib abs
APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification
Artem Chernodub | Aman Saini | Yejin Huh | Vivek Kulkarni | Vipul Raheja

Recent advancements in large language models (LLMs) have enabled a wide range of natural language processing (NLP) tasks through simple prompt-based interactions. Consequently, several approaches have been proposed to engineer prompts that most effectively enable LLMs to perform a given task (e.g., chain-of-thought prompting). In settings with a well-defined metric to optimize model performance, Automatic Prompt Optimization (APO) methods have been developed to refine a seed prompt. Subsequently, we propose APIO, a simple but effective prompt induction and optimization approach for the tasks of Grammatical Error Correction (GEC) and Text Simplification, without relying on manually specified seed prompts. APIO achieves a new state-of-the-art performance for purely LLM-based prompting methods on these tasks. We make our data, code, prompts, and outputs publicly available.

Access to age-appropriate texts is critical for young readers’ literacy acquisition. For limited-resourced languages, such as Romanian, this area remains under-researched. As such, we present ongoing work on improving readability for old Romanian texts by applying Large Language Models (LLMs). First, we compiled and cleaned a comprehensive list of archaic and regional terms from lexicographic sources, including DEX online and printed dictionaries. The cleaning process involved duplicate removal, orthographic normalization, context-based filtering, and manual review. Key challenges included distinguishing archaic forms from rare or poetic ones, resolving polysemous entries, and managing inconsistent labeling across sources. Second, LLMs were utilized to validate the archaic and regional nature of identified terms and replace them with modern equivalents, while also determining the appropriate reading level for both original and modified versions. Results show that through the replacement of archaic and regional terms, the appropriate age for the modified texts decreases by approximately 0.5 years for texts extracted from textbooks and canonical writings.

pdf bib abs
ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities
Aleksis Ioannis Datseris | Sylvia Vassileva | Ivan K. Koychev | Svetla Boytcheva

This paper introduces a novel approach to position embeddings in transformer models, named “Exact Positional Embeddings” (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model’s ability to generalize to longer sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training. The code and supplementary materials can be found in

pdf bib abs
End-to-End Deep Learning for Named Entity Recognition and Relation Extraction in Gut-Brain Axis PubMed Abstracts
Aleksis Ioannis Datseris | Mario Kuzmanov | Ivelina Nikolova-Koleva | Dimitar Taskov | Svetla Boytcheva

This is a comparative study tackling named entity recognition and relation extraction from PubMed abstracts with focus on the gut-brain interplay. The proposed systems for named entity recognition cover a range of models and techniques from traditional gazetteer-based approaches, transformer-based approaches, transformer domain adaptation, large models pre-training as well as LLM prompting. The best performing model among these achieves 82.53% F1-score. The relation extraction task is addressed with ATLOP and LLMs and their best results reach F1 up to 63.80% on binary relation extraction, 89.40% on ternary tag-based relation extraction and 40.32% on ternary mention-based relation extraction.

pdf bib abs
Enabling On-Premises Large Language Models for Space Traffic Management
Enrique De Alba

Natural language processing systems leveraging on-premises large language models (LLMs) can translate natural language into structured JSON commands for Space Traffic Management (STM) systems. While cloud-based LLMs excel at this task, security constraints necessitate local deployment, requiring evaluation of smaller on-premises models. We demonstrate that resource-efficient 7B-parameter models can achieve high accuracy for STM command generation through a two-stage pipeline. Our pipeline first classifies objectives, then generates schemas. Empirically, we observe that initial classification accuracy strongly influences overall performance, with failures cascading to the generation stage. We demonstrate that quantization disproportionately increases structural errors compared to semantic errors across 405 objectives. The best quantized model (Falcon3-7B-GPTQ) shows a 3.45% accuracy drop, primarily from structural errors. Our findings highlight limitations in how model compression affects applications that require syntactic validity. More broadly, we explore the feasibility of LLM deployment in air-gapped environments while uncovering how quantization asymmetrically impacts structured output generation.

pdf bib abs
Top Ten from Lakhs: A Transformer-based Retrieval System for Identifying Previously Fact-Checked Claims across Multiple Languages
Srijani Debnath | Pritam Pal | Dipankar Das

The efficient identification of previously fact-checked claims across multiple languages is a challenging task. It can be time-consuming for professional fact-checkers even within a single language. It becomes much more difficult to perform manually when the claim and the fact-check may be in different languages. This paper presents a systematic approach for the retrieval of top-k relevant fact-checks for a given post in a monolingual and cross-lingual setup using two transformer-based fact-checked claim retrieval frameworks that share a common preprocessing pipeline but differ in their underlying encoder implementations: TIDE, a TensorFlow-based custom dual encoder applied to english-translated data, and PTEX, a PyTorch-based encoder operating on both english-translated and original-language inputs, and introduces a lightweight post-processing technique based on a textual feature: Keyword Overlap Count applied via reranking on top of the transformer-based frameworks. Training and evaluation on a large multilingual corpus show that the fine-tuned E5-Large-v2 model in the PTEX framework yields the best monolingual track performance, achieving an average Success@10 score of 0.8846 and the same framework model with post-processing technique achieves an average Success@10 score of 0.7393 which is the best performance in crosslingual track.

pdf bib abs
Evaluating Bilingual Lexicon Induction without Lexical Data
Michaela Denisová | Pavel Rychly

Bilingual Lexicon Induction (BLI) is a fundamental task in cross-lingual word embedding (CWE) evaluation, aimed at retrieving word translations from monolingual corpora in two languages. Despite the task’s central role, existing evaluation datasets based on lexical data often contain biases such as a lack of morphological diversity, frequency skew, semantic leakage, and overrepresentation of proper names, which undermine the validity of reported performance. In this paper, we propose a novel, language-agnostic evaluation methodology that entirely eliminates the dependency on lexical data. By training two sets of monolingual word embeddings (MWEs) using identical data and algorithms but with different weight initialisations, we enable the assessment on the BLI task without being affected by the quality of the evaluation dataset. We evaluate three baseline CWE models and analyse the impact of key hyperparameters. Our results provide a more reliable and bias-free perspective on CWE models’ performance.

pdf bib abs
Utilizing Large Language Models for Focused Conversational Assistants
Shruti Dhavalikar | Karthika Vijayan

A focused conversational assistant (FCA) realizes human-computer interaction bounded in a predefined scope of operation. With the advent of large language models (LLMs), it has become imperative to integrate them in conversational assistants (CAs). However, an LLM can become largely inaccurate in an FCA with multiple responsibilities, like information extraction, scope adherence and response generation. In this paper, we attempt to use an LLM for an FCA while constricting the scope of operation and maintaining a guided flow of conversation. We present a strategical combination of discriminative AI methods and generative AI models. Our methodology includes (i) a component of natural language understanding (NLU) operating discriminatively, (ii) a conditional intent-based routing of user messages to appropriate response generators, and (iii) response generators which are either custom ones or open sourced LLMs. The collation of these three strategies realizes a hybrid AI system, assisting FCA with adhering to the defined scope, maintaining context and dialogue flow.

pdf bib abs
AntiSemRO: Studying the Romanian Expression of Antisemitism
Anca Dinu | Andreea C. Moldovan | Adina Marincea

This study introduces an annotated dataset for the study of antisemitic hate speech and attitudes towards Jewish people in Romanian, collected from social media. We performed two types of annotation: with three simple tags (‘Neutral’, ‘Positive’, ‘Negative’), and with five more refined tags (Neutral’, ‘Ambiguous’, ‘Jewish Community’, Solidarity’, ‘Zionism’, ‘Antisemitism’). We perform several experiments on this dataset: clusterization, automatic classification, using classical machine learning models and transformer-based models, and sentiment analysis. The three classes clusterization produced well grouped clusters, while, as expected, the five classes clusterization produced moderately overlapping groups, except for ‘Antisemitism’, which is well away from the other four groups. We obtained a good F1-Score of 0.78 in the three classes classification task with Romanian BERT model and a moderate F1-score of 0.62 for the five classes classification task with a SVM model. The lowest negative sentiment was contained in the ‘Neuter’ class, while the highest was in ‘Zionism’, and not in ‘Antisemitism’, as expected. Also, the same ‘Zionism’ category displays the highest level of positive sentiment.

We propose a map of cognates and borrowings usage in Romance languages, having as a starting point the pairs of cognates and borrowings between any two of these idioms from RoBoCoP, the largest database built upon electronic dictionaries containing etymological information for Portuguese, Spanish, French, Italian and Romanian. Having in mind that words are used and evolve in language communities over time, on the basis of the pairs extracted from RoBoCoP, we determine how many of them occur and with what frequency in the context of the languages in use, based on three online parallel corpora that contain all five Romance languages: Wikipedia, Europarl – focusing on proceedings of the European Parliament and RomCro2.0 – containing literary texts in different languages, translated in Romance languages and Croatian.

pdf bib abs
Decoding Emotion in Ancient Poetry: Leveraging Generative Models for Classical Chinese Sentiment Analysis
Quanqi Du | Loic De Langhe | Els Lefever | Veronique Hoste

This study explores the use of generative language models for sentiment analysis of classical Chinese poetry, aiming to better understand emotional expression in literary texts. Using the FSPC dataset, we evaluate two models, Qwen-2.5 and LLaMA-3.1, under various prompting strategies. Initial experiments show that base models struggle with task-specific instructions. By applying different instruction tuning strategies with Low-Rank Adaptation (LoRA), we significantly enhance the models’ ability to follow task instructions and capture poetic sentiment, with LLaMA-3.1 achieving the best results (67.10% accuracy, 65.42% macro F1), demonstrate competitive performance against data-intensive, domain-adapted baselines. We further examine the effects of prompt language and multi-task learning, finding that English prompts outperform Chinese ones. These results highlight the promise of instruction-tuned generative models in sentiment analysis of classical Chinese poetry, and underscore the importance of prompt formulation in literary understanding tasks.

pdf bib abs
GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs
Marius Dumitran | Angela Dumitran | Alexandra Mihaela Danila

Large language models (LLMs) have revolutionised NLP, yet their pedagogical value for low‐resource languages remains unclear. We present GRILE, the first open benchmark of 1 151 multiple‐choice questions harvested from Romanian high‐stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state‐of‐the‐art multilingual and Romanian‐specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically faithful explanations. While Gemini 2·5 Pro reaches 83% accuracy, most open‐weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM 3 orthographic norms. All data, code and a public web demo are released to catalyse future research. Our findings expose open challenges for trustworthy educational NLP in low‐resource settings and establish GRILE as a new test‐bed for controllable explanation generation and evaluation.

pdf bib abs
PerSpaCor: Correcting Space and ZWNJ Errors in Persian Text with Transformer Models
Matin Ebrahimkhani | Ebrahim Ansari

Precision and clarity are essential qualities of written texts; however, Persian script, rooted in Arabic script, presents unique challenges that can compromise readability and correctness. In particular, the use of space and half-space—specifically the Zero Width Non-Joiner (ZWNJ)—is essential for proper character separation in Persian typography. This research introduces four models for correcting spacing and ZWNJ errors at the character level, thereby improving both readability and textual accuracy. By fine-tuning BERT-based transformer models on Bijankhan and Peykare corpora—comprising over 12.7 million preprocessed and annotated words—and formulating the task as sequence labeling, the best model achieves a macro-average F1-score of 97.26%. An interactive corrector that incorporates user input further improves performance to a macro-average F1-score of 98.38%. These results demonstrate the effectiveness of advanced language models in enhancing Persian text quality and highlight their applicability to real-world natural language processing tasks.

pdf bib abs
Reddit-V: A Virality Prediction Dataset and Zero-Shot Evaluation with Large Language Models
Samir El-amrany | Matthias R. Brust | Salima Lamsiyah | Pascal Bouvry

We present Reddit-V, a new dataset designed to advance research on social media virality prediction in natural language processing. The dataset consists of over 27,000 Reddit posts, each enriched with images, textual content, and pre-engagement metadata such as post titles, categories, sentiment scores, and posting times. As an initial benchmark, we evaluate several instruction-tuned large language models (LLMs) in a zero-shot setting, prompting them with post titles and metadata to predict post virality. We then fine-tune two multimodal models, CLIP and IDEFICS, to assess whether incorporating visual context enhances predictive performance. Our results show that zero-shot LLMs perform poorly, whereas the fine-tuned multimodal models achieve better performance. Specifically, CLIP outperforms the best-performing zero-shot LLM (CodeLLaMA) by 3%, while IDEFICS achieves an 7% improvement over the same baseline, highlighting the importance of visual features in virality prediction. We release the Reddit-V dataset and our evaluation results to facilitate further research on multimodal and text-based virality prediction. Our dataset and code will be made publicly available on Github

pdf bib abs
Simplifications Are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions
Lukas Ellinger | Miriam Anschütz | Georg Groh

Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms—words with multiple meanings—where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.

In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.

pdf bib abs
EDAudio: Easy Data Augmentation for Dialectal Audio
Lea Fischbach | Akbar Karimi | Alfred Lameli | Lucie Flek

We investigate lightweight and easily applicable data augmentation techniques for dialectal audio classification. We evaluate four main methods, namely shifting pitch, interval removal, background noise insertion and interval swap as well as several subvariants on recordings from 20 German dialects. Each main method is tested across multiple hyperparameter combinations, inlcuding augmentation length, coverage ratio and number of augmentations per original sample. Our results show that frequency-based techniques, particularly frequency masking, consistently yield performance improvements, while others such as time masking or speaker-based insertion can negatively affect the results. Our comparative analysis identifies which augmentations are most effective under realistic conditions, offering simple and efficient strategies to improve dialectal speech classification.

pdf bib abs
Authorship Verification Using Cloze Test with Large Language Models
Tomáš Foltýnek | Tomáš Kancko | Pavel Rychly

Assignment outsourcing, also known as contract cheating, occurs when a student outsources an assessment task or a part of it to a third party. It has been one of the most pressing ethical issues in university education and was further exacerbated by the wide availability of chatbots based on large language models. We propose a method that has the potential to verify the authorship of a document in question by filling in a cloze test. A close test with 10 items selected by our method can be used as a classifier with an accuracy of 0.988 and a F₁ score of 0.937. We also describe a general method for building a cloze-test-based classifier when the probability of authors and non-authors correctly filling in cloze items is known.

pdf bib abs
A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos
Alexandru Ganea | Antonia-Adelina Popovici | Marius Dumitran

Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show “Who Wants to Be a Millionaire?” (Vrei să fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available.

pdf bib abs
Graph-based RAG for Low-Resource Aromanian–Romanian Translation
Laurentiu G. Ghetoiu | Sergiu Nisioi

Aromanian, a linguistically and culturally significant yet low-resource Romance language, poses substantial challenges in computational linguistic research due to its limited NLP resources and non-standardized orthography. In this paper, we present an experimental study aimed at translating Aromanian texts into Romanian using a variety of modern NLP methodologies. We leverage two key resources: a parallel corpus consisting of approximately 3,000 sentence-aligned short stories and a dictionary of over 28,000 Aromanian-Romanian word pairs. Our approaches include Retrieval-Augmented Generation (RAG) supported by a graph-based alignment database, fine-tuning multilingual transformer models (specifically Meta’s NLLB), and parameter-efficient fine-tuning techniques such as LoRA applied to LLaMA-derived models. Evaluations using standard metrics (BLEU, chrF) demonstrate varied effectiveness across these methodologies, highlighting the strong performance of NLLB for general translation tasks, while RAG excels in translating familiar content. Our findings underline the complexities inherent in low-resource language translation and provide valuable insights into effective digital preservation and NLP adaptation strategies for underrepresented languages.

pdf bib abs
Differential Robustness in Transformer Language Models: Empirical Evaluation under Adversarial Text Attacks
Taniya Gidatkar | Oluwaseun Ajao | Matthew Shardlow

This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and Flan-T5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast, BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies

pdf bib abs
An Annotation Scheme for Factuality and Its Application to Parliamentary Proceedings
Gili Goldin | Shira Wigderson | Ella Rabinovich | Shuly Wintner

Factuality assesses the extent to which a language utterance relates to real-world information; it determines whether utterances correspond to facts, possibilities, or imaginary situations, and as such, it is instrumental for fact checking. Factuality is a complex notion that relies on multiple linguistic signals, and has been studied in various disciplines. We present a complex, multi-faceted annotation scheme of factuality that combines concepts from a variety of previous works. We developed the scheme for Hebrew, but we trust that it can be adapted to other languages. We also present a set of almost 5,000 sentences in the domain of parliamentary discourse that we manually annotated according to this scheme. We report on inter-annotator agreement, and experiment with various approaches to automatically predict (some features of) the scheme, in order to extend the annotation to a large corpus.

pdf bib abs
Can We Predict Innovation? Narrow Experts versus Competent Generalists
Amir Hazem | Motohashi Kazuyuki

In this paper, we investigate the role of large language models in predicting innovation. We contrast two main paradigms: i) narrow experts: which consists of supervised and semi-supervised models trained or fine-tuned on a specific task and ii) competent generalists: which consists of large language models with zero-shot and few-shots learning. We define the task of innovation modeling and present the first attempt to understand the transformation from research to innovation. We focus on product innovation which can be defined as the process of transforming technology to a product or service and bring it to the market. Our extensive empirical evaluation shows that most existing pretrained models are not suited and perform poorly on the innovation modeling task. We also show that injecting research information helps improving the alignment from technology to the market. Finally, we propose a new methodology and fine-tuning strategies that achieve significant performance boosts over the baselines.

pdf bib abs
Arabic to Romanian Machine Translation: A Case Study on Distant Language Pairs
Ioan Alexandru Hirica | Stefana Arina Tabusca | Sergiu Nisioi

This paper investigates machine translation between two linguistically distant languages, Arabic and Romanian, with a focus on translating from Arabic to Romanian. Dataset cleaning techniques are addressed, offering insights on the impact of translation for a language pair with limited resources. Using publicly available corpora (e.g., OPUS) and manually translated diplomatic texts, filtering methods are applied, such as duplicate removal, embedding similarity analysis (LEALLA), and Large Language Model (LLM)-based validation (Gemini-flash-002). Transformer models are trained and evaluated with diverse preprocessing pipelines that incorporate subword tokenization. Additionally, the performance of a fine-tuned LLM is assessed for this task and is compared to their pre-trained counterparts. Despite computational limitations, the results emphasize the importance of targeted preprocessing and model adaptation in improving Arabic-Romanian translation quality.

Named entity recognition from financial text is challenging because of word ambiguity, huge quantity of unknown corporation names, and word abbreviation compared to nonfinancial text. However, models often treat named entities in a linear sequence fashion, which might obscure the model’s ability to capture complex hierarchical relationships among the entities. In this paper, we proposed a novel named entity recognition model BiGCAT, which integrates large language model (LLM) embeddings with graph-based representation where the contextual information captured by the language model and graph representation learning can complement each other. The method builds a spanning graph with nodes representing word spans and edges weighted by LLM embeddings, optimized using a combination of graph neural networks, specifically a graph-convolutional network (GCN) and a graph-attention network (GAT). This approach effectively captures the hierarchical dependencies among the spans. Our proposed model outperformed the state-of-the-art by 10% and 18% on the two publicly available datasets FiNER-ORD and FIN, respectively, in terms of weighted F1 score. The code is available at: https://github.com/Akram1871/BiGCAT-RANLP-2025.

pdf bib abs
Measuring How (Not Just Whether) VLMs Build Common Ground
Saki Imai | Mert Inan | Anthony B. Sicilia | Malihe Alikhani

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

pdf bib abs
SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation
Saki Imai | Mert Inan | Anthony B. Sicilia | Malihe Alikhani

Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language—such as facial expressions, spatial grammar, and prosody—but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

pdf bib abs
Alignment of Historical Manuscript Transcriptions and Translations
Maarten Janssen | Piroska Lendvai | Anna Jouravel

Using an XML-based framework, we compiled a gold standard for alignments in five primary as well as derived texts, related to De Lepra ad Sistelium by Methodius Olympius. These comprise diplomatic transcripts, editions, and translations of this work, involving both historical and modern languages. Using the TEITOK corpus platform, we created sentence-level gold standard alignments for our parallel resp. comparable texts, and applied both neural and classical alignment methods (SentenceBERT, Hunalign, Awesome-Align). We evaluated the methods in terms of Alignment Error Rate. We show that for alignment of our historical texts, Hunalign performs better than deep learning based methods.

pdf bib abs
Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Nevidu Jayatilleke | Nisansa de Silva

Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.

pdf bib abs
Detecting Gender Stereotypical Language Using Model-agnostic and Model-specific Explanations
Manuela Nayantara Jeyaraj | Sarah Jane Delany

AI models learn gender-stereotypical language from human data. So, understanding how well different explanation techniques capture diverse language features that suggest gender stereotypes in text can be useful in identifying stereotypes that could potentially lead to gender bias. The influential words identified by four explanation techniques (LIME, SHAP, Integrated Gradients (IG) and Attention) in a gender stereotype detection task were compared with words annotated by human evaluators. All techniques emphasized adjectives and verbs related to characteristic traits and gender roles as the most influential words. LIME was best at detecting explicitly gendered words, while SHAP, IG and Attention showed stronger overall alignment and considerable overlap. A combination of these techniques, combining the strengths of model-agnostic and model-specific explanations, performs better at capturing gender-stereotypical language. Extending to hate speech and sentiment prediction tasks, annotator agreement suggests these tasks to be more subjective while explanation techniques can better capture explicit markers in hate speech than the more nuanced gender stereotypes. This research highlights the strengths of different explanation techniques in capturing subjective gender stereotype language in text.

pdf bib abs
Reversing Causal Assumptions: Explainability in Online Sports Dialogues
Asteria Kaeberlein | Malihe Alikhani

Prior XAI research often assumes inputs must be “causes” and outputs must be “effects”, severely limiting applicability to analyzing behaviors that emerge as reactions or consequences. Many linguistic tasks, such as dialogues and conversations, involve such behaviors. To address this, we propose that the assumed causality from inputs to outputs can be reversed and still remain valid by using outputs that cause changes in features. We show how this enables analysis of complex feature sets through simpler metrics, propose a framework that is generalizable to most linguistic tasks, and highlight best practices for applying our framework. By training a predictive model from complex effects to simple causes, we apply feature attributions to estimate how the inputs change with the outputs. We demonstrate an application of this by studying sports fans’ comments made during a game and compare those comments to a simpler metric, win probability. We also expand on a prior study of intergroup bias, demonstrating how our framework can uncover behaviors that other XAI methods may overlook. We discuss the implications of these findings for advancing interpretability in computational linguistics and improving data-driven-decision-making in social contexts.

pdf bib abs
How LLMs Influence Perceived Bias in Journalism
Asteria Kaeberlein | Malihe Alikhani

As the use of generative AI tools in journalistic writing becomes more common, reporters have expressed growing concerns about how it may introduce bias to their works. This paper investigates how the integration of large language models (LLMs) into journalistic writing, both as editors and independent ‘authors’, can alter user perception of bias in media. We show novel insights into how human perception of media bias differs from automatic evaluations. Through human evaluations comparing original human-authored articles, AI-edited articles, and AI-generated articles, we show that while LLMs rarely introduce new bias and often trend towards neutrality, this supposedly ‘safe’ behavior can have harmful impacts. This is most observable in sensitive human rights contexts, where the AI’s neutral and measured tone can reduce the representation of relevant voices and present misinformation in a more convincing manner. Furthermore, we demonstrate the existence of previously unidentified patterns that existing automated bias detection methods fail to accurately capture. We underscore the critical need for human-centered evaluation frameworks in AI-assisted journalism by introducing human evaluations and contrasting against a state-of-the-art automated bias detection system.

pdf bib abs
Prompting Techniques for Reducing Social Bias in LLMs through System 1 and System 2 Cognitive Processes
Mahammed Kamruzzaman | Gene Louis Kim

Dual process theory posits that human cognition arises via two systems. System 1, which is a quick, emotional, and intuitive process, which is subject to cognitive biases, and System 2, is a slow, onerous, and deliberate process. Prior research in LLMs found that using chain-ofthought (CoT) prompting in LLMs, which has been often compared to System 2 reasoning, can lead to reduced gender bias. Along these lines, we investigate the relationship between bias, CoT prompting, a direct debiasing, and dual process theory modeling in LLMs. We compare zero-shot CoT, debiasing, and dual process theory-based prompting strategies on two bias datasets spanning nine different social bias categories. We incorporate human and machine personas to determine whether LLM modeling of the effects of dual process theory exist independent of explicit persona models or are tied to the LLM’s modeling of human-like generation. We find that a human persona, debiasing, System 2, and CoT prompting all tend to reduce social biases in LLMs, though the best combination of features depends on the exact model and bias category—resulting in up to a 33 percent drop in stereotypical judgments by an LLM.

pdf bib abs
Performance Gaps in Acted and Naturalistic Speech: Insights from Speech Emotion Recognition Strategies on Customer Service Calls
Lily Kawaoto | Hita Gupta | Ning Yu | Daniel Dakota

Current research in speech emotion recognition (SER) often uses speech data produced by actors which does not always best represent naturalistic speech. This can lead to challenges when applying models trained on such data sources to real-world data. We investigate the application of SER models developed on acted data and more naturalistic podcasts to service call data, with a particular focus on anger detection. Our results indicate that while there is noticeable performance degradation of models trained on acted data to the naturalistic data, weighted multimodal models developed on existing SER datasets–both acted and natural–show promise, but are limited in ability to recognize emotions that do not discernibly cluster.

Cyberbullying (CB) presents a pressing threat, especially to children, underscoring the urgent need for robust detection systems to ensure online safety. While large-scale datasets on online abuse exist, there remains a significant gap in labeled data that specifically reflects the language and communication styles used by children. The acquisition of such data from vulnerable populations, such as children, is challenging due to ethical, legal and technical barriers. Moreover, annotating these datasets relies heavily on human effort, which not only strains resources but also raises significant concerns due to annotators’ exposure to harmful content. In this paper, we address these challenges by leveraging Large Language Models (LLMs) to generate synthetic data and labels. Our experiments demonstrate that synthetic data enables BERT-based CB classifiers to achieve performance close to that of those trained on fully authentic datasets (75.8% vs. 81.5% accuracy). Additionally, LLMs can effectively label authentic yet unlabeled data, allowing BERT classifiers to attain a comparable performance level (79.1% vs. 81.5% accuracy). These results highlight the potential of LLMs as a scalable, ethical, and cost-effective solution for generating data for CB detection.

We introduce FreeTxt, a free and open-source web-based tool designed to support the analysis and visualisation of multilingual qualitative survey data, with a focus on low-resource languages. Developed in collaboration with stakeholders, FreeTxt integrates established techniques from corpus linguistics with modern natural language processing methods in an intuitive interface accessible to non-specialists. The tool currently supports bilingual processing and visualisation of English and Welsh responses, with ongoing extensions to other languages such as Vietnamese. Key functionalities include semantic tagging via PyMUSAS, multilingual sentiment analysis, keyword and collocation visualisation, and extractive summarisation. User evaluations with cultural heritage institutions demonstrate the system’s utility and potential for broader impact.

pdf bib abs
GPT-Based Lexical Simplification for Multi-Word Expressions Using Prompt Engineering
Sardar Khan Khayamkhani | Matthew Shardlow

Multiword Lexical Simplification (MWLS) is the task of replacing a complex phrase in a sentence with a simpler alternative. Whereas previous approaches to MWLS made use of the BERT language model, we make use of the Generative Pre-trained Transformer architecture. Our approach employs Large Language Models in an auto-regressive format, making use of prompt engineering and few-shot learning to develop new strategies for the MWLS task. We experiment with several GPT-based models and differing experimental settings including varying the number of requested examples, changing the base model type, adapting the prompt and zero-shot, one-shot and k-shot in-context learning. We show that a GPT-4o model with k-shot in-context learning (k=6) demonstrates state-of-the-art performance for the MWLS1 dataset with NDCG=0.3143, PREC@5=0.1048, beating the previous Bert-based approach by a wide margin on several metrics and consistently across subsets. Our findings indicate that GPT-based models are superior to BERT-based models for the MWLS task.

pdf bib abs
Instruction-Tuning LLaMA for Synthetic Medical Note Generation in Swedish and English
Lotta Kiefer | Jesujoba Alabi | Thomas Vakili | Hercules Dalianis | Dietrich Klakow

The increasing capabilities of large language models (LLMs) have unlocked transformative potential for medical applications, but privacy constraints limit access to high-quality training data from electronic health records (EHRs). In response, we propose a framework to generate synthetic EHRs by instruction-tuning an LLM using descriptions of diagnosis codes. We show that this framework overcomes problems of prior approaches, such as diversity reduction and medical incoherence, while maintaining strong privacy protections. Utility was measured by training models to predict diagnosis codes for EHRs. Real data still has higher utility, but synthetic data approaches real data results with increasing dataset size. The differences in utility were most likely due to noise in the synthetic data. A user study involving medical professionals confirmed no significant loss in readability or medical coherence compared to the real HRs, even though inter-annotator agreement is low. These findings establish synthetic EHRs as a viable alternative for privacypreserving and scalable clinical NLP applications. We release our code on GitHub.

pdf bib abs
Output Trend Analysis in Semantic Classification of Katakana Words Using a Large Language Model
Kazuki Kodaki | Minoru Sasaki

In semantic classification of katakana words using a large language model (LLM), semantic divergences from the meanings of original English words such as Wasei-Eigo(Japanese-made English) may affect the accuracy of the model. In order to accurately capture the meaning of foreign words, we fine-tuned the LLM using data extracted from the BCCWJ(Balanced Corpus of Contemporary Written Japanese), analyzed the current accuracy and output trend of semantic classification for katakana words, and explored ways to improve the accuracy. The results of several experiments showed that fine-tuning was not effective for zero-shot learning, but in contrast, fine-tuning improved accuracy by about 10% for few-shot learning. Further analysis of the visualized data suggests trends related to words and meanings that the model struggles to classify correctly.

pdf bib abs
Domain Knowledge Distillation for Multilingual Sentence Encoders in Cross-lingual Sentence Similarity Estimation
Risa Kondo | Hiroki Yamauchi | Tomoyuki Kajiwara | Marie Katsurai | Takashi Ninomiya

We propose a domain adaptation method for multilingual sentence encoders. In domains requiring a high level of expertise, such as medical and academic, domain-specific pre-trained models have been released in each language. However, there is no its multilingual version, which prevents application to cross-lingual information retrieval. Obviously, multilingual pre-training with developing in-domain corpora in each language is costly. Therefore, we efficiently develop domain-specific cross-lingual sentence encoders from existing multilingual sentence encoders and domain-specific monolingual sentence encoders in each language. Experimental results on translation ranking in three language pairs with different domains reveal the effectiveness of the proposed method compared to baselines without domain adaptation and existing domain adaptation methods.

Large language models (LLMs) have advanced natural language processing (NLP) skills such as through next-token prediction and self-attention, but their ability to integrate broad context also makes them prone to incorporating irrelevant information. Prior work has focused on semantic leakage—bias introduced by semantically irrelevant context.In this paper, we introduce expression leakage, a novel phenomenon where LLMs systematically generate sentimentally charged expressions that are semantically unrelated to the input context. To analyse the expression leakage, we collect a benchmark dataset along with a scheme to automatically generate a dataset from free-form text from common-crawl. In addition, we propose an automatic evaluation pipeline that correlates well with human judgment, which accelerates the benchmarking by decoupling from the need of annotation for each analysed model. Our experiments show that, as the model scales in the parameter space, the expression leakage reduces within the same LLM family. On the other hand, we demonstrate that expression leakage mitigation requires specific care during the model building process, and cannot be mitigated by prompting. In addition, our experiments indicate that, when negative sentiment is injected in the prompt, it disrupts the generation process more than the positive sentiment, causing a higher expression leakage rate.

pdf bib abs
Fusion of Object-Centric and Linguistic Features for Domain-Adapted Multimodal Learning
Jordan Konstantinov Kralev

Modern multimodal systems often struggle to link domain-specific visual content with textual descriptions, especially when object recognition is limited to general categories (e.g. COCO classes) and lacks customised adaptation to language models. In this paper, we present a novel framework that integrates a domain-specific adapted Detectron2 model into predefined models via a trainable projection layer, enabling precise crossmodal adaptation for specialised domains. Our approach extends Detectron2’s recognition capabilities to new categories by fine-tuning on multi-domain datasets, while a lightweight linear projection layer maps region-based visual features to the model’s embedding space without completely retraining the model. We evaluated the framework for domain-specific image captioning. The presented approach provides a scalable design for combining domain-specific visual recognition with language inference, with applications in domains that require fine-grained multimodal understanding.

pdf bib abs
Multi-Agent Reinforcement Learning for Interactive Code Debugging with Human Feedback and Memory
Anjana Krishnamoorthy | Kartik Ivatury | Benyamin Ahmadnia

This paper introduces an interactive Python debugging framework that combines multi-agent reinforcement learning, Natural Language Processing (NLP), and long-term memory. Two Proximal Policy Optimization (PPO) agents specialize in syntax and logic errors, generating candidate fixes that developers can accept, reject, or refine. A BERT-based module encodes natural language feedback into dense embeddings and quality scores, which shape reward signals for Reinforcement Learning from Human Feedback (RLHF). To support personalization, the system uses dual FAISS indices to retrieve past fixes based on code-error pairs and developer explanations. Evaluated on a synthetic dataset of 200 Python programs, our approach achieves an 88% syntax-fix rate and 45% logic-fix rate within five suggestions—outperforming one-shot Large Language Model (LLM) baselines. In addition, the system improves the quality of the explanation, as measured by BLEU, ROUGE, and CodeBLEU. By integrating multi-agent specialization, linguistic feedback, and memory-driven retrieval, our framework delivers a more efficient, adaptive, and developer-aligned debugging experience.

pdf bib abs
Integrating Large Language Models for Comprehensive Study and Sentiment Analysis of Student Feedback
Jana Kuzmanova | Katerina Zdravkova | Ivan Chorbev

n academic year 2023/24, our university collected over 200,000 student feedback responses evaluating teaching staff and course experiences. The survey included demographic data, 10 Likert scale questions on teaching quality, a question on student attendance, and three open-ended questions about student experiences. This paper explores the integration of Large Language Models (LLM) Gemini for sentiment analysis to evaluate students’ feedback quantitatively and qualitatively. We statistically analyze the Likert scale responses. To address the linguistic diversity of open-ended responses, written in both Cyrillic and Latin scripts with standard and slang expressions in several languages, we employed a preprocessing step using Gemini to standardize the input for further analyses. Sentiment analysis aims to identify various sentiment nuances, including direct answers, contradiction, multipolarity, mixed sentiment, sarcasm, irony, negation, ambiguity, understatement, and over-exaggeration. By comparing these insights with quantitative feedback, we aim to uncover deeper patterns between student perceptions and teaching performance. While the focus is on sentiment analysis, we also discuss the evaluation of the results provided by LLM. For the sentiments with less answers, the evaluation of GenAI was done manually. For the sentiments with more than 1000 entries, we suggest a semi-automated approach for sentiment categorization, to be explored in future work. This study enhances our understanding of student feedback through advanced computational methods, providing a more nuanced perspective on teaching quality and student satisfaction.

pdf bib abs
Task-Oriented Dialogue Systems through Function Calling
Tiziano Labruna | Giovanni Bonetta | Bernardo Magnini

Large Language Models (LLMs) have demonstrated remarkable capabilities in generating dialogues and handling a broad range of user queries. However, their effectiveness as end-to-end Task-Oriented Dialogue (TOD) systems remains limited due to their reliance on static parametric memory, which fails to accommodate evolving knowledge bases (KBs). This paper investigates a scalable function-calling approach that enables LLMs to retrieve only the necessary KB entries via schema-guided queries, rather than embedding the entire KB into each prompt. This selective retrieval strategy reduces prompt size and inference time while improving factual accuracy in system responses. We evaluate our method on the MultiWOZ 2.3 dataset and compare it against a full-KB baseline that injects the entire KB into every prompt. Experimental results show that our approach consistently outperforms the full-KB method in accuracy, while requiring significantly fewer input tokens and considerably less computation time, especially when the KB size increases.

pdf bib abs
When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
Tiziano Labruna | Jon Ander Campos | Gorka Azkune

In this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM’s parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token, <RET$>, when they do not know the answer to a question. Our evaluation of the Adaptive Retrieval LLM (Adapt-LLM) on the PopQA dataset showcases improvements over the same LLM under three configurations: (i) retrieving information for all the questions, (ii) using always the parametric memory of the LLM, and (iii) using a popularity threshold to decide when to use a retriever. Through our analysis, we demonstrate that Adapt-LLM is able to generate the <RET> token when it determines that it does not know how to answer a question, indicating the need for IR, while it achieves notably high accuracy levels when it chooses to rely only on its parametric memory.

pdf bib abs
Trust but Verify: A Comprehensive Survey of Faithfulness Evaluation Methods in Abstractive Text Summarization
Salima Lamsiyah | Aria Nourbakhsh | Christoph Schommer

Abstractive text summarization systems have advanced significantly with the rise of neural language models. However, they frequently suffer from issues of unfaithfulness or factual inconsistency, generating content that is not verifiably supported by the source text. This survey provides a comprehensive review of over 40 studies published between 2020 and 2025 on methods for evaluating faithfulness in abstractive summarization. We present a unified taxonomy that covers human evaluation techniques and a variety of automatic metrics, including question answering (QA)-based methods, natural language inference (NLI)-based methods, graph-based approaches, and large language model (LLM)-based evaluation. We also discuss meta-evaluation protocols that assess the quality of these metrics. In addition, we analyze a wide range of benchmark datasets, highlighting their design, scope, and relevance to emerging challenges such as long-document and domain-specific summarization. In addition, we identify critical limitations in current evaluation practices, including poor alignment with human judgment, limited robustness, and inefficiencies in handling complex summaries. We conclude by outlining future directions to support the development of more reliable, interpretable, and scalable evaluation methods. This work aims to support researchers in navigating the rapidly evolving landscape of faithfulness evaluation in summarization.

pdf bib abs
Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts
Frances Adriana Laureano De Leon | Asim Abbas | Harish Tayyar Madabushi | Mark Lee

Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. Although large language models perform well on many tasks, their ability to handle subtle linguistic phenomena remains unclear. This study examines how state-of-the-art models process the ambiguity of potentially idiomatic multiword expressions, particularly in less frequent contexts where memorisation is less likely to help. By evaluating models in Portuguese, Galician, and English, and introducing a new code-switched dataset and task, we show that large language models, despite their strengths, have difficulty handling nuanced language. In particular, we find that the latest models, including GPT-4, fail to outperform the xlm-roBERTa-base baselines in both detection and semantic tasks, with especially poor performance on the novel tasks we introduce, despite its similarity to existing tasks. Overall, our results demonstrate that multiword expressions, especially those that are ambiguous, continue to be a challenge to models. We provide open access to our datasets, prompts and model responses.

pdf bib abs
Instruction Finetuning to Attribute Language Stage, Dialect, and Provenance Region to Historical Church Slavic Texts
Piroska Lendvai | Uwe Reichel | Anna Jouravel | Achim Rabus | Elena Renje

Our study addresses domain-specific text provenance classification for the historical Church Slavic language. The downstream task is to attribute the language stage and its dialectal and regional varieties to texts compiled from newly curated sources, including digitally unpublished manuscripts, in addition to established Church Slavic resources from the Universal Dependencies Treebank. We aim to harmonize previously used tag sets pertaining to textual provenance, and construct a new, hierarchical, multi-layer provenance labeling scheme. For the classification task, we finetune Vikhr (Nikolich et al., 2004), a generative LLM with knowledge of modern Russian, with the instruction to generate labels to classify the provenance of sentence-level text units. Besides gold standard manuscript transcriptions, we test the finetuned model on character-corrupted data that emulate the quality of noisy, handwritten text recognition material. The experiments show that the Vikhr base model has low provenance attribution knowledge of Church Slavic, whereas our finetuned model achieves above .9 F-scores on Language stage labeling and Dialect labeling, and above .8 F-score on generating the label that jointly classifies all three provenance layers. The task of classifying the fine-grained geographical region from which a manuscript originates proves harder (but still performs above .8), and is negatively impacted by character level noise injection.

pdf bib abs
MariATE: Automatic Term Extraction Using Large Language Models in the Maritime Domain
Shijie Liu | Els Lefever | Veronique Hoste

This study presents a comprehensive evaluation of Large Language Models (LLMs) for automatic term extraction in the maritime safety domain. The research examines the zero-shot performance of seven state-of-the-art LLMs, including both open-source and closed-source models, and investigates terminology annotation strategies for optimal coverage. Nested annotation captures both complete technical expressions and their constituent components, while full-term annotation focuses exclusively on maximal-length terms. Experimental results demonstrate Claude-3.5-Sonnet’s superior performance (F1-score of 0.80) in maritime safety terminology extraction, particularly in boundary detection capabilities. Error analysis reveals three primary challenges: distinguishing contextual descriptions from legitimate terminology, handling complex multi-word expressions, and identifying maritime safety operational and navigational terms. Analysis of annotation strategies reveals that the full-term annotation approach achieves 95.24% coverage of unique terms compared to the nested annotation approach. The additional 4.76% of terms identified through nested annotation represents subcomponents of larger technical expressions. These findings advance the understanding of LLMs’ capabilities in specialized terminology extraction and provide empirical evidence supporting the sufficiency of full-term annotation for comprehensive terminology coverage in domain-specific applications.

pdf bib abs
Exploring the Usage of Knowledge Graphs in Identifying Human and LLM-Generated Fake Reviews
Ming Liu | Massimo Poesio

The emergence of large language models has led to an explosion of machine-generated fake reviews. Although distinguishing between human and LLM-generated fake reviews is an area of active research, progress is still needed. One aspect which makes current LLM-generated fake reviews easier to recognize is that LLMs–in particular the smaller ones–lack domain-related knowledge. The objective of this work is to investigate whether large language models can produce more realistic artificial reviews when supplemented with knowledge graph information, thus resulting in a more challenging training dataset for human and LLM-generated fake reviews detectors. We propose a method for generating fake reviews by providing knowledge graph information to a llama model, and used it to generate a large number of fake reviews which used to fine tune a state-of-the-art human and LLM-generated fake reviews detection system. Our results show that when knowledge graph information is provided as part of the input, the accuracy of the model is improved by 0.24%. When the knowledge graph is used as an embedding layer and combined with the existing input embedding layer, the accuracy of the detection model is improved by 1.279%.

pdf bib abs
The Evaluation of Medical Terms Complexity Using Lexical Features and Large Language Models
Liliya Makhmutova | Giancarlo Dondoni Salton | Fernando Perez-Tellez | Robert J. Ross

Understanding medical terminology is critical for effective patient-doctor communication, yet many patients struggle with complex jargon. This study compares Machine Learning (ML) models and Large Language Models (LLMs) in predicting medical term complexity as a means of improving doctor-patient communication. Using survey data from 252 participants rating 1,000 words along with various lexical features, we measured the accuracy of both model types. The results show that LLMs outperform traditional lexical-feature-based models, suggesting their potential to identify complex medical terms and lay the groundwork for personalised patient-doctor communication.

pdf bib abs
Where and How as Key Factors for Knowledge-Enhanced Constrained Commonsense Generation
Ivan Martinez-Murillo | Paloma Moreda Pozo | Elena Lloret

This paper addresses a key limitation in Natural Language Generation (NLG) systems: their struggle with commonsense reasoning, which is essential for generating contextually appropriate and plausible text. The study proposes an approach to enhance the commonsense reasoning abilities of NLG systems by integrating external knowledge framed in a constrained commonsense generation task. The paper investigates strategies for extracting and injecting external knowledge into pre-trained models, specifically BART and T5, in both base and large configurations. Experimental results show that incorporating external knowledge extracted with a simple strategy leads to significant improvements in performance, with the models achieving 88% accuracy in generating plausible and correct sentences. When refined methods for knowledge extraction are applied, the accuracy further increases to 92%. These findings underscore the crucial role of high-quality external knowledge in enhancing the commonsense reasoning capabilities of NLG systems, suggesting that such integration is vital for advancing their performance in real-world applications.

Social media platforms like Reddit, YouTube, and Instagram amplify rapid dissemination of negative sentiment, potentially causing harm and fostering extremist discourse. This paper addresses the NLP challenge of predicting sudden spikes in negative sentiment by fine-tuning multilingual transformer models. We present a structured pipeline emphasizing linguistic feature extraction and temporal modeling. Our experimental results, obtained from extensive Reddit, YouTube, and Instagram data, demonstrate improved forecasting accuracy over baseline methods. Ethical considerations and implications for deployment in social media moderation are thoroughly discussed. The system includes user-centric interactive features such as real-time filtering dashboards, customizable negativity thresholds, and forecasting analytics, providing actionable insights for preventative content moderation. Given its real-time deployment potential and cross-platform applicability, our system offers actionable insights for proactive content moderation.

pdf bib abs
C-SHAP: Collocation-Aware Explanations for Financial NLP
Martina Menzio | Elisabetta Fersini | Davide Paris

Understanding the internal decision-making process of NLP models in high-stakes domains such as the financial sector is particularly challenging due to the complexity of domain-specific terminology and the need for transparency and accountability. Although SHAP is a widely used model-agnostic method for attributing model predictions to input features, its standard formulation treats input tokens as independent units, failing to capture the influence of collocations that often carry non-compositional meaning, instead modeled by the current language models. We introduce C-SHAP, an extension of SHAP that incorporates collocational dependencies into the explanation process to account for word combinations in the financial sector. C-SHAP dynamically groups tokens into significant collocations using a financial glossary and computes Shapley values over these structured units. The proposed approach has been evaluated to explain sentiment classification of Federal Reserve Minutes, demonstrating improved alignment with human rationales and better association to model behaviour compared to the standard token-level approach.

pdf bib abs
Investigating Polarization in YouTube Comments via Aspect-Based Sentiment Analysis
Daniel Miehling | Daniel Dakota | Sandra Kübler

We investigate the use of Aspect-Based Sentiment Analysis (ABSA) to analyze polarization in online discourse. For the analysis, we use a corpus of over 3 million user comments and replies from four state-funded media channels from YouTube Shorts in the context of the 2023 Israel–Hamas war. We first annotate a subsample of approx. 5 000 comments for positive, negative, and neutral sentiment towards a list of topic related aspects. After training an ABSA model (Yang et al., 2023) on the corpus, we evaluate its performance on this task intrinsically, before evaluating the usability of the automatic analysis of the whole corpus for analyzing polarization. Our results show that the ABSA model achieves an F1 score of 77.9. The longitudinal and outlet analyses corroborate known trends and offer subject experts more fine-grained information about the use of domain-specific language in user-generated content.

pdf bib abs
From the Tractatus Logico-Philosophicus to Later Wittgenstein: An NLP-Based Comparative Analysis
Andreiana Mihail | Silviu-Florin Gheorghe | Andrei Fotea | Liviu P. Dinu

This study investigates the application of Natural Language Processing (NLP) methods to uncover linguistic and stylistic variations within the corpus of Ludwig Wittgenstein, a philosopher renowned for his complex and notional contributions. By analyzing works such as Tractatus Logico-Philosophicus alongside his later notes, manuscripts, and student-dictated lectures in Cambridge, we aim to identify significant distinctions in language use and conceptual framing. The corpus poses unique difficulties because of its diverse origins, encompassing published works, personal notes, and collaboratively edited transcripts. Utilizing zero-shot NLP techniques, this exploratory/preliminary research aims to reveal patterns reflective of Wittgenstein’s philosophical evolution and differences in text production manners. The results highlight the potential of computational approaches to enhance our understanding of complex, context-dependent philosophical writings, providing a possible path for further interdisciplinary investigations into linguistic and conceptual dynamics in this challenging body of work.

pdf bib abs
Towards Intention-aligned Reviews Summarization: Enhancing LLM Outputs with Pragmatic Cues
Maria Miro Maestre | Robiert Sepulveda-Torres | Ernesto Luis Estevanell-Valladares | Armando Suarez Cueto | Elena Lloret

Recent advancements in Natural Language Processing (NLP) have allowed systems to address complex tasks involving cultural knowledge, multi-step reasoning, and inference. While significant progress has been made in text summarization guided by specific instructions or stylistic cues, the integration of pragmatic aspects like communicative intentions remains underexplored, particularly in non-English languages. This study emphasizes communicative intentions as central to summary generation, classifying Spanish product reviews by intent and using prompt engineering to produce intention-aligned summaries. Results indicate challenges for large language models (LLMs) in processing extensive document clusters, with summarization accuracy heavily dependent on prior model exposure to similar intentions. Common intentions such as complimenting and criticizing are reliably handled, whereas less frequent ones like promising or questioning pose greater difficulties. These findings suggest that integrating communicative intentions into summarization tasks can significantly enhance summary relevance and clarity, thereby improving user experience in product review analysis.

pdf bib abs
Subtle Shifts, Significant Threats: Leveraging XAI Methods and LLMs to Undermine Language Models Robustness
Adrián Moreno Muñoz | L. Alfonso Ureñ-López | Eugenio Martínez Cámara

Language models exhibit inherent security vulnerabilities, which may be related to several factors, among them the malicious alteration of the input data. Such weaknesses compromise the robustness of language models, which is more critical when adversarial attacks are stealthy and do not require high computational resources. In this work, we study how vulnerable English language models are to adversarial attacks based on subtle modifications of the input of pretrained English language models. We claim that the attack may be more effective if it is targeted to the most salient words for the discriminative task of the language models. Accordingly, we propose a new attack built upon a two-step approach: first, we use a posteriori explainability methods to identify the most influential words for the classification task, and second, we replace them with contextual synonyms retrieved by a small language model. Since the attack has to be as stealthy as possible, we also propose a new evaluation measure that combines the effectiveness of the attack with the number of modifications performed. The results show that pretrained English language models are vulnerable to minimal semantic changes, which makes the design of countermeasure methods imperative.

pdf bib abs
Fast Thinking with Structured Prompts: Enabling LLM Reasoning without Chain-of-Thought Generation
Kirill Morozov | Liubov Chubarova | Irina Piontkovskaya

The emergence of complex reasoning abilities in large language models (LLMs) has sparked great interest, and a variety of prompting techniques was proposed to coax them into emulating human thought processes. In this work, we introduce Think Node-by-Node, a graph-based reasoning framework inspired by mind maps, flowcharts, and other visual aids that help humans tackle complex problems. Rather than generating images directly, our approach leverages standard graph-building and rendering libraries, and requires no fine-tuning, only the model’s native coding capabilities. We further explore a “Fast Thinking” regime, in which a graph-reasoning example provided in the prompt, but the model generates the answers directly, without the full thought process reconstruction. Surprisingly, this approach leads to significant improvement upon baseline in general-knowledge tasks. Remarkably, Think Node-by-Node maintains strong performance even under a strict 25-token budget for answer generation. Across two instruction-tuned LLMs (0.5B and 7B parameters), our FastTNbN strategy outperforms baseline prompting techniques, improving accuracy by up to 10%, and exceeds the capabilities of other structured prompting methods under equivalent generation constraints.

pdf bib abs
T2Know: Analysis and Trend Platform Using the Knowledge Extracted from Scientific Texts
Rafael Muñoz Guillena | Manuel Palomar | Yoan Gutiérrez | Mar Bonora

The T2Know project explores the application of natural language processing technologies to build a semantic platform for scientific documents using knowledge graphs. These graphs will interconnect meaningful sections from different documents, enabling both trend analysis and the generation of informed recommendations. The project’s objectives include the development of entity recognition systems, the definition of user and document profiles, and the linking of documents through transformer-based technologies. Consequently, the extracted relevant content will go beyond standard metadata such as titles and author affiliations, extending also to other key sections of scientific articles, including references, which are treated as integral components of the knowledge representation.

pdf bib abs
Investigating Large Language Models’ (LLMs) Capabilities for Sexism Detection on a Low-Resource Language
Lutfiye Seda Mut Altin | Horacio Saggion

Automatic detection of sexist language on social media is gaining attention due to its harmful societal impact and technical challenges it presents. The limited availability of data resources in some languages restricts the development of effective tools to fight the spread of such content. In this work, we investigated various methods to improve the efficiency of automatic detection of sexism and its subtypes in a low-resource language, Turkish. We first experimented with various LLM prompting strategies for classification and then investigated the impact of different data augmentation strategies, including both synthetic data generation with LLMs (GPT, DeepSeek) and translationbased augmentation using English and Spanish data. Finally, we examined whether these augmentation methods would improve model performance of a trained neural network (BERT). Our benchmarking results show that fine-tuned LLM (GPT-4o-mini) achieved the best performance compared to zero-shot, few-shot, Chain-of-Thought prompt classification and training a neural network (BERT) including the data augmented in different ways (synthetic generation, translation). Our results also indicated that, for the classification of more granular classes, in other words, more specific tasks, training a neural network generally performed better than prompt-based classification using an LLM.

pdf bib abs
PolyHope-M at RANLP2025 Subtask-1 Binary Hope Speech Detection: Spanish Language Classification Approach with Comprehensive Learning Using Transformer, and Traditional ML, and DL
Md. Julkar Naeen | Sourav Kumar Das | Sharun Akter Khushbu | Shahriar Sultan Ramit | Alaya Parven Alo

This paper presents our system for the RANLP 2025 shared task on multilingual binary sentiment classification for Task-2 Spanish datasets for domains including social media and customer reviews. We experimented with various models from traditional machine learning approaches—Naive Bayes and LightGBM—to deep learning architectures like LSTM. Among them, the transformer-based XLM-RoBERTa model performed best with an F1 of 0.85, demonstrating its promise for multilingual sentiment work. Basic text preprocessing techniques were used for data quality assurance and improving model performance. Our comparison reflects the superiority of transformer-based models over the traditional methods in binary sentiment classification for multilingual and low-resource environments. This study enables the development of cross-lingual sentiment classification by establishing strong baselines and paying close attention to model performance in joint task settings.

pdf bib abs
F-LoRA-QA: Finetuning LLaMA Models with Low-Rank Adaptation for French Botanical Question Generation and Answering
Ayoub Nainia | Régine Vignes-Lebbe | Hajar Mousannif | Jihad Zahir

Despite recent advances in large language models (LLMs), most question-answering (QA) systems remain English-centric and poorly suited to domain-specific scientific texts. This linguistic and domain bias poses a major challenge in botany, where a substantial portion of knowledge is documented in French. We introduce F-LoRA-QA, a fine-tuned LLaMA-based pipeline for French botanical QA, leveraging Low-Rank Adaptation (LoRA) for efficient domain adaptation. We construct a specialized dataset of 16,962 question-answer pairs extracted from scientific flora descriptions and fine-tune LLaMA models to retrieve structured knowledge from unstructured botanical texts. Expert-based evaluation confirms the linguistic quality and domain relevance of generated answers. Compared to baseline LLaMA models, F-LoRA-QA achieves a 300% BLEU score increase, 70% ROUGE-1 F1 gain, +16.8% BERTScore F1, and Exact Match improvement from 2.01% to 23.57%. These results demonstrate the effectiveness of adapting LLMs to low-resource scientific domains and highlight the potential of our approach for automated trait extraction and biodiversity data structuring.

pdf bib abs
Reverse Prompting: A Novel Computational Paradigm in Schizophrenia Based on Large Language Models
Ivan Nenchev | Christiane Montag | Sandra Anna Just

Large language models (LLMs) are increasingly being used to interpret and generate human language, yet their ability to process clinical language remains underexplored. This study examined whether three open-source LLMs can infer interviewer questions from participant responses in a semi-structured psychiatric interview (NET) conducted with individuals diagnosed with schizophrenia (n = 107) and neurotypical controls (n = 66). Using cosine similarity between LLM-generated questions and original prompts as a proxy for the precision of the inference, we found that responses from individuals with schizophrenia produced significantly lower similarity scores (beta = –0.165, p < .001). Cosine similarity decreased across the nested structure of the interview, with smaller reductions observed in the schizophrenia group. Although all emotions decreased similarity with fear, only sadness showed a significant interaction with diagnosis, suggesting differential processing of emotional discourse. Model type and generation temperature also influenced outcomes, highlighting variability in model performance. Our findings demonstrate that LLMs systematically struggle to reconstruct interviewer intent from responses by individuals with schizophrenia, reflecting known discourse-level disturbances in the disorder.

Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.

pdf bib abs
Quantifying the Overlap: Attribution Maps and Linguistic Heuristics in Encoder-Decoder Machine Translation Models
Aria Nourbakhsh | Salima Lamsiyah | Christoph Schommer

Explainable AI (XAI) attribution methods seek to illuminate the decision-making process of generative models by quantifying the contribution of each input token to the generated output. Different attribution algorithms, often rooted in distinct methodological frameworks, can produce varied interpretations of feature importance. In this study, we utilize attribution mappings derived from three distinct methods as weighting signals during the training of encoder-decoder models. Our findings demonstrate that Attention and Value Zeroing attribution weights consistently lead to improved model performance. To better understand the linguistic information these mappings capture, we extract part-of-speech (POS), dependency, and named entity recognition (NER) tags from the input-output pairs and compare them with the XAI attribution maps. Although the Saliency method shows greater alignment with POS and dependency annotations than Value Zeroing, it exhibits more divergence in places where its attributions do not conform to these linguistic tags, compared to the other two methods, and it contributes less to the models’ performance.

pdf bib abs
The Illusion of a Perfect Metric: Why Evaluating AI ́S Words Is Harder than It Looks
Maria Paz Oliva | Adriana D. Correia | Ivan Vankov | Viktor Botev

Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the ‘perfect metric’. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.

Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.

pdf bib abs
Toward Quantum-Enhanced Natural Language Understanding: Sarcasm and Claim Detection with QLSTM
Pritam Pal | Dipankar Das

Traditional machine learning (ML) and deep learning (DL) models have shown effectiveness in natural language processing (NLP) tasks, such as sentiment analysis. However, they often struggle with complex linguistic structures, such as sarcasm and implicit claims. This paper introduces a Quantum Long Short-Term Memory (QLSTM) framework for detecting sarcasm and identifying claims in text, aiming to enhance the analysis of complex sentences. We evaluate four approaches: (1) classical LSTM, (2) quantum framework using QLSTM, (3) voting ensemble combining classical and quantum LSTMs, and (4) hybrid framework integrating both types. The experimental results indicate that the QLSTM approach excels in sarcasm detection, while the voting framework performs best in claim identification.

pdf bib abs
Legal Terminology Extraction in Spanish: Gold-standard Generation and LLM Evaluation
Lucia Palacios Palacios | Beatriz Guerrero García | Patricia Martín Chozas | Elena Montiel Ponsoda

This study aims to develop a gold-standard for terminological extraction in Castilian Spanish within the domain of labour law. To achieve this, a methodology was developed based on established linguistic theories and reviewed by a team of expert terminologists. Departing from previous extraction studies and reference theoretical frameworks, candidate terms were identified by their morphosyntactic patterns, enriched by assessing their degree of specialisation in reference resources. The candidate terms were then subjected to manual validation. To evaluate its applicability, we assessed the performance of the LLaMA3-8B and Mistral-7B language models in extracting labour law terms from the latest version of the Real Decreto Legislativo 2/2015 Ley del Estatuto de los Trabajadores. YAKE was also included as a statistical baseline for comparison between traditional methods and generative approaches. All models were evaluated against the validated gold-standard.

pdf bib abs
Benchmarking Item Difficulty Classification in German Vocational Education and Training
Alonso Palomino | Benjamin Paassen

Predicting the difficulty of exam questions or items is essential to effectively assembling and calibrating exams. While item response theory (IRT) models can estimate item difficulty, they require student responses that are costly and rarely available at scale. Natural language processing methods offer a text-only alternative; however, due to the scarcity of real-world labeled data, prior work often relies on synthetic or domain-specific corpora, limiting generalizability and overlooking the nuanced challenges of real-world text-based item difficulty estimation. Addressing this gap, we benchmark 122 classifiers on 935 German Vocational Education and Training (VET) items labeled via previous IRT analysis to assess feasibility under real-world conditions. In our setup, a stacked ensemble that combines linguistic features, pre-trained embeddings, and external semantic resources outperforms both transformer-based models and few-shot large language models, achieving moderate performance. We report findings and discuss limitations in the context of German VET.

pdf bib abs
Isolating LLM Performance Gains in Pre-training versus Instruction-tuning for Mid-resource Languages: The Ukrainian Benchmark Study
Yurii Paniv

This paper evaluates language model performance on Ukrainian language tasks across multiple downstream benchmarks, including summarization, closed and open question answering, and translation at both sentence and paragraph levels. We also introduce LongFlores, an extension of the FLORES benchmark designed specifically to assess paragraph-level translation capabilities. In our experiments, we compare the performance of base models against their instruction-tuned counterparts to isolate and quantify the source of performance improvements for Ukrainian language tasks. Our findings reveal that for popular open source models, base models are stronger in the few-shot setting for the task than their instruction-tuned counterparts in the zero-shot setting. This suggests lower attention paid to Ukrainian during the instruction-tuning phase, providing valuable insights for future model development and optimization for Ukrainian and potentially other lower-resourced languages.

pdf bib abs
Evaluating LLMs on Deceptive Text across Cultures
Katerina Papantoniou | Panagiotis Papadakos | Dimitris Plexousakis

Deception is a pervasive feature of human communication, yet identifying linguistic cues of deception remains a challenging task due to strong context dependency across domains, cultures, and types of deception. While prior work has relied on human analysis across disciplines like social psychology, philosophy, and political science, large language models (LLMs) offer a new avenue for exploring deception due to their strong performance in Natural Language Processing (NLP) tasks. In this study, we investigate whether open-weight LLMs possess and can apply knowledge about linguistic markers of deception across multiple languages, domains, and cultural contexts, with language and country of origin used as a proxy for culture. We focus on two domains, opinionated reviews and personal descriptions about sensitive topics, spanning five languages and six cultural settings. Using various configurations (zero-shot, one-shot, and fine-tuning), we evaluate the performance of LLMs in detecting and generating deceptive text. In detection tasks, our results reveal cross-model and cross-context performance differences. In generation tasks, linguistic analyses show partial alignment with known deception cues in human text, though this knowledge appears largely uniform and context-agnostic.

pdf bib abs
Annotating Hate Speech towards Identity Groups
Donnie Parent | Nina Georgiades | Charvi Mishra | Khaled Mohammed | Sandra Kübler

Detecting hate speech, especially implicit hate speech, is a difficult task. We focus on annotating implicit hate targeting identity groups. We describe our dataset, which is a subset of AbuseEval (Caselli et al., 2020) and our annotation process for implicit identity hate. We annotate the type of abuse, the type of identity abuse, and the target identity group. We then discuss cases that annotators disagreed on and provide dataset statistics. Finally, we calculate our inter-annotator agreement.

pdf bib abs
On the Interaction of Identity Hate Classification and Data Bias
Donnie Parent | Nina Georgiades | Charvi Mishra | Khaled Mohammed | Sandra Kübler

Hate speech detection is a task where machine learning models tend to be limited by the biases introduced by the dataset. We use two existing datasets of hate speech towards identity groups, the one by Wiegand et al. (2022) and a reannotated subset of the data in AbuseEval (Caselli et al. 2020). Since the data by Wiegand et al. (2022) were collected using one syntactic pattern, there exists a possible syntactic bias in this dataset. We test whether there exists such a bias by using a more syntactically general dataset for testing. Our findings show that classifiers trained on the dataset with the syntactic bias and tested on a less constrained dataset suffer from a loss in performance in the order of 20 points. Further experiments show that this drop can only be partly attributed to a shift in identity groups between datasets.

pdf bib abs
Financial News as a Proxy of European Central Bank Interest Rate Adjustments
Davide Paris | Martina Menzio | Elisabetta Fersini

This paper examines the relationship between news coverage and the European Central Bank’s (ECB) interest rate decisions. In particular, the hypothesis of a linear relationship between financial news and ECB indications regarding interest rate variations is investigated by leveraging state-of-the-art large language models combined with domain experts and automatically selected keywords. The analysis revealed two key findings related to how news contents can signal the ECB’s decisions to raise or lower interest rates: (1) Sentence Transformer models, when combined with domain-specific keywords, exhibit a higher correlation with ECB decisions than state-of-the-art financial BERT architectures; (2) employing a grid search strategy to select subsets of informative keywords strengthened the relationships between news contents and ECB’s decisions, highlighting how media narratives can anticipate or reflect central bank policy actions.

pdf bib abs
Generating and Analyzing Disfluency in a Code-Mixed Setting
Aryan Paul | Tapabrata Mondal | Dipankar Das | Sivaji Bandyopadhyay

This work explores the intersection of code-mixing and disfluency in bilingual speech and text, with a focus on understanding how large language models (LLMs) handle code-mixed disfluent utterances. One of the primary objectives is to explore LLMs’ ability to generate code-mixed disfluent sentences and to address the lack of high-quality code-mixed disfluent corpora, particularly for Indic languages. We aim to compare the performance of LLM-based approaches with traditional disfluency detection methods and to develop novel metrics for quantitatively assessing disfluency phenomena. Additionally, we investigate the relationship between code-mixing and disfluency, exploring how factors such as switching frequency and direction influence the occurrence of disfluencies. By analyzing these intriguing dynamics, we seek to gain a deeper understanding of the mutual influence between code-mixing and disfluency in multilingual speech.

pdf bib abs
A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance
Peshala Sandali Perera | Deshan Koshala Sumanathilaka

Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system designed specifically for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT a open sourced fine-tuned BERT model trained for Sinhala to identify common dyslexic errors, and a combined mT5 and Mistral-based model to generate corrected text. Finally, the output is converted back to speech using gTTS, creating a complete multi modal feedback loop. Despite the challenges posed by limited Sinhala-language datasets, the system achieves 66% transcription accuracy and 70% correction accuracy with 65% overall system accuracy. These results demonstrate both the feasibility and effectiveness of the approach. Ultimately, this work highlights the importance of inclusive NLP technologies in underrepresented languages and showcases a practical step toward improving accessibility for adult dyslexic users.

pdf bib abs
Evaluating Transliteration Ambiguity in Adhoc Romanized Sinhala: A Dataset for Transliteration Disambiguation
Sandun Sameera Perera | Deshan Koshala Sumanathilaka

This paper introduces the first Transliteration disambiguation (TD) dataset for Romanized Sinhala, informally known as Singlish, developed to address the challenge of transliteration ambiguity in backwards transliteration tasks. The dataset covers 22 ambiguous Romanized Sinhala words, each mapping to two distinct Sinhala meanings, and provides 30 Romanized sentences per word: ten for each meaning individually and ten containing both meanings in context. Sentences were initially collected through web scraping and later post-processed using the Claude language model, which offers strong support for Sinhala, alongside a rule-based Romanization process to ensure linguistic quality and consistency. To demonstrate its applicability, the dataset was used to evaluate four existing back-transliteration systems, highlighting their performance in resolving context-sensitive ambiguities. Baseline evaluations confirm the dataset’s effectiveness in assessing transliteration systems’ ability to handle transliteration ambiguity, offering a valuable resource for advancing TD and transliteration research for Sinhala.

pdf bib abs
Detecting Deception in Disinformation across Languages: The Role of Linguistic Markers
Alba Perez-Montero | Silvia Gargova | Elena Lloret | Paloma Moreda Pozo

The unstoppable proliferation of news driven by the rise of digital media has intensified the challenge of news verification. Natural Language Processing (NLP) offers solutions, primarily through content and context analysis. Recognizing the vital role of linguistic analysis, this paper presents a multilingual study of linguistic markers for automated deceptive fake news detection across English, Spanish, and Bulgarian. We compiled datasets in these languages to extract and analyze both general and specific linguistic markers. We then performed feature selection using the SelectKBest algorithm, applying it to various classification models with different combinations of general and specific linguistic markers. The results show that Logistic Regression and Support Vector Machine classification models achieved F1-scores above 0.8 for English and Spanish. For Bulgarian, Random Forest yielded the best results with an F1-score of 0.73. While these markers demonstrate potential for transferability to other languages, results may vary due to inherent linguistic characteristics. This necessitates further experimentation, especially in low-resource languages like Bulgarian. These findings highlight the significant potential of our dataset and linguistic markers for multilingual deceptive news detection.

pdf bib abs
Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision
Dimitar Peshevski | Kiril Blazhevski | Martin Popovski | Gjorgji Madjarov

Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.

We introduce Q&A-LF, a French, question-answering benchmark designed to assess the extent to which large language models capture fine-grained lexical knowledge. We investigate the ability of ChatGPT-4o mini, Qwen2.5-14B, Llama3.0-8B, and Llama3.1-8B to answer questions based on lexical functions from Meaning-Text Theory. Using various prompting setups with different levels of examples and context, we find that Qwen and ChatGPT generally outperform Llama models, achieving up to 70% accuracy, while Llama models reach just above 60%. We identify LFs that are particularly easy or especially challenging for the models. We further investigate whether providing sentence-level context and one-shot prompting improve performance, especially on semantically complex functions.

pdf bib abs
Analysis of Vocabulary and Subword Tokenization Settings for Optimal Fine-tuning of MT: A Case Study of In-domain Translation
Javad Pourmostafa Roshan Sharami | Dimitar Shterionov | Pieter Spronck

The choice of vocabulary and subword (SW) tokenization has a significant impact on both training and fine-tuning of language and translation models. Fine-tuning is a common practice in optimizing a model with respect to new data. However, new data potentially introduces new words (or tokens), which, if not considered, may lead to suboptimal performance. In addition, the distribution of tokens in the new data can differ from the distribution of the original data. As such, the original SW tokenization model could be less suitable for the new data. With this work, we aim to gain better insights on the impact of SW tokenization and vocabulary generation on the performance of neural machine translation (NMT) models fine-tuned to a specific domain. To do so, we compare several strategies for SW tokenization and vocabulary generation and investigate the performance of the resulting models. Our findings show that the best way to fine-tune for domain adaptation is to consistently use both BPE and vocabulary from the in-domain data, which helps the model pick up on important domain-specific terms. At the same time, it is crucial not to lose sight of the vocabulary of the base (pre-trained) model—maintaining coverage of this vocabulary ensures the model keeps its general language abilities. The most successful configurations are those that introduce plenty of frequent domain terms while still retaining a substantial portion of the base model vocabulary, leading to noticeably better translation quality and adaptation, as seen in higher BLEU scores. These benefits, however, often come with greater computational costs, such as longer training times, since the model must learn more new tokens. Conversely, approaches that skip important domain terms or combine mismatched tokenization and vocabulary do not perform as well, making it clear that both domain-specific adaptation and broad vocabulary coverage matter—and that these gains are realized when the vocabulary preserves a good portion of the base (pre-trained) model. While using in-domain BPE and vocabulary yields the best domain adaptation, it substantially reduces out-of-domain translation quality. Hybrid configurations that combine base and domain vocabularies help balance this trade-off, maintaining broader translation capabilities alongside improved domain performance.

pdf bib abs
LLM-based Embedders for Prior Case Retrieval
Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov

In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.

pdf bib abs
Exploiting Primacy Effect to Improve Large Language Models
Bianca Raimondi | Maurizio Gabbrielli

Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect—where items presented first are more likely to be remembered or selected—plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect, by reordering response options on the basis of semantic similarity to the query - without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.

pdf bib abs
Alankaar: A Dataset for Figurativeness Understanding in Bangla
Geetanjali Rakshit | Jeffrey Flanigan

Bangla has a rich written literature, automatically making it replete with examples of creative usage of language. There have been limited efforts to computationally analyze creative text in the Bangla language due to a lack of resources. We present Alankaar, a dataset of 2500 manually annotated examples of text fragments in Bangla containing metaphors. We also provide automatic and manual English translations of these examples. Additionally, we provide 2500 examples of non-metaphorical text in Bangla. We use this dataset to build a metaphor identification system in Bangla. We also use it as a test bed for cross-lingual metaphor translation, finding that not all metaphors translate literally across languages and there are several cultural factors at play in the translation of metaphors. We hope this will advance the field in metaphor translation research and in grounding cultural nuances at work in the process of machine translation.

pdf bib abs
ASQ: Automatically Generating Question-Answer Pairs Using AMRs
Geetanjali Rakshit | Jeffrey Flanigan

We introduce ASQ, a tool to automatically mine questions and answers from a sentence using the Abstract Meaning Representation (AMR). Previous work has used question-answer pairs to specify the predicate-argument structure of a sentence using natural language, which does not require linguistic expertise or training, and created datasets such as QA-SRL and QAMR, for which the question-answer pair annotations were crowdsourced. Our goal is to build a tool (ASQ) that maps from the traditional meaning representation AMR to a question-answer meaning representation (QMR). This enables construction of QMR datasets automatically in various domains using existing high-quality AMR parsers, and provides an automatic mapping AMR to QMR for ease of understanding by non-experts. A qualitative evaluation of the output generated by ASQ from the AMR 2.0 data shows that the question-answer pairs are natural and valid, and demonstrate good coverage of the content. We run ASQ on the sentences from the QAMR dataset, to observe that the semantic roles in QAMR are also captured by ASQ. We intend to make this tool and the results publicly available for others to use and build upon.

pdf bib abs
Multi-LLM Verification for Question Answering under Conflicting Contexts
Geetanjali Rakshit | Jeffrey Flanigan

Open-domain question answering (ODQA) often requires models to resolve conflicting evidence retrieved from diverse sources—a task that remains challenging even for state-of-the-art large language models (LLMs). While single-agent techniques such as self-verification and self-consistency have shown promise across natural language understanding and generation tasks, and multi-agent approaches involving collaborative or competitive strategies have recently emerged, their effectiveness for ODQA in the presence of conflicting contexts remains underexplored. In this work, we investigate these techniques using the QACC dataset as a case study. We find that incorporating a multi-agent verification step—where the best answer is selected from among outputs generated by different LLMs—leads to improved performance. Interestingly, we also observe that requiring explanations during the verification step does not always improve answer quality. Our experiments evaluate three strong LLMs (GPT-4o, Claude 4, and DeepSeek-R1) across a range of prompting and verification baselines.

pdf bib abs
Comparative Analysis of Human and Large Language Model Performance in Pharmacology Multiple-Choice Questions
Ricardo Rodriguez | StÃ©phane Huet | Benoit Favre | Mickael Rouvier

In this article, we study the answers generated by a selection of Large Language Models to a set of Multiple Choice Questions in Pharmacology, and compare them to the answers provided by students, to understand which questions in this clinical domain are difficult for the models when compared to humans and why. We extract the internal logits to infer probability distributions and analyse the main features that determine the difficulty of questions using statistical methods. We also provide an extension to the FrenchMedMCQA dataset, with pairs of question-answers in pharmacology, enriched with student response rate, answer scoring, clinical topics, and annotations on question structure and semantics.

pdf bib abs
Enhancing Textual Understanding: Automated Claim Span Identification in English, Hindi, Bengali, and CodeMix
Rudra Roy | Pritam Pal | Dipankar Das | Saptarshi Ghosh | Biswajit Paul

Claim span identification, a crucial task in Natural Language Processing (NLP), aims to extract specific claims from texts. Such claim spans can be further utilized in various critical NLP applications, such as claim verification, fact-checking, and opinion mining, among others. The present work proposes a multilingual claim span identification framework for handling social media data in English, Hindi, Bengali, and CodeMixed texts, leveraging the strengths and knowledge of transformer-based pre-trained models. Our proposed framework efficiently identifies the contextual relationships between words and precisely detects claim spans across all languages, achieving a high F1 score and Jaccard score. The source code and datasets are available at: https://github.com/pritampal98/claim-span-multilingual

pdf bib abs
Detecting Fake News in the Era of Language Models
Muhammad Irfan Fikri Sabri | Hansi Hettiarachchi | Tharindu Ranasinghe

The proliferation of fake news has been amplified by the advent of large language models (LLMs), which can generate highly realistic and scalable misinformation. While prior studies have focused primarily on detecting human-generated fake news, the efficacy of current models against LLM-generated content remains underexplored. We address this gap by compiling a novel dataset combining public and LLM-generated fake news, redefining detection as a ternary classification task (real, human-generated fake, LLM-generated fake), and evaluating eight diverse classification models, including traditional machine learning, fine-tuned transformers, and few-shot prompted LLMs. Our findings highlight the strengths and limitations of these models in detecting evolving LLM-generated fake news, offering insights for future detection strategies.

pdf bib abs
Cyberbullying Detection via Aggression-Enhanced Prompting
Aisha Saeid | Anu Sabu | Girish Koushik | Ferrante Neri | Diptesh Kanojia

Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.

pdf bib abs
Lingdex.org:Leveraging LLMs to Structure and Explore Linguistic Olympiad Puzzles for Learning and Teaching Linguistics
Jonathan Sakunkoo | Annabella Sakunkoo

Linguistics Olympiad puzzles provide a valuable but underutilized resource for teaching linguistic reasoning, typology, and cross-cultural understanding. Many of these puzzles feature endangered and low-resource languages and thus offer a rare opportunity to integrate linguistic diversity into education at a time when over 40% of the world’s languages face extinction. This paper presents Lingdex, a novel web-based platform that leverages large language models (LLMs) to classify, organize, and enliven Linguistics Olympiad problems across various linguistic categories such as syntax, morphology, semantics, phonology, and language families. By applying NLP techniques to the multilingual and multicultural corpora of linguistics puzzles drawn from international and national Olympiads, Lingdex supports language and linguistics education, problem-based learning, and curriculum development. The visual, interactive platform also includes problems based on endangered and rare languages to raise awareness and interest in linguistic diversity. We present results from a user study that shows increased learner interest and appreciation for global linguistic richness.

pdf bib abs
When Does Language Transfer Help? Sequential Fine-Tuning for Cross-Lingual Euphemism Detection
Julia Sammartino | Libby Barak | Jing Peng | Anna Feldman

Euphemisms are culturally variable and often ambiguous, posing challenges for language models, especially in low-resource settings. This paper investigates how cross-lingual transfer via sequential fine-tuning affects euphemism detection across five languages: English, Spanish, Chinese, Turkish, and Yorùbá. We compare sequential fine-tuning with monolingual and simultaneous fine-tuning using XLM-R and mBERT, analyzing how performance is shaped by language pairings, typological features, and pretraining coverage. Results show that sequential fine-tuning with a high-resource L1 improves L2 performance, especially for low-resource languages like Yorùbá and Turkish. XLM-R achieves larger gains but is more sensitive to pretraining gaps and catastrophic forgetting, while mBERT yields more stable, though lower, results. These findings highlight sequential fine-tuning as a simple yet effective strategy for improving euphemism detection in multilingual models, particularly when low-resource languages are involved.

pdf bib abs
Modelling the Relative Contributions of Stylistic Features in Forensic Authorship Attribution
G. Çağatay Sat | John Blake | Evgeny Pyshkin

This paper explores the extent to which stylistic features contribute to the task of authorship attribution in forensic contexts. Drawing on a filtered subset of the Enron email corpus, the study operationalizes stylistic indicators across four groups: lexical, syntactic, orthographic, and discoursal. Using R Programming Language for feature engineering and logistic regression modelling, we systematically assessed both the individual and interactive effects of these features on attribution accuracy. Results show that n-gram similarity consistently outperformed all other features, with the combined model of n-gram similarity and its interaction with other features achieving accuracy, precision and F1 scores of 91.6%, 93.3% and 91.7% respectively. The model was subsequently evaluated on a subset of the TEL corpus to assess its applicability in a forensic setting. The findings highlight the dominant role of lexical similarity and suggest that integrating interaction effects can yield further performance gains in forensic authorship analysis.

pdf bib abs
The Hidden Cost of Structure: How Constrained Decoding Affects Language Model Performance
Maximilian Schall | Gerard de Melo

Large Language Models excel at generating fluent text, but real-world applications increasingly demand structured outputs like JSON that can be programmatically processed. While prior work examines either task performance or format compliance in isolation, we investigate their interaction through comprehensive experiments across 11 models and multiple benchmarks. We uncover a fundamental divergence between base and instruction-tuned models under structural constraints. Base models often benefit from constrained decoding, producing more precise outputs, while instruction-tuned models frequently suffer performance degradation on generation tasks despite maintaining stability on classification tasks. Our log probability analysis reveals the underlying mechanism: constrained decoding forces models away from their preferred natural language patterns into lower-confidence structured alternatives. We demonstrate that successful constrained generation requires both adapted prompts and sufficient few-shot examples, with constrained models showing steeper performance gains from additional demonstrations compared to unconstrained generation. Notably, we find that base model performance under constraints can serve as an early indicator of post-training structured output capabilities, offering a practical evaluation tool for model development. These findings suggest that current instruction-tuning practices may inadvertently reduce models’ structured output capabilities and highlight the need for training-time integration of structural constraints in future model development.

pdf bib abs
A Question-Answering Based Framework/Metric for Evaluation of Newspaper Article Summarization
Vasanth Seemakurthy | Shashank Sundar | Siddharth Arvind | Siddhant Jagdish | Ashwini M. Joshi

Condensed summaries of newspaper articles cater to the modern need for easily digestible content amid shrinking attention spans. However, current summarization systems often produce extracts failing to capture the essence of original articles. Traditional evaluation metrics like ROUGE also provide limited insights into whether key information is preserved in the summaries. To address this, we propose a pipeline to generate high-quality summaries tailored for newspaper articles and evaluate them using a question-answering based metric. Our system segments input newspaper images, extracts text, and generates summaries. We also generate relevant questions from the original articles and use a question-answering model to assess how well the summaries can answer these queries to evaluate summary quality beyond just lexical overlap. Experiments on real-world data show the potential effectiveness of our approach in contrast to conventional metrics. Our framework holds promise for enabling reliable news summary generation and evaluation systems.

pdf bib abs
Efficient Financial Fraud Detection on Mobile Devices Using Lightweight Large Language Models
Lakpriya Senevirathna | Deshan Koshala Sumanathilaka

The growth of mobile financial transactions presents new challenges for fraud detection, where traditional and ML methods often miss emerging patterns. While Large Language Models (LLMs) offer advanced language understanding, they are typically too resource-intensive for mobile deployment and raise privacy concerns due to cloud reliance. This paper proposes a lightweight, privacy-preserving approach by fine-tuning and quantizing compact LLMs for on-device fraud detection from textual data. Models were optimized using Open Neural Network Exchange (ONNX) conversion and quantization to ensure efficiency. The fine-tuned quantized Llama-160M-Chat-v1 (bnb4) achieved 99.47% accuracy with a 168MB footprint, while fine-tuned quantized Qwen1.5-0.5B-Chat (bnb4) reached 99.50% accuracy at 797MB. These results demonstrate that optimized LLMs can deliver accurate, real-time fraud detection on mobile devices without compromising user privacy.

pdf bib abs
Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems
Lia Shahnazaryan | Patrick Simianer | Joern Wuebker

We explore the impact of multi-source input strategies on machine translation (MT) quality, comparing GPT-4o, a large language model (LLM), with a traditional multilingual neural machine translation (NMT) system. Using intermediate language translations as contextual cues, we evaluate their effectiveness in enhancing English and Chinese translations into Portuguese. Results suggest that contextual information significantly improves translation quality for domain-specific datasets and potentially for linguistically distant language pairs, with diminishing returns observed in benchmarks with high linguistic variability. Additionally, we demonstrate that shallow fusion, a multi-source approach we apply within the NMT system, shows improved results when using high-resource languages as context for other translation pairs, highlighting the importance of strategic context language selection.

pdf bib abs
Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection against LLM-Generated Threats
Sadat Shahriar | Navid Ayoobi | Arjun Mukherjee | Mostafa Musharrat | Sai Vishnu Vamsi Senagasetty

The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.

pdf bib abs
The Erosion of LLM Signatures: Can We Still Distinguish Human and LLM-Generated Scientific Ideas after Iterative Paraphrasing?
Sadat Shahriar | Navid Ayoobi | Arjun Mukherjee

With the increasing reliance on LLMs as research agents, distinguishing between LLM and human-generated ideas has become crucial for understanding the cognitive nuances of LLMs’ research capabilities. While detecting LLM-generated text has been extensively studied, distinguishing human vs LLM-generated *scientific ideas* remains an unexplored area. In this work, we systematically evaluate the ability of state-of-the-art (SOTA) machine learning models to differentiate between human and LLM-generated ideas, particularly after successive paraphrasing stages. Our findings highlight the challenges SOTA models face in source attribution, with detection performance declining by an average of 25.4% after five consecutive paraphrasing stages. Additionally, we demonstrate that incorporating the research problem as contextual information improves detection performance by up to 2.97%. Notably, our analysis reveals that detection algorithms struggle significantly when ideas are paraphrased into a simplified, non-expert style, contributing the most to the erosion of distinguishable LLM signatures.

pdf bib abs
Deep Language Geometry: Constructing a Metric Space from LLM Weights
Maksym Shamrai | Vladyslav Hamolia

We introduce a novel framework that utilizes the internal weight activations of modern Large Language Models (LLMs) to construct a metric space of languages. Unlike traditional approaches based on hand-crafted linguistic features, our method automatically derives high-dimensional vector representations by computing weight importance scores via an adapted pruning algorithm. Our approach captures intrinsic language characteristics that reflect linguistic phenomena. We validate our approach across diverse datasets and multilingual LLMs, covering 106 languages. The results align well with established linguistic families while also revealing unexpected inter-language connections that may indicate historical contact or language evolution. The source code, computed language latent vectors, and visualization tool are made publicly available at https://github.com/mshamrai/deep-language-geometry.

pdf bib abs
Cross-Lingual Fact Verification: Analyzing LLM Performance Patterns across Languages
Hanna Shcharbakova | Tatiana Anikina | Natalia Skachkova | Josef van Genabith

Fact verification has emerged as a critical task in combating misinformation, yet most research remains focused on English-language applications. This paper presents a comprehensive analysis of multilingual fact verification capabilities across three state-of-the-art large language models: Llama 3.1, Qwen 2.5, and Mistral Nemo. We evaluate these models on the X-Fact dataset that includes 25 typologically diverse languages, examining both seen and unseen languages through various evaluation scenarios. Our analysis employs few-shot prompting and LoRA fine-tuning approaches, revealing significant performance disparities based on script systems, with Latin script languages consistently outperforming others. We identify systematic cross-lingual instruction following failures, particularly affecting languages with non-Latin scripts. Surprisingly, some officially supported languages, such as Indonesian and Polish, which are not high-resourced languages, achieve better performance than high-resource languages like German and Spanish, challenging conventional assumptions about resource availability and model performance. The results highlight critical limitations in current multilingual LLMs for the fact verification task and provide insights for developing more inclusive multilingual systems.

pdf bib abs
ESAQueryRank: Ranking Query Interpretations for Document Retrieval Using Explicit Semantic Analysis
Avijeet Shil | Wei Jin

Representing query translation into relevant entities is a critical component of an infor- mation retrieval system. This paper proposes an unsupervised framework, ESAQueryRank, designed to process natural language queries by mapping n-gram phrases to Wikipedia ti- tles and ranking potential entity and phrase combinations using Explicit Semantic Analy- sis. Unlike previous approaches, this frame- work does not rely on query expansion, syn- tactic parsing, or manual annotation. Instead, it leverages Wikipedia metadata—such as ti- tles, redirects, disambiguation pages to dis- ambiguate entities and identify the most rel- evant ones based on cosine similarity in the ESA space. ESAQueryRank is evaluated using a random set of TREC questions and compared against a keyword-based approach and a context-based question translation model (CBQT). In all comparisons of full category types, ESAQueryRank consistently shows bet- ter results against both methods. Notably, the framework excels with more complex queries, achieving improvements in Mean Reciprocal Rank (MRR) of up to 480% for intricate queries like those beginning with “Why,” even without explicitly incorporating the question type. These results demonstrate that ESA- QueryRank is an effective, transparent, and domain-independent framework for building natural language interfaces.

pdf bib abs
Personalized Author Obfuscation with Large Language Models
Mohammad Shokri | Sarah Ita Levitan | Rivka Levitan

In this paper, we investigate the efficacy of large language models (LLMs) in obfuscating authorship by paraphrasing and altering writing styles. Rather than adopting a holistic approach that evaluates performance across the entire dataset, we focus on user-wise performance to analyze how obfuscation effectiveness varies across individual authors. While LLMs are generally effective, we observe a bimodal distribution of efficacy, with performance varying significantly across users. To address this, we propose a personalized prompting method that outperforms standard prompting techniques and partially mitigates the bimodality issue.

pdf bib abs
Bulgarian Event Extraction with LLMs
Kiril Simov | Nikolay Paev | Petya Osenova | Stefan Marinov

The paper presents the results from the experiments with two large language models (LLMs) - T5 and Llama – for extracting events from a Bulgarian event corpus. The two models were pretrained by us on 35 Billion Token Bulgarian Corpus. The extraction was performed within the context of one sentence. Our approach aims at balancing the ACE-oriented approach that uses triggers in event detection, and the MUC-oriented one that uses more general event types. The evaluation relies on the IoU (Intersection over Union) of token spans and is twofold. The first one refers to the predicted event token span. Here if the span is correct, the semantic roles within the event are further checked. The second one refers to the triple of an event type, its semantic roles and participants. The results are promising. A qualitative evaluation is provided as well.

Captions are crucial for understanding scientific visualizations and documents. Existing captioning methods for scientific figures rely on figure-caption pairs extracted from documents for training, many of which fall short with respect to metrics like helpfulness, explainability, and visual-descriptiveness, leading to generated captions being misaligned with reader preferences. To address this issue, we introduce FigCaps-HF, a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating the quality of figure-caption pairs, and 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences. We demonstrate the effectiveness of our simple learning framework by improving performance over standard fine-tuning across different types of models. In particular, when using BLIP as the base model, our RLHF framework achieves a mean gain of 35.7%, 16.9%, 9%, and 11.4% in ROUGE, BLEU, Meteor, and CIDEr scores, respectively. Finally, we release a large-scale benchmark dataset with human feedback on figure-caption pairs to enable further evaluation and development of RLHF techniques for this problem.

Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency accross various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics namely: accuracy, inference latency, and throughput, providing insights into the suitability of low-bit quantization for real-world deployment and highlight the tradeoffs between memory, computing and latency in such settings, helping a user make suitable decisions

pdf bib abs
Pushing the (Generative) Envelope: Measuring the Effect of Prompt Technique and Temperature on the Generation of Model-based Systems Engineering Artifacts
Erin Smith Crabb | Cedric Bernard | Matthew Jones | Daniel Dakota

System engineers use Model-based systems engineering (MBSE) approaches to help design and model system requirements. This manually intensive process requires expertise in both the domain of artifact creation (e.g., the requirements for a vacuum), and how to encode that information in a machine readable form (e.g., SysML). We investigated leveraging local LLMs to generate initial draft artifacts using a variety of prompt techniques and temperatures. Our experiments showed promise for generating certain types of artifacts, suggesting that even smaller, local models possesses enough MBSE knowledge to support system engineers. We observed however that while scores for artifacts remain stable across different temperature settings, this is potentially misleading as significantly different, though semantically equivalent, generations can be produced.

pdf bib abs
Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch
Elza Strazda | Gerasimos Spanakis

Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.

pdf bib abs
The Challenge of Performing Ontology-driven Entity Extraction in Real-world Unstructured Textual Data from the Domain of Dementia
Sumaiya Suravee | Carsten Oliver Schmidt | Kristina Yordanova

Named entity recognition allows the automated extraction of structured domain-related information from unstructured textual data. Our study explores the task of ontology-driven entity recognition, a sequence labelling process for custom named entity recognition for the domain of dementia, specifically from unstructured forum texts where unprofessional caregivers of people with dementia discuss the challenges they face related to agitation. The targeted corpus is loosely structured, contains ambiguous sentences and vocabulary that does not match the agitation-related medical vocabulary. To address the above challenges, we propose a pipeline that involves the following steps: 1) development of an annotation codebook; 2) annotation of a textual corpus collected from dementia forums, consisting of 45,216 sentences (775 questions and 5571 answers); 3) data augmentation to reduce the imbalance in the corpus; 4) training of a bidirectional LSTM model and a transformer model; 5) comparison of the results with those from few shot- and zero-shot based prompt engineering techniques using a pretrained large language model (LLaMa 3). The results showed that LLaMa 3 was more robust than traditional neural networks and transformer models in detecting underrepresented entities. Furthermore, the study demonstrates that data augmentation improves the entity recognition task when fine-tuning deep learning models. The paper illustrates the challenges of ontology-driven entity recognition in real-world datasets and proposes a roadmap to addressing them that is potentially transferable to other real-world domains.

pdf bib abs
Recognizing the Structure and Content of Hungarian Civil Registers
Kata Ágnes Szűcs | Noémi Vadász | Zsolt Béla Záros

The study evaluates key steps in a system for processing data from digitized Hungarian state register records (1895-1980) into an SQL database. It examines how template selection and post-processing impact data accessibility and integration. The research details the compiled datasets, annotation processes, and evaluation functions used to measure processing quality, emphasizing template selection and post-processing to improve the overall workflow and the accuracy of the published data. An evaluation method for publishing structured data provides a model for similar projects.

pdf bib abs
Optimism, Pessimism, and the Language between: Model Interpretability and Psycholinguistic Profiling
Stefana Arina Tabusca | Liviu P. Dinu

This study explores how optimism and pessimism are expressed in social media by combining psycholinguistic profiling with model interpretability. Using the OPT dataset, we fine-tune a RoBERTa-based classifier and apply LIME to examine both the most confident and the most ambiguous predictions. We analyze the influential tokens driving these decisions and identify lexical patterns linked to affective intensity, certainty, and social orientation. A complementary LIWC-based analysis of ground truth labels reveals systematic differences in emotional tone and cognitive style. PCA projections further show that optimism and pessimism occupy overlapping yet distinguishable regions in psycholinguistic space. Our findings demonstrate the value of linguistic interpretability in understanding dispositional sentiment.

pdf bib abs
Demographic Features for Annotation-Aware Classification
Narjes Tahaei | Sabine Bergler

This paper revisits the use of annotator demographics as interpretable meta-information for modeling such variation. We adapt a lightweight attention mechanism, Annotation-Wise Attention Network (AWAN), to condition predictions on demographic features, enabling per-annotator modeling. Experiments on the EXIST sexism dataset show that AWAN improves classification performance over standard baselines, especially in cases of high annotator disagreement.

pdf bib abs
Exploring the Performance of Large Language Models for Event Detection and Extraction in the Health Domain
Hristo Tanev | Nicolas Stefanovitch | Tomáš Harmatha | Diana F. Sousa

Large Language Models (LLM) have entered the world of NLP with a fast pace. LLM has been used for summarization, translation, named entity recognition, and sentiment analysis Recently, different research groups have experimented with event detection and extraction, using LLM at various levels of the processing stage: The LLM have proven to be a very relevant technology from data preparation to event argument extraction. In particular Open Source LLM like Mistral are very important since they can be shared and modified by the research community. Still, little effort was made to study the performance of these models in NLP tasks like event extraction. In this paper we describe an experiment in evaluating several state-of-the-art open large language models (LLM) for the task of event extraction and event detection in the domain of health. The models were prompted to perform detection of health-related events - mostly disease outbreaks, but also natural and man-made disasters, which directly or indirectly have impact on the health of the people. The models were also asked to extract the place, time, number of human and animal cases, and the number of the human fatalities. The performance of the LLM turned out to be better than the one of a state-of-the-art knowledge based system, using as test data a set of 800 news abstracts, containing the title and the lead sentences of health-related news articles. We compared the performance of the event detection and event argument extraction from the open Large Language Models and two knowledge based event extraction systems, NEXUS and Medical NEXUS. Our evaluation shows that all the open LLM show a superior performance w.r.t. the knowledge-based systems with the best improvement of the F1 score of number of human fatalities detection of 0.2 (0.84 vs. 0.64), where the best performing LLM was LLama 3.3 70B instruct.

pdf bib abs
Leveraging LLaMa for Abstractive Text Summarisation in Malayalam: An Experimental Study
Hristo Tanev | Anitha S. Pillai | Revathy V. R

Recent years witnessed tremendous advancements in natural language processing (NLP) because of the development of complex language models that have automated several NLP applications, including text summarisation. Despite this progress, Malayalam text summarisation still faces challenges because of the peculiarities of the language. This research paper explores the potential of using a large language model, specifically the LLaMA (Large Language Model Meta AI) framework, for text summarisation of Malayalam language. In order to assess the performance of LLaMA for text summarization, for the low-resource language Malayalam, a dataset was curated with reference text and summaries. The evaluation showed that the LLaMA model could effectively summarize lengthy articles while maintaining important information and coherence. The generated summaries were compared with the reference summaries generated by human writers to observe how well aligned the model was with a human level of summarisation. The results proved that LLM can deal with the Malayalam text summarisation task, but more research is needed to understand the most relevant training strategy.

pdf bib abs
Building a Clean Bartangi Language Corpus and Training Word Embeddings for Low-Resource Language Modeling
Warda Tariq | Victor Popov | Vasilii Gromov

In this paper, we showcase a comprehensive end-to-end pipeline for creating a superior Bartangi language corpus and using it for training word embeddings. The critically low-resource Pamiri lan- guage of Bartangi, which is spoken in Tajikistan, has difficulties such as morphological complexity, orthographic variety, and a lack of data. In order to overcome these obstacles, we gathered a raw corpus of roughly 6,550 phrases, used the Uniparser-Morph-Bartangi morphological analyzer for linguistically accurate lemmatization, and implemented a thorough cleaning procedure to eliminate noise and ensure proper tokenization. The lemmatized corpus that results greatly lowers word spar- sity and raises the standard of linguistic analysis.The processed corpus was then used to train two different Word2Vec models, Skip-gram and CBOW, with a vector size of 100, a context window of 5, and a minimum frequency threshold of 1. The resultant word embeddings were displayed using dimensionality reduction techniques like PCA and t-SNE, and assessed using intrinsic methods like nearest-neighbor similarity tests. Our tests show that even from tiny datasets, meaningful semantic representations can be obtained by combining informed morphological analysis with clean prepro- cessing. One of the earliest computational datasets for Bartangi, this resource serves as a vital basis for upcoming NLP tasks, such as language modeling, semantic analysis, and low-resource machine translation. To promote more research in Pamiri and other under-represented languages, we make the corpus, lemmatizer pipeline, and trained embeddings publicly available.

pdf bib abs
A Deep Dive into Multi-Head Attention and Multi-Aspect Embedding
Maryam Teimouri | Jenna Kanerva | Filip Ginter

Multi-vector embedding models play an increasingly important role in retrieval-augmented generation, yet their internal behaviour lacks comprehensive analysis. We conduct a systematic, head-level study of the 32-head Semantic Feature Representation (SFR) encoder with the FineWeb corpus containing 10 billion tokens. For a set of 4,000 web documents, we pair head-specific embeddings with GPT-4o topic annotations and analyse the results using t-SNE visualisations, heat maps, and a 32-way logistic probe. The analysis shows that (i) clear semantic separation between heads emerges only at an intermediate layer, (ii) some heads align with specific topics while others capture broader corpus features, and (iii) naive pooling of head outputs can blur these distinctions, leading to frequent topic mismatches. The study offers practical guidance on where to extract embeddings, which heads may be pruned, and how to aggregate them to support more transparent and controllable retrieval pipelines.

pdf bib abs
A Linguistically-informed Comparison between Multilingual BERT and Language-specific BERT Models: The Case of Differential Object Marking in Romanian
Maria Tepei | Jelke Bloem

Current linguistic challenge datasets for language models focus on phenomena that exist in English. This may lead to a lack of attention for typological features beyond English. This is particularly an issue for multilingual models, which may be biased towards English by their training data and this bias may be amplified if benchmarks are also English-centered. We present the syntactically and semantically complex language phenomenon of Differential Object Marking (DOM) in Romanian as a challenging Masked Language Modelling task and compare the performance of monolingual and multilingual models. Results indicate that Romanian-specific BERT models perform better than equivalent multilingual one in representing this phenomenon.

pdf bib abs
PoliStance-TR: A Dataset for Turkish Stance Detection in Political Domain
Muhammed Cihat Unal | Yasemin Sarkın | Alper Karamanlioglu | Berkan Demirel

Stance detection in NLP involves determining whether an author is supportive, against, or neutral towards a particular target. This task is particularly challenging for Turkish due to the limited availability of data, which hinders progress in the field. To address this issue, we introduce a novel dataset focused on stance detection in Turkish, specifically within the political domain. This dataset was collected from X (formerly Twitter) and annotated by three human annotators who followed predefined guidelines to ensure consistent labeling and generalizability. After compiling the dataset, we trained various transformer-based models with different architectures, showing that the dataset is effective for stance classification. These models achieved an impressive Macro F1 score of up to 82%, highlighting their effectiveness in stance detection.

pdf bib abs
Towards Safer Hebrew Communication: A Dataset for Offensive Language Detoxification
Natalia Vanetik | Lior Liberov | Marina Litvak | Chaya Liebeskind

Text detoxification is the task of transforming offensive or toxic content into a non-offensive form while preserving the original meaning. Despite increasing research interest in detoxification across various languages, no resources or benchmarks exist for Hebrew, a Semitic language with unique morphological, syntactic, and cultural characteristics. This paper introduces HeDetox, the first annotated dataset for text detoxification in Hebrew. HeDetox contains 600 sentence pairs, each consisting of an offensive source text and a non-offensive text rewritten with LLM and human intervention. We present a detailed dataset analysis and evaluation showing that the dataset benefits offensive language detection. HeDetox offers a foundational resource for Hebrew natural language processing, advancing research in offensive language mitigation and controllable text generation.

pdf bib abs
AIDEN: Automatic Speaker Notes Creation and Navigation for Enhancing Online Learning Experience
Stalin Varanasi | Umer Butt | Guenter Neumann | Josef van Genabith

Effective learning in digital environments depends on quick access to educational resources and timely support. We present AIDEN, an advanced, AI-driven virtual teaching assistant integrated into lectures, to provide meaningful support for students. AIDEN’s capabilities include reading lecture materials aloud, locating specific slides, automatic speaker notes generation, search through a video stream. Powered by state-of-the-art retrieval and text generation, AIDEN can be adapted to new lecture content with minimal manual adjustments, requiring only minor customization of data handling processes and model configurations. Through automated testing, we evaluated AIDEN’s performance across key metrics slide retrieval recall for questions, and alignment of generated speaker notes with ground-truth data. The evaluation underscores AIDEN’s potential to significantly enhance learning experiences by offering real-world application and rapid configurability to diverse learning materials.

pdf bib abs
Using LLMs for Multilingual Clinical Entity Linking to ICD-10
Sylvia Vassileva | Ivan K. Koychev | Svetla Boytcheva

The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.

pdf bib abs
Aspect–Sentiment Quad Prediction with Distilled Large Language Models
Filippos Karolos Ventirozos | Peter Appleby | Matthew Shardlow

Aspect-based sentiment analysis offers detailed insights by pinpointing specific product aspects in a text that are associated with sentiments. This study explores it through the prediction of quadruples, comprising aspect, category, opinion, and polarity. We evaluated in-context learning strategies using recently released distilled large language models, ranging from zero to full-dataset demonstrations. Our findings reveal that the performance of these models now positions them between the current state-of-the-art and significantly higher than their earlier generations. Additionally, we experimented with various chain-of-thought prompts, examining sequences such as aspect to category to sentiment in different orders. Our results indicate that the optimal sequence differs from previous assumptions. Additionally, we found that for quadruple prediction, few-shot demonstrations alone yield better performance than chain-of-thought prompting.

pdf bib abs
SENTimental - a Simple Multilingual Sentiment Annotation Tool
John Vidler | Paul Rayson | Dawn Knight

Here we present SENTimental, a simple and fast web-based, mobile-friendly tool for capturing sentiment annotations from participants and citizen scientist volunteers to create training and testing data for low-resource languages. In contrast to existing tools, we focus on assigning broad values to segments of text over specific tags for tokens or spans to build datasets for training and testing LLMs. The SENTimental interface minimises barriers to entry with a goal of maximising the time a user spends in a flow state whereby they are able to quickly and accurately rate each text fragment without being distracted by the complexity of the interface. Designed from the outset to handle multilingual representations, SENTimental allows for parallel corpus data to be presented to the user and switched between instantly for immediate comparison. As such this allows for users in any loaded languages to contribute to the data gathered, building up comparable rankings in a simple structured dataset for later processing.

pdf bib abs
Anonymise: A Tool for Multilingual Document Pseudonymisation
Rinalds Vīksna | Inguna Skadina

According to the EU legislation, documents containing personal information need to be anonymized before public sharing. However, manual anonymisation is a time-consuming and costly process. Thus, there is a need for a robust text de-identification technique that accurately identifies and replaces personally identifiable information. This paper introduces the Anonymise tool, a system for document de-identification. The tool accepts text documents of various types (e.g., MS Word, plain-text), de-identifies personal information, and saves the de-identified document in its original format. The tool employs a modular architecture, integrating list-based matching, regular expressions and deep-learning-based named entity recognition to detect spans for redaction. Our evaluation results demonstrate high recall rates, making Anonymise a reliable solution for ensuring no sensitive information is left exposed. The tool can be accessed through a userfriendly web-based interface or API, offering flexibility for both individual and large-scale document processing needs. By automating document de-identification with high accuracy and efficiency, Anonymise presents a reliable solution for ensuring compliance with EU privacy regulations while reducing the time and cost associated with manual anonymisation.

pdf bib abs
Revealing Gender Bias in Language Models through Fashion Image Captioning
Maria Villalba-Oses | Victoria Muñoz-Garcia | Juan Pablo Consuegra-Ayala

Image captioning bridges computer vision and natural language processing but remains vulnerable to social biases. This study evaluates gender bias in ChatGPT, Copilot, and Grok by analyzing their descriptions of fashion-related images prompted without gender cues. We introduce a methodology combining gender annotation, stereotype classification, and a manually curated dataset. Results show that GPT-4o and Grok frequently assign gender and reinforce stereotypes, while Copilot more often generates neutral captions. Grok shows the lowest error rate but consistently assigns gender, even when cues are ambiguous. These findings highlight the need for bias-aware captioning approaches in multimodal systems.

pdf bib abs
Benchmarking Korean Idiom Understanding: A Comparative Analysis of Local and Global Models
Xiaonan Wang | Seoyoon Park | Hansaem Kim

Although an increasing number of multilingual LLMs (large language models) have begun to support Korean, there remains a notable lack of benchmark datasets specifically designed to evaluate their proficiency in Korean cultural and linguistic understanding. A major reason for this gap is that many available benchmarks in Korean are adapted from English originals via translation, which often fails to reflect the unique cultural context embedded in the Korean language. Even the few benchmark datasets based on native Korean data that involve cultural content typically focus on tasks such as bias or hate speech detection, where cultural knowledge serves merely as topical background rather than being integrated as a core component of semantic understanding. To address this gap, we introduce the Korean Idiom Matching Benchmark (KIM Bench), which consists of 1,175 instances. Idioms are culture-specific and often untranslatable, making them ideal for testing models’ cross-cultural semantic understanding. Using KIM Bench, We evaluate global and Korean native models. Our analysis show that larger and locally trained models better capture idiom semantics and cultural nuances, while chain-of-thought prompting may reduce accuracy. Models still struggle with deep semantic and contextual understanding. KIM Bench offers a compact tool for cross-cultural evaluation and insights into improving performance on culturally grounded tasks.

pdf bib abs
TinyMentalLLMs Enable Depression Detection in Chinese Social Media Texts
Jinyuan Xu | Tian Lan | Mathieu Valette | Pierre Magistry | Lei Li

Depression remains a major global mental health concern, bringing a higher risk of suicide and growing social costs tied to mental disorders. Leveraging social media as a valuable source of emotional signals, we identify two limitations in current NLP-based depression detection frameworks: (1) prediction systems often lack clear, user-friendly explanations for predictions in Depression Detection, and (2) the computational and confidentiality demands of LLMs are misaligned with the need for dependable, privacy-focused small-scale deployments. To address these challenges, we introduce TinyMentalLLMs (TMLs), a compact framework that offers two key contributions: (a) the construction of a small yet representative dataset through psychology-based textometry, and (b) an efficient fine-tuning strategy centered on multiple aspects of depression. This design improves both accuracy and F1 scores in generative models with 0.5B and 1.5B parameters, consistently yielding over 20% performance gains across datasets. TMLs achieve results on par with, and deliver better text quality than, much larger state-of-the-art models.

pdf bib abs
Prompt Engineering for Nepali NER: Leveraging Hindi-Capable LLMs for Low-Resource Languages
Dipendra Yadav | Sumaiya Suravee | Stefan Kemnitz | Tobias Strauss | Kristina Yordanova

This study provides a systematic evaluation of prompt engineering strategies for Named Entity Recognition in Nepali, a low-resource language with high similarity to Hindi, by leveraging Hindi-capable Meta’s LLaMA 3.3:70B model. Four prompting techniques—Baseline, Chain-of-Thought, Self-Refine, and Least-toMost—are assessed in both zero-shot and fewshot settings. As a novel contribution, we propose an entity-aware sentence selection strategy that prioritizes example diversity and entity coverage for few-shot prompting. Experimental results show that, without Nepali examples, zero-shot and one-shot prompts frequently yield unstructured or hallucinated outputs, underscoring the limitations of cross-lingual capabilities without in-context supervision. However, including even a small number of carefully selected Nepali examples—sometimes as few as ten—substantially enhances model performance, with the Least-to-Most approach achieving the highest F1 scores. These findings highlight the potential of prompt-based adaptation and principled example curation for extending LLM capabilities to related, low-resource languages, offering a practical alternative to full model fine-tuning.

pdf bib abs
Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Shakib Yazdani | Yasser Hamidullah | Cristina España-Bonet | Josef van Genabith

Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.

Large-scale Vision-Language Models (LVLMs) integrate linguistic and visual information, demonstrating advanced task-solving capabilities. These models are originally derived from Large Language Models, leading to strong capabilities for language tasks. However, the impact of additional visual information on model responses remains insufficiently understood. In this study, we focus on the priming effect, a psychological phenomenon, to investigate how visual information influences language task processing. We present additional intentionally designed images alongside two types of language tasks with different characteristics and analyze changes in the model’s responses. Our experimental results show that model responses shift in the direction intended by the image, suggesting that LVLMs do not simply ignore visual information but actively incorporate it into language processing. Furthermore, the similarity between this behavior and priming effects observed in human cognition suggests that LVLMs may share certain aspects of human cognitive mechanisms.

pdf bib abs
From Courtroom to Corpora: Building a Name Entity Corpus for Urdu Legal Texts
Adeel Zafar | Sohail Ashraf | Slawomir Nowaczyk

This study explores the effectiveness of transformer-based models for Named Entity Recognition (NER) in Urdu legal documents, a critical task in low-resource language processing. Given the legal texts’ specialized terminology and complex syntax, accurate entity recognition in Urdu remains challenging. We developed a legal Urdu dataset that contains 117,500 documents, generated synthetically from 47 different types of legal documents, and evaluated three BERT-based models. XLMRoBERTa, mBERT, and DistilBERT by analyzing their performance on an annotated Urdu legal dataset. mBERT demonstrated superior accuracy (0.999), and its F1 score (0.975) outperforms XLMRoBERTa and DistilBERT, highlighting its robustness in recognizing entities within low-resource languages. To ensure the privacy of the personal identifiers, all documents are anonymized. The dataset for this study is publicly hosted on Hugging Face and will be made public after the publication.

pdf bib abs
EmoHopeSpeech: An Annotated Dataset of Emotions and Hope Speech in English and Arabic
Wajdi Zaghouani | Md. Rafiul Biswas

This research introduces a bilingual dataset comprising 27,456 entries for Arabic and 10,036 entries for English, annotated for emotions and hope speech, addressing the scarcity of multi-emotion (Emotion and hope) datasets. The dataset provides comprehensive annotations capturing emotion intensity, complexity, and causes, alongside detailed classifications and subcategories for hope speech. To ensure annotation reliability, Fleiss’ Kappa was employed, revealing 0.75-0.85 agreement among annotators both for Arabic and English language. The evaluation metrics (micro-F1-Score=0.67) obtained from the baseline model (i.e., transformer-based AraBERT model) validate that the data annotations are worthy.

pdf bib abs
An Annotated Corpus of Arabic Tweets for Hate Speech Analysis
Wajdi Zaghouani | Md. Rafiul Biswas

Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10,000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an accuracy of 0.786.

pdf bib abs
Strategies for Efficient Retrieval-augmented Generation in Clinical Domains with RAPTOR: A Benchmarking Study
Xumou Zhang | Qixuan Hu | Jinman Kim | Adam G. Dunn

The Recursive Abstractive Processing for Tree-Organized Retrieval (RAPTOR) framework deploys a hierarchical tree-structured datastore to integrate local and global context, enabling efficient handling of long documents for language models. This design is especially useful when cloud-based language models are unavailable or undesirable. For instance, with offline confidential patient records or stringent data-privacy requirements. We benchmarked RAPTOR on the QuALITY dataset and a novel Clinical Trial question-answering dataset (CTQA) drawn from over 500 000 registry entries. Experiments varied question complexity (simple vs. complex), four language models, four embedding models, and three chunking strategies. Also incorporated GPT-4o as a cloud-based baseline. Results show that, with optimal settings, RAPTOR combined with smaller local models outperforms GPT-4o on complex CTQA questions, although this gain does not extend to QuALITY. These outcomes highlight RAPTOR’s promise as a practical, locally implementable solution for long-context understanding.

pdf bib abs
LLM-Based Product Recommendation with Prospect Theoretic Self Alignment Strategy
Manying Zhang | Zehua Cheng | Damien Nouvel

Accurate and personalized product recommendation is central to user satisfaction in e-commerce. However, a persistent language gap often exists between user queries and product titles or descriptions. While traditional user behavior-based recommenders and LLM-based Retrieval-Augmented Generation systems typically optimize for maximum likelihood objectives, they may struggle to bridge this gap or capture users’ true intent. In this paper, we propose a strategy based on Prospect Theoretic Self-Alignment, that reframes LLM-based recommendations as a utility-driven process. Given a user query and a set of candidate products, our model acts as a seller who anticipates latent user needs and generates product descriptions tailored to the user’s perspective. Simultaneously, it simulates user decision-making utility to assess whether the generated content would lead to a purchase. This self-alignment is achieved through a training strategy grounded in Kahneman & Tversky’s prospect theory, ensuring that recommendations are optimized for perceived user value rather than likelihood alone. Experiments on real-world product data demonstrate substantial improvements in intent alignment and recommendation quality, validating the effectiveness of our approach in producing personalized and decision-aware recommendations.

pdf bib abs
Branching Out: Exploration of Chinese Dependency Parsing with Fine-tuned Large Language Models
He Zhou | Emmanuele Chersoni | Yu-Yin Hsu

In this paper, we investigate the effectiveness of large language models (LLMs) for Chinese dependency parsing through fine-tuning. We explore how different dependency representations impact parsing performance when fine-tuning the Chinese Llama-3 model. Our results demonstrate that while the Stanford typed dependency tuple representation yields the highest number of valid dependency trees, converting dependency structure into a lexical centered tree produces parses of significantly higher quality despite generating fewer valid structures. The results further show that fine-tuning enhances LLMs’ capability to handle longer dependencies to some extent, though challenges remain. Additionally, we evaluate the effectiveness of DeepSeek in correcting LLM-generated dependency structures, finding that it is effective for fixing index errors and cyclicity issues but still suffers from tokenization mismatches. Our analysis across dependency distances and relations reveals that fine-tuned LLMs outperform traditional parsers in specific syntactic structures while struggling with others. These findings contribute to the research on leveraging LLMs for syntactic analysis tasks.

pdf (full)
bib (full) Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing

pdf bib
Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing
Boris Velichkov | Ivelina Nikolova-Koleva | Milena Slavcheva

pdf bib abs
A Multi-Baseline Framework for Ranking Global Event Significance Using Google Trends and Large Language Models
Zenan Chen

Determining global event significance lacks standardized metrics for quantifying worldwide impact. While Google Trends has demonstrated utility in domain-specific studies, its application to global event ranking remains limited. This paper presents a framework combining Google Trends data with large language models for automated global event ranking. This study leverages Command R+ and Llama 3.3-70B-Instruct to generate contextually relevant event keywords and establishes significance through comparative search volume analysis against baseline reference terms, incorporating temporal weighting mechanisms to address chronological biases. The proposed methodology identified globally significant events across technology, health, sports, and natural disasters from a dataset of 1,094 events (2020-2024) extracted from Wikipedia.

pdf bib abs
Investigating Hierarchical Structure in Multi-Label Document Classification
Artemis Dampa

Effectively organizing the vast and ever-growing body of research in scientific literature is crucial to advancing the field and supporting scholarly discovery. In this paper, we study the task of fine-grained hierarchical multi-label classification of scholarly articles, using a structured taxonomy. Specifically, we investigate whether incorporating hierarchical information in a classification method can improve performance compared to conventional flat classification approaches. To this end, we suggest and evaluate different strategies for the classification, on three different axes: selection of positive and negative samples; soft-to-hard label mapping; hierarchical post-processing policies that utilize taxonomy-related requirements to update the final labeling. Experiments demonstrate that flat baselines constitute powerful baselines, but the infusion of hierarchical knowledge leads to better recall-focused performance based on use-case requirements.

pdf bib abs
Large Language Models for Lexical Resource Enhancement: Multiple Hypernymy Resolution in WordNet
Dimitar Hristov

Large language models (LLMs) have materially changed natural language processing (NLP). While LLMs have shifted focus from traditional semantic-based resources, structured linguistic databases such as WordNet remain essential for precise knowledge retrieval, decision making and aiding LLM development. WordNet organizes concepts through synonym sets (synsets) and semantic links but suffers from inconsistencies, including redundant or erroneous relations. This paper investigates an approach using LLMs to aid the refinement of structured language resources, specifically WordNet, by an automation for multiple hypernymy resolution, leveraging the LLMs semantic knowledge to produce tools for aiding and evaluating manual resource improvement.

pdf bib abs
Automated classification of causal relations. Evaluating different LLM performances.
Giacomo Magnifico

The search for formal causal relations in natural language faces inherent limitations due to the lack of mathematically and logically informed datasets. Thus, the exploration of causal relations in natural language leads to the analysis of formal-logic-adjacent language patterns. Thanks to the recent advancements of generative LLMs, this research niche is expanding within the field of natural language processing and evaluation. In this work, we conduct an evaluation of 9 models produced by different AI developing companies in order to answer the question “Are LLMs capable of discerning between different types of causal relations?”. The SciExpl dataset is chosen as a natural language corpus, and we develop three different prompt types aligned with zero-shot, few-shot, and chain-of-thought standards to evaluate the performance of the LLMs. Claude 3.7 Sonnet and Gemini 2.5 Flash Preview emerge as the best models for the task, with the respective highest F1 scores of 0.842 (few-shot prompting) and 0.846 (chain-of-thought prompting).

pdf bib abs
Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream
Martin Polacek

In this study, we employ various ELECTRA-Small models that are pre-trained and fine-tuned on specific sets of languages for automatic punctuation restoration (APR) in automatically transcribed TV and radio shows, which contain conversations in two closely related languages. Our evaluation data specifically concerns bilingual interviews in Czech and Slovak and data containing speeches in Swedish and Norwegian. We train and evaluate three types of models: the multilingual (mELECTRA) model, which is pre-trained for 13 European languages; two bilingual models, each pre-trained for one language pair; and four monolingual models, each pre-trained for a single language. Our experimental results show that a) fine-tuning, which must be performed using data belonging to both target languages, is the key step in developing a bilingual APR system and b) the mELECTRA model yields competitive results, making it a viable option for bilingual APR and other multilingual applications. Thus, we publicly release our pre-trained bilingual and, in particular, multilingual ELECTRA-small models on HuggingFace, fostering further research in various multilingual tasks.

pdf bib abs
NoCs: A Non-Compound-Stable Splitter for German Compounds
Carmen Schacht

Compounding—the creation of highly complex lexical items through the combination of existing lexemes—can be considered one of the most efficient communication phenomenons, though the automatic processing of compound structures—especially of multi-constituent compounds—poses significant challenges for natural language processing. Existing tools like compound-split (Tuggener, 2016) perform well on compound head detection but are limited in handling long compounds and distinguishing compounds from non-compounds. This paper introduces NoCs (non-compound-stable splitter), a novel Python-based tool that extends the functionality of compound-split by incorporating recursive splitting, non-compound detection, and integration with state-of-the-art linguistic resources. NoCs employs a custom stack-and-buffer mechanism to traverse and decompose compounds robustly, even in cases involving multiple constituents. A large-scale evaluation using adapted GermaNet data shows that NoCs substantially outperforms compound-split in both non-compound identification and the recursive splitting of three- to five-constituent compounds, demonstrating its utility as a reliable resource for compound analysis in German.

pdf bib abs
A Proposal for Evaluating the Linguistic Quality of Synthetic Spanish Corpora
Lucia Sevilla-Requena

Large language models (LLMs) rely heavily on high-quality training data, yet human-generated corpora face increasing scarcity due to legal and practical constraints. Synthetic data generated by LLMs is emerging as a scalable alternative; however, concerns remain about its linguistic quality and diversity. While previous research has identified potential degradation in English synthetic corpora, the effects in Spanish, a language with distinct grammatical characteristics, remain underexplored. This research proposal aims to conduct a systematic linguistic evaluation of synthetic Spanish corpora generated by state-of-the-art LLMs, comparing them with human-written texts. The study will analyse three key dimensions: lexical, syntactic, and semantic diversity, using established corpus linguistics metrics. Through this comparative framework, the proposal intends to identify potential linguistic simplifications and degradation patterns in synthetic Spanish data. Ultimately, the proposed outcome is expected to contribute valuable insights to support the creation of robust and reliable Natural Language Processing (NLP) models for Spanish.

pdf bib abs
Personalizing chatbot communication with associative memory
Kirill Soloshenko | Alexandra Shatalina | Marina Sevostyanova | Elizaveta Kornilova | Konstantin Zaitsev

In our research paper we present the approach that is aimed at effectively expanding the context through integrating a database of associative memory into the pipeline. In order to improve long-term memory and personalization we have utilized methods close to Retrieval-Augmented Generation (RAG). Our method uses a multi-agent pipeline with a cold-start agent for initial interactions, a fact extraction agent to process user inputs, an associative memory agent for storing and retrieving context, and a generation agent for replying to user’s queries.Evaluation results show promising results: a 41% accuracy improvement over the base Gemma3 model (from 16% to 57%). Hence, with our approach, we demonstrate that personalized chatbots can bypass LLM memory limitations while increasing information reliability under the conditions of limited context and memory.

pdf bib abs
Visualization of LLM Annotated Documents
Teodor Todorov Valtchev | Nikolay Paev

The paper presents an automatic annotation and visualization system for documents in the field of Social Sciences and Humanities. The annotation is on two levels, named Entities and Events. The system combines automatically generated annotations from language models with a powerful text editor that is extended to accommodate manual annotation. The goal is to support the extraction of information from historical documents by scientists in the SS&H field. At the time of writing of the paper, the system is still in development.

pdf (full)
bib (full) Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.

pdf bib abs
iWAN-NLP at AHaSIS 2025: A Stacked Ensemble of Arabic Transformers for Sentiment Analysis on Arabic Dialects in the Hospitality Domain
Hend Al-Khalifa

This paper details the iWAN-NLP system developed for participation in the AHaSIS 2025 shared task, “Sentiment Analysis on Arabic Dialects in the Hospitality Domain: A Multi-Dialect Benchmark.” Our approach leverages a multi-model ensemble strategy, combining the strengths of MARBERTv2, Saudibert, and DarijaBERT. These pre-trained Arabic language models were fine-tuned for sentiment classification using a 5-fold stratified cross-validation methodology. The final predictions on the test set were derived by averaging the logits produced by each model across all folds and then averaging these combined logits across the three models. This system achieved a macro F1-score of 81.0% on the official evaluation dataset and a cross-validated macro F1-score of 0.8513 (accuracy 0.8628) on the training set. Our findings highlight the effectiveness of ensembling regionally adapted models and robust cross-validation for Arabic sentiment analysis in the hospitality domain, ultimately securing first place in the AHaSIS 2025 shared task.

pdf bib abs
Fine-tuning AraBert model for arabic sentiment detection
Mustapha Jaballah | Dhaou Ghoul | Ammar Mars

Arabic exhibits a rich and intricate linguistic landscape, with Modern Standard Arabic (MSA) serving as the formal written and spoken medium, alongside a wide variety of regional dialects used in everyday communication. These dialects vary considerably in syntax, vocabulary, phonology, and meaning, presenting significant challenges for natural language processing (NLP). The complexity is particularly pronounced in sentiment analysis, where emotional expressions and idiomatic phrases differ markedly across regions, hindering consistent and accurate sentiment detection. This paper describes our submission to the Ahasis Shared Task: A Benchmark for Arabic Sentiment Analysis in the hospitality domain. This shared task focuses on advancing sentiment analysis techniques for Arabic dialects in the hotel domain. Our proposed approach achieved an F1 score of 0.88 % on the internal test set (split from the original training data), and 79.16% on the official hidden test set of the shared task. This performance secured our team second place in the Ahasis Shared Task.

pdf bib abs
Enhancing Arabic Dialectal Sentiment Analysis through Advanced Data Augmentation Techniques
Md. Rafiul Biswas | Wajdi Zaghouani

This work addresses the challenge of Arabic sentiment analysis in the hospitality domain in all dialects by using data augmentation techniques. We created a pipeline with three simple techniques: context-based paraphrasing, pattern-based sentence generation, and domain-specific word replacement. Our method preserves the original dialect features, meanings, and key classification details while adding diversity to the training data. It also includes automatic fallback between methods to handle challenges effectively. We used the Fanar API for dialectal data augmentation in the hospitality domain. The AraBERT-Large-v02 model was fine-tuned on original and augmented data, showing improved performance. This study helps solve the problem of limited dialect data in Arabic NLP and offers an effective framework that is useful for other Arabic text analysis tasks.

pdf bib abs
Ahasis Shared Task: Hybrid Lexicon-Augmented AraBERT Model for Sentiment Detection in Arabic Dialects
Shimaa Amer Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani

This work was conducted as part of the Ahasis@RANLP–2025 shared task, which focuses on sentiment detection in Arabic dialects within the hotel review domain. The primary objective is to advance sentiment analysis methodologies tailored to dialectal Arabic. Our work combines data augmentation with a hybrid model that integrates AraBERT and our created sentiment lexicon. Notably, our hybrid model significantly improved performance, reaching an F1-score of 0.74, compared to 0.56 when using only AraBERT. These results highlight the effectiveness of lexicon integration and augmentation strategies in enhancing both the accuracy and robustness of sentiment classification in dialectal Arabic.

pdf bib abs
Lab17 @ Ahasis Shared Task 2025: Fine-Tuning and Prompting techniques for Sentiment Analysis of Saudi and Darija Dialects
Al Mukhtar Al Hadhrami | Firas Al Mahrouqi | Mohammed Al Shaaili | Hala Mulki

In this paper, we describe our contribution in Ahasis shared task: Sentiment analysis on Arabic Dialects in the Hospitality Domain. Through the presented framework, we explored using two learning strategies tailored to a Large Language Model (LLM) and Transformer-based model variants. While few-shot prompting was used with GPT-4o, fine-tuning was adopted once to refine the essential MARBERT model on the Ahasis dataset and then to utilize a MARBERT variant model, SODA-BERT, that was pretrained on an Omani sentiment dataset and later evaluated with the shared task data.

pdf bib abs
Dialect-Aware Sentiment Analysis for Ahasis Challenge
Hasna Chouikhi | Manel Aloui

This paper presents our approach to Arabic sentiment analysis with a specific focus on dialect-awareness for Saudi and Moroccan (Darija) dialectal variants. We develop a system that achieves a macro F1 score of 77% on the test set, demonstrating effective generalization across these dialect variations. Our approach leverages a pre-trained Arabic language model (Qarib) with custom dialect-specific embeddings and preprocessing techniques tailored to each dialect. The results show a significant improvement over baseline models that do not incorporate dialect information, with an absolute gain of 5% in F1 score over the equivalent non-dialect-aware model. Our analysis further reveals distinct sentiment expression patterns between Saudi and Darija dialects, highlighting the importance of dialect-aware approaches for Arabic sentiment analysis.

pdf bib abs
MAPROC at AHaSIS Shared Task: Few-Shot and Sentence Transformer for Sentiment Analysis of Arabic Hotel Reviews
Randa Zarnoufi

Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared task, which focuses on sentiment analysis on Arabic dialects in the hospitality domain. The dataset comprises hotel reviews written in Moroccan and Saudi dialects, and the objective is to classify the reviewers’ sentiment as positive, negative, or neutral. We employed the SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique. On the official evaluation set, our system achieved an F1 of 73%, ranking 12th among 26 participants. This work highlights the potential of few-shot learning to address data scarcity in processing nuanced dialectal Arabic text within specialized domains like hotel reviews.

pdf bib abs
mucAI at Ahasis Shared Task: Sentiment Analysis with Adaptive Few Shot Prompting
Ahmed Mohamed Abdelaal Abdou

Sentiment Analysis is a crucial task in Natural Language Processing (NLP) focused on identifying and categorizing emotional tones or opinions within text. For Arabic customer reviews, sentiment analysis is particularly challenging. The language’s rich diversity, with numerous regional dialects differing significantly from Modern Standard Arabic (MSA) and each other in lexicon, syntax, and sentiment expression, complicates consistent performance across dialects. In this paper, we present our approach, submitted to the AHASIS Shared Task 2025, focusing on sentiment analysis for Arabic dialects in the hotel domain. Our method leverages the capabilities of GPT-4o through adaptive few-shot prompting technique, where similar contextual examples are dynamically selected for each review using a k-Nearest Neighbors (kNN) search over train embeddings from a fine-tuned encoder model. This approach tailors the prompt to each specific instance, enhancing classification performance over minority class. Our submission achieved an F1-score of 76.0% on the official test set, showing stronger performance for the Saudi dialect compared to Darija.

pdf bib abs
A Hybrid Transformer-Based Model for Sentiment Analysis of Arabic Dialect Hotel Reviews
Rawand Alfugaha | Mohammad AL-Smadi

This paper describes the AraNLP system developed for the “Ahasis” shared task on sentiment detection in Arabic dialects for hotel reviews. The task involved classifying the overall sentiment of hotel reviews (Positive, Negative, or Neutral) written in Arabic dialects, specifically Saudi and Darija. Our proposed model, AraNLP, is a hybrid deep learning classifier that leverages the strengths of a transformer-based Arabic model (AraELECTRA)augmented with classical bag-of-words style features (TF-IDF). Our system achieved an F1-score of 76%, securing the 5th rank in the shared task, significantly outperforming the baseline system’s F1-score of 56%.

pdf bib abs
Arabic-Centric Large Language Models for Dialectal Arabic Sentiment Analysis Task
Salwa Saad Alahmari | Eric Atwell | Hadeel Saadany | Mohammad Alsalka

This paper presents a study on sentiment anal- ysis of Dialectal Arabic (DA), with a particu- lar focus on Saudi and Moroccan (Darija) di- alects within the hospitality domain. We in- troduce a novel dataset comprising 698 Saudi Arabian proverbs annotated with sentiment polarity labels—Positive, Negative, and Neu- tral—collected from five major Saudi dialect regions: Najdi, Hijazi, Shamali, Janoubi, and Sharqawi. In addition to this, we used customer reviews for fine-tuning the CAMeLBERT-DA- SA model, which achieved a 75% F1 score in sentiment classification. To further evaluate the robustness of Arabic-centric models, we assessed the performance of three open-source large language models—Allam, ACeGPT, and Jais—in a zero-shot setting using the Ahasis shared task test set. Our results highlight the effectiveness of domain-specific fine-tuning in improving sentiment analysis performance and demonstrate the potential of Arabic-centric LLMs in zero-shot scenarios. This work con- tributes new linguistic resources and empirical insights to support ongoing research in senti- ment analysis for Arabic dialect

pdf bib abs
A Gemini-Based Model for Arabic Sentiment Analysis of Multi-Dialect Hotel Reviews: Ahasis Shared Task Submission
Mohammed A. H. Lubbad

This paper presents a sentiment analysis model tailored for Arabic dialects in the hospitality domain, developed for the Ahasis Shared Task. Leveraging the Gemini Pro 1.5 language model, we address the challenges posed by the diversity of Arabic dialects, specifically Saudi and Moroccan Darija. Our method used the official Ahasis dataset of 3,000 hotel reviews. Through iterative benchmarking, dialect labeling, sarcasm detection, and fine-tuning, we adapted Gemini Pro 1.5 for the task. The final model achieved an F1-score of 0.7361 and ranked 10th on the competition leaderboard. This work shows that prompt engineering and domain adaptation of LLMs can mitigate challenges of dialectal variation, sarcasm, and resource scarcity in Arabic sentiment classification. Our contribution lies in the integration of dialect-specific prompt tuning with real-time batch inference, avoiding retraining. This approach, validated across 3,000 competition samples and 700 internal benchmarks, establishes a novel template for Arabic-domain sentiment pipelines.

pdf bib abs
Sentiment Analysis on Arabic Dialects: A Multi-Dialect Benchmark
Abdusalam F. Ahmad Nwesri | Nabila Almabrouk S. Shinbir | Amani Bahlul Sharif

This paper presents our contribution to the AHASIS Shared Task at RANLP 2025, which focuses on sentiment analysis for Arabic dialects. While sentiment analysis has seen considerable progress in Modern Standard Arabic (MSA), the diversity and complexity of Arabic dialects pose unique challenges that remain underexplored. We address this by fine-tuning six pre-trained language models, including AraBERT, MARBERTv2, QARiB, and DarijaBERT, on a sentiment-labeled dataset comprising hotel reviews written in Saudi and Moroccan (Darija) dialects. Our experiments evaluate the models’ performance on both combined and individual dialect datasets. MARBERTv2 achieved the highest performance with an F1-score of 79% on the test set, securing third place among 14 participants. We further analyze the effectiveness of each model across dialects, demonstrating the importance of dialect-aware pretraining for Arabic sentiment analysis. Our findings highlight the value of leveraging large pre-trained models tailored to dialectal Arabic for improved sentiment classification.

pdf (full)
bib (full) Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text

The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.

pdf bib abs
AI-Generated Text Detection Using DeBERTa with Auxiliary Stylometric Features
Annepaka Yadagiri | L. D. M. S Sai Teja | Partha Pakray | Chukhu Chunka

The global proliferation of Generative Artificial Intelligence (GenAI) has led to the increasing presence of AI-generated text across a wide spectrum of topics, ranging from everyday content to critical and specialized domains. Often, individuals are unaware that the text they interact with was produced by AI systems rather than human authors, leading to instances where AI-generated content is unintentionally combined with human-written material. In response to this growing concern, we propose a novel approach as part of the Multi-Domain AI-Generated Text Detection (M-DAIGT) shared task, which aims to accurately identify AI-generated content across multiple domains, particularly in news reporting and academic writing. Given the rapid evolution of large language models (LLMs), distinguishing between human-authored and AI-generated text has become increasingly challenging. To address this, our method employs fine-tuning strategies using transformer-based language models for binary text classification. We focus on two specific domains, news and scholarly writing, and demonstrate that our approach, based on the DeBERTa transformer model, achieves superior performance in identifying AI-generated text. Our team, CNLP-NITS-PP, achieved 5th position in Subtask 1 and 3rd position in Subtask 2.

pdf bib abs
Shared Task on Multi-Domain Detection of AI-Generated Text (M-DAIGT)
Sareem Farooqui | Ali Zain | Dr Muhammad Rafi

We participated in two subtasks: Subtask 1, focusing on news articles, and Subtask 2, focusing on academic abstracts. Our submission is based on three distinct architectural approaches: (1) Fine-tuning a RoBERTa-base model, (2) A TF-IDF based system with a Linear Support Vector Machine (SVM) classifier, and (3) An experimental system named Candace, which leverages probabilistic features extracted from multiple Llama-3.2 models (1B and 3B variants) fed into a Transformer Encoder-based classifier. Our RoBERTa-based system demonstrated strong performance on the development and test sets for both subtasks and was chosen as our primary submission to both the shared subtasks.

pdf bib abs
A Multimodal Transformer-based Approach for Cross-Domain Detection of Machine-Generated Text
Mohammad AL-Smadi

The rapid advancement of large language models (LLMs) has made it increasingly challenging to distinguish between human-written and machine-generated content. This paper presents IntegrityAI, a multimodal ELECTRA-based model for the detection of AI-generated text across multiple domains. Our approach combines textual features processed through a pre-trained ELECTRA model with handcrafted stylometric features to create a robust classifier. We evaluate our system on the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on identifying AI-generated content in news articles and academic writing. IntegrityAI achieves exceptional performance and ranked 1st in both subtasks, with F1-scores of 99.6% and 99.9% on the news article detection and academic writing detection subtasks, respectively. Our results demonstrate the effectiveness of combining transformer-based models with stylometric analysis for detecting AI-generated content across diverse domains and writing styles.

pdf bib abs
Inside the Box: A Streamlined Model for AI-Generated News Article Detection
Nsrin Ashraf | Mariam Labib | Hamada Nayel

The rapid proliferation of AI-generated text has raised concerns. With the increasing prevalence of AI-generated content, concerns have grown regarding authenticity, authorship, and the spread of misinformation. Detecting such content accurately and efficiently has become a pressing challenge. In this study, we propose a simple yet effective system for classifying AI-generated versus human-written text. Rather than relying on complex or resource-intensive deep learning architectures, our approach leverages classical machine learning algorithms combined with the TF-IDF text representation technique. Evaluated on the M-DAIGT shared task dataset, our Support Vector Machine (SVM) based system achieved strong results, ranking second on the official leaderboard and demonstrating competitive performance across all evaluation metrics. These findings highlight the potential of traditional lightweight models to address modern challenges in text authenticity detection, particularly in low-resource or real-time applications where interpretability and efficiency are essential.

pdf (full)
bib (full) Proceedings of the 8th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Texts

pdf bib
Proceedings of the 8th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Texts
Ali Hürriyetoğlu | Hristo Tanev | Surendrabikram Thapa | Surabhi Adhikari

pdf bib abs
Findings and Insights from the 8th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text
Ali Hurriyetoglu | Surendrabikram Thapa | Hristo Tanev | Surabhi Adhikari

This paper presents an overview of the 8th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE), held in conjunction with RANLP 2025. The workshop featured a range of contributions, including regular research papers, system descriptions from shared task participants, and an overview paper on shared task outcomes. Continuing its tradition, CASE brings together researchers from computational and social sciences to explore the evolving landscape of event extraction. With the rapid advancement of large language models (LLMs), this year’s edition placed particular emphasis on their application to socio-political event extraction. Alongside text-based approaches, the workshop also highlighted the growing interest in multimodal event extraction, addressing complex real-world scenarios across diverse modalities.

pdf bib abs
Challenges and Applications of Automated Extraction of Socio-political Events at the age of Large Language Models
Surendrabikram Thapa | Surabhi Adhikari | Hristo Tanev | Ali Hurriyetoglu

Socio-political event extraction (SPE) enables automated identification of critical events such as protests, conflicts, and policy shifts from unstructured text. As a foundational tool for journalism, social science research, and crisis response, SPE plays a key role in understanding complex global dynamics. The emergence of large language models (LLMs) like GPT-4 and LLaMA offers new opportunities for flexible, multilingual, and zero-shot SPE. However, applying LLMs to this domain introduces significant risks, including hallucinated outputs, lack of transparency, geopolitical bias, and potential misuse in surveillance or censorship. This position paper critically examines the promises and pitfalls of LLM-driven SPE, drawing on recent datasets and benchmarks. We argue that SPE is a high-stakes application requiring rigorous ethical scrutiny, interdisciplinary collaboration, and transparent design practices. We propose a research agenda focused on reproducibility, participatory development, and building systems that align with democratic values and the rights of affected communities.

This paper presents the Shared Task on Multimodal Detection of Hate Speech, Humor, and Stance in Marginalized Socio-Political Movement Discourse, hosted at CASE 2025. The task is built on the PrideMM dataset, a curated collection of 5,063 text-embedded images related to the LGBTQ+ pride movement, annotated for four interrelated subtasks: (A) Hate Speech Detection, (B) Hate Target Classification, (C) Topical Stance Classification, and (D) Intended Humor Detection. Eighty-nine teams registered, with competitive submissions across all subtasks. The results show that multimodal approaches consistently outperform unimodal baselines, particularly for hate speech detection, while fine-grained tasks such as target identification and stance classification remain challenging due to label imbalance, multimodal ambiguity, and implicit or culturally specific content. CLIP-based models and parameter-efficient fusion architectures achieved strong performance, showing promising directions for low-resource and efficient multimodal systems.

pdf bib abs
Natural Language Processing vs Large Language Models: this is the end of the world as we know it, and I feel fine
Bertrand De Longueville

As practitioners in the field of Natural Language Processing (NLP), we have had the unique vantage point of witnessing the evolutionary strides leading to the emergence of Large Language Models (LLMs) over the past decades. This perspective allows us to contextualise the current enthusiasm surrounding LLMs, especially following the introduction of “General Purpose” Language Models and the widespread adoption of conversational chatbots built on their frameworks. At the same time, we have observed the remarkable capabilities of zeroshot systems powered by LLMs in extracting structured information from text, outperforming previous iterations of language models. In this paper, we contend that that the hype around “conversational AI” is both a revolution and an epiphenomenon for NLP, particularly in the domain of information extraction from text. By adopting a measured approach to the recent technological advancements in Artificial Intelligence that are reshaping NLP, and by utilising Automated Socio-Political Event Extraction from text as a case study, this commentary seeks to offer insights into the ongoing trends and future directions in the field.

pdf bib abs
Machine Translation in the AI Era: Comparing previous methods of machine translation with large language models
William Jock Boyd | Ruslan Mitkov

The aim of this paper is to compare the efficacy of multiple different methods of machine translation in the French-English language pair. There is a particular focus on Large Language Models given they are an emerging technology that could have a profound effect on the field of machine translation. This study used the European Parliament’s parallel French-English corpus, testing each method on the same section of data, with multiple different Neural Translation, Large Language Model and Rule-Based solutions being used. The translations were then evaluated using BLEU and METEOR scores to gain an accurate understanding of both precision and semantic accuracy of translation. Statistical analysis was then performed to ensure the results validity and statistical significance. This study found that Neural Translation was the best translation technology overall, with Large Language Models coming second and Rule-Based translation coming last by a significant margin. It was also discovered that within Large Language Model implementations that specifically trained translation capabilities outperformed emergent translation capabilities.

pdf bib abs
Steering Towards Fairness: Mitigating Political Stance Bias in LLMs
Afrozah Nadeem | Mark Dras | Usman Naseem

Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases along political and economic dimensions. In this paper, we employ a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), this method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.

pdf bib abs
wangkongqiang@CASE 2025: Detection and Classifying Language and Targets of Hate Speech using Auxiliary Text Supervised Learning
Wang Kongqiang | Zhang Peng

Our team was interested in content classification and labeling from multimodal detection of Hate speech, Humor, and Stance in marginalized socio-political movement discourse. We joined the task: Subtask A-Detection of Hate Speech and Subtask B-Classifying the Targets of Hate Speech. In this two task, our goal is to assign a content classification label to multimodal Hate Speech. Detection of Hate Speech: The aim is to detect the presence of hate speech in the images. The dataset for this task will have binary labels: No Hate and Hate. Classifying the Targets of Hate Speech: Given that an image is hateful, the goal here is to identify the targets of hate speech. The dataset here will have four labels: Undirected, Individual, Community, and Organization. Our group used a supervised learning method and a text prediction model. The best result on the test set for Subtask-A and Subtask-B were F1 score of 0.6209 and 0.3453, ranking twentieth and thirteenth among all teams.

pdf bib abs
Luminaries@CASE 2025: Multimodal Hate Speech, Target, Stance and Humor Detection using ALBERT and Classical Models
Akshay Esackimuthu

In recent years, the detection of harmful and socially impactful content in multimodal online data has emerged as a critical area of research, driven by the increasing prevalence of text-embedded images and memes on social media platforms. These multimodal artifacts serve as powerful vehicles for expressing solidarity, resistance, humor, and sometimes hate, especially within the context of marginalized socio-political movements. To address these challenges, this shared task introduces a comprehensive, fine-grained classification framework consisting of four subtasks: (A) detection of hate speech, (B) identification of hate speech targets, (C) classification of topical stance toward marginalized movements, and (D) detection of intended humor. By focusing on the nuanced interplay between text and image modalities, this task aims to push the boundaries of automated socio-political event understanding and moderation. Using state-of-the-art deep learning and multimodal modeling approaches, this work seeks to enable a more effective detection of complex online phenomena, thus contributing to safer and more inclusive digital environments

pdf bib abs
Overfitters@CASE2025: Multimodal Hate Speech Analysis Using BERT and RESNET
Bidhan Chandra Bhattarai | Dipshan Pokhrel | Ishan Maharjan | Rabin Thapa

Marginalized socio-political movements have become focal points of online discourse, polarizing public opinion and attracting attention through controversial or humorous content. Memes, play a powerful role in shaping this discourse both as tools of empowerment, and as vessels for ridicule or hate. The ambiguous and highly contextual nature of these memes presents a unique challenge for computational systems. In this work we try to identify these trends. Our approach leverages the BERT+ResNet(BERTRES) model to classify the multimodal content into different categories based on different tasks for the Shared Task on Multimodal Detection of Hate Speech, Humor, and Stance in Marginalized SocioPolitical Movement Discourse at CASE 2025. The task is divided into four sub-tasks: subtask A focuses on detection of hate speech, subtask B focuses on classifying the targets of hate speech, subtask C focuses on classification of topical stance and subtask D focuses on detection of intended humor. Our approach obtained a 0.73 F1 score in subtask A, 0.56 F1 score in subtask B, 0.6 F1 score in subtask C, 0.65 F1 score in subtask D.

pdf bib abs
Silver@CASE2025: Detection of Hate Speech, Targets, Humor, and Stance in Marginalized Movement
Rohan Mainali | Neha Aryal | Sweta Poudel | Anupraj Acharya | Rabin Thapa

Memes, a multimodal form of communication, have emerged as a popular mode of expression in online discourse, particularly among marginalized groups. With multiple meanings, memes often combine satire, irony, and nuanced language, presenting particular challenges to machines in detecting hate speech, humor, stance, and the target of hostility. This paper presents a comparison of unimodal and multimodal solutions to address all four subtasks of the CASE 2025 Shared Task on Multimodal Hate, Humor, and Stance Detection. We compare transformer-based text models (BERT, RoBERTa) with CNN-based vision models (DenseNet, EfficientNet), and multimodal fusion methods, such as CLIP. We find that multimodal systems consistently outperform the unimodal baseline, with CLIP performing the best on all subtasks with a macro F1 score of 78% in sub-task A, 56% in sub-task B, 59% in sub-task C, and 72% in sub-task D.

In recent years, memes have developed as popular forms of online satire and critique, artfully merging entertainment, social critique, and political discourse. On the other side, memes have also become a medium for the spread of hate speech, misinformation, and bigotry, especially towards marginalized communities, including the LGBTQ+ population. Solving this problem calls for the development of advanced multimodal systems that analyze the complex interplay between text and visuals in memes. This paper describes our work in the CASE@RANLP 2025 shared task. As a part of that task, we developed systems for hate speech detection, target identification, stance classification, and humor recognition within the text of memes. We investigate two multimodal transformer-based systems, ResNet-18 with BERT and SigLIP2, for these sub-tasks. Our results show that SigLIP-2 consistently outperforms the baseline, achieving an F1 score of 79.27 in hate speech detection, 72.88 in humor classification, and competitive performance in stance 60.59 and target detection 54.86. Through this study, we aim to contribute to the development of ethically grounded, inclusive NLP systems capable of interpreting complex sociolinguistic narratives in multi-modal content.

pdf bib abs
Multimodal Deep Learning for Detection of Hate, Humor, and Stance in Social Discourse on Marginalized Communities
Durgesh Verma | Abhinav Kumar

Internet memes serve as powerful vehicles of expression across platforms like Instagram, Twitter, and WhatsApp. However, they often carry implicit messages such as humor, sarcasm, or offense especially in the context of marginalized communities. Understanding such intent is crucial for effective moderation and content filtering. This paper introduces a deep learning-based multimodal framework developed for the CASE 2025 Shared Task on detecting hate, humor, and stance in memes related to marginalized movements. The study explores three architectures combining textual models (BERT, XLM-RoBERTa) with visual encoders (ViT, CLIP), enhanced through cross-modal attention and Transformer-based fusion. Evaluated on four subtasks, the models effectively classify meme content—such as satire and offense—demonstrating the value of attention-driven multimodal integration in interpreting nuanced social media expressions

pdf bib abs
Multimodal Kathmandu@CASE 2025: Task-Specific Adaptation of Multimodal Transformers for Hate, Stance, and Humor Detection
Sujal Maharjan | Astha Shrestha | Shuvam Thakur | Rabin Thapa

The multimodal ambiguity of text-embedded images (memes), particularly those pertaining to marginalized communities, presents a significant challenge for natural language and vision processing. The subtle interaction between text, image, and cultural context makes it challenging to develop robust moderation tools. This paper tackles this challenge across four key tasks: (A) Hate Speech Detection, (B) Hate Target Classification, (C) Topical Stance Classification, and (D) Intended Humor Detection. We demonstrate that the nuances of these tasks demand a departure from a ‘onesize-fits-all’ approach. Our central contribution is a task-specific methodology, where we align model architecture with the specific challenges of each task, all built upon a common CLIP-ViT backbone. Our results illustrate the strong performance of this task-specific approach, with multiple architectures excelling at each task. For Hate Speech Detection (Task A), the Co-Attention Ensemble model achieved a top F1-score of 0.7929; for Hate Target Classification (Task B), our Hierarchical CrossAttention Transformer achieved an F1-score of 0.5777; and for Stance (Task C) and Humor Detection (Task D), our Two-Stage Multiplicative Fusion Framework yielded leading F1-scores of 0.6070 and 0.7529, respectively. Beyond raw results, we also provide detailed error analyses, including confusion matrices, to reveal weaknesses driven by multimodal ambiguity and class imbalance. Ultimately, this work provides a blueprint for the community, establishing that optimal performance in multimodal analysis is achieved not by a single superior model, but through the customized design of specialized solutions, supported by empirical validation of key methodological choices.

pdf bib abs
MMFusion@CASE 2025: Attention-Based Multimodal Learning for Text-Image Content Analysis
Prerana Rane

Text-embedded images, such as memes, are now increasingly common in social media discourse. These images combine visual and textual elements to convey complex attitudes and emotions. Deciphering the intent of these images is challenging due to their multimodal and context-dependent nature. This paper presents our approach to the Shared Task on Multimodal Hate, Humor, and Stance Detection in Marginalized Movement at CASE 2025. The shared task focuses on four key aspects of multimodal content analysis for text-embedded images: hate speech detection, target identification, stance classification, and humor recognition. We propose a multimodal learning framework that uses both textual and visual representations, along with cross-modal attention mechanisms, to classify content across all tasks effectively.

pdf bib abs
TSR@CASE 2025: Low Dimensional Multimodal Fusion Using Multiplicative Fine Tuning Modules
Sushant Kr. Ray | Rafiq Ali | Abdullah Mohammad | Ebad Shabbir | Samar Wazir

This study describes our submission to the CASE 2025 shared task on multimodal hate event detection, which focuses on hate detection, hate target identification, stance determination, and humour detection on text embedded images as classification challenges. Our submission contains entries in all of the subtasks. We propose FIMIF, a lightweight and efficient classification model that leverages frozen CLIP encoders. We utilise a feature interaction module that allows the model to exploit multiplicative interactions between features without any manual engineering. Our results demonstrate that the model achieves comparable or superior performance to larger models, despite having a significantly smaller parameter count

pdf bib abs
PhantomTroupe@CASE 2025: Multimodal Hate Speech Detection in Text-Embedded Memes using Instruction-Tuned LLMs
Farhan Amin | Muhammad Abu Horaira | Md. Tanvir Ahammed Shawon | Md. Ayon Mia | Muhammad Ibrahim Khan

Memes and other text-embedded images are powerful tools for expressing opinions and identities, especially within marginalized socio-political movements. Detecting hate speech in this type of multimodal content is challenging because of the subtle ways text and visuals interact. In this paper, we describe our approach for Subtask A of the Shared Task on Multimodal Hate Detection in Marginalized Movement@CASE 2025, which focuses on classifying memes as either Hate or No Hate. We tested both unimodal and multimodal setups, using models like DistilBERT, HateBERT, Vision Transformer, and Swin Transformer. Our best system is the large multimodal model Qwen2.5-VL-7B-Instruct-bnb-4bit, fine-tuned with 4-bit quantization and instruction prompts. While we also tried late fusion with multiple transformers, Qwen performed better at capturing text-image interactions in memes. This LLM-based approach reached the highest F1-score of 0.8086 on the test set, ranking our team 5th overall in the task. These results show the value of late fusion and instruction-tuned LLMs for tackling complex hate speech in socio-political memes.

pdf bib abs
ID4Fusion@CASE 2025: A Multimodal Approach to Hate Speech Detection in Text-Embedded Memes Using ensemble Transformer based approach
Tabassum Basher Rashfi | Md. Tanvir Ahammed Shawon | Md. Ayon Mia | Muhammad Ibrahim Khan

Identification of hate speech in images with text is a complicated task in the scope of online content moderation, especially when such talk penetrates into the spheres of humor and critical societal topics. This paper deals with Subtask A of the Shared Task on Multimodal Hate, Humor, and Stance Detection in Marginalized Movement@CASE2025. This task is binary classification over whether or not hate speech exists in image contents, and it advances as Hate versus No Hate. To meet this goal, we present a new multimodal architecture that blends the textual and visual features to reach effective classification. In the textual aspect, we have fine-tuned two state-of-the-art transformer models, which are RoBERTa and HateBERT, to extract linguistic clues of hate speech. The image encoder contains both the EfficientNetB7 and a Vision Transformer (ViT) model, which were found to work well in retrieving image-related details. The predictions made by each modality are then merged through an ensemble mechanism, with the last estimate being a weighted average of the text- and image-based scores. The resulting model produces a desirable F1- score metric of 0.7868, which is ranked 10 among the total number of systems, thus becoming a clear indicator of the success of multimodal combination in addressing the complex issue of self-identifying the hate speech in text-embedded images.

pdf bib abs
Team MemeMasters@CASE 2025: Adapting Vision-Language Models for Understanding Hate Speech in Multimodal Content
Shruti Gurung | Shubham Shakya

Social media memes have become a powerful form of digital communication, combining images and text to convey humor, social commentary, and sometimes harmful content. This paper presents a multimodal approach using a fine-tuned CLIP model to analyze textembedded images in the CASE 2025 Shared Task. We address four subtasks: Hate Speech Detection, Target Classification, Stance Detection, and Humor Detection. Our method effectively captures visual and textual signals, achieving strong performance with precision of 80% for the detection of hate speech and 76% for the detection of humor, while stance and target classification achieved a precision of 60% and 54%, respectively. Detailed evaluations with classification reports and confusion matrices highlight the ability of the model to handle complex multimodal signals in social media content, demonstrating the potential of vision-language models for computational social science applications.

Memes and text-embedded images have rapidly become compelling cultural artifacts that both facilitate expressive communication and serve as conduits for spreading hate speech against marginalized communities. Detecting hate speech within such multimodal content poses significant challenges due to the complex and subtle interplay between textual and visual elements. This paper presents our approach for Subtask A of the Shared Task on Multimodal Hate Detection in Marginalized Movement@CASE 2025, focusing on the binary classification of memes into Hate or No Hate categories. We propose a novel multimodal architecture that integrates DistilBERT for textual encoding with Vision Transformer (ViT) for image representation, combined through an advanced late fusion mechanism leveraging multi-head attention. Our method utilizes attention-based feature alignment to capture nuanced cross-modal interactions within memes. The proposed system achieved an F1-score of 0.7416 on the test set, securing the 13th position in the competition. These results underscore the value of sophisticated fusion strategies and attention mechanisms in comprehending and detecting complex socio-political content embedded in memes.

pdf (full)
bib (full) Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models

pdf bib
Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models
Damith Premasiri | Tharindu Ranasinghe | Hansi Hettiarachchi

pdf bib abs
TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Arjun Damerla | Jimin Lim | Yanxi Jiang | Nam Nguyen Hoai Le | Nikil Selladurai

Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains under-explored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, “you earned a token”, without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.

Modern LLMs often generate fluent text yet fabricate, misquote, or misattribute evidence. To quantify this flaw, we built a balanced Citation‐Alignment Dataset of 500 genuine, expert‐verified claim–quote pairs and 500 minimally perturbed false variants from news, legal, scientific, and literary sources. We then propose CoVeGAT, which converts claims and citations into SVO triplets (with trigram fallback), scores each pair via an LLM‐driven chain of verification, and embeds them in a weighted semantic graph. A Graph Attention Network over BERT embeddings issues strict pass/fail judgments on alignment. Zero‐shot evaluation of seven top LLMs (e.g., GPT‐4o, Gemini 1.5, Mistral 7B) reveals a trade‐off: decisive models reach 82.5 % accuracy but err confidently, while cautious ones fall below 50 %. A MiniLM + RBF kernel baseline, by contrast, achieves 96.4 % accuracy, underscoring the power of simple, interpretable methods.

Many enterprises are increasingly adopting Artificial Intelligence (AI) to make internal processes more competitive and efficient. In response to public concern and new regulations for the ethical and responsible use of AI, implementing AI governance frameworks could help to integrate AI within organisations and mitigate associated risks. However, the rapid technological advances and lack of shared ethical AI infrastructures creates barriers to their practical adoption in businesses. This paper presents a real-world AI application at TVS Supply Chain Solutions, reporting on the experience developing an AI assistant underpinned by large language models and the ethical, regulatory, and sociotechnical challenges in deployment for enterprise use.

pdf (full)
bib (full) Proceedings of the First International Workshop on Gaze Data and Natural Language Processing

pdf bib
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
Cengiz Acarturk | Jamal Nasir | Burcu Can | Cagrı Coltekin

pdf bib abs
What Determines Where Readers Fixate Next? Leveraging NLP to Investigate Human Cognition
Adrielli Tina Lopes Rego | Joshua Snell | Martijn Meeter

During reading, readers perform rapid forward and backward eye movements through text, called saccades. How these saccades are targeted in the text is not yet fully known, particularly regarding the role of higher-order linguistic processes in guiding eye-movement behaviour in naturalistic reading. Current models of eye movement simulation in reading either limit the role of high-order linguistic information or lack explainability and cognitive plausibility. In this study, we investigate the influence of linguistic information on saccade targeting, i.e. determining where to move our eyes next, by predicting which word is fixated next based on a limited processing window that resembles the amount of information humans readers can presumably process in parallel within the visual field at each fixation. Our preliminary results suggest that, while word length and frequency are important factors for determining the target of forward saccades, the contextualized meaning of the previous sequence, as well as whether the context word had been fixated before and the distance of the previous saccade, are important factors for predicting backward saccades.

pdf bib abs
Benchmarking Language Model Surprisal for Eye-Tracking Predictions in Brazilian Portuguese
Diego Alves

This study evaluates the effectiveness of surprisal estimates from six publicly available large language models (LLMs) in predicting reading times in Brazilian Portuguese (BP), using eye-tracking data from the RastrOS corpus. We analyze three key reading time measures: first fixation duration, gaze duration, and total fixation time. Our results demonstrate that surprisal significantly predicts all three measures, with a consistently linear effect observed across all models and the strongest effect for total fixation duration. We also find that larger model size does not necessarily provide better surprisal estimates. Additionally, entropy reduction derived from Cloze norms adds minimal predictive value beyond surprisal, and only for first fixation duration. These findings replicate known surprisal effects in BP and provide novel insights into how different models and linguistic predictors influence reading time predictions.

pdf bib abs
EgoDrive: Egocentric Multimodal Driver Behavior Recognition Using Project Aria
Michael Rice | Lorenz Krause | Waqar Shahid Qureshi

Egocentric sensing using wearable devices of- fers a unique first-person perspective for driver behavior analysis and monitoring, with the po- tential to accurately capture rich multimodal cues such as eye gaze, head motion, and hand activity directly from the driver’s view- point. In this paper, we introduce a multimodal driver behavior recognition framework utilizing Meta’s Project Aria smart glasses, along with a novel, synchronized egocentric driving dataset comprising high-resolution RGB video, gaze- tracking data, inertial IMU signals, hand pose landmarks, and YOLO-based semantic object detections. All sensor data streams are tempo- rally aligned and segmented into fixed-length clips, each manually annotated with one of six distinct driver behavior classes: Driving, Left Mirror Check, Right Wing Mirror Check, Rear- view Mirror Check, Mobile Phone Usage, and Idle. We design a Transformer-based recog- nition framework in which each modality is processed by a specialized encoder and then fused via Temporal Transformer layers to cap- ture cross-modal temporal dependencies. To in- vestigate the trade-off between accuracy and ef- ficiency for real-time deployment, we introduce two model variants: EgoDriveMax, optimized for maximum accuracy, and EgoDriveRT, de- signed for real-time performance. These mod- els achieve classification accuracies of 98.6% and 97.4% respectively. Notably, EgoDriveRT delivers strong performance despite operating with only 104K parameters and requiring just 2.65 ms per inference without the use of a spe- cialized GPU—highlighting its potential for efficient, real-time in-cabin driver monitoring.

pdf bib abs
Comparing Eye-gaze and Transformer Attention Mechanisms in Reading Tasks
Maria Mouratidi | Massimo Poesio

As transformers become increasingly prevalent in NLP research, evaluating their cognitive alignment with human language processing has become essential for validating them as models of human language. This study compares eye-gaze patterns in human reading with transformer attention using different attention representations (raw attention, attention flow, gradient-based saliency). We employ both statistical correlation analysis and predictive modeling using PCA-reduced representations of eye-tracking features across two reading tasks. The findings reveal lower correlations and predictive capacity for the decoder model compared to the encoder model, with implications for the gap between behavioral performance and cognitive plausibility of different transformer designs.

pdf bib abs
A French Eye-Tracking Corpus of Original and Simplified Medical, Clinical, and General Texts - FETA
Oksana Ivchenko | Natalia Grabar

Eye tracking offers an objective window on real-time cognitive processing of information being read: longer fixations, more regressions, and wider pupil dilation reliably index linguistic difficulty. Yet, there is a paucity of the available corpora annotated with eye-tracking features. We introduce in this paper the FETA corpus – a French Eye-TrAcking corpus. It combines three types of texts (general, medical and clinical) in two versions (original and manually simplified). These texts are read by 46 participants, from which we collect eye-tracking data through dozens of eye-tracking features.

pdf bib abs
Exploring Mouse Tracking for Reading on Romanian Data
Cristina Maria Popescu | Sergiu Nisioi

In this paper, we investigate the use of the Mouse Tracking for Reading (MoTR) method for a sample of Romanian texts. MoTR is a novel measurement tool that is meant to collect word-by-word reading times. In a typical MoTR trial, the text is blurred, except for a small area around the mouse pointer and the participants must move the mouse to reveal and read the text. In the current experiment, participants read such texts and afterwords answered comprehension questions, aiming to evaluate reading behavior and cognitive engagement. Mouse movement is recorded and analyzed to evaluate attention distribution across a sentence, providing insights into incremental language processing. Based on all the information gathered, the study confirms the feasibility of this method in a controlled setting and emphasizes MoTR’s potential as an accessible and naturalistic approach for studying text comprehension.

pdf bib abs
Where Patients Slow Down: Surprisal, Uncertainty, and Simplification in French Clinical Reading
Oksana Ivchenko | Alamgir Munir Qazi | Jamal Abdul Nasir

This eye-tracking study links language-model surprisal and contextual entropy to how 23 non-expert adults read French health texts. Participants read seven texts (clinical case, medical, general), each available in an Original and Simplified version. Surprisal and entropy were computed with eight autoregressive models (82M–8B parameters), and four complementary eye-tracking measures were analyzed. Surprisal correlates positively with early reading measures, peaking in the smallest GPT-2 models (r ≈ 0.26) and weakening with model size. Entropy shows the opposite pattern, with negative correlations strongest in the 7B-8B models (r ≈ −0.13), consistent with a skim-when-uncertain strategy. Surprisal effects are largest in Clinical Original passages and drop by ∼20% after simplification, whereas entropy effects are stable across domain and version. These findings expose a scaling paradox – where different model sizes are optimal for different cognitive signals – and suggest that French plain-language editing should focus on rewriting high-surprisal passages to reduce processing difficulty, and on avoiding high-entropy contexts for critical information.

pdf bib abs
AlEYEgnment: Leveraging Eye‐Tracking‐While‐Reading to Align Language Models with Human Preferences
Anna Bondar | David Robert Reich | Lena Ann Jäger

Direct Preference Optimisation (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its reliance on binary feedback restricts its ability to capture nuanced human judgements. To address this limitation, we introduce a gaze-informed extension that incorporates implicit, fine-grained signals from eye-tracking-while-reading into the DPO framework. Eye movements, reflecting real-time human cognitive processing, provide fine-grained signals about the linguistic characteristics of the text that is being read. We leverage these signals and modify DPO by introducing a gaze-based additional loss term, that quantifies the differences between the model’s internal sentence representations and cognitive (i.e., gaze-based) representations derived from the readers’ gaze patterns. We explore the use of both human and synthetic gaze signals, employing a generative model of eye movements in reading to generate supplementary training data, ensuring the scalability of our approach. We apply the proposed approach to modelling linguistic acceptability. Experiments conducted on the CoLA dataset demonstrate performance gains in grammatical acceptability classification tasks when the models are trained in the gaze-augmented setting. These results demonstrate the utility of leveraging gaze data to align language models with human preferences. All code and data are available from Github.

pdf bib abs
Predicting Total Reading Time Using Romanian Eye-Tracking Data
Anamaria Hodivoianu | Oleksandra Kuvshynova | Filip Popovici | Adrian Luca | Sergiu Nisioi

This work introduces the first Romanian eye-tracking dataset for reading and investigates methods for predicting word-level total reading times. We develop and compare a range of models, from traditional machine learning using handcrafted linguistic features to fine-tuned Romanian BERT architectures, demonstrating strong correlations between predicted and observed reading times. Additionally, we propose a lexical simplification pipeline that leverages these TRT predictions to identify and substitute complex words, enhancing text readability. Our approach is integrated into an interactive web tool, illustrating the practical benefits of combining cognitive signals with NLP techniques for Romanian — a language with limited resources in this area.

pdf (full)
bib (full) Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models

pdf bib abs
Towards the Creation of a Collao Quechua–Spanish Parallel Corpus Using Optical Character Recognition
Gian Carlo Orcotoma Mormontoy | Lida Leon Nuñez | Hugo Espetia Huamanga

The Quechua language stands as a fundamental element of Peru’s social and cultural identity, carries linguistic and cultural significance. However, it faces substantial challenges in terms of digital representation. One major limitation is the scarcity of resources such as a parallel corpus, which limits the development of technological resources for its analysis and practical application. This study addresses this gap through a methodology for building a parallel corpus using Optical Character Recognition (OCR). We digitized a collection of texts from a common origin to create a corpus that enables reliable access. The resulting corpus serves as a valuable asset for linguistic and Natural Language Processing (NLP) research, as well as for Quechua speakers. The source material derives from works produced by graduate students from the Academia Mayor de la Lengua Quechua, validated by academic staff, ensuring grammatical, syntactic and semantic integrity.

pdf bib abs
Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs
Deshan Koshala Sumanathilaka | Nicholas Micallef | Julian Hough

Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its practicality and effectiveness. This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task, particularly focusing on the biases introduced by imbalanced sample distributions. We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages: English, German, Spanish, French, and Italian. Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English. To assess model behavior, we evaluate both the GPT-4o and LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual WSD to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.

pdf bib abs
Non-Contextual BERT or FastText? A Comparative Analysis
Abhay Shanbhag | Suramya Jadhav | Amogh Thakurdesai | Ridhima Bhaskar Sinare | Raviraj Joshi

Natural Language Processing (NLP) for low-resource languages, which lack large annotated datasets, faces significant challenges due to limited high-quality data and linguistic resources. The selection of embeddings plays a critical role in achieving strong performance in NLP tasks. While contextual BERT embeddings require a full forward pass, non-contextual BERT embeddings rely only on table lookup. Existing research has primarily focused on contextual BERT embeddings, leaving non-contextual embeddings largely unexplored. In this study, we analyze the effectiveness of non-contextual embeddings from BERT models (MuRIL and MahaBERT) and FastText models (IndicFT and MahaFT) for tasks such as news classification, sentiment analysis, and hate speech detection in one such low-resource language—Marathi. We compare these embeddings with their contextual and compressed variants. Our findings indicate that non-contextual BERT embeddings extracted from the model’s first embedding layer outperform FastText embeddings, presenting a promising alternative for low-resource NLP.

pdf bib abs
Kantika: A Knowledge-Radiant Framework for Dermatology QA using IR-CoT and RAPTOR-Augmented Retrieval
Deep Das | Vikram Mehrolia | Rahul Dixit | Rohit Kumar

This paper presents an improved Retrieval-Augmented Generation (RAG) approach for domain-specific question-answering in dermatology and cosmetic science. The proposed system integrates RAPTOR-style hierarchical indexing with Iterative Retrieval Chain-of-Thought (IR-CoT) reasoning and CRAG-style interleaved retrieval-generation to better handle complex, clinically grounded queries. It leverages multi-source dermatology data, including peer-reviewed research, product formulations, user reviews, and ingredient safety databases. By decomposing queries into rationale-driven substeps and applying subgoal-specific retrieval, the system improves answer depth, accuracy, and relevance—particularly for ingredient interactions and personalized dermatological guidance. Empirical results show notable gains over standard RAG baselines in both precision and clinical coherence, establishing the effectiveness of this approach in specialized medical QA tasks. With 100% user satisfaction and 99.07% overall accuracy across all document categories, the system sets a strong benchmark for domain-specific medical QA in dermatology.

pdf bib abs
GeistBERT: Breathing Life into German NLP
Raphael Scheible-Schmitt | Johann Frei

Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI (German XNLI), using F₁ score and accuracy as evaluation metrics. GeistBERT achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification. It also outperformed several larger models, particularly in classification benchmarks. To support research in German NLP, we release GeistBERT under the MIT license.

pdf bib abs
Identifying Contextual Triggers in Hate Speech Texts Using Explainable Large Language Models
Dheeraj Kodati | Bhuvana Sree Lakkireddy

The pervasive spread of hate speech on online platforms poses a significant threat to social harmony, necessitating not only high-performing classifiers but also models capable of transparent, fine-grained interpretability. Existing methods often neglect the identification of influential contextual words that drive hate speech classification, limiting their reliability in high-stakes applications. To address this, we propose LLM-BiMACNet (Large Language Model-based Bidirectional Multi-Channel Attention Classification Network), an explainability-focused architecture that leverages pretrained language models and supervised attention to highlight key lexical indicators of hateful and offensive intent. Trained and evaluated on the HateXplain benchmark—comprising class labels, target community annotations, and human-labeled rationales—LLM-BiMACNet is optimized to simultaneously enhance both predictive performance and rationale alignment. Experimental results demonstrate that our model outperforms existing state-of-the-art approaches, achieving an accuracy of 87.3 %, AUROC of 0.881, token-level F1 of 0.553, IOU-F1 of 0.261, AUPRC of 0.874, and comprehensiveness of 0.524, thereby offering highly interpretable and accurate hate speech detection.

pdf bib abs
PortBERT: Navigating the Depths of Portuguese Language Models
Raphael Scheible-Schmitt | Henry He | Armando B. Mendes

Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.

pdf bib abs
Quality Matters Measuring the Effect of Human-Annotated Translation Quality on English-Slovak Machine Translation
Matúš Kleštinec | Daša Munková

This study investigates the influence of human-annotated translation quality on the performance of machine translation (MT) models for a low-resource language pair—English to Slovak. We collected and categorized 287 student translations from a national competition, annotated by expert translators into three quality levels. Using the mT5-large model, we trained six neural MT models: three on the full dataset without validation splitting, and three using training/validation splits. The models were evaluated using a suite of automatic metrics (BLEU, METEOR, chrF, COMET, BLEURT, and TER), with TER serving as the validity criterion. Statistical analyses revealed that data quality had no significant effect when training without validation, but did have a significant impact under fine-tuning conditions (p < 0.05). Our results suggest that fine-tuning with combination with validation splitting increases the model’s sensitivity to the quality of training data. While the overall effect size is modest, the findings underscore the importance of high-quality, annotated corpora and modern training strategies for improving MT in low-resource languages.

pdf bib abs
Spatio-Temporal Mechanism in Multilingual Sentiment Analysis
Adarsh Singh Jadon | Vivek Tiwari | Chittaranjan Swain | Deepak Kumar Dewangan

This study investigated the effectiveness of various models in deep learning in performing sentiment analysis on code-mixed Hinglish text, a hybrid language is widely used in digital telecommunication. Hinglish presents unique challenges due to its informal nature, frequent code-switching, and complex linguistic structure. This research leverages datasets from the HinGE, SemEval-2020 Task 9 & E-Commerce Reviews, datasets, competition, and employ models such as RNN (LSTM), BERT-LSTM, CNN, and a proposed BiLSTM model with Data Augmentation.

pdf bib abs
Automatic Animacy Classification for Latvian Nouns
Ralfs Brutāns | Jelke Bloem

We introduce the first automatic animacy classifier for the Latvian language. Animacy, a linguistic feature indicating whether a noun refers to a living entity, plays an important role in Latvian grammatical structures and syntactic agreement, but remains unexplored in Latvian NLP. We adapt and extend existing methods to develop type-based animacy classifiers that distinguish between human and non-human nouns. Due to the limited utility of Latvian WordNet, the classifier’s training data was derived from the WordNets of Lithuanian, English, and Japanese. These lists were intersected and mapped to Latvian nouns from the Tēzaurs dictionary through automatic translation. The resulting dataset was used to train classifiers with fastText and LVBERT embeddings. Results show good performance from a MLP classifier using the last four layers of LVBERT, with Lithuanian data contributing more than English. This demonstrates a viable method for animacy classification in languages lacking robust lexical resources and shows potential for broader application in morphologically rich, under-resourced languages.

pdf bib abs
Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning
Maximilian Bley | Thomas Eckart | Christopher Schröder

The quality of training data is an essential factor for training large language models (LLMs) as it directly impacts their performance. While high-quality data is crucial for training competitive LLMs, existing preprocessing pipelines still partly rely on rules, which are computationally cheap but also inherently limited to simpler patterns. Model-based filtering on the other hand, is more flexible and can detect finer-grained patterns and semantics, but often requires substantial amounts of labeled data. While there are existing models for common problems (such as toxicity classification), this is often only the case for resource-rich languages and well-studied problems—leaving gaps in coverage for other languages, problems, or combinations thereof. In this work, we investigate the feasibility of model-based preprocessing despite the absence of labeled data. We use active learning to bootstrap a sentence-level multi-label classifier that detects textual problems of traditional text cleaning approaches. With only 498 examples, the final classifier reaches macro- and micro-F1 scores of 0.80 and 0.84, making it suitable for practical use. Moreover, we find that it captured subtle errors compared to a rule-based baseline. We publish the training code, a labeled corpus quality classification dataset, and the resulting classifier.

pdf bib abs
Fine-Grained Arabic Offensive Language Classification with Taxonomy, Sentiment, and Emotions
Natalia Vanetik | Marina Litvak | Chaya Liebeskind

Offensive language detection in Arabic is a challenging task because of the unique linguistic and cultural characteristics of the Arabic language. This study introduces a high-quality annotated dataset for classifying offensive language in Arabic, based on a structured taxonomy, categorizing offensive content across seven levels, capturing both explicit and implicit expressions. Utilizing this taxonomy, we re-annotate the FARAD-500 dataset, creating reFarad-500, which provides fine-grained labels for offensive texts in Arabic. A thorough dataset analysis reveals key patterns in offensive language distribution, emphasizing the importance of target type, offense severity, and linguistic structures. Additionally, we assess text classification techniques to evaluate the dataset’s effectiveness, exploring the impact of sentiment analysis and emotion detection on classification performance. Our findings highlight the complexity of Arabic offensive language and underscore the necessity of extensive annotation frameworks for accurate detection. This paper advances Arabic natural language processing (NLP) in resource-constrained settings by enhancing the recognition of hate speech and fostering a deeper understanding of the linguistic and emotional dimensions of offensive language.

pdf bib abs
Measuring Prosodic Richness in LLM-Generated Responses for Conversational Recommendation
Darshna Parmar | Pramit Mazumdar

This paper presents a novel framework for stylistic evaluation in conversational recommendation systems (CRS), focusing on the prosodic and expressive qualities of generated responses. While prior work has predominantly emphasized semantic relevance and recommendation accuracy, the stylistic fidelity of model outputs remains underexplored. We introduce the prosodic richness score (PRS), a composite metric that quantifies expressive variation through structural pauses, emphatic lexical usage, and rhythmic variability. Using PRS, we conduct both sentence-level and turn-level analyses across six contemporary large language models (LLMs) on two benchmark CRS datasets: ReDial, representing goal-oriented dialogue, and INSPIRED, which incorporates stylized social interaction. Empirical results reveal statistically significant differences (p < 0.01) in PRS between human and model-generated responses, highlighting the limitations of current LLMs in reproducing natural prosodic variation. Our findings advocate for broader evaluation of stylistic attributes in dialogue generation, offering a scalable approach to enhance expressive language modeling in CRS.

pdf bib abs
Assessing the Accuracy of AI-Generated Idiom Translations
Marijana Gasparovic | Marija Brala Vukanovic | Marija Brkic Bakaric

Idioms pose unique challenges for machine translation due to their metaphorical nature and cultural nuances. Consequently, they often present a translation problem even for humans. This longitudinal study evaluates the performance of ChatGPT in translating idiomatic expressions between English and Croatian, comparing results across two time points. The test set comprises 72 idioms in each translation direction, divided into three categories based on equivalence: complete, partial, and zero, with each category representing one-third of the set. The evaluation considers three layers: translation of the isolated idiom, translation of an online excerpt containing the idiom, and translation of a self-constructed example sentence. As expected, accuracy generally declined with decreasing equivalence. However, a follow-up study conducted six months later highlighted the need for continuous monitoring of machine translation tools.

pdf bib abs
From Pixels to Prompts: Evaluating ChatGPT-4o in Face Recognition, Age Estimation, and Gender Classification
Jashn Jain | Praveen Kumar Chandaliya | Dhruti P. Sharma

This study investigates the biometric capabilities of ChatGPT-4o, evaluating its performance on age estimation, gender classification, and identity verification across two challenging datasets: the ITWCC (images of children aged 6–17) and a pediatric surgery dataset. By leveraging tailored prompts that bypass safety filters, ChatGPT-4o outperformed conventional CNN-based models such as DeepFace, achieving higher accuracy and offering interpretable, rationale-rich outputs. Specifically, it delivered a mean absolute error of 1.8 years in age estimation, 96–100% gender classification accuracy, and over 85% identity continuity recognition, even across surgical transformations. The findings demonstrate the potential of multimodal LLMs to complement or exceed traditional approaches in face analysis tasks, though the study notes the importance of expanding demographic diversity, refining prompt strategies, and ensuring fairness and robustness in real-world settings.

DRISHTI is a novel RFID-vision integrated assistive medication-verification system that combines RFID contactless scanning, quantized AI-based vision processing, and adaptive audio feedback to provide comprehensive medication-safety assurance. The architecture integrates an MFRC522 RFID reader for rapid drug-container identification, a Raspberry Pi–mounted camera running a quantized Gemma3-4B vision model for prescription-document analysis, and a hierarchical validation engine employing confidence-weighted scoring across five critical safety dimensions. Operating entirely offline, the system processes compressed medication data through multi-criteria classification while preserving user privacy and eliminating cloud dependencies. In evaluations across 149 test scenarios, DRISHTI achieved 86.57% overall accuracy and 100% detection of safety-critical cases, including expired medications, dosage mismatches, and drug interactions. The system delivers sub-millisecond response times with real-time, urgency-differentiated audio feedback, offering a practical solution for enhancing independence and reducing healthcare risks for visually impaired individuals.

pdf bib abs
What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
Katharina A. T. T. Trinley | Toshiki Nakai | Tatiana Anikina | Tanja Baeumel

Large language models (LLMs) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only LLM trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23’s language-specific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron overlap analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes LLM internals and inform future cross-lingual transfer research. The code and dataset are publicly available.

Clinical federated learning faces critical challenges from statistical heterogeneity across healthcare institutions and privacy requirements for sensitive medical data. This work implements the foundational components of FedCliMask and proposes a comprehensive framework for privacy-preserving federated learning in clinical settings that combines ontology-guided semantic masking with context-aware federated aggregation. Our framework addresses the dual challenges of privacy preservation and statistical heterogeneity through two key innovations: (1) ontology-guided semantic masking using UMLS hierarchies to provide graduated privacy protection while preserving clinical semantics, and (2) context-aware federated aggregation that considers hospital-specific features including medical specialties, data complexity, privacy levels, and data volume. The semantic masking component is implemented and evaluated on synthetic clinical data, demonstrating effective privacy-utility tradeoffs across four masking levels. The context-aware analysis component is also implemented successfully profiling 12,996 synthetic clinical notes across 6 diverse hospitals to demonstrate meaningful hospital differentiation. The complete framework is designed to enable privacy-preserving clinical trial recruitment through federated learning while adapting to institutional heterogeneity.

pdf bib abs
A study on the language independent stemmer in the Indian language IR
Siba Sankar Sahu | Sukomal Pal

We explore and evaluate the effect of different language-independent stemmers in the information retrieval (IR) tasks with Indian languages such as Hindi, Gujarati, and English. The issue was examined from two points of view. Does a language-independent stemmer improve retrieval effectiveness in Indian languages IR? Which language-independent stemmer is the most suitable for different Indian languages? It is observed that stemming enhances retrieval efficiency in different Indian languages compared to the no stemming approaches. Among the different stemmers experimented with, the co-occurrence-based stemmer (SNS) performs the best and improves a mean average precision (MAP) score by 2.98% in Hindi, and 20.78% in Gujarati languages, respectively, whereas the graph-based stemmer (GRAS) performs the best and improves a MAP score by 5.83% in English.

pdf bib abs
Checklist Engineering Empowers Multilingual LLM Judges
Mohammad Ghiasvand Mohammadkhani | Hamid Beigy

Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators—a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.

pdf bib abs
C A N C E R: Corpus for Accurate Non-English Cancer-related Educational Resources
Anika Harju | Asma Shakeel | Tiantian He | Tianqi Xu | Aaro Harju

Improving the quality of cancer terminology through Machine Translation (MT) in non-English languages remains an under-researched area despite its critical role in supporting self-management and advancing multilingual patient education. Existing computational tools encounter significant limitations in accurately translating cancer terminologies, particularly for low-resource languages, primarily due to data scarcity and morphological complexity. To address the gap, we introduce a dedicated terminology resource — Corpus for Accurate Non-English Cancer-related Educational Resources (C A N C E R), a manually annotated dataset in Finnish (FI), Chinese (ZH), and Urdu (UR), curated from publicly available existing English (EN) data. We also examine the impact of data quality versus quantity and compare the performance of the Opus-mt-en-fi, Opus-mt-en-zh, and Opus-mt-en-ur models with the SMaLL-100 multilingual MT model. We assess translation quality using automatic and human evaluation. Results demonstrated that high-quality parallel data, though sparse, combined with fine-tuning, substantially improved the translation of cancer terminology across both high and low-resource language pairs, positioning the C A N C E R corpus as a foundational resource for improving multilingual patient education.

pdf (full)
bib (full) Proceedings of the First on Natural Language Processing and Language Models for Digital Humanities

pdf bib
Proceedings of the First on Natural Language Processing and Language Models for Digital Humanities
Isuri Nanomi Arachchige | Francesca Frontini | Ruslan Mitkov | Paul Rayson

pdf bib abs
HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents
Mohammad Amin Abbasi | Farnaz Sadat Mirnezami | Ali Neshati | Hassan Naderi

We present HamRaz, a culturally adapted Persian-language dataset for AI-assisted mental health support, grounded in Person-Centered Therapy (PCT). To reflect real-world therapeutic challenges, we combine script-based dialogue with adaptive large language models (LLM) role-playing, capturing the ambiguity and emotional nuance of Persian-speaking clients. We introduce HamRazEval, a dual-framework for assessing conversational and therapeutic quality using General Metrics and specialized psychological relationship measures. Human evaluations show HamRaz outperforms existing baselines in empathy, coherence, and real-ism. This resource contributes to the Digital Humanities by bridging language, culture, and mental health in underrepresented communities.

pdf bib abs
Simulating Complex Immediate Textual Variation with Large Language Models
Fernando Aguilar-Canto | Alberto Espinosa-Juarez | Hiram Calvo

Immediate Textual Variation (ITV) is defined as the process of introducing changes during text transmission from one node to another. One-step variation can be useful for testing specific philological hypotheses. In this paper, we propose using Large Language Models (LLMs) as text-modifying agents. We analyze three scenarios: (1) simple variations (omissions), (2) paraphrasing, and (3) paraphrasing with bias injection (polarity). We generate simulated news items using a predefined scheme. We hypothesize that central tendency measures—such as the mean and median vectors in the feature space of sentence transformers—can effectively approximate the original text representation. Our findings indicate that the median vector is a more accurate estimator of the original vector than most alternatives. However, in cases involving substantial rephrasing, the agent that produces the least semantic drift provides the best estimation, aligning with the principles of Bédierian textual criticism.

pdf bib abs
Versus: an automatic text comparison tool for the digital humanities
Motasem Alrahabi | Tom Wainstain

Digital humanities (DH) have been exploring large-scale textual reuse for several decades: quotation, allusion, paraphrase, translation, rephrasing. Automatic comparison, made possible by the increasing digitization of corpora, opens new perspectives in philology and intertextual studies. This article presents a state of the art of existing methods (formal, vector-based, statistical, graph-based) and introduces an open-source tool, Versus, which combines multigranular vector alignment, interactive visualization, and critical traceability. This framework aims to provide a reproducible and accessible solution for DH researchers, with support for text comparison in multiple languages.

pdf bib abs
Like a Human? A Linguistic Analysis of Human-written and Machine-generated Scientific Texts
Sergei Bagdasarov | Diego Alves

The purpose of this study is to analyze lexical and syntactic features in human-written texts and machine-generated texts produced by three state-of-the-art large language models: GPT-4o, Llama 3.1 and Qwen 2.5. We use Kullback-Leibler divergence to quantify the dissimilarity between humans and LLMs as well as to identify relevant features for comparison. We test the predictive power of our features using binary and multi-label random forest classifiers. The classifiers achieve robust performance of above 80% for multi-label classification and above 90% for binary classification. Our results point to substantial differences between human- and machine-generated texts. Human writers show higher variability in the use of syntactic resources, while LLMs score higher in lexical variability.

pdf bib abs
A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek
Giuseppe G. A. Celano

This paper presents an experiment comparing six models to identify state-of-the-art models for Ancient Greek: a morphosyntactic parser and a lemmatizer that are capable of annotating in accordance with the Ancient Greek Dependency Treebank annotation scheme. A normalized version of the major collections of annotated texts was used to (i) train the baseline model Dithrax with randomly initialized character embeddings and (ii) fine-tune Trankit and four recent models pretrained on Ancient Greek texts, namely GreBERTa and PhilBERTa for morphosyntactic annotation and GreTA and PhilTa for lemmatization. A Bayesian analysis shows that Dithrax and Trankit are practically equivalent in morphological annotation, while syntax is best annotated by Trankit and lemmata by GreTa. The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships. The dataset and best-performing models are made available online for reuse

pdf bib abs
It takes a village to grammaticalize
Joseph E. Larson | Patricia Amaral

This paper investigates the grammaticalization of the noun caleta ‘cove, village’ to an inten- sifier, as part of the system of degree words in Chilean Spanish. We use word embeddings trained on a corpus of tweets to show the on- going syntactic and semantic change of caleta, while also revealing how high degree is ex- pressed in colloquial Chilean Spanish.

pdf bib abs
Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities
Maria A. Levchenko

Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting “over-historicization”—inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

pdf bib abs
Finding the Plea: Evaluating the Ability of LLMs to Identify Rhetorical Structure in Swedish and English Historical Petitions
Ellinor Lindqvist | Eva Pettersson | Joakim Nivre

Large language models (LLMs) have shown impressive capabilities across many NLP tasks, but their effectiveness on fine-grained content annotation, especially for historical texts, remains underexplored. This study investigates how well GPT-4, Gemini, Mixtral, Mistral, and LLaMA can identify rhetorical sections (Salutatio, Petitio, and Conclusio) in 100 English and 100 Swedish petitions using few-shot prompting with varying levels of detail. Most models perform very well, achieving F1 scores in the high 90s for Salutatio, though Petitio and Conclusio prove more challenging, particularly for smaller models and Swedish data. Cross-lingual prompting yields mixed results, and models generally underestimate document difficulty. These findings demonstrate the strong potential of LLMs for assisting with nuanced historical annotation while highlighting areas for further investigation.

pdf bib abs
Leveraging RAG for a Low-Resource Audio-Aware Diachronic Analysis of Gendered Toy Marketing
Luca Marinelli | Iacopo Ghinassi | Charalampos Saitis

We performed a diachronic analysis of sound and language in toy commercials, leveraging retrieval-augmented generation (RAG) and open-weight language models in low-resource settings. A pool of 2508 UK toy advertisements spanning 14 years was semi-automatically annotated, integrating thematic coding of transcripts with audio annotation. With our RAG pipeline, we thematically coded and classified commercials by gender-target audience (feminine, masculine, or mixed) achieving substantial inter-coder reliability. In parallel, a music-focused multitask model was applied to annotate affective and mid-level musical perceptual attributes, enabling multimodal discourse analysis. Our findings reveal significant diachronic shifts and enduring patterns. Soundtracks classified as energizing registered an overall increase across distinct themes and audiences, but such increase was steeper for masculine-adjacent commercials. Moreover, themes stereotypically associated with masculinity paired more frequently with louder, distorted, and aggressive music, while stereotypically feminine themes with softer, calmer, and more harmonious soundtracks. Code and data to reproduce the results are available on github.com/marinelliluca/low-resource-RAG.

pdf bib abs
Quantifying Societal Stress: Forecasting Historical London Mortality using Hardship Sentiment and Crime Data with Natural Language Processing and Time-Series
Sebastian Olsen | Jelke Bloem

We study links between societal stress - quantified from 18th–19th century Old Bailey trial records - and weekly mortality in historical London. Using MacBERTh-based hardship sentiment and time-series analyses (CCF, VAR/IRF, and a Temporal Fusion Transformer, TFT), we find robust lead–lag associations. Hardship sentiment shows its strongest predictive contribution at a 5–6 week lead for mortality in the TFT, while mortality increases precede higher conviction rates in the courts. Results align with Epidemic Psychology and suggest that text-derived stress markers can improve forecasting of public-health relevant mortality fluctuations.

pdf bib abs
Exploring Language in Different Daily Time Segments Through Text Prediction and Language Modeling
Kennedy Roland | Milton King

Temporal-aware language models have proved to be effective over longer time periods as language and its use changes, but little research has looked at how language use can change at different times of the day. We hypothesize that a person’s usage of language varies at different times of day. We explore this concept by evaluating if models for language modeling and next word prediction improve their performance when considering the time of day. Specifically, we explore personalized temporal-aware models for next-word prediction and language modeling and compare them against baseline models, including non-temporal-aware personalized models. Specifically, our proposed model considers which of the 8, 3-hr daily time segments that a text snippet was written during for a given author. We found that our temporal-aware models tend to outperform temporal-agnostic models with respect to accuracy and perplexity.

pdf bib abs
Identifying Severity of Depression in Forum Posts using Zero-Shot Classifier and DistilBERT Model
Zafar Sarif | Sannidhya Das | Dr. Abhishek Das | Md Fahin Parvej | Dipankar Das

This paper presents our approach to the RANLP 2025 Shared Task on “Identification of the Severity of Depression in Forum Posts.” The objective of the task is to classify user-generated posts into one of four severity levels of depression: subthreshold, mild, moderate, or severe. A key challenge in the task was the absence of annotated training data. To address this, we employed a two-stage pipeline: first, we used zero-shot classification with facebook/bart-large-mnli to generate pseudo-labels for the unlabeled training set. Next, we fine-tuned a DistilBERT model on the pseudo-labeled data for multi-class classification. Our system achieved an internal accuracy of 0.92 on the pseudo-labeled test set and an accuracy of 0.289 on the official blind evaluation set. These results demonstrate the feasibility of leveraging zero-shot learning and weak supervision for mental health classification tasks, even in the absence of gold-standard annotations.

pdf bib abs
Recall Them All: Long List Generation from Long Novels
Sneha Singhania | Simon Razniewski | Gerhard Weikum

Language models can generate lists of salient literary characters for specific relations but struggle with long, complete lists spanning entire novels. This paper studies the non-standard setting of extracting complete entity lists from full-length books, such as identifying all 50+ friends of Harry Potter across the 7-volume book series. We construct a benchmark dataset with meticulously compiled ground-truth, posing it as a challenge for the research community. We present a first-cut method to tackle this task, based on RAG with LLMs. Our method introduces the novel contribution of harnessing IR-style pseudo-relevance feedback for effective passage retrieval from literary texts. Experimental results show that our approach clearly outperforms both LLM-only and standard RAG baselines, achieving higher recall while maintaining acceptable precision.

pdf bib abs
Exploring the Limits of Prompting LLMs with Speaker-Specific Rhetorical Fingerprints
Wassiliki Siskou | Annette Hautli-Janisz

The capabilities of Large Language Models (LLMs) to mimic written content are being tested on a wide range of tasks and settings, from persuasive essays to programming code. However, the question to what extent they are capable of mimicking human conversational monologue is less well-researched. In this study, we explore the limits of popular LLMs in impersonating content in a high-stakes legal setting, namely for the generation of the decision statement in parole suitability hearings: We distill a linguistically well-motivated rhetorical fingerprint from individual presiding commissioners, based on patterns observed in verbatim transcripts and then enhance the model prompts with those characteristics. When comparing this enhanced prompt with an underspecified prompt we show that LLMs can approximate certain rhetorical features when prompted accordingly, but are not able to fully replicate the linguistic profile of the original speakers as their own fingerprint dominates.

pdf bib abs
Annotating Personal Information in Swedish Texts with SPARV
Maria Irena Szawerna | David Alfter | Elena Volodina

Digital Humanities (DH) research, among many others, relies on data, a subset of which comes in the form of language data that contains personal information (PI). Working with and sharing such data has ethical and legal implications. The process of removing (anonymization) or replacing (pseudonymization) of personal information in texts may be used to address these issues, and often begins with a PI detection and labeling stage. We present a new tool for personal information detection and labeling for Swedish, SBX-PI-DETECTION (henceforth SBX-PI), alongside a visualization interface, (IM)PERSONAL DATA, which allows for the comparison of outputs from different tools. A valuable feature of SBX-PI is that it enables the users to run the annotation locally. It is also integrated into the text annotation pipeline SPARV, allowing for other types of annotation to be performed simultaneously and contributing to the privacy by design requirement set by the GDPR. A novel feature of (IM)PERSONAL DATA is that it allows researchers to assess the extent of detected PI in a text and how much of it will be manipulated once anonymization or pseudonymization are applied. The tools are primarily aimed at researchers within Digital Humanities and Natural Language Processing and are linked to CLARIN’s Virtual Language Observatory.

pdf bib abs
Can LLMs Help Sun Wukong in his Journey to the West? A Case Study of Language Models in Video Game Localization
Xiaojing Zhao | Han Xu | Huacheng Song | Emmanuele Chersoni | Chu-Ren Huang

Large language models (LLMs) have demonstrated increasing proficiency in general-purpose translation, yet their effectiveness in creative domains such as game localization remains underexplored. This study focuses on the role of LLMs in game localization from both linguistic quality and sociocultural adequacy through a case study of the video game Black Myth: Wukong. Results indicate that LLMs demonstrate adequate competence in accuracy and fluency, achieving performance comparable to human translators. However, limitations remain in the literal translation of culture-specific terms and offensive language. Human oversight is required to ensure nuanced cultural authenticity and sensitivity. Insights from human evaluations also suggest that current automatic metrics and the Multidimensional Quality Metrics framework may be inadequate for evaluating creative translation. Finally, varying human preferences in localization pose a learning ambiguity for LLMs to perform optimal translation strategies. The findings highlight the potential and shortcomings of LLMs to serve as collaborative tools in game localization workflows. Data are available at https://github.com/zcocozz/wukong-localization.

pdf (full)
bib (full) Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

pdf bib abs
Bridging the Gap: Leveraging Cherokee to Improve Language Identification for Endangered Iroquoian Languages
Liam Enzo Eggleston | Michael P. Cacioli | Jatin Sarabu | Ivory Yang | Kevin Zhu

Language identification is a foundational task in natural language processing (NLP), yet many Indigenous languages remain entirely unsupported by commercial language identification systems. In this study, we assess the performance of Google LangID on a 5k Cherokee dataset and find that every sentence is classified as “undetermined”, indicating a complete failure to even misidentify Cherokee as another language. To further explore this issue, we manually constructed the first digitalized Northern Iroquoian dataset, consisting of 120 sentences across five related languages: Onondaga, Cayuga, Mohawk, Seneca, and Oneida. Running these sentences through Google LangID, we examine patterns in its incorrect predictions. To address these limitations, we train a random forest classifier to successfully distinguish between these languages, demonstrating its effectiveness in language identification. Our findings underscore the inadequacies of existing commercial language identification models for Indigenous languages and highlight concrete steps toward improving automated recognition of low-resource languages.

pdf bib abs
Building a Lightweight Classifier to Distinguish Closely Related Language Varieties with Limited Supervision: The Case of Catalan vs Valencian
Raúl García-Cerdá | María Miró Maestre | Miquel Canal

Dialectal variation among closely related languages poses a major challenge in low-resource NLP, as their linguistic similarity increases confusability for automatic systems. We introduce the first supervised classifier to distinguish standard Catalan from its regional variety Valencian. Our lightweight approach fine-tunes a RoBERTa-base model on a manually curated corpus of 20 000 sentences—without any Valencian-specific tools—and achieves 98 % accuracy on unseen test data. In a human evaluation of 90 mixed-variety items per reviewer, acceptance rates reached 96.7 % for Valencian and 97.7 % for Catalan (97.2 % overall). We discuss limitations with out-of-distribution inputs and outline future work on confidence calibration and dialect-aware tokenization. Our findings demonstrate that high-impact dialect classification is feasible with minimal resources.

pdf bib abs
A thresholding method for Improving translation Quality for Indic MT task
Sudhansu Bala Das | Leo Raphael Rodrigues | Tapas Kumar Mishra | Bidyut Ku Patra

The conversion of content from one language to another using a computer system is known as Machine Translation (MT). Various techniques have been used to ensure effective translations that retain the contextual and lexical interpretation of the source and target languages. One of these methods is end-to-end Neural Machine Translation (NMT), which is frequently utilized in real-world machine translation systems. NMT requires large parallel datasets for effective translation. These datasets are essential for an MT system to acquire during the training phase to learn the linguistic patterns and structures of both languages. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since these datasets have been gathered from various sources, they contain many incorrect or dissimilar translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. This paper proposes an algorithm to remove dissimilar translations from the training dataset and evaluate the model’s efficiency. Two Indic languages (ILs), Hindi (HIN) and Odia (ODI), were chosen for the experiment. A baseline NMT system is built for these languages, and the effect of different dataset sizes is investigated. The quality of the translations in the experiment is evaluated using standard metrics. The results have shown that removing the dissimilar translations from the training dataset improves the quality of the language. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same dataset, ILs-English works more effectively across all the evaluation metrics.

pdf bib abs
A Multi-Task Learning Approach to Dialectal Arabic Identification and Translation to Modern Standard Arabic
Abdullah Khered | Youcef Benkhedda | Riza Batista-Navarro

Translating Dialectal Arabic (DA) into Modern Standard Arabic (MSA) is a complex task due to the linguistic diversity and informal nature of dialects, particularly in social media texts. To improve translation quality, we propose a Multi-Task Learning (MTL) framework that combines DA-MSA translation as the primary task and dialect identification as an auxiliary task. Additionally, we introduce LahjaTube, a new corpus containing DA transcripts and corresponding MSA and English translations, covering four major Arabic dialects: Egyptian (EGY), Gulf (GLF), Levantine (LEV), and Maghrebi (MGR), collected from YouTube. We evaluate AraT5 and AraBART on the Dial2MSA-Verified dataset under Single-Task Learning (STL) and MTL setups. Our results show that adopting the MTL framework and incorporating LahjaTube into the training data improve the translation performance, leading to a BLEU score improvement of 2.65 points over baseline models.

pdf bib abs
Low-Resource Machine Translation for Moroccan Arabic
Alexei Rosca | Abderrahmane Issam | Gerasimos Spanakis

Neural Machine Translation (NMT) has achieved significant progress especially for languages with large amounts of data (referred to as high resource languages). However, most of the world languages lack sufficient data and are thus considered as low resource or endangered. Previous research explored various techniques for improving NMT performance on low resource languages, with no guarantees that they will perform similarly on other languages. In this work, we explore various low resource NMT techniques for improving performance on Moroccan Arabic (Darija), a dialect of Arabic that is considered a low resource language. We experiment with three techniques that are prominent in low resource Natural Language Processing (NLP), namely: back-translation, paraphrasing and transfer learning. Our results indicate that transfer learning, especially in combination with back-translation is effective at improving translation performance on Moroccan Arabic, achieving a BLEU score of 26.79 on Darija to English and 9.98 on English to Darija.

pdf bib abs
Efficient Architectures For Low-Resource Machine Translation
Edoardo Signoroni | Pavel Rychly | Ruggero Signoroni

Low-resource Neural Machine Translation is highly sensitive to hyperparameters and needs careful tuning to achieve the best results with small amounts of training data. We focus on exploring the impact of changes in the Transformer architecture on downstream translation quality, and propose a metric to score the computational efficiency of such changes. By experimenting on English-Akkadian, German-Lower Sorbian, English-Italian, and English-Manipuri, we confirm previous finding in low-resource machine translation optimization, and show that smaller and more parameter-efficient models can achieve the same translation quality of larger and unwieldy ones at a fraction of the computational cost. Optimized models have around 95% less parameters, while dropping only up to 14.8% ChrF. We compile a list of optimal ranges for each hyperparameter.

pdf bib abs
IfGPT: A Dataset in Bulgarian for Large Language Models
Svetla Peneva Koeva | Ivelina Stoyanova | Jordan Konstantinov Kralev

The paper presents the large dataset IfGPT, which contains available corpora and datasets for Bulgarian, and describes methods to continuously expand it with unduplicated and unbiased Bulgarian data. The samples in the dataset are annotated with metadata that enable effective extraction of domain- and application-oriented datasets for fine-tuning or Retrieval Augmented Generation (RAG) of large language models (LLMs). The paper focuses on the description of the extended metadata of the IfGPT dataset and its management in a graph database.

We present a modular training approach for deep text classification in Guarani, where networks are split into sectors trained independently and later combined. This sector-wise backpropagation improves stability, reduces training time, and adapts to standard architectures like CNNs, LSTMs, and Transformers. Evaluated on three Guarani datasets—emotion, humor, and offensive language—our method outperforms traditional Bayesian-optimized training in both accuracy and efficiency.

pdf bib abs
Roman Urdu as a Low-Resource Language: Building the First IR Dataset and Baseline
Muhammad Umer Tariq Butt | Stalin Varanasi | Guenter Neumann

The field of Information Retrieval (IR) increasingly recognizes the importance of inclusivity, yet addressing the needs of low-resource languages, especially those with informal variants, remains a significant challenge. This paper addresses a critical gap in effective IR systems for Roman Urdu, a romanized version of Urdu i.e a language with millions of speakers, widely used in digital communication yet severely underrepresented in research and tooling. Roman Urdu presents unique complexities due to its informality, lack of standardized spelling conventions, and frequent code-switching with English. Crucially, prior to this work, there was a complete absence of any Roman Urdu IR dataset or dedicated retrieval work. To address this critical gap, we present the first-ever large-scale IR MS-marco translated dataset specifically for Roman Urdu, created through a multi-hop pipeline involving English-to-Urdu translation followed by Urdu-to-Roman Urdu transliteration. Using this novel dataset, we train and evaluate a multilingual retrieval model, achieving substantial improvements over traditional lexical retrieval baselines (MRR@10: 0.19 vs. 0.08; Recall@10: 0.332 vs. 0.169). This work lays foundational benchmarks and methodologies for Roman Urdu IR especially using the transformer based models, significantly contributing to inclusive information access and setting the stage for future research in informal, Romanized, and low-resource languages.

pdf bib abs
The Brittle Compass: Navigating LLM Prompt Sensitivity in Slovak Migration Media Discourse
Jaroslav Kopčan | Samuel Harvan | Marek Suppa

In this work, we present a case study that explores various tasks centered around the topic of migration in Slovak, a low-resource language, such as topic relevance and geographical relevance classification, and migration source/destination location term extraction. Our results demonstrate that native (Slovak)prompts yield a modest, task-dependent gain, while large models show significant robustness to prompt variations compared to their smaller counterparts. Analysis reveals that instructions(system or task) emerge as the most critical prompt component, more so than the examples sections, with task-specific performance benefits being more pronounced than overall language effects.

pdf bib abs
Explicit Edge Length Coding to Improve Long Sentence Parsing Performance
Khensa Daoudi | Mathieu Dehouck | Rayan Ziane | Natasha Romanova

Performance of syntactic parsers is reduced for longer sentences. While some of this reduction can be explained by the tendency of longer sentences to be more syntactically complex as well as the increase of candidate governor number, some of it is due to longer sentences being more challenging to encode. This is especially relevant for low-resource scenarios such as parsing of written sources in historical languages (e.g. medieval and early-modern European languages), in particular legal texts, where sentences can be very long whereas the amount of training material remains limited. In this paper, we present a new method for explicitly using the arc length information in order to bias the scores produced by a graph-based parser. With a series of experiments on Norman and Gascon data, in which we divide the test data according to sentence length, we show that indeed explicit length coding is beneficial to retain parsing performance for longer sentences.

pdf bib abs
Evaluating LLM Capabilities in Low-Resource Contexts: A Case Study of Persian Linguistic and Cultural Tasks
Jasmin Heierli | Rebecca Bahar Ganjineh | Elena Gavagnin

We evaluate four representative large language models, namely GPT-4o, Gemini, Llama, and DeepSeek on on a suite of linguistic and cultural tasks in Persian, covering grammar, paraphrasing, inference, translation, factual recall, analogical reasoning, and a Hofstede-based cultural probe under direct and role-based prompts. Our findings reveal consistent performance declines, alongside systematic misalignment with Iranian cultural norms. Role-based prompting yields modest improvements but does not fully restore cultural fidelity. We conclude that advancing truly multilingual models demands richer Persian resources, targeted adaptation, and evaluation frameworks that jointly assess fluency and cultural alignment.

pdf bib abs
A Benchmark for Evaluating Logical Reasoning in Georgian For Large Language Models
Irakli Koberidze | Archil Elizbarashvili | Magda Tsintsadze

Advancements in LLMs have largely overlooked low-resource languages (LRLs), creating a gap in evaluation benchmarks. To address this for Georgian, a Kartvelian language, we introduce GeoLogicQA. This novel, manually-curated benchmark assesses LLMs’ logical and inferential reasoning through 100 questions. Questions cover syllogistic deduction, inferential reading comprehension, common-sense reasoning, and arithmetic, adapted from challenging sources (Kangaroo Mathematics Competition) and validated by native Georgian speakers for linguistic nuances. Initial evaluations of state-of-the-art LLMs (Gemini 2.5 Flash, DeepSeek-V3, Grok-3, GPT-4o) show an average accuracy of 64% to 83%, significantly exceeding the human baseline of 47%. While demonstrating strong reasoning potential, error analysis reveals persistent challenges in multi-step combinatorial and highly constrained inferential tasks. GeoLogicQA is a public resource for tracking progress and diagnosing weaknesses in Georgian LLMs. We plan to expand the benchmark and establish a public leader-board to foster continuous improvement.

pdf bib abs
Slur and Emoji Aware Models for Hate and Sentiment Detection in Roman Urdu Transgender Discourse
Muhammad Owais Raza | Aqsa Umar | Mehrub Awan

The rise of social media has amplified both the visibility and vulnerability of marginalized communities, particularly the transgender population in South Asia. While hate speech detection has seen considerable progress in high resource languages like English, under-resourced and code mixed languages such as Roman Urdu remains significantly understudied. This paper presents a novel Roman Urdu dataset derived from Instagram comments on transgender related content, capturing the intricacies of multilingual, code-mixed, and emoji-laden social discourse. We introduce a transphobic slur lexicon specific to Roman Urdu and a semantic emoji taxonomy grounded in contextual usage. These resources are utilized to perform fine-grained classification of sentiment and hate speech using both traditional machine learning models and transformer-based architectures. The findings show that our custom-trained BERT-based models, Senti-RU-Bert and Hate-RU-Bert, best performance, with F1 scores of 80.39% for sentiment classification and 77.34% for hate speech classification. Ablation studies reveal consistent performance gains when slur and emoji features are included.

False information poses a significant global challenge, and manually verifying claims is a time-consuming and resource-intensive process. In this research paper, we experiment with different approaches to investigate the effectiveness of large language models (LLMs) in classifying factual claims by their veracity and generating justifications in English and Telugu. The key contributions of this work include the creation of a bilingual English-Telugu dataset and the benchmarking of different veracity classification approaches based on LLMs.

pdf bib abs
Synthetic Voice Data for Automatic Speech Recognition in African Languages
Brian DeRenzi | Anna Dixon | Mohamed Aymane Farhi | Christian Resch

Speech technology remains out of reach for most of the 2,300+ languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. W2v-BERT 2.0 speech encoder fine-tuned on 250h real and 250h synthetic data in Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved by ~6.5% with a 1:2 real-to-synthetic ratio; a 1:1 ratio for Dholuo showed similar improvements on some evaluation data, but not on others. Inves- tigating intercoder reliability, ASR errors and evaluation datasets revealed the need for more robust reviewer protocols and more accurate evaluation data. All data and models are publicly released to invite further work to improve synthetic data for African languages.

pdf bib abs
ADOR: Dataset for Arabic Dialects in Hotel Reviews: A Human Benchmark for Sentiment Analysis
Maram I. Alharbi | Saad Ezzini | Hansi Hettiarachchi | Tharindu Ranasinghe | Ruslan Mitkov

Arabic machine translation remains a fundamentally challenging task, primarily due to the lack of comprehensive annotated resources. This study evaluates the performance of Meta’s NLLB-200 model in translating Modern Standard Arabic (MSA) into three regional dialects: Saudi, Maghribi, and Egyptian Arabic using a manually curated dataset of hotel reviews. We applied a multi-criteria human annotation framework to assess translation correctness, dialect accuracy, and sentiment and aspect preservation. Our analysis reveals significant variation in translation quality across dialects. While sentiment and aspect preservation were generally high, dialect accuracy and overall translation fidelity were inconsistent. For Saudi Arabic, over 95% of translations required human correction, highlighting systemic issues. Maghribi outputs demonstrated better dialectal authenticity, while Egyptian translations achieved the highest reliability with the lowest correction rate and fewest multi-criteria failures. These results underscore the limitations of current multilingual models in handling informal Arabic varieties and highlight the importance of dialect-sensitive evaluation.

pdf bib abs
Towards Creating a Bulgarian Readability Index
Dimitar Kazakov | Stefan Minkov | Ruslana Margova | Irina Temnikova | Ivo Emauilov

Readability assessment plays a crucial role in education and text accessibility. While numerous indices exist for English and have been extended to Romance and Slavic languages, Bulgarian remains under- served in this regard. This paper reviews established readability metrics across these language families, examining their underlying features and modelling methods. We then report the first attempt to develop a readability index for Bulgarian, using end-of-school-year assessment questions and literary works targeted at children of various ages. Key linguistic attributes, namely, word length, sentence length, syllable count, and information content (based on word frequency), were extracted, and their first two statistical moments, mean and variance, were modelled against grade levels using linear and polynomial regression. Results suggest that polynomial models outperform linear ones by capturing non-linear relationships between textual features and perceived difficulty, but may be harder to interpret. This work provides an initial framework for building a reliable readability measure for Bulgarian, with applications in educational text design, adaptive learning, and corpus annotation.

pdf (full)
bib (full) Proceedings of Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models

pdf bib
Proceedings of Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models
Piotr Przybyła | Matthew Shardlow | Clara Colombatto | Nanna Inie

pdf bib abs
Bias in, Bias out: Annotation Bias in Multilingual Large Language Models
Xia Cui | Ziyi Huang | Naeemeh Adel

Annotation bias in NLP datasets remains a major challenge for developing multilingual Large Language Models (LLMs), particularly in culturally diverse settings. Bias from task framing, annotator subjectivity, and cultural mismatches can distort model outputs and exacerbate social harms. We propose a comprehensive framework for understanding annotation bias, distinguishing among instruction bias, annotator bias, and contextual and cultural bias. We review detection methods (including inter-annotator agreement, model disagreement, and metadata analysis) and highlight emerging techniques such as multilingual model divergence and cultural inference. We further outline proactive and reactive mitigation strategies, including diverse annotator recruitment, iterative guideline refinement, and post-hoc model adjustments. Our contributions include: (1) a typology of annotation bias; (2) a synthesis of detection metrics; (3) an ensemble-based bias mitigation approach adapted for multilingual settings, and (4) an ethical analysis of annotation processes. Together, these insights aim to inform more equitable and culturally grounded annotation pipelines for LLMs.

Vision-Language Models (VLMs) achieve impressive multimodal performance but often inherit gender biases from their training data. This bias might be coming from both the vision and text modalities. In this work, we dissect the contributions of vision and text backbones to these biases by applying targeted debiasing—Counterfactual Data Augmentation (CDA) and Task Vector methods. Inspired by data-efficient approaches in hate speech classification, we introduce a novel metric, Degree of Stereotypicality (DoS), and a corresponding debiasing method, Data Augmentation Using DoS (DAUDoS), to reduce bias with minimal computational cost. We curate a gender-annotated dataset and evaluate all methods on the VisoGender benchmark to quantify improvements and identify the dominant source of bias. Our results show that CDA reduces the gender gap by 6% and DAUDoS by 3% but using only one‐third the data. Both methods also improve the model’s ability to correctly identify gender in images by 3%, with DAUDoS achieving this improvement using only almost one-third of training data. From our experiments, we observed that CLIP’s vision encoder is more biased whereas PaliGemma2’s text encoder is more biased. By identifying whether the bias stems more from the vision or text encoders, our work enables more targeted and effective bias mitigation strategies in future multi-modal systems.

pdf bib abs
AnthroSet: a Challenge Dataset for Anthropomorphic Language Detection
Dorielle Lonke | Jelke Bloem | Pia Sommerauer

This paper addresses the challenge of detecting anthropomorphic language in AI research. We introduce AnthroSet, a novel dataset of 600 manually annotated utterances covering various linguistic structures. Through the evaluation of two current approaches for anthropomorphism and atypical animacy detection, we highlight the limitations of a masked language model approach, arising from masking constraints as well as increasingly anthropomorphizing AI-related terminology. Our findings underscore the need for more targeted methods and a robust definition of anthropomorphism.

pdf bib abs
FLARE: An Error Analysis Framework for Diagnosing LLM Classification Failures
Keerthana Madhavan | Luiza Antonie | Stacey Scott

When Large Language Models return “Inconclusive” in classification tasks, practitioners are left without insight into what went wrong. This diagnostic gap can delay medical decisions, undermine content moderation, and mislead downstream systems. We present FLARE (Failure Location and Reasoning Evaluation), a framework that transforms opaque failures into seven actionable categories. Applied to 5,400 election-misinformation classifications, FLARE reveals a surprising result: Few-Shot prompting—widely considered a best practice—produced 38× more failures than Zero-Shot, with 70.8% due to simple parsing issues. By exposing hidden failure modes, FLARE addresses critical misunderstandings in LLM deployment with implications across domains.

pdf bib abs
BuST: A Siamese Transformer Model for AI Text Detection in Bulgarian
Andrii Maslo | Silvia Gargova

We introduce BuST (Bulgarian Siamese Transformer), a novel method for detecting machine-generated Bulgarian text using paraphrase-based semantic similarity. Inspired by the RAIDAR approach, BuST employs a Siamese Transformer architecture to compare input texts with their LLM-generated paraphrases, identifying subtle linguistic patterns that indicate synthetic origin. In pilot experiments, BuST achieved 88.79% accuracy and an F1-score of 88.0%, performing competitively with strong baselines. While BERT reached higher raw scores, BuST offers a model-agnostic and adaptable framework for low-resource settings, demonstrating the promise of paraphrase-driven detection strategies.

pdf bib abs
F*ck Around and Find Out: Quasi-Malicious Interactions with LLMs as a Site of Situated Learning
Sarah ONeill

This work-in-progress paper proposes a cross-disciplinary perspective on “malicious” interactions with large language models (LLMs), reframing it from only a threat to be mitigated, we ask whether certain adversarial interactions can also serve as productive learning encounters that demystify the opaque workings of AI systems to novice users. We ground this inquiry in an anecdotal observation of a student who deliberately sabotaged a machine-learning robot’s training process in order to understand its underlying logic. We outline this observation with a conceptual framework for learning with, through, and from the material quirks of LLMs grounded in Papert’s constructionism and Hasse’s ultra-social learning theory. Finally, we present the preliminary design of a research-through-workshop where non-experts will jailbreak various LLM chatbots, investigating this encounter as a situated learning process. We share this early-stage research as an invitation for feedback on reimagining inappropriate and harmful interactions with LLMs not merely as problems, but as opportunities for engagement and education.

pdf bib abs
<think> So let’s replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Sergey Pletenev | Alexander Panchenko | Daniil Moskovskiy

Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.

pdf bib abs
Anthropomorphizing AI: A Multi-Label Analysis of Public Discourse on Social Media
Muhammad Owais Raza | Areej Fatemah Meghji

As the anthropomorphization of AI in public discourse usually reflects a complex interplay of metaphors, media framing, and societal perceptions, it is increasingly being used to shape and influence public perception on a variety of topics. To explore public perception and investigate how AI is personified, emotionalized, and interpreted in public discourse, we develop a custom multi-labeled dataset from the title and description of YouTube videos discussing artificial intelligence (AI) and large language models (LLMs). This was accomplished using a hybrid annotation pipeline that combined human-in-the-loop validation with AI assisted pre-labeling. This research introduces a novel taxonomy of narrative and epistemic dimensions commonly found in social media content on AI / LLM. Employing two modeling techniques based on traditional machine learning and transformer-based models for classification, the experimental results indicate that the fine-tuned transformer models, particularly AnthroRoBERTa and AnthroDistilBERT, generally outperform traditional machine learning approaches in anthropomorphization focused classification.

pdf bib abs
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs
Jonathan Hvithamar Rystrøm | Hannah Rose Kirk | Scott Hale

Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. A key concern is US-centric bias, where LLMs reflect US rather than local cultural values. We propose a novel methodology that compares LLM-generated response distributions against population-level opinion data from the World Value Survey across four languages (Danish, Dutch, English, and Portuguese). Using a rigorous linear mixed-effects regression framework, we compare three families of models: Google’s Gemma models (2B-27B parameters), AI2’s OLMo models (7B-32B parameters), and successive iterations of OpenAI’s turbo-series. Across the families of models, we find no consistent relationships between language capabilities and cultural alignment. While the Gemma models have a positive correlation between language capability and cultural alignment across all languages, the OpenAI and OLMo models are inconsistent. Our results demonstrate that achieving meaningful cultural alignment requires dedicated effort beyond improving general language capabilities.

pdf bib abs
Learn, Achieve, Predict, Propose, Forget, Suffer: Analysing and Classifying Anthropomorphisms of LLMs
Matthew Shardlow | Ashley Williams | Charlie Roadhouse | Filippos Karolos Ventirozos | Piotr Przybyła

Anthropomorphism is a literary device where human-like characteristics are used to refer to non-human entities. However, the use of anthropomorphism in the scientific description and public communication of large language models could lead to misunderstanding amongst scientists and lay-people regarding the technical capabilities and limitations of these models. In this study, we present an analysis of anthropomorphised language commonly used to describe LLMs, showing that the presence of terms such as ‘learn’, ‘achieve’, ‘predict’ and ‘can’ are typically correlated with human labels of anthropomorphism. We also perform experiments to develop a classification system for anthropomorphic descriptions of LLMs in scientific writing at the sentence level. We find that whilst a supervised Roberta-based system identifies anthropomorphisms with F1-score of 0.564, state-of-the-art LLM-based approaches regularly overfit to the task.

pdf bib abs
Leveraging the Scala type system for secure LLM-generated code
Alexander Sternfeld | Ljiljana Dolamic | Andrei Kucharavy

Large language models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages. However, their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems. This paper introduces an agentic AI framework designed to enhance the security and robustness of LLM-generated code by leveraging strongly typed and verifiable languages, using Scala as a representative example. We evaluate the effectiveness of our approach in two settings: formal verification with the Stainless framework and general-purpose secure code generation. Our experiments with leading open-source LLMs reveal that while direct code generation often fails to enforce safety constraints, just as naive prompting for more secure code, our type-focused agentic pipeline substantially mitigates input validation and injection vulnerabilities. The results demonstrate the potential of structured, type-guided LLM workflows to improve the SotA of the trustworthiness of automated code generation in high-assurance domains.

pdf (full)
bib (full) Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models

pdf bib
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Alicia Picazo-Izquierdo | Ernesto Luis Estevanell-Valladares | Ruslan Mitkov | Rafael Muñoz Guillena | Raúl García Cerdá

pdf bib abs
A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos
Tomas Ditchfield-Ogle | Ruslan Mitkov

This project compares methods for de- tecting violent videos, which are crucial for ensuring real-time safety in surveil- lance and digital moderation. It evaluates four approaches: a random forest classi- fier, a transformer model, and two multi- modal vision-language models. The pro- cess involves preprocessing datasets, train- ing models, and assessing accuracy, inter- pretability, scalability, and real-time suit- ability. Results show that traditional meth- ods are simple but less effective. The trans- former model achieved high accuracy, and the multimodal models offered high vio- lence recall with descriptive justifications. The study highlights trade-offs and pro- vides practical insights for the deployment of automated violence detection.

pdf bib abs
Detection of AI-generated Content in Scientific Abstracts
Ernesto Luis Estevanell-Valladares | Alicia Picazo-Izquierdo | Ruslan Mitkov

The growing use of generative AI in academic writing raises urgent questions about authorship and the integrity of scientific communication. This study addresses the detection of AI-generated scientific abstracts by constructing a temporally anchored dataset of paired abstracts—each with a human-written version that contains scientific abstracts of works published before 2021 and a synthetic version generated using GPT-4.1. We evaluate three approaches to authorship classification: zero-shot large language models (LLMs), fine-tuned encoder-based transformers, and traditional machine learning classifiers. Results show that LLMs perform near chance level, while a LoRA-fine-tuned DistilBERT and a PassiveAggressive classifier achieve near-perfect performance. These findings suggest that shallow lexical or stylistic patterns still differentiate human and AI writing, and that supervised learning is key to capturing these signals.

pdf bib abs
A Comparative Study of Hyperbole Detection Methods: From Rule-Based Approaches through Deep Learning Models to Large Language Models
Silvia Gargova | Nevena Grigorova | Ruslan Mitkov

We address hyperbole detection as a binary classification task, comparing rule-based methods, fine-tuned transformers (BERT, RoBERTa), and large language models (LLMs) in zero-shot and few-shot prompting (Gemini, LLaMA). Fine-tuned transformers achieved the best overall performance, with RoBERTa attaining an F1-score of 0.82. Rule-based methods performed lower (F1 = 0.58) but remain effective in constrained linguistic contexts. LLMs showed mixed results: zero-shot performance was variable, while few-shot prompting notably improved outcomes, reaching F1-scores up to 0.79 without task-specific training data. We discuss the trade-offs between interpretability, computational cost, and data requirements across methods. Our results highlight the promise of LLMs in low-resource scenarios and suggest future work on hybrid models and broader figurative language tasks.

pdf bib abs
Evaluating the Performance of Transformers in Translating Low-Resource Languages through Akkadian
Daniel A. Jones | Ruslan Mitkov

In this paper, we evaluate the performance of various fine-tuned, transformer-based models in translating Akkadian into English. Using annotated Akkadian data, we seek to establish potential considerations when developing models for other low-resource languages, which do not yet have as robust data. The results of this study show the potency, but also cost inefficiency, of Large Language Models compared to smaller Neural Machine Translation models. Significant evidence was also found demonstrating the importance of fine-tuning machine translation models from related languages.

We propose a fine-tuning strategy for English Multi-class Hope Speech Detection using Mistral, leveraging two complementary datasets: PolyHope and CDB, a new unified framework for hope speech detection. While the former provides nuanced hope-related categories such as GENERALIZED, REALISTIC, and UNREALISTIC HOPE, the later introduces linguistically grounded dimensions including COUNTERFACTUAL, DESIRE, and BELIEF. By fine-tuning Mistral on both datasets, we enable the model to capture deeper semantic representations of hope. In addition to fine-tuning, we developed advanced prompting strategies which provide interpretable, zero-shot alternatives and further inform annotation and classification designs. Our approach achieved third place in the multi-class (Macro F1=71.77) and sixth in the binary (Macro F1=85.35) settings.

pdf bib abs
Does Anaphora Resolution Improve LLM Fine-Tuning for Summarisation?
Yi Chun Lo | Ruslan Mitkov

This study investigates whether adding anaphora resolution as a preprocessing step before fine-tuning the text summarisation application in LLM can improve the quality of summary output. Two sets of training with the T5-base model and BART-large model using the SAMSum dataset were conducted. One uses the original text and the other uses the text processed by a simplified version of MARS (Mitkov’s Anaphora Resolution System). The experiment reveals that when T5-base model is fine-tuned on the anaphora-resolved inputs, the ROUGE metrics are improved. In contrast, BART-large model only has a slight improvement after fine-tuning under the same conditions, which is not statistically significant. Further analysis of the generated summaries indicates that anaphora resolution is helpful in semantic alignment.

pdf bib abs
Transformers and Large Language Models for Hope Speech Detection A Multilingual Approach
Diana Patricia Madera-Espíndola | Zoe Caballero-Domínguez | Valeria J. Ramírez-Macías | Sabur Butt | Hector G. Ceballos

With the rise of Generative AI (GenAI) models in recent years, it is necessary to understand how they performed compared with other Deep Learning techniques, across tasks and across different languages. In this study, we benchmark ChatGPT-4 and XML-RoBERTa, a multilingual transformer-based model, as part of the Multilingual Binary and Multiclass Hope Speech Detection within the PolyHope-M 2025 competition. Furthermore, we explored prompting techniques and data augmentation to determine which approach yields the best performance. In our experiments, XML-RoBERTa frequently outperformed ChatGPT-4. It also attained F1 scores of 0.86 for English, 0.83 for Spanish, 0.86 for German, and 0.94 for Urdu in Task 1, while achieving 0.73 for English, 0.70 for Spanish, 0.69 for German, and 0.60 for Urdu in Task 2.

pdf bib abs
Beyond BLEU: Ethical Risks of Misleading Evaluation in Domain-Specific QA with LLMs
Ayoub Nainia | Régine Vignes-Lebbe | Hajar Mousannif | Jihad Zahir

Large Language Models (LLMs) are increasingly used in scientific question answering (QA), including high-stakes fields such as biodiversity informatics. However, standard evaluation metrics such as BLEU, ROUGE, Exact Match (EM), and BERTScore remain poorly aligned with the factual and domain-specific requirements of these tasks. In this work, we investigate the gap between automatic metrics and expert judgment in botanical QA by comparing metric scores with human ratings across five dimensions: accuracy, completeness, relevance, fluency, and terminology usage. Our results show that standard metrics often misrepresent response quality, particularly in the presence of paraphrasing, omission, or domain-specific language. Through both quantitative analysis and qualitative examples, we show that high-scoring responses may still exhibit critical factual errors or omissions. These findings highlight the need for domain-aware evaluation frameworks that incorporate expert feedback and raise important ethical concerns about the deployment of LLMs in scientific contexts.

pdf bib abs
From Zero to Hero: Building Serbian NER from Rules to LLMs
Milica Ikonić Nešić | Sasa Petalinkar | Ranka Stanković | Ruslan Mitkov

Named Entity Recognition (NER) presents specific challenges in Serbian, a morphologically rich language. To address these challenges, a comparative evaluation of distinct model paradigms across diverse text genres was conducted. A rule-based system (SrpNER), a traditional deep learning model (Convolutional Neural Network – CNN), fine-tuned transformer architectures (Jerteh and Tesla), and Large Language Models (LLMs), specifically ChatGPT 4.0 Nano and 4.1 Mini, were evaluated and compared. For the LLMs, a one-shot prompt engineering approach was employed, using prompt instructions aligned with the entity type definitions used in the manual annotation guidelines. Evaluation was performed on three Serbian datasets representing varied domains: newspaper articles, history textbook excerpts, and a sample of literary texts from the srpELTeC collection. The highest performance was consistently achieved by the fine-tuned transformer models, with F1 scores ranging from 0.78 on newspaper articles to 0.96 on primary school history textbook sample.

pdf bib abs
Enhancing the Performance of Spoiler Review Detection by a LLM with Hints
Genta Nishi | Einoshin Suzuki

We investigate the effects of various hints including an introduction text, a few examples, and prompting techniques to enhance the performance of a Large-Language Model (LLM) in detecting a spoiler review of a movie. Detecting a spoiler review of a movie represents an important Natural Language Processing (NLP) task which resists the Deep Learning (DL) approach due to its highly subjective nature and scarcity in data. The highly subjective nature is also the main reason of the poor performance of LLMs-based methods, which explains their scarcity for the target problem. We address this problem by providing the LLM with an introduction text of the movie and a few reviews with their class labels as well as equipping it with a prompt that selects and exploits spoiler types with reasoning. Experiments using 400 manually labeled reviews and about 3200 LLM-labeled reviews show that our CAST (Clue And Select Types prompting) outperforms (0.05 higher) or is on par with (only 0.01 lower) cutting-edge LLM-based methods in three out of four movies in ROC-AUC. We believe our study represents an evidence of a target problem in which the knowledge intensive approach outperforms the learning-based approach.

pdf bib abs
Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets
Julian Oestreich | Lydia Müller

We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.

pdf bib abs
Evaluating the LLM and NMT Models in Translating Low-Resourced Languages
Julita JP Pucinskaite | Ruslan Mitkov

Machine translation has significantly advanced due to the development of transformer architecture, which is utilised by many modern deep-learning models. However, low-resource languages, such as Lithuanian, still face challenges stemming from the limited availability of training data and resource constraints. This study examines the translation capabilities of Neural Machine Translation (NMT) models and Large Language Models (LLMs), comparing their performance in low-resource translation tasks. Furthermore, it assesses the impact of parameter scaling and fine-tuning on their effectiveness in enhancing model performance. The evaluation showed that while LLMs demonstrated proficiency in low-resource translation, their results were lower compared to NMT models, which remained consistent across smaller variants. However, as model size increased, the lead was not as prominent, correlating with automatic and human evaluations. The effort to enhance translation accuracy through fine-tuning proved to be an effective strategy, demonstrating improvements in vocabulary expansion and structural coherence in both architectures. These findings highlight the importance of diverse datasets, comprehensive model design, and fine-tuning techniques in addressing the challenges of low-resourced language translation. This project, one of the first studies to focus on the low-resourced Lithuanian language, aims to contribute to the broader discourse and ongoing efforts to enhance accessibility and inclusivity in Natural Language Processing.

pdf bib abs
KGEIR: Knowledge Graph-Enhanced Iterative Reasoning for Multi-Hop Question Answering
Tianda Sun | Dimitar Kazakov

Multi-hop question answering (MHQA) requires systems to retrieve and connect information across multiple documents, a task where large language models often struggle. We introduce Knowledge Graph-Enhanced Iterative Reasoning (KGEIR), a framework that dynamically constructs and refines knowledge graphs during question answering to enhance multi-hop reasoning. KGEIR identifies key entities from questions, builds an initial graph from retrieved paragraphs, reasons over this structure, identifies information gaps, and iteratively retrieves additional context to refine the graph until sufficient information is gathered. Evaluations on HotpotQA, 2WikiMultiHopQA, and MuSiQue benchmarks show competitive or superior performance to state-of-the-art methods. Ablation studies confirm that structured knowledge representations significantly outperform traditional prompting approaches like Chain-of-Thought and Tree-of-Thought. KGEIR’s ability to explicitly model entity relationships while addressing information gaps through targeted retrieval offers a promising direction for integrating symbolic and neural approaches to complex reasoning tasks.

pdf bib abs
From Handcrafted Features to LLMs: A Comparative Study in Native Language Identification
Aliyah C. Vanterpool | Katsiaryna Aharodnik

This study compares a traditional machine learning feature-engineering approach to a large language models (LLMs) fine-tuning method for Native Language Identification (NLI). We explored the COREFL corpus, which consists of L2 English narratives produced by Spanish and German L1 speakers with lower-advanced English proficiency (C1) (Lozano et al., 2020). For the feature-engineering approach, we extracted language productivity, linguistic diversity, and n-gram features for Support Vector Machine (SVM) classification. We also looked at sentence embeddings with SVM and logistic regression. For the LLM approach, we evaluated BERT-like models and GPT-4. The feature-engineering approach, particularly n-grams, outperformed the LLMs. Sentence-BERT embeddings with SVM achieved the second-highest accuracy (93%), while GPT-4 reached an average accuracy of 90.4% across three runs when prompted with labels. These findings suggest that feature engineering remains a robust method for NLI, especially for smaller datasets with subtle linguistic differences between classes. This study contributes to the comparative analysis of traditional machine learning and transformer-based LLMs, highlighting current LLM limitations in handling domain-specific data and their need for larger training resources.

pdf bib abs
Systematic Evaluation of Rule-Based Analytics for LLM-Driven Graph Data Modelling
Fabio Antonio Yanez | Andrés Montoyo | Armando Suárez | Alejandro Piad-Morffis | Yudivián Almeida Cruz

This paper presents a novel multi-agent system for automatically generating graph database schemas from tabular data, strategically integrating rule-based analytics with large language models (LLMs). The framework leverages a lightweight rule system to select the most suitable analytic methods based on column data types, providing targeted insights that guide schema generation.

pdf bib abs
Improved Contrastive Learning over Commonsense Knowledge Graphs for Unsupervised Reasoning
Rongwen Zhao | Jeffrey Flanigan

Knowledge-augmented methods leverage external resources such as commonsense knowledge graphs (CSKGs) to improve downstream reasoning tasks. Recent work has explored contrastive learning over relation-aware sequence pairs derived from CSKG triples to inject commonsense knowledge into pre-trained language models (PLMs). However, existing approaches suffer from two key limitations: they rely solely on randomly sampled in-batch negatives, overlooking more informative hard negatives, and they ignore additional plausible positives that could strengthen training. Both factors limit the effectiveness of contrastive knowledge learning. In this paper, we propose an enhanced contrastive learning framework for CSKGs that integrates hard negative sampling and positive set expansion. Hard negatives are dynamically selected based on semantic similarity to ensure the model learns from challenging distinctions, while positive set expansion exploits the property that similar head entities often share overlapping tail entities, allowing the recovery of missing positives. We evaluate our method on unsupervised commonsense question answering and inductive CSKG completion using ConceptNet and ATOMIC. Experimental results demonstrate consistent improvements over strong baselines, confirming that our approach yields richer commonsense-aware representations and more effective knowledge injection into PLMs.