uppdf
bib
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Galia Angelova
|
Maria Kunilovskaya
|
Marie Escribe
|
Ruslan Mitkov
pdf
bib
abs
Harnessing Open-Source LLMs for Tender Named Entity Recognition
Asim Abbas
|
Venelin Kovatchev
|
Mark Lee
|
Niloofer Shanavas
|
Mubashir Ali
In the public procurement domain, extracting accurate tender entities from unstructured text remains a critical, less explored challenge, because tender data is highly sensitive and confidential, and not available openly. Previously, state-of-the-art NLP models were developed for this task; however developing an NER model from scratch required huge amounts of data and resources. Similarly, performing fine-tuning of a transformer-based model like BERT requires training data, as a result posing challenges in training data cost, model generalization, and data privacy. To address these challenges, an emerging LLM such as GPT-4 in a Few-shot learning environment achieves SOTA performance comparable to fine-tuned models. However, being dependent on the closed-source commercial LLMs involves high cost and privacy concerns. In this study, we have investigated open-source LLMs like Mistral and LLAMA-3, focusing on the tender domain for the NER tasks on local consumer-grade CPUs in three different environments: Zero-shot, One-shot, and Few-shot learning. The motivation is to efficiently lessen costs compared to a cloud solution while preserving accuracy and data privacy. Similarly, we have utilized two datasets open-source from Singapore and closed-source commercially sensitive data provided by Siemens. As a result, all the open-source LLMs achieve above 85% F1-score on an open-source dataset and above 90% F1-score on a closed-source dataset.
pdf
bib
abs
On the Limitations of Large Language Models (LLMs): False Attribution
Tosin Adewumi
|
Nudrat Habib
|
Lama Alkhaled
|
Elisa Barney
In this work, we introduce a new hallucination metric - SHI and provide insight into one important limitation of the parametric knowledge of large language models LLMs, i.e. false attribution. The task of automatic author attribution for relatively small chunks of text is an important NLP task but can be challenging. We empirically evaluate the power of 3 open SotA LLMs in zero-shot setting (Gemma-7B, Mixtral 8x7B, and LLaMA-2-13B). We acquired the top 10 most popular books of a month, according to Project Gutenberg, divided each one into equal chunks of 400 words, and prompted each LLM to predict the author. We then randomly sampled 162 chunks per book for human evaluation, based on the error margin of 7% and a confidence level of 95%. The average results show that Mixtral 8x7B has the highest prediction accuracy, the lowest SHI, and a Pearson’s correlation r of 0.724, 0.263, and -0.9996, respectively, followed by LLaMA-2-13B and Gemma-7B. However, Mixtral 8x7B suffers from high hallucinations for 3 books, rising as high as a SHI of 0.87 (in the range 0-1, where 1 is the worst). The strong negative correlation of accuracy and SHI, given by r, demonstrates the fidelity of the new hallucination metric, which may generalize to other tasks. We also show that prediction accuracies correlate positively with the frequencies of Wikipedia instances of the book titles instead of the downloads and we perform error analyses of predictions. We publicly release the annotated chunks of data and our codes to aid the reproducibility and evaluation of other models.
pdf
bib
abs
Candidate Profile Summarization: A RAG Approach with Synthetic Data Generation for Tech Jobs
Anum Afzal
|
Ishwor Subedi
|
Florian Matthes
As Large Language Models (LLMs) become increasingly applied to resume evaluation and candidate selection, this study investigates the effectiveness of using in-context example resumes to generate synthetic data. We compare a Retrieval-Augmented Generation (RAG) system to a Named Entity Recognition (NER)-based baseline for job-resume matching, generating diverse synthetic resumes with models like Mixtral-8x22B-Instruct-v0.1. Our results show that combining BERT, ROUGE, and Jaccard similarity metrics effectively assesses synthetic resume quality, ensuring the least lexical overlap along with high similarity and diversity. Our experiments show that RAG notably outperforms NER for retrieval tasks—though generation-based summarization remains challenged by role differentiation. Human evaluation further highlights issues of factual accuracy and completeness, emphasizing the importance of in-context examples, prompt engineering, and improvements in summary generation for robust, automated candidate selection.
pdf
bib
abs
PersianSciQA: A New Dataset for Bridging the Language Gap in Scientific Question Answering
Safoura Aghadavoud Jolfaei
|
Azadeh Mohebi
|
Zahra Hemmat
The shortage of specialized datasets hinders the development of Natural Language Processing (NLP) for scientific texts in low-resource languages such as Persian. To address this, we introduce PersianSciQA , a large-scale resource of 39,809 question- answer snippet pairs, each containing a question and a scientific answer snippet from a scientific engineering abstract source from IranDoc’s ‘Ganj’ repository, linked by an LLM-assigned relevance score (0-3) that measures how relevant the question is to the content of the accompanying answer snippet. The dataset was generated using a two stage prompting methodology and refined through a rigorous cleaning pipe-line, including text normalization and semantic deduplication. Human validation of 1,000 instances by two NLP researchers confirmed the dataset’s quality and a substantial LLM-human agreement (Cohen’s kappa coefficient κ=0.6642). To demonstrate its value, we establish baseline benchmarks and show that fine-tuning on PersianSciQA dramatically improves a state-of-the-art model, achieving a Spearman correlation of 0.895 on a blind test set. PersianSciQA provides a crucial new resource to facilitate research in information retrieval and question answering within the Persian scientific domain.
pdf
bib
abs
Multilingual Pre-training Meets Supervised Neural Machine Translation: A Reproducible Evaluation on English–French and Finnish Translation
Benyamin Ahmadnia
|
Yeswanth Soma
|
Hossein Sarrafzadeh
This paper presents a comparative evaluation of Transformer-based Neural Machine Translation (NMT) models and pre-trained multilingual sequence-to-sequence models in the context of moderately-resourced MT. Using English-French (high-resource) and English-Finnish (moderate-resource) as case studies, we assess the effectiveness of fine-tuning the mBART model versus training standard NMT systems from scratch. Our experiments incorporate data-augmentation techniques such as back-translation and evaluate translation quality using BLEU, TER, METEOR, and COMET metrics. We also provide a detailed error analysis that covers lexical choice, named entity handling, and word order. While mBART demonstrates consistent improvements over classical NMT, particularly in handling complex linguistic structures and sparse training data, we acknowledge the challenges of deploying large models in resource-constrained settings. Our findings highlight practical trade-offs between model complexity, resource availability, and translation quality in multilingual scenarios.
pdf
bib
abs
Advancing Clinical Translation in Nepali through Fine-Tuned Multilingual Models
Benyamin Ahmadnia
|
Sumaiya Shaikh
|
Bibek Poudel
|
Shazan Mohammed
|
Sahar Hooshmand
Low-resource Neural Machine Translation (NMT) remains a major challenge, particularly in high-stakes domains such as healthcare. This paper presents a domain-adapted pipeline for English-Nepali medical translation leveraging two state-of-the-art multilingual Large Language Models (LLMs): mBART and NLLB-200. A high-quality, domain-specific parallel corpus is curated, and both models are fine-tuned using PyTorch frameworks. Translation fidelity is assessed through a multi-metric evaluation strategy that combines BLEU, CHRF++, METEOR, BERTScore, COMET, and perplexity. Our experimental results show that NLLB-200 consistently outperforms mBART across surface-level and semantic metrics, achieving higher accuracy and lower hallucination rates in clinical settings. In addition, error profiling and ethical assessments are conducted to highlight challenges such as term omissions and cultural bias. This work underscores the viability of large-scale multilingual models in enhancing medical translation for low-resource languages and proposes actionable paths toward safer and more equitable MT deployment in healthcare.
pdf
bib
abs
Advancing Active Learning with Ensemble Strategies
Naif Alatrush
|
Sultan Alsarra
|
Afraa Alshammari
|
Luay Abdeljaber
|
Niamat Zawad
|
Latifur Khan
|
Patrick T. Brandt
|
Javier Osorio
|
Vito D’Orazio
Active learning (AL) reduces annotation costs by selecting the most informative samples for labeling. However, traditional AL methods rely on a single heuristic, limiting data exploration and annotation efficiency. This paper introduces two ensemble-based AL methods: Ensemble Union, which combines multiple heuristics to improve dataset exploration, and Ensemble Intersection, which applies majority voting for robust sample selection. We evaluate these approaches on the United Nations Parallel Corpus (UNPC) in both English and Spanish using domain-specific models such as ConfliBERT. Our results show that ensemble-based AL strategies outperform individual heuristics, achieving classification performance comparable to full dataset training while using significantly fewer labeled examples. Although focused on political texts, the proposed methods are applicable to broader NLP annotation tasks where labeling costs are high.
pdf
bib
abs
Evaluating Large Language Models on Sentiment Analysis in Arabic Dialects
Maram I. Alharbi
|
Saad Ezzini
|
Hansi Hettiarachchi
|
Tharindu Ranasinghe
|
Ruslan Mitkov
Despite recent progress in large language models (LLMs), their performance on Arabic dialects remains underexplored, particularly in the context of sentiment analysis. This study presents a comparative evaluation of three LLMs, DeepSeek-R1, Qwen2.5, and LLaMA-3, on sentiment classification across Modern Standard Arabic (MSA), Saudi dialect and Darija. We construct a balanced sentiment dataset by translating and validating MSA hotel reviews into Saudi dialect and Darija. Using parameter-efficient fine-tuning (LoRA) and dialect-specific prompts, we assess each model under matched and mismatched prompting conditions. Evaluation results show that Qwen2.5 achieves the highest macro F1 score of 79% on Darija input using MSA prompts, while DeepSeek performs best when prompted in the input dialect, reaching 71% on Saudi dialect. LLaMA-3 exhibits stable performance across prompt variations, with 75% macro F1 on Darija input under MSA prompting. Dialect-aware prompting consistently improves classification accuracy, particularly for neutral and negative sentiment classes.
pdf
bib
abs
From Posts to Predictions: A User-Aware Framework for Faithful and Transparent Detection of Mental Health Risks on Social Media
Hessam Amini
|
Leila Kosseim
We propose a user-aware attention-based framework for early detection of mental health risks from social media posts. Our model combines DisorBERT, a mental health–adapted transformer encoder, with a user-level attention mechanism that produces transparent post-level explanations. To assess whether these explanations are faithful, i.e., aligned with the model’s true decision process, we apply adversarial training and quantify attention faithfulness using the AtteFa metric. Experiments on four eRisk tasks (depression, anorexia, self-harm, and pathological gambling) show that our model achieves competitive latency-weighted F1 scores while relying on a sparse subset of posts per user. We also evaluate attention robustness and conduct ablations, confirming the model’s reliance on high-weighted posts. Our work extends prior explainability studies by integrating faithfulness assessment in a real-world high-stakes application. We argue that systems combining predictive accuracy with faithful and transparent explanations offer a promising path toward safe and trustworthy AI for mental health support.
pdf
bib
abs
Beyond Methods and Datasets Entities: Introducing SH-NER for Hardware and Software Entity Recognition in Scientific Text
Aftab Anjum
|
Nimra Maqbool
|
Ralf Krestel
Scientific Information Extraction (SciIE) has become essential for organizing and understanding scientific literature, powering tasks such as knowledge graph construction, method recommendation, and automated literature reviews. Although prior SciIE work commonly annotates entities such as tasks, methods, and datasets, it systematically neglects infrastructure-related entities like hardware and software specifications mentioned in publications. This gap limits key applications: knowledge graphs remain incomplete, and recommendation systems cannot effectively filter methods based on hardware compatibility. To address this gap, we introduce SH-NER, the first large-scale, manually annotated dataset focused on infrastructure-related entities in NLP research. SH-NER comprises 1,128 full-text papers from the ACL Anthology and annotates five entity types: Software, Cloud-Platform, Hardware-Device, Device-Count, and Device-Memory. Our dataset comprises over 9k sample sentences with around 6k annotated entity mentions. To assess the effectiveness of SH-NER, we conducted comprehensive experiments employing state-of-the-art supervised models alongside large language models (LLMs) as baselines. The results show that SH-NER improves scientific information extraction by better capturing infrastructure mentions. You can find the manually annotated dataset at https://github.com/coderhub84/SH-NER.
pdf
bib
abs
Toponym Resolution: Will Prompt Engineering Change Expectations?
Isuri Anuradha
|
Deshan Koshala Sumanathilaka
|
Ruslan Mitkov
|
Paul Rayson
Large Language Models(LLMs) have revolutionised the field of artificial intelligence and have been successfully employed in many disciplines, capturing widespread attention and enthusiasm. Many previous studies have established that Domain-specific Deep Learning models to competitively perform with the general-purpose LLMs (Maatouk et al., 2024;Lu et al., 2024). However, a suitable prompt which provides direct instructions and background information is expected to yield im- proved results (Kamruzzaman and Kim, 2024). The present study focuses on utilising LLMs for the Toponym Resolution task by incorporating Retrieval-Augmented Generation(RAG) and prompting techniques to surpass the results of the traditional Deep Learning models. Moreover, this study demonstrates that promising results can be achieved without relying on large amounts of labelled, domain-specific data. After a descriptive comparison between open-source and proprietary LLMs through different prompt engineering techniques, the GPT-4o model performs best compared to the other LLMs for the Toponym Resolution task.
pdf
bib
abs
HoloBERT: Pre-Trained Transformer Model for Historical Narratives
Isuri Anuradha
|
Le An Ha
|
Ruslan Mitkov
Oral texts often contain spontaneous, unstructured language with features like disfluencies, colloquialisms, and non-standard syntax. In this paper, we investigate how further pretraining language models with specialised learning objectives for oral and transcribed texts to enhance Named Entity Recognition (NER) performance in Holocaust-related discourse. To evaluate our models, we compare the extracted named entities (NE) against those from other pretrained models on historical texts and generative AI models such as GPT. Furthermore, we demonstrate practical applications of the recognised NEs by linking them to a knowledge base as structured metadata and representing them in a graph format. With these contributions, our work illustrates how the further-pretrain-and-fine-tune paradigm in Natural Language Processing advances research in Digital Humanities.
pdf
bib
abs
A Framework for Fine-Tuning LLMs Using Heterogeneous Feedback
Ryan Aponte
|
Ryan A. Rossi
|
Shunan Guo
|
Franck Dernoncourt
|
Tong Yu
|
Xiang Chen
|
Subrata Mitra
|
Nedim Lipka
Large language models (LLMs) have been applied to a wide range of tasks, including text summarization, web navigation, and chat- bots. They have benefitted from supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) following an un- supervised pretraining. These datasets can be difficult to collect, limited in scope, and vary in sample quality. Additionally, datasets can vary extensively in supervision format, from numer- ical to binary as well as multi-dimensional with many different values. We present a framework for fine-tuning LLMs using heterogeneous feed- back, which has two main components. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high- quality and diverse subset to obtain perfor- mance increases potentially exceeding the full dataset. We conduct extensive experiments to understand the effectiveness of these tech- niques for incorporating heterogeneous feed- back, and demonstrate improvements from us- ing a high-quality and diverse subset of the data. We find that our framework is able to improve models in multiple areas simultaneously, such as in instruction following and bias reduction.
pdf
bib
abs
Chakoshi: A Customizable Guardrail for LLMs with a Focus on Japanese-Language Moderation
Kazuhiro Arai
|
Ryota Matsui
|
Kenji Miyama
|
Yudai Yamamoto
|
Ren Shibamiya
|
Kaito Sugimoto
|
Yoshimasa Iwase
In this research, we developed and evaluated “chakoshi” an LLM guardrail model designed to address Japanese-specific nuances. chakoshi is a lightweight LLM that has been fine-tuned using multiple open datasets and proprietary learning datasets. Based on gemma-2-9b, the chakoshi model achieved an average F1 score of 0.92 or higher across multiple test datasets, demonstrating superior performance compared to existing models. Additionally, we implemented a feature that allows customization of categories to be filtered using natural language, and confirmed its effectiveness through practical examples.
pdf
bib
abs
KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines
Alexander Baranov
|
Anna Palatkina
|
Yulia Makovka
|
Pavel Braslavski
We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to. Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts – each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities – the mechanism that has been underrepresented in previous humor datasets. Our experiments with five LLMs show that there is ample room for improvement in wordplay detection and interpretation tasks. The dataset and evaluation scripts are available at https://github.com/Humor-Research/KoWit-24
pdf
bib
abs
Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets
Eduard Barbu
|
Meeri-Ly Muru
|
Sten Marcus Malva
This paper presents a method for text simplification based on two neural architectures: a neural machine translation (NMT) model and a fine-tuned large language model (LLaMA). Given the scarcity of existing resources for Estonian, a new dataset was created by combining manually translated corpora with GPT-4.0-generated simplifications. OpenNMT was selected as a representative NMT-based system, while LLaMA was fine-tuned on the constructed dataset. Evaluation shows LLaMA outperforms OpenNMT in grammaticality, readability, and meaning preservation. These results underscore the effectiveness of large language models for text simplification in low-resource language settings. The complete dataset, fine-tuning scripts, and evaluation pipeline are provided in a publicly accessible supplementary package to support reproducibility and adaptation to other languages.
pdf
bib
abs
Mitigating Bias in Text Classification via Prompt-Based Text Transformation
Charmaine Barker
|
Dimitar Kazakov
The presence of specific linguistic signals particular to a certain sub-group can become highly salient to language models during training. In automated decision-making settings, this may lead to biased outcomes when models rely on cues that correlate with protected characteristics. We investigate whether prompting ChatGPT to rewrite text using simplification, neutralisation, localisation, and formalisation can reduce demographic signals while preserving meaning. Experimental results show a statistically significant drop in location classification accuracy across multiple models after transformation, suggesting reduced reliance on group-specific language. At the same time, sentiment analysis and rating prediction tasks confirm that the core meaning of the reviews remains greatly intact. These results suggest that prompt-based rewriting offers a practical and generalisable approach for mitigating bias in text classification.
pdf
bib
abs
Towards CEFR-targeted Text Simplification for Question Adaptation
Luca Benedetto
|
Paula Buttery
Text Simplification (TS) can adapt educational content to learners’ proficiency levels. In reading comprehension questions, passage complexity directly affects the question difficulty; thus, TS could enable automatic question adaptation by generating multiple versions of a reading passage. However, despite the potential of TS and its applications in other domains, the feasibility, reliability, and robustness of TS for question adaptation remains unexplored. In this paper, we conduct the first evaluation of LLMs for CEFR targeted text simplification aimed at question adaptation. Specifically, we investigate whether LLMs can perform CEFR-targeted text simplification and how this affects question answerability. Evaluating four LLMs on two English learning datasets, we show that they can mostly perform targeted simplification with readability values correlating with reference CEFR levels, but alignment is imperfect. Crucially, the simplified texts generally preserve the information needed to for question answering, and questions associated with texts simplified at lower levels show reduced difficulty in virtual pretesting. These preliminary findings show the potential of LLMs for educational content adaptation, but practical deployment will need improved CEFR alignment.
pdf
bib
abs
Evaluation of Pretrained and Instruction-Based Pretrained Models for Emotion Detection in Arabic Social Media Text
Md. Rafiul Biswas
|
Shimaa Ibrahim
|
Mabrouka Bessghaier
|
Wajdi Zaghouani
This study evaluates three approaches—instruction prompting of large language models (LLMs), instruction fine-tuning of LLMs, and transformer-based pretrained models on emotion detection in Arabic social media text. We compare pretrained transformer models like AraBERT, CaMelBERT, and XLM-RoBERTa with instruction prompting with advanced LLMs like GPT-4o, Gemini, Deepseek, and Fanar, and instruction fine-tuning approaches with LLMs like Llama 3.1, Mistral, and Phi. With a highly preprocessed dataset of 10,000 labeled Arabic tweets with overlapping emotional labels, our findings reveal that transformer-based pretrained models outperform instruction prompting and instruction fine-tuning approaches. Instruction prompts leverage general linguistic skills with maximum efficiency but fall short in detecting subtle emotional contexts. Instruction fine-tuning is more specific but trails behind pretrained transformer models. Our findings establish the need for optimized instruction-based approaches and underscore the important role played by domain-specific transformer architectures in accurate Arabic emotion detection.
pdf
bib
abs
Can LLMs Disambiguate Grounded Language? The Case of PP Attachment
John Blackmore
|
Matthew Stone
We explore the potential of large language models in resolving ambiguity in prepositional phrase attachments in grounded language. We find that when prompted in such a way that we can compute a probability of the respective attachment, models yield promising results. However, additional inputs from a measure of information structure may help improve prediction accuracy. We also investigate where we need more sophisticated tools, commonsense reasoning, world knowledge, and additional context to resolve ambiguity.
pdf
bib
abs
MLDataForge: Accelerating Large-Scale Dataset Preprocessing and Access for Multimodal Foundation Model Training
Andrea Blasi Núñez
|
Lukas Paul Achatius Galke
|
Peter Schneider-Kamp
Preprocessing large and possibly multimodal datasets remains a key bottleneck in many machine learning workflows, particularly when random access to samples is needed for global shuffling and sorting. Existing approaches, including widely used formats like JSONL and frameworks such as Huggingface Datasets and MosaicML Streaming, typically incur substantial computational, memory, and storage overhead in such settings. Here, we introduce MLDataForge, a Python-based open-source framework designed for scalable dataset pre-processing and access. Our key contributions are: (1) optimized readers for Mosaic Data Shards (MDS) that substantially improve throughput, reduce peak storage usage, and support sample-level compression; (2) JINX (JSON Indexed ’N’ eXtended), a novel, index-augmented JSONL-compatible format supporting structured footers and binary sidecar files; and (3) a lazy-loading mechanism that defers data loading, decompression, and decoding JINX files until sample fields are accessed. We empirically evaluate MLDataForge and our contributions on a representative 200 GB supervised fine-tuning dataset for vision language models. Our best configuration – zstd-compressed JINX with binary sidecar and lazy loading – yields at least a decimal order-of-magnitude throughput increase compared to the best baselines for iteration, global shuffling, and sorting. These advances enable substantial gains in data preprocessing performance, facilitating more scalable and resource-efficient model training pipelines.
pdf
bib
abs
The Impact of Named Entity Recognition on Transformer-Based Multi-Label Dietary Recipe Classification
Kemalcan Bora
|
Horacio Saggion
This research explores the impact of Named Entity Recognition (NER) on transformer-based models for multi-label recipe classification by dietary preference. To support this task, we introduce the NutriCuisine Index: a collection of 23,932 recipes annotated across six dietary categories (Healthy, Vegan, Gluten-Free, Low-Carb, High-Protein, Low-Sugar). Using BERT-base-uncased, RoBERTa-base, and DistilBERT-base-uncased, we evaluate how NER-based preprocessing affects the performance (F1-score, Precision, Recall, and Hamming Loss) of Transformer-based multi-label classification models. RoBERTa-base shows significant improvements with NER in F1-score (∆F1 = +0.0147, p < 0.001), Precision, and Recall, while BERT and DistilBERT show no such gains. NER also leads to a slight but statistically significant increase in Hamming Loss across all models. These findings highlight the model dependent impact of NER on classification performance.
pdf
bib
abs
Balancing the Scales: Addressing Gender Bias in Social Media Toxicity Detection
Beatriz Botella-Gil
|
Juan Pablo Consuegra-Ayala
|
Alba Bonet-Jover
|
Paloma Moreda-Pozo
The detection of toxic content in social media has become a critical task in Natural Language Processing (NLP), particularly given its intersection with complex issues like subjectivity, implicit language, and cultural context. Among these challenges, bias in training data remains a central concern—especially as language models risk reproducing and amplifying societal inequalities. This paper investigates the interplay between toxicity and gender bias on Twitter/X by introducing a novel dataset of violent and non-violent tweets, annotated not only for violence but also for gender. We conduct an exploratory analysis of how biased data can distort toxicity classification and present algorithms to mitigate these effects through dataset balancing and debiasing. Our contributions include four new dataset splits—two balanced and two debiased—that aim to support the development of fairer and more inclusive NLP models. By foregrounding the importance of equity in data curation, this work lays the groundwork for more ethical approaches to automated violence detection and gender annotation.
pdf
bib
abs
“Simple-Tool”: A Tool for the Automatic Transformation of Spanish Texts into Easy-to-Read
Beatriz Botella-Gil
|
Isabel Espinosa-Zaragoza
|
Paloma Moreda Pozo
|
Manuel Palomar
Automatic Text Simplification (ATS) has emerged as a key area of research within the field of Natural Language Processing, aiming to improve access to information by reducing the linguistic complexity of texts. Simplification can be applied at various levels—lexical, syntactic, semantic, and stylistic—and must be tailored to meet the needs of different target audiences, such as individuals with cognitive disabilities, low-literacy readers, or non-native speakers. This work introduces a tool that automatically adapts Spanish texts into Easy-to-Read format, enhancing comprehension for people with cognitive or reading difficulties. The proposal is grounded in a critical review of existing Spanish-language resources and addresses the need for accessible, well-documented solutions aligned with official guidelines, reinforcing the potential of text simplification as a strategy for inclusion.
pdf
bib
abs
QuARK: LLM-Based Domain-Specific Question Answering Using Retrieval Augmented Generation and Knowledge Graphs
Edward Burgin
|
Sourav Dutta
|
Mingxue Wang
Retrieval Augmented Generation (RAG) has been pivotal in the utilization of Large Language Models (LLM) to improve the factuality of long-form question answering systems in industrial settings. Knowledge graphs (KG) represent a linking of disparate information sources that potentially yield useful information for mitigating the issues of insufficient knowledge and hallucination within the LLM-RAG pipeline. However, the creation of domain-specific KG is costly and usually requires a domain expert. To alleviate the above challenges, this work proposes QuARK, a novel domain-specific question answering framework to enhance the knowledge capabilities of LLM by integrating structured KG, thereby significantly reducing the reliance on the “generic” latent knowledge of LLMs. Here, we showcase how LLMs can be deployed to not only act in dynamic information retrieval and in answer generating frameworks, but also as flexible agents to automatically extract relevant entities and relations for the automated construction of domain-specific KGs. Crucially we propose how the pairing of question decomposition and semantic triplet retrieval within RAG can enable optimal subgraph retrieval. Experimental evaluations of our framework on financial domain public dataset, demonstrate that it enables a robust pipeline incorporating schema-free KG within a RAG framework to improve the overall accuracy by nearly 13%.
pdf
bib
abs
Classifying Emotions in Tweets from the Financial Market: A BERT-based Approach
Wesley Pompeu Carvalho
|
Norton Trevisan Roman
Behavioural finance emphasizes the relevance of investor sentiment and emotions in the pricing of financial assets. However, little research has examined how discrete emotions can be detected in text related to this domain, with extant work focusing mostly in sentiment instead. This study approaches this problem by describing a framework for emotion classification in tweets related to the stock market, written in Brazilian Portuguese. Emotion classifiers were then developed, based on Plutchik’s psychoevolutionary theory, by fine-tuning BERTimbau, a pre-trained BERT-based language model for Brazilian Portuguese, and applying it to an existing corpus of tweets, from the stock market domain, previously annotated with emotions. Each of Plutchik’s four emotional axes was modelled as a ternary classification problem. For each axis, 30 independent training iterations were executed using a repeated holdout strategy with different train/test splits in each iteration. In every iteration, hyperparameter tuning was performed via 10-fold stratified cross-validation on the training set to identify the best configuration. A final model was then retrained using the selected hyperparameters and evaluated on a hold-out test set, generating a distribution of macro-F1 scores in out-of-sample data. The results demonstrated statistically significant improvements over a stratified random baseline (Welch’s t-test, << 0.001 across all axes), with macro-F1 scores ranging from 0.50 to 0.61. These findings point to the feasibility of using transformer-based models to capture emotional nuance in financial texts written in Portuguese and provide a reproducible framework for future research.
pdf
bib
abs
Detecting Changes in Mental Health Status via Reddit Posts in Response to Global Negative Events
Zenan Chen
|
Judita Preiss
|
Peter A. Bath
Detecting population-level mental health responses to global negative events through social media language remains understudied, despite its potential for public health surveillance. While pretrained language models (PLMs) have shown promise in mental health detection, their effectiveness in capturing event-driven collective psychological shifts – especially across diverse crisis contexts – is unclear. We present a prototype evaluation of three PLMs for identifying population mental health dynamics triggered by real-world negative events. We introduce two novel datasets specifically designed for this task. Our findings suggest that DistilBERT is better suited to the noisier global negative events data, while MentalRoBERTa shows the validity of the method on the Covid-19 tidier data. SHAP interpretability analysis of 500 randomly sampled posts revealed that mental-health related vocabulary (anxiety, depression, worthless) emerged as the most influential linguistic markers for mental health classification.
pdf
bib
abs
APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification
Artem Chernodub
|
Aman Saini
|
Yejin Huh
|
Vivek Kulkarni
|
Vipul Raheja
Recent advancements in large language models (LLMs) have enabled a wide range of natural language processing (NLP) tasks through simple prompt-based interactions. Consequently, several approaches have been proposed to engineer prompts that most effectively enable LLMs to perform a given task (e.g., chain-of-thought prompting). In settings with a well-defined metric to optimize model performance, Automatic Prompt Optimization (APO) methods have been developed to refine a seed prompt. Subsequently, we propose APIO, a simple but effective prompt induction and optimization approach for the tasks of Grammatical Error Correction (GEC) and Text Simplification, without relying on manually specified seed prompts. APIO achieves a new state-of-the-art performance for purely LLM-based prompting methods on these tasks. We make our data, code, prompts, and outputs publicly available.
pdf
bib
abs
Integrating Archaic and Regional Lexicons to Improve the Readability of Old Romanian Texts
Madalina Chitez
|
Roxana Rogobete
|
Cristina Aura Udrea
|
Karla Csürös
|
Ana-Maria Bucur
|
Mihai Dascalu
Access to age-appropriate texts is critical for young readers’ literacy acquisition. For limited-resourced languages, such as Romanian, this area remains under-researched. As such, we present ongoing work on improving readability for old Romanian texts by applying Large Language Models (LLMs). First, we compiled and cleaned a comprehensive list of archaic and regional terms from lexicographic sources, including DEX online and printed dictionaries. The cleaning process involved duplicate removal, orthographic normalization, context-based filtering, and manual review. Key challenges included distinguishing archaic forms from rare or poetic ones, resolving polysemous entries, and managing inconsistent labeling across sources. Second, LLMs were utilized to validate the archaic and regional nature of identified terms and replace them with modern equivalents, while also determining the appropriate reading level for both original and modified versions. Results show that through the replacement of archaic and regional terms, the appropriate age for the modified texts decreases by approximately 0.5 years for texts extracted from textbooks and canonical writings.
pdf
bib
abs
ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities
Aleksis Ioannis Datseris
|
Sylvia Vassileva
|
Ivan K. Koychev
|
Svetla Boytcheva
This paper introduces a novel approach to position embeddings in transformer models, named “Exact Positional Embeddings” (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model’s ability to generalize to longer sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training. The code and supplementary materials can be found in
pdf
bib
abs
End-to-End Deep Learning for Named Entity Recognition and Relation Extraction in Gut-Brain Axis PubMed Abstracts
Aleksis Ioannis Datseris
|
Mario Kuzmanov
|
Ivelina Nikolova-Koleva
|
Dimitar Taskov
|
Svetla Boytcheva
This is a comparative study tackling named entity recognition and relation extraction from PubMed abstracts with focus on the gut-brain interplay. The proposed systems for named entity recognition cover a range of models and techniques from traditional gazetteer-based approaches, transformer-based approaches, transformer domain adaptation, large models pre-training as well as LLM prompting. The best performing model among these achieves 82.53% F1-score. The relation extraction task is addressed with ATLOP and LLMs and their best results reach F1 up to 63.80% on binary relation extraction, 89.40% on ternary tag-based relation extraction and 40.32% on ternary mention-based relation extraction.
pdf
bib
abs
Enabling On-Premises Large Language Models for Space Traffic Management
Enrique De Alba
Natural language processing systems leveraging on-premises large language models (LLMs) can translate natural language into structured JSON commands for Space Traffic Management (STM) systems. While cloud-based LLMs excel at this task, security constraints necessitate local deployment, requiring evaluation of smaller on-premises models. We demonstrate that resource-efficient 7B-parameter models can achieve high accuracy for STM command generation through a two-stage pipeline. Our pipeline first classifies objectives, then generates schemas. Empirically, we observe that initial classification accuracy strongly influences overall performance, with failures cascading to the generation stage. We demonstrate that quantization disproportionately increases structural errors compared to semantic errors across 405 objectives. The best quantized model (Falcon3-7B-GPTQ) shows a 3.45% accuracy drop, primarily from structural errors. Our findings highlight limitations in how model compression affects applications that require syntactic validity. More broadly, we explore the feasibility of LLM deployment in air-gapped environments while uncovering how quantization asymmetrically impacts structured output generation.
pdf
bib
abs
Top Ten from Lakhs: A Transformer-based Retrieval System for Identifying Previously Fact-Checked Claims across Multiple Languages
Srijani Debnath
|
Pritam Pal
|
Dipankar Das
The efficient identification of previously fact-checked claims across multiple languages is a challenging task. It can be time-consuming for professional fact-checkers even within a single language. It becomes much more difficult to perform manually when the claim and the fact-check may be in different languages. This paper presents a systematic approach for the retrieval of top-k relevant fact-checks for a given post in a monolingual and cross-lingual setup using two transformer-based fact-checked claim retrieval frameworks that share a common preprocessing pipeline but differ in their underlying encoder implementations: TIDE, a TensorFlow-based custom dual encoder applied to english-translated data, and PTEX, a PyTorch-based encoder operating on both english-translated and original-language inputs, and introduces a lightweight post-processing technique based on a textual feature: Keyword Overlap Count applied via reranking on top of the transformer-based frameworks. Training and evaluation on a large multilingual corpus show that the fine-tuned E5-Large-v2 model in the PTEX framework yields the best monolingual track performance, achieving an average Success@10 score of 0.8846 and the same framework model with post-processing technique achieves an average Success@10 score of 0.7393 which is the best performance in crosslingual track.
pdf
bib
abs
Evaluating Bilingual Lexicon Induction without Lexical Data
Michaela Denisová
|
Pavel Rychly
Bilingual Lexicon Induction (BLI) is a fundamental task in cross-lingual word embedding (CWE) evaluation, aimed at retrieving word translations from monolingual corpora in two languages. Despite the task’s central role, existing evaluation datasets based on lexical data often contain biases such as a lack of morphological diversity, frequency skew, semantic leakage, and overrepresentation of proper names, which undermine the validity of reported performance. In this paper, we propose a novel, language-agnostic evaluation methodology that entirely eliminates the dependency on lexical data. By training two sets of monolingual word embeddings (MWEs) using identical data and algorithms but with different weight initialisations, we enable the assessment on the BLI task without being affected by the quality of the evaluation dataset. We evaluate three baseline CWE models and analyse the impact of key hyperparameters. Our results provide a more reliable and bias-free perspective on CWE models’ performance.
pdf
bib
abs
Utilizing Large Language Models for Focused Conversational Assistants
Shruti Dhavalikar
|
Karthika Vijayan
A focused conversational assistant (FCA) realizes human-computer interaction bounded in a predefined scope of operation. With the advent of large language models (LLMs), it has become imperative to integrate them in conversational assistants (CAs). However, an LLM can become largely inaccurate in an FCA with multiple responsibilities, like information extraction, scope adherence and response generation. In this paper, we attempt to use an LLM for an FCA while constricting the scope of operation and maintaining a guided flow of conversation. We present a strategical combination of discriminative AI methods and generative AI models. Our methodology includes (i) a component of natural language understanding (NLU) operating discriminatively, (ii) a conditional intent-based routing of user messages to appropriate response generators, and (iii) response generators which are either custom ones or open sourced LLMs. The collation of these three strategies realizes a hybrid AI system, assisting FCA with adhering to the defined scope, maintaining context and dialogue flow.
pdf
bib
abs
AntiSemRO: Studying the Romanian Expression of Antisemitism
Anca Dinu
|
Andreea C. Moldovan
|
Adina Marincea
This study introduces an annotated dataset for the study of antisemitic hate speech and attitudes towards Jewish people in Romanian, collected from social media. We performed two types of annotation: with three simple tags (‘Neutral’, ‘Positive’, ‘Negative’), and with five more refined tags (Neutral’, ‘Ambiguous’, ‘Jewish Community’, Solidarity’, ‘Zionism’, ‘Antisemitism’). We perform several experiments on this dataset: clusterization, automatic classification, using classical machine learning models and transformer-based models, and sentiment analysis. The three classes clusterization produced well grouped clusters, while, as expected, the five classes clusterization produced moderately overlapping groups, except for ‘Antisemitism’, which is well away from the other four groups. We obtained a good F1-Score of 0.78 in the three classes classification task with Romanian BERT model and a moderate F1-score of 0.62 for the five classes classification task with a SVM model. The lowest negative sentiment was contained in the ‘Neuter’ class, while the highest was in ‘Zionism’, and not in ‘Antisemitism’, as expected. Also, the same ‘Zionism’ category displays the highest level of positive sentiment.
pdf
bib
abs
Towards a Map of Related Words in Romance Languages
Liviu P. Dinu
|
Ana Sabina Uban
|
Ioan-Bogdan Iordache
|
Claudia Vlad
|
Simona Georgescu
|
Laurentiu Zoicas
|
Anca Dinu
We propose a map of cognates and borrowings usage in Romance languages, having as a starting point the pairs of cognates and borrowings between any two of these idioms from RoBoCoP, the largest database built upon electronic dictionaries containing etymological information for Portuguese, Spanish, French, Italian and Romanian. Having in mind that words are used and evolve in language communities over time, on the basis of the pairs extracted from RoBoCoP, we determine how many of them occur and with what frequency in the context of the languages in use, based on three online parallel corpora that contain all five Romance languages: Wikipedia, Europarl – focusing on proceedings of the European Parliament and RomCro2.0 – containing literary texts in different languages, translated in Romance languages and Croatian.
pdf
bib
abs
Decoding Emotion in Ancient Poetry: Leveraging Generative Models for Classical Chinese Sentiment Analysis
Quanqi Du
|
Loic De Langhe
|
Els Lefever
|
Veronique Hoste
This study explores the use of generative language models for sentiment analysis of classical Chinese poetry, aiming to better understand emotional expression in literary texts. Using the FSPC dataset, we evaluate two models, Qwen-2.5 and LLaMA-3.1, under various prompting strategies. Initial experiments show that base models struggle with task-specific instructions. By applying different instruction tuning strategies with Low-Rank Adaptation (LoRA), we significantly enhance the models’ ability to follow task instructions and capture poetic sentiment, with LLaMA-3.1 achieving the best results (67.10% accuracy, 65.42% macro F1), demonstrate competitive performance against data-intensive, domain-adapted baselines. We further examine the effects of prompt language and multi-task learning, finding that English prompts outperform Chinese ones. These results highlight the promise of instruction-tuned generative models in sentiment analysis of classical Chinese poetry, and underscore the importance of prompt formulation in literary understanding tasks.
pdf
bib
abs
GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs
Marius Dumitran
|
Angela Dumitran
|
Alexandra Mihaela Danila
Large language models (LLMs) have revolutionised NLP, yet their pedagogical value for low‐resource languages remains unclear. We present GRILE, the first open benchmark of 1 151 multiple‐choice questions harvested from Romanian high‐stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state‐of‐the‐art multilingual and Romanian‐specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically faithful explanations. While Gemini 2·5 Pro reaches 83% accuracy, most open‐weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM 3 orthographic norms. All data, code and a public web demo are released to catalyse future research. Our findings expose open challenges for trustworthy educational NLP in low‐resource settings and establish GRILE as a new test‐bed for controllable explanation generation and evaluation.
pdf
bib
abs
PerSpaCor: Correcting Space and ZWNJ Errors in Persian Text with Transformer Models
Matin Ebrahimkhani
|
Ebrahim Ansari
Precision and clarity are essential qualities of written texts; however, Persian script, rooted in Arabic script, presents unique challenges that can compromise readability and correctness. In particular, the use of space and half-space—specifically the Zero Width Non-Joiner (ZWNJ)—is essential for proper character separation in Persian typography. This research introduces four models for correcting spacing and ZWNJ errors at the character level, thereby improving both readability and textual accuracy. By fine-tuning BERT-based transformer models on Bijankhan and Peykare corpora—comprising over 12.7 million preprocessed and annotated words—and formulating the task as sequence labeling, the best model achieves a macro-average F1-score of 97.26%. An interactive corrector that incorporates user input further improves performance to a macro-average F1-score of 98.38%. These results demonstrate the effectiveness of advanced language models in enhancing Persian text quality and highlight their applicability to real-world natural language processing tasks.
pdf
bib
abs
Reddit-V: A Virality Prediction Dataset and Zero-Shot Evaluation with Large Language Models
Samir El-amrany
|
Matthias R. Brust
|
Salima Lamsiyah
|
Pascal Bouvry
We present Reddit-V, a new dataset designed to advance research on social media virality prediction in natural language processing. The dataset consists of over 27,000 Reddit posts, each enriched with images, textual content, and pre-engagement metadata such as post titles, categories, sentiment scores, and posting times. As an initial benchmark, we evaluate several instruction-tuned large language models (LLMs) in a zero-shot setting, prompting them with post titles and metadata to predict post virality. We then fine-tune two multimodal models, CLIP and IDEFICS, to assess whether incorporating visual context enhances predictive performance. Our results show that zero-shot LLMs perform poorly, whereas the fine-tuned multimodal models achieve better performance. Specifically, CLIP outperforms the best-performing zero-shot LLM (CodeLLaMA) by 3%, while IDEFICS achieves an 7% improvement over the same baseline, highlighting the importance of visual features in virality prediction. We release the Reddit-V dataset and our evaluation results to facilitate further research on multimodal and text-based virality prediction. Our dataset and code will be made publicly available on Github
pdf
bib
abs
Simplifications Are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions
Lukas Ellinger
|
Miriam Anschütz
|
Georg Groh
Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms—words with multiple meanings—where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.
pdf
bib
abs
Multi-LLM Text Summarization
Jiangnan Fang
|
Cheng-Tse Liu
|
Jieun Kim
|
Yash Bhedaru
|
Ethan Liu
|
Nikhil Singh
|
Nedim Lipka
|
Puneet Mathur
|
Nesreen K. Ahmed
|
Franck Dernoncourt
|
Ryan Rossi
|
Hanieh Deilamsalehy
In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.
pdf
bib
abs
EDAudio: Easy Data Augmentation for Dialectal Audio
Lea Fischbach
|
Akbar Karimi
|
Alfred Lameli
|
Lucie Flek
We investigate lightweight and easily applicable data augmentation techniques for dialectal audio classification. We evaluate four main methods, namely shifting pitch, interval removal, background noise insertion and interval swap as well as several subvariants on recordings from 20 German dialects. Each main method is tested across multiple hyperparameter combinations, inlcuding augmentation length, coverage ratio and number of augmentations per original sample. Our results show that frequency-based techniques, particularly frequency masking, consistently yield performance improvements, while others such as time masking or speaker-based insertion can negatively affect the results. Our comparative analysis identifies which augmentations are most effective under realistic conditions, offering simple and efficient strategies to improve dialectal speech classification.
pdf
bib
abs
Authorship Verification Using Cloze Test with Large Language Models
Tomáš Foltýnek
|
Tomáš Kancko
|
Pavel Rychly
Assignment outsourcing, also known as contract cheating, occurs when a student outsources an assessment task or a part of it to a third party. It has been one of the most pressing ethical issues in university education and was further exacerbated by the wide availability of chatbots based on large language models. We propose a method that has the potential to verify the authorship of a document in question by filling in a cloze test. A close test with 10 items selected by our method can be used as a classifier with an accuracy of 0.988 and a F1 score of 0.937. We also describe a general method for building a cloze-test-based classifier when the probability of authors and non-authors correctly filling in cloze items is known.
pdf
bib
abs
A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos
Alexandru Ganea
|
Antonia-Adelina Popovici
|
Marius Dumitran
Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show “Who Wants to Be a Millionaire?” (Vrei să fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available.
pdf
bib
abs
Graph-based RAG for Low-Resource Aromanian–Romanian Translation
Laurentiu G. Ghetoiu
|
Sergiu Nisioi
Aromanian, a linguistically and culturally significant yet low-resource Romance language, poses substantial challenges in computational linguistic research due to its limited NLP resources and non-standardized orthography. In this paper, we present an experimental study aimed at translating Aromanian texts into Romanian using a variety of modern NLP methodologies. We leverage two key resources: a parallel corpus consisting of approximately 3,000 sentence-aligned short stories and a dictionary of over 28,000 Aromanian-Romanian word pairs. Our approaches include Retrieval-Augmented Generation (RAG) supported by a graph-based alignment database, fine-tuning multilingual transformer models (specifically Meta’s NLLB), and parameter-efficient fine-tuning techniques such as LoRA applied to LLaMA-derived models. Evaluations using standard metrics (BLEU, chrF) demonstrate varied effectiveness across these methodologies, highlighting the strong performance of NLLB for general translation tasks, while RAG excels in translating familiar content. Our findings underline the complexities inherent in low-resource language translation and provide valuable insights into effective digital preservation and NLP adaptation strategies for underrepresented languages.
pdf
bib
abs
Differential Robustness in Transformer Language Models: Empirical Evaluation under Adversarial Text Attacks
Taniya Gidatkar
|
Oluwaseun Ajao
|
Matthew Shardlow
This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and Flan-T5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast, BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies
pdf
bib
abs
An Annotation Scheme for Factuality and Its Application to Parliamentary Proceedings
Gili Goldin
|
Shira Wigderson
|
Ella Rabinovich
|
Shuly Wintner
Factuality assesses the extent to which a language utterance relates to real-world information; it determines whether utterances correspond to facts, possibilities, or imaginary situations, and as such, it is instrumental for fact checking. Factuality is a complex notion that relies on multiple linguistic signals, and has been studied in various disciplines. We present a complex, multi-faceted annotation scheme of factuality that combines concepts from a variety of previous works. We developed the scheme for Hebrew, but we trust that it can be adapted to other languages. We also present a set of almost 5,000 sentences in the domain of parliamentary discourse that we manually annotated according to this scheme. We report on inter-annotator agreement, and experiment with various approaches to automatically predict (some features of) the scheme, in order to extend the annotation to a large corpus.
pdf
bib
abs
Can We Predict Innovation? Narrow Experts versus Competent Generalists
Amir Hazem
|
Motohashi Kazuyuki
In this paper, we investigate the role of large language models in predicting innovation. We contrast two main paradigms: i) narrow experts: which consists of supervised and semi-supervised models trained or fine-tuned on a specific task and ii) competent generalists: which consists of large language models with zero-shot and few-shots learning. We define the task of innovation modeling and present the first attempt to understand the transformation from research to innovation. We focus on product innovation which can be defined as the process of transforming technology to a product or service and bring it to the market. Our extensive empirical evaluation shows that most existing pretrained models are not suited and perform poorly on the innovation modeling task. We also show that injecting research information helps improving the alignment from technology to the market. Finally, we propose a new methodology and fine-tuning strategies that achieve significant performance boosts over the baselines.
pdf
bib
abs
Arabic to Romanian Machine Translation: A Case Study on Distant Language Pairs
Ioan Alexandru Hirica
|
Stefana Arina Tabusca
|
Sergiu Nisioi
This paper investigates machine translation between two linguistically distant languages, Arabic and Romanian, with a focus on translating from Arabic to Romanian. Dataset cleaning techniques are addressed, offering insights on the impact of translation for a language pair with limited resources. Using publicly available corpora (e.g., OPUS) and manually translated diplomatic texts, filtering methods are applied, such as duplicate removal, embedding similarity analysis (LEALLA), and Large Language Model (LLM)-based validation (Gemini-flash-002). Transformer models are trained and evaluated with diverse preprocessing pipelines that incorporate subword tokenization. Additionally, the performance of a fine-tuned LLM is assessed for this task and is compared to their pre-trained counterparts. Despite computational limitations, the results emphasize the importance of targeted preprocessing and model adaptation in improving Arabic-Romanian translation quality.
pdf
bib
abs
BiGCAT: A Graph-Based Representation Learning Model with LLM Embeddings for Named Entity Recognition
Md. Akram Hossain
|
Abdul Aziz
|
Muhammad Anwarul Azim
|
Abu Nowshed Chy
|
Md Zia Ullah
|
Mohammad Khairul Islam
Named entity recognition from financial text is challenging because of word ambiguity, huge quantity of unknown corporation names, and word abbreviation compared to nonfinancial text. However, models often treat named entities in a linear sequence fashion, which might obscure the model’s ability to capture complex hierarchical relationships among the entities. In this paper, we proposed a novel named entity recognition model BiGCAT, which integrates large language model (LLM) embeddings with graph-based representation where the contextual information captured by the language model and graph representation learning can complement each other. The method builds a spanning graph with nodes representing word spans and edges weighted by LLM embeddings, optimized using a combination of graph neural networks, specifically a graph-convolutional network (GCN) and a graph-attention network (GAT). This approach effectively captures the hierarchical dependencies among the spans. Our proposed model outperformed the state-of-the-art by 10% and 18% on the two publicly available datasets FiNER-ORD and FIN, respectively, in terms of weighted F1 score. The code is available at: https://github.com/Akram1871/BiGCAT-RANLP-2025.
pdf
bib
abs
Measuring How (Not Just Whether) VLMs Build Common Ground
Saki Imai
|
Mert Inan
|
Anthony B. Sicilia
|
Malihe Alikhani
Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.
pdf
bib
abs
SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation
Saki Imai
|
Mert Inan
|
Anthony B. Sicilia
|
Malihe Alikhani
Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language—such as facial expressions, spatial grammar, and prosody—but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.
pdf
bib
abs
Alignment of Historical Manuscript Transcriptions and Translations
Maarten Janssen
|
Piroska Lendvai
|
Anna Jouravel
Using an XML-based framework, we compiled a gold standard for alignments in five primary as well as derived texts, related to De Lepra ad Sistelium by Methodius Olympius. These comprise diplomatic transcripts, editions, and translations of this work, involving both historical and modern languages. Using the TEITOK corpus platform, we created sentence-level gold standard alignments for our parallel resp. comparable texts, and applied both neural and classical alignment methods (SentenceBERT, Hunalign, Awesome-Align). We evaluated the methods in terms of Alignment Error Rate. We show that for alignment of our historical texts, Hunalign performs better than deep learning based methods.
pdf
bib
abs
Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Nevidu Jayatilleke
|
Nisansa de Silva
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.
pdf
bib
abs
Detecting Gender Stereotypical Language Using Model-agnostic and Model-specific Explanations
Manuela Nayantara Jeyaraj
|
Sarah Jane Delany
AI models learn gender-stereotypical language from human data. So, understanding how well different explanation techniques capture diverse language features that suggest gender stereotypes in text can be useful in identifying stereotypes that could potentially lead to gender bias. The influential words identified by four explanation techniques (LIME, SHAP, Integrated Gradients (IG) and Attention) in a gender stereotype detection task were compared with words annotated by human evaluators. All techniques emphasized adjectives and verbs related to characteristic traits and gender roles as the most influential words. LIME was best at detecting explicitly gendered words, while SHAP, IG and Attention showed stronger overall alignment and considerable overlap. A combination of these techniques, combining the strengths of model-agnostic and model-specific explanations, performs better at capturing gender-stereotypical language. Extending to hate speech and sentiment prediction tasks, annotator agreement suggests these tasks to be more subjective while explanation techniques can better capture explicit markers in hate speech than the more nuanced gender stereotypes. This research highlights the strengths of different explanation techniques in capturing subjective gender stereotype language in text.
pdf
bib
abs
Reversing Causal Assumptions: Explainability in Online Sports Dialogues
Asteria Kaeberlein
|
Malihe Alikhani
Prior XAI research often assumes inputs must be “causes” and outputs must be “effects”, severely limiting applicability to analyzing behaviors that emerge as reactions or consequences. Many linguistic tasks, such as dialogues and conversations, involve such behaviors. To address this, we propose that the assumed causality from inputs to outputs can be reversed and still remain valid by using outputs that cause changes in features. We show how this enables analysis of complex feature sets through simpler metrics, propose a framework that is generalizable to most linguistic tasks, and highlight best practices for applying our framework. By training a predictive model from complex effects to simple causes, we apply feature attributions to estimate how the inputs change with the outputs. We demonstrate an application of this by studying sports fans’ comments made during a game and compare those comments to a simpler metric, win probability. We also expand on a prior study of intergroup bias, demonstrating how our framework can uncover behaviors that other XAI methods may overlook. We discuss the implications of these findings for advancing interpretability in computational linguistics and improving data-driven-decision-making in social contexts.
pdf
bib
abs
How LLMs Influence Perceived Bias in Journalism
Asteria Kaeberlein
|
Malihe Alikhani
As the use of generative AI tools in journalistic writing becomes more common, reporters have expressed growing concerns about how it may introduce bias to their works. This paper investigates how the integration of large language models (LLMs) into journalistic writing, both as editors and independent ‘authors’, can alter user perception of bias in media. We show novel insights into how human perception of media bias differs from automatic evaluations. Through human evaluations comparing original human-authored articles, AI-edited articles, and AI-generated articles, we show that while LLMs rarely introduce new bias and often trend towards neutrality, this supposedly ‘safe’ behavior can have harmful impacts. This is most observable in sensitive human rights contexts, where the AI’s neutral and measured tone can reduce the representation of relevant voices and present misinformation in a more convincing manner. Furthermore, we demonstrate the existence of previously unidentified patterns that existing automated bias detection methods fail to accurately capture. We underscore the critical need for human-centered evaluation frameworks in AI-assisted journalism by introducing human evaluations and contrasting against a state-of-the-art automated bias detection system.
pdf
bib
abs
Prompting Techniques for Reducing Social Bias in LLMs through System 1 and System 2 Cognitive Processes
Mahammed Kamruzzaman
|
Gene Louis Kim
Dual process theory posits that human cognition arises via two systems. System 1, which is a quick, emotional, and intuitive process, which is subject to cognitive biases, and System 2, is a slow, onerous, and deliberate process. Prior research in LLMs found that using chain-ofthought (CoT) prompting in LLMs, which has been often compared to System 2 reasoning, can lead to reduced gender bias. Along these lines, we investigate the relationship between bias, CoT prompting, a direct debiasing, and dual process theory modeling in LLMs. We compare zero-shot CoT, debiasing, and dual process theory-based prompting strategies on two bias datasets spanning nine different social bias categories. We incorporate human and machine personas to determine whether LLM modeling of the effects of dual process theory exist independent of explicit persona models or are tied to the LLM’s modeling of human-like generation. We find that a human persona, debiasing, System 2, and CoT prompting all tend to reduce social biases in LLMs, though the best combination of features depends on the exact model and bias category—resulting in up to a 33 percent drop in stereotypical judgments by an LLM.
pdf
bib
abs
Performance Gaps in Acted and Naturalistic Speech: Insights from Speech Emotion Recognition Strategies on Customer Service Calls
Lily Kawaoto
|
Hita Gupta
|
Ning Yu
|
Daniel Dakota
Current research in speech emotion recognition (SER) often uses speech data produced by actors which does not always best represent naturalistic speech. This can lead to challenges when applying models trained on such data sources to real-world data. We investigate the application of SER models developed on acted data and more naturalistic podcasts to service call data, with a particular focus on anger detection. Our results indicate that while there is noticeable performance degradation of models trained on acted data to the naturalistic data, weighted multimodal models developed on existing SER datasets–both acted and natural–show promise, but are limited in ability to recognize emotions that do not discernibly cluster.
pdf
bib
abs
Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection
Arefeh Kazemi
|
Sri Balaaji Natarajan Kalaivendan
|
Joachim Wagner
|
Hamza Qadeer
|
Kanishk Verma
|
Brian Davis
Cyberbullying (CB) presents a pressing threat, especially to children, underscoring the urgent need for robust detection systems to ensure online safety. While large-scale datasets on online abuse exist, there remains a significant gap in labeled data that specifically reflects the language and communication styles used by children. The acquisition of such data from vulnerable populations, such as children, is challenging due to ethical, legal and technical barriers. Moreover, annotating these datasets relies heavily on human effort, which not only strains resources but also raises significant concerns due to annotators’ exposure to harmful content. In this paper, we address these challenges by leveraging Large Language Models (LLMs) to generate synthetic data and labels. Our experiments demonstrate that synthetic data enables BERT-based CB classifiers to achieve performance close to that of those trained on fully authentic datasets (75.8% vs. 81.5% accuracy). Additionally, LLMs can effectively label authentic yet unlabeled data, allowing BERT classifiers to attain a comparable performance level (79.1% vs. 81.5% accuracy). These results highlight the potential of LLMs as a scalable, ethical, and cost-effective solution for generating data for CB detection.
pdf
bib
abs
FreeTxt: Analyse and Visualise Multilingual Qualitative Survey Data for Cultural Heritage Sites
Nouran Khallaf
|
Ignatius Ezeani
|
Dawn Knight
|
Paul Rayson
|
Mo El-Haj
|
John Vidler
|
James Davies
|
Fernando Alva-Manchego
We introduce FreeTxt, a free and open-source web-based tool designed to support the analysis and visualisation of multilingual qualitative survey data, with a focus on low-resource languages. Developed in collaboration with stakeholders, FreeTxt integrates established techniques from corpus linguistics with modern natural language processing methods in an intuitive interface accessible to non-specialists. The tool currently supports bilingual processing and visualisation of English and Welsh responses, with ongoing extensions to other languages such as Vietnamese. Key functionalities include semantic tagging via PyMUSAS, multilingual sentiment analysis, keyword and collocation visualisation, and extractive summarisation. User evaluations with cultural heritage institutions demonstrate the system’s utility and potential for broader impact.
pdf
bib
abs
GPT-Based Lexical Simplification for Multi-Word Expressions Using Prompt Engineering
Sardar Khan Khayamkhani
|
Matthew Shardlow
Multiword Lexical Simplification (MWLS) is the task of replacing a complex phrase in a sentence with a simpler alternative. Whereas previous approaches to MWLS made use of the BERT language model, we make use of the Generative Pre-trained Transformer architecture. Our approach employs Large Language Models in an auto-regressive format, making use of prompt engineering and few-shot learning to develop new strategies for the MWLS task. We experiment with several GPT-based models and differing experimental settings including varying the number of requested examples, changing the base model type, adapting the prompt and zero-shot, one-shot and k-shot in-context learning. We show that a GPT-4o model with k-shot in-context learning (k=6) demonstrates state-of-the-art performance for the MWLS1 dataset with NDCG=0.3143, PREC@5=0.1048, beating the previous Bert-based approach by a wide margin on several metrics and consistently across subsets. Our findings indicate that GPT-based models are superior to BERT-based models for the MWLS task.
pdf
bib
abs
Instruction-Tuning LLaMA for Synthetic Medical Note Generation in Swedish and English
Lotta Kiefer
|
Jesujoba Alabi
|
Thomas Vakili
|
Hercules Dalianis
|
Dietrich Klakow
The increasing capabilities of large language models (LLMs) have unlocked transformative potential for medical applications, but privacy constraints limit access to high-quality training data from electronic health records (EHRs). In response, we propose a framework to generate synthetic EHRs by instruction-tuning an LLM using descriptions of diagnosis codes. We show that this framework overcomes problems of prior approaches, such as diversity reduction and medical incoherence, while maintaining strong privacy protections. Utility was measured by training models to predict diagnosis codes for EHRs. Real data still has higher utility, but synthetic data approaches real data results with increasing dataset size. The differences in utility were most likely due to noise in the synthetic data. A user study involving medical professionals confirmed no significant loss in readability or medical coherence compared to the real HRs, even though inter-annotator agreement is low. These findings establish synthetic EHRs as a viable alternative for privacypreserving and scalable clinical NLP applications. We release our code on GitHub.
pdf
bib
abs
Output Trend Analysis in Semantic Classification of Katakana Words Using a Large Language Model
Kazuki Kodaki
|
Minoru Sasaki
In semantic classification of katakana words using a large language model (LLM), semantic divergences from the meanings of original English words such as Wasei-Eigo(Japanese-made English) may affect the accuracy of the model. In order to accurately capture the meaning of foreign words, we fine-tuned the LLM using data extracted from the BCCWJ(Balanced Corpus of Contemporary Written Japanese), analyzed the current accuracy and output trend of semantic classification for katakana words, and explored ways to improve the accuracy. The results of several experiments showed that fine-tuning was not effective for zero-shot learning, but in contrast, fine-tuning improved accuracy by about 10% for few-shot learning. Further analysis of the visualized data suggests trends related to words and meanings that the model struggles to classify correctly.
pdf
bib
abs
Domain Knowledge Distillation for Multilingual Sentence Encoders in Cross-lingual Sentence Similarity Estimation
Risa Kondo
|
Hiroki Yamauchi
|
Tomoyuki Kajiwara
|
Marie Katsurai
|
Takashi Ninomiya
We propose a domain adaptation method for multilingual sentence encoders. In domains requiring a high level of expertise, such as medical and academic, domain-specific pre-trained models have been released in each language. However, there is no its multilingual version, which prevents application to cross-lingual information retrieval. Obviously, multilingual pre-training with developing in-domain corpora in each language is costly. Therefore, we efficiently develop domain-specific cross-lingual sentence encoders from existing multilingual sentence encoders and domain-specific monolingual sentence encoders in each language. Experimental results on translation ranking in three language pairs with different domains reveal the effectiveness of the proposed method compared to baselines without domain adaptation and existing domain adaptation methods.
pdf
bib
abs
Am I Blue or Is My Hobby Counting the Teardrops? Expression Leakage in Large Language Models as a Symptom of Irrelevancy Disruption
Berkay Kopru
|
Mehrzad Mashal
|
Yigit Gurses
|
Akos Kadar
|
Maximilian Schmitt
|
Ditty Mathew
|
Felix Burkhardt
|
Florian Eyben
|
Björn W. Schuller
Large language models (LLMs) have advanced natural language processing (NLP) skills such as through next-token prediction and self-attention, but their ability to integrate broad context also makes them prone to incorporating irrelevant information. Prior work has focused on semantic leakage—bias introduced by semantically irrelevant context.In this paper, we introduce expression leakage, a novel phenomenon where LLMs systematically generate sentimentally charged expressions that are semantically unrelated to the input context. To analyse the expression leakage, we collect a benchmark dataset along with a scheme to automatically generate a dataset from free-form text from common-crawl. In addition, we propose an automatic evaluation pipeline that correlates well with human judgment, which accelerates the benchmarking by decoupling from the need of annotation for each analysed model. Our experiments show that, as the model scales in the parameter space, the expression leakage reduces within the same LLM family. On the other hand, we demonstrate that expression leakage mitigation requires specific care during the model building process, and cannot be mitigated by prompting. In addition, our experiments indicate that, when negative sentiment is injected in the prompt, it disrupts the generation process more than the positive sentiment, causing a higher expression leakage rate.
pdf
bib
abs
Fusion of Object-Centric and Linguistic Features for Domain-Adapted Multimodal Learning
Jordan Konstantinov Kralev
Modern multimodal systems often struggle to link domain-specific visual content with textual descriptions, especially when object recognition is limited to general categories (e.g. COCO classes) and lacks customised adaptation to language models. In this paper, we present a novel framework that integrates a domain-specific adapted Detectron2 model into predefined models via a trainable projection layer, enabling precise crossmodal adaptation for specialised domains. Our approach extends Detectron2’s recognition capabilities to new categories by fine-tuning on multi-domain datasets, while a lightweight linear projection layer maps region-based visual features to the model’s embedding space without completely retraining the model. We evaluated the framework for domain-specific image captioning. The presented approach provides a scalable design for combining domain-specific visual recognition with language inference, with applications in domains that require fine-grained multimodal understanding.
pdf
bib
abs
Multi-Agent Reinforcement Learning for Interactive Code Debugging with Human Feedback and Memory
Anjana Krishnamoorthy
|
Kartik Ivatury
|
Benyamin Ahmadnia
This paper introduces an interactive Python debugging framework that combines multi-agent reinforcement learning, Natural Language Processing (NLP), and long-term memory. Two Proximal Policy Optimization (PPO) agents specialize in syntax and logic errors, generating candidate fixes that developers can accept, reject, or refine. A BERT-based module encodes natural language feedback into dense embeddings and quality scores, which shape reward signals for Reinforcement Learning from Human Feedback (RLHF). To support personalization, the system uses dual FAISS indices to retrieve past fixes based on code-error pairs and developer explanations. Evaluated on a synthetic dataset of 200 Python programs, our approach achieves an 88% syntax-fix rate and 45% logic-fix rate within five suggestions—outperforming one-shot Large Language Model (LLM) baselines. In addition, the system improves the quality of the explanation, as measured by BLEU, ROUGE, and CodeBLEU. By integrating multi-agent specialization, linguistic feedback, and memory-driven retrieval, our framework delivers a more efficient, adaptive, and developer-aligned debugging experience.
pdf
bib
abs
Integrating Large Language Models for Comprehensive Study and Sentiment Analysis of Student Feedback
Jana Kuzmanova
|
Katerina Zdravkova
|
Ivan Chorbev
n academic year 2023/24, our university collected over 200,000 student feedback responses evaluating teaching staff and course experiences. The survey included demographic data, 10 Likert scale questions on teaching quality, a question on student attendance, and three open-ended questions about student experiences. This paper explores the integration of Large Language Models (LLM) Gemini for sentiment analysis to evaluate students’ feedback quantitatively and qualitatively. We statistically analyze the Likert scale responses. To address the linguistic diversity of open-ended responses, written in both Cyrillic and Latin scripts with standard and slang expressions in several languages, we employed a preprocessing step using Gemini to standardize the input for further analyses. Sentiment analysis aims to identify various sentiment nuances, including direct answers, contradiction, multipolarity, mixed sentiment, sarcasm, irony, negation, ambiguity, understatement, and over-exaggeration. By comparing these insights with quantitative feedback, we aim to uncover deeper patterns between student perceptions and teaching performance. While the focus is on sentiment analysis, we also discuss the evaluation of the results provided by LLM. For the sentiments with less answers, the evaluation of GenAI was done manually. For the sentiments with more than 1000 entries, we suggest a semi-automated approach for sentiment categorization, to be explored in future work. This study enhances our understanding of student feedback through advanced computational methods, providing a more nuanced perspective on teaching quality and student satisfaction.
pdf
bib
abs
Task-Oriented Dialogue Systems through Function Calling
Tiziano Labruna
|
Giovanni Bonetta
|
Bernardo Magnini
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating dialogues and handling a broad range of user queries. However, their effectiveness as end-to-end Task-Oriented Dialogue (TOD) systems remains limited due to their reliance on static parametric memory, which fails to accommodate evolving knowledge bases (KBs). This paper investigates a scalable function-calling approach that enables LLMs to retrieve only the necessary KB entries via schema-guided queries, rather than embedding the entire KB into each prompt. This selective retrieval strategy reduces prompt size and inference time while improving factual accuracy in system responses. We evaluate our method on the MultiWOZ 2.3 dataset and compare it against a full-KB baseline that injects the entire KB into every prompt. Experimental results show that our approach consistently outperforms the full-KB method in accuracy, while requiring significantly fewer input tokens and considerably less computation time, especially when the KB size increases.
pdf
bib
abs
When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
Tiziano Labruna
|
Jon Ander Campos
|
Gorka Azkune
In this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM’s parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token, <RET$>, when they do not know the answer to a question. Our evaluation of the Adaptive Retrieval LLM (Adapt-LLM) on the PopQA dataset showcases improvements over the same LLM under three configurations: (i) retrieving information for all the questions, (ii) using always the parametric memory of the LLM, and (iii) using a popularity threshold to decide when to use a retriever. Through our analysis, we demonstrate that Adapt-LLM is able to generate the <RET> token when it determines that it does not know how to answer a question, indicating the need for IR, while it achieves notably high accuracy levels when it chooses to rely only on its parametric memory.
pdf
bib
abs
Trust but Verify: A Comprehensive Survey of Faithfulness Evaluation Methods in Abstractive Text Summarization
Salima Lamsiyah
|
Aria Nourbakhsh
|
Christoph Schommer
Abstractive text summarization systems have advanced significantly with the rise of neural language models. However, they frequently suffer from issues of unfaithfulness or factual inconsistency, generating content that is not verifiably supported by the source text. This survey provides a comprehensive review of over 40 studies published between 2020 and 2025 on methods for evaluating faithfulness in abstractive summarization. We present a unified taxonomy that covers human evaluation techniques and a variety of automatic metrics, including question answering (QA)-based methods, natural language inference (NLI)-based methods, graph-based approaches, and large language model (LLM)-based evaluation. We also discuss meta-evaluation protocols that assess the quality of these metrics. In addition, we analyze a wide range of benchmark datasets, highlighting their design, scope, and relevance to emerging challenges such as long-document and domain-specific summarization. In addition, we identify critical limitations in current evaluation practices, including poor alignment with human judgment, limited robustness, and inefficiencies in handling complex summaries. We conclude by outlining future directions to support the development of more reliable, interpretable, and scalable evaluation methods. This work aims to support researchers in navigating the rapidly evolving landscape of faithfulness evaluation in summarization.
pdf
bib
abs
Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts
Frances Adriana Laureano De Leon
|
Asim Abbas
|
Harish Tayyar Madabushi
|
Mark Lee
Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. Although large language models perform well on many tasks, their ability to handle subtle linguistic phenomena remains unclear. This study examines how state-of-the-art models process the ambiguity of potentially idiomatic multiword expressions, particularly in less frequent contexts where memorisation is less likely to help. By evaluating models in Portuguese, Galician, and English, and introducing a new code-switched dataset and task, we show that large language models, despite their strengths, have difficulty handling nuanced language. In particular, we find that the latest models, including GPT-4, fail to outperform the xlm-roBERTa-base baselines in both detection and semantic tasks, with especially poor performance on the novel tasks we introduce, despite its similarity to existing tasks. Overall, our results demonstrate that multiword expressions, especially those that are ambiguous, continue to be a challenge to models. We provide open access to our datasets, prompts and model responses.
pdf
bib
abs
Instruction Finetuning to Attribute Language Stage, Dialect, and Provenance Region to Historical Church Slavic Texts
Piroska Lendvai
|
Uwe Reichel
|
Anna Jouravel
|
Achim Rabus
|
Elena Renje
Our study addresses domain-specific text provenance classification for the historical Church Slavic language. The downstream task is to attribute the language stage and its dialectal and regional varieties to texts compiled from newly curated sources, including digitally unpublished manuscripts, in addition to established Church Slavic resources from the Universal Dependencies Treebank. We aim to harmonize previously used tag sets pertaining to textual provenance, and construct a new, hierarchical, multi-layer provenance labeling scheme. For the classification task, we finetune Vikhr (Nikolich et al., 2004), a generative LLM with knowledge of modern Russian, with the instruction to generate labels to classify the provenance of sentence-level text units. Besides gold standard manuscript transcriptions, we test the finetuned model on character-corrupted data that emulate the quality of noisy, handwritten text recognition material. The experiments show that the Vikhr base model has low provenance attribution knowledge of Church Slavic, whereas our finetuned model achieves above .9 F-scores on Language stage labeling and Dialect labeling, and above .8 F-score on generating the label that jointly classifies all three provenance layers. The task of classifying the fine-grained geographical region from which a manuscript originates proves harder (but still performs above .8), and is negatively impacted by character level noise injection.
pdf
bib
abs
MariATE: Automatic Term Extraction Using Large Language Models in the Maritime Domain
Shijie Liu
|
Els Lefever
|
Veronique Hoste
This study presents a comprehensive evaluation of Large Language Models (LLMs) for automatic term extraction in the maritime safety domain. The research examines the zero-shot performance of seven state-of-the-art LLMs, including both open-source and closed-source models, and investigates terminology annotation strategies for optimal coverage. Nested annotation captures both complete technical expressions and their constituent components, while full-term annotation focuses exclusively on maximal-length terms. Experimental results demonstrate Claude-3.5-Sonnet’s superior performance (F1-score of 0.80) in maritime safety terminology extraction, particularly in boundary detection capabilities. Error analysis reveals three primary challenges: distinguishing contextual descriptions from legitimate terminology, handling complex multi-word expressions, and identifying maritime safety operational and navigational terms. Analysis of annotation strategies reveals that the full-term annotation approach achieves 95.24% coverage of unique terms compared to the nested annotation approach. The additional 4.76% of terms identified through nested annotation represents subcomponents of larger technical expressions. These findings advance the understanding of LLMs’ capabilities in specialized terminology extraction and provide empirical evidence supporting the sufficiency of full-term annotation for comprehensive terminology coverage in domain-specific applications.
pdf
bib
abs
Exploring the Usage of Knowledge Graphs in Identifying Human and LLM-Generated Fake Reviews
Ming Liu
|
Massimo Poesio
The emergence of large language models has led to an explosion of machine-generated fake reviews. Although distinguishing between human and LLM-generated fake reviews is an area of active research, progress is still needed. One aspect which makes current LLM-generated fake reviews easier to recognize is that LLMs–in particular the smaller ones–lack domain-related knowledge. The objective of this work is to investigate whether large language models can produce more realistic artificial reviews when supplemented with knowledge graph information, thus resulting in a more challenging training dataset for human and LLM-generated fake reviews detectors. We propose a method for generating fake reviews by providing knowledge graph information to a llama model, and used it to generate a large number of fake reviews which used to fine tune a state-of-the-art human and LLM-generated fake reviews detection system. Our results show that when knowledge graph information is provided as part of the input, the accuracy of the model is improved by 0.24%. When the knowledge graph is used as an embedding layer and combined with the existing input embedding layer, the accuracy of the detection model is improved by 1.279%.
pdf
bib
abs
The Evaluation of Medical Terms Complexity Using Lexical Features and Large Language Models
Liliya Makhmutova
|
Giancarlo Dondoni Salton
|
Fernando Perez-Tellez
|
Robert J. Ross
Understanding medical terminology is critical for effective patient-doctor communication, yet many patients struggle with complex jargon. This study compares Machine Learning (ML) models and Large Language Models (LLMs) in predicting medical term complexity as a means of improving doctor-patient communication. Using survey data from 252 participants rating 1,000 words along with various lexical features, we measured the accuracy of both model types. The results show that LLMs outperform traditional lexical-feature-based models, suggesting their potential to identify complex medical terms and lay the groundwork for personalised patient-doctor communication.
pdf
bib
abs
Where and How as Key Factors for Knowledge-Enhanced Constrained Commonsense Generation
Ivan Martinez-Murillo
|
Paloma Moreda Pozo
|
Elena Lloret
This paper addresses a key limitation in Natural Language Generation (NLG) systems: their struggle with commonsense reasoning, which is essential for generating contextually appropriate and plausible text. The study proposes an approach to enhance the commonsense reasoning abilities of NLG systems by integrating external knowledge framed in a constrained commonsense generation task. The paper investigates strategies for extracting and injecting external knowledge into pre-trained models, specifically BART and T5, in both base and large configurations. Experimental results show that incorporating external knowledge extracted with a simple strategy leads to significant improvements in performance, with the models achieving 88% accuracy in generating plausible and correct sentences. When refined methods for knowledge extraction are applied, the accuracy further increases to 92%. These findings underscore the crucial role of high-quality external knowledge in enhancing the commonsense reasoning capabilities of NLG systems, suggesting that such integration is vital for advancing their performance in real-world applications.
pdf
bib
abs
Forecasting Online Negativity Spikes with Multilingual Transformers for Strategic Decision-Making
Rowan Martnishn
|
Vishal Green
|
Varun Kadari
|
Shravan Athikinasetti
|
Zach Miller
|
Julia Brady
|
Viraj Chawda
|
Nikhil Badlani
Social media platforms like Reddit, YouTube, and Instagram amplify rapid dissemination of negative sentiment, potentially causing harm and fostering extremist discourse. This paper addresses the NLP challenge of predicting sudden spikes in negative sentiment by fine-tuning multilingual transformer models. We present a structured pipeline emphasizing linguistic feature extraction and temporal modeling. Our experimental results, obtained from extensive Reddit, YouTube, and Instagram data, demonstrate improved forecasting accuracy over baseline methods. Ethical considerations and implications for deployment in social media moderation are thoroughly discussed. The system includes user-centric interactive features such as real-time filtering dashboards, customizable negativity thresholds, and forecasting analytics, providing actionable insights for preventative content moderation. Given its real-time deployment potential and cross-platform applicability, our system offers actionable insights for proactive content moderation.
pdf
bib
abs
C-SHAP: Collocation-Aware Explanations for Financial NLP
Martina Menzio
|
Elisabetta Fersini
|
Davide Paris
Understanding the internal decision-making process of NLP models in high-stakes domains such as the financial sector is particularly challenging due to the complexity of domain-specific terminology and the need for transparency and accountability. Although SHAP is a widely used model-agnostic method for attributing model predictions to input features, its standard formulation treats input tokens as independent units, failing to capture the influence of collocations that often carry non-compositional meaning, instead modeled by the current language models. We introduce C-SHAP, an extension of SHAP that incorporates collocational dependencies into the explanation process to account for word combinations in the financial sector. C-SHAP dynamically groups tokens into significant collocations using a financial glossary and computes Shapley values over these structured units. The proposed approach has been evaluated to explain sentiment classification of Federal Reserve Minutes, demonstrating improved alignment with human rationales and better association to model behaviour compared to the standard token-level approach.
pdf
bib
abs
Investigating Polarization in YouTube Comments via Aspect-Based Sentiment Analysis
Daniel Miehling
|
Daniel Dakota
|
Sandra Kübler
We investigate the use of Aspect-Based Sentiment Analysis (ABSA) to analyze polarization in online discourse. For the analysis, we use a corpus of over 3 million user comments and replies from four state-funded media channels from YouTube Shorts in the context of the 2023 Israel–Hamas war. We first annotate a subsample of approx. 5 000 comments for positive, negative, and neutral sentiment towards a list of topic related aspects. After training an ABSA model (Yang et al., 2023) on the corpus, we evaluate its performance on this task intrinsically, before evaluating the usability of the automatic analysis of the whole corpus for analyzing polarization. Our results show that the ABSA model achieves an F1 score of 77.9. The longitudinal and outlet analyses corroborate known trends and offer subject experts more fine-grained information about the use of domain-specific language in user-generated content.
pdf
bib
abs
From the Tractatus Logico-Philosophicus to Later Wittgenstein: An NLP-Based Comparative Analysis
Andreiana Mihail
|
Silviu-Florin Gheorghe
|
Andrei Fotea
|
Liviu P. Dinu
This study investigates the application of Natural Language Processing (NLP) methods to uncover linguistic and stylistic variations within the corpus of Ludwig Wittgenstein, a philosopher renowned for his complex and notional contributions. By analyzing works such as Tractatus Logico-Philosophicus alongside his later notes, manuscripts, and student-dictated lectures in Cambridge, we aim to identify significant distinctions in language use and conceptual framing. The corpus poses unique difficulties because of its diverse origins, encompassing published works, personal notes, and collaboratively edited transcripts. Utilizing zero-shot NLP techniques, this exploratory/preliminary research aims to reveal patterns reflective of Wittgenstein’s philosophical evolution and differences in text production manners. The results highlight the potential of computational approaches to enhance our understanding of complex, context-dependent philosophical writings, providing a possible path for further interdisciplinary investigations into linguistic and conceptual dynamics in this challenging body of work.
pdf
bib
abs
Towards Intention-aligned Reviews Summarization: Enhancing LLM Outputs with Pragmatic Cues
Maria Miro Maestre
|
Robiert Sepulveda-Torres
|
Ernesto Luis Estevanell-Valladares
|
Armando Suarez Cueto
|
Elena Lloret
Recent advancements in Natural Language Processing (NLP) have allowed systems to address complex tasks involving cultural knowledge, multi-step reasoning, and inference. While significant progress has been made in text summarization guided by specific instructions or stylistic cues, the integration of pragmatic aspects like communicative intentions remains underexplored, particularly in non-English languages. This study emphasizes communicative intentions as central to summary generation, classifying Spanish product reviews by intent and using prompt engineering to produce intention-aligned summaries. Results indicate challenges for large language models (LLMs) in processing extensive document clusters, with summarization accuracy heavily dependent on prior model exposure to similar intentions. Common intentions such as complimenting and criticizing are reliably handled, whereas less frequent ones like promising or questioning pose greater difficulties. These findings suggest that integrating communicative intentions into summarization tasks can significantly enhance summary relevance and clarity, thereby improving user experience in product review analysis.
pdf
bib
abs
Subtle Shifts, Significant Threats: Leveraging XAI Methods and LLMs to Undermine Language Models Robustness
Adrián Moreno Muñoz
|
L. Alfonso Ureñ-López
|
Eugenio Martínez Cámara
Language models exhibit inherent security vulnerabilities, which may be related to several factors, among them the malicious alteration of the input data. Such weaknesses compromise the robustness of language models, which is more critical when adversarial attacks are stealthy and do not require high computational resources. In this work, we study how vulnerable English language models are to adversarial attacks based on subtle modifications of the input of pretrained English language models. We claim that the attack may be more effective if it is targeted to the most salient words for the discriminative task of the language models. Accordingly, we propose a new attack built upon a two-step approach: first, we use a posteriori explainability methods to identify the most influential words for the classification task, and second, we replace them with contextual synonyms retrieved by a small language model. Since the attack has to be as stealthy as possible, we also propose a new evaluation measure that combines the effectiveness of the attack with the number of modifications performed. The results show that pretrained English language models are vulnerable to minimal semantic changes, which makes the design of countermeasure methods imperative.
pdf
bib
abs
Fast Thinking with Structured Prompts: Enabling LLM Reasoning without Chain-of-Thought Generation
Kirill Morozov
|
Liubov Chubarova
|
Irina Piontkovskaya
The emergence of complex reasoning abilities in large language models (LLMs) has sparked great interest, and a variety of prompting techniques was proposed to coax them into emulating human thought processes. In this work, we introduce Think Node-by-Node, a graph-based reasoning framework inspired by mind maps, flowcharts, and other visual aids that help humans tackle complex problems. Rather than generating images directly, our approach leverages standard graph-building and rendering libraries, and requires no fine-tuning, only the model’s native coding capabilities. We further explore a “Fast Thinking” regime, in which a graph-reasoning example provided in the prompt, but the model generates the answers directly, without the full thought process reconstruction. Surprisingly, this approach leads to significant improvement upon baseline in general-knowledge tasks. Remarkably, Think Node-by-Node maintains strong performance even under a strict 25-token budget for answer generation. Across two instruction-tuned LLMs (0.5B and 7B parameters), our FastTNbN strategy outperforms baseline prompting techniques, improving accuracy by up to 10%, and exceeds the capabilities of other structured prompting methods under equivalent generation constraints.
pdf
bib
abs
T2Know: Analysis and Trend Platform Using the Knowledge Extracted from Scientific Texts
Rafael Muñoz Guillena
|
Manuel Palomar
|
Yoan Gutiérrez
|
Mar Bonora
The T2Know project explores the application of natural language processing technologies to build a semantic platform for scientific documents using knowledge graphs. These graphs will interconnect meaningful sections from different documents, enabling both trend analysis and the generation of informed recommendations. The project’s objectives include the development of entity recognition systems, the definition of user and document profiles, and the linking of documents through transformer-based technologies. Consequently, the extracted relevant content will go beyond standard metadata such as titles and author affiliations, extending also to other key sections of scientific articles, including references, which are treated as integral components of the knowledge representation.
pdf
bib
abs
Investigating Large Language Models’ (LLMs) Capabilities for Sexism Detection on a Low-Resource Language
Lutfiye Seda Mut Altin
|
Horacio Saggion
Automatic detection of sexist language on social media is gaining attention due to its harmful societal impact and technical challenges it presents. The limited availability of data resources in some languages restricts the development of effective tools to fight the spread of such content. In this work, we investigated various methods to improve the efficiency of automatic detection of sexism and its subtypes in a low-resource language, Turkish. We first experimented with various LLM prompting strategies for classification and then investigated the impact of different data augmentation strategies, including both synthetic data generation with LLMs (GPT, DeepSeek) and translationbased augmentation using English and Spanish data. Finally, we examined whether these augmentation methods would improve model performance of a trained neural network (BERT). Our benchmarking results show that fine-tuned LLM (GPT-4o-mini) achieved the best performance compared to zero-shot, few-shot, Chain-of-Thought prompt classification and training a neural network (BERT) including the data augmented in different ways (synthetic generation, translation). Our results also indicated that, for the classification of more granular classes, in other words, more specific tasks, training a neural network generally performed better than prompt-based classification using an LLM.
pdf
bib
abs
PolyHope-M at RANLP2025 Subtask-1 Binary Hope Speech Detection: Spanish Language Classification Approach with Comprehensive Learning Using Transformer, and Traditional ML, and DL
Md. Julkar Naeen
|
Sourav Kumar Das
|
Sharun Akter Khushbu
|
Shahriar Sultan Ramit
|
Alaya Parven Alo
This paper presents our system for the RANLP 2025 shared task on multilingual binary sentiment classification for Task-2 Spanish datasets for domains including social media and customer reviews. We experimented with various models from traditional machine learning approaches—Naive Bayes and LightGBM—to deep learning architectures like LSTM. Among them, the transformer-based XLM-RoBERTa model performed best with an F1 of 0.85, demonstrating its promise for multilingual sentiment work. Basic text preprocessing techniques were used for data quality assurance and improving model performance. Our comparison reflects the superiority of transformer-based models over the traditional methods in binary sentiment classification for multilingual and low-resource environments. This study enables the development of cross-lingual sentiment classification by establishing strong baselines and paying close attention to model performance in joint task settings.
pdf
bib
abs
F-LoRA-QA: Finetuning LLaMA Models with Low-Rank Adaptation for French Botanical Question Generation and Answering
Ayoub Nainia
|
Régine Vignes-Lebbe
|
Hajar Mousannif
|
Jihad Zahir
Despite recent advances in large language models (LLMs), most question-answering (QA) systems remain English-centric and poorly suited to domain-specific scientific texts. This linguistic and domain bias poses a major challenge in botany, where a substantial portion of knowledge is documented in French. We introduce F-LoRA-QA, a fine-tuned LLaMA-based pipeline for French botanical QA, leveraging Low-Rank Adaptation (LoRA) for efficient domain adaptation. We construct a specialized dataset of 16,962 question-answer pairs extracted from scientific flora descriptions and fine-tune LLaMA models to retrieve structured knowledge from unstructured botanical texts. Expert-based evaluation confirms the linguistic quality and domain relevance of generated answers. Compared to baseline LLaMA models, F-LoRA-QA achieves a 300% BLEU score increase, 70% ROUGE-1 F1 gain, +16.8% BERTScore F1, and Exact Match improvement from 2.01% to 23.57%. These results demonstrate the effectiveness of adapting LLMs to low-resource scientific domains and highlight the potential of our approach for automated trait extraction and biodiversity data structuring.
pdf
bib
abs
Reverse Prompting: A Novel Computational Paradigm in Schizophrenia Based on Large Language Models
Ivan Nenchev
|
Christiane Montag
|
Sandra Anna Just
Large language models (LLMs) are increasingly being used to interpret and generate human language, yet their ability to process clinical language remains underexplored. This study examined whether three open-source LLMs can infer interviewer questions from participant responses in a semi-structured psychiatric interview (NET) conducted with individuals diagnosed with schizophrenia (n = 107) and neurotypical controls (n = 66). Using cosine similarity between LLM-generated questions and original prompts as a proxy for the precision of the inference, we found that responses from individuals with schizophrenia produced significantly lower similarity scores (beta = –0.165, p < .001). Cosine similarity decreased across the nested structure of the interview, with smaller reductions observed in the schizophrenia group. Although all emotions decreased similarity with fear, only sadness showed a significant interaction with diagnosis, suggesting differential processing of emotional discourse. Model type and generation temperature also influenced outcomes, highlighting variability in model performance. Our findings demonstrate that LLMs systematically struggle to reconstruct interviewer intent from responses by individuals with schizophrenia, reflecting known discourse-level disturbances in the disorder.
pdf
bib
abs
A Survey on Small Language Models
Chien Van Nguyen
|
Xuan Shen
|
Ryan Aponte
|
Yu Xia
|
Samyadeep Basu
|
Zhengmian Hu
|
Jian Chen
|
Mihir Parmar
|
Sasidhar Kunapuli
|
Joe Barrow3
|
Junda Wu
|
Ashish Singh
|
Yu Wang
|
Jiuxiang Gu
|
Nesreen K. Ahmed
|
Nedim Lipka
|
Ruiyi Zhang
|
Xiang Chen
|
Tong Yu
|
Sungchul Kim
|
Hanieh Deilamsalehy
|
Namyong Park
|
Michael Rimer
|
Zhehao Zhang
|
Huanrui Yang
|
Puneet Mathur
|
Gang Wu
|
Franck Dernoncourt
|
Ryan Rossi
|
Thien Huu Nguyen
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.
pdf
bib
abs
Quantifying the Overlap: Attribution Maps and Linguistic Heuristics in Encoder-Decoder Machine Translation Models
Aria Nourbakhsh
|
Salima Lamsiyah
|
Christoph Schommer
Explainable AI (XAI) attribution methods seek to illuminate the decision-making process of generative models by quantifying the contribution of each input token to the generated output. Different attribution algorithms, often rooted in distinct methodological frameworks, can produce varied interpretations of feature importance. In this study, we utilize attribution mappings derived from three distinct methods as weighting signals during the training of encoder-decoder models. Our findings demonstrate that Attention and Value Zeroing attribution weights consistently lead to improved model performance. To better understand the linguistic information these mappings capture, we extract part-of-speech (POS), dependency, and named entity recognition (NER) tags from the input-output pairs and compare them with the XAI attribution maps. Although the Saliency method shows greater alignment with POS and dependency annotations than Value Zeroing, it exhibits more divergence in places where its attributions do not conform to these linguistic tags, compared to the other two methods, and it contributes less to the models’ performance.
pdf
bib
abs
The Illusion of a Perfect Metric: Why Evaluating AI ́S Words Is Harder than It Looks
Maria Paz Oliva
|
Adriana D. Correia
|
Ivan Vankov
|
Viktor Botev
Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the ‘perfect metric’. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.
pdf
bib
abs
Multi-LLM Debiasing Framework
Deonna M. Owens
|
Ryan Rossi
|
Sungchul Kim
|
Tong Yu
|
Franck Dernoncourt
|
Xiang Chen
|
Ruiyi Zhang
|
Jiuxiang Gu
|
Hanieh Deilamsalehy
|
Nedim Lipka
Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.
pdf
bib
abs
Toward Quantum-Enhanced Natural Language Understanding: Sarcasm and Claim Detection with QLSTM
Pritam Pal
|
Dipankar Das
Traditional machine learning (ML) and deep learning (DL) models have shown effectiveness in natural language processing (NLP) tasks, such as sentiment analysis. However, they often struggle with complex linguistic structures, such as sarcasm and implicit claims. This paper introduces a Quantum Long Short-Term Memory (QLSTM) framework for detecting sarcasm and identifying claims in text, aiming to enhance the analysis of complex sentences. We evaluate four approaches: (1) classical LSTM, (2) quantum framework using QLSTM, (3) voting ensemble combining classical and quantum LSTMs, and (4) hybrid framework integrating both types. The experimental results indicate that the QLSTM approach excels in sarcasm detection, while the voting framework performs best in claim identification.
pdf
bib
abs
Legal Terminology Extraction in Spanish: Gold-standard Generation and LLM Evaluation
Lucia Palacios Palacios
|
Beatriz Guerrero García
|
Patricia Martín Chozas
|
Elena Montiel Ponsoda
This study aims to develop a gold-standard for terminological extraction in Castilian Spanish within the domain of labour law. To achieve this, a methodology was developed based on established linguistic theories and reviewed by a team of expert terminologists. Departing from previous extraction studies and reference theoretical frameworks, candidate terms were identified by their morphosyntactic patterns, enriched by assessing their degree of specialisation in reference resources. The candidate terms were then subjected to manual validation. To evaluate its applicability, we assessed the performance of the LLaMA3-8B and Mistral-7B language models in extracting labour law terms from the latest version of the Real Decreto Legislativo 2/2015 Ley del Estatuto de los Trabajadores. YAKE was also included as a statistical baseline for comparison between traditional methods and generative approaches. All models were evaluated against the validated gold-standard.
pdf
bib
abs
Benchmarking Item Difficulty Classification in German Vocational Education and Training
Alonso Palomino
|
Benjamin Paassen
Predicting the difficulty of exam questions or items is essential to effectively assembling and calibrating exams. While item response theory (IRT) models can estimate item difficulty, they require student responses that are costly and rarely available at scale. Natural language processing methods offer a text-only alternative; however, due to the scarcity of real-world labeled data, prior work often relies on synthetic or domain-specific corpora, limiting generalizability and overlooking the nuanced challenges of real-world text-based item difficulty estimation. Addressing this gap, we benchmark 122 classifiers on 935 German Vocational Education and Training (VET) items labeled via previous IRT analysis to assess feasibility under real-world conditions. In our setup, a stacked ensemble that combines linguistic features, pre-trained embeddings, and external semantic resources outperforms both transformer-based models and few-shot large language models, achieving moderate performance. We report findings and discuss limitations in the context of German VET.
pdf
bib
abs
Isolating LLM Performance Gains in Pre-training versus Instruction-tuning for Mid-resource Languages: The Ukrainian Benchmark Study
Yurii Paniv
This paper evaluates language model performance on Ukrainian language tasks across multiple downstream benchmarks, including summarization, closed and open question answering, and translation at both sentence and paragraph levels. We also introduce LongFlores, an extension of the FLORES benchmark designed specifically to assess paragraph-level translation capabilities. In our experiments, we compare the performance of base models against their instruction-tuned counterparts to isolate and quantify the source of performance improvements for Ukrainian language tasks. Our findings reveal that for popular open source models, base models are stronger in the few-shot setting for the task than their instruction-tuned counterparts in the zero-shot setting. This suggests lower attention paid to Ukrainian during the instruction-tuning phase, providing valuable insights for future model development and optimization for Ukrainian and potentially other lower-resourced languages.
pdf
bib
abs
Evaluating LLMs on Deceptive Text across Cultures
Katerina Papantoniou
|
Panagiotis Papadakos
|
Dimitris Plexousakis
Deception is a pervasive feature of human communication, yet identifying linguistic cues of deception remains a challenging task due to strong context dependency across domains, cultures, and types of deception. While prior work has relied on human analysis across disciplines like social psychology, philosophy, and political science, large language models (LLMs) offer a new avenue for exploring deception due to their strong performance in Natural Language Processing (NLP) tasks. In this study, we investigate whether open-weight LLMs possess and can apply knowledge about linguistic markers of deception across multiple languages, domains, and cultural contexts, with language and country of origin used as a proxy for culture. We focus on two domains, opinionated reviews and personal descriptions about sensitive topics, spanning five languages and six cultural settings. Using various configurations (zero-shot, one-shot, and fine-tuning), we evaluate the performance of LLMs in detecting and generating deceptive text. In detection tasks, our results reveal cross-model and cross-context performance differences. In generation tasks, linguistic analyses show partial alignment with known deception cues in human text, though this knowledge appears largely uniform and context-agnostic.
pdf
bib
abs
Annotating Hate Speech towards Identity Groups
Donnie Parent
|
Nina Georgiades
|
Charvi Mishra
|
Khaled Mohammed
|
Sandra Kübler
Detecting hate speech, especially implicit hate speech, is a difficult task. We focus on annotating implicit hate targeting identity groups. We describe our dataset, which is a subset of AbuseEval (Caselli et al., 2020) and our annotation process for implicit identity hate. We annotate the type of abuse, the type of identity abuse, and the target identity group. We then discuss cases that annotators disagreed on and provide dataset statistics. Finally, we calculate our inter-annotator agreement.
pdf
bib
abs
On the Interaction of Identity Hate Classification and Data Bias
Donnie Parent
|
Nina Georgiades
|
Charvi Mishra
|
Khaled Mohammed
|
Sandra Kübler
Hate speech detection is a task where machine learning models tend to be limited by the biases introduced by the dataset. We use two existing datasets of hate speech towards identity groups, the one by Wiegand et al. (2022) and a reannotated subset of the data in AbuseEval (Caselli et al. 2020). Since the data by Wiegand et al. (2022) were collected using one syntactic pattern, there exists a possible syntactic bias in this dataset. We test whether there exists such a bias by using a more syntactically general dataset for testing. Our findings show that classifiers trained on the dataset with the syntactic bias and tested on a less constrained dataset suffer from a loss in performance in the order of 20 points. Further experiments show that this drop can only be partly attributed to a shift in identity groups between datasets.
pdf
bib
abs
Financial News as a Proxy of European Central Bank Interest Rate Adjustments
Davide Paris
|
Martina Menzio
|
Elisabetta Fersini
This paper examines the relationship between news coverage and the European Central Bank’s (ECB) interest rate decisions. In particular, the hypothesis of a linear relationship between financial news and ECB indications regarding interest rate variations is investigated by leveraging state-of-the-art large language models combined with domain experts and automatically selected keywords. The analysis revealed two key findings related to how news contents can signal the ECB’s decisions to raise or lower interest rates: (1) Sentence Transformer models, when combined with domain-specific keywords, exhibit a higher correlation with ECB decisions than state-of-the-art financial BERT architectures; (2) employing a grid search strategy to select subsets of informative keywords strengthened the relationships between news contents and ECB’s decisions, highlighting how media narratives can anticipate or reflect central bank policy actions.
pdf
bib
abs
Generating and Analyzing Disfluency in a Code-Mixed Setting
Aryan Paul
|
Tapabrata Mondal
|
Dipankar Das
|
Sivaji Bandyopadhyay
This work explores the intersection of code-mixing and disfluency in bilingual speech and text, with a focus on understanding how large language models (LLMs) handle code-mixed disfluent utterances. One of the primary objectives is to explore LLMs’ ability to generate code-mixed disfluent sentences and to address the lack of high-quality code-mixed disfluent corpora, particularly for Indic languages. We aim to compare the performance of LLM-based approaches with traditional disfluency detection methods and to develop novel metrics for quantitatively assessing disfluency phenomena. Additionally, we investigate the relationship between code-mixing and disfluency, exploring how factors such as switching frequency and direction influence the occurrence of disfluencies. By analyzing these intriguing dynamics, we seek to gain a deeper understanding of the mutual influence between code-mixing and disfluency in multilingual speech.
pdf
bib
abs
A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance
Peshala Sandali Perera
|
Deshan Koshala Sumanathilaka
Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system designed specifically for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT a open sourced fine-tuned BERT model trained for Sinhala to identify common dyslexic errors, and a combined mT5 and Mistral-based model to generate corrected text. Finally, the output is converted back to speech using gTTS, creating a complete multi modal feedback loop. Despite the challenges posed by limited Sinhala-language datasets, the system achieves 66% transcription accuracy and 70% correction accuracy with 65% overall system accuracy. These results demonstrate both the feasibility and effectiveness of the approach. Ultimately, this work highlights the importance of inclusive NLP technologies in underrepresented languages and showcases a practical step toward improving accessibility for adult dyslexic users.
pdf
bib
abs
Evaluating Transliteration Ambiguity in Adhoc Romanized Sinhala: A Dataset for Transliteration Disambiguation
Sandun Sameera Perera
|
Deshan Koshala Sumanathilaka
This paper introduces the first Transliteration disambiguation (TD) dataset for Romanized Sinhala, informally known as Singlish, developed to address the challenge of transliteration ambiguity in backwards transliteration tasks. The dataset covers 22 ambiguous Romanized Sinhala words, each mapping to two distinct Sinhala meanings, and provides 30 Romanized sentences per word: ten for each meaning individually and ten containing both meanings in context. Sentences were initially collected through web scraping and later post-processed using the Claude language model, which offers strong support for Sinhala, alongside a rule-based Romanization process to ensure linguistic quality and consistency. To demonstrate its applicability, the dataset was used to evaluate four existing back-transliteration systems, highlighting their performance in resolving context-sensitive ambiguities. Baseline evaluations confirm the dataset’s effectiveness in assessing transliteration systems’ ability to handle transliteration ambiguity, offering a valuable resource for advancing TD and transliteration research for Sinhala.
pdf
bib
abs
Detecting Deception in Disinformation across Languages: The Role of Linguistic Markers
Alba Perez-Montero
|
Silvia Gargova
|
Elena Lloret
|
Paloma Moreda Pozo
The unstoppable proliferation of news driven by the rise of digital media has intensified the challenge of news verification. Natural Language Processing (NLP) offers solutions, primarily through content and context analysis. Recognizing the vital role of linguistic analysis, this paper presents a multilingual study of linguistic markers for automated deceptive fake news detection across English, Spanish, and Bulgarian. We compiled datasets in these languages to extract and analyze both general and specific linguistic markers. We then performed feature selection using the SelectKBest algorithm, applying it to various classification models with different combinations of general and specific linguistic markers. The results show that Logistic Regression and Support Vector Machine classification models achieved F1-scores above 0.8 for English and Spanish. For Bulgarian, Random Forest yielded the best results with an F1-score of 0.73. While these markers demonstrate potential for transferability to other languages, results may vary due to inherent linguistic characteristics. This necessitates further experimentation, especially in low-resource languages like Bulgarian. These findings highlight the significant potential of our dataset and linguistic markers for multilingual deceptive news detection.
pdf
bib
abs
Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision
Dimitar Peshevski
|
Kiril Blazhevski
|
Martin Popovski
|
Gjorgji Madjarov
Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.
pdf
bib
abs
Q&A-LF : A French Question-Answering Benchmark for Measuring Fine-Grained Lexical Knowledge
Alexander Petrov
|
Alessandra Thais Mancas
|
Viviane Binet
|
Antoine Venant
|
Francois Lareau
|
Yves Lepage
|
Phillippe Langlais
We introduce Q&A-LF, a French, question-answering benchmark designed to assess the extent to which large language models capture fine-grained lexical knowledge. We investigate the ability of ChatGPT-4o mini, Qwen2.5-14B, Llama3.0-8B, and Llama3.1-8B to answer questions based on lexical functions from Meaning-Text Theory. Using various prompting setups with different levels of examples and context, we find that Qwen and ChatGPT generally outperform Llama models, achieving up to 70% accuracy, while Llama models reach just above 60%. We identify LFs that are particularly easy or especially challenging for the models. We further investigate whether providing sentence-level context and one-shot prompting improve performance, especially on semantically complex functions.
pdf
bib
abs
Analysis of Vocabulary and Subword Tokenization Settings for Optimal Fine-tuning of MT: A Case Study of In-domain Translation
Javad Pourmostafa Roshan Sharami
|
Dimitar Shterionov
|
Pieter Spronck
The choice of vocabulary and subword (SW) tokenization has a significant impact on both training and fine-tuning of language and translation models. Fine-tuning is a common practice in optimizing a model with respect to new data. However, new data potentially introduces new words (or tokens), which, if not considered, may lead to suboptimal performance. In addition, the distribution of tokens in the new data can differ from the distribution of the original data. As such, the original SW tokenization model could be less suitable for the new data. With this work, we aim to gain better insights on the impact of SW tokenization and vocabulary generation on the performance of neural machine translation (NMT) models fine-tuned to a specific domain. To do so, we compare several strategies for SW tokenization and vocabulary generation and investigate the performance of the resulting models. Our findings show that the best way to fine-tune for domain adaptation is to consistently use both BPE and vocabulary from the in-domain data, which helps the model pick up on important domain-specific terms. At the same time, it is crucial not to lose sight of the vocabulary of the base (pre-trained) model—maintaining coverage of this vocabulary ensures the model keeps its general language abilities. The most successful configurations are those that introduce plenty of frequent domain terms while still retaining a substantial portion of the base model vocabulary, leading to noticeably better translation quality and adaptation, as seen in higher BLEU scores. These benefits, however, often come with greater computational costs, such as longer training times, since the model must learn more new tokens. Conversely, approaches that skip important domain terms or combine mismatched tokenization and vocabulary do not perform as well, making it clear that both domain-specific adaptation and broad vocabulary coverage matter—and that these gains are realized when the vocabulary preserves a good portion of the base (pre-trained) model. While using in-domain BPE and vocabulary yields the best domain adaptation, it substantially reduces out-of-domain translation quality. Hybrid configurations that combine base and domain vocabularies help balance this trade-off, maintaining broader translation capabilities alongside improved domain performance.
pdf
bib
abs
LLM-based Embedders for Prior Case Retrieval
Damith Premasiri
|
Tharindu Ranasinghe
|
Ruslan Mitkov
In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.
pdf
bib
abs
Exploiting Primacy Effect to Improve Large Language Models
Bianca Raimondi
|
Maurizio Gabbrielli
Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect—where items presented first are more likely to be remembered or selected—plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect, by reordering response options on the basis of semantic similarity to the query - without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.
pdf
bib
abs
Alankaar: A Dataset for Figurativeness Understanding in Bangla
Geetanjali Rakshit
|
Jeffrey Flanigan
Bangla has a rich written literature, automatically making it replete with examples of creative usage of language. There have been limited efforts to computationally analyze creative text in the Bangla language due to a lack of resources. We present Alankaar, a dataset of 2500 manually annotated examples of text fragments in Bangla containing metaphors. We also provide automatic and manual English translations of these examples. Additionally, we provide 2500 examples of non-metaphorical text in Bangla. We use this dataset to build a metaphor identification system in Bangla. We also use it as a test bed for cross-lingual metaphor translation, finding that not all metaphors translate literally across languages and there are several cultural factors at play in the translation of metaphors. We hope this will advance the field in metaphor translation research and in grounding cultural nuances at work in the process of machine translation.
pdf
bib
abs
ASQ: Automatically Generating Question-Answer Pairs Using AMRs
Geetanjali Rakshit
|
Jeffrey Flanigan
We introduce ASQ, a tool to automatically mine questions and answers from a sentence using the Abstract Meaning Representation (AMR). Previous work has used question-answer pairs to specify the predicate-argument structure of a sentence using natural language, which does not require linguistic expertise or training, and created datasets such as QA-SRL and QAMR, for which the question-answer pair annotations were crowdsourced. Our goal is to build a tool (ASQ) that maps from the traditional meaning representation AMR to a question-answer meaning representation (QMR). This enables construction of QMR datasets automatically in various domains using existing high-quality AMR parsers, and provides an automatic mapping AMR to QMR for ease of understanding by non-experts. A qualitative evaluation of the output generated by ASQ from the AMR 2.0 data shows that the question-answer pairs are natural and valid, and demonstrate good coverage of the content. We run ASQ on the sentences from the QAMR dataset, to observe that the semantic roles in QAMR are also captured by ASQ. We intend to make this tool and the results publicly available for others to use and build upon.
pdf
bib
abs
Multi-LLM Verification for Question Answering under Conflicting Contexts
Geetanjali Rakshit
|
Jeffrey Flanigan
Open-domain question answering (ODQA) often requires models to resolve conflicting evidence retrieved from diverse sources—a task that remains challenging even for state-of-the-art large language models (LLMs). While single-agent techniques such as self-verification and self-consistency have shown promise across natural language understanding and generation tasks, and multi-agent approaches involving collaborative or competitive strategies have recently emerged, their effectiveness for ODQA in the presence of conflicting contexts remains underexplored. In this work, we investigate these techniques using the QACC dataset as a case study. We find that incorporating a multi-agent verification step—where the best answer is selected from among outputs generated by different LLMs—leads to improved performance. Interestingly, we also observe that requiring explanations during the verification step does not always improve answer quality. Our experiments evaluate three strong LLMs (GPT-4o, Claude 4, and DeepSeek-R1) across a range of prompting and verification baselines.
pdf
bib
abs
Comparative Analysis of Human and Large Language Model Performance in Pharmacology Multiple-Choice Questions
Ricardo Rodriguez
|
Stéphane Huet
|
Benoit Favre
|
Mickael Rouvier
In this article, we study the answers generated by a selection of Large Language Models to a set of Multiple Choice Questions in Pharmacology, and compare them to the answers provided by students, to understand which questions in this clinical domain are difficult for the models when compared to humans and why. We extract the internal logits to infer probability distributions and analyse the main features that determine the difficulty of questions using statistical methods. We also provide an extension to the FrenchMedMCQA dataset, with pairs of question-answers in pharmacology, enriched with student response rate, answer scoring, clinical topics, and annotations on question structure and semantics.
pdf
bib
abs
Enhancing Textual Understanding: Automated Claim Span Identification in English, Hindi, Bengali, and CodeMix
Rudra Roy
|
Pritam Pal
|
Dipankar Das
|
Saptarshi Ghosh
|
Biswajit Paul
Claim span identification, a crucial task in Natural Language Processing (NLP), aims to extract specific claims from texts. Such claim spans can be further utilized in various critical NLP applications, such as claim verification, fact-checking, and opinion mining, among others. The present work proposes a multilingual claim span identification framework for handling social media data in English, Hindi, Bengali, and CodeMixed texts, leveraging the strengths and knowledge of transformer-based pre-trained models. Our proposed framework efficiently identifies the contextual relationships between words and precisely detects claim spans across all languages, achieving a high F1 score and Jaccard score. The source code and datasets are available at: https://github.com/pritampal98/claim-span-multilingual
pdf
bib
abs
Detecting Fake News in the Era of Language Models
Muhammad Irfan Fikri Sabri
|
Hansi Hettiarachchi
|
Tharindu Ranasinghe
The proliferation of fake news has been amplified by the advent of large language models (LLMs), which can generate highly realistic and scalable misinformation. While prior studies have focused primarily on detecting human-generated fake news, the efficacy of current models against LLM-generated content remains underexplored. We address this gap by compiling a novel dataset combining public and LLM-generated fake news, redefining detection as a ternary classification task (real, human-generated fake, LLM-generated fake), and evaluating eight diverse classification models, including traditional machine learning, fine-tuned transformers, and few-shot prompted LLMs. Our findings highlight the strengths and limitations of these models in detecting evolving LLM-generated fake news, offering insights for future detection strategies.
pdf
bib
abs
Cyberbullying Detection via Aggression-Enhanced Prompting
Aisha Saeid
|
Anu Sabu
|
Girish Koushik
|
Ferrante Neri
|
Diptesh Kanojia
Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.
pdf
bib
abs
Lingdex.org:Leveraging LLMs to Structure and Explore Linguistic Olympiad Puzzles for Learning and Teaching Linguistics
Jonathan Sakunkoo
|
Annabella Sakunkoo
Linguistics Olympiad puzzles provide a valuable but underutilized resource for teaching linguistic reasoning, typology, and cross-cultural understanding. Many of these puzzles feature endangered and low-resource languages and thus offer a rare opportunity to integrate linguistic diversity into education at a time when over 40% of the world’s languages face extinction. This paper presents Lingdex, a novel web-based platform that leverages large language models (LLMs) to classify, organize, and enliven Linguistics Olympiad problems across various linguistic categories such as syntax, morphology, semantics, phonology, and language families. By applying NLP techniques to the multilingual and multicultural corpora of linguistics puzzles drawn from international and national Olympiads, Lingdex supports language and linguistics education, problem-based learning, and curriculum development. The visual, interactive platform also includes problems based on endangered and rare languages to raise awareness and interest in linguistic diversity. We present results from a user study that shows increased learner interest and appreciation for global linguistic richness.
pdf
bib
abs
When Does Language Transfer Help? Sequential Fine-Tuning for Cross-Lingual Euphemism Detection
Julia Sammartino
|
Libby Barak
|
Jing Peng
|
Anna Feldman
Euphemisms are culturally variable and often ambiguous, posing challenges for language models, especially in low-resource settings. This paper investigates how cross-lingual transfer via sequential fine-tuning affects euphemism detection across five languages: English, Spanish, Chinese, Turkish, and Yorùbá. We compare sequential fine-tuning with monolingual and simultaneous fine-tuning using XLM-R and mBERT, analyzing how performance is shaped by language pairings, typological features, and pretraining coverage. Results show that sequential fine-tuning with a high-resource L1 improves L2 performance, especially for low-resource languages like Yorùbá and Turkish. XLM-R achieves larger gains but is more sensitive to pretraining gaps and catastrophic forgetting, while mBERT yields more stable, though lower, results. These findings highlight sequential fine-tuning as a simple yet effective strategy for improving euphemism detection in multilingual models, particularly when low-resource languages are involved.
pdf
bib
abs
Modelling the Relative Contributions of Stylistic Features in Forensic Authorship Attribution
G. Çağatay Sat
|
John Blake
|
Evgeny Pyshkin
This paper explores the extent to which stylistic features contribute to the task of authorship attribution in forensic contexts. Drawing on a filtered subset of the Enron email corpus, the study operationalizes stylistic indicators across four groups: lexical, syntactic, orthographic, and discoursal. Using R Programming Language for feature engineering and logistic regression modelling, we systematically assessed both the individual and interactive effects of these features on attribution accuracy. Results show that n-gram similarity consistently outperformed all other features, with the combined model of n-gram similarity and its interaction with other features achieving accuracy, precision and F1 scores of 91.6%, 93.3% and 91.7% respectively. The model was subsequently evaluated on a subset of the TEL corpus to assess its applicability in a forensic setting. The findings highlight the dominant role of lexical similarity and suggest that integrating interaction effects can yield further performance gains in forensic authorship analysis.
pdf
bib
abs
The Hidden Cost of Structure: How Constrained Decoding Affects Language Model Performance
Maximilian Schall
|
Gerard de Melo
Large Language Models excel at generating fluent text, but real-world applications increasingly demand structured outputs like JSON that can be programmatically processed. While prior work examines either task performance or format compliance in isolation, we investigate their interaction through comprehensive experiments across 11 models and multiple benchmarks. We uncover a fundamental divergence between base and instruction-tuned models under structural constraints. Base models often benefit from constrained decoding, producing more precise outputs, while instruction-tuned models frequently suffer performance degradation on generation tasks despite maintaining stability on classification tasks. Our log probability analysis reveals the underlying mechanism: constrained decoding forces models away from their preferred natural language patterns into lower-confidence structured alternatives. We demonstrate that successful constrained generation requires both adapted prompts and sufficient few-shot examples, with constrained models showing steeper performance gains from additional demonstrations compared to unconstrained generation. Notably, we find that base model performance under constraints can serve as an early indicator of post-training structured output capabilities, offering a practical evaluation tool for model development. These findings suggest that current instruction-tuning practices may inadvertently reduce models’ structured output capabilities and highlight the need for training-time integration of structural constraints in future model development.
pdf
bib
abs
A Question-Answering Based Framework/Metric for Evaluation of Newspaper Article Summarization
Vasanth Seemakurthy
|
Shashank Sundar
|
Siddharth Arvind
|
Siddhant Jagdish
|
Ashwini M. Joshi
Condensed summaries of newspaper articles cater to the modern need for easily digestible content amid shrinking attention spans. However, current summarization systems often produce extracts failing to capture the essence of original articles. Traditional evaluation metrics like ROUGE also provide limited insights into whether key information is preserved in the summaries. To address this, we propose a pipeline to generate high-quality summaries tailored for newspaper articles and evaluate them using a question-answering based metric. Our system segments input newspaper images, extracts text, and generates summaries. We also generate relevant questions from the original articles and use a question-answering model to assess how well the summaries can answer these queries to evaluate summary quality beyond just lexical overlap. Experiments on real-world data show the potential effectiveness of our approach in contrast to conventional metrics. Our framework holds promise for enabling reliable news summary generation and evaluation systems.
pdf
bib
abs
Efficient Financial Fraud Detection on Mobile Devices Using Lightweight Large Language Models
Lakpriya Senevirathna
|
Deshan Koshala Sumanathilaka
The growth of mobile financial transactions presents new challenges for fraud detection, where traditional and ML methods often miss emerging patterns. While Large Language Models (LLMs) offer advanced language understanding, they are typically too resource-intensive for mobile deployment and raise privacy concerns due to cloud reliance. This paper proposes a lightweight, privacy-preserving approach by fine-tuning and quantizing compact LLMs for on-device fraud detection from textual data. Models were optimized using Open Neural Network Exchange (ONNX) conversion and quantization to ensure efficiency. The fine-tuned quantized Llama-160M-Chat-v1 (bnb4) achieved 99.47% accuracy with a 168MB footprint, while fine-tuned quantized Qwen1.5-0.5B-Chat (bnb4) reached 99.50% accuracy at 797MB. These results demonstrate that optimized LLMs can deliver accurate, real-time fraud detection on mobile devices without compromising user privacy.
pdf
bib
abs
Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems
Lia Shahnazaryan
|
Patrick Simianer
|
Joern Wuebker
We explore the impact of multi-source input strategies on machine translation (MT) quality, comparing GPT-4o, a large language model (LLM), with a traditional multilingual neural machine translation (NMT) system. Using intermediate language translations as contextual cues, we evaluate their effectiveness in enhancing English and Chinese translations into Portuguese. Results suggest that contextual information significantly improves translation quality for domain-specific datasets and potentially for linguistically distant language pairs, with diminishing returns observed in benchmarks with high linguistic variability. Additionally, we demonstrate that shallow fusion, a multi-source approach we apply within the NMT system, shows improved results when using high-resource languages as context for other translation pairs, highlighting the importance of strategic context language selection.
pdf
bib
abs
Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection against LLM-Generated Threats
Sadat Shahriar
|
Navid Ayoobi
|
Arjun Mukherjee
|
Mostafa Musharrat
|
Sai Vishnu Vamsi Senagasetty
The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.
pdf
bib
abs
The Erosion of LLM Signatures: Can We Still Distinguish Human and LLM-Generated Scientific Ideas after Iterative Paraphrasing?
Sadat Shahriar
|
Navid Ayoobi
|
Arjun Mukherjee
With the increasing reliance on LLMs as research agents, distinguishing between LLM and human-generated ideas has become crucial for understanding the cognitive nuances of LLMs’ research capabilities. While detecting LLM-generated text has been extensively studied, distinguishing human vs LLM-generated *scientific ideas* remains an unexplored area. In this work, we systematically evaluate the ability of state-of-the-art (SOTA) machine learning models to differentiate between human and LLM-generated ideas, particularly after successive paraphrasing stages. Our findings highlight the challenges SOTA models face in source attribution, with detection performance declining by an average of 25.4% after five consecutive paraphrasing stages. Additionally, we demonstrate that incorporating the research problem as contextual information improves detection performance by up to 2.97%. Notably, our analysis reveals that detection algorithms struggle significantly when ideas are paraphrased into a simplified, non-expert style, contributing the most to the erosion of distinguishable LLM signatures.
pdf
bib
abs
Deep Language Geometry: Constructing a Metric Space from LLM Weights
Maksym Shamrai
|
Vladyslav Hamolia
We introduce a novel framework that utilizes the internal weight activations of modern Large Language Models (LLMs) to construct a metric space of languages. Unlike traditional approaches based on hand-crafted linguistic features, our method automatically derives high-dimensional vector representations by computing weight importance scores via an adapted pruning algorithm. Our approach captures intrinsic language characteristics that reflect linguistic phenomena. We validate our approach across diverse datasets and multilingual LLMs, covering 106 languages. The results align well with established linguistic families while also revealing unexpected inter-language connections that may indicate historical contact or language evolution. The source code, computed language latent vectors, and visualization tool are made publicly available at https://github.com/mshamrai/deep-language-geometry.
pdf
bib
abs
Cross-Lingual Fact Verification: Analyzing LLM Performance Patterns across Languages
Hanna Shcharbakova
|
Tatiana Anikina
|
Natalia Skachkova
|
Josef van Genabith
Fact verification has emerged as a critical task in combating misinformation, yet most research remains focused on English-language applications. This paper presents a comprehensive analysis of multilingual fact verification capabilities across three state-of-the-art large language models: Llama 3.1, Qwen 2.5, and Mistral Nemo. We evaluate these models on the X-Fact dataset that includes 25 typologically diverse languages, examining both seen and unseen languages through various evaluation scenarios. Our analysis employs few-shot prompting and LoRA fine-tuning approaches, revealing significant performance disparities based on script systems, with Latin script languages consistently outperforming others. We identify systematic cross-lingual instruction following failures, particularly affecting languages with non-Latin scripts. Surprisingly, some officially supported languages, such as Indonesian and Polish, which are not high-resourced languages, achieve better performance than high-resource languages like German and Spanish, challenging conventional assumptions about resource availability and model performance. The results highlight critical limitations in current multilingual LLMs for the fact verification task and provide insights for developing more inclusive multilingual systems.
pdf
bib
abs
ESAQueryRank: Ranking Query Interpretations for Document Retrieval Using Explicit Semantic Analysis
Avijeet Shil
|
Wei Jin
Representing query translation into relevant entities is a critical component of an infor- mation retrieval system. This paper proposes an unsupervised framework, ESAQueryRank, designed to process natural language queries by mapping n-gram phrases to Wikipedia ti- tles and ranking potential entity and phrase combinations using Explicit Semantic Analy- sis. Unlike previous approaches, this frame- work does not rely on query expansion, syn- tactic parsing, or manual annotation. Instead, it leverages Wikipedia metadata—such as ti- tles, redirects, disambiguation pages to dis- ambiguate entities and identify the most rel- evant ones based on cosine similarity in the ESA space. ESAQueryRank is evaluated using a random set of TREC questions and compared against a keyword-based approach and a context-based question translation model (CBQT). In all comparisons of full category types, ESAQueryRank consistently shows bet- ter results against both methods. Notably, the framework excels with more complex queries, achieving improvements in Mean Reciprocal Rank (MRR) of up to 480% for intricate queries like those beginning with “Why,” even without explicitly incorporating the question type. These results demonstrate that ESA- QueryRank is an effective, transparent, and domain-independent framework for building natural language interfaces.
pdf
bib
abs
Personalized Author Obfuscation with Large Language Models
Mohammad Shokri
|
Sarah Ita Levitan
|
Rivka Levitan
In this paper, we investigate the efficacy of large language models (LLMs) in obfuscating authorship by paraphrasing and altering writing styles. Rather than adopting a holistic approach that evaluates performance across the entire dataset, we focus on user-wise performance to analyze how obfuscation effectiveness varies across individual authors. While LLMs are generally effective, we observe a bimodal distribution of efficacy, with performance varying significantly across users. To address this, we propose a personalized prompting method that outperforms standard prompting techniques and partially mitigates the bimodality issue.
pdf
bib
abs
Bulgarian Event Extraction with LLMs
Kiril Simov
|
Nikolay Paev
|
Petya Osenova
|
Stefan Marinov
The paper presents the results from the experiments with two large language models (LLMs) - T5 and Llama – for extracting events from a Bulgarian event corpus. The two models were pretrained by us on 35 Billion Token Bulgarian Corpus. The extraction was performed within the context of one sentence. Our approach aims at balancing the ACE-oriented approach that uses triggers in event detection, and the MUC-oriented one that uses more general event types. The evaluation relies on the IoU (Intersection over Union) of token spans and is twofold. The first one refers to the predicted event token span. Here if the span is correct, the semantic roles within the event are further checked. The second one refers to the triple of an event type, its semantic roles and participants. The results are promising. A qualitative evaluation is provided as well.
pdf
bib
abs
FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback
Ashish Singh
|
Ashutosh Singh
|
Prateek Agarwal
|
Zixuan Huang
|
Arpita Singh
|
Tong Yu
|
Sungchul Kim
|
Victor Soares Bursztyn
|
Nesreen K. Ahmed
|
Puneet Mathur
|
Erik Learned-Miller
|
Franck Dernoncourt
|
Ryan Rossi
Captions are crucial for understanding scientific visualizations and documents. Existing captioning methods for scientific figures rely on figure-caption pairs extracted from documents for training, many of which fall short with respect to metrics like helpfulness, explainability, and visual-descriptiveness, leading to generated captions being misaligned with reader preferences. To address this issue, we introduce FigCaps-HF, a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating the quality of figure-caption pairs, and 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences. We demonstrate the effectiveness of our simple learning framework by improving performance over standard fine-tuning across different types of models. In particular, when using BLIP as the base model, our RLHF framework achieves a mean gain of 35.7%, 16.9%, 9%, and 11.4% in ROUGE, BLEU, Meteor, and CIDEr scores, respectively. Finally, we release a large-scale benchmark dataset with human feedback on figure-caption pairs to enable further evaluation and development of RLHF techniques for this problem.
pdf
bib
abs
LLM Compression: How Far Can We Go in Balancing Size and Performance?
Sahil Sk
|
Debashish Dhal
|
Sonal Khosla
|
Akash Dhaka
|
Shantipriya Parida
|
Sk Shahid
|
Sambit Shekhar
|
Dilip Prasad
|
Ondrej Bojar
Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency accross various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics namely: accuracy, inference latency, and throughput, providing insights into the suitability of low-bit quantization for real-world deployment and highlight the tradeoffs between memory, computing and latency in such settings, helping a user make suitable decisions
pdf
bib
abs
Pushing the (Generative) Envelope: Measuring the Effect of Prompt Technique and Temperature on the Generation of Model-based Systems Engineering Artifacts
Erin Smith Crabb
|
Cedric Bernard
|
Matthew Jones
|
Daniel Dakota
System engineers use Model-based systems engineering (MBSE) approaches to help design and model system requirements. This manually intensive process requires expertise in both the domain of artifact creation (e.g., the requirements for a vacuum), and how to encode that information in a machine readable form (e.g., SysML). We investigated leveraging local LLMs to generate initial draft artifacts using a variety of prompt techniques and temperatures. Our experiments showed promise for generating certain types of artifacts, suggesting that even smaller, local models possesses enough MBSE knowledge to support system engineers. We observed however that while scores for artifacts remain stable across different temperature settings, this is potentially misleading as significantly different, though semantically equivalent, generations can be produced.
pdf
bib
abs
Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch
Elza Strazda
|
Gerasimos Spanakis
Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.
pdf
bib
abs
The Challenge of Performing Ontology-driven Entity Extraction in Real-world Unstructured Textual Data from the Domain of Dementia
Sumaiya Suravee
|
Carsten Oliver Schmidt
|
Kristina Yordanova
Named entity recognition allows the automated extraction of structured domain-related information from unstructured textual data. Our study explores the task of ontology-driven entity recognition, a sequence labelling process for custom named entity recognition for the domain of dementia, specifically from unstructured forum texts where unprofessional caregivers of people with dementia discuss the challenges they face related to agitation. The targeted corpus is loosely structured, contains ambiguous sentences and vocabulary that does not match the agitation-related medical vocabulary. To address the above challenges, we propose a pipeline that involves the following steps: 1) development of an annotation codebook; 2) annotation of a textual corpus collected from dementia forums, consisting of 45,216 sentences (775 questions and 5571 answers); 3) data augmentation to reduce the imbalance in the corpus; 4) training of a bidirectional LSTM model and a transformer model; 5) comparison of the results with those from few shot- and zero-shot based prompt engineering techniques using a pretrained large language model (LLaMa 3). The results showed that LLaMa 3 was more robust than traditional neural networks and transformer models in detecting underrepresented entities. Furthermore, the study demonstrates that data augmentation improves the entity recognition task when fine-tuning deep learning models. The paper illustrates the challenges of ontology-driven entity recognition in real-world datasets and proposes a roadmap to addressing them that is potentially transferable to other real-world domains.
pdf
bib
abs
Recognizing the Structure and Content of Hungarian Civil Registers
Kata Ágnes Szűcs
|
Noémi Vadász
|
Zsolt Béla Záros
The study evaluates key steps in a system for processing data from digitized Hungarian state register records (1895-1980) into an SQL database. It examines how template selection and post-processing impact data accessibility and integration. The research details the compiled datasets, annotation processes, and evaluation functions used to measure processing quality, emphasizing template selection and post-processing to improve the overall workflow and the accuracy of the published data. An evaluation method for publishing structured data provides a model for similar projects.
pdf
bib
abs
Optimism, Pessimism, and the Language between: Model Interpretability and Psycholinguistic Profiling
Stefana Arina Tabusca
|
Liviu P. Dinu
This study explores how optimism and pessimism are expressed in social media by combining psycholinguistic profiling with model interpretability. Using the OPT dataset, we fine-tune a RoBERTa-based classifier and apply LIME to examine both the most confident and the most ambiguous predictions. We analyze the influential tokens driving these decisions and identify lexical patterns linked to affective intensity, certainty, and social orientation. A complementary LIWC-based analysis of ground truth labels reveals systematic differences in emotional tone and cognitive style. PCA projections further show that optimism and pessimism occupy overlapping yet distinguishable regions in psycholinguistic space. Our findings demonstrate the value of linguistic interpretability in understanding dispositional sentiment.
pdf
bib
abs
Demographic Features for Annotation-Aware Classification
Narjes Tahaei
|
Sabine Bergler
This paper revisits the use of annotator demographics as interpretable meta-information for modeling such variation. We adapt a lightweight attention mechanism, Annotation-Wise Attention Network (AWAN), to condition predictions on demographic features, enabling per-annotator modeling. Experiments on the EXIST sexism dataset show that AWAN improves classification performance over standard baselines, especially in cases of high annotator disagreement.
pdf
bib
abs
Exploring the Performance of Large Language Models for Event Detection and Extraction in the Health Domain
Hristo Tanev
|
Nicolas Stefanovitch
|
Tomáš Harmatha
|
Diana F. Sousa
Large Language Models (LLM) have entered the world of NLP with a fast pace. LLM has been used for summarization, translation, named entity recognition, and sentiment analysis Recently, different research groups have experimented with event detection and extraction, using LLM at various levels of the processing stage: The LLM have proven to be a very relevant technology from data preparation to event argument extraction. In particular Open Source LLM like Mistral are very important since they can be shared and modified by the research community. Still, little effort was made to study the performance of these models in NLP tasks like event extraction. In this paper we describe an experiment in evaluating several state-of-the-art open large language models (LLM) for the task of event extraction and event detection in the domain of health. The models were prompted to perform detection of health-related events - mostly disease outbreaks, but also natural and man-made disasters, which directly or indirectly have impact on the health of the people. The models were also asked to extract the place, time, number of human and animal cases, and the number of the human fatalities. The performance of the LLM turned out to be better than the one of a state-of-the-art knowledge based system, using as test data a set of 800 news abstracts, containing the title and the lead sentences of health-related news articles. We compared the performance of the event detection and event argument extraction from the open Large Language Models and two knowledge based event extraction systems, NEXUS and Medical NEXUS. Our evaluation shows that all the open LLM show a superior performance w.r.t. the knowledge-based systems with the best improvement of the F1 score of number of human fatalities detection of 0.2 (0.84 vs. 0.64), where the best performing LLM was LLama 3.3 70B instruct.
pdf
bib
abs
Leveraging LLaMa for Abstractive Text Summarisation in Malayalam: An Experimental Study
Hristo Tanev
|
Anitha S. Pillai
|
Revathy V. R
Recent years witnessed tremendous advancements in natural language processing (NLP) because of the development of complex language models that have automated several NLP applications, including text summarisation. Despite this progress, Malayalam text summarisation still faces challenges because of the peculiarities of the language. This research paper explores the potential of using a large language model, specifically the LLaMA (Large Language Model Meta AI) framework, for text summarisation of Malayalam language. In order to assess the performance of LLaMA for text summarization, for the low-resource language Malayalam, a dataset was curated with reference text and summaries. The evaluation showed that the LLaMA model could effectively summarize lengthy articles while maintaining important information and coherence. The generated summaries were compared with the reference summaries generated by human writers to observe how well aligned the model was with a human level of summarisation. The results proved that LLM can deal with the Malayalam text summarisation task, but more research is needed to understand the most relevant training strategy.
pdf
bib
abs
Building a Clean Bartangi Language Corpus and Training Word Embeddings for Low-Resource Language Modeling
Warda Tariq
|
Victor Popov
|
Vasilii Gromov
In this paper, we showcase a comprehensive end-to-end pipeline for creating a superior Bartangi language corpus and using it for training word embeddings. The critically low-resource Pamiri lan- guage of Bartangi, which is spoken in Tajikistan, has difficulties such as morphological complexity, orthographic variety, and a lack of data. In order to overcome these obstacles, we gathered a raw corpus of roughly 6,550 phrases, used the Uniparser-Morph-Bartangi morphological analyzer for linguistically accurate lemmatization, and implemented a thorough cleaning procedure to eliminate noise and ensure proper tokenization. The lemmatized corpus that results greatly lowers word spar- sity and raises the standard of linguistic analysis.The processed corpus was then used to train two different Word2Vec models, Skip-gram and CBOW, with a vector size of 100, a context window of 5, and a minimum frequency threshold of 1. The resultant word embeddings were displayed using dimensionality reduction techniques like PCA and t-SNE, and assessed using intrinsic methods like nearest-neighbor similarity tests. Our tests show that even from tiny datasets, meaningful semantic representations can be obtained by combining informed morphological analysis with clean prepro- cessing. One of the earliest computational datasets for Bartangi, this resource serves as a vital basis for upcoming NLP tasks, such as language modeling, semantic analysis, and low-resource machine translation. To promote more research in Pamiri and other under-represented languages, we make the corpus, lemmatizer pipeline, and trained embeddings publicly available.
pdf
bib
abs
A Deep Dive into Multi-Head Attention and Multi-Aspect Embedding
Maryam Teimouri
|
Jenna Kanerva
|
Filip Ginter
Multi-vector embedding models play an increasingly important role in retrieval-augmented generation, yet their internal behaviour lacks comprehensive analysis. We conduct a systematic, head-level study of the 32-head Semantic Feature Representation (SFR) encoder with the FineWeb corpus containing 10 billion tokens. For a set of 4,000 web documents, we pair head-specific embeddings with GPT-4o topic annotations and analyse the results using t-SNE visualisations, heat maps, and a 32-way logistic probe. The analysis shows that (i) clear semantic separation between heads emerges only at an intermediate layer, (ii) some heads align with specific topics while others capture broader corpus features, and (iii) naive pooling of head outputs can blur these distinctions, leading to frequent topic mismatches. The study offers practical guidance on where to extract embeddings, which heads may be pruned, and how to aggregate them to support more transparent and controllable retrieval pipelines.
pdf
bib
abs
A Linguistically-informed Comparison between Multilingual BERT and Language-specific BERT Models: The Case of Differential Object Marking in Romanian
Maria Tepei
|
Jelke Bloem
Current linguistic challenge datasets for language models focus on phenomena that exist in English. This may lead to a lack of attention for typological features beyond English. This is particularly an issue for multilingual models, which may be biased towards English by their training data and this bias may be amplified if benchmarks are also English-centered. We present the syntactically and semantically complex language phenomenon of Differential Object Marking (DOM) in Romanian as a challenging Masked Language Modelling task and compare the performance of monolingual and multilingual models. Results indicate that Romanian-specific BERT models perform better than equivalent multilingual one in representing this phenomenon.
pdf
bib
abs
PoliStance-TR: A Dataset for Turkish Stance Detection in Political Domain
Muhammed Cihat Unal
|
Yasemin Sarkın
|
Alper Karamanlioglu
|
Berkan Demirel
Stance detection in NLP involves determining whether an author is supportive, against, or neutral towards a particular target. This task is particularly challenging for Turkish due to the limited availability of data, which hinders progress in the field. To address this issue, we introduce a novel dataset focused on stance detection in Turkish, specifically within the political domain. This dataset was collected from X (formerly Twitter) and annotated by three human annotators who followed predefined guidelines to ensure consistent labeling and generalizability. After compiling the dataset, we trained various transformer-based models with different architectures, showing that the dataset is effective for stance classification. These models achieved an impressive Macro F1 score of up to 82%, highlighting their effectiveness in stance detection.
pdf
bib
abs
Towards Safer Hebrew Communication: A Dataset for Offensive Language Detoxification
Natalia Vanetik
|
Lior Liberov
|
Marina Litvak
|
Chaya Liebeskind
Text detoxification is the task of transforming offensive or toxic content into a non-offensive form while preserving the original meaning. Despite increasing research interest in detoxification across various languages, no resources or benchmarks exist for Hebrew, a Semitic language with unique morphological, syntactic, and cultural characteristics. This paper introduces HeDetox, the first annotated dataset for text detoxification in Hebrew. HeDetox contains 600 sentence pairs, each consisting of an offensive source text and a non-offensive text rewritten with LLM and human intervention. We present a detailed dataset analysis and evaluation showing that the dataset benefits offensive language detection. HeDetox offers a foundational resource for Hebrew natural language processing, advancing research in offensive language mitigation and controllable text generation.
pdf
bib
abs
AIDEN: Automatic Speaker Notes Creation and Navigation for Enhancing Online Learning Experience
Stalin Varanasi
|
Umer Butt
|
Guenter Neumann
|
Josef van Genabith
Effective learning in digital environments depends on quick access to educational resources and timely support. We present AIDEN, an advanced, AI-driven virtual teaching assistant integrated into lectures, to provide meaningful support for students. AIDEN’s capabilities include reading lecture materials aloud, locating specific slides, automatic speaker notes generation, search through a video stream. Powered by state-of-the-art retrieval and text generation, AIDEN can be adapted to new lecture content with minimal manual adjustments, requiring only minor customization of data handling processes and model configurations. Through automated testing, we evaluated AIDEN’s performance across key metrics slide retrieval recall for questions, and alignment of generated speaker notes with ground-truth data. The evaluation underscores AIDEN’s potential to significantly enhance learning experiences by offering real-world application and rapid configurability to diverse learning materials.
pdf
bib
abs
Using LLMs for Multilingual Clinical Entity Linking to ICD-10
Sylvia Vassileva
|
Ivan K. Koychev
|
Svetla Boytcheva
The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.
pdf
bib
abs
Aspect–Sentiment Quad Prediction with Distilled Large Language Models
Filippos Karolos Ventirozos
|
Peter Appleby
|
Matthew Shardlow
Aspect-based sentiment analysis offers detailed insights by pinpointing specific product aspects in a text that are associated with sentiments. This study explores it through the prediction of quadruples, comprising aspect, category, opinion, and polarity. We evaluated in-context learning strategies using recently released distilled large language models, ranging from zero to full-dataset demonstrations. Our findings reveal that the performance of these models now positions them between the current state-of-the-art and significantly higher than their earlier generations. Additionally, we experimented with various chain-of-thought prompts, examining sequences such as aspect to category to sentiment in different orders. Our results indicate that the optimal sequence differs from previous assumptions. Additionally, we found that for quadruple prediction, few-shot demonstrations alone yield better performance than chain-of-thought prompting.
pdf
bib
abs
SENTimental - a Simple Multilingual Sentiment Annotation Tool
John Vidler
|
Paul Rayson
|
Dawn Knight
Here we present SENTimental, a simple and fast web-based, mobile-friendly tool for capturing sentiment annotations from participants and citizen scientist volunteers to create training and testing data for low-resource languages. In contrast to existing tools, we focus on assigning broad values to segments of text over specific tags for tokens or spans to build datasets for training and testing LLMs. The SENTimental interface minimises barriers to entry with a goal of maximising the time a user spends in a flow state whereby they are able to quickly and accurately rate each text fragment without being distracted by the complexity of the interface. Designed from the outset to handle multilingual representations, SENTimental allows for parallel corpus data to be presented to the user and switched between instantly for immediate comparison. As such this allows for users in any loaded languages to contribute to the data gathered, building up comparable rankings in a simple structured dataset for later processing.
pdf
bib
abs
Anonymise: A Tool for Multilingual Document Pseudonymisation
Rinalds Vīksna
|
Inguna Skadina
According to the EU legislation, documents containing personal information need to be anonymized before public sharing. However, manual anonymisation is a time-consuming and costly process. Thus, there is a need for a robust text de-identification technique that accurately identifies and replaces personally identifiable information. This paper introduces the Anonymise tool, a system for document de-identification. The tool accepts text documents of various types (e.g., MS Word, plain-text), de-identifies personal information, and saves the de-identified document in its original format. The tool employs a modular architecture, integrating list-based matching, regular expressions and deep-learning-based named entity recognition to detect spans for redaction. Our evaluation results demonstrate high recall rates, making Anonymise a reliable solution for ensuring no sensitive information is left exposed. The tool can be accessed through a userfriendly web-based interface or API, offering flexibility for both individual and large-scale document processing needs. By automating document de-identification with high accuracy and efficiency, Anonymise presents a reliable solution for ensuring compliance with EU privacy regulations while reducing the time and cost associated with manual anonymisation.
pdf
bib
abs
Revealing Gender Bias in Language Models through Fashion Image Captioning
Maria Villalba-Oses
|
Victoria Muñoz-Garcia
|
Juan Pablo Consuegra-Ayala
Image captioning bridges computer vision and natural language processing but remains vulnerable to social biases. This study evaluates gender bias in ChatGPT, Copilot, and Grok by analyzing their descriptions of fashion-related images prompted without gender cues. We introduce a methodology combining gender annotation, stereotype classification, and a manually curated dataset. Results show that GPT-4o and Grok frequently assign gender and reinforce stereotypes, while Copilot more often generates neutral captions. Grok shows the lowest error rate but consistently assigns gender, even when cues are ambiguous. These findings highlight the need for bias-aware captioning approaches in multimodal systems.
pdf
bib
abs
Benchmarking Korean Idiom Understanding: A Comparative Analysis of Local and Global Models
Xiaonan Wang
|
Seoyoon Park
|
Hansaem Kim
Although an increasing number of multilingual LLMs (large language models) have begun to support Korean, there remains a notable lack of benchmark datasets specifically designed to evaluate their proficiency in Korean cultural and linguistic understanding. A major reason for this gap is that many available benchmarks in Korean are adapted from English originals via translation, which often fails to reflect the unique cultural context embedded in the Korean language. Even the few benchmark datasets based on native Korean data that involve cultural content typically focus on tasks such as bias or hate speech detection, where cultural knowledge serves merely as topical background rather than being integrated as a core component of semantic understanding. To address this gap, we introduce the Korean Idiom Matching Benchmark (KIM Bench), which consists of 1,175 instances. Idioms are culture-specific and often untranslatable, making them ideal for testing models’ cross-cultural semantic understanding. Using KIM Bench, We evaluate global and Korean native models. Our analysis show that larger and locally trained models better capture idiom semantics and cultural nuances, while chain-of-thought prompting may reduce accuracy. Models still struggle with deep semantic and contextual understanding. KIM Bench offers a compact tool for cross-cultural evaluation and insights into improving performance on culturally grounded tasks.
pdf
bib
abs
TinyMentalLLMs Enable Depression Detection in Chinese Social Media Texts
Jinyuan Xu
|
Tian Lan
|
Mathieu Valette
|
Pierre Magistry
|
Lei Li
Depression remains a major global mental health concern, bringing a higher risk of suicide and growing social costs tied to mental disorders. Leveraging social media as a valuable source of emotional signals, we identify two limitations in current NLP-based depression detection frameworks: (1) prediction systems often lack clear, user-friendly explanations for predictions in Depression Detection, and (2) the computational and confidentiality demands of LLMs are misaligned with the need for dependable, privacy-focused small-scale deployments. To address these challenges, we introduce TinyMentalLLMs (TMLs), a compact framework that offers two key contributions: (a) the construction of a small yet representative dataset through psychology-based textometry, and (b) an efficient fine-tuning strategy centered on multiple aspects of depression. This design improves both accuracy and F1 scores in generative models with 0.5B and 1.5B parameters, consistently yielding over 20% performance gains across datasets. TMLs achieve results on par with, and deliver better text quality than, much larger state-of-the-art models.
pdf
bib
abs
Prompt Engineering for Nepali NER: Leveraging Hindi-Capable LLMs for Low-Resource Languages
Dipendra Yadav
|
Sumaiya Suravee
|
Stefan Kemnitz
|
Tobias Strauss
|
Kristina Yordanova
This study provides a systematic evaluation of prompt engineering strategies for Named Entity Recognition in Nepali, a low-resource language with high similarity to Hindi, by leveraging Hindi-capable Meta’s LLaMA 3.3:70B model. Four prompting techniques—Baseline, Chain-of-Thought, Self-Refine, and Least-toMost—are assessed in both zero-shot and fewshot settings. As a novel contribution, we propose an entity-aware sentence selection strategy that prioritizes example diversity and entity coverage for few-shot prompting. Experimental results show that, without Nepali examples, zero-shot and one-shot prompts frequently yield unstructured or hallucinated outputs, underscoring the limitations of cross-lingual capabilities without in-context supervision. However, including even a small number of carefully selected Nepali examples—sometimes as few as ten—substantially enhances model performance, with the Least-to-Most approach achieving the highest F1 scores. These findings highlight the potential of prompt-based adaptation and principled example curation for extending LLM capabilities to related, low-resource languages, offering a practical alternative to full model fine-tuning.
pdf
bib
abs
Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Shakib Yazdani
|
Yasser Hamidullah
|
Cristina España-Bonet
|
Josef van Genabith
Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.
pdf
bib
abs
Visual Priming Effect on Large-scale Vision Language Models
Daiki Yoshida
|
Haruki Sakajo
|
Kazuki Hayashi
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Katsuhiko Hayashi
|
Taro Watanabe
Large-scale Vision-Language Models (LVLMs) integrate linguistic and visual information, demonstrating advanced task-solving capabilities. These models are originally derived from Large Language Models, leading to strong capabilities for language tasks. However, the impact of additional visual information on model responses remains insufficiently understood. In this study, we focus on the priming effect, a psychological phenomenon, to investigate how visual information influences language task processing. We present additional intentionally designed images alongside two types of language tasks with different characteristics and analyze changes in the model’s responses. Our experimental results show that model responses shift in the direction intended by the image, suggesting that LVLMs do not simply ignore visual information but actively incorporate it into language processing. Furthermore, the similarity between this behavior and priming effects observed in human cognition suggests that LVLMs may share certain aspects of human cognitive mechanisms.
pdf
bib
abs
From Courtroom to Corpora: Building a Name Entity Corpus for Urdu Legal Texts
Adeel Zafar
|
Sohail Ashraf
|
Slawomir Nowaczyk
This study explores the effectiveness of transformer-based models for Named Entity Recognition (NER) in Urdu legal documents, a critical task in low-resource language processing. Given the legal texts’ specialized terminology and complex syntax, accurate entity recognition in Urdu remains challenging. We developed a legal Urdu dataset that contains 117,500 documents, generated synthetically from 47 different types of legal documents, and evaluated three BERT-based models. XLMRoBERTa, mBERT, and DistilBERT by analyzing their performance on an annotated Urdu legal dataset. mBERT demonstrated superior accuracy (0.999), and its F1 score (0.975) outperforms XLMRoBERTa and DistilBERT, highlighting its robustness in recognizing entities within low-resource languages. To ensure the privacy of the personal identifiers, all documents are anonymized. The dataset for this study is publicly hosted on Hugging Face and will be made public after the publication.
pdf
bib
abs
EmoHopeSpeech: An Annotated Dataset of Emotions and Hope Speech in English and Arabic
Wajdi Zaghouani
|
Md. Rafiul Biswas
This research introduces a bilingual dataset comprising 27,456 entries for Arabic and 10,036 entries for English, annotated for emotions and hope speech, addressing the scarcity of multi-emotion (Emotion and hope) datasets. The dataset provides comprehensive annotations capturing emotion intensity, complexity, and causes, alongside detailed classifications and subcategories for hope speech. To ensure annotation reliability, Fleiss’ Kappa was employed, revealing 0.75-0.85 agreement among annotators both for Arabic and English language. The evaluation metrics (micro-F1-Score=0.67) obtained from the baseline model (i.e., transformer-based AraBERT model) validate that the data annotations are worthy.
pdf
bib
abs
An Annotated Corpus of Arabic Tweets for Hate Speech Analysis
Wajdi Zaghouani
|
Md. Rafiul Biswas
Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10,000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an accuracy of 0.786.
pdf
bib
abs
Strategies for Efficient Retrieval-augmented Generation in Clinical Domains with RAPTOR: A Benchmarking Study
Xumou Zhang
|
Qixuan Hu
|
Jinman Kim
|
Adam G. Dunn
The Recursive Abstractive Processing for Tree-Organized Retrieval (RAPTOR) framework deploys a hierarchical tree-structured datastore to integrate local and global context, enabling efficient handling of long documents for language models. This design is especially useful when cloud-based language models are unavailable or undesirable. For instance, with offline confidential patient records or stringent data-privacy requirements. We benchmarked RAPTOR on the QuALITY dataset and a novel Clinical Trial question-answering dataset (CTQA) drawn from over 500 000 registry entries. Experiments varied question complexity (simple vs. complex), four language models, four embedding models, and three chunking strategies. Also incorporated GPT-4o as a cloud-based baseline. Results show that, with optimal settings, RAPTOR combined with smaller local models outperforms GPT-4o on complex CTQA questions, although this gain does not extend to QuALITY. These outcomes highlight RAPTOR’s promise as a practical, locally implementable solution for long-context understanding.
pdf
bib
abs
LLM-Based Product Recommendation with Prospect Theoretic Self Alignment Strategy
Manying Zhang
|
Zehua Cheng
|
Damien Nouvel
Accurate and personalized product recommendation is central to user satisfaction in e-commerce. However, a persistent language gap often exists between user queries and product titles or descriptions. While traditional user behavior-based recommenders and LLM-based Retrieval-Augmented Generation systems typically optimize for maximum likelihood objectives, they may struggle to bridge this gap or capture users’ true intent. In this paper, we propose a strategy based on Prospect Theoretic Self-Alignment, that reframes LLM-based recommendations as a utility-driven process. Given a user query and a set of candidate products, our model acts as a seller who anticipates latent user needs and generates product descriptions tailored to the user’s perspective. Simultaneously, it simulates user decision-making utility to assess whether the generated content would lead to a purchase. This self-alignment is achieved through a training strategy grounded in Kahneman & Tversky’s prospect theory, ensuring that recommendations are optimized for perceived user value rather than likelihood alone. Experiments on real-world product data demonstrate substantial improvements in intent alignment and recommendation quality, validating the effectiveness of our approach in producing personalized and decision-aware recommendations.
pdf
bib
abs
Branching Out: Exploration of Chinese Dependency Parsing with Fine-tuned Large Language Models
He Zhou
|
Emmanuele Chersoni
|
Yu-Yin Hsu
In this paper, we investigate the effectiveness of large language models (LLMs) for Chinese dependency parsing through fine-tuning. We explore how different dependency representations impact parsing performance when fine-tuning the Chinese Llama-3 model. Our results demonstrate that while the Stanford typed dependency tuple representation yields the highest number of valid dependency trees, converting dependency structure into a lexical centered tree produces parses of significantly higher quality despite generating fewer valid structures. The results further show that fine-tuning enhances LLMs’ capability to handle longer dependencies to some extent, though challenges remain. Additionally, we evaluate the effectiveness of DeepSeek in correcting LLM-generated dependency structures, finding that it is effective for fixing index errors and cyclicity issues but still suffers from tokenization mismatches. Our analysis across dependency distances and relations reveals that fine-tuned LLMs outperform traditional parsers in specific syntactic structures while struggling with others. These findings contribute to the research on leveraging LLMs for syntactic analysis tasks.