The 14th International Joint Conference on Natural Language Processing & The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Mumbai, India
December 20–24, 2025

Links

Volumes

up

pdf (full)
bib (full)
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Kentaro Inui | Sakriani Sakti | Haofen Wang | Derek F. Wong | Pushpak Bhattacharyya | Biplab Banerjee | Asif Ekbal | Tanmoy Chakraborty | Dhirendra Pratap Singh

pdf bib
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
Theo King | Zekun Wu | Adriano Koshiyama | Emre Kazim | Philip Colin Treleaven

A stereotype is a generalised claim about a social group. Such claims change with culture and context and are often phrased in everyday language, which makes them hard to detect: the State of the Art Large Language Models (LLMs) reach only 68% macro-F1 on the yes/no task “does this sentence contain a stereotype?”. We present HEARTS, a Holistic framework for Explainable, sustAinable and Robust Text Stereotype detection that brings together NLP and social-science. The framework is built on the Expanded Multi-Grain Stereotype Dataset (EMGSD), 57201 English sentences that cover gender, profession, nationality, race, religion and LGBTQ+ topics, adding 10% more data for under-represented groups while keeping high annotator agreement (𝜅 = 0.82). Fine-tuning the lightweight ALBERT-v2 model on EMGSD raises binary detection scores to 81.5% macro-F1, matching full BERT while producing 200× less CO2. For Explainability, we blend SHAP and LIME token level scores and introduce a confidence measure that increases when the model is correct (𝜌 = 0.18). We then use HEARTS to assess 16 SOTA LLMs on 1050 neutral prompts each for stereotype propagation: stereotype rates fall by 23% between model generations, yet clear differences remain across model families (LLaMA > Gemini > GPT > Claude). HEARTS thus supplies a practical, low-carbon and interpretable toolkit for measuring stereotype bias in language.

pdf bib
Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey
Xiantao Zhang

Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: *Modality-Unifying Captioners*, *Multimodal Embedders*, and *End-to-End Representers*. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role.Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.

pdf bib
With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting
Stephen Meisenbacher | Florian Matthes

Recent work in Differential Privacy with Natural Language Processing (DP NLP) has proposed numerous promising techniques in the form of *text rewriting* mechanisms. In the evaluation of these mechanisms, an often-ignored aspect is that of *dataset size*, or rather, the effect of dataset size on a mechanism’s efficacy for utility and privacy preservation. In this work, we are the first to introduce this factor in the evaluation of DP text privatization, where we design utility and privacy tests on large-scale datasets with dynamic split sizes. We run these tests on datasets of varying size with up to one million texts, and we focus on quantifying the effect of increasing dataset size on the privacy-utility trade-off. Our findings reveal that dataset size plays an integral part in evaluating DP text rewriting mechanisms; additionally, these findings call for more rigorous evaluation procedures in DP NLP, as well as shed light on the future of DP NLP in practice and at scale.

pdf bib
From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models
Mahammed Kamruzzaman | Abdullah Al Monsur | Gene Louis Kim | Anshuman Chhabra

Emotions are a fundamental facet of human experience, varying across individuals, cultural contexts, and nationalities. Given the recent success of Large Language Models (LLMs) as role-playing agents, we examine whether LLMs exhibit emotional stereotypes when assigned nationality-specific personas. Specifically, we investigate how different countries are represented in pre-trained LLMs through emotion attributions and whether these attributions align with cultural norms. To provide a deeper interpretive lens, we incorporate four key cultural dimensions, namely Power Distance, Uncertainty Avoidance, Long-Term Orientation, and Individualism, derived from Hofstede’s cross-cultural framework. Our analysis reveals significant nationality-based differences, with emotions such as shame, fear, and joy being disproportionately assigned across regions. Furthermore, we observe notable misalignment between LLM-generated and human emotional responses, particularly for negative emotions, highlighting the presence of reductive and potentially biased stereotypes in LLM outputs.

pdf bib
REGULAR: A Framework for Relation-Guided Multi-Span Question Generation
Jiayi Lin | Chenyang Zhang | Bingxuan Hou | Dongyu Zhang | Qingqing Hong | Junli Wang

To alleviate the high cost of manually annotating Question Answering (QA) datasets, Question Generation (QG) requires the model to generate a question related to the given answer and passage. This work primarily focuses on Multi-Span Question Generation (MSQG), where the generated question corresponds to multiple candidate answers. Existing QG methods may not suit MSQG as they typically overlook the correlation between the candidate answers and generate trivial questions, which limits the quality of the synthetic datasets. Based on the observation that relevant entities typically share the same relationship with the same entity, we propose REGULAR, a framework of RElation-GUided MuLti-SpAn Question GeneRation. REGULAR first converts passages into relation graphs and extracts candidate answers from the relation graphs. Then, REGULAR utilizes a QG model to generate a set of candidate questions and a QA model to obtain the best question. We construct over 100,000 questions using Wikipedia corpora, named REGULAR-WIKI, and conduct experiments to compare our synthetic datasets with other synthetic QA datasets. The experiment results show that models trained with REGULAR-WIKI achieve the best performance. We also conduct ablation studies and statistical analysis to verify the quality of our synthetic dataset. Our code and data are available at https://github.com/PluseLin/REGULAR.

pdf bib
Feature Decomposition-Augmentation Network for Multimodal Sentiment Analysis
Dapeng Yin | Bingxuan Hou | Mengna Gao | Shuyue Zhu | Junli Wang

Multimodal sentiment analysis identifies human emotional tendencies by analyzing text, visual, and auditory modalities. In most studies, the textual modality is usually considered to contain the most emotional information and is regarded as the dominant modality. Existing methods mostly map auxiliary modalities into a semantic space close to the dominant modality, which overly relies on the dominant modality. In this work, we propose a Feature Decomposition-Augmentation (FeaDA) framework, which aims to elevate the role of auxiliary modalities in multimodal data fusion. We first design a projector to decompose auxiliary modalities into partial features, which contain features for emotion judgment, and then utilize these decomposed features to guide the fusion process with KL loss, thereby enhancing the status of auxiliary modality fusion. To verify the effectiveness of our method, we conducted experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets. The experimental results show that our FeaDA framework outperforms mutilmodal sentiment analysis methods of the same type in main metrics. Our code is available at https://github.com/PowerLittleYin/FeaDA-main.

pdf bib
CSPLADE: Learned Sparse Retrieval with Causal Language Models
Zhichao Xu | Aosong Feng | Yijun Tian | Haibo Ding | Lin Lee Cheong

In recent years, dense retrieval has been the focus of information retrieval (IR) research. While effective, dense retrieval produces uninterpretable dense vectors, and suffers from the drawback of large index size. Learned sparse retrieval (LSR) has emerged as promising alternative, achieving competitive retrieval performance while also being able to leverage the classical inverted index data structure for efficient retrieval. However, limited works have explored scaling LSR beyond BERT scale. In this work, we identify two challenges in training large language models (LLM) for LSR: (1) training instability during the early stage of contrastive training; (2) suboptimal performance due to pre-trained LLM’s unidirectional attention. To address these challenges, we propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information. With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size. Furthermore, we are among the first to analyze the performance-efficiency tradeoff of LLM-based LSR model through the lens of model quantization. Our findings provide insights into adapting LLMs for efficient retrieval modeling.

pdf bib
Bias Amplification: Large Language Models as Increasingly Biased Media
Ze Wang | Zekun Wu | Yichi Zhang | Xin Guan | Navya Jain | Qinyang Lu | Saloni Gupta | Adriano Koshiyama

Model collapse—a phenomenon where models degrade in performance due to indiscriminate use of synthetic data—is well studied. However, its role in bias amplification—the progressive reinforcement of pre-existing social biases in Large Language Models (LLMs)—remains underexplored. In this paper, we formally define the conditions for bias amplification and demonstrate through statistical simulations that bias can intensify even in the absence of sampling errors, the primary driver of model collapse. Empirically, we investigate political bias amplification in GPT-2 using a custom-built benchmark for sentence continuation tasks. Our findings reveal a progressively increasing right-leaning bias. Furthermore, we evaluate three mitigation strategies—Overfitting, Preservation, and Accumulation—and show that bias amplification persists even when model collapse is mitigated. Finally, a mechanistic interpretation identifies distinct sets of neurons responsible for model collapse and bias amplification, suggesting they arise from different underlying mechanisms.

pdf bib
STAR: Self-Automated Back-Querying for Production Data Generation
Kellen Tan Cheng | Anna Lisa Gentile | Chad DeLuca | Guang-Jie Ren

The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs’ input/output text through various detectors. However, developing and maintaining robust detectors has many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs before deployment. In this work, we propose STAR, a simple yet intuitive solution to generate production-like labeled data for LLMs’ guardrails development. STAR is based on two key ideas: (i) using self-automated back-querying to synthetically generate data, paired with (ii) a sparse human-in-the-loop clustering technique to label the data. The aim of self-automated back-querying is to construct a parallel corpus roughly representative of the original dataset and resembling real LLM output. We then infuse existing datasets with our synthetically generated examples to produce robust training data for our detectors. We test our technique on one of the most difficult and nuanced detectors: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.48%, despite having 400x less parameters.

pdf bib
GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
Chao Zeng | Songwei Liu | Shu Yang | Fangmin Chen | Xing Mei | Lean Fu

Model compression has emerged as a mainstream solution to reduce memory usage and computational overhead. This paper proposes GQSA, a novel model compression framework specifically designed for LLMs. Traditional methods typically focus exclusively on either quantization or sparsification, but relying on a single strategy often results in significant performance loss at high compression rates. In contrast, GQSA integrates quantization and sparsification in a tightly coupled manner, leveraging GPU-friendly structured group sparsity and quantization for efficient acceleration. Building upon system-algorithm co-design principles, we propose a two-stage sparse optimization strategy that ensures the performance superiority of the compressed model. On the engine side, we introduce a “task-centric” parallel strategy, which, to the best of our knowledge, is the first application in the domain of sparse computing. Compared to the traditional 2:4 sparse method, the GQSA offers a more flexible and adjustable sparsity rate, as well as a higher weight compression rate, and is efficiently compatible with weight-only quantization methods. Experimental results demonstrate that, under the GQSA W4S50% compression setting, the model’s accuracy surpasses that of both 2:4 pruning and W2 quantization. Furthermore, at the inference level, GQSA outperforms W2 by 1.26 × and 2:4 pruning by 2.35 × in terms of speed.

pdf bib
Explainable Ethical Assessment on Human Behaviors by Generating Conflicting Social Norms
Yuxi Sun | Wei Gao | Hongzhan Lin | Jing Ma | Wenxuan Zhang

Human behaviors are often guided or constrained by social norms, which are defined as shared, commonsense rules. For example, underlying an action report a witnessed crime are social norms that inform our conduct, such as It is expected to be brave to report crimes. Current AI systems that assess valence (i.e., support or oppose) of human actions by leveraging large-scale data training not grounded on explicit norms may be difficult to explain, and thus untrustworthy. Emulating human assessors by considering social norms can help AI models better understand and predict valence. While multiple norms come into play, conflicting norms can create tension and directly influence human behavior. For example, when deciding whether to report a witnessed crime, one may balance bravery against self-protection. In this paper, we introduce ClarityEthic, a novel ethical assessment approach, to enhance valence prediction and explanation by generating conflicting social norms behind human actions, which strengthens the moral reasoning capabilities of language models by using a contrastive learning strategy. Extensive experiments demonstrate that our method outperforms strong baseline approaches, and human evaluations confirm that the generated social norms provide plausible explanations for the assessment of human behaviors.

pdf bib
Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation
Juanhui Li | Haoyu Han | Zhikai Chen | Harry Shomer | Wei Jin | Amin Javari | Hui Liu

Session-based recommendation systems have attracted growing interest for their ability to provide personalized recommendations based on users’ in-session behaviors. While ID-based methods have shown strong performance, they often struggle with long-tail items and overlook valuable textual information. To incorporate text information, various approaches have been proposed, generally employing a naive fusion framework. Interestingly, this approach often fails to outperform the best single-modality baseline. Further exploration indicates a potential imbalance issue in the naive fusion method, where the ID tends to dominate the training and the text is undertrained. This issue indicates that the naive fusion method might not be as effective in combining ID and text as once believed. To address this, we propose AlterRec, an alternative training framework that separates the optimization of ID and text to avoid the imbalance issue. AlterRec also designs an effective strategy to enhance the interaction between the two modalities, facilitating mutual interaction and more effective text integration. Extensive experiments demonstrate the effectiveness of AlterRec in session-based recommendation.

pdf bib
Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data Generation
Zijian Li | Jingjing Fu | Lei Song | Jiang Bian | Jun Zhang | Rui Wang

Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity. In this paper, we propose Chain of Functions (CoF), a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity. Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. CoF provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: it eliminates reliance on extremely large models. Employing CoF, we construct the ChartCoF dataset, with 1.4k complex reasoning Q&A for fine-grained analysis and 50k Q&A for reasoning enhancement.Experiments show that ChartCoF improves performance for MLLMs on widely used benchmarks, and the fine-grained evaluation on ChartCoF reveals varying performance across question taxonomies and step numbers for each MLLM. Furthermore, the novel paradigm of function-governed rationale generation in CoF could inspire broader applications beyond charts. The code and data have been publicly available at https://github.com/microsoft/Chain-of-Functions.

pdf bib
Topology-Aware Gated Graph Neural Network for Social Bot Detection
Pi Jiebin | Yantuan Xian | Yuxin Huang | Yan Xiang | Ran Song | Zhengtao Yu

The rapid growth of social networks has led to a surge in social bots, which often disseminate low-quality content and may manipulate public opinion, posing threats to online security. Although recent GNN-based bot detection methods perform strongly, they still face two major challenges. First, deep GNNs are prone to over-smoothing: neighbor aggregation blends bot and human node representations, obscuring bot-specific features. Second, social graphs are dominated by human–human and human–bot connections, while direct bot–bot links are scarce, making it difficult for effective bot representations to propagate within GNNs. To address these issues, we propose a Topology-Aware Gated Graph Neural Network () to detect social bots. employs topology-aware data augmentation to synthesize realistic bot nodes that preserve the original graph structure, mitigating class imbalance; it also introduces a hierarchical gating mechanism that restructures node embeddings into a tree format, selectively filtering noise and enhancing discriminative features. Experiments on three standard benchmark datasets show that consistently surpasses leading baselines in highly imbalanced settings, delivering superior accuracy and robustness.

pdf bib
Minimizing Queries, Maximizing Impact: Adaptive Score-Based Attack and Defense for Sentiment Analysis
Yigit Efe Enhos | Shira Wein | Scott Alfeld

While state-of-the-art large language models find high rates of success on text classification tasks such as sentiment analysis, they still exhibit vulnerabilities to adversarial examples: meticulously crafted perturbations of input data that guide models into making false predictions. These adversarial attacks are of particular concern when the systems in question are user-facing. While many attacks are able to reduce the accuracy of Transformer-based classifiers by a substantial margin, they often require a large amount of computational time and a large number of queries made to the attacked model, which makes the attacks susceptible to detection. In this work, we resolve the limitations of high query counts and necessary computational time by proposing a query-efficient word-level attack that is fast during deployment and does not compromise the attack success rate of state-of-the-art methods. Our attack constructs a dictionary of adversarial word substitutions based on prior data and leverages these substitutions to flip the sentiment classification of the text. Our attack method achieves an average of 27.49 queries—over 30% fewer than the closest competitor—while maintaining a 99.70% attack success rate. We also develop an effective defense strategy inspired by our attack approach.

pdf bib
Generating Text from Uniform Meaning Representation
Emma Markle | Reihaneh Iranmanesh | Shira Wein

Uniform Meaning Representation (UMR) is a recently developed graph-based semantic representation, which expands on Abstract Meaning Representation (AMR) in a number of ways, in particular through the inclusion of document-level information and multilingual flexibility. In order to effectively adopt and leverage UMR for downstream tasks, efforts must be placed toward developing a UMR technological ecosystem. Though only a small amount of UMR annotations have been produced to date, in this work, we investigate the first approaches to producing text from multilingual UMR graphs. Exploiting the structural similarity between UMR and AMR graphs and the wide availability of AMR technologies, we introduce (1) a baseline approach which passes UMR graphs to AMR-to-text generation models, (2) a pipeline conversion of UMR to AMR, then using AMR-to-text generation models, and (3) a fine-tuning approach for both foundation models and AMR-to-text generation models with UMR data. Our best performing models achieve multilingual BERTscores of 0.825 for English and 0.882 for Chinese, a promising indication of the effectiveness of fine-tuning approaches for UMR-to-text generation even with limited UMR data.

pdf bib
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan | Yufei Xu | Zhengyin Du | Xuesong Yao | Zheyu Wang | Xiaowen Guo | Jiecao Chen

The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years seems to bring us close enough to the “last exam” for LLMs to surpass humanity. However, is the LLMs’ remarkable reasoning ability indeed coming from true intelligence by human standards, or are they actually reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM’s recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60 percent performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to reevaluate the true intelligence level of cutting-edge LLMs.

pdf bib
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Lin Shi | Chiyu Ma | Wenhua Liang | Xingjian Diao | Weicheng Ma | Soroush Vosoughi

LLM-as-a-Judge has emerged as a promising alternative to human evaluators across various tasks, yet inherent biases—particularly position bias, the tendency to favor solutions based on their position within the prompt—compromise its reliability. This exploratory study evaluates position bias in LLM judges across pairwise and list-wise comparison settings, introducing three metrics: repetition stability, position consistency, and preference fairness. Our experiments, involving 15 LLM judges across MTBench and DevBench with 22 tasks and approximately 40 solution-generating models, result in over 150,000 evaluation instances. We identify Judge-Level, Candidate-Level, and Task-Level factors contributing to bias. The findings confirm that position bias is not due to random chance and varies significantly across judges and tasks. While position bias is weakly influenced by the length of prompt components, it is strongly affected by the quality gap between solutions. Our agreement and disagreement analysis among judges further provides insights into the distribution of judging difficulty across the dataset, and highlights the potential for dataset modifications.

pdf bib
Item-Language Model: Improving Large Language Model for Recommendation via Item-Language Representation Learning
Li Yang | Anushya Subbiah | Hardik Patel | Judith Yue Li | Yanwei Song | Reza Mirghaderi | Vikram Aggarwal | Fuli Feng | Zenglin Xu | Dongfang Liu | Qifan Wang

Large Language Models (LLMs) have recently made significant advancements in tackling complex tasks, such as retrieving hard-to-find information and solving intricate problems. Consequently, various approaches have been proposed to integrate LLMs into recommender systems, primarily by embedding them within existing architectures or training them on the recommendation data. However, most existing methods fail to effectively incorporate user-item interaction signals into pretrained LLMs due to the modality gap between interaction data and the LLM’s internal knowledge. To address this challenge, we propose the Item-Language Model (ILM) to enhance LLMs for recommendation. ILM consists of two main components: An item-language representation learning module, where an ILM encoder is pretrained to generate text-aligned item representations. And an item-language co-training module, where the ILM encoder is integrated into a pretrained LLM for the recommendation tasks. Extensive experiments demonstrate the superior performance of our approach over several state-of-the-art methods, validating the importance of text-aligned item representations in bridging this modality gap. Our ablation studies further reveal the effectiveness of our model design for integrating the interaction knowledge into LLMs for recommendation tasks. Our code is available at: https://anonymous.4open.science/r/ILM-7AD4/.

pdf bib
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
Zahraa Al Sahili | Ioannis Patras | Matthew Purver

Multilingual vision–language models (VLMs) promise universal image–text retrieval, yet their social biases remain under‐explored.We perform the first systematic audit of four public multilingual CLIP variants—M‐CLIP, NLLB‐CLIP, CAPIVARA‐CLIP, and the debiased SigLIP‐2—covering ten languages that differ in resource availability and morphological gender marking.Using balanced subsets of FairFace and the PATA stereotype suite in a zero‐shot setting, we quantify race and gender bias and measure stereotype amplification.Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English‐only baseline.CAPIVARA‐CLIP shows its largest biases precisely in the low‐resource languages it targets, while the shared encoder of NLLB‐CLIP and SigLIP‐2 transfers English gender stereotypes into gender‐neutral languages; loosely coupled encoders largely avoid this leakage.Although SigLIP‐2 reduces agency and communion skews, it inherits—and in caption‐sparse contexts (e.g., Xhosa) amplifies—the English anchor’s crime associations.Highly gendered languages consistently magnify all bias types, yet gender‐neutral languages remain vulnerable whenever cross‐lingual weight sharing imports foreign stereotypes.Aggregated metrics thus mask language‐specific “hot spots,” underscoring the need for fine‐grained, language‐aware bias evaluation in future multilingual VLM research.

pdf bib
A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models
Adam M. Morgan | Adeen Flinker

We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.

pdf bib
Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria’s Minority Languages
Oluwadara Kalejaiye | Luel Hagos Beyene | David Ifeoluwa Adelani | Mmekut-mfon Gabriel Edet | Aniefon Daniel Akpan | Eno-Abasi Urua | Anietie Andy

Nigeria is the most populous country in Africa with a population of more than 200 million people. More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world. Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba (i.e <1% of the languages spoken in Nigeria). This is in part due to the unavailability of textual data in these languages to train and apply NLP algorithms. In this work, we introduce Ibom—a dataset for machine translation and topic classification in four Coastal Nigerian languages from the Akwa Ibom State region: Anaang, Efik, Ibibio, and Oro. These languages are not represented in Google Translate or in major benchmarks such as Flores-200 or SIB-200. We focus on extending Flores-200 benchmark to these languages, and further align the translated texts with topic labels based on SIB-200 classification dataset. Our evaluation shows that current LLMs perform poorly on machine translation for these languages in both zero-and-few shot settings. However, we find the few-shot samples to steadily improve topic classification with more shots.

pdf bib
An Analysis of the Impact of Problem Paraphrasing on LLM-Based Mathematical Problem Solving
Yerim Han | Hyein Seo | Hyuk Namgoong | Sangkeun Jung

Recent advances in large language models (LLMs) have significantly improved mathematical problem-solving. Among various techniques, paraphrasing problem statements has emerged as a promising strategy to enhance model understanding and accuracy.We define twelve paraphrasing types grounded in mathematics education theory and analyze their impact on LLM performance across different configurations. To automate selection, we propose a Paraphrase Type Selector that predicts effective paraphrases for each problem.Experiments on MATH-500, SVAMP, and AIME shows consistent performance gain from paraphrased problems. On MATH-500 with LLaMA 3.1-8B, combining the original with the best five paraphrased problems improves accuracy by +8.4%, with the selector achieving an additional +1.33% gain.

pdf bib
Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches
Changhao Pan | Dongyu Yao | Yu Zhang | Wenxiang Guo | Jingyu Lu | Zhiyuan Zhu | Zhou Zhao

Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high‐fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep‐learning‐based singing voice systems and their enabling technologies.To address the aforementioned issue, this survey first categorizes existing systems by task type and then organizes current architectures into two major paradigms: cascaded and end-to-end approaches. Moreover, we provide an in-depth analysis of core technologies, covering singing modeling and control techniques. Finally, we review relevant datasets, annotation tools, and evaluation benchmarks that support training and assessment. In appendix, we introduce training strategies and further discussion of SVS. This survey provides an up-to-date review of the literature on SVS models, which would be a useful reference for both researchers and engineers. Related materials are available at https://github.com/David-Pigeon/SyntheticSingers.

pdf bib
ASAudio: A Survey of Advanced Spatial Audio Research
Zhiyuan Zhu | Yu Zhang | Wenxiang Guo | Changhao Pan | Zhou Zhao

With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives. Related materials are available at https://github.com/dieKarotte/ASAudio.

pdf bib
MossNet: Mixture of State-Space Experts is a Multi-Head Attention
Shikhar Tuli | James Seale Smith | Haris Jeelani | Chi-Heng Lin | Abhishek Patel | Vasili Ramanishka | Yen-Chang Hsu | Hongxia Jin

Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple “attention heads.” Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.

pdf bib
LLM-Based Behavior Prediction for Social Media Users with Continuous Memory
Kun Li | Chengwei Dai | Wei Zhou | Songlin Hu

Large language models (LLMs) have demonstrated strong capabilities in simulating social roles and generating human-like behaviors. However, their effectiveness in predicting real-world user behavior under continuous memory accumulation remains largely unexplored. Most existing studies focus on short-term interactions or static personas, neglecting the dynamic nature of users’ historical experiences in social media environments. To address this gap, we introduce FineRob, a novel dataset for fine-grained behavior prediction of social media users, which includes long-term memory traces from 1,866 users across three platforms. Each behavior is decomposed into three elements: object, type, and content, resulting in 78.6k QA records.We identify that as memory accumulates, prediction accuracy drops significantly due to the model’s difficulty in accessing detailed historical information. We further propose the OM-CoT fine-tuning framework to enhance the model’s ability to process and utilize long-term memory. Experimental results show that our method effectively reduces the performance degradation caused by memory growth, improving fine-grained behavior prediction. .

pdf bib
Hassles and Uplifts Detection on Social Media Narratives
Jiyu Chen | Sarvnaz Karimi | Diego Molla | Andreas Duenser | Maria Kangas | Cecile Paris

Hassles and uplifts are psychological constructs of individuals’ positive or negative responses to daily minor incidents, with cumulative impacts on mental health. These concepts are largely overlooked in NLP, where existing tasks and models focus on identifying general sentiment expressed in text. These, however, cannot satisfy targeted information needs in psychological inquiry. To address this, we introduce Hassles and Uplifts Detection (HUD), a novel NLP application to identify these constructs in social media language.We evaluate various language models and task adaptation approaches on a probing dataset collected from a private, real-time emotional venting platform. Some of our models achieve F scores close to 80%. We also identify open opportunities to improve affective language understanding in support of studies in psychology.

pdf bib
Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Saeed Almheiri | Yerulan Kongrat | Adrian Santosh | Ruslan Tasmukhanov | Josemaria Loza Vera | Muhammad Dehan Al Kautsar | Fajri Koto

As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

pdf bib
SEA-LION: Southeast Asian Languages in One Network
Raymond Ng | Thanh Ngan Nguyen | Huang Yuli | Tai Ngee Chia | Leong Wai Yi | Wei Qi Leong | Xianbin Yong | Jian Gang Ngui | Yosephine Susanto | Nicholas Cheng | Hamsawardhini Rengarajan | Peerat Limkonchotiwat | Adithya Venkatadri Hulagadri | Kok Wai Teng | Yeo Yeow Tong | Bryan Siow | Wei Yi Teo | Tan Choon Meng | Brandon Ong | Zhi Hao Ong | Jann Railey Montalan | Adwin Chan | Sajeban Antonyrex | Ren Lee | Esther Choa | David Ong Tat-Wee | Bing Jie Darius Liu | William Chandra Tjhi | Erik Cambria | Leslie Teo

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

pdf bib
Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction
Hongjin Kim | Jaewook Lee | Kiyoung Lee | Jong-hun Shin | Soojong Lim | Oh-Woog Kwon

Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model’s internal reasoning processes with Korean inputs—particularly by tuning Korean-specific neurons in early layers—is key to unlocking RL’s effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.

pdf bib
Multilingual Iterative Model Pruning: What Matters?
Haryo Akbarianto Wibowo | Haiyue Song | Hideki Tanaka | Masao Utiyama | Alham Fikri Aji | Raj Dabre

Pruning techniques have been studied to construct small models for efficiency, yet the effect of cross-lingual, which shows language performance transferability, is understudied in this field. In this work, we investigate cross-lingual effects in multilingual large language model compression using iterative pruning and recovery. We employ structured layer pruning with LoRA-based recovery and knowledge distillation, testing whether calibration languages different from target evaluation languages can preserve multilingual performance. Experiments on Qwen2.5-7B and Llama3.1-8B demonstrate that any recovery language consistently outperforms no-recovery baselines, with even low-resource languages like Swahili providing ~5% improvements. In contrast to expectations, dominant pretraining languages do not always yield the best results, where Indonesian achieves the highest performance in Llama3.1-8B, while Japanese performs the best in Qwen2.5-7B. Our findings reveal that cross-lingual calibration effectively maintains multilingual capabilities in the iterative pruning.

pdf bib
Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
Lijia Liu | Takumi Kondo | Kyohei Atarashi | Koh Takeuchi | Jiyi Li | Shigeru Saito | Hisashi Kashima

This paper investigates defenses in LLM-based evaluation, where prompt injection attacks can manipulate scores by deceiving the evaluation system. We formalize blind attacks as a class in which candidate answers are crafted independently of the true answer. To counter such attacks, we propose an evaluation framework that combines standard and counterfactual evaluation. Experiments show it significantly improves attack detection with minimal performance trade-offs for recent models.

pdf bib
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
Yohan Mathew | Ollie Matthews | Robert McCarthy | Joan Velja | Christian Schroeder de Witt | Dylan Cope | Nandi Schoots

The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions.Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation.The use of information hiding (steganography) in agent communications could render such collusion practically undetectable.This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures.To investigate this problem we design two approaches – a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method – for reliably eliciting sophisticated LLM-generated linguistic text steganography.We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training.Additionally, we find that standard mitigations — both passive oversight of model outputs and active mitigation through communication paraphrasing — are not fully effective at preventing this steganographic communication.Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.

pdf bib
A Survey on LLM-Assisted Clinical Trial Recruitment
Shrestha Ghosh | Moritz Schneider | Carina Reinicke | Carsten Eickhoff

Clinical trials are designed in natural language and the task of matching them to patients, represented via both structured and unstructured textual data, benefits from knowledge aggregation and reasoning abilities of LLMs. LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution than classical approaches that employ trial-specific heuristics. Yet, adoption of LLMs in critical domains, such as clinical research, comes with many challenges, such as, the availability of public benchmarks, the dimensions of evaluation and data sensitivity. In this survey, we contextualize emerging LLM-based approaches in clinical trial recruitment. We examine the main components of the clinical trial recruitment process, discuss existing challenges in adopting LLM technologies in clinical research and exciting future directions.

pdf bib
The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Francois Meyer | Jan Buys

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

pdf bib
PRALEKHA: Cross-Lingual Document Alignment for Indic Languages
Sanjay Suryanarayanan | Haiyue Song | Mohammed Safi Ur Rahman Khan | Anoop Kunchukuttan | Raj Dabre

Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Existing methods often rely on metadata such as URLs, which are scarce, or on pooled document representations that fail to capture fine-grained alignment cues. Moreover, the limited context window of sentence embedding models hinders their ability to represent document-level context, while sentence-based alignment introduces a combinatorially large search space, leading to high computational cost. To address these challenges for Indic languages, we introduce Pralekha, a benchmark containing over 3 million aligned document pairs across 11 Indic languages and English, which includes 1.5 million English–Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based methods, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that our chunk-based method is 2–3× faster while maintaining competitive performance, and that DAC achieves substantial gains over pooling-based baselines. Extrinsic evaluation further demonstrates that document-level MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC’s effectiveness for parallel document mining. The dataset and evaluation framework are publicly available to support further research.

pdf bib
Structured Document Translation via Format Reinforcement Learning
Haiyue Song | Johannes Eschbach-Dymanus | Hour Kaing | Sumire Honda | Hideki Tanaka | Bianka Buschbeck | Masao Utiyama

Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose Format Reinforcement Learning (FormatRL), which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we propose StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

pdf bib
StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies
Pragya Khanna | Priyanka Kommagouni | Vamshi Raghu Simha Narasinga | Anil Vuppala

Stuttering is a complex speech disorder that challenges both ASR systems and clinical assessment. We propose a multimodal stuttering detection and classification model that integrates acoustic and linguistic features through a two-stage fusion mechanism. Fine-tuned Wav2Vec 2.0 and HuBERT extract acoustic embeddings, which are early fused with MFCC features to capture fine-grained spectral and phonetic variations, while Llama-2 embeddings from Whisper ASR transcriptions provide linguistic context. To enhance robustness against out-of-distribution speech patterns, we incorporate Retrieval-Augmented Generation or adaptive classification. Our model achieves state-of-the-art performance on SEP-28k and FluencyBank, demonstrating significant improvements in detecting challenging stuttering events. Additionally, our analysis highlights the complementary nature of acoustic and linguistic modalities, reinforcing the need for multimodal approaches in speech disorder detection.

pdf bib
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
Huiyin Xue | Nafise Sadat Moosavi | Nikolaos Aletras

The success of Transformer language models is widely credited to their dot-product attention mechanism, which interweaves a set of key design principles: mixing information across positions (enabling multi-token interactions), sequence-dependent activations (where attention weights adapt to each input), a specific mathematical form (dot-product similarities plus softmax weighting), and coupling of queries and keys to evolving hidden states (grounding attention in the current layer). However, the necessity of each of these principles remains largely untested. In this work, we systematically deconstruct attention by designing controlled variants that selectively relax these principles, applied both uniformly across all layers and in hybrid architectures where only some layers retain standard attention. Our empirical analysis reveals that mechanisms for mixing tokens are indispensable, as their absence collapses models to near-random behavior, while the exact mathematical form and sequence dependency can be substantially relaxed, especially when preserved in just a subset of layers. Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention, highlighting a cooperative effect. These findings deepen our understanding of what truly underpins attention’s effectiveness and open new avenues for simplifying language models without sacrificing performance.

pdf bib
Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance
Ahmed Alajrami | Xingwei Tan | Nikolaos Aletras

Instruction-tuning plays a vital role in enhancing the task-solving abilities of large language models (LLMs), improving their usability in generating helpful responses on various tasks. However, previous work has demonstrated that they are sensitive to minor variations in instruction phrasing. In this paper, we explore whether introducing perturbations in instruction-tuning data can enhance LLMs’ resistance against noisy instructions. We focus on how instruction-tuning with perturbations, such as removing stop words or shuffling words, affects LLMs’ performance on the original and perturbed versions of widely-used benchmarks (MMLU, BBH, GSM8K). We further assess learning dynamics and potential shifts in model behavior. Surprisingly, our results suggest that instruction-tuning on perturbed instructions can, in some cases, improve downstream performance. These findings highlight the importance of including perturbed instructions in instruction-tuning, which can make LLMs more resilient to noisy user inputs.

pdf bib
Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
Xinyi Ling | Hanwen Du | Bo Peng | Zhihui Zhu | Xia Ning

Multimodal foundation models (MFMs) have demonstrated strong capabilities in e-commerce by effectively leveraging multimodal data to enhance product understanding and user experienceHowever, the development of e-commerce MFMs is hindered by two challenges: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods in e-commerce. To address these challenges, we introduce MMECInstruct, the first large-scale, high-quality multimodal instruction dataset designed specifically for e-commerce MFMs. MMECInstruct comprises 75,000 samples covering 7 real-world e-commerce tasks, supporting both in-domain (IND) and out-of-domain (OOD) evaluations. Leveraging MMECInstruct, we develop CASLIE, a lightweight framework that enhances multimodal information understanding and integration for e-commerce. Our comprehensive evaluation demonstrates that MMECInstruct endows CASLIE with advanced capability and strong generalizability in e-commerce applications. MMECInstruct and CASLIE models are publicly accessible through https://github.com/ninglab/CASLIE.

pdf bib
EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-commerce Models
Xinyi Ling | Hanwen Du | Zhihui Zhu | Xia Ning

E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://github.com/ninglab/EcomMMMU.

pdf bib
Multilingual, Not Multicultural: Uncovering the Cultural Empathy Gap in LLMs through a Comparative Empathetic Dialogue Benchmark
Woojin Lee | Yujin Sim | Hongjin Kim | Harksoo Kim

Large Language Models (LLMs) demonstrate remarkable multilingual capabilities, yet it remains unclear whether they are truly multicultural. Do they merely process different languages, or can they genuinely comprehend the unique cultural contexts embedded within them? This study investigates this critical question by examining whether LLM’s perception of emotion and empathy differs across linguistic and cultural boundaries. To facilitate this, we introduce the Korean Empathetic Dialogues (KoED), a benchmark extending the English-based EmpatheticDialogues (ED) dataset. Moving beyond direct translation, we meticulously reconstructed dialogues specifically selected for their potential for cultural adaptation, aligning them with Korean emotional nuances and incorporating key cultural concepts like ‘jeong’ and ‘han’ that lack direct English equivalents. Our cross-cultural evaluation of leading multilingual LLMs reveals a significant “cultural empathy gap”: models consistently underperform on KoED compared to ED, struggling especially with uniquely Korean emotional expressions. Notably, the Korean-centric model, EXAONE, exhibits significantly higher cultural appropriateness. This result provides compelling evidence that aligns with the “data provenance effect”, suggesting that the cultural alignment of pre-training data is a critical factor for genuine empathetic communication. These findings demonstrate that current LLMs have cultural blind spots and underscore the necessity of benchmarks like KoED to move beyond simple linguistic fluency towards truly culturally adaptive AI systems.

pdf bib
HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages
Bi-Cheng Yan | Hsin Wei Wang | Fu-An Chao | Tien-Hong Lo | Yung-Chang Hsu | Berlin Chen

Automatic pronunciation assessment (APA) seeks to quantify a second language (L2) learner’s pronunciation proficiency in a target language by offering timely and fine-grained diagnostic feedback. Most existing efforts on APA have predominantly concentrated on highly constrained reading-aloud tasks (where learners are prompted to read a reference text aloud); however, assessing pronunciation quality in unscripted speech (or free-speaking scenarios) remains relatively underexplored. In light of this, we first propose HiPPO, a hierarchical pronunciation assessment model tailored for spoken languages, which evaluates an L2 learner’s oral proficiency at multiple linguistic levels based solely on the speech uttered by the learner. To improve the overall accuracy of assessment, a contrastive ordinal regularizer and a curriculum learning strategy are introduced for model training. The former aims to generate score-discriminative features by exploiting the ordinal nature of regression targets, while the latter gradually ramps up the training complexity to facilitate the assessment task that takes unscripted speech as input. Experiments conducted on the Speechocean762 benchmark dataset validates the feasibility and superiority of our method in relation to several cutting-edge baselines.

pdf bib
Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation
Leonid Boytsov | David Akinpelu | Nipun Katyal | Tianyi Lin | Fangwei Gao | Yutian Zhao | Jeffrey Huang | Eric Nyberg

We tested over 20 Transformer models for ranking long documents (including recent LongP models trained with FlashAttention and RankGPT models “powered” by OpenAI and Anthropic cloud APIs).We compared them with the simple FirstP baseline, which applied the same model to truncated input (up to 512 tokens).On MS MARCO, TREC DL, and Robust04 no long-document model outperformed FirstP by more than 5% (on average).We hypothesized that this lack of improvement is not due to inherent model limitations,but due to benchmark positional bias (most relevant passages tend to occur early in documents),which is known to exist in MS MARCO.To confirm this, we analyzed positional relevance distributions across four long-document corpora (with six query sets) and observed the same early-position bias.Surprisingly, we also found bias in six BEIR collections, which are typically categorized asshort-document datasets.We then introduced a new diagnostic dataset, MS MARCO FarRelevant, where relevant spans were deliberately placed beyond the first 512 tokens.On this dataset, many long-context models—including RankGPT—performed at random-baseline level, suggesting overfitting to positional bias.We also experimented with debiasing training data, but with limited success.Our findings (1) highlight the need for careful benchmark design in evaluating long-context models for document ranking, (2) identify model types that are more robust to positional bias, and (3) motivate further work on approaches to debias training data.We release our code and data to support further research.

pdf bib
How Reliable are Causal Probing Interventions?
Marc E. Canby | Adam Davies | Chirag Rastogi | Julia Hockenmaier

Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: *completeness* (how thoroughly the representation of the target property has been transformed) and *selectivity* (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as *reliability*, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) all methods show a clear tradeoff between completeness and selectivity; (2) more complete and reliable methods have a greater impact on LLM behavior; and (3) nonlinear interventions are almost always more reliable than linear interventions.Our project webpage is available at: [https://ahdavies6.github.io/causal_probing_reliability/](https://ahdavies6.github.io/causal_probing_reliability/)

pdf bib
wavCSE: Learning Fixed-size Unified Speech Embeddings via Feature-based Multi-Task Learning
Braveenan Sritharan | Uthayasanker Thayasivam

Modern speech applications require compact embeddings that generalize across both linguistic and paralinguistic tasks. However, most existing embeddings are task-specific and fail to transfer effectively across domains. We propose wavCSE, a feature-based multi-task learning model that produces a fixed-size unified speech embedding suitable for both linguistic and paralinguistic tasks. wavCSE is jointly trained on keyword spotting, speaker identification, and emotion recognition, achieving state-of-the-art performance on all three tasks. The resulting unified embedding is then evaluated on twelve downstream tasks spanning both linguistic and paralinguistic domains. Experimental results show that it outperforms strong baselines on nine of the twelve tasks, indicating effective generalization across domains. To streamline embedding generation, we introduce a recursive layer selection strategy that identifies the most informative hidden layer outputs from the upstream model and refine how these selected outputs are aggregated in the downstream model. These enhancements reduce memory usage and computational cost while improving task performance, making them broadly applicable to self-supervised learning-based speech processing models.

pdf bib
Unveiling Empathic Triggers in Online Interactions via Empathy Cause Identification
Calliope Chloe Bandera | Gyeongeun Lee | Natalie Parde

Effectively learning language patterns that provoke empathetic expression is vital to creating emotionally intelligent technologies; however, this problem has historically been overlooked. We address this gap by proposing the new task of empathy cause identification: a challenging task aimed at pinpointing specific triggers prompting empathetic responses in communicative settings. We correspondingly introduce AcnEmpathize-Cause, a novel dataset consisting of 4K cause-identified sentences, and explore various models to evaluate and demonstrate the dataset’s efficacy. This research not only contributes to the understanding of empathy in textual communication but also paves the way for the development of AI systems capable of more nuanced and supportive interactions.

pdf bib
Assessing the Limits of In-Context Learning beyond Functions using Partially Ordered Relation
Debanjan Dutta | Faizanuddin Ansari | Swagatam Das

Generating rational and generally accurate responses to tasks, often accompanied by example demonstrations, highlights Large Language Model’s (LLM’s) remarkable In-Context Learning (ICL) capabilities without requiring updates to the model’s parameter space. Despite having an ongoing exploration focused on the inference from a document-level concept, its behavior in learning well-defined functions or relations in context needs a careful investigation. In this article, we present the performance of ICL on partially ordered relation by introducing the notion of inductively increasing complexity in prompts. In most cases, the saturated performance of the chosen metric indicates that while ICL offers some benefits, its effectiveness remains constrained as we increase the complexity in the prompts even in presence of sufficient demonstrative examples. The behavior is evident from our empirical findings and has further been theoretically justified in term of its implicit optimization process.The code is available here.

pdf bib
ProSwitch: Knowledge-Guided Instruction Tuning to Switch Between Professional and Non-Professional Responses
Chang Zong | Yuyan Chen | Weiming Lu | Jian Shao | Yongfeng Huang | Heng Chang | Yueting Zhuang

Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including question answering and controlled text generation. However, studies into their ability to switch between opposite styles of responses in professional domains remain underexplored. This study introduces a novel approach, named ProSwitch, which enables a language model to switch between professional and non-professional answers, by tuning and evaluating through the guidance of domain and style knowledge. ProSwitch unfolds in three phases: LLM-augmented preparation to collect domain knowledge and QA pairs, instruction tuning to optimize LLMs with multiple levels of knowledge, and comprehensive evaluation to assess both style discrimination and reference-based quality of the generated text. Comparative analysis of ProSwitch against general and specialized LLMs reveals that our approach outperforms baselines in switching between professional and non-professional responses.

pdf bib
Reasoning Models Reason Well, Until They Don’t
Revanth Rameshkumar | Jimson Huang | Yunxin Sun | Fei Xia | Abulhair Saparov

Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs)LLMs fine-tuned with incentives for step‐by‐step argumentation and self‐verification. LRM performance on graph and reasoning benchmarks such as **NLGraph** seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the **Deep Reasoning Dataset (DeepRD)**, along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs’ success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.

pdf bib
Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration
Songyuan Sui | Hongyi Liu | Serena Liu | Li Li | Soo-Hyun Choi | Rui Chen | Xia Hu

Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Extensive experiments across four models and five widely used benchmarks demonstrate that CoQ achieves substantial accuracy improvements and significantly lowers invalid SQL rates compared to prior generic LLM-based, SQL-aided, and hybrid baselines, confirming its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.

pdf bib
Multimodal Language Models for Financial Forecasting from Interleaved Sequences of Text and Time Series
Ross Koval | Nicholas Andrews | Xifeng Yan

Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.

pdf bib
Comparing Language Models of Different Scales for Security-Focused Tabular Query Generation and Reasoning
Varivashya Poladi | Sandipan Dandapat

Security-related data often exists in complex, multi-table formats and is scarce due to privacy and compliance constraints—posing a major challenge for training and evaluating language models (LMs) on security reasoning tasks. In this work, we systematically investigate the performance of large language models (LLMs) across different parameter scales in generating and solving multi-step, semantically rich queries over realistic security scenarios represented through three interlinked tabular datasets. We assess models on three key axes (i) their ability to formulate insightful, high complexity security questions; (ii) the quality and coherence of their reasoning chains; and (iii) their accuracy in deriving actionable answers from the underlying data. To address data scarcity, we propose a diffusion-based synthetic data generation pipeline that amplifies the existing dataset while preserving domain semantics and statistical structure. Our findings reveal that while large models often outperform in reasoning depth and query formulation, smaller models show surprising efficiency and accuracy. The study provides actionable insights for deploying generative models in security analytics and opens avenues for synthetic data-driven evaluation of LLMs in low-resource, high-stakes domains.

pdf bib
Generate but Verify: Answering with Faithfulness in RAG-based Question Answering
Simone Filice | Elad Haramaty | Guy Horowitz | Zohar Karnin | Liane Lewin-Eytan | Alex Shtoff

Retrieval-Augmented Generation (RAG) enhances LLMs by grounding answers in retrieved passages, which is key in factual Question Answering. However, generated answers may still be unfaithful to the passages, either due to retrieval or generation errors. Many RAG downstream applications rely on assessing answer faithfulness for applying fallback strategies, yet address it implicitly, without a consistent evaluation methodology. We introduce the task of Answering with Faithfulness (AwF), which brings faithfulness prediction to the forefront, explicitly coupling it with answer generation. We define variants of the precision and recall metrics tailored to this task, facilitating direct evaluation and comparison of different AwF methods.We then demonstrate, both theoretically and empirically, that for RAG applications using AwF as a sub-procedure, an improvement to the AwF metrics translates to an improvement to the downstream performance. This results in improved performance for recently published results.

pdf bib
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
Celine Lee | Alexander M Rush | Keyon Vafa

Large language models (LLMs) often benefit from verbalized reasoning at inference time, but it remains unclear which aspects of task difficulty these extra reasoning tokens address. To investigate this question, we construct a controlled setting where task complexity can be precisely manipulated to study its effect on reasoning length. Deterministic finite automata (DFAs) offer a formalism through which we can characterize task complexity through measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity). We first show that across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then investigate which properties of complexity govern this critical length: we find that task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.

pdf bib
MAJI: A Multi-Agent Workflow for Augmenting Journalistic Interviews
Kaiwen Guo | Yimeng Wu

Journalistic interviews are creative, dynamic processes where success hinges on insightful, real-time questioning. While Large Language Models (LLMs) can assist, their tendency to generate coherent but uninspired questions optimizes for probable, not insightful, continuations. This paper investigates whether a structured, multi-agent approach can overcome this limitation to act as a more effective creative partner for journalists. We introduce MAJI, a system designed for this purpose, which employs a divergent-convergent architecture: a committee of specialized agents generates a diverse set of questions, and a convergent agent selects the optimal one. We evaluated MAJI against a suite of strong LLM baselines. Our results demonstrate that our multi-agent framework produces questions that are more coherent, elaborate, and original (+36.9% for our best model vs. a standard LLM baseline), exceeded strong LLM baselines on key measures of creative question quality. Most critically, in a blind survey, professional journalists preferred MAJI’s selected questions over those from the baseline by a margin of more than two to one. We present the system’s evolution, highlighting the architectural trade-offs that enable MAJI to augment, rather than simply automate, journalistic inquiry. We will release the code upon publication.

pdf bib
How Aligned Are Unimodal Language and Graph Encodings of Chemical Molecules?
Congfeng Cao | Zhi Zhang | Jelke Bloem | Khalil Sima’an

Chemical molecules can be represented as graphs or as language descriptions. Training unimodal models on graphs results in different encodings than training them on language. Therefore, the existing literature force-aligns the unimodal models during training to use them in downstream applications such as drug discovery. But to what extent are graph and language unimodal model representations inherently aligned, i.e., aligned prior to any force-alignment training? Knowing this is useful for a more expedient and effective forced-alignment. For the first time, we explore methods to gauge the alignment of graph and language unimodal models. We find compelling differences between models and their ability to represent slight structural differences without force-alignment. We also present an unified unimodal alignment (U2A) benchmark for gauging the inherent alignment between graph and language encoders which we make available with this paper.

pdf bib
Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models
Hirohane Takagi | Gouki Minegishi | Shota Kizawa | Issey Sukeda | Hitomi Yanaka

Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions: (1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2) How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.

pdf bib
EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels
Wan Jou She | Lis Pereira | Fei Cheng | Sakiko Yahata | Panote Siriaraya | Eiji Aramaki

This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4,125 two-turn dialogues, collected through crowdsourcing and expert review.To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation–dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of ≤ 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.

pdf bib
Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs
Wenyu Zhang | Yingxu He | Geyu Lin | Zhuohan Liu | Shuo Sun | Bin Wang | Xunlong Zou | Jeremy H. M. Wong | Qiongqiong Wang | Hardik Bhupendra Sailor | Nancy F. Chen | AiTi Aw

Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses. Experiments on two out-of-domain datasets demonstrate the generalization capabilities of the resulting model.

pdf bib
On the Convergence of Moral Self-Correction in Large Language Models
Guangliang Liu | Haitao Mao | Bochuan Cao | Xitong Zhang | Zhiyu Xue | Rongrong Wang | Kristen Johnson

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

pdf bib
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
Divya Jyoti Bajpai | Manjesh Kumar Hanawal

Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model’s refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper-level insights from the full model’s architecture. Also, it maintains the performance integrity of the full model while training the draft model, achieving a balance between efficiency and accuracy. Our method speeds up the inference process by 1.55-1.85× as compared to the final layer with minimal loss in performance. The source code is available at https://github.com/Div290/SSD.

pdf bib
Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions
Wesley Scivetti | Melissa Torgbi | Mollie Shichman | Taylor Pellegrin | Austin Blodgett | Claire Bonial | Harish Tayyar Madabushi

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can “understand” the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

pdf bib
Improving Sign Language Understanding with a Multi-Stream Masked Autoencoder Trained on ASL Videos
Junwen Mo | MinhDuc Vo | Hideki Nakayama

Sign language understanding remains a significant challenge, particularly for low-resource sign languages with limited annotated data. Motivated by the success of large-scale pretraining in deep learning, we propose Multi-Stream Masked Autoencoder (MS-MAE) — a simple yet effective framework for learning sign language representations from skeleton-based video data. We pretrained a model with MS-MAE on the YouTube-ASL dataset, and then adapted it to multiple downstream tasks across different sign languages. Experimental results show that MS-MAE achieves competitive or superior performance on a range of isolated sign language recognition benchmarks and gloss-free sign language translation tasks across several sign languages. These findings highlight the potential of leveraging large-scale, high-resource sign language data to boost performance in low-resource sign language scenarios. Additionally, visualization of the model’s attention maps reveals its ability to cluster adjacent pose sequences within a sentence, some of which align with individual signs, offering insights into the mechanisms underlying successful transfer learning.

pdf bib
Quantifying Phonosemantic Iconicity Distributionally in 6 Languages
George Flint | Kaustubh Kislay

Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations–both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes’ phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.

pdf bib
Fine-grained Confidence Estimation for Spurious Correctness Detection in Large Language Models
Ai Ishii | Naoya Inoue | Hisami Suzuki | Satoshi Sekine

In the deployment of Large Language Models (LLMs), “spurious correctness”—where answers are correct but reasoning contains errors—poses a critical risk by creating an illusion of reliability. While prior work on LLM confidence estimation focuses on answer-level or entire reasoning path confidence, these coarse-grained approaches fail to identify which specific parts of the reasoning contain errors. We propose a fine-grained confidence estimation framework that computes confidence scores for individual evidence triplets within reasoning chains, enabling precise localization of errors. Using carefully designed prompts, we generate answers, evidence in triplet format, and their respective confidence scores simultaneously, allowing automatic detection of spurious correctness patterns where partial evidence contains factual errors. Evaluated on both Japanese and English multi-hop QA benchmarks across multiple models from three model families representing different architectures and training approaches, we show that our approach exhibits superior calibration performance for evidence confidence and demonstrates effective ability to detect spurious correct answers (up to 0.84 on our primary discrimination metric). The consistent improvements across languages demonstrate the generalizability of our method. As a secondary benefit, joint generation of confidence scores improves answer confidence calibration by up to 43%. This prompt-based approach requires no model retraining and is immediately applicable to existing LLMs.

pdf bib
Observing Micromotives and Macrobehavior of Large Language Models
Yuyang Cheng | Xingwei Qu | Tomas Goldsack | Chenghua Lin | Chung-Chi Chen

Thomas C. Schelling, awarded the 2005 Nobel Memorial Prize in Economic Sciences, pointed out that ”individuals decisions (micromotives), while often personal and localized, can lead to societal outcomes (macrobehavior) that are far more complex and different from what the individuals intended.” The current research related to large language models’ (LLMs’) micromotives, such as preferences or biases, assumes that users will make more appropriate decisions once LLMs are devoid of preferences or biases. Consequently, a series of studies has focused on removing bias from LLMs. In the NLP community, while there are many discussions on LLMs’ micromotives, previous studies have seldom conducted a systematic examination of how LLMs may influence society’s macrobehavior. In this paper, we follow the design of Schelling’s model of segregation to observe the relationship between the micromotives and macrobehavior of LLMs. Our results indicate that, regardless of the level of bias in LLMs, a highly segregated society will emerge as more people follow LLMs’ suggestions. We hope our discussion will spark further consideration of the fundamental assumption regarding the mitigation of LLMs’ micromotives and encourage a reevaluation of how LLMs may influence users and society.

pdf bib
SciHallu: A Multi-Granularity Hallucination Detection Dataset for Scientific Writing
Adiba Ibnat Hossain | Sagnik Ray Choudhury | Hamed Alhoori

Large Language Models (LLMs) are increasingly used to support scientific writing, but their tendency to produce hallucinated content threatens academic reliability. Existing benchmarks have addressed hallucination detection in general-domain tasks, such as fact-checking or question answering, but they do not reflect the fine-grained, domain-specific needs of scientific communication. We introduce SciHallu, a dataset for identifying hallucinations in academic text at three levels of granularity: token, sentence, and paragraph. To establish a reliable ground truth, we select source passages from research papers published prior to the widespread adoption of LLMs. Our dataset includes both hallucinated and non-hallucinated paragraph instances, constructed through controlled perturbations at varying levels of noise and validated by human annotators. A rationale is paired with each instance, explaining the nature of the modification. SciHallu covers multiple academic fields, such as Computer Science, Health Sciences, and Humanities and Social Sciences. It is built using a model-guided annotation pipeline, followed by expert human validation. We evaluate state-of-the-art LLMs on both binary and fine-grained classification tasks, revealing challenges in detecting subtle hallucinations. SciHallu supports the development of context-aware systems for more trustworthy scientific content generation.

pdf bib
Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis
Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen | Hiroya Takamura | Ichiro Kobayashi | Yusuke Miyao

In the era of rapid Internet and social media development, individuals readily share their investment opinions online. The overwhelming volume of such opinions makes comprehensive evaluation impractical, highlighting the need for an effective recommendation system that can identify valuable insights. To address this challenge, we propose an argument-based sentiment analysis framework that incorporates a new perspective on opinion strength. Our approach introduces the concept of a Fuzzy Strength Degree (FSD), derived from the difference between analysts’ target and closing prices, to quantify the intensity of opinions. By integrating argument mining techniques, we further decompose each opinion into claims and premises, examine their relationships, and use these structures to evaluate the persuasive strength of the arguments. This dual strategy allows us to rank both professional and amateur investor opinions without relying on user history or social signals. Experiments show that our method works best for analyst reports, while on social media, simpler approaches based on wording and professionalism features perform better. Moreover, our analysis of professional analysts’ and traders’ behaviors reveals that top-ranked opinions are more likely to influence subsequent market actions. These findings demonstrate that argument structure and quantified opinion strength provide a novel and reliable foundation for investment opinion recommendation.

pdf bib
A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
Shinnosuke Ono | Issey Sukeda | Takuro Fujii | Kosei Buma | Shunsuke Sasaki

We present **JPharmatron**, a Japanese domain-specific large language model (LLM) for the pharmaceutical field, developed through continual pre-training on two billion Japanese pharmaceutical tokens and eight billion English biomedical tokens. For rigorous evaluation, we introduce **JPharmaBench**, a benchmark suite consisting of three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task involving cross-document consistency checking.We evaluate our model against open-source medical LLMs and commercial models, including GPT-4o. Experimental results show that **JPharmatron** outperforms existing open models and achieves competitive performance with commercial ones.Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge.**JPharmatron** enables secure and local model deployment for pharmaceutical tasks, where privacy and legal constraints limit the use of closed models. Besides, **JPharmaBench** offers a reproducible framework for evaluating Japanese pharmaceutical natural language processing. Together, they demonstrate the feasibility of practical and cost-efficient language models for Japanese healthcare and pharmaceutical sectors.Our model, codes, and datasets are available on HuggingFace: https://huggingface.co/collections/EQUES/jpharmatron and https://huggingface.co/collections/EQUES/jpharmabench.

pdf bib
Investigating Feasibility of Large Language Model Agent Collaboration in Minecraft and Comparison with Human-Human Collaboration
Yuki Hirota | Ryuichiro Higashinaka

In recent years, there has been growing interest in agents that collaborate with humans on creative tasks, and research has begun to explore such collaboration within Minecraft. However, most existing studies on agents in Minecraft focus on scenarios where an agent constructs objects independently on the basis of given instructions, making it difficult to achieve joint construction through dialogue-based cooperation with humans. Prior work, such as the Action-Utterance Model, used small-scale large language models (LLMs), which resulted in limited accuracy. In this study, we attempt to build an agent capable of collaborative construction using LLMs by integrating the framework of the Action-Utterance Model with that of Creative Agents, which leverages more recent and powerful LLMs for more accurate and flexible building. We had two agents conduct the Collaborative Garden Task through simulations and evaluate both the generated gardens and the dialogue content. Through this evaluation, we confirm that the agents are capable of producing gardens with a certain level of quality and can actively offer suggestions and assert their opinions. Furthermore, we conduct a comparative analysis with human-human collaboration to identify current challenges faced by agents and to discuss future directions for improvement toward achieving more human-like cooperative behavior.

pdf bib
Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data
Hao Xiong | Chuanyuan Tan | Wenliang Chen

Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE.To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA).In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%. Our code and data will be released on Github.

pdf bib
Optimizing the Arrangement of Citations in Related Work Section
Masashi Oshika | Ryohei Sasano

In related work section of a scientific paper, authors collect relevant citations and structure them into coherent paragraphs that follow a logical order. Previous studies have addressed citation recommendation and related work section generation in settings where both the citations and their order are provided in advance. However, they have not adequately addressed the optimal ordering of these citations, which is a critical step for achieving fully automated related work section generation. In this study, we propose a new task, citation arrangement, which focuses on determining the optimal order of cited papers to enable fully automated related work section generation. Our approach decomposes citation arrangement into three tasks: citation clustering, paragraph ordering, and citation ordering within a paragraph. For each task, we propose a method that uses a large language model (LLM) in combination with a graph-based technique to comprehensively consider the context of each paper and the relationships among all cited papers. The experimental results show that our method is more effective than methods that generate outputs for each task using only an LLM.

pdf bib
Ability Transfer Through Language Mixing
Petr Hyner | Jan Mrógala | Jan Hula

We systematically investigate cross-lingual ability transfer in language models through controlled experiments across three problem sets: algorithmic addition, graph navigation, and natural language modeling. Our experimental design creates high-resource and low-resource “language” pairs differing in vocabulary, grammar, and computational requirements. We show that training on mixed datasets consistently enables strong positive transfer, significantly improving low-resource language performance compared to training on low amount of data in isolation. We observe improvements from 0% to 100% accuracy in arithmetic tasks, from 24% to 98% accuracy in graph navigation tasks, and 69.6% perplexity reduction in natural language modeling. We demonstrate that transfer effectiveness depends on computational complexity and linguistic differences, where grammar modifications support stronger transfer than vocabulary modifications. These findings provide compelling evidence that cross-lingual ability transfer is a robust mechanism which contributes to the quality of large language models in low-resource languages.

pdf bib
Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection
Rabiraj Bandyopadhyay | Dennis Assenmacher | Jose Maria Alonso-Moral | Claudia Wagner

The quality of training data is crucial for the performance of supervised machine learning models. In particular, poor annotation quality and spurious correlations between labels and features in text dataset can significantly degrade model generalization. This problem is especially pronounced in harmful language detection, where prior studies have revealed major deficiencies in existing datasets. In this work, we design and test data selection methods based on learnability measures to improve dataset quality. Using a sexism dataset with counterfactuals designed to avoid spurious correlations, we show that pruning with EL2N and PVI scores can lead to significant performance increases and outperforms submodular and random selection. Our analysis reveals that in presence of label imbalance models rely on dataset shortcuts; especially easy-to-classify sexist instances and hard-to-classify non-sexist instances contain shortcuts. Pruning these instances leads to performances increases. Pruning hard-to-classify instances is in general a promising strategy as well when shortcuts are not present.

pdf bib
Mitigating Label Length Bias in Large Language Models
Mario Sanz-Guerrero | Katharina von der Wense

Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call *label length bias*, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose *normalized contextual calibration* (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.

pdf bib
Social Bias in Popular Question-Answering Benchmarks
Angelie Kraft | Judith Simon | Sonja Schimmler

Question-answering (QA) and reading comprehension (RC) benchmarks are commonly used for assessing the capabilities of large language models (LLMs) to retrieve and reproduce knowledge. However, we demonstrate that popular QA and RC benchmarks do not cover questions about different demographics or regions in a representative way. We perform a content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) whether the benchmarks exhibit social bias, or whether this is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most benchmark papers analyzed provide insufficient information about those involved in benchmark creation, particularly the annotators. Notably, just one (WinoGrande) explicitly reports measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. Our work adds to the mounting criticism of AI evaluation practices and shines a light on biased benchmarks being a potential source of LLM bias by incentivizing biased inference heuristics.

pdf bib
ProofTeller: Exposing recency bias in LLM reasoning and its side effects on communication
Mayank Jobanputra | Alisa Kovtunova | Brisca Balthes | Fedor Grigoryevich Pogulskiy | Yifan Wang | Stefan Borgwardt | Vera Demberg

Large language models (LLMs) are increasingly applied in domains that demand reliable and interpretable reasoning. While formal methods can generate provably correct proofs, these proofs are often inaccessible to non-expert users. This raises a natural question: can LLMs, when given a verified proof, faithfully interpret its reasoning and communicate it clearly? We introduce ProofTeller, a benchmark that evaluates this ability across three tasks: (1) identifying key proof steps, (2) summarizing the reasoning, and (3) explaining the result in concise natural language. The benchmark covers three domains: _Biology_, _Drones_, and _Recipes_, representing scientific, safety-critical, and everyday reasoning scenarios. We find a consistent near-conclusion bias: LLMs tend to focus on steps closest to the final proof conclusion rather than on the most informative ones. A targeted human study confirms that explanations based on such steps are rated less appropriate for end users. These findings indicate that even when reasoning is provided, current LLMs face challenges in communicating key information in a useful manner, highlighting the need for LLMs that can communicate important details reliably.

pdf bib
Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE
Varvara Arzt | Allan Hanbury | Michael Wiegand | Gabor Recski | Terra Blevins

Analysing the generalisation capabilities of relation extraction (RE) models is crucial for assessing whether they learn robust relational patterns or rely on spurious correlations. Our cross-dataset experiments find that RE models struggle with unseen data, even within similar domains. Notably, higher intra-dataset performance does not indicate better transferability, instead often signaling overfitting to dataset-specific artefacts. Our results also show that data quality, rather than lexical similarity, is key to robust transfer, and the choice of optimal adaptation strategy depends on the quality of data available: while fine-tuning yields the best cross-dataset performance with high-quality data, few-shot in-context learning (ICL) is more effective with noisier data. However, even in these cases, zero-shot baselines occasionally outperform all cross-dataset results. Structural issues in RE benchmarks, such as single-relation per sample constraints and non-standardised negative class definitions, further hinder model transferability. We release our dataset splits with sample IDs and code for reproducibility.

pdf bib
Whose story is it? Personalizing story generation by inferring author styles
Nischal Ashok Kumar | Chau Minh Pham | Mohit Iyyer | Andrew Lan

Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author’s writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per author, across five distinct sources reflecting diverse story-writing settings. We propose a two-stage pipeline for personalized story generation: first, we infer authors’ implicit writing characteristics and organize them into an Author Writing Sheet, which is validated by humans to be of high quality; second, we simulate the author’s persona using tailored persona descriptions and personalized story rules. We find that stories personalized using the Author Writing Sheet outperform a non-personalized baseline, achieving a 78% win-rate in capturing authors’ past style and 59% in similarity to ground-truth author stories. Human evaluation supports these findings and further highlights trends, such as Reddit stories being easier to personalize, and the Creativity and Language Use aspects of stories being easier to personalize than the Plot.

pdf bib
The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
Niyati Bafna | Tianjian Li | Kenton Murray | David R. Mortensen | David Yarowsky | Hale Sirin | Daniel Khashabi

Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages, but the causes for this are not well-understood. We first demonstrate the existence of an implicit task-solving→translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We quantify the extent to which either stage in the pipeline is responsible for final failure for a word translation task across 108 language pairs, and find that the translation barrier explains a dominant portion of error for a majority of language pairs, and is especially severe for low-resource target languages. Our results highlight an important bottleneck for end-to-end multilingual generation, relevant for future work seeking to improve multilinguality in LLMs.

pdf bib
TaCL-CoMoE: Task-adaptive Contrastive Learning with Cooperative Mixture of Experts for Multi-task Social Media Analysis
Wang Xingren | Hongde Liu | Liushanhong | Feiyang Meng | Chenyuan He | Senbin Zhu | Li Zechen | Yuxiang Jia

Social media has become a crucial platform for information dissemination and opinion expression. The massive and continuous generation of user content has given rise to various natural language processing tasks, such as sentiment analysis and topic classification. However, existing mainstream approaches typically focus on modeling individual tasks in isolation, lacking systematic exploration of collaborative modeling across multiple tasks. This neglects the inherent correlations among social media tasks, thereby limiting the model’s ability to fully comprehend and exploit the rich, multi-dimensional semantic information embedded in text. To address this challenge, we propose Task-adaptive Contrastive Learning with Cooperative Mixture of Experts (TaCL-CoMoE), a unified framework for social media multi-task learning. Specifically, we improve the gating mechanism by replacing the traditional softmax routing with sigmoid activation, enabling cooperative selection among multiple experts and mitigating the “expert monopoly” phenomenon. In addition, we introduce a task-adaptive contrastive learning strategy to further enhance the model’s ability to capture and distinguish semantic structures across different tasks. Experimental results on multiple public social media datasets demonstrate that TaCL-CoMoE consistently achieves state-of-the-art (SOTA) performance. The code is available at https://github.com/wxr2847/TaCL-CoMoE.

pdf bib
Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection
Saad Almohaimeed | Saleh Almohaimeed | Damla Turgut | Ladislau Bölöni

Implicit hate speech has increasingly been recognized as a significant issue for social media platforms. While much of the research has traditionally focused on harmful speech in general, the need for generalizable techniques to detect veiled and subtle forms of hate has become increasingly pressing. Based on lexicon analysis, we hypothesize that implicit hate speech is already present in publicly available harmful speech datasets but may not have been explicitly recognized or labeled by annotators. Additionally, crowdsourced datasets are prone to mislabeling due to the complexity of the task and often influenced by annotators’ subjective interpretations. In this paper, we propose an approach to address the detection of implicit hate speech and enhance generalizability across diverse datasets by leveraging existing harmful speech datasets. Our method comprises three key components: influential sample identification, reannotation, and augmentation using Llama-3 70B and GPT-4o. Experimental results demonstrate the effectiveness of our approach in improving implicit hate detection, achieving a +12.9-point F1 score improvement compared to the baseline.

pdf bib
SEAGraph: Unveiling the Whole Story of Paper Review Comments
Jianxiang Yu | Jiaqi Tan | Zichen Ding | Jiapeng Zhu | Jiahao Li | Yao Cheng | Qier Cui | Yunshi Lan | Yao Liu | Xiang Li

Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer’s concerns but also improve their work. This raises the critical question of how to enhance authors’ comprehension of review comments. In this paper, we present SEAGraph a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the author’s thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors. By bridging the gap between reviewers’ critiques and authors’ comprehension, SEAGraph contributes to a more efficient, transparent, and collaborative scientific publishing ecosystem. Our code is available at https://anonymous.4open.science/r/seagraph/.

pdf bib
A Comparative Study of Human-operated and AI-driven Guidance with a Teleoperated Mobile Robot
Ao Guo | Shota Mochizuki | Sanae Yamashita | Hoshimure Kenya | Jun Baba | Ryuichiro Higashinaka

Recent advances in large language models (LLMs) such as GPT-4o offer the potential for enhancing AI-driven robotic interactions, but their effectiveness in mobile tour guidance remains unexplored. This study investigates the differences between human-operated and AI-driven guidance at an aquarium using Teleco, a teleoperated mobile robot, in a real-world field experiment. A total of 277 guidance sessions were collected under two modes: human-operated, where the operator controlled all dialogue, actions, and movement, and AI-driven, where GPT-4o generated responses while the operator only controlled the robot’s actions and movement. Our results indicate that human-operated guidance places greater emphasis on visitor movement, spatial positioning during observation guidance, and empathetic expressions, whereas AI-driven guidance promotes conversational engagement by frequently prompting visitors to ask questions. In addition, we found that user behaviors, including users’ gaze patterns and vocabulary richness, also serve as valuable indicators reflecting their overall experience during guidance interactions. Furthermore, empathetic expression is recognized as the key differentiating factor between the two guidance modes, significantly influencing users’ overall experience.

pdf bib
R²-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation
Zhen Wu | Ritam Dutt | Luke M. Breitfeller | Armineh Nourbakhsh | Siddharth Parekh | Carolyn Rose

Relational reasoning lies at the core of many NLP tasks, drawing on complementary signals from text and graphs. While prior research has investigated how to leverage this dual complementarity, a detailed and systematic understanding of text-graph interplay and its effect on hybrid models remains underexplored. We take an analysis-driven approach to investigate text–graph representation complementarity via a unified architecture that supports knowledge co-distillation (CoD). We explore five tasks involving relational reasoning that differ in how text and graph structures encode the information needed to solve that task. By tracking how these dual representations evolve during training, we uncover interpretable patterns of alignment and divergence, and provide insights into when and why their integration is beneficial.

pdf bib
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Surapon Nonesung | Teetouch Jaknamon | Sirinya Chaiophat | Natapong Nitarach | Chanakan Wittayasakpan | Warit Sirichotedumrong | Adisai Na-Thalang | Kunat Pipatanakul

We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

pdf bib
An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring
Sana Ebrahimi | Mohsen Dehghankar | Abolfazl Asudeh

While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system’s effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.

pdf bib
Are LLMs Rigorous Logical Reasoners? Empowering Natural Language Proof Generation by Stepwise Decoding with Contrastive Learning
Ying Su | Mingwen Liu | Zhijiang Guo

Logical reasoning is a pivotal component in the field of artificial intelligence. Proof planning, particularly in contexts requiring the validation of explanation accuracy, continues to present challenges. The recent advancement of large language models (LLMs) has led to significant progress in natural language proof planning, evolving from one-stage generators to more complex three-stage systems that include additional searchers or verifiers. While these assisted methods improve the quality of generated results, they also introduce increased search efforts and computational costs. Furthermore, the generative process itself remains underexplored. In this study, we propose a stepwise decoding approach augmented by contrastive learning to address two common errors encountered during the LLM generator’s decoding process. We fine-tune the language model using both vanilla and enhanced hard negatives to mitigate these decoding errors. Empirical results demonstrate the effectiveness of our strategy. Additionally, our further analysis reveals that even larger LLMs still struggle to generate rigorous logical chains.

pdf bib
NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System
Shubham Kumar Nigam | Balaramamahanthi Deepak Patnaik | Shivam Mishra | Ajay Varghese Thomas | Noel Shallum | Kripabandhu Ghosh | Arnab Bhattacharya

Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.

pdf bib
Exploring Working Memory Capacity in LLMs: From Stressors to Human-Inspired Strategies
Eunjin Hong | Sumin Cho | Juae Kim

Large language models (LLMs) exhibit inherent limitations in working memory, which often affect their overall capabilities. However, prior studies have largely focused on describing such constraints without identifying their causes or providing practical strategies to cope with them. In this paper, we investigate the limited working memory capacity of LLMs through a series of empirical studies. Specifically, we examine the factors involved in the limited capacity and explore strategies to make more effective use of it. Our analysis shows that the number and difficulty of tasks in a single input largely strain the working memory of LLMs. In response, we design a cognitive marker consisting of simple token sequences theoretically grounded in cognitive science. Further analyses show that the cognitive marker reduces the overall prediction difficulty and uncertainty for the models to process the input, and its effectiveness is confirmed across various evaluation settings. Overall, our study incorporates cognitively motivated perspectives into the analysis of model behavior and highlights the need for deeper exploration of working memory in LLMs.

pdf bib
CLASSER: Cross-lingual Annotation Projection enhancement through Script Similarity for Fine-grained Named Entity Recognition
Prachuryya Kaushik | Ashish Anand

We introduce CLASSER, a cross-lingual annotation projection framework enhanced through script similarity, to create fine-grained named entity recognition (FgNER) datasets for low-resource languages. Manual annotation for named entity recognition (NER) is expensive, and distant supervision often produces noisy data that are often limited to high-resource languages. CLASSER employs a two-stage process: first projection of annotations from high-resource NER datasets to target language by using source-to-target parallel corpora and a projection tool built on a multilingual encoder, then refining them by leveraging datasets in script-similar languages. We apply this to five low-resource Indian languages: *Assamese*, *Marathi*, *Nepali*, *Sanskrit*, and *Bodo*, a vulnerable language. The resulting dataset comprises 1.8M sentences, 2.6M entity mentions and 24.7M tokens. Through rigorous analyses, the effectiveness of our method and the high quality of the resulting dataset are ascertained with F1 score improvements of 26% in Marathi and 46% in Sanskrit over the current state-of-the-art. We further extend our analyses to zero-shot and cross-lingual settings, systematically investigating the impact of script similarity and multilingualism on cross-lingual FgNER performance. The dataset is publicly available at [huggingface.co/datasets/prachuryyaIITG/CLASSER](https://huggingface.co/datasets/prachuryyaIITG/CLASSER).

pdf bib
On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility
Kushal Tatariya | Wessel Poelman | Miryam de Lhoneux

Language model architectures are predominately first created for English and afterwards applied to other languages. This can lead to problems for languages that are structurally different from English. We study one specific architectural choice: positional encodings. We do this through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis states there exists a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pre-train and evaluate three monolingual model variants with absolute, relative and no position encodings for seven typologically diverse languages and evaluate on four downstream tasks. We fail to find a consistent trend with various proxies for morphological complexity and word order flexibility. Our work shows choice of tasks, languages, and metrics are essential for drawing stable conclusions.

pdf bib
Doppelganger-JC: Benchmarking the LLMs’ Understanding of Cross-Lingual Homographs between Japanese and Chinese
Yuka Kitamura | Jiahao Huang | Akiko Aizawa

The recent development of LLMs is remarkable, but they still struggle to handle cross-lingual homographs effectively. This research focuses on the cross-lingual homographs between Japanese and Chinese—the spellings of the words are the same, but their meanings differ entirely between the two languages. We introduce a new benchmark dataset named Doppelganger-JC to evaluate the ability of LLMs to handle them correctly. We provide three kinds of tasks for evaluation: word meaning tasks, word meaning in context tasks, and translation tasks. Through the evaluation, we found that LLMs’ performance in understanding and using homographs is significantly inferior to that of humans. We pointed out the significant issue of homograph shortcut, which means that the model tends to preferentially interpret the cross-lingual homographs in its easy-to-understand language. We investigate the potential cause of this homograph shortcut from a linguistic perspective and pose that it is difficult for LLMs to recognize a word as a cross-lingual homograph, especially when it shares the same part-of-speech (POS) in both languages. The data and code is publicly available here: https://github.com/0017-alt/Doppelganger-JC.git.

pdf bib
Don’t Take it Literally! Idiom-aware Vietnamese Translation via In-context Learning
Luan Thanh Nguyen | Parisa Kordjamshidi

The translation of idiomatic expressions often results in misunderstandings and inaccuracies, affecting everyday communication as well as machine translation systems. This paper introduces Idiom-aware Vietnamese Translation (IDiAT), a new framework for the evaluation of idiomatic translation for Vietnamese, along with state-of-the-art results for this task. We collect and curate a high-quality Vietnamese-English idiom set that serves as a resource for in-context learning (ICL). IDiAT’s evaluation benchmark includes both idiomatic and non-idiomatic text pairs to assess general translation quality and idiomatic translation performance. We leverage ICL in large language models to augment few-shot demonstrations with idiom and topic descriptions and consequently improve the translation accuracy. Empirical results demonstrate that our IDiAT-based ICL outperforms traditional supervised methods using only a few data samples. Multiple evaluations confirm the effectiveness of our proposed approach. Though focusing on the Vietnamese language, our approach advances idiomatic translation and contributes to the development of culturally aware translation systems, paving the way for future research in low-resource languages. The experimental materials are publicly available.

pdf bib
LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis
Favour Y. Aghaebe | Elizabeth A Williams | Tanefa Apekey | Nafise Sadat Moosavi

Clinical interventions often hinge on age: medications and procedures safe for adults may be harmful to children or ineffective for older adults. However, as language models are increasingly integrated into biomedical evidence synthesis workflows, it remains uncertain whether these systems preserve such crucial demographic distinctions. To address this gap, we evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies.We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies, covering child, adult, and older adult populations. We evaluate three prominent summarisation-capable LLMs, Qwen (open-source), Longformer (open-source) and GPT-4.1 Nano (proprietary), using both standard metrics and a newly proposed Demographic Salience Score (DSS), which quantifies age-related entity retention and hallucination.Our results reveal systematic disparities across models and age groups: demographic fidelity is lowest for adult-focused summaries, and underrepresented populations are more prone to hallucinations. These findings highlight the limitations of current LLMs in faithful and bias-free summarisation and point to the need for fairness-aware evaluation frameworks and summarisation pipelines in biomedical NLP.

pdf bib
KERLQA: Knowledge-Enhanced Reinforcement Learning for Question Answering in Low-resource Languages
Sello Ralethe | Jan Buys

Question answering in low-resource languages faces critical challenges when models encounter questions beyond their knowledge boundaries, often producing confident but incorrect answers. We propose Knowledge-Enhanced Reinforcement Learning for Question Answering (KERLQA), a novel approach that combines knowledge graph integration with reinforcement learning to enable principled abstention decisions. Unlike existing refusal-tuned methods that make binary decisions based solely on internal confidence, KERLQA implements a three-way decision process: answer with internal knowledge, answer with external knowledge assistance, or abstain. Using a composite reward function that jointly optimizes for correctness, appropriate abstention, and efficient knowledge utilization, we train policies via PPO and DPO with dynamic calibration for low-resource settings. Experiments on CommonsenseQA and OpenBookQA across English and four South African languages show KERLQA achieves improved F1 scores, with up to 6.2 point improvements in low-resource languages. Our analysis reveals that KERLQA reduces false positive abstention rates by 30% while expanding the boundary of answerable questions through external knowledge integration.

pdf bib
The Visual Counter Turing Test (VCT²): A Benchmark for Evaluating AI-Generated Image Detection and the Visual AI Index (V_AI)
Nasrin Imanpour | Abhilekh Borah | Shashwat Bajpai | Subhankar Ghosh | Sainath Reddy Sankepally | Hasnat Md Abdullah | Nishoak Kosaraju | Shreyas Dixit | Ashhar Aziz | Shwetangshu Biswas | Vinija Jain | Aman Chadha | Song Wang | Amit Sheth | Amitava Das

The rapid progress and widespread availability of text-to-image (T2I) generation models have heightened concerns about the misuse of AI-generated visuals, particularly in the context of misinformation campaigns. Existing AI-generated image detection (AGID) methods often overfit to known generators and falter on outputs from newer or unseen models. To systematically address this generalization gap, we introduce the Visual Counter Turing Test (VCT^2), a comprehensive benchmark of 166,000 images, comprising both real and synthetic prompt-image pairs produced by six state-of-the-art (SoTA) T2I systems: Stable Diffusion 2.1, SDXL, SD3 Medium, SD3.5 Large, DALL·E 3, and Midjourney 6. We curate two distinct subsets: COCO_AI, featuring structured captions from MS COCO, and Twitter_AI, containing narrative-style tweets from The New York Times. Under a unified zero-shot evaluation, we benchmark 17 leading AGID models and observe alarmingly low detection accuracy, 58% on COCO_AI and 58.34% on Twitter_AI. To transcend binary classification, we propose the Visual AI Index (V_AI), an interpretable, prompt-agnostic realism metric based on twelve low-level visual features, enabling us to quantify and rank the perceptual quality of generated outputs with greater nuance. Correlation analysis reveals a moderate inverse relationship between V_AI and detection accuracy: Pearson rho of -0.532 on COCO_AI and rho of -0.503 on Twitter_AI; suggesting that more visually realistic images tend to be harder to detect, a trend observed consistently across generators. We release COCO_AI and Twitter_AI to catalyze future advances in robust AGID and perceptual realism assessment.

pdf bib
Hildoc: Leveraging Hilbert Curve Representation for Accurate and Efficient Document Retrieval
Muhammad AL-Qurishi | Zhaozhi Qian | Faroq AL-Tam | Riad Souissi

Document retrieval is a critical challenge in information retrieval systems, where the goal is to efficiently retrieve relevant documents in response to a given query. Dense retrieval methods, which utilize vector embeddings to represent semantic information, require effective indexing to ensure fast and accurate retrieval. Existing methods, such as MEVI, have attempted to address this by using hierarchical K-Means for clustering, but they often face limitations in computational efficiency and retrieval accuracy. In this paper, we introduce the Hildoc Index, a novel document indexing approach that leverages the Hilbert Curve to map document embeddings onto a one-dimensional space. This innovative representation facilitates efficient clustering using a 1D quantile-based algorithm, ensuring uniform partition sizes and preserving the inherent structure of the data. As a result, Hildoc Index not only reduces training complexity but also enhances retrieval accuracy and speed during inference. Our method can be seamlessly integrated into both dense retrieval systems and hybrid ensemble systems. Through comprehensive experiments on standard benchmarks like MSMARCO Passage and Natural Questions, we demonstrate that the Hildoc Index significantly outperforms the current state-of-the-art MEVI in terms of both retrieval speed and recall. These results underscore the Hildoc Index as a solution for fast and accurate dense document retrieval.

pdf bib
Rethinking what matters: Effective and Robust Multilingual Realignment for Low-Resource Languages
Quang Phuoc Nguyen | David Anugraha | Félix Gaschi | Jun Bin Cheng | En-Shiun Annie Lee

Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.

pdf bib
Decode Like a Clinician: Enhancing LLM Fine-Tuning with Temporal Structured Data Representation
Daniel Fadlon | David Dov | Aviya Bennett | Daphna Heller-Miron | Gad Levy | Kfir Bar | Ahuva Weiss-Meilik

Predictive modeling of hospital patient data is challenging due to its structured format, irregular timing of measurements, and variation in data representation across institutions. While traditional models often struggle with such inconsistencies, Large Language Models (LLMs) offer a flexible alternative. In this work, we propose a method for verbalizing structured Electronic Health Records (EHRs) into a format suitable for LLMs and systematically examine how to include time-stamped clinical observations—such as lab tests and vital signs—from previous time points in the prompt. We study how different ways of structuring this temporal information affect predictive performance, and whether fine-tuning alone enables LLMs to effectively reason over such data. Evaluated on two real-world hospital datasets and MIMIC-IV, our approach achieves strong in-hospital and cross-hospital performance, laying the groundwork for more generalizable clinical modeling.

pdf bib
Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces
Farhan Sheth | Girish | Mohd Mujtaba Akhtar | Muskaan Singh

In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms—including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to generalize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its generative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds—where hyperbolic geometry excels at modeling hierarchical generator families, and spherical projections capture angular, energy-invariant cues such as periodic vocoder artifacts. The fused representation is obtained via Riemannian barycentric averaging, enabling synthesis invariant alignment. RHYME outperforms individual PTMs and homogeneous fusion baselines, achieving top performance and setting new state-of-the-art in cross-paradigm ADD.

pdf bib
MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language
Farhan Farsi | Farnaz Aghababaloo | Shahriar Shariati Motlagh | Parsa Ghofrani | MohammadAli SadraeiJavaheri | Shayan Bali | Amir Hossein Shabani | Farbod Bijary | Ghazal Zamaninejad | AmirMohammad Salehoof | Saeedeh Momtazi

As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field. The evaluation results are publicly available on our live leaderboard: https://huggingface.co/spaces/opll-org/Open-Persian-LLM-Leaderboard

pdf bib
Atomic Consistency Preference Optimization for Long-Form Question Answering
Jingfeng Chen | Raghuveer Thirukovalluru | Junlin Wang | Kaiwei Luo | Bhuwan Dhingra

Large Language Models (LLMs) often produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated (factual, non-factual) pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness that may not always be accessible. Addressing this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals (i.e., the agreement of individual facts across multiple stochastic responses) to identify high- and low-quality data pairs for model alignment. Despite being fully self-supervised, ACPO outperforms the strong supervised alignment baseline by 1.95 points averaged across Phi-3 and Llama3 on the LongFact and BioGen datasets, demonstrating its effectiveness in improving factual reliability without relying on external models or knowledge bases.

pdf bib
Breaking Bad: Norms for Valence, Arousal, and Dominance for over 10k English Multiword Expressions
Saif M. Mohammad

Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 2018, include VAD association ratings for words. Here, we present a complement to it, which has **human ratings of valence, arousal, and dominance for ∼10k English Multiword Expressions (MWEs) and their constituent words**. We also increase the coverage of unigrams, especially words that have become more common since 2018. In all, the new **NRC VAD Lexicon v2 now has entries for ∼10k MWEs and ∼25k words, in addition to the entries in v1**. We show that the associations are highly reliable. We use the lexicon to examine emotional characteristics of MWEs, including: 1. The degree to which MWEs (idioms, noun compounds, and verb particle constructions) exhibit strong emotionality; 2. The degree of emotional compositionality in MWEs. The lexicon enables a wide variety of research in NLP, Psychology, Public Health, Digital Humanities, and Social Sciences. The NRC VAD Lexicon v2 is freely available through the project web- page: http://saifmohammad.com/ WebPages/nrc-vad.html

pdf bib
A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
Kyle Buettner | Jacob T. Emmerson | Adriana Kovashka

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures. Modern vision-language models (VLMs) gain understanding of images with text in different languages often through training on machine translations of English captions. However, this process relies on input content written from the perception of English speakers, leading to a perceptual bias. In this work, we outline a framework to address this bias. We specifically use a small amount of native speaker data, nearest-neighbor example guidance, and multimodal LLM reasoning to augment captions to better reflect descriptions in a target language. When adding the resulting rewrites to multilingual CLIP finetuning, we improve on German and Japanese text-image retrieval case studies (up to +3.5 mean recall, +4.4 on native vs. translation errors). We also propose a mechanism to build understanding of object description variation across languages, and offer insights into cross-dataset and cross-language generalization.

pdf bib
Characterizing Mamba’s Selective Memory using Auto-Encoders
Tamanna Hossain | Robert L. Logan Iv | Chandrasekhara Ganesh Jagadeesan | Sameer Singh | Joel R. Tetreault | Alejandro Jaimes

State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM’s hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M–1.4B) on sequences ranging from 4–256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba’s pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba’s ability to retain important information.

pdf bib
Task-Aligned Tool Recommendation for Large Language Models
Hang Gao | Yongfeng Zhang

By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.

pdf bib
INTERCHART: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
Anirudh Iyengar Kaniyar Narayana Iyengar | Srija Mukhopadhyay | Adnan Qidwai | Shubhankar Singh | Dan Roth | Vivek Gupta

We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2–3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open- and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.

pdf bib
LangCompress: Language-Aware Compression of Large Language Models
Dieu-Hien Nguyen | Nguyen-Khang Le | Truong Dinh Do | Le-Minh Nguyen

Large Language Models (LLMs) demonstrate strong multilingual capabilities but are costly to deploy due to their size and computational demands. To mitigate this, compression techniques such as pruning and quantization are widely used. However, these methods face two key limitations: (1) they assume access to high-quality instruction or calibration data, which is often unavailable for low-resource languages; and (2) they aim to preserve multilingual generality, making them inefficient for language-specific applications. We introduce LangCompress, a language-aware compression framework that enhances existing compression methods for targeted deployment. LangCompress is method-agnostic and improves state-of-the-art pruning and quantization approaches. It features two core components: an iterative self-supervised pipeline for generating instruction data in the target language, and a vocabulary simplification strategy that reduces the LM head to focus on key tokens. Experiments on perplexity, translation, and summarization tasks show that LangCompress improves performance in the target language. The code and data are publicly available.

pdf bib
The Confidence Paradox: Can LLM Know When It’s Wrong?
Sahil Tripathi | MD Tabrez Nafis | Imran Hussain | Jiechao Gao

Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses, especially under uncertainty. Existing models like LayoutLMv3, UDOP, and DONUT focus on accuracy but lack ethical calibration. We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning. We introduce two new metrics—Honesty Score (H-Score) and Ethical Confidence Index (ECI)—to evaluate ethical alignment. HonestVQA improves accuracy and F1 by up to 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets, while reducing overconfidence. It also generalizes well across domains, achieving 78.9% accuracy and 76.1% F1-score.

pdf bib
DharmaBench: Evaluating Language Models on Buddhist Texts in Sanskrit and Tibetan
Kai Golan Hashiloni | Shay Cohen | Asaf Shina | Jingyi Yang | Orr Meir Zwebner | Nicola Bajetta | Guy Bilitski | Rebecca Sundén | Guy Maduel | Ryan Conlon | Ari Barzilai | Daniel Mass | Shanshan Jia | Aviv Naaman | Sonam Choden | Sonam Jamtsho | Yadi Qu | Harunaga Isaacson | Dorji Wangchuk | Shai Fine | Orna Almogi | Kfir Bar

We assess the capabilities of large language models on tasks involving Buddhist texts written in Sanskrit and Classical Tibetan—two typologically distinct, low-resource historical languages. To this end, we introduce DharmaBench, a benchmark suite comprising 13 classification and detection tasks grounded in Buddhist textual traditions: six in Sanskrit and seven in Tibetan, with four shared across both. The tasks are curated from scratch, tailored to the linguistic and cultural characteristics of each language. We evaluate a range of models, from proprietary systems like GPT-4o to smaller, domain-specific open-weight models, analyzing their performance across tasks and languages. All datasets and code are publicly released, under the CC-BY-4 License and the Apache-2.0 License respectively, to support research on historical language processing and the development of culturally inclusive NLP systems.

pdf bib
BhashaSetu: Cross-Lingual Knowledge Transfer from High-Resource to Extreme Low-Resource Languages
Subhadip Maji | Arnab Bhattacharya

Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.

pdf bib
Are Humans as Brittle as Large Language Models?
Jiahui Li | Sean Papay | Roman Klinger

The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

pdf bib
Large Temporal Models: Unlocking Temporal Understanding in LLMs for Temporal Relation Classification
Omri Homburger | Kfir Bar

We present Large Temporal Model, a Large Language Model (LLM) that excels in Temporal Relation Classification (TRC). We show how a carefully designed fine-tuning strategy, using a novel two-step fine-tuning approach, can adapt LLMs for TRC. Our approach is focused on global TRC, enabling simultaneous classification of all temporal relations within a document. Unlike traditional pairwise methods, our approach performs global inference in a single step, improving both efficiency and consistency. Evaluations on the MATRES and OmniTemp benchmarks demonstrate that, for the first time, an LLM achieves state-of-the-art performance, outperforming previous pairwise and global TRC methods. Results show that our global approach produces more consistent and accurate temporal graphs. Ablation studies further validate the effectiveness of our two-step fine-tuning strategy, while analyses reveal why our approach succeeds in increasing performance and reducing inconsistencies.

pdf bib
What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization
Weixiao Zhou | Junnan Zhu | Gengyao Li | Xianfu Cheng | Xinnian Liang | Feifei Zhai | Zhoujun Li

Traditional dialogue summarization primarily focuses on dialogue content, assuming it comprises adequate information for a clear summary. However, this assumption often fails for discussions grounded in shared background, where participants frequently omit context and use implicit references. This results in summaries that are confusing to readers unfamiliar with the background. To address this, we introduce Knowledge-Grounded Discussion Summarization (KGDS), a novel task that produces a supplementary background summary for context and a clear opinion summary with clarified references. To facilitate research, we construct the first KGDS benchmark, featuring news-discussion pairs and expert-created multi-granularity gold annotations for evaluating sub-summaries. We also propose a novel hierarchical evaluation framework with fine-grained and interpretable metrics. Our extensive evaluation of 12 advanced large language models (LLMs) reveals that KGDS remains a significant challenge. The models frequently miss key facts and retain irrelevant ones in background summarization, and often fail to resolve implicit references in opinion summary integration.

pdf bib
CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts
Chuang Liang | Wei Li | Yanqiu Shao

Quotation recommendation is an inherently asymmetric retrieval task, where the intended meaning of a quote often diverges from surface expressions, creating significant semantic shifts. Combined with minimal lexical overlap, this poses a core challenge for classic dense retrievers, which struggle to capture non-literal and rhetorical alignments. To bridge this semantic gap, we propose introducing controllable signals to guide the model’s attention toward abstract, context-relevant concepts. We propose CtrlShift, a framework that leverages a Variational Autoencoder (VAE) to capture latent associations between context and quotation, which is used to derive context-aware control signals to modulate semantic focus and support bidirectional alignment and rhetorical intent modeling. Experiments show that our method consistently outperforms baselines on the quotation recommendation task and can be effectively transfered to the general purposed benchmark. Further, CtrlShift integrates seamlessly with general-purpose generative models without additional fine-tuning, and provides satisfactory interpretability by generating textual explaination to uncover the model’s focus on abstract, citation-aligned semantics.

pdf bib
PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion
Morteza Alikhani | Mohammadtaha Bagherifard | Erfan Zinvandi | Mehran Sarmadi

We introduced PerCoR—Persian Commonsense Reasoning—the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural and other web sources. We adopt a linguistically grounded, conjunction-based segmentation strategy to generate coherent prefix–continuation pairs. To create challenging distractors, we propose DRESS-AF—Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering—a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PerCoR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset’s difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at https://huggingface.co/datasets/MCINext/PerCoR .

pdf bib
Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Shubhashis Roy Dipta | Francis Ferraro

Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Q2E outperforms the previous SOTA on the MultiVENT dataset by 8 NDCG points, while improving on MSR-VTT and MSVD by 4 and 3 points, respectively, outperforming several existing retrieval methods, including many fine-tuned and SOTA zero-shot approaches. We have released both code and data.

pdf bib
VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation
Arpan Phukan | Anupam Pandey | Deepjyoti Bodo | Asif Ekbal

Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain’s strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model’s ability to generate coherent, contextually grounded, and reasoning-intensive questions. To facilitate future research, we publicly release our code and dataset.

pdf bib
Interpreting the Effects of Quantization on LLMs
Manpreet Singh | Hassan Sajjad

Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.

pdf bib
Large Language Models Encode Semantics and Alignment in Linearly Separable Representations
Baturay Saglam | Paul Kassianik | Blaine Nelson | Sajana Weerawardhena | Yaron Singer | Amin Karbasi

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior—even when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model’s built-in safety alignment and external token-level filters.

pdf bib
Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation
Jonne Sälevä | Duygu Ataman | Constantine Lignos

We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks.We show how experimental variation in performance scores arises from both model and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications.Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for quantifying the replication uncertainty of various quantities used in leaderboards such as model rankings and pairwise differences between models.

pdf bib
What Would You Ask When You First Saw a2+b2=c2? Evaluating LLM on Curiosity-Driven Question Generation
Shashidhar Reddy Javaji | Zining Zhu

Large language models (LLMs) are increasingly widely used as critical components of knowledge retrieval systems and agentic systems. These systems can benefit from knowledge-seeking capabilities of LLMs, in other words, curiosity. However, this capability has not been evaluated quantitatively. Towards bridging this gap, we propose an evaluation framework, CDQG (Curiosity-Driven Question Generation). The CDQG task prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. The CDQG dataset contains 1,988 statements including physics, chemistry, and mathematics with distinct levels of difficulty, general knowledge statements, and intentionally erroneous statements. We score the qualities of the questions generated by LLMs along multiple dimensions. These scores are validated by rigorous controlled ablation studies and human evaluations. While large models like GPT-4 and Mistral 8x7b can generate highly coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. CDQG quantifies a critical model capability, and opens up research opportunities for developing future knowledge retrieval systems driven by LLMs.

pdf bib
Can AI Validate Science? Benchmarking LLMs on Claim →Evidence Reasoning in AI Papers
Shashidhar Reddy Javaji | Yupeng Cao | Haohang Li | Yangyang Yu | Nikhil Muralidhar | Zining Zhu

Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’ capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs’ ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs’ abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

pdf bib
More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation
Yangtian Zi | Harshitha Menon | Arjun Guha

State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval (Chen et al., 2021) but underperform on specialized suites such as ParEval (Nichols et al., 2024). Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.

pdf bib
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Galann Pennec | Zhengyuan Liu | Nicholas Asher | Philippe Muller | Nancy F. Chen

Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay-like representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.

pdf bib
PMPO: A Self-Optimizing Framework for Creating High-Fidelity Measurement Tools for Social Bias in Large Language Models
Zeqiang Wang | Yuqi Wang | Xinyue Wu | Chenxi Li | Yiran Liu | Linghan Ge | Zhan Yu | Jiaxin Shi | Suparna De

The potential of Large Language Models (LLMs) as instruments for measuring social phenomena is constrained by the methodological limitations of current probing techniques. Prevailing methods rely on static, handcrafted probe sets whose quality is highly dependent on their authors’ subjective expertise. This results in measurement tools with inconsistent statistical reliability that defy systematic optimization. Such an “artisanal” approach, akin to using an “uneven ruler,” undermines the scientific rigor of its findings and severely limits the applicability of LLMs in the social sciences. To elevate bias measurement from a craft to a science, we introduce the Psychometric-driven Probe Optimization (PMPO) framework. This framework treats a probe set as an optimizable scientific instrument and, for the first time, utilizes a Neural Genetic Algorithm that leverages a powerful LLM as a “neural genetic operator.” Through a hybrid strategy of gradient-guided mutation and creative rephrasing, PMPO automatically enhances the probe set’s reliability, sensitivity, and diversity. We first establish the external validity of our foundational measurement method (PLC), demonstrating a high correlation between its measurement of occupational gender bias and real-world U.S. Bureau of Labor Statistics data (average Pearson’s r=0.83, p<.001). Building on this, we show that the PMPO framework can elevate a standard probe set’s internal consistency (Cronbach’s Alpha) from 0.90 to an exceptional 0.96 within 10 generations. Critically, in a rigorous, double-blind “Turing Test,” probes evolved by PMPO from non-expert seeds were judged by sociology experts to have achieved a level of quality, sophistication, and nuance that is comparable to, and even indistinguishable from, those handcrafted by domain experts. This work provides a systematic pathway to upgrade LLM measurement tools from artisanal artifacts to automated scientific instruments, offering an unprecedented and trustworthy tool for AI safety auditing and computational social science.

pdf bib
ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages
Neha Joshi | Pamir Gogoi | AasimBaig Mirza | Aayush Jansari | Aditya Yadavalli | Ayushi Pandey | Arunima Shukla | Deepthi Sudharsan | Kalika Bali | Vivek Seshadri

We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000—captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models’ capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context—including background information about the languages, translation examples, and guidelines for cultural preservation—leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

pdf bib
Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
Takuma Sato | Seiya Kawano | Koichiro Yoshino

The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

pdf bib
IndicClaimBuster: A Multilingual Claim Verification Dataset
Pritam Pal | Shyamal Krishna Jana | Dipankar Das

The present article introduces **IndicClaimBuster**, a novel multilingual claim verification dataset comprising 9K claims and their corresponding evidence in English, Hindi, Bengali, and Hindi-English CodeMixed texts. The data set covers three key domains: politics, law and order, and health, to address the challenges of verifiable facts. Each claim was sourced from reputable Indian news portals and is accompanied by three pieces of evidence, two LLM-generated and one manually curated. Additionally, a separate attempt was conducted to generate refuted claims by employing an LLM. We further develop two frameworks: an unsupervised baseline and a two-stage pipeline that comprises evidence retrieval and veracity prediction modules. For retrieval, we fine-tuned SBERT models, with e5-base demonstrating superior average performance across languages, whereas for veracity prediction, multilingual transformers (mBERT, XLM-R, MuRIL, IndicBERTv2) were fine-tuned. Results indicate MuRIL and IndicBERTv2 excel in Indian languages, while XLM-R performs the best for CodeMix. Our work contributes a high-quality multilingual dataset and strong baseline methodologies, offering valuable resources for advancing automated claim verification in linguistically diverse and low-resource settings for Indian languages. The IndicClaimBuster dataset is available at: https://github.com/pritampal98/indic-claim-buster

pdf bib
IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization
Ahmed Frikha | Nassim Walha | Krishna Kanth Nakka | Ricardo Mendes | Xue Jiang | Xuebing Zhou

In this work, we address the problem of text anonymization where the goal is to prevent adversaries from correctly inferring private attributes of the author, while keeping the text utility, i.e., meaning and semantics. We propose IncogniText, a technique that anonymizes the text to mislead a potential adversary into predicting a wrong private attribute value. Our empirical evaluation shows a reduction of private attribute leakage by more than across 8 different private attributes. Finally, we demonstrate the maturity of IncogniText for real-world applications by distilling its anonymization capability into a set of LoRA parameters associated with an on-device model. Our results show the possibility of reducing privacy leakage by more than half with limited impact on utility.

pdf bib
Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data
Yohei Kobashi | Fumiya Uchiyama | Takeshi Kojima | Andrew Gambardella | Qi Cao | Yusuke Iwasawa | Yutaka Matsuo

As the adoption of large language models (LLMs) continues to grow, the risk of sensitive data leakage from their training datasets has become a critical concern. This study proposes a novel method for encrypting training data using a polyalphabetic substitution cipher. This approach prevents the model from learning sensitive information while allowing it to capture abstract linguistic patterns. We pre-trained a Llama 3 model (551M parameters) using approximately 7.5 billion tokens of encrypted data and subsequently conducted continual pre-training with another 2.5 billion tokens of plaintext data. The effectiveness of the model was evaluated by comparing its downstream task performance with a model trained solely on plaintext data. In addition, we evaluated the risk of sensitive data leakage through name reconstruction, true-prefix and data extraction attacks. These results demonstrate the potential of our approach to balance data security with model performance.

pdf bib
Not Just a Piece of Cake: Cross-Lingual Fine-Tuning for Idiom Identification
Ofri Hefetz | Kai Golan Hashiloni | Alon Mannor | Kfir Bar

We investigate cross-lingual fine-tuning for idiomatic expression identification, addressing the limited availability of annotated data in many languages. We evaluate encoder and generative decoder models to examine their ability to generalize idiom identification across languages. Additionally, we conduct an explainability study using linear probing and LogitLens to analyze how idiomatic meaning is represented across model layers. Results show consistent cross-lingual transfer, with English emerging as a strong source language. All code and models are released to support future research.

pdf bib
Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics
Chunhua Liu | Hong Yi Lin | Patanamon Thongtanunam

Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.

pdf bib
Differential Mamba
Nadav Schneider | Itamar Zimerman | Eliya Nachmani

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models.

pdf bib
From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen

Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a 2.5D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-authored instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.

pdf bib
Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
Masahiro Kaneko | Zeerak Talat | Timothy Baldwin

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs—using the model’s previous responses to guide each new iteration—have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past‐Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

pdf bib
HARBOR: Exploring Persona Dynamics in Multi-Agent Competition
Kenan Jiang | Li Xiong | Fei Liu

We investigate factors contributing to LLM agents’ success in competitive multi-agent environments, using auctions as a testbed where agents bid to maximize profit. The agents are equipped with bidding domain knowledge, distinct personas that reflect item preferences, and a memory of auction history. Our work extends the classic auction scenario by creating a realistic environment where multiple agents bid on houses, weighing aspects such as size, location, and budget to secure the most desirable homes at the lowest prices. Particularly, we investigate three key questions: (a) How does a persona influence an agent’s behavior in a competitive setting? (b) Can an agent effectively profile its competitors’ behavior during auctions? (c) How can persona profiling be leveraged to create an advantage using strategies such as theory of mind? Through a series of experiments, we analyze the behaviors of LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a valuable platform for deepening the understanding of multi-agent workflows in competitive environments.

pdf bib
A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics
Angeline Charles | Srikant Panda | Amit Agarwal | Hitesh Laxmichand Patel | Priyaranjan Pattnayak | Bhargava Kumar | Tejaswini Kumar

Reference-free metrics such as CLIPScore and PAC-S are increasingly used in vision-language tasks due to their scalability and independence from human-written references. However, their reliability under linguistic, visual, and cultural variation remains underexplored. In this work, we present a systematic audit of CLIPScore and PAC-S using an eight-factor diagnostic framework applied to MS-COCO validation images. Our analysis reveals consistent failure modes across dimensions including object size, content category, syntax, named entities, spatial relations and cultural context. Both metrics penalize captions referencing African (−5.5%, −4.8%) and Arabian (−4.9%, −5.3%) cultures, favor large-object and animal-centric scenes (by 20-30%) and show limited sensitivity to spatial negation and word order. CLIPScore correlates more strongly with syntactic complexity, while PAC-S demonstrates greater robustness to verbosity and named–entity variation highlighting complementary strengths rather than superiority. These findings expose cultural and content bias, weak semantic robustness, and limited compositional understanding. We conclude with design recommendations to improve fairness, scale invariance, and semantic grounding in future reference-free evaluation metrics.

pdf bib
Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts
Preethi Seshadri | Hongyu Chen | Sameer Singh | Seraphina Goldfarb-Tarrant

Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring, yet their potential for unfair decision-making remains understudied in generative and retrieval settings. In this work, we examine the allocational fairness of LLM-based hiring systems through two tasks that reflect actual HR usage: resume summarization and applicant ranking. By constructing a synthetic resume dataset with controlled perturbations and curating job postings, we investigate whether model behavior differs across demographic groups. Our findings reveal that generated summaries exhibit meaningful differences more frequently for race than for gender perturbations. Models also display non-uniform retrieval selection patterns across demographic groups and exhibit high ranking sensitivity to both gender and race perturbations. Surprisingly, retrieval models can show comparable sensitivity to both demographic and non-demographic changes, suggesting that fairness issues may stem from broader model brittleness. Overall, our results indicate that LLM-based hiring systems, especially in the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes in real-world contexts.

pdf bib
CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications
Raviraj Bhuminand Joshi | Rakesh Paul | Kanishk Singla | Anusha Kamath | Michael Evans | Katherine Luna | Shaona Ghosh | Utkarsh Vaidya | Eileen Margaret Peters Long | Sanjay Singh Chauhan | Niranjan Wartikar

The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Safety-Guard-Dataset-v3, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-8B-v3 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. Furthermore, we show our moderately multilingual fine-tuning enables robust cross-lingual transfer and strong zero-shot generalization to unseen languages. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work advances multilingual LLM safety by enabling the development of culturally aware safety guard models.

pdf bib
Revisiting Word Embeddings in the LLM Era
Yash Mahajan | Matthew Freestone | Naman Bansal | Sathyanarayanan N. Aakur | Santu Karmaker

Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.

pdf bib
Who Remembers What? Tracing Information Fidelity in Human-AI Chains
Suvojit Acharjee | Utathya Aich | Diptarka Mandal | Asfak Ali

In many real-world settings like journalism, law, medicine, and science communication, information is passed from one person or system to another through multiple rounds of summarization or rewriting. This process, known as multi-hop information transfer, also happens increasingly in workflows involving large language models (LLMs). But while summarization models and factuality metrics have improved, we still don’t fully understand how meaning and factual accuracy hold up across long chains of transformations, especially when both humans and LLMs are involved.In this paper, we take a fresh look at this problem by combining insights from cognitive science (Bartlett’s serial reproduction) and information theory (Shannon’s noisy-channel model). We build a new dataset of 700 five-step transmission chains that include human-only, LLM-only, mixed human-LLM, and cross-LLM settings across a wide range of source texts. To track how meaning degrades, we introduce three new metrics: Information Degradation Rate (IDR) for semantic drift, Meaning Preservation Entropy (MPE) for uncertainty in factual content, and Cascaded Hallucination Propagation Index (CHPI) for how hallucinations accumulate over time. Our findings reveal that hybrid chains behave asymmetrically. When a human summary is refined by a language model, the final output tends to preserve meaning well, suggesting that models can improve upon human-written summaries. The code and data will be available at : https://github.com/transtrace6/TransTrace.git.

pdf bib
QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs
Maria Tseytlin | Paul Roit | Omri Abend | Ido Dagan | Ayal Klein

Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR’s noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.

pdf bib
On Memorization of Large Language Models in Logical Reasoning
Chulin Xie | Yangsibo Huang | Chiyuan Zhang | Da Yu | Xinyun Chen | Bill Yuchen Lin | Bo Li | Badih Ghazi | Ravi Kumar

Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs’ reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using two dynamically generated logical reasoning benchmarks based on Knights and Knaves (K&K) puzzles and Zebra puzzles (DynamicZebra). We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. Through in-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers, we establish that LLMs develop reasoning skills on logical puzzles alongside memorization. Finally, our analysis based on a per-sample memorization score sheds light on how LLMs switch between reasoning and memorization when solving logical puzzles.

pdf bib
ControlMed: Adding Reasoning Control to Medical Language Model
Sung-Min Lee | Siyoon Lee | Juyeon Kim | Kyoungmin Roh

Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce ControlMed, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both direct and reasoning responses; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.

pdf bib
No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning.
Abhishek Rajgaria | Kushagra Dixit | Mayank Vyas | Harshavardhan Kalalbandi | Dan Roth | Vivek Gupta

Temporal Table Reasoning poses a significant challenge for Large Language Models (LLMs), requiring effective reasoning to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, model performance varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique on diverse table types to determine that performance depends on factors such as entity type, table structure, requirement of additional context and question complexity, with “NO” single method consistently outperforming others. To address this, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts to context and integrates structured reasoning. SEAR_Unified, its cost-efficient variant. We also demonstrate that optional table refactoring (preprocessing) enhances both approaches when tables lack structural consistency. Our results demonstrate that SEAR prompts achieve superior performance across all table types compared to baseline prompting techniques

pdf bib
ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts
Karthikeyan K | Raghuveer Thirukovalluru | David Carlson

Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question–answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2–3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.

pdf bib
Adaptive Collaborative Labeling with MLLMs for Low-Resource Multimodal Emotion Recognition
Wenwen Zhuang | Lu Xiang | Shubei Tang | Yaping Zhang | Yu Zhou

Multimodal emotion recognition (MER) plays a crucial role in human-centric AI applications, yet existing models struggle in low-resource scenarios due to their heavy reliance on large amounts of high-quality labeled data. To address this challenge, we propose Adaptive Collaborative Labeling for Low-Resource MER (ACL-MER), a novel framework that leverages off-the-shelf multimodal large language models (MLLMs) to effectively exploit abundant unlabeled data. Specifically, ACL-MER incorporates a diverse teacher model zoo, wherein each MLLM specializes in a specific modality and is prompted to generate chain-of-thought predictions accompanied by scalar confidence scores. Rather than directly adopting these pseudo-labels, ACL-MER introduces an adaptive refinement strategy that selectively distills knowledge based on teacher confidence, iteratively guiding the lightweight student model toward robust learning under limited supervision. Extensive experiments on two benchmarks demonstrate that ACL-MER consistently outperforms strong baselines, especially in extremely low-resource settings.

pdf bib
Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning
Nhi Hoai Doan | Tatsuya Hiraoka | Kentaro Inui

This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.

pdf bib
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Abhijit Mishra | Mingda Li | Hsiang Fu | Richard Noh | Minji Kim

Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores **Visual Instruction Rewriting**, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs **(250M parameters)** with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

pdf bib
Quantifying Cognitive Bias Induction in LLM-Generated Content
Abeer Alessa | Param Somane | Akshaya Thenkarai Lakshminarasimhan | Julian Skirzynski | Julian McAuley | Jessica Maria Echterhoff

Large language models (LLMs) are integrated into applications like shopping reviews, summarization, or medical diagnosis support, where their use affects human decisions. We investigate the extent to which LLMs expose users to biased content and demonstrate its effect on human decision-making. We assess five LLM families in summarization and news fact-checking tasks, evaluating the consistency of LLMs with their context and their tendency to hallucinate on a new self-updating dataset. Our findings show that LLMs expose users to content that changes the context’s sentiment in 26.42% of cases (framing bias), hallucinate on 60.33% of post-knowledge-cutoff questions, and highlight context from earlier parts of the prompt (primacy bias) in 10.12% of cases, averaged across all tested models. We further find that humans are 32% more likely to purchase the same product after reading a summary of the review generated by an LLM rather than the original review. To address these issues, we evaluate 18 mitigation methods across three LLM families and find the effectiveness of targeted interventions.

pdf bib
Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Daniil Gurgurov | Katharina Trinley | Yusser Al Ghussin | Tanja Baeumel | Josef Van Genabith | Simon Ostermann

Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity.Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming established replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

pdf bib
FOCUS: A Benchmark for Targeted Socratic Question Generation via Source-Span Grounding
Surawat Pothong | Machi Shimmei | Naoya Inoue | Paul Reisert | Ana Brassard | Wenzhi Wang | Shoichi Naito | Jungmin Choi | Kentaro Inui

We present FOCUS, a benchmark and task setting for Socratic question generation that delivers more informative and targeted feedback to learners. Unlike prior datasets, which rely on broad typologies and lack grounding in the source text, FOCUS introduces a new formulation: each Socratic question is paired with a fine-grained, 11-type typology and an explicit source span from the argument it targets. This design supports clearer, more actionable feedback and facilitates interpretable model evaluation. FOCUS includes 440 annotated instances with moderate partial-match agreement, establishing it as a reliable benchmark. Baseline experiments with representative state-of-the-art models reveal, through detailed error analysis, that even strong models struggle with span selection and context-sensitive categories. An extension study on the LogicClimate dataset further confirms the generalizability of the task and annotation framework. FOCUS sets a new standard for pedagogically grounded and informative Socratic question generation.

pdf bib
MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking
Nishant Mishra | Wilker Aziz | Iacer Calixto

Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths—i.e., from general to specific—in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.

pdf bib
AURA-QG: Automated Unsupervised Replicable Assessment for Question Generation
Rajshekar K | Harshad Khadilkar | Pushpak Bhattacharyya

Question Generation (QG) is central to information retrieval, education, and knowledge assessment, yet its progress is bottlenecked by unreliable and non-scalable evaluation practices. Traditional metrics fall short in structured settings like document-grounded QG, and human evaluation, while insightful, remains expensive, inconsistent, and difficult to replicate at scale. We introduce AURA-QG: an Automated, Unsupervised, Replicable Assessment pipeline that scores question sets using only the source document. It captures four orthogonal dimensions i.e., answerability, non-redundancy, coverage, and structural entropy, without needing reference questions or relative baselines. Our method is modular, efficient, and agnostic to the question generation strategy. Through extensive experiments across four domains i.e., car manuals, economic surveys, health brochures, and fiction, we demonstrate its robustness across input granularities and prompting paradigms. Chain-of-Thought prompting, which first extracts answer spans and then generates targeted questions, consistently yields higher answerability and coverage, validating the pipeline’s fidelity. The metrics also exhibit strong agreement with human judgments, reinforcing their reliability for practical adoption. The complete implementation of our evaluation pipeline is publicly available.

pdf bib
EFSA-CLC: Enhancing Zero-shot Entity-level Financial Sentiment Analysis with Cross-lingual Collaboration
Senbin Zhu | Hongde Liu | Chenyuan He | Yuxiang Jia

Entity-level sentiment analysis is becoming increasingly important in the context of diverse financial texts, and large language models demonstrate significant potential under zero-shot settings. While it is well recognized that different languages embody distinct cognitive patterns, the use of multilingual capabilities in large language models to enable cross-lingual collaborative reasoning in the financial domain remains insufficiently studied. To address this, we propose a Cross-Lingual Collaboration (CLC) method: first, financial texts are aligned from one language to another based on semantic and syntactic structures, enabling the model to capture complementary linguistic features. Then, we integrate sentiment analysis results from both languages through redundancy removal and conflict resolution, enhancing the effectiveness of cross-lingual collaboration. Our experiments cover seven languages from three language families, including six UN official languages, and evaluate CLC on two English datasets and one Chinese dataset. Results show that multilingual collaboration improves sentiment analysis accuracy, especially among linguistically similar languages. Furthermore, stronger reasoning capabilities in LLMs amplify these benefits. Our code is available at https://anonymous.4open.science/r/Cross-lingual-Collaboration.

pdf bib
Enhancing Low-Resource Text Classification with LLM-Generated Corpora : A Case Study on Olfactory Reference Extraction
Cédric Boscher | Shannon Bruderer | Christine Largeron | Véronique Eglin | Elöd Egyed-Zsigmond

Extracting sensory information from text, particularly olfactory references, is challenging due to limited annotated datasets and the implicit, subjective nature of sensory experiences. This study investigates whether GPT-4o-generated data can complement or replace human annotations. We evaluate human- and LLM-labeled corpora on two tasks: coarse-grained detection of olfactory content and fine-grained sensory term extraction. Despite lexical variation, generated texts align well with real data in semantic and sensorimotor embedding spaces. Models trained on synthetic data perform strongly, especially in low-resource settings. Human annotations offer better recall by capturing implicit and diverse aspects of sensoriality, while GPT-4o annotations show higher precision through clearer pattern alignment. Data augmentation experiments confirm the utility of synthetic data, though trade-offs remain between label consistency and lexical diversity. These findings support using synthetic data to enhance sensory information mining when annotated data is limited.

pdf bib
Learning from *Sufficient* Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies
Jonathan Kamp | Lisa Beinborn | Antske Fokkens

Human explanations of natural language, *rationales*, form a tool to assess whether models learn a label *for the right reasons* or rely on dataset-specific shortcuts. *Sufficiency* is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

pdf bib
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
Xiao Wang | Daniil Larionov | Siwei Wu | Yiqi Liu | Steffen Eger | Nafise Sadat Moosavi | Chenghua Lin

Recent advances in automatic evaluation of natural language generation have increasingly relied on large language models as general-purpose metrics. While effective, these approaches often require high-capacity models, which introduce substantial computational costs, and remain susceptible to known evaluation pathologies, such as over-reliance on likelihood. We introduce ContrastScore, a contrastive evaluation paradigm that builds on the widely used BARTScore formulation by comparing token-level probabilities between a stronger and a weaker model. Instead of relying on single-model likelihoods or prompt-based judgments, ContrastScore captures disagreement between models to better reflect confidence and uncertainty in generation quality. Empirical results on summarization and machine translation benchmarks show that ContrastScore, instantiated with paired moderate-scale models across both Qwen and LLaMA families, consistently outperforms larger alternatives, such as Qwen 7B and LLaMA 8B, in correlation with human ratings. In addition to improving evaluation quality, ContrastScore significantly reduces susceptibility to likelihood bias, offering a more robust and cost-effective alternative to larger LLM-based evaluation methods.

pdf bib
ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
Wissam Antoun | Benoît Sagot | Djamé Seddah

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being its support for long context, faster training, and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

pdf bib
Simplified Rewriting Improves Expert Summarization
Xingmeng Zhao | Tongnian Wang | Anthony Rios

Radiology report summarization (RRS) is critical for clinical workflows, requiring concise Impressions “distilled from detailed Findings.” This paper proposes a novel prompting strategy that enhances RRS by introducing a layperson summary as an intermediate step. This summary helps normalize key observations and simplify complex terminology using communication techniques inspired by doctor–patient interactions. Combined with few-shot in-context learning, this approach improves the model’s ability to map generalized descriptions to specific clinical findings. We evaluate our method on three benchmark datasets, MIMIC-CXR, CheXpert, and MIMIC-III, and compare it against state-of-the-art open-source language models in the 7B/8B parameter range, such as Llama-3.1-8B-Instruct. Results show consistent improvements in summarization quality, with gains of up to 5% on some metrics for prompting, and more than 20% for some models when instruction tuning.

pdf bib
RASTeR: Robust, Agentic, and Structured Temporal Reasoning
Dan Schumacher | Fatemeh Haji | Tara Grey | Niharika Bandlamudi | Nupoor Karnik | Gagana Uday Kumar | Cho-Yu Jason Chiang | Peyman Najafirad | Nishant Vishwamitra | Anthony Rios

Temporal question answering (TQA) remains a persistent challenge for large language models (LLMs), particularly in retrieval-augmented generation (RAG) settings where retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, policy tracking, and real-time decision-making, which require reliable temporal reasoning even under noisy or misleading context. To address this challenge, we introduce RASTeR: Robust, Agentic, and Structured, Temporal Reasoning, an agentic prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of retrieved context, then constructs a structured temporal knowledge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustness: defined here as the model’s ability to generate correct predictions despite suboptimal context. We further validate our approach through a “needle-in-the-haystack” study, in which relevant context is buried among irrelevant distractors. Even with forty distractors, RASTeR achieves 75% accuracy, compared to the runner-up model, which reaches only 62%.

pdf bib
ReasoningWeekly: A General Knowledge and Verbal Reasoning Challenge for Large Language Models
Zixuan Wu | Francesca Lucchetti | Aleksander Boruch-Gruszecki | Jingmiao Zhao | Carolyn Jane Anderson | Joydeep Biswas | Federico Cassano | Arjun Guha

Existing benchmarks for frontier models often test specialized, “PhD-level” knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 613 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models’ mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise.Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with “I give up” before providing an answer that it knows is wrong. R1 can also be remarkably “uncertain” in its output and in rare cases, it does not “finish thinking,” which suggests the need for techniques to “wrap up” before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

pdf bib
Noise May Drown Out Words but Foster Compositionality: The Advantage of the Erasure and Deletion Noisy Channels on Emergent Communication
Cezary Klamra | Francijn Keur | Raquel G. Alhama

We investigate communication emerging in noisy environments with the goal of capturing the impact of message disruption on the emerged protocols. We implement two different noise mechanisms, inspired by the erasure and deletion channels studied in information theory, and simulate a referential game in a neural agent-based model with a variable message length channel. We leverage a stochastic evaluation setting to apply noise only after a message is sampled, which adds ecological validity and allows us to estimate information-theoretic measures of the emerged protocol directly from symbol probabilities. Contrary to our expectations, the emerged protocols do not become more redundant with the presence of noise; instead, we observe that certain levels of noise encourage the sender to produce more compositional messages, although the impact varies depending on the type of noise and input representation.

pdf bib
Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels
Sourya Dipta Das | Shubham Kumar | Kuldeep Yadav

Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.

pdf bib
LLM-Guided Lifecycle-Aware Clustering of Multi-Turn Customer Support Conversations
Priyaranjan Pattnayak | Sanchari Chowdhuri | Amit Agarwal | Hitesh Laxmichand Patel

Clustering customer chat data is vital for cloud providers handling multi-service queries. Traditional methods struggle with overlapping concerns and create broad, static clusters that degrade over time. Re-clustering disrupts continuity, making issue tracking difficult. We propose an adaptive system that segments multi-turn chats into service-specific concerns and incrementally refines clusters as new issues arise. Cluster quality is tracked via Davies–Bouldin Index (DBI) and Silhouette Scores, with LLM-based splitting applied only to degraded clusters. Our method improves Silhouette Scores by over 100% and reduces DBI by 65.6% compared to baselines, enabling scalable, real-time analytics without full re-clustering.

pdf bib
Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Yunxiang Yan | Tomohiro Sawada | Kartik Goyal

While question-answering (QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on **cascaded question disclosure** that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models.We further validate our findings by extensive ablation studies.

pdf bib
Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning
Yahui Fu | Zi Haur Pang | Tatsuya Kawahara

User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M²PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.

pdf bib
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
Kaiwen Zhou | Chengzhi Liu | Xuandong Zhao | Shreedhar Jangam | Jayanth Srinivasa | Gaowen Liu | Dawn Song | Xin Eric Wang

The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models (LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The distilled reasoning model shows poorer safety performance compared to its safety-aligned base models. (3) The stronger the model’s reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models’ safety to close the gap.

pdf bib
Agnus LLM: Robust and Flexible Entity Disambiguation with decoder-only Language Models
Kristian Noullet | Ayoub Ourgani | Niklas Thomas Lakner | Lukas Kinder | Tobias Käfer

Entity disambiguation (ED) links ambiguous mentions in text to entries in a knowledge base and is a core task in entity linking systems. While pretrained decoder-only language models (DLMs) offer strong generalization capabilities, their effective use in ED has been restricted due to sensitivity to candidate order, susceptibility to hallucinated outputs, and potential dataset leakage. We introduce Agnus a zero-shot ED framework that addresses these challenges through three core innovations: (1) order-invariant candidate encoding via shared positional embeddings and modified autoregressive attention masking, which eliminates bias on input ordering; (2) constrained decoding that ensures outputs are restricted to valid candidates, effectively preventing hallucinations; and (3) synthetic dataset creation approach as a diagnostic tool for data contamination detection and mitigation. Agnus eliminates up to 15.2% of F1 variability caused by candidate permutations, delivering consistent and order-robust predictions previously unattainable with autoregressive architectures. In our experiments, Agnus achieves state-of-the-art performance on four standard ED benchmarks, surpassing prior zero-shot approaches by an average 3.7% using small language models. We release code, data including candidate sets, and a synthetic benchmark to support reproducibility and controlled evaluation.

pdf bib
MuSciClaims: Multimodal Scientific Claim Verification
Yash Kumar Lal | Manikanta Bandham | Mohammad Saqib Hasan | Apoorva Kashi | Mahnaz Koupaee | Niranjan Balasubramanian

Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

pdf bib
Program Synthesis Dialog Agents for Interactive Decision-Making
Matthew Toles | Nikhil Balwani | Rattandeep Singh | Valentina Giulia Sartori Rodriguez | Zhou Yu

Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on the features of the user. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, suggesting a need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is important that these agents can ask the right questions. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. We therefore introduce ProADA, a novel approach that uses program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 56.2 while using nearly the same number of dialog turns.

pdf bib
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
Liran Ringel | Elad Tolochinsky | Yaniv Romano

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "/think>” with “Wait”) can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment distilled versions of DeepSeek-R1 with a single learned "<|continue-thinking|>” token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., “Wait”) for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model’s accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

pdf bib
Video-guided Machine Translation: A Survey of Models, Datasets, and Challenges
Pinaki Das | Virendra Singh | Pushpak Bhattacharyya | Gholamreza Haffari

In recent years, machine translation has evolved with the integration of multimodal information. Infusion of multi-modality into translation tasks decreases ambiguation and enhances translation scores. Common modalities include images, speech, and videos, which provide additional context alongside the text to be translated. While multimodal translation with images has been extensively studied, video-guided machine translation (VMT) has gained increasing attention, particularly since Wang et al. 2019 first explored this task. In this paper, we provide a comprehensive overview of VMT, highlighting its unique challenges, methodologies, and recent advancements. Unlike previous surveys that primarily focus on image-guided multimodal machine translation, this work explores the distinct complexities and opportunities introduced by adding video as a modality to the translation task.

pdf bib
ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models
Biddut Sarker Bijoy | Mohammad Saqib Hasan | Pegah Alipoormolabashi | Avirup Sil | Aruna Balasubramanian | Niranjan Balasubramanian

Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models.We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.

pdf bib
Rethinking Large Language Model Architectures for Sequential Recommendations
Hanbing Wang | Xiaorui Liu | Wenqi Fan | Xiangyu Zhao | Venkataramana Kini | Devendra Pratap Yadav | Fei Wang | Zhen Wen | Hui Liu

In recent times, there has been a shift towards adapting sequential recommendation to LLM paradigm to harness the capabilities of LLMs. These methods typically formulate recommendation data into natural language and train the model to forecast the subsequent item in an auto-regressive manner. Despite their notable success, the significant computational burden during inference poses a major challenge to their practical implementation. In this study, we aim to streamline current LLM-based recommendation models and introduce a straightforward yet highly effective model Lite-LLM4Rec. The primary objective of Lite-LLM4Rec is to ensure efficient inference for the sequential recommendation task. Lite-LLM4Rec circumvents the step-by-step beam search decoding by employing a direct item projection head to produce ranking scores in one step. This design arises from our empirical finding that beam search decoding is ultimately unnecessary for sequential recommendations. Additionally, Lite-LLM4Rec introduces a hierarchical LLM structure crafted to efficiently handle the extensive contextual information of items and redundant computation issue, thus diminishing computational overhead while enjoying the power of LLMs. Experiments on four publicly available datasets validate the efficacy of Lite-LLM4Rec in enhancing both performance and inference efficiency (notably 46.8% performance improvement and 99.48% efficiency improvement on ML-1m) compared to existing LLM-based methods. Our implementations are available at: https://github.com/HanbingWang2001/Lite-LLM4Rec-PyTorch.

pdf bib
DSBC : Data Science task Benchmarking with Context engineering
Ram Mohan Rao Kadiyala | Jebish Purbey | Siddhant Gupta | Giulio Martini | Suman Debnath | Hamza Farooq

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

pdf bib
Improving Document Retrieval Coherence for Semantically Equivalent Queries
Stefano Campese | Alessandro Moschitti | Ivano Lauriola

Dense Retrieval (DR) models have proven to be effective for Document Retrieval and Information Grounding tasks. Usually, these models are trained and optimized for improving the relevance of top-ranked documents for a given query. Previous work has shown that popular DR models are sensitive to the query and document lexicon: small variations of it may lead to a significant difference in the set of retrieved documents. In this paper, we propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents with respect to semantically similar queries. The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantically equivalent queries. We conducted extensive experiments on various datasets, MS-MARCO, Natural Questions, BEIR, and TREC DL 19/20. The results show that (i) models optimizes by our loss are subject to lower sensitivity, and, (ii) interestingly, higher accuracy.

pdf bib
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
Akshar Tumu | Varad Shinde | Parisa Kordjamshidi

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis papers often use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation (‘not’). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

pdf bib
FINDR: A Fast Influential Data Selector for NL2Code Pretraining
Xinliang Frederick Zhang | Lu Wang

Pretraining on massive corpora has given rise to large language models (LLMs) with multi-task capabilities. However, real-world applications often require more specialized training, as is the case of NL2Code. We approach this specialization through the lens of data selection, i.e., identifying a subset of a large corpus that aligns with a desired target distribution—a challenge that remains under-explored within NL2Code. Existing methods are typically designed for selecting instruction-tuning data, and might not easily scale to large-scale code repositories; while methods for NL2Code do exist, they primarily rely on coarse heuristics—–such as repo stars—–for filtering. To bridge this gap, we propose FINDR, an efficient data selection method that extends logistic regression with feature-wise importance reweighting—marking it, to our knowledge, the first fine-grained solution to NL2Code pretraining. Our method uses hashed n-grams and code-aware features to capture code-specific patterns, and then apply informative priors to reweight feature importance when computing influence scores. Extensive experiments on NL2Python and NL2SQL, with two model families, show that FINDR consistently outperforms strong baselines in both execution accuracy and token efficiency. Notably, pretraining on only 2% of FINDR-selected data boosts Gemma by over 29% in both domains, even surpassing CodeGemma (pretrained on 300x more examples) by 10% in Python.

pdf bib
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate
Ashim Gupta | Maitrey Mehta | Zhichao Xu | Vivek Srikumar

Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM’s cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.

pdf bib
Do Persona-Infused LLMs Affect Performance in a Strategic Reasoning Game?
John Licato | Stephen Steinle

Although the use of persona prompting in large language models appears to trigger different styles of generated text, it is unclear whether these translate into measurable behavioral differences. Furthermore, little work has studied whether these differences, when they do exist, can affect decision-making in an adversarial strategic environment. We investigate the impact of persona prompting on strategic performance in PERIL, a world domination board game. Specifically, we compare the effectiveness of persona-derived heuristics to those chosen manually. Our findings reveal that personality traits intuitively associated with strategic thinking do appear to improve game performance, but only when an additional mediator is used to translate personas into heuristic values. We introduce this mediator as a structured translation process, inspired by exploratory factor analysis, that maps LLM-generated inventory responses into strategic heuristics. Results indicate our method enhances heuristic reliability and face validity when compared to directly inferred heuristics, allowing us to better study the effect of persona types on decision-making behaviors. These insights advance our understanding of how persona prompting influences LLM-based decision-making and propose a novel heuristic generation method that adds to the growing body of work applying psychometric principles to LLMs.

pdf bib
FarSense: A Comprehensive Commonsense Benchmark and Evaluation Framework for the Farsi Language
Kamyar Zeinalipour | Neda Jamshidi | Seyedehbahareh Hejazi | Marco Maggini | Monica Bianchini | Simone Paoletti | Marco Gori

Although Farsi is widely spoken, no comprehensive benchmark exists for assessing commonsense reasoning in language models. We therefore present FarSense, a 6‐task benchmark for Farsi covering True/False judgment, multiple-choice questions, Explanation, Cause‐Effect inference, Counterfactual reasoning, and Knowledge Completion. Starting from Farsi‐Wikipedia, we filtered noise and retained ~4,210 passages, rewrote them into realistic daily scenarios, and derived the above tasks from each scenario. Scenario and task generation quality was first judged via native‐speaker annotations on outputs from five major LLMs—GPT‐4o, Gemini-2.5-Flash, Mistral-Large, Qwen‐Plus, and DeepSeek‐Chat. Gemini-2.5-Flash demonstrated the highest performance, leading to its use in generating a large-scale dataset, subsequently finalized through meticulous two-step human validation. Using FarSense, we measured the commonsense ability of the same five flagship LLMs and also fine‐tuned six compact models (1B–24B parameters) before re‐evaluating them. To ensure broad applicability, task wording was designed to minimize dialectal, cultural, or religious bias. Experiments show that targeted fine‐tuning yields substantial gains, confirming FarSense as a reliable, openly licensed resource for advancing reproducible commonsense understanding research in Farsi NLP. We publicly release all code and data at https://github.com/KamyarZeinalipour/FarSense.

pdf bib
GL-CLiC: Global-Local Coherence and Lexical Complexity for Sentence-Level AI-Generated Text Detection
Rizky Adi | Bassamtiano Renaufalgi Irnawan | Yoshimi Suzuki | Fumiyo Fukumoto

Unlike document-level AI-generated text (AIGT) detection, sentence-level AIGT detection remains underexplored, despite its importance for addressing collaborative writing scenarios where humans modify AIGT suggestions on a sentence-by-sentence basis. Prior sentence-level detectors often neglect the valuable context surrounding the target sentence, which may contain crucial linguistic artifacts that indicate a potential change in authorship. We propose **GL-CLiC**, a novel technique that leverages both **G**lobal and **L**ocal signals of **C**oherence and **L**ex**i**cal **C**omplexity, which we operationalize through discourse analysis and CEFR-based vocabulary sophistication. **GL-CLiC** models local coherence and lexical complexity by examining a sentence’s relationship with its neighbors or peers, complemented with its document-wide analysis. Our experimental results show that **GL-CLiC** achieves superior performance and better generalization across domains compared to existing methods.

pdf bib
Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
Ram Mohan Rao Kadiyala | Siddartha Pullakhandam | Siddhant Gupta | Jebish Purbey | Drishti Sharma | Kanwal Mehreen | Muhammad Arham | Suman Debnath | Hamza Farooq

Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM with ~3% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

pdf bib
AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR
Gabrial Zencha Ashungafac | Mardhiyah Sanni | Busayo Awobade | Alex Gichamba | Tobi Olatunji

Recent advances in speech‐enabled AI, including Google’s NotebookLM and OpenAI’s speech-to-speech API, are driving widespread interest in voice interfaces across sectors such as finance, health, agritech, legal services, and call‐centers in the global north and south. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa’s linguistic diversity. We present AfriSpeech‑MultiBench, the first domain‐specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities, and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversations drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open‐source ASR excels in spontaneous speech contexts but degrades on noisy, non‐native dialogue; multimodal LLMs are more accent‐robust yet struggle with domain‐specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Smaller models fine‐tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment. By releasing this benchmark, we empower practitioners and researchers to select voice technologies suited to African use‐cases, fostering inclusive voice applications for undeserved communities.

pdf bib
An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation
Vimaleswar A | Prabhu Nandan Sahu | Nilesh Kumar Sahu | Haroon R Lone

Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have increasingly been used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solutions. To address these challenges, we propose **EmoSApp (Emotional Support App)**: an entirely offline, smartphone-based conversational app designed to provide mental health and emotional support. EmoSApp leverages a language model, specifically the LLaMA-3.2-1B-Instruct, which is fine-tuned and quantized on a custom-curated “Knowledge Dataset” comprising 14,582 mental health QA pairs along with multi-turn conversational data, enabling robust domain expertise and fully on-device inference on resource-constrained smartphones.Through qualitative evaluation with students and mental health professionals, we demonstrate that EmoSApp has the ability to respond coherently and empathetically, provide relevant suggestions to user’s mental health problems, and maintain interactive dialogue. Additionally, quantitative evaluations on nine commonsense and reasoning benchmarks, along with two mental health specific datasets, demonstrate EmoSApp’s effectiveness in low-resource settings. By prioritizing on-device deployment and specialized domain-specific adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health support.

pdf bib
Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective
Tejas Anvekar | Krishna Singh Rajput | Chitta Baral | Vivek Gupta

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

pdf bib
SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
Jing Chen | Zhiheng Yang | Yixian Shen | Jie Liu | Adam Belloum | Paola Grosso | Chrysa Papagianni

Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I performs survey-level retrieval to construct the initial outline and writing plan, then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across six scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage. The code is available at https://github.com/SurveyGens/SurveyGen-I.

pdf bib
VAGUEGate: Plug‐and‐Play Local‐Privacy Shield for Retrieval‐Augmented Generation
Arshia Hemmat | Matin Moqadas | Ali Mamanpoosh | Amirmasoud Rismanchian | Afsaneh Fatemi

Retrieval-augmented generation (RAG) still *forwards* raw passages to large-language models, so private facts slip through. Prior defenses are either (i) **heavyweight**—full DP training that is impractical for today’s 70B-parameter models—or (ii) **over-zealous**—blanket redaction of every named entity, which slashes answer quality.We introduce **VAGUE-Gate**, a lightweight, *locally* differentially-private gate deployable in front of *any* RAG system. A precision pass drops low-utility tokens under a user budget ε, then up to k(ε) high-temperature paraphrase passes further cloud residual cues; post-processing guarantees preserve the same ε-LDP bound.To measure both privacy and utility, we release **BlendPriv** (3k blended-sensitivity QA pairs) and two new metrics: a lexical Information-Leakage Score and an LLM-as-Judge score. Across eight pipelines and four SOTA LLMs, **VAGUE-Gate** at ε = 0.3 lowers lexical leakage by **70%** and semantic leakage by **1.8** points (1–5 scale) while retaining **91%** of Plain-RAG faithfulness with only a **240 ms** latency overhead.All code, data, and prompts are publicly released:- Code: < https://github.com/arshiahemmat/LDP_RAG > - Dataset: <https://huggingface.co/datasets/AliMnp/BlendPriv>

pdf bib
PII-Scope: A Comprehensive Study on Training Data Privacy Leakage in Pretrained LLMs
Krishna Kanth Nakka | Ahmed Frikha | Ricardo Mendes | Xue Jiang | Xuebing Zhou

In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.

pdf bib
Indic-S2ST: a Multilingual and Multimodal Many-to-Many Indic Speech-to-Speech Translation Dataset
Nivedita Sethiya | Puneet Walia | Chandresh Kumar Maurya

Speech-to-Speech Translation (S2ST) converts speech from one language to speech in a different language. While various S2ST models exist, none adequately support Indic languages, primarily due to the lack of a suitable dataset. We fill this gap by introducing Indic-S2ST, a multilingual and multimodal many-to-many S2ST data of approximately 600 hours in 14 Indic languages, including Indian-accented English. To the best of our knowledge, this is the largest data for the S2ST task with parallel speech and text in 14 scheduled Indic languages. Our data also supports Automatic Speech Recognition (ASR), Text-to-Speech (TTS) synthesis, Speech-to-Text translation (ST), and Machine Translation (MT) due to parallel speech and text alignment. Thus, our data may be useful to train a model likeMeta’s SeamlessM4T for Indic languages. We also propose Indic-S2UT, a discrete unit-based S2ST model for Indic languages. To showcase the utility of the data, we present baseline results on the Indic-S2ST data using the Indic-S2UT. The dataset and codes are available at https://github.com/Nivedita5/Indic-S2ST/blob/main/README.md.

up

pdf (full)
bib (full)
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Kentaro Inui | Sakriani Sakti | Haofen Wang | Derek F. Wong | Pushpak Bhattacharyya | Biplab Banerjee | Asif Ekbal | Tanmoy Chakraborty | Dhirendra Pratap Singh

pdf bib
Is OpenVLA Truly Robust? A Systematic Evaluation of Positional Robustness
Yiran Pang | Yiheng Zhao | Zhuopu Zhou | Tingkai Hu | Ranxin Hou

Pretrained language and vision-language models have become core components in building vision-language-action models (VLAs) due to their strong spatial reasoning capabilities. Evaluating the robustness of VLAs is crucial to ensuring their reliability in practical scenarios. Although prior work has focused on background and environment robustness, positional robustness remains underexplored. In this paper, we propose a comprehensive evaluation protocol to assess the positional robustness of VLAs and apply it to OpenVLA, an open-source, high-performing, and efficient model well suited for real-world deployment. We find that OpenVLA succeeds only when the target object is placed at one of the two positions encountered during training. Even in these cases, the success rate never exceeds 50% because it exhibits a memorized behavior that it randomly executes a grasping action toward one of the two fixed positions without relying on perception to localize the target object. This reveals that OpenVLA’s positional robustness is extremely weak.

pdf bib
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Jiabao Ji | Bairu Hou | Alexander Robey | George J. Pappas | Hamed Hassani | Yang Zhang | Eric Wong | Shiyu Chang

Aligned large language models (LLMs) are vulnerable to jailbreaks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based attacks, there are no defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SemanticSmooth, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SemanticSmooth achieves strong robustness against both manually constructed jailbreak prompts and automatic jailbreak attacks like GCG, PAIR, and PromptRS while maintaining strong nominal performance on standard LLM evaluation benchmarks such as AlpacaEval for the instruction-following tasks and PiQA for the question-answering tasks.

pdf bib
Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models
Nghia Tuan Bui | Guergana K Savova | Lijing Wang

The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model performance. In this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation.

pdf bib
How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?
Christian Clark | Byung-Doh Oh | William Schuler

Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model’s probability distribution over a word’s first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.

pdf bib
Speaking the Right Language: The Impact of Expertise (Mis)Alignment in User-AI Interactions
Shramay Palta | Nirupama Chandrasekaran | Rachel Rudinger | Scott Counts

Using a sample of 25,000 Bing Copilot conversations, we study how the agent responds to users of varying levels of domain expertise and the resulting impact on user experience along multiple dimensions. Our findings show that across a variety of topical domains, the agent largely responds at proficient or expert levels of expertise (77% of conversations) which correlates with positive user experience regardless of the user’s level of expertise. Misalignment, such that the agent responds at a level of expertise below that of the user, has a negative impact on overall user experience, with the impact more profound for more complex tasks. We also show that users engage more, as measured by the number of words in the conversation, when the agent responds at a level of expertise commensurate with that of the user. Our findings underscore the importance of alignment between users and AI when designing human-centered AI systems, to ensure satisfactory and productive interactions.

pdf bib
To Labor is Not to Suffer: Exploration of Polarity Association Bias in LLMs for Sentiment Analysis
Jiyu Chen | Sarvnaz Karimi | Diego Molla | Cecile Paris

Large language models (LLMs) are widely used for modeling sentiment trends on social media text. We examine whether LLMs have a polarity association bias—positive or negative—when encountering specific types of lexical word mentions. Such polarity association bias could lead to the wrong classification of neutral statements and thus a distorted estimation of sentiment trends. We estimate the severity of the polarity association bias across five widely used LLMs, identifying lexical word mentions spanning a diverse range of linguistic and psychological categories that correlate with this bias. Our results show a moderate to strong degree of polarity association bias in these LLMs.

pdf bib
Seeing isn’t Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms
Tyler Loakman | Joseph James | Chenghua Lin

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in different tasks that fuse both vision and language modalities. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the difficulty of this task stemming from the requirement for esoteric parametric knowledge of how to interpret such figures, rather than paired samples alone.

pdf bib
A Formal Analysis of Chain-of-Thought Prompting via Turing Reductions
S M Rafiuddin | Muntaha Nujat Khan

Chain-of-Thought (CoT) prompting has emerged as a powerful empirical technique for eliciting multi-step reasoning from large language models by decomposing complex tasks into sequential subprompts. However, the formal computational trade-offs between internal computation, query count, and space usage remain unexplored. We introduce the CoT-oracle Turing machine, a formal model in which each subprompt corresponds to an oracle query, and define three resource metrics: internal time T(n), query complexity Q(n), and prompt buffer space Sprompt(n). We prove that (T,Q)-bounded CoT machines exactly capture the class PO[Q(n)] of polynomial-time Turing reductions with Q(n) queries, derive upper bounds for P and NP-complete problems under linear and prefix-query budgets, and establish an Ω(n) query lower bound for SAT under P ≠ NP. Illustrative examples on integer factorization and SAT reconstruction, together with synthetic and LLM-based simulations, confirm our theoretical T–Q–S trade-off predictions. This framework provides principled guidelines for prompt design, noisy-oracle robustness, and cost-aware reasoning.

pdf bib
Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking
Jihyun Lee | Solee Im | Wonjun Lee | Gary Lee

Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.

pdf bib
Are Relational Triple Extraction Frameworks Sufficient for Hyper-relational Facts ?
Pratik Saini | Chayan Sarkar | Tapas Nayak

Hyper-relational fact extraction involves identifying relational triples along with additional contextual information—known as qualifiers—such as time, location, or quantity. These qualifiers enable models to represent complex real-world knowledge more accurately. While numerous end-to-end models have been developed for extracting relational triples, they are not designed to handle qualifiers directly. In this work, we propose a straightforward and effective approach to extend existing end-to-end triple extraction models to also capture qualifiers. Our method reformulates qualifiers as new relations by computing the Cartesian product between qualifiers and their associated relations. This transformation allows the model to extract qualifier information as additional triples, which can later be merged to form complete hyper-relational facts. We evaluate our approach using multiple end-to-end triple extraction models on the HyperRED dataset and demonstrate its effectiveness in extracting hyper-relational facts.

pdf bib
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models
Yiming Gao | Bin Wang | Chengwei Wei | Shuo Sun | AiTi Aw

Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio–instruction–answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.

pdf bib
IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
Vanessa Rebecca Wiyono | David Anugraha | Ayu Purwarianti | Genta Indra Winata

Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset designed to evaluate the naturalness and quality of LLM-generated text. The dataset contains 522 prompts and yields 4,099 human-annotated pairwise preferences from comparisons across five instruction-tuned LLMs. All annotations are natively written in Indonesian with strong inter-annotator agreement, measured by Krippendorff’s alpha. Our benchmark spans 10 diverse categories, enabling practitioners to identify LLMs’ fine-grained strengths and weaknesses.

pdf bib
Large Language Models Exhibit Limited Reasoning Ability on Coding Problems
Jinyoung Jo | Jonah Engelmann | Sean Choi

Claims that large language models (LLMs) have complex reasoning ability have stirred broad interests, and controversies, of academics and non-academics alike. A popular basis for such claims comes from LLMs’ ability to solve coding problems, which involves understanding the problem statement and providing code that solves the problem. Although such abilities are remarkable feats worth praising, we argue that they come from memorization rather than reasoning. We first show that LLMs’ problem-solving ability degrades with increased recency of the problem, likely due to the reduced amount of training data for more recent problems, regardless of the problem difficulty labeled by human experts. Additionally, we show that an LLM often fails to solve the problem when presented with reworded but equivalent problem statements, further suggesting their limited reasoning ability.

pdf bib
Documentation Retrieval Improves Planning Language Generation
Renxiang Wang | Li Zhang

Certain strong LLMs have shown promise for zero-shot formal planning by generating planning languages like PDDL. Yet, performance of most open-source models under 50B parameters has been reported to be close to zero due to the low-resource nature of these languages. We significantly improve their performance via a series of lightweight pipelines that integrates documentation retrieval with modular code generation and error refinement. With models like Llama-4-Maverick, our best pipeline improves plan correctness from 0% to over 80% on the common BlocksWorld domain. However, while syntactic errors are substantially reduced, semantic errors persist in more challenging domains, revealing fundamental limitations in current models’ reasoning capabilities.

pdf bib
Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?
Gaurav Kamath | Sowmya Vajjala

We explore whether synthetic datasets generated by large language models using a few high quality seed samples are useful for low-resource named entity recognition, considering 11 languages from three language families. Our results suggest that synthetic data created with such seed data is a reasonable choice when there is no available labeled data, and is better than using entirely automatically labeled data. However, a small amount of high-quality data, coupled with cross-lingual transfer from a related language, always offers better performance. Data and code available at: https://github.com/grvkamath/low-resource-syn-ner.

pdf bib
From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge
Nafis Chowdhury | Moinul Haque | Anika Ahmed | Nazia Tasnim | Md. Istiak Hossain Shihab | Sajjadur Rahman | Farig Sadeque

Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.

pdf bib
Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?
Tawsif Tashwar Dipto | Azmol Hossain | Rubayet Sabbir Faruque | Md. Rezuwan Hassan | Kanij Fatema | Tanmoy Shome | Ruwad Naswan | Md.Foriduzzaman Zihad | Mohaymen Ul Anam | Nazia Tasnim | Hasan Mahmud | Md Kamrul Hasan | Md. Mehedi Hasan Shawon | Farig Sadeque | Tahsin Reasat

Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations, but dialect specific model training alleviates the issue. Our dataset also serves as a out-of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available.

pdf bib
Gatsby without the ‘E’: Creating Lipograms with LLMs
Nitish Gokulakrishnan | Rohan Balasubramanian | Syeda Jannatus Saba | Steven Skiena

Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby (Wright, 1939), which daringly avoids all usage of the letter ‘e’. In this study, we explore the power of modern large language models (LLMs) by transforming the novel The Great Gatsby (Fitzgerald, 1925) into a fully ‘e’-less text. We experimented with a range of techniques, from baseline methods like synonym replacement to sophisticated generative models enhanced with beam search and named entity analysis. We show that excluding up to 3.6% of the most common letters (up to the letter ‘u’) had minimal impact on the text’s meaning, although translation fidelity rapidly and predictably decays with stronger lipogram constraints. Our work highlights the surprising flexibility of English under strict constraints.

pdf bib
Hint-Augmented Re-ranking: Efficient Product Search using LLM-Based Query Decomposition
Yilun Zhu | Nikhita Vedula | Shervin Malmasi

Search queries with superlatives (e.g., best, most popular) require comparing candidates across multiple dimensions, demanding linguistic understanding and domain knowledge. We show that LLMs can uncover latent intent behind these expressions in e-commerce queries through a framework that extracts structured interpretations or hints. Our approach decomposes queries into attribute-value hints generated concurrently with retrieval, enabling efficient integration into the ranking pipeline. Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines. Since direct LLM-based reranking faces prohibitive latency, we develop an efficient approach transferring superlative interpretations to lightweight models. Our findings provide insights into how superlative semantics can be represented and transferred between models, advancing linguistic interpretation in retrieval systems while addressing practical deployment constraints.

pdf bib
p²-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich

Table question answering (TQA) focuses on answering questions based on tabular data. Developing TQA systems targets effective interaction with tabular data for tasks such as cell retrieval and data analysis. While recent work has leveraged fine-tuning to improve TQA systems, existing approaches often under-utilize available data and neglect the potential of post-training for further gains. In this work, we introduce p²-TQA, a process-based preference learning framework for TQA post-training. p²-TQA automatically constructs process-based preference data via a table-specific pipeline, eliminating the need for manual or costly data collection. It then optimizes models through contrastive learning on the collected data. Experiments show that p²-TQA effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets with only 8,000 training instances. Furthermore, models enhanced with p²-TQA achieve competitive results against larger, more complex state-of-the-art TQA systems, while maintaining up to five times higher efficiency.

pdf bib
PerMed-MM: A Multimodal, Multi-Specialty Persian Medical Benchmark for Evaluating Vision Language Models
Ali Khoramfar | Mohammad Javad Dousti | Heshaam Faili

We present PerMed-MM, the first multimodal benchmark for Persian medical question answering. The dataset comprises 733 expert-authored multiple-choice questions from Iranian National Medical Board Exams, each paired with one to five clinically relevant images, spanning 46 medical specialties and diverse visual modalities. We evaluate five open-source and five proprietary vision language models, and find that reasoning supervision and domain-specific fine-tuning yield performance gains. Our cross-lingual analysis reveals significant unpredictability in translation-based pipelines, motivating the need for benchmarks that support direct, native-language evaluation. Additionally, domain- and modality-level analysis uncovers meaningful variation in model behavior often masked by aggregate metrics. PerMed-MM is publicly available on Hugging Face.

pdf bib
An Analysis of Large Language Models for Simulating User Responses in Surveys
Ziyun Yu | Yiru Zhou | Chen Zhao | Hongyi Wen

Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions and propose two LLM-based approaches: chain-of-thought (COT) prompting and Diverse Claims Generation (CLAIMSIM), which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.

pdf bib
Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages
Jozef Kubík | Marek Suppa | Martin Takac

Limited data for low-resource languages typically yields weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies across epochs, which we term ‘Active Learning schedulers,’ to boost the fine-tuning process with a limited amount of training data. We connect the AL process to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, data clustering, and dynamic data selection schedulers to enhance models’ performance. Several experiments on the Slovak, Maltese, Icelandic, and Turkish languages show that the use of clustering during the fine-tuning phase together with novel AL scheduling can for models simultaneously yield annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.

pdf bib
What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction
Charlie Wyatt | Aditya Joshi | Flora D. Salim

Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP’s focus on single-token prediction often limits a model’s ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries—an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) — the task of infilling a randomly removed sentence — from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities.

pdf bib
A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models
Sadia Kamal | Lalu Prasad Yadav Prakash | S M Rafiuddin | Mohammed Rakib | Atriya Sen | Sagnik Ray Choudhury

The Political Compass Test (PCT) and similar surveys are commonly used to assess political bias in auto-regressive LLMs. Our rigorous statistical experiments show that while changes to standard generation parameters have minimal effect on PCT scores, prompt phrasing and fine-tuning individually and together can significantly influence results. Interestingly, fine-tuning on politically rich vs. neutral datasets does not lead to different shifts in scores. We also generalize these findings to a similar popular test called 8 Values. Humans do not change their responses to questions when prompted differently (“answer this question” vs “state your opinion”), or after exposure to politically neutral text, such as mathematical formulae. But the fact that the models do so raises concerns about the validity of these tests for measuring model bias, and paves the way for deeper exploration into how political and social views are encoded in LLMs.

pdf bib
EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation
Kai Zhang | Christopher Malon | Lichao Sun | Martin Renqiang Min

Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce **EditGRPO**, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.

pdf bib
Enhancing Long Document Long Form Summarisation with Self-Planning
Xiaotang Du | Rohit Saxena | Laura Perez-Beltrachini | Pasquale Minervini | Ivan Titov

We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.

pdf bib
Faithful Transcription: Leveraging Bible Recordings to Improve ASR for Endangered Languages
Eric Le Ferrand | Cian Mohamed Bashar Hauser | Joshua Hartshorne | Emily Prud’hommeaux

While automatic speech recognition (ASR) now achieves human-level accuracy for a dozen or so languages, the majority of the world’s languages lack the resources needed to train robust ASR models. For many of these languages, the largest available source of transcribed speech data consists of recordings of the Bible. Bible recordings are appealingly large and well-structured resources, but they have notable limitations: the vocabulary and style are constrained, and the recordings are typically produced by a single speaker in a studio. These factors raise an important question: to what extent are Bible recordings useful for developing ASR models to transcribe contemporary naturalistic speech, the goal of most ASR applications? In this paper, we use Bible recordings alongside contemporary speech recordings to train ASR models in a selection of under-resourced and endangered languages. We find that models trained solely on Bible data yield shockingly weak performance when tested on contemporary everyday speech, even when compared to models trained on other (non-Bible) out-of-domain data. We identify one way of effectively leveraging Bible data in the ASR training pipeline via a two-stage training regime. Our results highlight the need to re-assess reported results relying exclusively on Bible data and to use Bible data carefully and judiciously.

pdf bib
Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?
Kunal Kingkar Das | Manoj Balaji Jagadeeshan | Nallani Chakravartula Sahith | Jivnesh Sandhan | Pawan Goyal

Large Language Models (LLMs) are increasingly treated as universal, general-purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low-resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction-tuned and in-context-prompted LLMs with smaller task-specific encoder–decoder models on the Sanskrit poetry-to-prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi-step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models.For LLMs, we apply instruction fine-tuning on general-purpose models and design in-context learning templates grounded in Pāṇinian grammar and classical commentary heuristics. For task-specific modelling, we fully fine-tune a ByT5-Sanskrit Seq2Seq model. Our experiments show that domain-specific fine-tuning of ByT5-Sanskrit significantly outperforms all instruction-driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall’s Tau scores.Additionally, our prompting strategies provide an alternative to fine-tuning when domain-specific verse corpora are unavailable, and the task-specific Seq2Seq model demonstrates robust generalisation on out-of-domain evaluations.Our code1 and dataset2 are publicly available.

pdf bib
Exploring the Performance of Large Language Models on Subjective Span Identification Tasks
Alphaeus Dmonte | Roland R Oruche | Tharindu Ranasinghe | Marcos Zampieri | Prasad Calyam

Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.

pdf bib
Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs
Sachin Pawar | Manoj Apte | Kshitij Jadhav | Girish Keshav Palshikar | Nitin Ramrakhiyani

Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model’s fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of “natural” words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral’s tokenizer splits “martial” into “mart” and “ial”). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how “bad” the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.

pdf bib
ReGraph: Learning to Reformulate Graph Encodings with Large Language Models
Amir Hadifar | Christopher Ochs | Arjan Van Ewijk

Large language models can rephrase and restructure natural language effectively, but their potential for reformulating graph encodings remains underexplored despite the significant impact of encoding choices on performance.In this work, we introduce ReGraph, a reinforcement learning-based approach that guides language models to reformulate graph encodings for improved task alignment.We demonstrate that reformulating graph encodings enhances reasoning and yields consistent performance gains on graph question answering tasks.Our results show that language models often prefer specific graph encodings, even if they are suboptimal for the task at hand.

pdf bib
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Chloe Li | Noah Y. Siegel

Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat is sandbagging—the strategic underperformance on evaluations by AI models or their developers. A promising defense is to monitor a model’s chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.

pdf bib
Meronymic Ontology Extraction via Large Language Models
Dekai Zhang | Simone Conia | Antonio Rago

Ontologies have become essential in today’s digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.

pdf bib
Compositional Phoneme Approximation for L1-Grounded L2 Pronunciation Training
Jisang Park | Minu Kim | DaYoung Hong | Jongha Lee

Learners of a second language (L2) often map non-native phonemes to similar native-language (L1) phonemes, making conventional L2-focused training slow and effortful. To address this, we propose an L1-grounded pronunciation training method based on compositional phoneme approximation (CPA), a feature-based representation technique that approximates L2 sounds with sequences of L1 phonemes.Evaluations with 20 Korean non-native English speakers show that CPA-based training achieves a 76% in-box formant rate in acoustic analysis, 17.6% relative improvement in phoneme recognition accuracy, and over 80% of speech being rated as more native-like, with minimal training. Project page: https://gsanpark.github.io/CPA-Pronunciation.

pdf bib
Can Language Models Handle a Non-Gregorian Calendar? The Case of the Japanese wareki
Mutsumi Sasaki | Go Kamoda | Ryosuke Takahashi | Kosuke Sato | Kentaro Inui | Keisuke Sakaguchi | Benjamin Heinzerling

Temporal reasoning and knowledge are essential capabilities for language models (LMs).While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar.However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time.If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far.Here, we present a systematic evaluation of how well language models handle one such non-Gregorian system: the Japanese *wareki*.We create datasets that require temporal knowledge and reasoning in using *wareki* dates. Evaluating open and closed LMs, we find that some models can perform calendar conversions, but GPT-4o, Deepseek V3, and even Japanese-centric models struggle with Japanese calendar arithmetic and knowledge involving *wareki* dates.Error analysis suggests corpus frequency of Japanese calendar expressions and a Gregorian bias in the model’s knowledge as possible explanations.Our results show the importance of developing LMs that are better equipped for culture-specific tasks such as calendar understanding.

pdf bib
Modeling Contextual Passage Utility for Multihop Question Answering
Akriti Jain | Aparna Garimella

Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages. While most prior retrieval methods assist in identifying relevant passages for QA, further assessing the utility of the passages can help in removing redundant ones, which may otherwise add to noise and inaccuracies in the generated answers. Existing utility prediction approaches model passage utility independently, overlooking a critical aspect of multi-hop reasoning, that the utility of a passage can be context-dependent, influenced by its relation to other passages—whether it provides complementary information, or forms a crucial link in conjunction with others. In this paper, we propose a light-weight approach to model contextual passage utility, accounting for inter-passage dependencies. We fine-tune a small transformer-based model to predict passage utility scores for multihop QA. We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question, to obtain synthetic training data. Through comprehensive experiments, we demonstrate that our utility-based scoring of retrieved passages leads to better reranking and downstream task performance compared to relevance-based reranking methods.

pdf bib
Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization
Hadi Sheikhi | Chenyang Huang | Osmar Zaiane

Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs’ attachment on external knowledge.

pdf bib
Agreement-Constrained Probabilistic Minimum Bayes Risk Decoding
Koki Natsumi | Hiroyuki Deguchi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe

Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT’23 En↔De translation tasks.

up

pdf (full)
bib (full)
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

pdf bib
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Santosh T.y.s.s | Shuichiro Shimizu | Yifan Gong

pdf bib
Interpretable Sparse Features for Probing Self-Supervised Speech Models
Iñigo Parra

Self-supervised speech models have demonstrated the ability to learn rich acoustic representations. However, interpreting which specific phonological or acoustic features these models leverage within their highly polysemantic activations remains challenging. In this paper, we propose a straightforward and unsupervised probing method for model interpretability. We extract the activations from the final MLP layer of a pretrained HuBERT model and train a sparse autoencoder (SAE) using dictionary learning techniques to generate an over-complete set of latent representations. Analyzing these latent codes, we observe that a small subset of high-variance units consistently aligns with phonetic events, suggesting their potential utility as interpretable acoustic detectors. Our proposed method does not require labeled data beyond raw audio, providing a lightweight and accessible tool to gain insights into the internal workings of self-supervised speech models.

pdf bib
Learning Dynamics of Meta-Learning in Small Model Pretraining
David Demitri Africa | Yuval Weiss | Paula Buttery | Richard Diehl Martinez

Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only faster but also more interpretable. We integrate first–order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M–570M params), and evaluate on multilingual Universal NER. Compared with vanilla training, our hybrid setup (i) reaches the same loss up to 1.6× sooner, (ii) yields modest but consistent average gains on Universal NER at medium/large scales under equal compute (+2–3 percentage points), and (iii) and (iii) reveals phase-like learning dynamics: models first diversify their representations, then compress them in a pattern that aligns with improved episodic accuracy. These observations are correlational, not causal, and we do not claim generality beyond NER or across seeds. We also document a trade-off: perplexity on Paloma (a diverse language modeling benchmark spanning 18 domains) is worse at most scales. Code, checkpoints and analysis logs are released.

pdf bib
Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks
Darpan Aswal | Manjira Sinha

Transformer based models, specially large language models (LLMs) dominate the field of NLP with their mass adoption in tasks such as text generation, summarization and fake news detection. These models offer ease of deployment and reliability for most applications, however, they require significant amounts of computational power for training as well as inference. This poses challenges in their adoption in resource-constrained applications, specially in the open-source community where compute availability is usually scarce. This work proposes a graph-based approach for Environmental Claim Detection, exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks (HGNNs) as lightweight yet effective alternatives to transformer-based models. Re-framing the task as a graph classification problem, we transform claim sentences into dependency parsing graphs, utilizing a combination of word2vec & learnable part-of-speech (POS) tag embeddings for the node features and encoding syntactic dependencies in the edge relations. Our results show that our graph-based models, particularly HGNNs in the poincaré space (P-HGNNs), achieve performance superior to the state-of-the-art on environmental claim detection while using up to **30x fewer parameters**. We also demonstrate that HGNNs benefit vastly from explicitly modeling data in hierarchical (tree-like) structures, enabling them to significantly improve over their euclidean counterparts.

pdf bib
Stacked LoRA: Isolated Low-Rank Adaptation for Lifelong Knowledge Management
Heramb Vivek Patil | Vaishnavee Sanam | Minakshi Pradeep Atre

Continual learning (CL) presents a significant challenge for large pre-trained models, primarily due to catastrophic forgetting and the high computational cost of sequential knowledge updating. Parameter-Efficient Transfer Learning (PETL) methods offer reduced computational burdens but often struggle to effectively mitigate forgetting. This paper introduces Stacked Low-Rank Adaptation (SLoRA), a novel parameter-efficient approach that leverages the additive composition of task-specific, frozen low-rank adapters to enable modular continual learning with inherent support for explicit knowledge modification. SLoRA was evaluated on vision benchmarks, BERT-base, and the 1-billion-parameter Llama-3.2-1B model. Experiments demonstrated that SLoRA almost completely eliminated catastrophic forgetting, achieving a final average accuracy of 92.75% on Llama-3.2-1B while perfectly preserving prior task performance. Furthermore, SLoRA is computationally efficient, enabling up to a 15x training speed-up over full fine-tuning with 99.7% fewer trainable parameters per update. SLoRA offers a compelling balance of forgetting mitigation, parameter efficiency, and modularity, representing a promising direction for developing adaptable and efficient lifelong knowledgeable foundation models.

pdf bib
On Multilingual Encoder Language Model Compression for Low-Resource Languages
Daniil Gurgurov | Michal Gregor | Josef Van Genabith | Simon Ostermann

In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% while maintaining competitive performance, with average drops of 2–10% for moderate compression and 8–13% at maximum compression in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct ablation studies to identify the best practices for multilingual model compression using these techniques.

pdf bib
Do We Need Large VLMs for Spotting Soccer Actions?
Ritabrata Chakraborty | Rajatsubhra Chakraborty | Avijit Dasgupta | Sandeep Chaurasia

Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data. We propose a shift from this video-centric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs). We posit that expert commentary, which provides rich descriptions and contextual cues contains sufficient information to reliably spot key actions in a match. To demonstrate this, we employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics for spotting actions in soccer matches. Our experiments show that this language-centric approach performs effectively in detecting critical match events coming close to state-of-the-art video-based spotters while using zero video processing compute and similar amount of time to process the entire match.

pdf bib
LRMGS: A Language-Robust Metric for Evaluating Question Answering in Very Low-Resource Indic Languages
Anuj Kumar | Satyadev Ahlawat | Yamuna Prasad | Virendra Singh

Reliable evaluation of Question Answering (QA) systems in low-resource Indic languages presents a significant challenge due to limited annotated datasets, linguistic diversity, and suitable evaluation metrics. Languages such as Sindhi, Manipuri, Dogri, Konkani, and Maithili are particularly underrepresented, creating difficulty in assessing Large Language Models (LLMs) on QA tasks. Existing metrics, including BLEU, ROUGE-L, and BERTScore, are effective in machine translation and high-resource settings; however, they often fail in low-resource QA due to score compression, zero-inflation, and poor scale alignment. To overcome this, LRMGS (Language-Robust Metric for Generative QA) is introduced to capture semantic and lexical agreement while preserving the score scale across languages. LRMGS is evaluated across 8 Indic languages and multiple LLMs, demonstrating consistently higher concordance with reference-based chrF++ scores, measured using the Concordance Correlation Coefficient (CCC). Experimental results indicate that LRMGS provides more accurate discrimination of system performance in very low-resource languages compared to existing metrics. This work establishes a robust and interpretable framework for evaluating QA systems in low-resource Indic languages, supporting more reliable multilingual model assessment.

pdf bib
NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction
Peter Røysland Aarnes | Vinay Setty

Large language models show strong performance on knowledge intensive tasks such as fact checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

pdf bib
Testing Simulation Theory in LLMs’ Theory of Mind
Koshiro Aoki | Daisuke Kawahara

Theory of Mind (ToM) is the ability to understand others’ mental states, which is essential for human social interaction. Although recent studies suggest that large language models (LLMs) exhibit human-level ToM capabilities, the underlying mechanisms remain unclear. “Simulation Theory” posits that we infer others’ mental states by simulating their cognitive processes, which has been widely discussed in cognitive science. In this work, we propose a framework for investigating whether the ToM mechanism in LLMs is based on Simulation Theory by analyzing their internal representations. Following this framework, we successfully steered LLMs’ ToM reasoning through modeled perspective-taking and counterfactual interventions. Our results suggest that Simulation Theory may partially explain the ToM mechanism in state-of-the-art LLMs, indicating parallels between human and artificial social reasoning.

pdf bib
Turn-by-Turn Behavior Monitoring in LM-Guided Psychotherapy
Anish Sai Chedalla | Samina Ali | Jiuming Chen | Starborn0128@gmail.com Starborn0128@gmail.com | Eric Xia

Large language models (LLMs) have the potential to be powerful instruments for psychotherapy. However, there is a shortage of practical tools to support their use in production. We develop a novel, iterative process of updating conversational context for tracking EIS (Emotional Intelligence Scale) instantaneously, and test Llama-70b. Through this, we show that (1) EIS varies more on psychotherapeutic (emotional support) conversations than control (emotionally unstimulating) conversations and (2) model responses can be systematically classified to identify consistent patterns. Thus, EIS is a valid indicator of empathetic model behavior. Rises in the EIS score correspond to prosocial behavior, and falls correspond to detached, unsocial behavior. These results suggest that psychometric questionnaires like EIS can provide a structured lens for observing empathetic stability of models and offer a foundation for future work on their role in psychotherapy.

pdf bib
BookAsSumQA: An Evaluation Framework for Aspect-Based Book Summarization via Question Answering
Ryuhei Miyazato | Ting-Ruen Wei | Xuyang Wu | Hsin-Tai Wu | Kei Harada

Aspect-based summarization aims to generate summaries that highlight specific aspects of a text, enabling more personalized and targeted summaries. However, its application to books remains unexplored due to the difficulty of constructing reference summaries for long text. To address this challenge, we propose BookAsSumQA, a QA-based evaluation framework for aspect-based book summarization. BookAsSumQA automatically generates aspect-specific QA pairs from a narrative knowledge graph to evaluate summary quality based on its question-answering performance. Our experiments using BookAsSumQA revealed that while LLM-based approaches showed higher accuracy on shorter texts, RAG-based methods become more effective as document length increases, making them more efficient and practical for aspect-based book summarization.

pdf bib
Thesis Proposal: Interpretable Reasoning Enhancement in Large Language Models through Puzzle and Ontological Task Analysis
Mihir Panchal

Large language models (LLMs) excel across diverse natural language processing tasks but remain opaque and unreliable. This thesis investigates how LLM reasoning can be made both interpretable and reliable through systematic analysis of internal dynamics and targeted interventions. Unlike prior work that examines reasoning broadly, this research focuses on two representative domains: puzzle solving, where reasoning steps can be precisely tracked, and ontological inference, where hierarchical structures constrain valid reasoning. The central questions are: (1) How can systematic error patterns in domain specific reasoning be detected through layer wise probing and mitigated through targeted interventions? (2) How can probing frameworks and middle layer analyses reveal and enhance the computational mechanisms underlying inference? By combining probing methods, middle layer investigations, and probe guided interventions, the work aims to uncover interpretable reasoning patterns, identify systematic failure modes, and develop adaptive enhancement strategies. The expected outcome is a domain grounded framework that advances both theoretical understanding of neural reasoning and the design of practical, trustworthy AI systems.

pdf bib
Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning
Wendy Yaqiao Liu | Rui Jerry Huang | Anastasia Miin | Lei Ding

Inference-time computation is a critical yet challenging paradigm for enhancing the reasoning performance of large language models (LLMs). While existing strategies improve reasoning stability and consistency, they suffer from notable limitations: self-correction often reinforces the model’s initial biases, and Multi-Agent Collaboration (MAC) often fails due to the lack of efficient coordination mechanisms, leading to collective errors. Although high-performing verifiers can detect reasoning errors, making them reliable requires substantial training. To address these challenges, we introduce a novel inference-time framework - **Adaptive Coopetition (AdCo)** - in which LLM agents utilize **an adaptive, UCB-based ‘coopetition’ mechanism**. At each round, agents leverage coarse verifier signals to determine whether to collaborate or compete, further iteratively refining their reasoning based on peer feedback. Without relying on high-performance verifiers, our adaptive strategy achieves significant performance gains on mathematical reasoning benchmarks, yielding **a 20% relative improvement** over baselines on the more challenging dataset. Our approach remains robust and consistent in terms of accuracy under different sample sizes and configurations. This adaptive, signal-guided ‘coopetition’ framework enhances reasoning robustness by leveraging bothmodel knowledge diversity and reasoning trace measure, while also promoting uncertainty-driven exploration, especially when participants have comparable capabilities. From this perspective, our work offers a fresh lens on inference-time computation and paves the way for more resilient multi-agent LLM systems.

pdf bib
AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology
Akash Kundu | Rishika Goswami

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks frompsychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluated several proprietary and open-source models using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yetare shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and futurework that bridges cognitive psychology and AI safety.

pdf bib
Thesis Proposal: A NeuroSymbolic Approach to Control Task-Oriented Dialog Systems
Anuja Tayal | Barbara Di Eugenio

Developing effective healthcare dialog systems requires controlling conversations to offer clear insight into the system’s understanding and to address the lack of patient-oriented conversational datasets. Moreover, evaluating these systems is equally challenging and requires user studies for robust evaluation. These challenges are even more pronounced when addressing the needs of minority populations with low health literacy and numeracy. This thesis proposal focuses on designing conversational architectures that deliver self-care information to African American patients with heart failure.Neuro-symbolic approaches provide a promising direction by integrating symbolic reasoning with the generative capabilities of Large Language Models (LLMs). In this proposal, we explore various approaches to creating a hybrid dialog model by combining the strengths of task-oriented dialog systems with the integration of neuro-symbolic rules into a Language Model (LM)/LLM-based dialog system, thereby controlling the dialog system. We propose a hybrid conversational system that uses schema graphs to control the flow of dialogue, while leveraging LLMs to generate responses grounded in these schemas. We will also conduct a user study to evaluate the system’s effectiveness.

pdf bib
Enriching the Low-Resource Neural Machine Translation with Large Language Model
Sachin Giri | Takashi Ninomiya | Isao Goto

Improving the performance of neural machine translation for low-resource languages is challenging due to the limited availability of parallel corpora. However, recently available Large Language Models (LLM) have demonstrated superior performance in various natural language processing tasks, including translation. In this work, we propose to incorporate an LLM into a Machine Translation (MT) model as a prior distribution to leverage its translation capabilities. The LLM acts as a teacher, instructing the student MT model about the target language. We conducted an experiment in four language pairs: English ⇔ German and English ⇔ Hindi. This resulted in improved BLEU and COMET scores in a low-resource setting.

pdf bib
Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models
Tomoki Doi | Masaru Isonuma | Hitomi Yanaka

Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models’ actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful self-explanations for continual learning on instruction-tuned models. Our experiments demonstrate that training can improve self-explanation faithfulness across all classification tasks and explanation styles, and that these improvements also show signs of generalization to the multi-word settings and to unseen tasks. Furthermore, we find consistent cross-style generalization among three styles, suggesting that training may contribute to a broader improvement in faithful self-explanation ability.

pdf bib
Thesis Proposal: Efficient Methods for Natural Language Generation/Understanding Systems
Nalin Kumar

While Large Language Models (LLMs) have shown remarkable performance in various Natural Language Processing (NLP) tasks, their effectiveness seem to be heavily biased toward high-resource languages. This proposal aims to address this gap by developing efficient training strategies for low-resource languages. We propose various techniques for efficient learning in simluated low-resource settings for English. We then plan to adapt these methods for low-resource languages. We plan to experiment with both natural language generation and understanding models. We evaluate the models on similar benchmarks as the BabyLM challenge for English. For other languages, we plan to use treebanks and translation techniques to create our own silver test set to evaluate the low-resource LMs.

pdf bib
Two Step Automatic Post Editing of Patent Machine Translation based on Pre-trained Encoder Models and LLMs
Kosei Buma | Takehito Utsuro | Masaaki Nagata

We study automatic post-editing for patent translation, where accuracy and traceability are critical, and propose a two-step pipeline that combines a multilingual encoder for token-level error detection with an LLM for targeted correction. As no word-level annotations exist for Japanese–English patents, we create supervised data by injecting synthetic errors into parallel patent sentences and fine-tune mBERT, XLM-RoBERTa, and mDeBERTa as detectors. In the second stage, GPT-4o is prompted to revise translations either freely or under a restricted policy that allows edits only on detector-marked spans. For error detection, evaluation on synthetic errors shows that encoder-based detectors outperform LLMs in both F1 and MCC. For error correction, tests on synthetic, repetition, and omission datasets demonstrate statistically significant BLEU gains over LLM methods for synthetic and repetition errors, while omission errors remain challenging. Overall, pairing compact encoders with an LLM enables more accurate and controllable post-editing for key patent error types, reducing unnecessary rewrites via restricted edits. Future work will focus on strengthening omission modeling to better detect and correct missing content.

pdf bib
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula | Sandipan Dandapat | Dipti Sharma | Parameswari Krishnamurthy

The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models—from pre-training through fine-tuning—for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms.Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.

pdf bib
Are LLMs Good for Semantic Role Labeling via Question Answering?: A Preliminary Analysis
Ritwik Raghav | Abhik Jana

Semantic role labeling (SRL) is a fundamental task in natural language processing that is crucial for achieving deep semantic understanding. Despite the success of large language models (LLMs) in several downstream NLP tasks, key tasks such as SRL remain a challenge for LLMs. Hence, in this study, we attempt to instantiate the efficacy of LLMs for the task of SRL via Question answering. Toward that goal, we investigate the effectiveness of five different LLMs (Llama, Mistral, Qwen, OpenChat, Gemini) using zero-shot and few-shot prompting. Our findings indicate that few-shot prompting enhances the performance of all models. Although Gemini outperformed others by a margin of 11%, Qwen and Llama are not too far behind. Additionally, we conduct a comprehensive error analysis to shed light on the cases where LLMs fail. This study offers valuable insights into the performance of LLMs for structured prediction and the effectiveness of simple prompting techniques in the Question-Answering framework for SRL.

pdf bib
Could you BE more sarcastic? A Cognitive Approach to Bidirectional Sarcasm Understanding in Language Models
Veer Chheda | Avantika Sankhe | Atharva Vinay Sankhe

Sarcasm is a specific form of ironic speech which can often be hard to understand for language models due to its nuanced nature. Recent improvements in the ability of such models to detect and generate sarcasm motivate us to try a new approach to help language models perceive sarcasm as a speech style, through a human cognitive perspective. In this work, we propose a multi-hop Chain of Thought (CoT) methodology to understand the context of an utterance that follows a dialogue and to perform bidirectional style transfer on that utterance, leveraging the Theory of Mind. We use small language models (SLMs) due to their cost-efficiency and fast response-time. The generated utterances are evaluated using both LLM-as-a-judge and human evaluation, suitable to the open-ended and stylistic nature of the generations. Along with these, we also evaluate scores of automated metrics such as DialogRPT, BLEU and SBERT; drawing valuable insights from them that support our evidence. Based on this, we find that our cognitive approach to sarcasm is an effective way for language models to stylistically understand and generate sarcasm with better authenticity.

pdf bib
To What Extent Can In-Context Learning Solve Unseen Tasks?
Ryoma Shinto | Masashi Takeshita | Rafal Rzepka | Toshihiko Itoh

While Large Language Models (LLMs) are known for their In-Context Learning (ICL) capabilities, there is no consensus on the underlying mechanisms. A key point of debate is whether ICL allows models to adapt to unseen tasks without parameter updates—that is, whether they can extrapolate. In this study, we address this question by constructing an arithmetic dataset based on the bivariate linear function z=ax+by to train a model and quantitatively evaluate its interpolation and extrapolation abilities through ICL. Our results show that while extrapolation was not achieved within our experimental design, tasks that were partially learned could be solved. We also found that the model acquires internal representations that can distinguish unseen tasks, and that greater task diversity in the training dataset improves ICL capabilities.

pdf bib
Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
Nathan Mao | Varun Kaushik | Shreya Shivkumar | Parham Sharafoleslami | Kevin Zhu | Sunishchal Dev

Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite’s potential as a foundation for evaluating and mitigating hallucinations in future LLM research.

pdf bib
Mitigating Forgetting in Continual Learning with Selective Gradient Projection
Anika Singh | David Martinez | Aayush Dhaulakhandi | Varun Chopade | Likhith Malipati | Vasu Sharma | Kevin Zhu | Sunishchal Dev | Ryan Lagasse

As neural networks are increasingly deployed in dynamic environments, they face the challenge of catastrophic forgetting, the tendency to overwrite previously learned knowledge when adapting to new tasks, resulting in severe performance degradation on earlier tasks. We propose Selective Forgetting-Aware Optimization (SFAO), a dynamic method that regulates gradient directions via cosine similarity and per-layer gating, enabling controlled forgetting while balancing plasticity and stability. SFAO selectively projects, accepts, or discards updates using a tunable mechanism with efficient Monte Carlo approximation. Experiments on standard continual learning benchmarks show that SFAO achieves competitive accuracy with markedly lower memory cost, a 90% reduction, and improved forgetting on MNIST datasets, making it suitable for resource-constrained scenarios.

pdf bib
VariantBench: A Framework for Evaluating LLMs on Justifications for Genetic Variant Interpretation
Humair Basharat | Simon Plotkin | Charlotte Le | Kevin Zhu | Michael Pink | Isabella Alfaro

Accurate classification in high-stakes domains requires not only correct predictions but transparent, traceable reasoning. We instantiate this need in clinical genomics and present VariantBench, a reproducible benchmark and scoring harness that evaluates both the final American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) labels and criterion-level reasoning fidelity for missense single-nucleotide variants (SNVs). Each case pairs a variant with deterministic, machine-readable evidence aligned to five commonly used criteria (PM2, PP3, PS1, BS1, BA1), enabling consistent evaluation of large language models (LLMs). Unlike prior work that reports only final labels, our framework scores the correctness and faithfulness of per-criterion justifications against numeric evidence. On a balanced 100-variant freeze, Gemini 2.5 Flash and GPT-4o outperform Claude 3 Opus on label accuracy and criterion detection, and both improve materially when the decisive PS1 cue is provided explicitly. Error analyses show models master population-frequency cues yet underuse high-impact rules unless evidence is unambiguous. VariantBench delivers a substrate to track such improvements and compare prompting, calibration, and aggregation strategies in genomics and other rule-governed, safety-critical settings.

pdf bib
The ‘aftermath’ of compounds: Investigating Compounds and their Semantic Representations
Swarang Joshi

This study investigated how well computational embeddings aligned with human semantic judgments in the processing of English compound words. We compared static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we computed embedding-derived LMD and ST metrics and assessed their relationships with human judgments via Spearman’s correlation and regression analyses. Our results showed that BERT embeddings better captured compositional semantics than GloVe, and that predictability ratings were strong predictors of semantic transparency in both human and model data. These findings advanced computational psycholinguistics by clarifying the factors that drove compound word processing and offered insights into embedding-based semantic modeling.

up

pdf (full)
bib (full)
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract

pdf bib
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract
Benjamin Heinzerling | Lun-Wei Ku

pdf bib
Source Attribution for Large Language Models
Vipula Rawte | Koustava Goswami | Puneet Mathur | Nedim Lipka

As Large Language Models (LLMs) become more widely used for tasks like document summarization, question answering, and information extraction, improving their trustworthiness and interpretability has become increasingly important. One key strategy for achieving this is extbfattribution, a process that tracks the sources of the generated responses. This tutorial will explore various attribution techniques, including model-driven attribution, post-retrieval answering, and post-generation attribution. We will also discuss the challenges involved in implementing these approaches, and also look at the advanced topics such as model-based attribution for complex cases, table attribution, multimodal attribution, and multilingual attribution.

pdf bib
Continual Learning in Large Language Models: Foundations to Frontiers
P. K. Srijith | Shrey Satapara | Sarath Chandar

Continual learning (CL) enables deep learning models to learn a sequence of tasks under resource constraint settings, without forgetting previously acquired knowledge. This is particularly useful for multilingual NLP for low-resource languages, where incremental data collection is common and the compute cost is crucial. This tutorial will introduce key CL methodologies and their applications in natural language processing (NLP), covering both foundational techniques and modern challenges posed by large language models (LLMs). This tutorial covers foundational CL strategies based on regularization, replay, and network architecture. We explore NLP-specific CL scenarios such as task-incremental, language-incremental, and joint task-language incremental setups, along with methodologies to address them. A major emphasize of the tutorial is on continual learning for large language models (LLMs), examining challenges in applying CL for LLMs and the benefits it can provide in LLM training and inference. We further explore the connection between several advances in LLM such as model merging and continual learning. This tutorial is suitable for NLP researchers, practitioners, and students interested in lifelong learning, multilingual NLP, or large language models. It is designed as a half-day tutorial at IJCNLP 2025 and fall under the category of Introduction to Non-CL/Non-NLP Topic.

pdf bib
NLP for Affective Science: Exploring Fundamental Questions on Emotions through Language and Computation
Krishnapriya Vishnubhotla | Saif M. Mohammad

Affect refers to the fundamental neural processes that generate and regulate emotions, moods, and feeling states. Affect and emotions are central to how we organize meaning, to our behaviour, to our health and well-being, and to our very survival. Despite this, and even though most of us are all intimately familiar with emotions in everyday life, there is much we do not know about how emotions work, and how they impact our lives. Affective Science is a broad interdisciplinary field that explores these and other related questions about affect and emotions.Since language is a powerful mechanism of emotion expression, there is great potential in using language data and computation to shed light on fundamental questions about emotions. However, even though much progress has been made in areas such as sentiment analysis and affective computing, much of the research focus is squarely on automatically classifying pieces of text. In this tutorial, we will present an introduction to Affective Science and argue that NLP is uniquely positioned to contribute to it: to boldly explore a new frontier — to use language and computation to ask fundamental questions about how emotions and affect work. We will cover the broad areas of research within this nascent field of study - Computational Affective Science (CAS).

pdf bib
Human–Agent Teaming for Higher-Order Thinking Augmentation
Chung-Chi Chen

Human-agent teaming refers to humans and artificial agents working together toward shared goals, and recent advances in artificial intelligence, including large language models and autonomous robots, have intensified interest in using these agents not only for automation but also to augment higher-order cognition. Higher-order thinking involves complex mental processes such as critical thinking, creative problem solving, abstract reasoning, and metacognition, and intelligent agents hold the potential to act as genuine teammates that complement human strengths and address cognitive limitations. This tutorial synthesizes emerging research on human-agent teaming for cognitive augmentation by outlining the foundations of higher-order thinking and the psychological frameworks that describe it, reviewing key concepts and interaction paradigms in human–AI collaboration, and examining applications across education, healthcare, military decision-making, scientific discovery, and creative industries, where systems such as language models, decision-support tools, multi-agent architectures, explainable AI, and hybrid human–AI methods are used to support complex reasoning and expert judgment. It also discusses the major challenges involved in achieving meaningful augmentation, including the calibration of trust, the need for transparency, the development of shared mental models, the role of human adaptability and training, and broader ethical concerns. The tutorial further identifies gaps such as limited evidence of long-term improvement in human cognitive abilities and insufficient co-adaptation between humans and agents. Finally, it outlines future directions involving real-time cognitive alignment, long-term studies of cognitive development, co-adaptive learning systems, ethics-aware AI teammates, and new benchmarks for evaluating collaborative cognition, offering a comprehensive overview of current progress and a roadmap for advancing human-agent teaming as a means of enhancing higher-order human thinking.

pdf bib
Beyond Guardrails: Advanced Safety for Large Language Models — Monolingual, Multilingual and Multimodal Frontiers
Somnath Banerjee | Rima Hazra | Animesh Mukherjee

LLMs are now embedded in workflows that span languages, modalities, and tools. This raises safety challenges that outpace conventional “guardrails”: jailbreaks and prompt injections, attributional safety failures under code-mixing, multimodal bypass via typography and icons, activation-level manipulation, and agentic risks from tool use. This tutorial synthesizes the newest advances (2023–2025) and lays out open research questions around (i) failure modes in monolingual / multilingual / multimodal settings, (ii) training-time and inference-time defenses (rejection SFT, RLHF/RLAIF, decoding-time safety, parameter/activation steering), and (iii) evaluation and red-teaming pipelines balancing safety and utility. We anchor the tutorial with recent results including our safety related papers published at top tier conferences, and connect them to emerging best practices from recent safety tutorials. The target audience is researchers/engineers with basic NLP knowledge who want the latest techniques and a research roadmap; format is half-day with short demos and Q&A.

pdf bib
Tutorial on Trustworthy Legal Text Processing with LLMs: Retrieval, Rhetorical Roles, Summarization, and Trustworthy Generation
Anand Kumar M | Sangeetha S | Manikandan R | Anjali R

This half-day tutorial provides a comprehensive overview of Legal Natural Language Processing (NLP) with LLM for participants with a basic understanding of Computational Linguistics or NLP concepts. We introduce how NLP can help analyze and manage legal text by covering five key topics: legal text analysis with LLM insights, legal text retrieval, rhetorical role identification, legal text summarization, and addressing bias and hallucination in legal tasks. Our goals are to explain why these tasks matter for researchers in the legal domain, describe the challenges and open problems, and outline current solutions. This proposed tutorial blends lectures, live examples, and Q&A to help researchers and students see how language technology and LLMs can make legal information more understandable and efficient.

up

pdf (full)
bib (full)
Proceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations
Xuebo Liu | Ayu Purwarianti

pdf bib
ImageTra: Real-Time Translation for Texts in Image and Video
Hour Kaing | Jiannan Mao | Haiyue Song | Chenchen Ding | Hideki Tanaka | Masao Utiyama

There has been a growing research interest in in-image machine translation, which involves translating texts in images from one language to another. Recent studies continue to explore pipeline-based systems due to its straightforward construction and the consistent improvement of its underlined components. However, the existing implementation for such pipeline often lack extensibility, composability, and support for real-time translation. Therefore, this work introduces ImageTra—an open-source toolkit designed to facilitate the development of the pipeline-based system of in-image machine translation. The toolkit integrates state-of-the-art open-source models and tools, and is designed with a focus on modularity and efficiency, making it particularly well-suited for real-time translation. The toolkit is released at https://github.com/hour/imagetra.

pdf bib
Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script
Xi Cao | Yuan Sun | Jiajun Li | Quzong Gesang | Nuo Qun | Nyima Tashi

DNN-based language models excel across various NLP tasks but remain highly vulnerable to textual adversarial attacks. While adversarial text generation is crucial for NLP security, explainability, evaluation, and data augmentation, related work remains overwhelmingly English-centric, leaving the problem of constructing high-quality and sustainable adversarial robustness benchmarks for lower-resourced languages both difficult and understudied. First, method customization for lower-resourced languages is complicated due to linguistic differences and limited resources. Second, automated attacks are prone to generating invalid or ambiguous adversarial texts. Last but not least, language models continuously evolve and may be immune to parts of previously generated adversarial texts. To address these challenges, we introduce HITL-GAT, an interactive system based on a general approach to human-in-the-loop generation of adversarial texts. Additionally, we demonstrate the utility of HITL-GAT through a case study on Tibetan script, employing three customized adversarial text generation methods and establishing its first adversarial robustness benchmark, providing a valuable reference for other lower-resourced languages.

pdf bib
Real-time Commentator Assistant for Photo Editing Live Streaming
Matīss Rikters | Goran Topić

Live commentary has the potential of making specific broadcasts such as sports or video games more engaging and interesting to watch for spectators. With the recent popularity rise of online live streaming many new categories have entered the space, like art in its many forms or even software development, however, not all live streamers have the capability to be naturally engaging with the audience. We introduce a live commentator assistant system that can discuss what is visible on screen in real time. Our experimental setting is focused on the use-case of a photo editing live stream. We compare several recent vision language models for commentary generation and text to speech models for spoken output, all on relatively modest consumer hardware configurations.

pdf bib
Supporting Plain Language Summarization of Psychological Meta-Analyses with Large Language Models
Yarik Menchaca Resendiz | Martin Kerwer | Anita Chasiotis | Marlene Bodemer | Kai Sassenberg | Roman Klinger

Communicating complex scientific findings to non-experts remains a major challenge in fields like psychology, where research is often presented in highly technical language. One effective way to improve accessibility, for non-experts, is through plain language summaries, which summarize key insights into simple and understandable terms. However, the limited number of institutions that produce lay summaries typically relies on psychology experts to create them manually – an approach that ensures high quality but requires significant expertise, time, and effort. In this paper, we introduce the KLARpsy App, a system designed to support psychology experts in creating plain language summaries of psychological meta-analyses using Large Language Models (LLM). Our system generates initial draft summaries based on a 37-criterion guideline developed to ensure clarity for non-experts. All summaries produced through the system are manually validated and edited by KLARpsy authors to ensure factual correctness and readability. We demonstrate how the system integrates LLM-generated content into an expert-in-the-loop workflow. The automatic evaluation showed a mean semantic-similarity score of 0.73 against expert-written summaries, and human evaluation on a 5-point Likert scale averaged above 3 (higher is better), indicate that the generated drafts are of high quality. The application and code are open source.

pdf bib
Standardizing the Measurement of Text Diversity: A Tool and Comparative Analysis
Chantal Shaib | Venkata S Govindarajan | Joe Barrow | Jiuding Sun | Alexa Siu | Byron C Wallace | Ani Nenkova

The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and “canned” responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and release diversity, an open-source Python package (https://pypi.org/project/diversity/, https://github.com/cshaib/diversity) for measuring and extracting repetition in text. We also build a platform (https://ai-templates.app) based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute n-gram overlap homogeneity scores. Further, a combination of measures—compression ratios, self-repetition of long n-grams, and Self-BLEU—are sufficient to report, as they have low mutual correlation with each other.

pdf bib
LITMUS++ : An Agentic System for Predictive Analysis of Low-Resource Languages Across Tasks and Models
Avni Mittal | Shanu Kumar | Sandipan Dandapat | Monojit Choudhury

We present LITMUS++, an agentic system for predicting language-model performance for queries of the form “How will a Model perform on a Task in a Language?”, a persistent challenge in multilingual and low-resource settings, settings where benchmarks are incomplete or unavailable. Unlike static evaluation suites or opaque LLM-as-judge pipelines, LITMUS++ implements an agentic, auditable workflow: a Directed Acyclic Graph of specialized Thought Agents that generate hypotheses, retrieve multilingual evidence, select predictive features, and train lightweight regressors with calibrated uncertainty. The system supports interactive querying through a chat-style interface, enabling users to inspect reasoning traces and cited evidence. Experiments across six tasks and five multilingual scenarios show that LITMUS++ delivers accurate and interpretable performance predictions, including in low-resource and unseen conditions. Code is available at https://github.com/AvniMittal13/litmus_plus_plus.

pdf bib
SimAgents: Bridging Literature and the Universe Via A Multi-Agent Large Language Model System
Xiaowen Zhang | Zhenyu Bi | Patrick LaChance | Xuan Wang | Tiziana Di Matteo | Rupert Croft

As cosmological simulations and their associated software become increasingly complex, physicists face the challenge of searching through vast amounts of literature and user manuals to extract simulation parameters from dense academic papers, each using different models, and formats. Translating these parameters into executable scripts remains a time-consuming and error-prone process. To improve efficiency in physics research and accelerate the cosmological simulation process, we introduce SimAgents, a multi-agent system designed to automate both parameter configuration from the literature and preliminary analysis for cosmology research. SimAgents is powered by specialized LLM agents capable of physics reasoning, simulation software validation, and tool execution. These agents collaborate through structured communication, ensuring that extracted parameters are physically meaningful, internally consistent, and software-compliant. We also construct a cosmological parameter extraction evaluation dataset by collecting over 40 simulations in published papers from Arxiv and leading journals that cover diverse simulation types. Experiments on the dataset demonstrate a strong performance of SimAgents, highlighting its effectiveness and potential to accelerate scientific research for physicists. Our demonstration video is available at: https://youtu.be/w1zLpm_CaWA. The complete system and dataset are publicly available at https://github.com/xwzhang98/SimAgents.

pdf bib
StanceMining: An open-source stance detection library supporting time-series and visualization
Benjamin Steel | Derek Ruths

Despite the size of the field, stance detection has remained inaccessible to most researchers due to implementation barriers. Here we present a library that allows easy access to an end-to-end stance modelling solution. This library comes complete with everything needed to go from a corpus of documents, to exploring stance trends in a corpus through an interactive dashboard. To support this, we provide stance target extraction, stance detection, stance time-series trend inference, and an exploratory dashboard, all available in an easy-to-use library. We hope that this library can increase the accessibility of stance detection for the wider community of those who could benefit from this method.

pdf bib
ShortCheck: Checkworthiness Detection of Multilingual Short‐Form Videos
Henrik Vatndal | Vinay Setty

Short-form video platforms like TikTok present unique challenges for misinformation detection due to their multimodal, dynamic, and noisy content. We present ShortCheck, a modular, inference-only pipeline with a user-friendly pipeline that automatically identifies checkworthy short-form videos to help human fact-checkers. The system integrates speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification. ShortCheck is validated by evaluating it on two manually annotated datasets with TikTok videos in a multilingual setting. The pipeline achieves promising results with F1-weighted score over 70%. The demo can be accessed live at http://shortcheck.factiverse.ai.

pdf bib
ChartEval: LLM-Driven Chart Generation Evaluation Using Scene Graph Parsing
Kanika Goswami | Puneet Mathur | Ryan A. Rossi | Franck Dernoncourt | Vivek Gupta | Dinesh Manocha

Accurate assessment of generated chart quality is crucial for automated document creation and editing across diverse applications like finance, medicine, policy making, and education. Current evaluation approaches suffer from significant limitations: human evaluation is costly and difficult to scale, pixel-based metrics ignore data accuracy, while data-centric measures overlook design quality. Recent multimodal LLM evaluators show promise but exhibit concerning inconsistencies due to prompt sensitivity and subjective biases. Existing metrics fail to evaluate chart quality holistically across visual similarity, semantic alignment, and data fidelity, often producing misleading scores that unfairly penalize good charts while rewarding bad ones. We introduce ChartEval, a novel chart evaluation system that compares generated chart images with ground truth by leveraging scene graph parsing to decompose chart images into hierarchical scene graphs of chart objects, attributes, and relations. Subsequently, it applies graph-based similarity measures to compare candidate chart scene graphs against reference scene graphs for measuring chart quality. We demonstrate that our evaluation approach achieves significantly stronger correlation with human judgments compared to existing metrics like GPT-Score, SSIM, and SCRM using a comprehensive benchmark of 4K chart images paired with generation intents and human quality ratings. We demonstrate the utility of the ChartEval system as a reliable automatic chart quality metric on diverse tasks, including language-guided chart editing, chart reconstruction, and text-to-chart synthesis using both open-source and API-based LLMs.

pdf bib
SPORTSQL: An Interactive System for Real-Time Sports Reasoning and Visualization
Sebastian Martinez | Naman Ahuja | Fenil Bardoliya | Suparno Roy Chowdhury | Chris Bryan | Vivek Gupta

We present a modular, interactive system, SPORTSQL, for natural language querying and visualization of dynamic sports data, with a focus on the English Premier League (EPL). The system translates user questions into executable SQL over a live, temporally indexeddatabase constructed from real-time Fantasy Premier League (FPL) data. It supports both tabular and visual outputs, leveraging symbolic reasoning capabilities of Large Language Models (LLMs) for query parsing, schema linking, and visualization selection. To evaluate system performance, we introduce the Dynamic Sport Question Answering Benchmark (DSQABENCH), comprising 1,700+ queries annotated with SQL programs, gold answers, and database snapshots. Our demo highlights how non-expert users can seamlessly explore evolving sports statistics through a natural, conversational interface.

up

pdf (full)
bib (full)
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Kentaro Inui | Sakriani Sakti | Haofen Wang | Derek F. Wong | Pushpak Bhattacharyya | Biplab Banerjee | Asif Ekbal | Tanmoy Chakraborty | Dhirendra Pratap Singh

pdf bib
MPF: Aligning and Debiasing Language Models post Deployment via Multi-Perspective Fusion
Xin Guan | Pei-Hsin Lin | Zekun Wu | Ze Wang | Ruibo Zhang | Emre Kazim | Adriano Koshiyama

Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the Human Resource baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.

pdf bib
Generalizing to Unseen Disaster Events: A Causal View
Philipp Seeberger | Steffen Freisinger | Tobias Bocklet | Korbinian Riedhammer

Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.

pdf bib
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs
Peng Yifeng | Zhizheng Wu | Chen Chen

Modern large language models (LLMs) exhibit critical vulnerabilities to poison pill attacks—localized data poisoning that alters specific factual knowledge while preserving overall model utility. We systematically demonstrate these attacks exploit inherent architectural properties of LLMs, achieving 54.6% increased retrieval inaccuracy on long-tail knowledge versus dominant topics and up to 25.5% increase retrieval inaccuracy on compressed models versus original architectures. Through controlled mutations (e.g. temporal/spatial/entity alterations) and , our method induces localized memorization deterioration with negligible impact on models’ performance on regular standard benchmarks (e.g., <2% performance drop on MMLU/GPQA), leading to potential detection evasion. Our findings suggest: (1) Disproportionate vulnerability in long-tail knowledge may result from reduced parameter redundancy ; (2) Model compression may increase attack surfaces, with pruned/distilled models requiring 30% fewer poison samples for equivalent damage; (3) Associative memory enables both spread of collateral damage to related concepts and amplification of damage from simultaneous attack, particularly for dominant topics. These findings raise concerns over current scaling paradigms since attack costs are lowering while defense complexity is rising. Our work establishes poison pills as both a security threat and diagnostic tool, revealing critical security-efficiency trade-offs in language model compression that challenge prevailing safety assumptions.

pdf bib
Sympathy over Polarization: A Computational Discourse Analysis of Social Media Posts about the July 2024 Trump Assassination Attempt
Qingcheng Zeng | Guanhong Liu | Zhaoqian Xue | Diego Ford | Rob Voigt | Loni Hagen | Lingyao Li

On July 13, 2024, an assassination attempt was made on Republican presidential candidate Donald Trump during a rally in Pennsylvania. This event triggered widespread discourses on social media platforms. In this study, we analyze posts from X (formerly Twitter) collected during the week preceding and following the incident to examine the short-term impact of this political shock on public opinion and discourse. Our investigation is guided by three central research questions. First (RQ1), we assess how public stance toward Donald Trump evolved over time and varied across geographic regions. Second (RQ2), we apply causal inference methods to determine whether the assassination attempt itself significantly influenced public attitudes, independent of pre-existing political alignments. Third (RQ3), we conduct topic modeling to identify shifts in dominant themes of online discussions before and after the event. Integrating large language model-based stance detection, difference-in-differences estimation, and topic modeling, our findings reveal a marked surge in sympathetic responses toward Trump in the immediate aftermath of the attempt, suggesting a unifying effect that temporarily transcended ideological and regional divides.

pdf bib
LLMs as Architects and Critics for Multi-Source Opinion Summarization
Anuj Attri | Arnav Attri | Suman Banerjee | Amey Patil | Muthusamy Chelliah | Nikesh Garera | Pushpak Bhattacharyya

Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across seven key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, and specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of ρ = 0.74, which surpasses the performance of previous methodologies.

pdf bib
Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning
Po-Chun Chen | Hen-Hsen Huang | Hsin-Hsi Chen

To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

pdf bib
Building Helpful-Only Large Language Models: A Complete Approach from Motivation to Evaluation
Donghyeon Ko | Sohee Yang | Donghyun Kwak | Sang-Woo Lee

Reinforcement learning from AI feedback (RLAIF) is widely used for customizing the safety policies of large language models (LLMs) at scale. However, standard aligned LLMs are poorly suited in this setting, as their fixed alignment prevents adaptation to new policies. To address this, prior works have employed Helpful-Only LLMs (HOLLMs). Despite their effectiveness, no public framework exists for training or evaluating HOLLMs. In this paper, we present a comprehensive framework for developing HOLLMs that enable custom safety alignment. We first define the key attributes of a HOLLM and then propose Refusal-Avoidant Instruction Learning (RAIL), a novel training method that constructs HOLLMs from open-source datasets. We also introduce a comprehensive evaluation framework including a new benchmark: Helpfulness Evaluation without Limitations from Policies (HELP). Experiments show that the HOLLM achieves a 30.28% reduction in refusal rate over the strongest refusal-optimized baseline without compromising general capabilities. The HOLLM also achieves a 29.25% higher accuracy on HELP compared to the best-performing baseline. These results demonstrate that RAIL effectively cultivates the key attributes required of a HOLLM.

pdf bib
Autoformalizing Natural Language to First-Order Logic: A Case Study in Logical Fallacy Detection
Abhinav Lalwani | Tasha Kim | Lovish Chopra | Christopher Hahn | Zhijing Jin | Mrinmaya Sachan

Translating natural language into formal language such as First-Order Logic (FOL) is a foundational challenge in NLP with wide-ranging applications in automated reasoning, misinformation tracking, and knowledge validation. In this paper, we introduce Natural Language to First-Order Logic (NL2FOL), a framework to autoformalize natural language to FOL step-by-step using Large Language Models (LLMs). Our approach addresses key challenges in this translation process, including the integration of implicit background knowledge. By leveraging structured representations generated by NL2FOL, we use Satisfiability Modulo Theory (SMT) solvers to reason about the logical validity of natural language statements. We present logical fallacy detection as a case study to evaluate the efficacy of NL2FOL. Being neurosymbolic, our approach also provides interpretable insights into the reasoning process and demonstrates robustness without requiring model fine-tuning or labeled training data. Our framework achieves good performance on multiple datasets–on the Logic dataset, NL2FOL achieves an F1-score of 78%, while generalizing effectively to the LogicClimate dataset with an F1-score of 80%.

pdf bib
Atomic Calibration of LLMs in Long-Form Generations
Caiqi Zhang | Ruihan Yang | Zhisong Zhang | Xinting Huang | Sen Yang | Dong Yu | Nigel Collier

Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.

pdf bib
Estimating Causal Effects of Text Interventions Leveraging LLMs
Siyi Guo | Myrl G Marmarelis | Fred Morstatter | Kristina Lerman

Quantifying the effects of textual interventions in social systems, such as reducing anger in social media posts to see its impact on engagement, is challenging. Real-world interventions are often infeasible, necessitating reliance on observational data. Traditional causal inference methods, typically designed for binary or discrete treatments, are inadequate for handling the complex, high-dimensional textual data. This paper addresses these challenges by proposing CausalDANN, a novel approach to estimate causal effects using text transformations facilitated by large language models (LLMs). Unlike existing methods, our approach accommodates arbitrary textual interventions and leverages text-level classifiers with domain adaptation ability to produce robust effect estimates against domain shifts, even when only the control group is observed. This flexibility in handling various text interventions is a key advancement in causal estimation for textual data, offering opportunities to better understand human behaviors and develop effective interventions within social systems.

pdf bib
AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection
Tiankai Yang | Junjun Liu | Michael Siu | Jiahang Wang | Zhuangzhuang Qian | Chanjuan Song | Cheng Cheng | Xiyang Hu | Yue Zhao

Anomaly detection (AD) is essential in areas such as fraud detection, network monitoring, and scientific research. However, the diversity of data modalities and the increasing number of specialized AD libraries pose challenges for non-expert users who lack in-depth library-specific knowledge and advanced programming skills. To tackle this, we present AD-AGENT, an LLM-driven multi-agent framework that turns natural-language instructions into fully executable AD pipelines. AD-AGENT coordinates specialized agents for intent parsing, data preparation, library and model selection, documentation mining, and iterative code generation and debugging. Using a shared short-term workspace and a long-term cache, the agents integrate popular AD libraries like PyOD, PyGOD, and TSLib into a unified workflow. Experiments demonstrate that AD-AGENT produces reliable scripts and recommends competitive models across libraries. The system is open-sourced to support further research and practical applications in AD.

pdf bib
LiteLMGuard: Seamless and Lightweight On-Device Guardrails for Small Language Models against Quantization Vulnerabilities
Kalyan Nakka | Jimmy Dani | Ausmit Mondal | Nitesh Saxena

The growing adoption of Large Language Models (LLMs) has influenced the development of Small Language Models (SLMs) for on-device deployment across smartphones and edge devices, offering enhanced privacy, reduced latency, server-free functionality, and improved user experience. However, due to on-device resource constraints, SLMs undergo size optimization through compression techniques like quantization, which inadvertently introduce fairness, ethical and privacy risks. Critically, quantized SLMs may respond to harmful queries directly, without requiring adversarial manipulation, raising significant safety and trust concerns. To address this, we propose LiteLMGuard, an on-device guardrail that provides real-time, prompt-level defense for quantized SLMs. Additionally, our guardrail is designed to be model-agnostic such that it can be seamlessly integrated with any SLM, operating independently of underlying architectures. Our LiteLMGuard formalizes deep learning (DL)-based prompt filtering by leveraging semantic understanding to classify prompt answerability for SLMs. Built on our curated Answerable-or-Not dataset, LiteLMGuard employs ELECTRA as the candidate model with 97.75% answerability classification accuracy. The on-device deployment of LiteLMGuard enabled real-time offline filtering with over 85% defense-rate against harmful prompts (including jailbreak attacks), 94% filtering accuracy and ~135 ms average latency. These results demonstrate LiteLMGuard as a lightweight robust defense mechanism for effectively and efficiently securing on-device SLMs against Open Knowledge Attacks.

pdf bib
“Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection
Muhammad Haroon | Magdalena Wojcieszak | Anshuman Chhabra

The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM’s classification.

pdf bib
AnaToM: A Dataset Generation Framework for Evaluating Theory of Mind Reasoning Toward the Anatomy of Difficulty through Structurally Controlled Story Generation
Jundai Suzuki | Ryoma Ishigaki | Eisaku Maeda

Evaluating Theory of Mind (ToM) in Large Language Models (LLMs) is an important area of research for understanding the social intelligence of AI. Recent ToM benchmarks have made significant strides in enhancing the complexity, comprehensiveness, and practicality of evaluation. However, while the focus has been on constructing “more difficult” or “more comprehensive” tasks, there has been insufficient systematic analysis of the structural factors that inherently determine the difficulty of ToM reasoning—that is, “what” makes reasoning difficult. To address this challenge, we propose a new dataset generation framework for ToM evaluation named AnaToM. To realize an “Anatomy of Difficulty” in ToM reasoning, AnaToM strictly controls structural parameters such as the number of entities and the timeline in a story. This parameter control enables the isolation and identification of factors affecting the ToM of LLMs, allowing for a more precise examination of their reasoning mechanisms. The proposed framework provides a systematic methodology for diagnosing the limits of LLM reasoning abilities and offers new guidelines for future benchmark design.

pdf bib
Information-theoretic Distinctions Between Deception and Confusion
Robin Young

We propose an information-theoretic formalization of the distinction between two fundamental AI safety failure modes: deceptive alignment and goal drift. While both can lead to systems that appear misaligned, we demonstrate that they represent distinct forms of information divergence occurring at different interfaces in the human-AI system. Deceptive alignment creates entropy between an agent’s true goals and its observable behavior, while goal drift, or confusion, creates entropy between the intended human goal and the agent’s actual goal. Though often observationally equivalent, these failures necessitate different interventions. We present a formal model and an illustrative thought experiment to clarify this distinction. We offer a formal language for re-examining prominent alignment challenges observed in Large Language Models (LLMs), offering novel perspectives on their underlying causes.

pdf bib
Whispering in Ol Chiki: Cross-Lingual Transfer Learning for Santali Speech Recognition
Atanu Mandal | Madhusudan Ghosh | Pratick Maiti | Sudip Kumar Naskar

India, a country with a large population, possesses two official and twenty-two scheduled languages, making it the most linguistically diverse nation. Despite being one of the scheduled languages, Santali remains a low-resource language. Although Ol Chiki is recognized as the official script for Santali, many continue to use Bengali, Devanagari, Odia, and Roman scripts. In tribute to the upcoming centennial anniversary of the Ol Chiki script, we present an Automatic Speech Recognition for Santali in the Ol Chiki script. Our approach involves cross-lingual transfer learning by utilizing the Whisper framework pre-trained in Bengali and Hindi on the Santali language, using Ol Chiki script transcriptions. With the adoption of the Bengali pre-trained framework, we achieved a Word Error Rate (WER) score of 28.47%, whereas the adaptation of the Hindi pre-trained framework resulted in a score of 34.50% WER. These outcomes were obtained using the Whisper Small framework.

pdf bib
Multilingual Political Views of Large Language Models: Identification and Steering
Daniil Gurgurov | Katharina Trinley | Ivan Vykopal | Josef Van Genabith | Simon Ostermann | Roberto Zamparelli

Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases-frequently skewing toward liberal or progressive positions-key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled.In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.

pdf bib
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization
Iñigo Pikabea | Iñaki Lacunza | Oriol Pareras Velasco | Carlos Escolano | Aitor Gonzalez-Agirre | Javier Hernando | Marta Villegas

Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model’s original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.

pdf bib
XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification
Sachin Yadav | Dominik Schlechtweg

We propose XL-DURel, a finetuned, multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification. We test several loss functions for regression and ranking tasks managing to outperform previous models on ordinal and binary data with a ranking objective based on angular distance in complex space. We further show that binary WiC can be treated as a special case of ordinal WiC and that optimizing models for the general ordinal task improves performance on the more specific binary task. This paves the way for a unified treatment of WiC modeling across different task formulations.

pdf bib
HalluCounter: Reference-free LLM Hallucination Detection in the Wild!
Ashok Urlana | Gopichand Kanumolu | Charaka Vinayak Kumar | Bala Mallikarjunarao Garlapati | Rahul Mishra

Response consistency-based, reference-free hallucination detection (RFHD) methods do not depend on internal model states, such as generation probabilities or gradients, which Grey-box models typically rely on but are inaccessible in closed-source LLMs. However, their inability to capture query-response alignment patterns often results in lower detection accuracy. Additionally, the lack of large-scale benchmark datasets spanning diverse domains remains a challenge, as most existing datasets are limited in size and scope. To this end, we propose HalluCounter, a novel reference-free hallucination detection method that utilizes both response-response and query-response consistency and alignment patterns. This enables the training of a classifier that detects hallucinations and provides a confidence score and an optimal response for user queries. Furthermore, we introduce HalluCounterEval, a benchmark dataset comprising both synthetically generated and human-curated samples across multiple domains. Our method outperforms state-of-the-art approaches by a significant margin, achieving over 90% average confidence in hallucination detection across datasets.

pdf bib
Intrinsic Linguistic Bias in Formal vs. Informal Bengali Pragmatics with Progressive Context Inflation
Md Tanzib Hosain | Md Kishor Morol

The social biases inherent in language models necessitate a critical analysis of their social influence in many linguistic situations because of their extensive use. This study investigates gender bias in Bengali language models by highlighting the unique linguistic challenges posed by its complex morphology, dialectical variations, and distinctions between formal and informal language versions. While prior research on social bias in Bengali has provided foundational insights, it has not adequately addressed the nuances arising from these variations. This research extends to measuring intrinsic gender bias in both formal and informal Bengali, analyzing the impact of context lengths on bias detection, and proposing modifications to existing techniques to enhance their applicability to Bengali. Addressing these, the study aims to contribute to developing more inclusive and representative bias measurement methodologies for underrepresented languages. We open the source code and data at https://github.com/kraritt/b-bias-ctext.

pdf bib
Enhancing LLM-Based Molecular Captioning with Molecular Fingerprints
Keisuke Mizutani | Koriki Ryonosuke | Kento Tokuyama

The development of large language models (LLMs) has resulted in significant transformations in the field of chemistry, with potential applications in molecular science. Traditionally, the exploration of methods to enhance pre-trained general-purpose LLMs has focused on techniques like supervised fine-tuning (SFT) and retrieval-augmented generation (RAG), to improve model performance and tailor them to specific applications. General purpose extended approaches are being researched, but their adaptation within the chemical domain has not progressed significantly. This study advances the application of LLMs in molecular science by exploring SFT of LLMs, and developing RAG and multimodal models, incorporating molecular embeddings derived from molecular fingerprints and other properties. Experimental results show that a multimodal model with fingerprint inputs to the LLM achieved the highest overall performance. For molecular representation based on SMILES notation, fingerprints effectively capture the structural information of molecular compounds, demonstrating the applicability of LLMs in drug discovery research.

pdf bib
SiLQ: Simple Large Language Model Quantization-Aware Training
Steven K. Esser | Jeffrey L McKinstry | Deepika Bablani | Rathinakumar Appuswamy | Dharmendra Modha

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.

pdf bib
Teaching by Failure: Counter-Example–Driven Curricula for Transformer Self-Improvement
Harshil Vejendla

Transformer models often exhibit brittle extrapolation, failing on inputs that are longer or structurally more complex than those seen during training. We introduce Counter-Example–Driven Curricula (CEDC), an automated framework that improves model robustness by iteratively focusing on its own failures. At each step, CEDC uses the current model to generate a diverse set of candidate problems, employs a fast, executable verifier to identify incorrect predictions (counter‐examples), and then fine‐tunes the model on a dataset enriched with these discovered failures. We evaluate CEDC on a suite of algorithmic and natural language tasks, including integer addition, sorting, Dyck‐2 language recognition, and three text classification benchmarks. Compared to static training and standard curriculum learning baselines, CEDC achieves up to 30× greater length extrapolation, is 3.75× more computationally efficient than uniform data augmentation, and requires no manual difficulty heuristics. We provide a detailed analysis of the counter‐examples, showing how the curriculum naturally adapts to target progressively more complex error modes. Our findings establish verifier‐guided, failure‐driven learning as a simple, powerful, and efficient paradigm for enhancing the generalization capabilities of Transformer models.

pdf bib
SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching
Xinye Zhao | Spyridon Mastorakis

As large language models (LLMs) continue to scale, the memory footprint of Key-Value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently occurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose SemShareKV, a KV cache sharing and compression framework that accelerates LLM inference by reusing KV cache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using Locality-Sensitive Hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant KV pairs from a reference prompt’s cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25× speedup and 42% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

pdf bib
RewriteNets: End-to-End Trainable String-Rewriting for Generative Sequence Modeling
Harshil Vejendla

Dominant sequence models like the Transformer represent structure implicitly through dense attention weights, incurring quadratic complexity. We propose RewriteNets, a novel neural architecture built on an alternative paradigm: explicit, parallel string rewriting. Each layer in a RewriteNet contains a set of learnable rules. For each position in an input sequence, the layer performs four operations: (1) fuzzy matching of rule patterns, (2) conflict resolution via a differentiable assignment operator to select non-overlapping rewrites, (3) application of the chosen rules to replace input segments with output segments of potentially different lengths, and (4) propagation of untouched tokens. While the discrete assignment of rules is non-differentiable, we employ a straight-through Gumbel-Sinkhorn estimator, enabling stable end-to-end training. We evaluate RewriteNets on algorithmic, compositional, and string manipulation tasks, comparing them against strong LSTM and Transformer baselines. Results show that RewriteNets excel at tasks requiring systematic generalization (achieving 98.7% accuracy on the SCAN benchmark’s length split) and are computationally more efficient than Transformers. We also provide an analysis of learned rules and an extensive ablation study, demonstrating that this architecture presents a promising direction for sequence modeling with explicit structural inductive biases.

pdf bib
When in Doubt, Ask First: A Unified Retrieval Agent-Based System for Ambiguous and Unanswerable Question Answering
Long Nguyen | Quynh Vo | Hung Luu | Tho Quan

Large Language Models (LLMs) have shown strong capabilities in Question Answering (QA), but their effectiveness in high-stakes, closed-domain settings is often constrained by hallucinations and limited handling of vague or underspecified queries. These challenges are especially pronounced in Vietnamese, a low-resource language with complex syntax and strong contextual dependence, where user questions are often short, informal, and ambiguous. We introduce the Unified Retrieval Agent-Based System (URASys), a QA framework that combines agent-based reasoning with dual retrieval under the Just Enough principle to address standard, ambiguous, and unanswerable questions in a unified manner. URASys performs lightweight query decomposition and integrates document retrieval with a question–answer layer via a two-phase indexing pipeline, engaging in interactive clarification when intent is uncertain and explicitly signaling unanswerable cases to avoid hallucination. We evaluate URASys on Vietnamese and English QA benchmarks spanning single-hop, multi-hop, and real-world academic advising tasks, and release new dual-language ambiguous subsets for benchmarking interactive clarification. Results show that URASys outperforms strong retrieval-based baselines in factual accuracy, improves unanswerable handling, and achieves statistically significant gains in human evaluations for clarity and trustworthiness.

pdf bib
Smruti: Grammatical Error Correction for Gujarati using LLMs with Non-Parametric Memory
Vrund Dobariya | Jatayu Baxi | Bhavika Gambhava | Brijesh Bhatt

Grammatical Error Correction (GEC) is a fundamental task in Natural Language Processing that focuses on automatically detecting and correcting grammatical errors in text. In this paper, we present a novel approach for GEC for Gujarati. Gujarati is an Indian language spoken by over 55 million people worldwide. Our approach combines a large language model with non-parametric memory modules to address the low-resource challenge. We have evaluated our system on human-annotated and synthetic datasets. The overall result indicates promising results for Gujarati. The proposed approach is generic enough to be adopted by other languages. Furthermore, we release a publicly available evaluation dataset for Gujarati GEC along with an adapted version of the ERRANT framework to enable error-type-wise evaluation in Gujarati.

pdf bib
R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
Sumin Jo | Junseong Choi | Jiho Kim | Edward Choi

Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference.

pdf bib
Data Augmentation for Low-resource Neural Machine Translation: A Systematic Analysis
Zhiqiang Shi

As an effective way to address data scarcity problem, data augmentation has received significant interest in low-resource neural machine translation, while the latter has the potential to reduce digital divide and benefit out of domain translation. However, the existing works mainly focus on how to generate the synthetic data, while the synthetic data quality and the way we use the synthetic data also matter. In this paper, we give a systematic analysis of data augmentation for low-resource neural machine translation that encompasses all the three aspects. We show that with careful control of the synthetic data quality and the way we use the synthetic data, the performance can be greatly boosted even with the same method to generate the synthetic data.

pdf bib
LLM-based Business Process Models Generation from Textual Descriptions
Xiaoxuan Li | Lin Ni | Xin Wang | Tang Yitong | Ruoxuan Li | Jiamou Liu | Zhongsheng Wang

Business process modeling has traditionally depended on manual efforts or rigid rule-based techniques, limiting scalability and flexibility. Recent progress in Large Language Models (LLMs) enables automatic generation of process models from text, yet a systematic evaluation remains lacking. This paper explores the ability of LLMs to produce structurally and semantically valid business process workflows using five approaches: zero-shot, zero-shot CoT, few-shot, few-shot CoT, and fine-tuning. We assess performance under increasing control-flow complexity (e.g., nested gateways, parallel branches) using the MaD dataset, and introduce a masked-input setting to test semantic robustness. Results show that while fine-tuning achieves the best accuracy, few-shot CoT excels in handling complex logic and incomplete inputs. These findings reveal the strengths and limits of LLMs in process modeling and offer practical guidance for enterprise Business Process Management (BPM) automation.

pdf bib
Quriosity: Analyzing Human Questioning Behavior and Causal Inquiry through Curiosity-Driven Queries
Roberto Ceraolo | Dmitrii Kharlapenko | Ahmad Khan | Amélie Reymond | Rada Mihalcea | Bernhard Schölkopf | Mrinmaya Sachan | Zhijing Jin

Recent progress in Large Language Model (LLM) technology has changed our role in interacting with these models. Instead of primarily testing these models with questions we already know answers to, we are now using them for queries where the answers are unknown to us, driven by human curiosity. This shift highlights the growing need to understand curiosity-driven human questions – those that are more complex, open-ended, and reflective of real-world needs. To this end, we present Quriosity, a collection of 13K naturally occurring questions from three diverse sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Our comprehensive collection enables a rich understanding of human curiosity across various domains and contexts. Our analysis reveals a significant presence of causal questions (up to 42%) in the dataset, for which we develop an iterative prompt improvement framework to identify all causal queries and examine their unique linguistic properties, cognitive complexity and source distribution. We also lay the groundwork for exploring efficient identifiers of causal questions, providing six efficient classification models.

pdf bib
Distillation versus Contrastive Learning: How to Train Your Rerankers
Zhichao Xu | Zhiqi Huang | Shengyao Zhuang | Vivek Srikumar

Training effective text rerankers is crucial for information retrieval. Two strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied extensively, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed.This paper empirically compares these strategies by training rerankers of different sizes (0.5B, 1.5B, 3B, 7B) and architectures (Transformer, Recurrent) using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a more performant teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. We recommend using knowledge distillation to train smaller rerankers if a larger, more performant teacher is accessible; in its absence, contrastive learning remains a robust baseline. Our code implementation is made available to facilitate reproducbility.

pdf bib
A Word-Splitting Approach to Kannada Sanskrit Sandhi Words Useful in Effective English Translation
Shanta Kallur | Basavaraj S. Anami

Natural Language Processing is a branch of artificial intelligence that enables man- machine interactions through regional languages. In Kannada, there are two types of Sandhi: Kannada Sandhi and Sanskrit Sandhi. A morph-phonemic word “Sandhi” is created when two words or distinct morphemes are joined or combined. Conversely, Sandhi word splitting reverses this process. Rules governing Sandhi exist across all the Dravidian languages. A rule-based method has been developed to split Sanskrit Sandhi words into their components within Kannada sentences. Once the Sanskrit Sandhi (SS) words are split, the type of Sandhi is also identified, facilitating accurate translation of the Sanskrit Sandhi words into English. This paper discusses seven types of SS words: SavarNadeergha, YaN, GuNa, Vruddhi, Jatva, Shchutva and Anunasika Sandhi. The identified split points adhere precisely to Sandhi rules. A dataset of 4900 SanskritSandhi words found in Kannada sentences was used to evaluate the proposed method, which achieved an accuracy of 90.03% for Sanskrit Sandhi Identification and 85.87% for reliable English Translation. This work has potential applications in other Dravidian languages.

pdf bib
Generating Questions Under Discussion with Reinforcement Learning using Ranking and Scoring for Reward and Evaluation
Kelvin Han | Claire Gardent

There is growing research interest in Questions Under Discussion (QUD), a linguistic framework for representing discourse in the form of natural language question-answer pairs, which are more easily understandable and have been found useful in several applications. Our goal in this work is to improve on the quality of automatic QUD generation. As a way to sidestep the paucity of data currently, we propose a reinforcement learning-based approach using the Group Relative Policy Optimisation (GRPO) objective for LLM post-training on the task. To get there, we: (i) carefully investigated five promising methods for reference-free automatic QUD evaluation, (ii) proposed a novel prompting strategy, SCRS, involving ranking and scoring with structured outputs that enables QUD evaluation close to the human upperbound, (iii) leveraged findings from (i) with (ii) for the knowledge distillation from a very large LLM to obtain a more resource-efficient reward model, and which (iv) we then used in the GRPO post-training for 3B LLMs on the QUD generation task. Our QUD generators give overall higher-quality QUDs compared to the SOTA which is based on supervised fine-tuning; all of these are achieved using only three annotated exemplars in the few-shot prompting for evaluation, and without the use of any other annotated questions for training the QUD generators. Our code, models, and annotated examples can be found at https://github.com/hankelvin/grpo_qud_generation.

pdf bib
The Feasibility of Topic-Based Watermarking on Academic Peer Reviews
Alexander Nemecek | Yuzhou Jiang | Erman Ayday

Large language models (LLMs) are increasingly integrated into academic workflows, with many conferences and journals permitting their use for tasks such as language refinement and literature summarization. However, their use in peer review remains prohibited due to concerns around confidentiality breaches, hallucinated content, and inconsistent evaluations. As LLM-generated text becomes more indistinguishable from human writing, there is a growing need for reliable attribution mechanisms to preserve the integrity of the review process. In this work, we evaluate topic-based watermarking (TBW), a semantic-aware technique designed to embed detectable signals into LLM-generated text. We conduct a systematic assessment across multiple LLM configurations, including base, few-shot, and fine-tuned variants, using authentic peer review data from academic conferences. Our results show that TBW maintains review quality relative to non-watermarked outputs, while demonstrating robust detection performance under paraphrasing. These findings highlight the viability of TBW as a minimally intrusive and practical solution for LLM attribution in peer review settings.

pdf bib
Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning
Girish | Mohd Mujtaba Akhtar | Farhan Sheth | Muskaan Singh

In this work, we address the problem of fine-grained traceback of emotional and manipulation characteristics from synthetically manipu- lated speech. We hypothesize that combining semantic–prosodic cues captured by Speech Foundation Models (SFMs) with fine-grainedspectral dynamics from auditory representations can enable more precise tracing of both emotion and manipulation source. To validate this hypothesis, we introduce MiCuNet, a novel multitask framework for fine-grained tracing of emotional and manipulation attributes in synthetically generated speech. Our approach integrates SFM embeddings with spectrogram-based auditory features through a mixed-curvature projection mechanism that spans Hyperbolic, Euclidean, and Spherical spaces guided by a learnable temporal gating mechanism. Our proposed method adopts a multitask learning setup to simultaneously predict original emotions, manipulated emotions, and manipulation sources on the Emo-Fake dataset (EFD) across both English and Chinese subsets. MiCuNet yields consistent improvements, consistently surpassing conventional fusion strategies. To the best of our knowledge, this work presents the first study to explore a curvature-adaptive framework specifically tailored for multitask tracking in synthetic speech.

pdf bib
Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models
Ercong Nie | Shuzhou Yuan | Bolei Ma | Helmut Schmid | Michael Färber | Frauke Kreuter | Hinrich Schuetze

Probing the multilingual knowledge of linguistic structure in LLMs, often characterized as sequence labeling, faces challenges with maintaining output templates in current text-to-text prompting strategies. To solve this, we introduce a decomposed prompting approach for sequence labeling tasks. Diverging from the single text-to-text prompt, our prompt method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, using both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Moreover, our analysis of multilingual performance of English-centric LLMs yields insights into the transferability of linguistic knowledge via multilingual prompting.

pdf bib
Moral Self-correction is Not An Innate Capability in Language Models
Guangliang Liu | Zimo Qi | Xitong Zhang | Lu Cheng | Kristen Johnson

Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness.Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs’ moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.

pdf bib
LLM-Empowered Patient-Provider Communication: A Data-Centric Survey From a Clinical Perspective
Ruosi Shao | Md Shamim Seraj | Kangyi Zhao | Yingtao Luo | Lincan Li | Bolin Shen | Averi Bates | Yue Zhao | Chongle Pan | Lisa Hightow-Weidman | Shayok Chakraborty | Yushun Dong

Large language models (LLMs) hold promise for advancing patient–provider communication, yet a persistent gap remains between benchmark-driven model development and the realities of clinical practice. This work presents a systematic, clinically grounded review of text-based medical datasets for LLM training and evaluation. We propose a scenario-based taxonomy derived from established clinical frameworks to map major knowledge-based and conversation-based corpora against core communication scenarios. We further synthesize core communication skills from gold-standard clinical assessment instruments and meta-analyze state-of-the-art medical LLM performance, highlighting how dataset properties, fine-tuning strategies, and evaluation metrics shape both knowledge acquisition and communicative competence. To empirically validate these findings, we conducted controlled fine-tuning experiments across representative LLMs, demonstrating that data composition and scenario alignment critically affect model performance. Our findings highlight the urgent need for scenario-rich datasets and standardized, human-centered evaluation protocol to advance clinically relevant medical LLMs.

pdf bib
Enhancing Scene Transition Awareness in Video Generation via Post-Training
Hanwen Shen | Jiajie Lu | Yupeng Cao | Xiaonan Yang

Recent advances in AI-generated video have shown strong performance on text-to-video tasks, particularly for short clips depicting a single scene. However, current models struggle to generate longer videos with coherent scene transitions, primarily because they cannot infer when a transition is needed from the prompt. Most open-source models are trained on datasets consisting of single-scene video clips, which limits their capacity to learn and respond to prompts requiring multiple scenes. Developing scene transition awareness is essential for multi-scene generation, as it allows models to identify and segment videos into distinct clips by accurately detecting transitions. To address this, we introduce the Transition-Aware Video (TAV) dataset with multi-scene clips and captions that explicitly state scene segmentation and transition structure. Our focus is on how prompt semantics and dataset annotations about temporal context affect text-to-video generation. Post-training on TAV improves alignment between the scene count implied by prompt and the scene count produced by the model, while preserving visual quality.

pdf bib
DoubleDipper: Recycling Contexts for Efficient and Attributed In-Context Learning
Arie Cattan | Alon Jacovi | Alex Fabrikant | Jonathan Herzig | Roee Aharoni | Hannah Rashkin | Dror Marcus | Avinatan Hassidim | Yossi Matias | Idan Szpektor | Avi Caciularu

In this work, we propose DoubleDipper, a novel In-Context-Learning method that automatically generates few-shot examples for several QA tasks by _recycling_ contexts. Specifically, given an input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to _explicitly_ identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+16 absolute points on average across models) on various QA datasets. Surprisingly, despite introducing only single-hop ICL examples, LLMs successfully generalize to multi-hop QA using our approach.

pdf bib
PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems
Oshayer Siddique | J. M Areeb Uzair Alam | Md Jobayer Rahman Rafy | Syed Rifat Raiyan | Hasan Mahmud | Md Kamrul Hasan

The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems—a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, PhysicsEval, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

pdf bib
Attention Overflow: Language Model Input Blur during Long-Context Missing Items Identification
Damien Sileo

Large language models (LLMs) can suggest missing elements from items listed in a prompt, which can be used for list completion or similar item recommendation. However, their performance degrades when they are exposed to too many items, as they start to suggest items already included in the input list. This occurs at around 100 items for mid-2024 flagship LLMs. We evaluate this phenomenon on both synthetic problems (e.g., finding missing numbers in a given range of shuffled integers) and realistic movie recommendation scenarios. We refer to this issue as “attention overflow”, as avoiding repetition requires attending to all items simultaneously. Although iterative loops can mitigate this problem, their costs increase with the repetition rate, affecting the language models’ ability to derive novelty from lengthy inputs.

pdf bib
Isolating Culture Neurons in Multilingual Large Language Models
Danial Namazifard | Lukas Galke Poech

Language and culture are deeply intertwined, yet it has been unclear how and where multilingual large language models encode culture. Here, we build on an established methodology for identifying language-specific neurons to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated largely independently of language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited, with implications for fairness, inclusivity, and alignment. Code and data are available at https://github.com/namazifard/Culture_Neurons

pdf bib
Benchmarking Bangla Causality: A Dataset of Implicit and Explicit Causal Sentences and Cause-Effect Relations
Diya Saha | Sudeshna Jana | Manjira Sinha | Tirthankar Dasgupta

Causal reasoning is central to language understanding, yet remains under-resourced in Bangla. In this paper, we introduce the first large-scale dataset for causal inference in Bangla, consisting of over 11663 sentences annotated for causal sentence types (explicit, implicit, non-causal) and token-level spans for causes, effects, and connectives. The dataset captures both simple and complex causal structures across diverse domains such as news, education, and health. We further benchmark a suite of state-of-the-art instruction-tuned large language models, including LLaMA 3.3 70B, Gemma 2 9B, Qwen 32B, and DeepSeek, under zero-shot and three-shot prompting conditions. Our analysis reveals that while LLMs demonstrate moderate success in explicit causality detection, their performance drops significantly on implicit and span-level extraction tasks. This work establishes a foundational resource for Bangla causal understanding and highlights key challenges in adapting multilingual LLMs for structured reasoning in low-resource languages.

pdf bib
Multi-Modal Data Exploration via Language Agents
Farhad Nooralahzadeh | Yi Zhang | Jonathan Fürst | Kurt Stockinger

International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored.In this paper, we propose M2EX, a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) M2EX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.

pdf bib
Fine-Grained Detection of AI-Generated Text Using Sentence-Level Segmentation
L D M S Sai Teja | Annepaka Yadagiri | Partha Pakray | Chukhu Chunka | Mangadoddi Srikar Vardhan

Generation of Artificial Intelligence (AI) texts in important works has become a common practice that can be used to misuse and abuse AI at various levels. Traditional AI detectors often rely on document-level classification, which struggles to identify AI content in hybrid or slightly edited texts designed to avoid detection, leading to concerns about the model’s efficiency, which makes it hard to distinguish between human-written and AI-generated texts. A sentence-level sequence labeling model proposed to detect transitions between human- and AI-generated text, leveraging nuanced linguistic signals overlooked by document-level classifiers. By this method, detecting and segmenting AI and human-written text within a single document at the token-level granularity is achieved. Our model combines the state-of-the-art pre-trained Transformer models, incorporating Neural Networks (NN) and Conditional Random Fields (CRFs). This approach extends the power of transformers to extract semantic and syntactic patterns, and the neural network component to capture enhanced sequence-level representations, thereby improving the boundary predictions by the CRF layer, which enhances sequence recognition and further identification of the partition between Human- and AI-generated texts. The evaluation is performed on two publicly available benchmark datasets containing collaborative human and AI-generated texts. Our experimental comparisons are with zero-shot detectors and the existing state-of-the-art models, along with rigorous ablation studies to justify that this approach, in particular, can accurately detect the spans of AI texts in a completely collaborative text.

pdf bib
Incorporating Dialogue State Tracking into Japanese Full-duplex Task-oriented Spoken Dialogue Model
Yuya Chiba | Ryuichiro Higashinaka

Full-duplex spoken dialogue models, which process audio input and output simultaneously, have been actively studied for their ability to naturally model turn-taking and non-verbal phenomena in addition to generating responses. Although these models enable natural conversational flow, they lack mechanisms for language understanding and dialogue management, making them difficult to apply to task-oriented dialogue systems. We propose a method for incorporating dialogue state tracking in task-oriented dialogue into Moshi, aiming to achieve a multi-channel, full-duplex task-oriented spoken dialogue model. We evaluated the proposed method on JMultiWOZ, a benchmark corpus for Japanese task-oriented dialogue, focusing on dialogue state tracking and response generation.

pdf bib
SeqTNS: Sequential Tolerance-based Classifier for Identification of Rhetorical Roles in Indian Legal Documents
Arjun T D | Anand Kumar Madasamy | Sheela Ramanna

Identifying rhetorical roles in legal judgments is a foundational step for automating legal reasoning, summarization, and retrieval. In this paper, we propose a novel Sequential Tolerance-based Classifier (SeqTNS) for rhetorical role classification in Indian legal documents. The proposed classifier leverages semantic similarity and contextual dependencies by using label sequence aware BiLSTMs on top of word embeddings from finetuned InLegalBERT model. These enriched embeddings are clustered into tolerance classes via a tolerance relation using a cosine distance threshold,enabling the model to make flexible, similarity-based predictions. We evaluate SeqTNS on two benchmark datasets annotated with thirteen and seven rhetorical roles, respectively. The proposed method outperforms fine-tuned transformer baselines (LegalBERT, InLegalBERT) as well as the previously developed tolerance relation-based (TNS) model, achieving a weighted F1 score of 0.78 on thirteen class dataset and a macro F1 of 0.83 on the seven class dataset, while reducing training time by 39-40% compared to state of the art BiLSTM-CRF models. The larger of our two datasets is substantial, containing over 40,000 sentences and 1.3M tokens, and serves as a challenging real world benchmark. Additionally, we use LIME for explainability and t-SNE to validate the coherence of tolerance-based clusters.

pdf bib
Persona is a Double-Edged Sword: Rethinking the Impact of Role-play Prompts in Zero-shot Reasoning Tasks
Junseok Kim | Nakyeong Yang | Kyomin Jung

Recent studies have shown that prompting large language models (LLMs) with role-playing personas can enhance their reasoning capabilities. While the benefits of role-playing personas in reasoning tasks are widely recognized, it remains uncertain whether a persona aligned with the given dataset can consistently achieve these improvements. In this work, we empirically investigate the potential drawbacks of using dataset-aligned personas (referred to as **coarsely aligned personas**) and introduce Jekyll & Hyde, a novel framework that enhances reasoning robustness by ensembling solutions from both role-playing and neutral (non-persona) prompts.Jekyll & Hyde first predicts an instance-specific persona tailored to each query using an LLM, then generates answers with both persona and neutral prompts, and finally selects the superior output through an LLM-based evaluator.Experimental results claim that across twelve widely used natural language reasoning datasets and three backbone large language models, Jekyll & Hyde consistently outperforms single-perspective LLMs, achieving an average accuracy gain of **9.98%** on GPT‐4.We further demonstrate that using instance‐aligned personas yields more accurate and stable performance than using dataset-aligned personas.

pdf bib
CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models
Danny Brahman | Mohammad Mahoor

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated remarkable advancements in logical reasoning, there remains a significant gap in evaluating their code generation capabilities. Existing benchmark datasets fall short in pinpointing specific strengths and weaknesses, impeding targeted enhancements in models’ reasoning abilities to synthesize code. To bridge this gap, our paper introduces an innovative, pedagogical benchmarking method that mirrors the evaluation processes encountered in academic programming courses. We introduce CodeEval, a multi-dimensional benchmark dataset designed to rigorously evaluate LLMs across 24 distinct aspects of Python programming. The dataset covers three proficiency levels—beginner, intermediate, and advanced—and includes both class-based and function-based problem types with detailed problem specifications and comprehensive test suites. To facilitate widespread adoption, we also developed RunCodeEval, an open-source execution framework that provides researchers with a ready-to-use evaluation pipeline for CodeEval. RunCodeEval handles test execution, context setup, and metrics generation, enabling researchers to quickly obtain detailed insights into model strengths and weaknesses across complexity levels, problem types, and programming categories. This combination enables targeted evaluation and guides improvements in LLMs’ programming proficiencies.

pdf bib
WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada
Braeden Sherritt | Isar Nejadgholi | Efstratios Aivaliotis | Khaled Mslmani | Marzieh Amini

Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. In this work, we focus on multimodal wildfire social media data, which, although existing in current datasets, is currently underrepresented in Canadian contexts. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across twelve key themes. We evaluate zero-shot vision-language models on this dataset and compare their results with those of custom-trained and baseline classifiers. We show that while baseline methods and zero-shot prompting offer quick deployment, custom-trained models outperform them when labelled data is available. Our best-performing custom model reaches 84.48±0.69% f-score, outperforming VLMs and baseline classifiers. We also demonstrate how this model can be used to uncover trends during wildfires, through the collection and analysis of a large unlabeled dataset. Our dataset facilitates future research in wildfire response, and our findings highlight the importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.

pdf bib
Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning
Sahil Bansal | Sai Shruthi Sistla | Aarti Arikatala | Sebastian Schreiber

Effective tool pre-selection via retrieval is essential for AI agents to select from a vast array of tools when identifying and planning actions in the context of complex user queries. Despite its central role in planning, this aspect remains underexplored in the literature. Traditional approaches rely primarily on similarities between user queries and tool descriptions, which significantly limits retrieval accuracy, specifically when handling multi-step user requests. To address these limitations, we propose a Knowledge Graph (KG)-based tool retrieval framework that captures the semantic relationships between tools and their functional dependencies. Our retrieval algorithm leverages ensembles of 1-hop ego tool graphs to model direct and indirect connections between tools, enabling more comprehensive and contextual tool selection for multi-step tasks. We evaluate our approach on a synthetically generated internal dataset across six defined user classes, extending previous work on coherent dialogue synthesis and tool retrieval benchmarks. Results demonstrate that our tool graph-based method achieves 91.85% tool coverage on the micro-average CompleteRecall metric, compared to 89.26% for re-ranked semantic-lexical hybrid retrieval, the strongest non-KG baseline in our experiments. These findings support our hypothesis that the structural information modeled in the graph provides complementary signals to pure similarity matching, particularly for queries requiring sequential tool composition.

pdf bib
Unveiling the Influence of Amplifying Language-Specific Neurons
Inaya Rahmanisa | Lyzander Marciano Andrylie | Mahardika Krisna Ihsani | Alfan Farizki Wicaksono | Haryo Akbarianto Wibowo | Alham Fikri Aji

Language-specific neurons, units in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored.This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification steering factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.

pdf bib
Enhancing Coreference Resolution with LLM-driven Data Augmentation and Adversarial Filtering
Dohyeon Kim | Gayeon Jung | Jeongseon Cho | Jihoon Yang

Coreference resolution is a fundamental task in natural language processing that involves linking different references to the same entity within a text. However, existing models often struggle to reliably identify referential relationships in contexts with extensive length or complex modifiers. This study proposes a data augmentation technique adding adjective phrases and employing a prompt-based adversarial filtering pipeline to address these challenges. Specifically, we generated and inserted contextually appropriate adjective phrases through the interaction between GPT-4o-mini based Few-shot Prompting and a Discriminative Language Model. The grammatical and semantic consistency of these phrases was validated via human evaluation and inter-annotator agreement (IAA) procedures. The generated synthetic dataset was integrated with existing data, leading to enhanced model performance. On the LitBank dataset, the CoNLL-F1 score increased by up to 1.7%, while the synthetic dataset improved linguistic diversity and the complexity of referential structures. The proposed pipeline represents a significant step towards developing coreference resolution models capable of better capturing linguistic variety and demonstrating robustness under challenging conditions.

pdf bib
TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context
Shubham Kumar Nigam | Balaramamahanthi Deepak Patnaik | Shivam Mishra | Noel Shallum | Kripabandhu Ghosh | Arnab Bhattacharya

In the legal domain, Fact-based Judgment Prediction and Explanation (FJPE) aims to predict judicial outcomes and generate grounded explanations using only factual information, mirroring early-phase legal reasoning. Motivated by the overwhelming case backlog in the Indian judiciary, we introduce TathyaNyaya, the first large-scale, expert-annotated dataset for FJPE in the Indian context. Covering judgments from the Supreme Court and multiple High Courts, the dataset comprises four complementary components, NyayaFacts, NyayaScrape, NyayaSimplify, and NyayaFilter, that facilitate diverse factual modeling strategies. Alongside, we present FactLegalLlama, an instruction-tuned LLaMa-3-8B model fine-tuned to generate faithful, fact-grounded explanations. While FactLegalLlama trails transformer baselines in raw prediction accuracy, it excels in generating interpretable explanations, as validated by both automatic metrics and legal expert evaluation. Our findings show that fact-only inputs and preprocessing techniques like text simplification and fact filtering can improve both interpretability and predictive performance. Together, TathyaNyaya and FactLegalLlama establish a robust foundation for realistic, transparent, and trustworthy AI applications in the Indian legal system.

pdf bib
How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness
Darshita Rathore | Vineet Kumar | Chetna Bansal | Anindya Moitra

Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.

pdf bib
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification
Shuzhou Yuan | Ercong Nie | Lukas Kouba | Helmut Schmid | Hinrich Schuetze | Michael Färber

Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with LLM and show that LLM performs comparably to the human annotation. Building on this, we construct ParaDeHate, a large-scale parallel dataset specifically for hate speech detoxification. We release ParaDeHate as a benchmark of over 8,000 hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART fine-tuned on ParaDeHate achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.

pdf bib
From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks
Shuzhou Yuan | Zhan Qu | Mario Tawfelis | Michael Färber

Large Language Models (LLMs) exhibit strong linguistic capabilities, but little is known about how they encode psycholinguistic knowledge across languages. We investigate whether and how LLMs exhibit human-like psycholinguistic responses under different linguistic identities using two tasks: sound symbolism and word valence. We evaluate two models, Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct, under monolingual and bilingual prompting in English, Dutch, and Chinese. Behaviorally, both models adjust their outputs based on prompted language identity, with Qwen showing greater sensitivity and sharper distinctions between Dutch and Chinese. Probing analysis reveals that psycholinguistic signals become more decodable in deeper layers, with Chinese prompts yielding stronger and more stable valence representations than Dutch. Our results demonstrate that language identity conditions both output behavior and internal representations in LLMs, providing new insights into their application as models of cross-linguistic cognition.

pdf bib
Signs of Struggle: Spotting Cognitive Distortions across Language and Register
Abhishek Kuber | Enrico Liscio | Ruixuan Zhang | Caroline Figueroa | Pradeep K. Murukannaiah

Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One key focus is the identification of cognitive distortions, irrational thought patterns that have a role in aggravating mental distress. Early detection of these distortions may enable timely, low-cost interventions. While prior work has focused on English clinical data, we present the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection, analyzing forum posts written by Dutch adolescents. Our findings show that while changes in language and writing style can significantly affect model performance, domain adaptation methods show the most promise

pdf bib
FB-RAG: Improving RAG with Forward and Backward Lookup
Kushal Chawla | Alfy Samuel | Anoop Kumar | Daben Liu

Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across 9 datasets from LongBench and Bench, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over 48% latency reduction or achieves an 8% performance improvement with a 10% latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.

pdf bib
Emotion-Aware Dysarthric Speech Reconstruction: LLMs and Multimodal Evaluation with MCDS
Kaushal Attaluri | Radhika Mamidi | Sireesha Chittepu | Anirudh Chebolu | Hitendra Sarma Thogarcheti

Dysarthria, a motor speech disorder affecting over 46 million individuals globally, impairs both intelligibility and emotional expression in communication. This work introduces a novel framework for emotion-aware sentence reconstruction from dysarthric speech using Large Language Models (LLMs) fine-tuned with QLoRA, namely LLaMA 3.1 and Mistral 8x7B. Our pipeline integrates direct emotion recognition from raw audio and conditions textual reconstruction on this emotional context to enhance both semantic and affective fidelity.We propose the Multimodal Communication Dysarthria Score (MCDS), a holistic evaluation metric combining BLEU, semantic similarity, emotion consistency, and human ratings:MCDS=αB+βE+γS+δHwhere 𝛼 + 𝛽 + 𝛾 + 𝛿 = 1.On our extended TORGO+ dataset, our emotion-aware LLM model achieves a MCDS of 0.87 and BLEU of 72.4%, significantly outperforming traditional pipelines like Kaldi GMM-HMM (MCDS: 0.52, BLEU: 38.1%) and Whisper-based models. It also surpasses baseline LLM systems by 0.09 MCDS. This sets a new benchmark in emotionally intelligent dysarthric speech reconstruction, with future directions including multilingual support and real-time deployment.

pdf bib
Iterative Critique-Driven Simplification: Targeted Enhancement of Complex Definitions with Small Language Models
Veer Chheda | Avantika Sankhe | Aaditya Uday Ghaisas

Difficult and unfamiliar concepts often hinder comprehension for lay audiences, especially in technical and educational domains. This motivates the usage of large language models (LLMs) for the process of text simplification (TS). In this work, we propose an iterative refinement framework that aims to simplify definitions by carefully handling complex terminology and domain-specific expressions. The obtained definition is reprocessed based on the critique, making refinements in successive iterations. We emphasize the use of small language models (SLMs) due to their faster response times and cost-efficient deployment. Human evaluations of the definitions produced at each refinement stage indicate consistent improvements in our specified evaluation criteria. We evaluate both LLM-as-a-judge score and human assessments along with automated metrics like BERTScore, BLEU-4, which provided supporting evidence for the effectiveness of our approach. Our work highlights the use of LLMs mimicking human-like feedback system in a TS task catering to a reader’s specific cognitive needs. Thus, we find that an iterative, critique-driven method can be an effective strategy for the simplification of dense or technical texts, particularly in domains where jargon impedes understanding.

pdf bib
SafePersuasion: A Dataset, Taxonomy, and Baselines for Analysis of Rational Persuasion and Manipulation
Haein Kong | A M Muntasir Rahman | Ruixiang Tang | Vivek Singh

Persuasion is a central feature of communication, widely used to influence beliefs, attitudes, and behaviors. In today’s digital landscape, across social media and online platforms, persuasive content is pervasive, appearing in political campaigns, marketing, fundraising appeals, and more. These strategies span a broad spectrum, from rational and ethical appeals to highly manipulative tactics, some of which pose significant risks to individuals and society. Despite the growing need to identify and differentiate safe from unsafe persuasion, empirical research in this area remains limited. To address this gap, we introduce SafePersuasion, a two-level taxonomy and annotated dataset that categorizes persuasive techniques based on their safety. We evaluate the baseline performance of three large language models in detecting manipulation and its subtypes, and report only moderate success in distinguishing manipulative content from rational persuasion. By releasing SafePersuasion, we aim to advance research on detecting unsafe persuasion and support the development of tools that promote ethical standards and transparency in persuasive communication online.

pdf bib
Illusions of Relevance: Arbitrary Content Injection Attacks Deceive Retrievers, Rerankers, and LLM Judges
Manveer Singh Tamber | Jimmy Lin

This work considers a black-box threat model in which adversaries attempt to propagate arbitrary non-relevant content in search. We show that retrievers, rerankers, and LLM relevance judges are all highly vulnerable to attacks that enable arbitrary content to be promoted to the top of search results and to be assigned perfect relevance scores. We investigate how attackers may achieve this via content injection, injecting arbitrary sentences into relevant passages or query terms into arbitrary passages. Our study analyzes how factors such as model class and size, the balance between relevant and non-relevant content, injection location, toxicity and severity of injected content, and the role of LLM-generated content influence attack success, yielding novel, concerning, and often counterintuitive results. Our results reveal a weakness in embedding models, LLM-based scoring models, and generative LLMs, raising concerns about the general robustness, safety, and trustworthiness of language models regardless of the type of model or the role in which they are employed. We also emphasize the challenges of robust defenses against these attacks. Classifiers and more carefully prompted LLM judges often fail to recognize passages with content injection, especially when considering diverse text topics and styles. Our findings highlight the need for further research into arbitrary content injection attacks. We release our code for further study: https://github.com/manveertamber/content_injection_attacks.

pdf bib
Semantic, Orthographic, and Phonological Biases in Humans’ Wordle Gameplay
Jiadong Liang | Adam Kabbara | Jiaying Liu | Ronaldo Luo | Kina Kim | Michael Guerzhoy

We show that human players’ gameplay in the game of Wordle is influenced by the semantics, orthography, and phonology of the player’s previous guesses. We compare actual human players’ guesses with near-optimal guesses using NLP techniques. We study human language use in the constrained environment of Wordle, which is situated between natural language use and the artificial word association task.

pdf bib
Learning from Hallucinations: Mitigating Hallucinations in LLMs via Internal Representation Intervention
Sora Kadotani | Kosuke Nishida | Kyosuke Nishida

Large language models (LLMs) sometimes hallucinate facts. Recent studies have shown that use of non-factual LLMs (anti-expert) have the potential to improve the factuality of the base LLM. Anti-expert methods penalize the output probabilities of the base LLM with an anti-expert LLM. Anti-expert methods are effective in mitigating hallucinations, but require high computational costs because the two LLMs are run simultaneously. In this paper, we propose an efficient anti-expert method called in-model anti-expert. It mitigated the hallucination problem with a single LLM and intervening to change the internal representations in the direction of improving factuality. Experiments results showed that the proposed method is less costly than the conventional anti-expert method and outperformed existing methods except for the anti-expert method. We confirmed that the proposed method improved GPU memory usage from 2.2x to 1.2x and latency from 1.9x to 1.2x.

pdf bib
CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences
Rhitabrat Pokharel | Yufei Tao | Ameeta Agrawal

Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have proven effective in English, they often fail to generalize robustly to multilingual settings. We propose a simple yet effective alternative, Confidence-Aware Preference Optimization (CAPO), which replaces DPO’s fixed treatment of preference pairs with a dynamic loss scaling mechanism based on a relative reward. By modulating the learning signal according to the confidence in each preference pair, CAPO enhances robustness to noisy or low-margin comparisons, typically encountered in multilingual text. Empirically, CAPO outperforms existing preference optimization baselines by at least 16% in reward accuracy, and improves alignment by widening the gap between preferred and dispreferred responses across languages.

pdf bib
Formalizing Test-Time Compute for Function-Level Code Generation
Haau-Sing Li | Patrick Fernandes | Iryna Gurevych | Andre Martins

Test-time compute has emerged as a powerful paradigm in function-level code generation. However, previous proposed strategies have been viewed as disparate, thus lacking a fair apples-to-apples analysis enabling understanding of their operational mechanisms in execution-based benchmarks. Therefore, we present a mathematical framework that unifies generation and reranking with theoretical justifications through the lens of Minimum Bayes Risk (MBR) decoding. Our proposed framework leads to key research questions regarding the effectiveness of using parallel and/or iterative sampling, design choices of reranking signals and soft/hard MBR utility functions, and behaviors of the final selected program across different methods. Our empirical findings highlight the importance of the diversity of sampled candidates (over self-improvement), reranking with simple and high-quality signals, and the effectiveness of test-time compute to select programs that manifest general and edge test case robustness. We will open-source our analysis toolkit and implementation to enable reproducible research.We open-source our analysis toolkit and implementation to enable reproducible research.

pdf bib
BioMistral-Clinical: A Scalable Approach to Clinical LLMs via Incremental Learning and RAG
Ziwei Chen | Bernhard Bermeitinger | Christina Niklaus

The integration of large language models (LLMs) into clinical medicine represents a major advancement in natural language processing (NLP). We introduce BioMistral-Clinical 7B, a clinical LLM built on BioMistral-7B (Labrak et al., 2024), designed to support continual learning from unstructured clinical notes for real-world tasks such as clinical decision support. Using the augmented-clinical-notes dataset provided by Hugging Face (2024), we apply prompt engineering to transform unstructured text into structured JSON, capturing key clinical information (symptoms, diagnoses, treatments, outcomes). This enables efficient incremental training via self-supervised continual learning (SPeCiaL) (Caccia and Pineau, 2021). Evaluation on MedQA (Jin et al., 2021) and MedMCQA (Pal et al., 2022) shows that BioMistral-Clinical 7B improves accuracy on MedMCQA by nearly 10 points (37.4% vs. 28.0%) over the base model, while maintaining comparable performance on MedQA (34.8% vs. 36.5%). Building on this, we propose the BioMistral-Clinical System, which integrates Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) to enrich responses with relevant clinical cases retrieved from a structured vector database. The full system enhances clinical reasoning by combining domain-specific adaptation with contextual retrieval.

pdf bib
Surprisal Dynamics for the Detection of Multi-Word Expressions in English
Diego Alves | Sergei Bagdasarov | Elke Teich

This work examines the potential of surprisal slope as a feature for identifying multi-word expressions (MWEs) in English, leveraging token-level surprisal estimates from the GPT-2 language model. Evaluations on the DiMSUM and SemEval-2022 datasets reveal that surprisal slope provides moderate yet meaningful discriminative power with a trade-off between specificity and coverage: while high recall indicates that surprisal slope captures many true MWEs, the slightly lower precision reflects false positives, particularly for non-MWEs that follow formulaic patterns (e.g., adjective-noun or verb-pronoun structures). The method performs particularly well for conventionalized expressions, such as idiomatic bigrams in the SemEval-2022 corpus. Both idiomatic and literal usages of these bigrams exhibit negative slopes, with idiomatic instances generally showing a more pronounced decrease.Overall, surprisal slope offers a cognitively motivated and interpretable signal that complements existing MWE identification methods, particularly for conventionalized expressions.

pdf bib
Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance
Manon Reusens | Philipp Borchert | Jochen De Weerdt | Bart Baesens

Large Language Models (LLMs) excel at providing information acquired during pretraining on large-scale corpora and following instructions through user prompts. However, recent studies suggest that LLMs exhibit biases favoring Western native English speakers over non-Western native speakers. Given English’s role as a global lingua franca and the diversity of its dialects, we extend this analysis to examine whether non-native English speakers also receive lower-quality or factually incorrect responses more frequently. We compare three groups—Western native, non-Western native, and non-native English speakers—across classification and generation tasks. Our results show that performance discrepancies occur when LLMs are prompted by the different groups for the classification tasks. Generative tasks, in contrast, are largely robust to nativeness bias, likely due to their longer context length and optimization for open-ended responses. Additionally, we find a strong anchoring effect when the model is made aware of the user’s nativeness for objective classification tasks, regardless of the correctness of this information. Our analysis is based on a newly collected dataset with over 12,000 unique annotations from 124 annotators, including information on their native language and English proficiency.

pdf bib
Improving Proficiency and Grammar Accuracy for Chinese Language Learners with Large Language Models
Yuqi Liang | Wenjing Xu | Hongzhi Xu

In this study, we evaluate the performance of large language models (LLMs) in detecting and correcting grammatical errors made by Chinese language learners. We find that incorporating various linguistic features—such as dependency structures, parts of speech, and pinyin transliteration—into the prompts can potentially enhance model performance. Among these features, parts of speech and pinyin prove to be the most effective across all tested models. Additionally, our findings show that the success of error correction also depends on the severity of the errors. When the intended meaning is preserved, LLMs tend to provide accurate revisions following the principle of minimal editing. However, when the meaning is obscured, LLMs are more likely to produce divergent outputs, both in comparison to reference corrections and to the responses of other models.

pdf bib
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
Raj Vardhan Tomar | Preslav Nakov | Yuxia Wang

As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets (21 wins out of 36 settings), with even a small selected-1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision.We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain.

pdf bib
SEER: The Span-based Emotion Evidence Retrieval Benchmark
Aneesha Sampath | Oya Aran | Emily Mower Provost

Emotion recognition methods typically assign labels at the sentence level, obscuring the specific linguistic cues that signal emotion. This limits their utility in applications requiring targeted responses, such as empathetic dialogue and clinical support, which depend on knowing which language expresses emotion. The task of identifying emotion evidence – text spans conveying emotion – remains underexplored due to a lack of labeled data. Without span-level annotations, we cannot evaluate whether models truly localize emotion expression, nor can we diagnose the sources of emotion misclassification. We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to evaluate Large Language Models (LLMs) on this task. SEER evaluates single and multi-sentence span identification with new annotations on 1200 real-world sentences. We evaluate 14 LLMs and find that, on single-sentence inputs, the strongest models match the performance of average human annotators, but performance declines in multi-sentence contexts. Key failure modes include fixation on emotion keywords and false positives in neutral text.

pdf bib
TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation
Bhavana Akkiraju | Srihari Bandarupalli | Swathi Sambangi | R Vijaya Saraswathi | Dr Vasavi Ravuri | Anil Vuppala

Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu-English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, fine-tuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. While our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu–English translation. The work delivers three key contributions: a reproducible Telugu–English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.

pdf bib
HiLearners: Non-Native Spoken Hindi Error Correction
Sourava Kumar Behera | Rohit Saluja

While majority of current resources rely on formal text corrections, our work shifts the focus to non-native spoken Hindi error correction, which presents unique challenges due to its rich morphology, complex syntax, and distinct error patterns. To address the scarcity of authentic learner data, we introduce HiLearners, a dataset gathered from 2,500 real non-native Hindi speakers across three linguistic backgrounds (English, Bengali, Dravidian), capturing authentic error patterns including transfer errors, overgeneralization patterns, and contextual agreement issues. Furthermore, to overcome data resource limitations, we develop a methodical synthetic data augmentation technique, utilizing Large Language Models (LLMs) with a pattern analysis and controlled generation approach similar to Retrieval-Augmented Generation (RAG), yielding 5,500 carefully verified synthetic examples. Through extensive experiments on individual, mixed, and progressive curriculum-based configurations using multilingual models, we demonstrate that LLM-based synthetic data combined with three-phase curriculum learning significantly boosts performance, achieving a 76.92 GLEU score and surpassing human-only baselines. This work bridges the gap between native-centric error correction research and non-native Hindi learner needs, establishing a realistic assessment standard for advancing low-resource language processing.

pdf bib
MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
Rhitabrat Pokharel | Ameeta Agrawal

The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce – MTQ-Eval – a novel framework for multilingual text quality evaluation. We automatically generate text quality preference data and train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Additionally, we explore whether this enhanced ability to distinguish between high- and low-quality text translates to better performance in downstream tasks.

pdf bib
TelcoAI: Advancing 3GPP Technical Specification Search through Agentic Multi-Modal Retrieval-Augmented Generation
Rahul Ghosh | Chun-Hao Liu | Gaurav Rele | Vidya Sagar Ravipati | Hazar Aouad

The 3rd Generation Partnership Project (3GPP) produces complex technical specifications essential to global telecommunications, yet their hierarchical structure, dense formatting, and multi-modal content make them difficult to process. While Large Language Models (LLMs) show promise, existing approaches fall short in handling complex queries, visual information, and document interdependencies. We present TelcoAI, an agentic, multi-modal Retrieval-Augmented Generation (RAG) system tailored for 3GPP documentation. TelcoAI introduces section-aware chunking, structured query planning, metadata-guided retrieval, and multi-modal fusion of text and diagrams. Evaluated on multiple benchmarks—including expert-curated queries—our system achieves 87% recall, 83% claim recall, and 92% faithfulness, representing a 16% improvement over state-of-the-art baselines. These results demonstrate the effectiveness of agentic and multi-modal reasoning in technical document understanding, advancing practical solutions for real-world telecommunications research and engineering.

pdf bib
ENG-DRB: PDTB-style Discourse Relation Bank on Engineering Tutorial Video Scripts
Cheng Zhang | Rajasekhar Kakarla | Kangda Wei | Ruihong Huang

Discourse relation parsing plays a crucial role in uncovering the logical structure of text, yet existing corpora focus almost exclusively on general-domain genres, leaving specialized fields like engineering under-resourced. We introduce ENG‐DRB, the first PDTB‐style discourse relation corpus derived from transcripts of hands‐on engineering tutorial videos. ENG‐DRB comprises 11 tutorials spanning civil, mechanical, and electrical/electronics engineering (155 minutes total) with 1,215 annotated relations. Compared to general‐domain benchmarks, this dataset features a high proportion of explicit senses, dense causal and temporal relations, and frequent overlapping and embedded senses. Our benchmarking experiments underscore the dataset’s difficulty. A top parser (HITS) detects segment boundaries well (98.6% F1), but its relation classification is more than 11 F1 percentages lower than on the standard PDTB. In addition, state‐of‐the‐art LLMs (OpenAI o4‐mini, Claude 3.7, LLaMA‐3.1) achieve at best 41% F1 on explicit relations and less than 9% F1 on implicit relations, revealing systematic errors in temporal and causal sense detection. The dataset can be accessed at: https://doi.org/10.57967/hf/6895. Code to reproduce our results is available at: https://github.com/chengzhangedu/ENG-DRB.

pdf bib
Did the Writer Actually Visit the Location? Analysis of Location Reviews from Visit Experience
Aitaro Yamamoto | Hiroki Ouchi | Kota Tsubouchi | Tatsuo Yamashita | Ryo Tsujimoto | Yuki Matsuda | Hirohiko Suwa

We investigate the characteristics of location review texts written on the basis of actual visit experiences or without any visit experiences. Specifically, we formalize this as a binary classification task and propose a data construction framework that labels reviews as Visit or NotVisit by linking them with users’ GPS-based movement data. We train a logistic regression model on the dataset and evaluate it alongside human annotators and a large language model (LLM). The results show that the task is more challenging for humans and LLMs than for the simple trained model.

pdf bib
Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student
Soumyadeep Jana | Ranbir Singh Sanasam

Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model’s performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.

pdf bib
Seeing Through the Mask: AI-Generated Text Detection with Similarity-Guided Graph Reasoning
Nidhi Gupta | Qinghua Li

The rise of generative AI has led to challenges in distinguishing AI-generated text from human-written content, raising concerns about misinformation and content authenticity. Detecting AI-generated text remains challenging, especially under various stylistic domains and paraphrased inputs. We introduce SGG-ATD, a novel detection framework that models structural and contextual relationships between LLM-predicted and original-input text. By masking parts of the input and reconstructing them using a language model, we capture implicit coherence patterns. These are encoded in a graph where cosine and contextual links between keywords guide classification via a Graph Convolutional Network (GCN). SGG-ATD achieves strong performance across diverse datasets and shows resilience to adversarial rephrasing and out-of-distribution inputs, outperforming competitive baselines.

pdf bib
Reasoning Enhanced Missing Knowledge Retrieval Augmented Generation Framework for Domain Specific Question Answering
Yuanjun Shi | Zhaopeng Qiu

Retrieval Augmented Generation (RAG) framework mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge, yet faces two critical challenges: (1) the distribution gap between user queries and knowledge bases in a specific domain, and (2) incomplete coverage of required knowledge for complex queries. Existing solutions either require task-specific annotations or neglect inherent connections among query, context, and missing knowledge interactions. We propose a reasoning-based missing knowledge RAG framework that synergistically resolves both issues through Chain-of-Thought reasoning. By leveraging open-source LLMs, our method generates structured missing knowledge queries in a single inference pass while aligning query knowledge distributions, and integrates reasoning traces into answer generation. Experiments on open-domain medical and general question answering (QA) datasets demonstrate significant improvements in context recall and answer accuracy. Our approach achieves effective knowledge supplementation without additional training, offering enhanced interpretability and robustness for real-world QA applications.

pdf bib
Modular Arithmetic: Language Models Solve Math Digit by Digit
Tanja Baeumel | Daniil Gurgurov | Yusser Al Ghussin | Josef Van Genabith | Simon Ostermann

While recent work has begun to uncover the internal strategies that Large Language Models (LLMs) employ for simple arithmetic tasks, a unified understanding of their underlying mechanisms is still lacking. We extend recent findings showing that LLMs represent numbers in a digit-wise manner and present evidence for the existence of digit-position-specific circuits that LLMs use to perform simple arithmetic tasks, i.e. modular subgroups of MLP neurons that operate independently on different digit positions (units, tens, hundreds). Notably, such circuits exist independently of model size and of tokenization strategy, i.e. both for models that encode longer numbers digit-by-digit and as one token.Using Feature Importance and Causal Interventions, we identify and validate the digit-position-specific circuits, revealing a compositional and interpretable structure underlying the solving of arithmetic problems in LLMs. Our interventions selectively alter the model’s prediction at targeted digit positions, demonstrating the causal role of digit-position circuits in solving arithmetic tasks.

pdf bib
Standardizing Heterogeneous Corpora with DUUR: A Dual Data- and Process-Oriented Approach to Enhancing NLP Pipeline Integration
Leon Lukas Hammerla | Alexander Mehler | Giuseppe Abrami

Despite their success, LLMs are too computationally expensive to replace task- or domain-specific NLP systems. However, the variety of corpus formats makes reusing these systems difficult. This underscores the importance of maintaining an interoperable NLP landscape. We address this challenge by pursuing two objectives: standardizing corpus formats and enabling massively parallel corpus processing. We present a unified conversion framework embedded in a massively parallel, microservice-based, programming language-independent NLP architecture designed for modularity and extensibility. It allows for the integration of external NLP conversion tools and supports the addition of new components that meet basic compatibility requirements. To evaluate our dual data- and process-oriented approach to standardization, we (1) benchmark its efficiency in terms of processing speed and memory usage, (2) demonstrate the benefits of standardized corpus formats for NLP downstream tasks, and (3) illustrate the advantages of incorporating custom formats into a corpus format ecosystem.

pdf bib
LLMForum-RAG: A Multilingual, Multi-domain Framework for Factual Reasoning via Weighted Retrieval and LLM Collaboration
Soham Chaudhuri | Dipanjan Saha | Dipankar Das

LLMs have emerged as a transformative technology, enabling a wide range of tasks such as text generation, summarization, question answering, and more. The use of RAG with LLM is on the rise to provide deeper knowledge bases of various domains. In the present study, we propose a RAG framework that employs weighted Rocchio mechanism for retrieval and LLM collaborative forum with supervision for generation. Our framework is evaluated in two downstream tasks: a biomedical question answering (BioASQ-QA) and a multilingual claim verification (e.g. in English, Hindi, and Bengali) to showcase its adaptability across various domains and languages. The proposed retriever is capable to achieve substantial improvement over BM25 of +8% (BioASQ-QA), +15% (English), +5% (Hindi), and +20% (Bengali) for Recall@5. In veracity classification, our framework achieves an average answer correctness of 0.78 on BioASQ-QA while achieving F1-score of 0.59, 0.56, and 0.41 for English, Hindi and Bengali languages, respectively. These results demonstrate the effectiveness and robustness of our framework for retrieval and generation in multilingual and multi-domain settings.

pdf bib
D-Neg: Syntax-Aware Graph Reasoning for Negation Detection
Leon Lukas Hammerla | Andy Lücking | Carolin Reinert | Alexander Mehler

Despite the communicative importance of negation, its detection remains challenging. Previous approaches perform poorly in out-of-domain scenarios, and progress outside of English has been slow due to a lack of resources and robust models. To address this gap, we present D-Neg: a syntax-aware graph reasoning model based on a transformer that incorporates syntactic embeddings by attention-gating. D-Neg uses graph attention to represent syntactic structures, emulating the effectiveness of rule-based dependency approaches for negation detection. We train D-Neg using 7 English resources and their translations into 10 languages, all aligned at the annotation level. We conduct an evaluation of all these datasets in in-domain and out-of-domain settings. Our work represents a significant advance in negation detection, enabling more effective cross-lingual research.

pdf bib
EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms
Abeer Aldayel | Areej Alokaili

Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a basis for understanding how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.

pdf bib
CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning
Lei Sheng | Xu Shuai Shuai

Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD private test set, our 7B model achieves 71.72% execution accuracy, while the 32B model achieves 73.67%, outperforming other known methods using open source models. The code has been open sourced at https://github.com/CycloneBoy/csc_sql.

pdf bib
SLM-SQL: An Exploration of Small Language Models for Text-to-SQL
Lei Sheng | Xu Shuai Shuai

Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87% execution accuracy (EX), while the 1.5B model achieved 67.08% EX. On the BIRD private test set, our 0.5B model achieves 61.82% EX, while the 1.5B model achieves 70.49%. We will release our dataset, model, and code to github: https://github.com/CycloneBoy/slm_sql.

pdf bib
CLEV: LLM-Based Evaluation Through Lightweight Efficient Voting for Free-Form Question-Answering
Sher Badshah | Moamen Moustafa | Hassan Sajjad

Evaluating free-form Question-Answering (QA) remains a challenge due to its diverse and open-ended nature. Traditional automatic metrics fail to capture semantic equivalence or accommodate the variability of open-ended responses. Leveraging Large Language Models (LLMs) as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. We propose the Consensus via Lightweight Efficient Voting (CLEV), which employs two primary LLMs as judges and engages a third judge only in cases of disagreement. This approach prioritizes evaluation reliability while reducing unnecessary computational demands. Through experiments, including human evaluation, we demonstrate CLEV’s ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating LLMs on free-form QA.

pdf bib
How do we measure privacy in text? A survey of text anonymization metrics
Yaxuan Ren | Krithika Ramesh | Yaxing Yao | Anjalie Field

In this work, we aim to clarify and reconcile metrics for evaluating privacy protection in text through a systematic survey.Although text anonymization is essential for enabling NLP research and model development in domains with sensitive data, evaluating whether anonymization methods sufficiently protect privacy remains an open challenge. In manually reviewing 47 papers that report privacy metrics, we identify and compare six distinct privacy notions, and analyze how the associated metrics capture different aspects of privacy risk. We then assess how well these notions align with legal privacy standards (HIPAA and GDPR), as well as user-centered expectations grounded in HCI studies. Our analysis offers practical guidance on navigating the landscape of privacy evaluation approaches further and highlights gaps in current practices. Ultimately, we aim to facilitate more robust, comparable, and legally aware privacy evaluations in text anonymization.

pdf bib
To Generate or Discriminate? Methodological Considerations for Measuring Cultural Alignment in LLMs
Saurabh Kumar Pandey | Sougata Saha | Monojit Choudhury

Socio-demographic prompting (SDP) - prompting Large Language Models (LLMs) using demographic proxies to generate culturally aligned outputs - often shows LLM responses as stereotypical and biased. While effective in assessing LLMs’ cultural competency, SDP is prone to confounding factors such as prompt sensitivity, decoding parameters, and the inherent difficulty of generation over discrimination tasks due to larger output spaces. These factors complicate interpretation, making it difficult to determine if the poor performance is due to bias or the task design. To address this, we use inverse socio-demographic prompting (ISDP), where we prompt LLMs to discriminate and predict the demographic proxy from actual and simulated user behavior from different users. We use the Goodreads-CSI dataset (Saha et al., 2025), which captures difficulty in understanding English book reviews for users from India, Mexico, and the USA, and test four LLMs: Aya-23, Gemma-2, GPT-4o, and LLaMA-3.1 with ISDP. Results show that models perform better with actual behaviors than simulated ones, contrary to what SDP suggests. However, performance with both behavior types diminishes and becomes nearly equal at the individual level, indicating limits to personalization.

pdf bib
One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning
Ritesh Goru | Shanay Mehta | Prateek Jain

Fine-tuning Large Language Models(LLMs) on multi-turn reasoning datasets requires N (number of turns) separate forward passes per conversation due to reasoning token visibility constraints, as reasoning tokens for a turn are discarded in subsequent turns. We propose duplicating response tokens along with a custom attention mask to enable single-pass processing of entire conversations. We prove our method produces identical losses to the N-pass approach while reducing time complexity from O\bigl(N3\bigl) to O\bigl(N2\bigl) and maintaining the same memory complexity for a transformer based model. Our approach achieves significant training speedup while preserving accuracy. Our implementation is available online(https://github.com/devrev/One-Pass-to-Reason).

pdf bib
CLL-RetICL: Contrastive Linguistic Label Retrieval-based In-Context Learning for Text Classification via Large Language Models
Chaohao Lin | Kaida Wu | Peihao Xiang | Yanzhao Wu | Ou Bai

Recent research has delved into Retrieval-based In-Context Learning (RetICL), leveraging the power of large language models (LLMs) for text classification. Despite its promise, a persistent challenge lies in effectively retrieving relevant demonstrations from a support set. Many existing approaches have overlooked the essential role of linguistic label information in guiding this retrieval process. To bridge this gap, we present Contrastive Linguistic Label Retrieval-based In-Context Learning (CLL-RetICL), a novel framework designed to identify the most relevant and impactful sentences without altering the model parameters. Our approach uniquely integrates sentence-query similarity with sentence-label similarity, enabling a more nuanced and comprehensive evaluation of relevance. We tested CLL-RetICL across diverse text classification tasks and evaluated its performance on various LLMs. Experimental results demonstrate that CLL-RetICL consistently outperforms previous retrieval methods that do not incorporate linguistic label information. These findings highlight the critical importance of linguistic label-aware selection in enhancing text classification accuracy.

pdf bib
Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents
Seyoung Song | Haneul Yoo | Jiho Jin | Kyunghyun Cho | Alice Oh

Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within ±0.0068 F1-score for sequence labeling tasks and up to +0.84 BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These findings emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.

pdf bib
Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data
Srihari Bandarupalli | Bhavana Akkiraju | Sri Charan D | Vamshi Raghu Simha Narasinga | Anil Vuppala

Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.

pdf bib
Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication
Shadab Hafiz Choudhury | Asha Kumar | Lara J. Martin

Gaps arise between a language model’s use of concepts and people’s expectations. This gap is critical when LLMs generate text to help people communicate via Augmentative and Alternative Communication (AAC) tools. In this work, we introduce the evaluation task of Representation Alignment for measuring this gap via human judgment. In our study, we expand keywords and emotion representations into full sentences. We select four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. In addition to Representation Alignment, we also measure people’s judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., “angry”) rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.

pdf bib
Towards Multimodal Question Answering in Educational Domain
Himanshu Wadhwa | T Karthikeyan | Mausam | Manish Gupta

The proliferation of educational videos on the Internet has changed the educational landscape by enabling students to learn complex concepts at their own pace. Our work outlines the vision of an automated tutor – a multimodal question answering (QA) system to answer questions from students watching a video. This can make doubt resolution faster and further improve learning experience. In this work, we take first steps towards building such a QA system. We curate and release a dataset named EduVidQA, with 3,158 videos and 18,474 QA-pairs. However, building and evaluating an educational QA system is challenging because (1) existing evaluation metrics do not correlate with human judgments, and (2) a student question could be answered in many different ways, training on a single gold answer could confuse the model and make it worse. We conclude with important research questions to develop this research area further.

pdf bib
ACE-ICD: Acronym Expansion As Data Augmentation For Automated ICD Coding
Tuan-Dung Le | Shohreh Haddadan | Thanh Q. Thieu

Automatic ICD coding, the task of assigning disease and procedure codes to electronic medical records, is crucial for clinical documentation and billing. While existing methods primarily enhance model understanding of code hierarchies and synonyms, they often overlook the pervasive use of medical acronyms in clinical notes, a key factor in ICD code inference. To address this gap, we propose a novel effective data augmentation technique that leverages large language models to expand medical acronyms, allowing models to be trained on their full form representations. Moreover, we incorporate consistency training to regularize predictions by enforcing agreement between the original and augmented documents. Extensive experiments on the MIMIC-III dataset demonstrate that our approach, ACE-ICD establishes new state-of-the-art performance across multiple settings, including common codes, rare codes, and full-code assignments. Our code is publicly available.

pdf bib
Logical Table-to-Text Generation: Challenges, Methods, and Reasoning
Lena Trigg | Dean F. Hougen

Logical Table-to-Text (LT2T) generation requires models to both verbalize tabular data and reason over it - performing comparisons, aggregations, and causal inference. While many generation tasks struggle with similar analytical demands, LT2T provides a structured perspective on reasoning capabilities in natural language generation. This survey uses LT2T as a lens to focus on reasoning in data-to-text tasks. By focusing narrowly on LT2T, we present a deep taxonomy of methods that inject, structure, or verify reasoning steps, allowing a level of technical granularity missing in broader surveys. We review representative models and evaluation metrics, and highlight how LT2T techniques transfer to general generation challenges involving logic, numeracy, and faithfulness. Our goal is to distill lessons from LT2T that apply more widely, while also guiding future research in table-based reasoning.

pdf bib
Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering
Christos Nikolaos Zacharopoulos | Revekka Kyriakoglou

As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1

pdf bib
Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn’t Matter (Much)
Zony Yu | Yuqiao Wen | Lili Mou

Knowledge distillation (KD) is a popular method of transferring knowledge from a large “teacher” model to a small “student” model. Previous work has explored various layer-selection strategies (e.g., forward matching and in-order random matching) for intermediate-layer matching in KD, where a student layer is forced to resemble a certain teacher layer. In this work, we revisit such layer-selection strategies and observe an intriguing phenomenon that layer-selection strategy does not matter (much) in intermediate-layer matching—even seemingly nonsensical matching strategies such as *reverse matching* still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student’s perspective. Our work sheds light on KD practice, as layer-selection strategies may not be the main focus of KD system design and vanilla forward matching works well in most setups.

pdf bib
Can LLMs Learn from Their Mistakes? Self-Correcting Instruction Tuning for Named Entity Recognition
Takumi Takahashi | Tomoki Taniguchi | Chencheng Zhu | Tomoko Ohkuma

Recent instruction-tuned large language models (LLMs) have demonstrated remarkable performance on various downstream tasks, including named entity recognition (NER). However, previous approaches often generate incorrect predictions, particularly regarding entity boundaries and types. Many of these errors can be corrected to match the ground truth by revising the entity boundaries and/or types. In this paper, we propose a self-correcting instruction tuning approach that simultaneously learns to perform NER and correct errors through natural language instructions. Self-correcting instruction tuning requires only a standard annotated NER dataset. Supervision for self-correction can be automatically generated from error patterns observed in LLMs fine-tuned solely on NER tasks. We conducted extensive experiments on eight NER datasets with two LLMs to validate the effectiveness of the proposed approach. The results demonstrate that the proposed approach enhances NER performance by effectively correcting prediction errors and substantially reducing false positives. We further analyze the self-correction behavior to better understand how the models improve performance.

pdf bib
OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning
Zhenyu Bi | Meng Lu | Yang Li | Swastik Roy | Weijie Guan | Morteza Ziyadi | Xuan Wang

Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose OptAgent, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess OptAgent on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.

pdf bib
EWoRA: Expert Weighted Low-Rank Adaptation for Heterogeneous Data
Harsh Kohli | Helian Feng | Lenon Minorics | Bhoomit Vasani | Xin He | Ali Kebarighotbi

Low-Rank Adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning (PEFT) approach for language models. By restricting weight updates to a low-rank subspace, LoRA achieves cost-effective finetuning of large, generalist models to more specialized target domains. While LoRA achieves impressive results for a variety of individual downstream tasks, it struggles to capture the diverse expertise needed when presented with a more heterogeneous finetuning corpus. To address this, we propose Expert Weighted Low-Rank Adaptation (EWoRA), a novel LoRA variant that partitions a rank-(r) adapter into (n) independent adapters of rank (r/n). A lightweight “routing” matrix (W_r R^r n) aggregates the outputs of these adapters by learning specialized weights for each context. Experiments show EWoRA improves performance over LoRA when finetuning on heterogeneous data while generally matching or exceeding LoRA performance on individual finetuning tasks under the same low-rank parameter budget.

pdf bib
The Alchemy of Thought: Understanding In-Context Learning Through Supervised Classification
Harshita Narnoli | Mihai Surdeanu

In-context learning (ICL) has become a prominent paradigm to rapidly customize LLMs to new tasks without fine-tuning. However, despite the empirical evidence of its usefulness, we still do not truly understand how ICL works. In this paper, we compare the behavior of in-context learning with supervised classifiers trained on ICL demonstrations to investigate three research questions: (1) Do LLMs with ICL behave similarly to classifiers trained on the same examples? (2) If so, which classifiers are closer, those based on gradient descent (GD) or those based on k-nearest neighbors (kNN)? (3) When they do not behave similarly, what conditions are associated with differences in behavior? Using text classification as a use case, with six datasets and three LLMs, we observe that LLMs behave similarly to these classifiers when the relevance of demonstrations is high. On average, ICL is closer to kNN than logistic regression, giving empirical evidence that the attention mechanism behaves more similarly to kNN than GD. However, when demonstration relevance is low, LLMs perform better than these classifiers, likely because LLMs can back off to their parametric memory, a luxury these classifiers do not have.

pdf bib
Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts
Sidharth Pulipaka | Ashwin Sankar | Raj Dabre

Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text-to-speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state-of-the-art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence’s practical value for low-resource NLP pipelines at scale.

pdf bib
Efficient Decoding Methods for Language Models on Encrypted Data
Matan Avitan | Moran Baruch | Nir Drucker | Itamar Zimerman | Yoav Goldberg

Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving its convergence via exponential amplification of the gap ratio between the maximum and runner-up elements. Evaluations on realistic LLM outputs show latency reductions of 24×35× over baselines, advancing secure text generation.

pdf bib
Commentary Generation from Multimodal Game Data for Esports Moments in Multiplayer Strategy Games
Zihan Wang | Naoki Yoshinaga

Esports is a competitive sport in which highly skilled players face off in fast-paced video games. Matches consist of intense, moment-by-moment plays that require exceptional technique and strategy. These moments often involve complex interactions, including team fights, positioning, or strategic decisions, which are difficult to interpret without expert explanation. In this study, we set up the task of generating commentary for a specific game moment from multimodal game data consisting of a gameplay screenshot and structured JSON data. Specifically, we construct the first large-scale tri-modal dataset for League of Legends, one of the most popular multiplayer strategy esports titles, and then design evaluation criteria for the task. Using this dataset, we evaluate various large vision language models in generating commentary for a specific moment. We will release the scripts to reconstruct our dataset.

pdf bib
Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Yingqi Hu | Zhuo Zhang | Jingyuan Zhang | Jinghua Wang | Qifan Wang | Lizhen Qu | Zenglin Xu

Federated large language models (FedLLMs) enable cross-silo collaborative training among institutions while preserving data locality, making them appealing for privacy-sensitive domains such as law, finance, and healthcare. However, the memorization behavior of LLMs can lead to privacy risks that may cause cross-client data leakage. In this work, we study the threat of *cross-client data extraction*, where a semi-honest participant attempts to recover personally identifiable information (PII) memorized from other clients’ data. We propose three simple yet effective extraction strategies that leverage contextual prefixes from the attacker’s local data, including frequency-based prefix sampling and local fine-tuning to amplify memorization. To evaluate these attacks, we construct a Chinese legal-domain dataset with fine-grained PII annotations consistent with CPIS, GDPR, and CCPA standards, and assess extraction performance using two metrics: *coverage* and *efficiency*. Experimental results show that our methods can recover up to 56.6% of victim-exclusive PII, where names, addresses, and birthdays are particularly vulnerable. These findings highlight concrete privacy risks in FedLLMs and establish a benchmark and evaluation framework for future research on privacy-preserving federated learning. Code and data are available at https://github.com/SMILELab-FL/FedPII.

pdf bib
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
Rohit Saxena | Pasquale Minervini | Frank Keller

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

pdf bib
Hypercomplex Transformer: Novel Attention Mechanism
Maxim Gordeev | Zuev Aleksandr | Mikhail Bakulin | Andrey Latyshev | Dmitry Kozlov | Yiwu Yao | Voronova Anastasia

Self-attention mechanisms have become foundational across modern deep learning architectures. Recent efforts focus on improving their efficiency, particularly for signal processing tasks. The existing approaches employ complex-valued representations for inputs and weights and achieve higher accuracy at the cost of increased model size and inference latency. Dual-numbered algebra offers a promising alternative that allows efficient multiplication and faster inference with the same representational capacity. Inspired by previous studies in the field of hypercomplex neural networks, we introduce a generalized hypercomplex attention block and integrate it into Transformer-based models for EEG classification. Our experiments include adaptation of the hypercomplex models, so that the number of parameters is equal to that of their real-valued counterparts. Across all scenarios, the dual- and complex-numbered models consistently outperform the real ones, demonstrating superior accuracy. This work presents hypercomplex attention as a competitive and computationally efficient strategy with potential value to solve multiple NLP tasks.

pdf bib
Merging Two Grammar Worlds: Exploring the Relationship between Universal Dependencies and Signal Temporal Logic
Christopher Rashidian | Sabine Brunswicker

Translating natural language requirements into Signal Temporal Logic (STL) is essential for safety-critical systems but requires mathematical expertise. We propose a translational grammar mapping Universal Dependencies (UD) structures to STL Operators through 17 theoretically-motivated patterns, evaluated on the NL2TL benchmarking dataset of 7,002 expert-annotated sentence-STL pairs, and an additional cross-domain analysis. We built a parser guided by this grammar to explore the formal deterministic relationship between UDR Compositions and STL Operators, achieving ~99% sentence coverage, ~54% exact matches (and ~97% similarity). Sentence-level regression analyses predict STL statements and STL Operator classes, considering the co-occurance of UDR substructures (UDR components) with an accuracy of more than ~74% and ~81%, respectively. They uncover a new logical grammatical link between temporal NL and formal logic, that is conditioned by the sentence-level context, and provide insights into how linguistic theory unfolds in practice through temporal linguistic expressions.

pdf bib
GARuD: Guided Alignment of Representations using Distillation for Ultra-Low-Resource Languages
Debarchan Basu | Shashwat Bhardwaj | Vaibhav Sharma | Pooja Singh | Sandeep Kumar

The vast majority of the world’s languages, particularly low-resource and indigenous ones like Bhili, remain critically underserved by modern language technologies. The primary bottleneck is the lack of large-scale corpora required for standard pre-training. To address this gap, we introduce cross-lingual contrastive distillation, a novel and data-efficient, compute-efficient paradigm for creating powerful language models without a massive monolingual corpus. Our method adapts a pre-existing multilingual model (MuRIL) by using a fixed, expert teacher model (HindBERT) to distill semantic knowledge from a related high-resource language (Hindi) via a contrastive objective on a modest parallel corpus. Through comprehensive experiments, we show that our resulting model, GARuD-Bhili, significantly outperforms strong zero-shot and MLM-only baselines on a suite of evaluations, including intrinsic language modeling, downstream sentiment analysis, and cross-lingual benchmarks (Tatoeba, XNLI). Our work presents a generalizable and scalable blueprint for linguistic empowerment, offering a practical pathway to develop robust language technologies for other underserved languages globally.

pdf bib
IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?
Akhilesh Aravapalli | Mounika Marreddy | Radhika Mamidi | Manish Gupta | Subba Reddy Oota

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately ~47K sentences. Our probing analysis of surface, syntactic, and semantic properties reveals that, while almost all multilingual models demonstrate consistent encoding performance for English, surprisingly, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages.

pdf bib
Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks
Jinu Nyachhyon | Mridul Sharma | Prajwal Thapa | Bal Krishna Bal

The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects, which pose a unique challenge for Natural Language Understanding (NLU) tasks. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of Natural Language Processing (NLP) models. To address this limitation, we introduce twelve new datasets, creating a new benchmark, the Nepali Language Understanding Evaluation (NLUE) benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET). Through extensive experiments, we demonstrate that existing top models struggle with the added complexity of these tasks. We also find that the best multilingual model outperforms the best monolingual models across most tasks, highlighting the need for more robust solutions tailored to the Nepali language. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.

pdf bib
Family helps one another: Dravidian NLP suite for Natural Language Understanding
Abhinav Pm | Priyanka Dasari | Vuppala Nagaraju | Parameswari Krishnamurthy

Developing robust Natural Language Understanding (NLU) for morphologically rich Dravidian languages like Kannada, Malayalam, Tamil, and Telugu presents significant challenges due to their agglutinative nature and syntactic complexity. In this work, we present the Dravidian NLP Suite tackling five core tasks: Morphological Analysis (MA), POS Tagging (POS), Named Entity Recognition (NER), Dependency Parsing (DEP), and Coreference Resolution (CR), trained for monolingual models and multilingual models. To facilitate this, we present the Dravida dataset, meticulously annotated multilingual corpus for these tasks across all four languages. Our experiments demonstrate that a multilingual model, which utilizes shared linguistic features and cross-lingual patterns inherent to the Dravidian family, consistently outperforms its monolingual counterparts across all tasks. These findings suggest that multilingual learning is an effective approach for enhancing Natural Language Understanding (NLU) capabilities, particularly for languages belonging to the same family. To the best of our knowledge, this is the first work to jointly address all these core tasks on the Dravidian languages.

pdf bib
Spatial-Aware Visual Program Guided Reasoning for Answering Complex Visual Questions
Haoran Wang | Kai Shu

Visual Question Answering (VQA) often requires complex multi-hop reasoning encompassing both vision and language. Despite the remarkable performance of Large Multimodal Models (LMMs) in vision-language tasks, they encounter difficulties when faced with challenging scenarios that require complex reasoning and may be susceptible to object hallucination. This paper introduces a novel framework named Spatial-aware Visual Program Reasoning (SVPR). The primary goal of SVPR is to enhance the alignment between vision and language within LMMs, fostering their multi-hop reasoning abilities and ultimately strengthening their capacity to address complex visual reasoning tasks. We first utilize the strong visual understanding abilities of LMMs to generate scene graphs, facilitating coordination between vision and language at semantic levels. Then, we leverage the in-context learning ability of LMMs to generate visual programs, which guide the question decomposition process. Finally, we employ a program solver to execute the programs and derive the final answer. This process makes our approach both explanatory and robust, providing clear explanations of its reasoning process while ensuring the faithfulness of the answer to the visual input. We evaluate our framework on two challenging multi-hop multimodal VQA datasets and show its effectiveness under zero-shot settings.

pdf bib
Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges
Jintao Liang | Sugang | Huifeng Lin | You Wu | Rui Zhao | Ziyue Li

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to overcome the knowledge limitations of Large Language Models (LLMs) by integrating external retrieval with language generation. While early RAG systems based on static pipelines have shown effectiveness in well-structured tasks, they struggle in real-world scenarios requiring complex reasoning, dynamic retrieval, and multi-modal integration. To address these challenges, the field has shifted toward Reasoning Agentic RAG, a paradigm that embeds decision-making and adaptive tool use directly into the retrieval process. In this paper, we present a comprehensive review of Reasoning Agentic RAG methods, categorizing them into two primary systems: predefined reasoning, which follow fixed modular pipelines to boost reasoning, and agentic reasoning, where the model autonomously orchestrates tool interaction during inference. We analyze representative techniques under both paradigms, covering architectural design, reasoning strategies, and tool coordination. Finally, we discuss key research challenges and propose future directions to advance the flexibility, robustness, and applicability of reasoning agentic RAG systems.

pdf bib
Extracting Numeric Assertions from Text
Amar Parajuli | Koninika Pal

Open-domain Information Extraction (IE) plays an essential role in constructing large-scale knowledge bases and supports downstream applications such as Question Answering, Text Summarization, etc. While most prior research in IE has centered around extracting categorical relational tuples (e.g., president of, located in), the extraction of numerical relations (e.g., literacy rate, area, molecular weight), that link quantitative mentions to corresponding entities, remains relatively underexplored. This work addresses this gap by targeting the extraction of open-domain numeric assertions, which require identifying both the relevant entity and the appropriate measuring attribute associated with a quantity in natural language text. We begin by refining an existing OpenIE system through a rule-based approach where retrieving implicit measuring attributes for a quantity mention becomes the main challenge. To overcome this, we propose a neural framework that jointly identifies the relevant entity for a numeric mention and infers the measuring attribute to relate them, using contextual cues in the sentence. Experimental evaluation shows that our proposed model outperforms the baseline and a general-purpose large language model with a significantly large margin.

pdf bib
Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection
Maya Srikanth | Run Chen | Julia Hirschberg

Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

pdf bib
PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint
Bhoomit Vasani | Jack FitzGerald | Anjie Fang | Sushmit Vaish

We introduce PHLoRA (Pronounced “flora”) (Post-hoc LoRA), a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models without requiring access to training data or gradients. By computing the low-rank decomposition of weight differences between a base model and its fine-tuned counterpart, our method reconstructs adapter modules that can be merged or dynamically routed at inference time via S-LoRA, AdapterFusion, or served in scalable, industry settings using platforms like NVIDIA NIM.This approach amortizes latency overhead across requests and yields substantial cost savings. Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation, allowing adapter extraction from existing full-rank models or third-party checkpoints. Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged. Overall, PHLoRA provides a practical path for making all existing full-rank checkpoints adapter-ready, democratizing scalable inference for all models.

pdf bib
Harmonious Minds: Benchmarking Intertwined Reasoning of Human Personality and Musical Preference
Sayantan Pal | Souvik Das | Rohini Srihari

Understanding how large language models (LLMs) reason across semantically distinct domains remains an open challenge. In this work, we investigate whether LLMs can connect personality traits to musical preferences, specifically chord progressions. Drawing on psychological theory and symbolic music structure, we introduce a novel benchmark that evaluates two interdependent tasks: (1) inferring personality traits from a textual context and (2) selecting a musically appropriate chord progression aligned with the inferred trait. We release a synthetic, expert-guided dataset grounded in Cattell’s 16 Personality Factors (PF16), genre-conditioned chord structures, and diverse situational contexts. We explore multiple learning strategies, including fine-tuning task-specific corpora, model merging with LoRA adapters, and advanced prompt-based reasoning techniques such as verbalization. Additionally, we propose a teacher-student framework to evaluate the quality of model-generated explanations using a five-dimensional rubric. Our findings show that verbalization outperforms standard reasoning methods, achieving up to 11% improvement over zero-shot baselines.

pdf bib
Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach
Blessed Guda | Lawrence Francis | Gabrial Zencha Ashungafac | Carlee Joe-Wong | Moise Busogi

Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs). However, LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content. This bias undermines the reliability of MCQ as an evaluation framework. Most existing selection bias metrics require answer labels and measure divergences between prediction and answer distributions, but do not fully capture the consistency of a model’s predictions across different orderings of answer choices. Existing selection bias mitigation strategies have notable limitations: majority voting, though effective, is computationally prohibitive; calibration-based methods require validation sets and often fail to generalise across datasets. To address these gaps, we propose three key contributions: (1) a new unsupervised label-free Permutation Bias Metric (PBM) that directly quantifies inconsistencies in model predictions across answer permutations, providing a more precise measure of selection bias, (2) an efficient majority voting approach called Batch Question-Context KV caching (BaQCKV), to significantly reduce computational costs while preserving bias mitigation effectiveness, and (3) an unsupervised Low-Rank Adaptation (LoRA)-1 fine-tuning strategy based on our proposed metric and the BaQCKV that mitigates selection bias, providing a computationally efficient alternative that maintains model generalizability. Experiments across multiple MCQ benchmarks demonstrate that our approaches reduce bias, increasing consistency in accuracy while minimizing computational costs.

pdf bib
MINDS: A Cross-Cultural Dialogue Corpus for Social Norm Classification and Adherence Detection
Pritish Sahu | Anirudh Som | Ajay Divakaran | Dimitra Vergyri

Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures—posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.

pdf bib
Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts
Raavi Gupta | Pranav Hari Panicker | Sumit Bhatia | Ganesh Ramakrishnan

Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.

pdf bib
Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models
Wenqi Pei | Hailing Xu | Henry Hengyuan Zhao | Shizheng Hou | Chen Han | Zining Zhang | Luo Pingyi | Bingsheng He

Natural Language to SQL (NL2SQL) has seen significant advancements with large language models (LLMs). However, these models often depend on closed-source methods and high computational resources, posing challenges in data privacy and deployment. In contrast, small language models (SLMs) struggle with NL2SQL tasks, exhibiting poor performance and incompatibility with existing frameworks. To address these issues, we introduce Feather-SQL, a new lightweight framework tailored for SLMs. Feather-SQL improves SQL executability and accuracy through: (i) schema pruning and linking, (ii) multi-path and multi-candidate generation. Additionally, we introduce 1+1 Model Collaboration Paradigm, which pairs a strong general-purpose chat model with a fine-tuned SQL model, combining strong analytical reasoning with high-precision SQL generation. Experimental results on BIRD demonstrate that Feather-SQL improves NL2SQL performance on SLMs, with around 10% boost for models without fine-tuning. The proposed paradigm raises the accuracy ceiling of SLMs to 54.76%, highlighting its effectiveness.

pdf bib
Uncovering Cultural Representation Disparities in Vision-Language Models
Ram Mohan Rao Kadiyala | Siddhant Gupta | Jebish Purbey | Srishti Yadav | Suman Debnath | Alejandro R. Salamanca | Desmond Elliott

Vision-Language Models (VLMs) have demonstrated impressive capabilities across a range of tasks, yet concerns about their potential biases persist. This work investigates the cultural biases in state-of-the-art VLMs by evaluating their performance on an image-based country identification task at the country level. Utilizing the geographically diverse Country211 (CITATION) dataset, we probe VLMs via open-ended questions, multiple-choice questions (MCQs), and include challenging multilingual and adversarial task settings. Our analysis aims to uncover disparities in model accuracy across different countries and question formats, providing insights into how training data distribution and evaluation methodologies may influence cultural biases in VLMs. The findings highlight significant variations in performance, suggesting that while VLMs possess considerable visual understanding, they inherit biases from their pre-training data and scale, which impact their ability to generalize uniformly across diverse global contexts.

pdf bib
Can You Really Trust That Review? ProtoFewRoBERTa and DetectAIRev: A Prototypical Few-Shot Method and Multi-Domain Benchmark for Detecting AI-Generated Reviews
Shifali Agrahari | Sujit Kumar | Ranbir Singh Sanasam

Synthetic reviews mislead users and erode trust in online marketplaces, and the advent of Large Language Models (LLMs) makes detecting such AI-generated content increasingly challenging due to their human-like fluency and coherence. In the literature, LLM-generated review detection datasets are limited to one or a few domains, with reviews generated by only a few LLMs. Consequently, datasets are limited in diversity in terms of both domain coverage and review generation styles. Models trained on such datasets generalize poorly, lacking cross-model adaptation and struggling to detect diverse LLM-generated reviews in real-world, open-domain scenarios. To address this, we introduce DetectAIRev, a benchmark dataset for AI-generated review detection that includes human-written reviews from diverse domains and AI-generated reviews generated by various categories of LLMs. We evaluate the quality and reliability of the proposed dataset through several ablation studies and human evaluations. Furthermore, we propose an AI-generated text detection method ProtoFewRoBERTa, a few-shot framework that combines prototypical networks with RoBERTa embeddings, which learn discriminative features across multiple LLMs and human-written text using only a few labeled examples per class to discriminate between LLMs as the author for text author detection. We conduct our experiments on the DetectAIRev and a publicly available benchmark dataset. Our experimental results suggest that our proposed methods outperform the state-of-the-art baseline models in detecting AI-generated reviews and text detection.

pdf bib
Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?
Sazia Tabasum Mim | Jack Morris | Manish Dhakal | Yanming Xiu | Maria Gorlatova | Yi Ding

To explore a more scalable path for adding multimodal capabilities to existing LLMs, this paper addresses a fundamental question: Can a unimodal LLM, relying solely on text, reason about its own informational needs and provide effective feedback to optimize a multimodal model? To answer this, we propose a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent’s preferences. Our results from different experiments affirm this hypothesis, showing that LLM preference feedback significantly enhances VLM descriptions. Using our proposed method, we find that the VLM can supply multimodal scene descriptions to help the LLM better understand multimodal context, leading to improvements of maximum 13% in absolute accuracy compared to the baseline multimodal approach. Furthermore, a human study validated our AI-driven feedback, showing a 64.6% preference alignment rate between the LLM’s choices and human judgments. Extensive experiments provide insights on how and why the method works and its limitations.

pdf bib
SOMAJGYAAN: A Dataset for Evaluating LLMs on Bangla Culture, Social Knowledge, and Low-Resource Language Adaptation
Fariha Anjum Shifa | Muhtasim Ibteda Shochcho | Abdullah Ibne Hanif Arean | Mohammad Ashfaq Ur Rahman | Akm Moshiur Rahman Mazumder | Ahaj Mahhin Faiak | Md Fahim | M Ashraful Amin | Amin Ahsan Ali | Akmmahbubur Rahman

Despite significant progress in large language models (LLMs), their knowledge and evaluation continue to be centered around high-resource languages, leaving critical gaps in low-resource settings. This raises questions about how effectively LLMs handle subjects that require locally relevant knowledge. To address this challenge, we need a robust dataset that reflects the knowledge of underrepresented regions such as Bangladesh. In this paper, we present ***SOMAJGYAAN***, a Bangla multiple-choice dataset consisting of 4,234 questions, annotated across five levels of difficulty. The questions are drawn from Bangladesh’s National Curriculum and Global Studies textbooks, covering a wide range of domains including History, Geography, Economics, Social Studies, Politics and Law, and Miscellaneous topics. Difficulty levels were assigned by four expert annotators to minimize annotation bias. The experiments reveal that closed-source LLMs perform better than open-source LLMs. While fine-tuning open-source models on improves their performance, they still fall short of matching closed-source LLMs. Our findings highlight the importance of culturally grounded evaluation datasets and task-specific adaptation to improve LLM performance in low-resource language settings.

pdf bib
CMBan: Cartoon-Driven Meme Contextual Classification Dataset for Bangla
Newaz Ben Alam | Akm Moshiur Rahman Mazumder | Mir Sazzat Hossain | Mysha Samiha | Md Alvi Noor Hossain | Md Fahim | Amin Ahsan Ali | Ashraful Islam | M Ashraful Amin | Akmmahbubur Rahman

Social networks extensively feature memes, particularly cartoon images, as a prevalent form of communication often conveying complex sentiments or harmful content. Detecting such content, particularly when it involves Bengali and English text, remains a multimodal challenge. This paper introduces ***CMBan***, a novel and culturally relevant dataset of 2,641 annotated cartoon memes. It addresses meme classification based on their sentiment across five key categories: Humor, Sarcasm, Offensiveness, Motivational Content, and Overall Sentiment, incorporating both image and text features. Our curated dataset specifically aids in detecting nuanced offensive content and navigating complexities of pure Bengali, English, or code-mixed Bengali-English languages. Through rigorous experimentation involving over 12 multimodal models, including monolingual, multilingual, and proprietary architectures, and utilizing prompting methods like Chain-Of-Thought (CoT), findings suggest this cartoon-based, code-mixed meme content poses substantial understanding challenges. Experimental results demonstrate that closed models excel over open models. While the LoRA fine-tuning strategy equalizes performance across model architectures and improves classification of challenging aspects in multilingual meme contexts, this work advances meme classification by providing effective solution for detecting harmful content in multilingual meme contexts.

pdf bib
Multi-Agent Cross-Lingual Veracity Assessment for Explainable Fake News Detection
Bassamtiano Renaufalgi Irnawan | Yoshimi Suzuki | Noriko Tomuro | Fumiyo Fukumoto

The spread of fake news during the COVID-19 pandemic era triggered widespread chaos and confusion globally, causing public panic and misdirected health behavior. Automated fact checking in non-English languages is challenging due to the low availability of trusted resources. There are several prior work that attempted automated fact checking in multilingual settings. However, most of them fine-tune pre-trained language models (PLMs) and only produce veracity prediction without providing explanations. The absence of explanatory reasoning in these models reduces the credibility of their predictions. This paper proposes a multi-agent explainable cross-lingual fake news detection method that leverages credible English evidence and Large Language Models (LLMs) to verify and generate explanations for non-English claims, overcoming the scarcity of non-English evidence. The experimental results show that the proposed method performs well across three non-English written multilingual COVID-19 datasets in terms of veracity predictions and explanations. Our source code is available online. (https://github.com/bassamtiano/crosslingual_efnd)

pdf bib
GeoSAFE - A Novel Geospatial Artificial Intelligence Safety Assurance Framework and Evaluation for LLM Moderation
Nihar Sanda | Rajat Shinde | Sumit Nawathe | William Seawright | Shaona Ghosh | Manil Maskey

The rapid progress of generative AI (Gen-AI) and large language models (LLMs) offers significant potential for geospatial applications, but simultaneously introduces critical privacy, security, and ethical risks. Existing general-purpose AI safety frameworks inadequately cover GeoAI-specific risks such as geolocation privacy violations and re-identification, with False Safe Rates exceeding 40% in some models. To address this, we present GeoSAFE (Geospatial Safety Assurance Framework and Evaluation), introducing the first GeoAI-specific safety taxonomy with six hazard categories and a multimodal GeoSAFE-Dataset. It includes 11694 textual prompts with explanations, augmented by real-world queries and images to reduce synthetic bias and reflect operational use. We benchmark model performance on detecting unsafe geospatial queries. Additionally, we present GeoSAFEGuard, an instruction-tuned LLM achieving 4.6% False Safe Rate, 0.4% False Unsafe Rate, and 97% F1-score on text-to-text evaluation of GeoSAFE-Dataset. An anonymous user-survey confirms human-GeoSAFE alignment emphasizing the urgent need for domain-specific safety evaluations as general-purpose LLMs fail to detect unsafe location-powered queries.

pdf bib
Investigating Omission as a Latency Reduction Strategy in Simultaneous Speech Translation
Mana Makinae | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe

Simultaneous speech translation (SiST) requires balancing translation quality and latency. While most SiST systems follow machine translation assumptions that prioritize full semantic accuracy to the source, human interpreters often omit less critical content to catch up with the speaker. This study investigates whether omission can be used to reduce latency while preserving meaning in SiST.We construct a dataset that includes omission using large language models (LLMs) and propose a Target-Duration Latency (TDL), target-based latency metric that measures the output length considering the start and end timing of translation. Our analysis shows that LLMs can omit less important words while retaining the core meaning. Furthermore, experimental results show that although standard metrics overlook the benefit of the model trained with proposed omission-involving dataset, alternative evaluation methods capture it, as omission leads to shorter outputs with acceptable quality.

pdf bib
Evaluating LLMs’ Reasoning Over Ordered Procedural Steps
Adrita Anika | Md Messal Monem Miah

Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall’s Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.

pdf bib
mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models
Arka Mukherjee | Shreya Ghosh

Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce mmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India’s JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2% errors). Systematic ablations show mmJEE-Eval’s difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io

pdf bib
Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations
Krithi Shailya | Akhilesh Kumar Mishra | Gokul S Krishnan | Balaraman Ravindran

Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.

pdf bib
Regional-TinyStories: A Small Language Model Framework for Evaluating Language Learning, Tokenizers, and Datasets
Nirvan Patil | Malhar Abhay Inamdar | Agnivo Gosai | Guruprasad Pathak | Anish Joshi | Anish Joshirao | Raj Dandekar | Rajat Dandekar | Sreedath Panat

Small, resource-efficient language models are pivotal for extending high-quality text generation to low-resource and regional languages—the true frontier of linguistic equity in AI. Yet research largely prioritises massive English-centric systems, leaving regional-centric (low-resource) language modelling underexplored, particularly how tokenizer design, dataset diversity, and linguistic structure shape the inference of Small Language Models (SLMs) under realistic computational and data constraints. We present Regional-TinyStories, a lightweight framework that treats SLMs as cost-effective stand-ins for LLMs, enabling rapid, variable-wise inference-based analysis. Extending TinyStories to Hindi, Marathi, and Bangla, we release datasets of 2M synthetic and translated stories per language and train over 20 SLMs spanning 5–157M parameters. Using this framework, we (i) uncover contrasts between form-oriented (grammar, fluency) and content-oriented (context, completeness, creativity) metrics; (ii) chart language-specific learning dynamics; (iii) rank tokenizers, showing Indic-specific Sarvam-1 outperforming SUTRA and generic Tiktoken (GPT-2) across all metrics; and (iv) demonstrate that dataset semantic quality (translation vs. synthetic) strongly governs downstream generation. Validation through an LLM-as-Judge ensemble (GPT-4o, LLaMA-3.3-70B) and a 100+ participant human study confirms these trends while exposing systematic score inflation in automated evaluations. Regional-TinyStories offers a reproducible path to benchmark tokenizers, datasets, and SLM designs for scalable, context-faithful generation in low-resource settings.

pdf bib
High-Quality Complex Text-to-SQL Data Generation through Chain-of-Verification
Yuchen Zhang | Yuze Gao | Bin Chen | Wenfeng Li | Shuo Sun | Jian Su

Can today’s Text-to-SQL benchmarks still stretch modern LLMs? We argue no. Spider1.0 and BIRD, painstakingly hand-built, remain small, costly, and skewed toward middle complex SQL. Meanwhile, LLM-generated corpora are inexpensive but often superficial and fragile suffering from shallow nesting, semantic drift, template fatigue, and insufficient quality check.We address this gap with a Chain-of-Verifications framework that turns a handful of expert-labelled seeds into a large, reliably checked dataset at a fraction of the usual cost. The resulting corpus, AIGT2S, delivers: (1)18k Question–SQL pairs across 113 databases, 41–77% larger than current English sets; (2)55% queries in the Ultra band of our four-level difficulty taxonomy; (3)87.5% inter-annotator agreement; (4)≥80% labour and ≥98% monetary savings versus earlier efforts.Baselines including GPT-4o, Llama3, RESDSQL, and MAC-SQL, achieve at most 56% execution accuracy, indicating substantial room for improvement.

pdf bib
Cross-Prompt Encoder for Low-Performing Languages
Beso Mikaberidze | Temo Saghinadze | Simon Ostermann | Philipp Müller

Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages—those that achieve poor accuracy even under full-model fine-tuning. We investigate a lightweight encoder paired with multi-source training on typologically diverse languages. We call this architecture-training combination the Cross-Prompt Encoder (XPE), and show that it advances the capture of abstract, transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Text classification experiments with a transformer encoder (XLM-R) on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.

pdf bib
An Information-Theoretic Approach to Reducing Fertility in LLMs for Manipuri Machine Translation
Telem Joyson Singh | Ranbir Singh Sanasam | Priyankoo Sarmah

Large language models (LLMs) have transformed machine translation, yet they have a high subword fertility issue for low-resource languages, which leads to slow inference speed and increased costs. While vocabulary expansion via continual pre-training is a common solution, it often degrades translation quality and requires large target-language corpora, which are unavailable for truly low-resource languages. To address this, we investigate tokenization efficiency through an information-theoretic lens, building on the established hypothesis that word length correlates with information content. From this perspective, we characterize tokenization inefficiency as having high fertility for low-information (highly predictable) words. Guided by this principle, we introduce a novel fine-tuning strategy that systematically identifies informationally redundant words—those with high fertility but low information content—for targeted vocabulary expansion and model fine-tuning. Experiments fine-tuning BLOOM and LLaMA-3 in English-Manipuri and other two language pairs show that our proposed method significantly reduces fertility by 50% and accelerates inference by more than 2 times, without compromising and often exceeding the translation quality of standard LLM baselines, providing a theoretically grounded solution for efficient LLM-based MT.

pdf bib
Agent-based Automated Claim Matching with Instruction-following LLMs
Dina Pisarevskaya | Arkaitz Zubiaga

We present a novel agent-based approach for the automated claim matching task with instruction-following LLMs. We propose a two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate that LLM-generated prompts can outperform SOTA with human-generated prompts, and that smaller LLMs can do as well as larger ones in the generation process, allowing to save computational resources. We also demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. Our investigation into the prompt generation process in turn reveals insights into the LLMs’ understanding and handling of the claim matching task.

pdf bib
Tooka-SBERT: Lightweight Sentence Embedding models for Persian
Ghazal Zamaninejad | MohammadAli SadraeiJavaheri | Farnaz Aghababaloo | Hamideh Rafiee | Milad Molazadeh Oskuee | AmirMohammad Salehoof

We introduce Tooka-SBERT, a family of Persian sentence embedding models designed to enhance semantic understanding for Persian. The models are released in two sizes—Small (123M parameters) and Large (353M parameters)—both built upon the TookaBERT backbone. Tooka-SBERT is pretrained on the Targoman News corpus and fine-tuned using high-quality synthetic Persian sentence pair datasets to improve semantic alignment. We evaluate Tooka-SBERT on PTEB, a Persian adaptation of the MTEB benchmark, where the Large model achieves an average score of 70.54% and the Small model 69.49%, outperforming some strong multilingual baselines. Tooka-SBERT provides a compact and high-performing open-source solution for Persian sentence representation, with efficient inference suitable for both GPU and CPU environments. Our models are publicly available on Hugging Face, and the corresponding benchmark results can be viewed on the PTEB Leaderboard.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)

pdf bib
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
Firoj Alam | Sudipta Kar | Shammur Absar Chowdhury | Naeemul Hassan | Enamul Hoque Prince | Mohiuddin Tasnim | Md Rashad Al Hasan Rony | Md Tahmid Rahman Rahman

pdf bib
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
Ahnaf Mozib Samin

Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge in various natural language and speech processing tasks. Recent research highlights the dependency of BPE subword tokenization’s efficacy on the morphological nature of the language, particularly in languages rich in inflectional morphology, where fewer BPE merges suffice for generating highly productive tokens. Motivated by this, our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity, thus enhancing out-of-distribution automatic speech recognition (ASR) performance. Experimental evaluation reveals that an excessively high number of BPE tokens can lead to overfitting, while approximately 500-1000 tokens result in superior OOV performance. Furthermore, we conduct a comparative analysis of BPE with character-based and unigram-based tokenization methods. By introducing BPE tokenization to Bengali ASR, we achieve a substantial reduction in the word error rate (WER) from 66.44% in our character-based baseline system to 63.80% on the LB-ASRTD eval set and from 46.34% to 42.80% on the SHRUTI eval set, both of which include out-of-distribution data.

pdf bib
GRASP-ChoQ: Knowledge Graph-Based Retrieval Augmentation for Stance Detection in Political Texts with Chain-of-Questions Reasoning
Rasel Mahmud | Md. Abdur Rakib Mollah | Aninda Kumar Sharma | Omar Faruq Osama

Political stance detection in understudied socio-political contexts presents a persistent challenge for language models because dynamic contexts and indirect relationships between political entities complicate the accurate alignment of opinions. To address this, we introduce GRASP-ChoQ, an approach that combines structured knowledge graphs with chain-of-questions reasoning to break down interactions in political texts. We support this with BPDisC, a novel dataset of politically charged tweets from Bangladesh during and after the July 2024 protests, along with a knowledge graph that details political entities and events. By using the knowledge graph to provide context, GRASP-ChoQ moves away from making direct predictions and instead uses intermediate reasoning steps. Experiments indicate that our proposed method yields substantial improvements relative to baseline approaches. Notably, the DeepSeek R1 variant, when integrated with GRASP-ChoQ, achieved the highest performance, demonstrating a 40% higher F1 score over zero-shot detection. As a whole, through the proposed framework, the improvement of retrieval augmentation is occurring, which facilitates the adaptive analysis of low-resource discussion of politics.

pdf bib
Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation
Khondoker Ittehadul Islam | Gabriele Sarti

Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models’ predictions, highlighting different trends across models and languages.

pdf bib
BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
Jakir Hasan | Shubhashis Roy Dipta

Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers.

pdf bib
Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles
Nusrat Jahan Lia | Shubhashis Roy Dipta | Abdullah Khan Zehady | Naymul Islam | Madhusodan Chakraborty | Abdullah Al Wasif

Detecting media bias is crucial, specifically in the South Asian region. Despite this, annotated datasets and computational studies for Bangla political bias research remain scarce. Crucially because, political stance detection in Bangla news requires understanding of linguistic cues, cultural context, subtle biases, rhetorical strategies, code-switching, implicit sentiment, and socio-political background. To address this, we introduce the first benchmark dataset of 200 politically significant and highly debated Bangla news articles, labeled for government-leaning, government-critique, and neutral stances, alongside diagnostic analyses for evaluating large language models (LLMs). Our comprehensive evaluation of 28 proprietary and open-source LLMs shows strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00). Models also tend to over-predict government-leaning stances, often misinterpreting ambiguous narratives. This dataset and its associated diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.

pdf bib
BiCap: Bangla Image Captioning Using Attention-based Encoder-Decoder Architecture
Md Aminul Kader Bulbul

Automatic image captioning has gained significant attention at the intersection of computer vision and natural language processing, yet research in low-resource languages such as Bangla remains limited. This work introduces BiCap, an attention-based encoder–decoder framework designed for Bangla image captioning. The model leverages a pretrained ResNet-50 as the encoder to extract rich visual features and a Long Short-Term Memory (LSTM) network as the decoder to sequentially generate Bangla captions. To overcome the fixed-length bottleneck of traditional encoder–decoder architectures, we integrate Bahdanau attention, enabling the decoder to dynamically focus on salient image regions while producing each word. The model is trained and evaluated on the Chitron dataset, with extensive preprocessing including vocabulary construction, tokenization, and word embedding. Experimental results demonstrate that BiCap achieves superior performance over the existing works (Masud et al., 2025; Hossain et al., 2024; Das et al., 2023; Humaira et al., 2021), yielding higher BLEU, METEOR, ROUGE, CIDEr scores. Improved fluency in human evaluation further confirms that the model generates more contextually accurate and semantically coherent captions, although occasional challenges remain with complex scenes. Recent advances in Vision–Language Models (VLMs), such as CLIP, BLIP, Flamingo, LLaVA, and MiniGPT-4, have redefined state-of-the-art captioning performance in high-resource settings. However, these models require large multimodal corpora and extensive pretraining that are currently unavailable for Bangla. BiCap therefore offers a resource-efficient, interpretable, and practically deployable solution tailored to low-resource multimodal learning.

pdf bib
Zero-Shot Multi-Label Classification of Bangla Documents: Large Decoders Vs. Classic Encoders
Souvika Sarkar | Md Najib Hasan | Santu Karmaker

Bangla, a language spoken by over 300 million native speakers and ranked as the sixth most spoken language worldwide, presents unique challenges in natural language processing (NLP) due to its complex morphological characteristics and limited resources. Although recent large-decoder-based LLMs, such as GPT, LLaMA, and DeepSeek, have demonstrated excellent performance across many NLP tasks, their effectiveness in Bangla remains largely unexplored. In this paper, we establish the first benchmark comparing large decoder-based LLMs with classic encoder-based models for the Zero-Shot Multi-Label Classification (Zero-Shot-MLC) task in Bangla. Our evaluation of 32 state-of-the-art models reveals that existing so-called powerful encoders and decoders still struggle to achieve high accuracy on the Bangla Zero-Shot-MLC task, suggesting a need for more research and resources for Bangla NLP.

pdf bib
A Hybrid Transformer–Sequential Model for Depression Detection in Bangla–English Code-Mixed Text
Md Siddikul Imam Kawser | Jidan Al Abrar | Mehebub Bin Kabir | Md. Rayhan Chowdhury | Md Ataullah Bahari

Depression detection from social media text is critical for early mental health intervention, yet existing NLP systems underperform in low-resource, code-mixed settings. Bangla-English code-mixing, common across South Asian online communities, poses unique challenges due to irregular grammar, transliteration, and scarce labeled data. To address this gap, we introduce DepressiveText, a 7,019-sample dataset of Bangla-English social media posts annotated for depressive signals, with strong inter-annotator agreement (𝜅 = 0.84). We further propose a hybrid architecture that combines BanglishBERT embeddings with an LSTM classifier, enabling the model to capture both contextual and sequential cues. Comparative experiments with traditional ML, deep learning, and multilingual transformer baselines demonstrate that our approach achieves the highest performance, with an accuracy of 0.8889. We also employ LIME to enhance interpretability by identifying key lexical triggers. Our findings underscore the effectiveness of hybrid transformer–sequence models for low-resource code-mixed NLP and highlight their potential in real-world mental health applications.

pdf bib
BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation
Md. Tofael Ahmed Bhuiyan | Md. Abdur Rahman | Abdul Kadar Muhammad Masum

While machine translation has made significant strides for high-resource languages, many regional languages and their dialects, such as the Bangla variants Chittagong and Sylhet, remain underserved. Existing resources are often insufficient for robust sentence-level evaluation and overlook the widespread real-world practice of romanization, the common practice of typing native languages using the Latin script in digital communication. To address these gaps, we introduce BhasaBodh, a comprehensive benchmark for Bangla dialectal machine translation. We construct and release a sentence-level parallel dataset for Chittagong and Sylhet dialects aligned with Standard Bangla and English, create a novel romanized version of the dialectal data to facilitate evaluation in realistic multi-script scenarios, and provide the first comprehensive performance baselines by fine-tuning two powerful multilingual models, NLLB-200 and mBART-50, on seven distinct translation tasks. Our experiments reveal that mBART-50 consistently outperforms NLLB-200 on most dialectal and romanized tasks, achieving a BLEU score as high as 87.44 on the Romanized-to-Standard Bangla normalization task. However, complex cross-lingual and cross-script translation remains a significant challenge. BhasaBodh lays the groundwork for future research in low-resource dialectal NLP, offering a valuable resource for developing more inclusive and practical translation systems.

pdf bib
CheckSent-BN: A Bengali Multi-Task Dataset for Claim Checkworthiness and Sentiment Classification from News Headlines
Pritam Pal | Dipankar Das

This paper presents **CheckSent-BN** (Claim **Check**worthiness and **Sen**timent Classification in **B**engali **N**ews Headline), a novel multi-task dataset in Bengali comprising approximately 11.5K news headlines annotated for two critical natural language processing (NLP) tasks: claim checkworthiness detection and sentiment classification. To address the lack of high-quality annotated resources in Bengali, we employ a cost-effective annotation strategy that utilizes three large language models (GPT-4o-mini, GPT-4.1-mini, and Llama-4), followed by majority voting and manual verification to ensure label consistency. We provide benchmark results using multilingual and Bengali-focused transformer models under both single-task and multi-task learning (MTL) frameworks. Experimental results show that IndicBERTv2, BanglaBERT, and mDeBERTa model-based frameworks outperform other multilingual models, with IndicBERTv2 achieving the best overall performance in the MTL setting. CheckSent-BN establishes the first comprehensive benchmark for joint claim checkworthiness and sentiment classification in Bengali news headlines, offering a valuable resource for advancing misinformation detection and sentiment-aware analysis in low-resource languages. The CheckSent-BN dataset is available at: https://github.com/pritampal98/check-sent-bn

pdf bib
Advancing Subjectivity Detection in Bengali News Articles Using Transformer Models with POS-Aware Features
Md Minhazul Kabir | Kawsar Ahmed | Mohammad Ashfak Habib | Mohammed Moshiul Hoque

Distinguishing fact from opinion in text is a nuanced but essential task, particularly in news articles where subjectivity can influence interpretation and reception. Identifying whether content is subjective or objective is critical for sentiment analysis, media bias detection, and content moderation. However, progress in this area has been limited for low-resource languages such as Bengali due to a lack of benchmark datasets and tools. To address these constraints, this work presents BeNSD (Bengali News Subjectivity Detection), a novel dataset of 8,655 Bengali news article texts, along with an enhanced transformer-based architecture (POS-Aware-MuRIL) that integrates parts-of-speech (POS) features with MuRIL embeddings at the input level to provide richer contextual representation for subjectivity detection. A range of baseline models is evaluated, and the proposed architecture achieves a macro F1-score of 93.35% in subjectivity detection for the Bengali language.

pdf bib
Gen-mABSA-T5: A Multilingual Zero-Shot Generative Framework for Aspect-Based Sentiment Analysis
Shabrina Akter Shahana | Nuzhat Nairy Afrin | Md Musfique Anwar | Israt Jahan

Aspect-Based Sentiment Analysis (ABSA) identifies sentiments toward specific aspects of an entity. While progress has been substantial for high-resource languages such as English, low-resource languages like Bangla remain underexplored due to limited annotated data and linguistic challenges. We propose Gen-mABSA-T5, a multilingual zero-shot generative framework for ABSA based on Flan-T5, incorporating prompt engineering and Natural Language Inference (NLI). Without task-specific training, Gen-mABSA-T5 achieves state-of-the-art zero-shot accuracy of 61.56% on the large Bangla corpus, 73.50% on SemEval Laptop, and 73.56% on SemEval Restaurant outperforming both English and Bangla task-specific models in zero-shot settings. It delivers reasonable performance against very large general-purpose models on both English and Bangla benchmarks. These results highlight the effectiveness of generative, zero-shot approaches for ABSA in low-resource and multilingual settings.

pdf bib
A Comprehensive Text Optimization Approach to Bangla Summarization
Irtifa Haider | Shanjida Alam

The task of Bengali text optimization demands not only the generation of concise and coherent summaries but also grammatical accuracy, semantic appropriateness, and factual reliability. This study presents a dual-phase optimization framework for Bengali text summarization that integrates entity-preserving preprocessing and abstractive generation with mT5, followed by refinement through sentence ranking, entity consistency enforcement, and optimization with instruction-tuned LLMs such as mBART. Evaluations using ROUGE, BLEU,BERTScore, and human ratings of fluency, adequacy, coherence, and readability show consistent gains over baseline summarizers. By embedding grammatical and factual safe guards into the summarization pipeline, this study establishes a robust and scalable benchmark for Bengali NLP, advancing text optimization research. Our model achieves 0.54 ROUGE-1 and 0.88 BERTScore on BANSData, outperforming recent multilingual baselines.

pdf bib
BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge
Daeen Kabir | Minhajur Rahman Chowdhury Mahim | Sheikh Shafayat | Adnan Sadik | Arian Ahmed | Eunsu Kim | Alice Oh

In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh’s culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs’ performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali’s status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

pdf bib
BanHateME : Understanding Hate in Bangla Memes thorough Detection, Categorization, and Target Profiling
Md Ayon Mia | Md Fahim

Detecting hateful memes is a complex task due to the interplay of text and visuals, with subtle cultural cues often determining whether content is harmful. This challenge is amplified in Bangla, a low-resource language where existing resources provide only binary labels or single dimensions of hate. To bridge this gap, we introduce BanHateME, a comprehensive Bangla hateful meme dataset with hierarchical annotations across three levels: binary hate, hate categories, and targeted groups. The dataset comprises 3,819 culturally grounded memes, annotated with substantial inter-annotator agreement. We further propose a hierarchical loss function that balances predictions across levels, preventing bias toward binary detection at the expense of fine-grained classification. To assess performance, we pair pretrained language and vision models and systematically evaluate three multimodal fusion strategies: summation, concatenation, and co-attention, demonstrating the effectiveness of hierarchical learning and cross-modal alignment. Our work establishes BanHateME as a foundational resource for fine-grained multimodal hate detection in Bangla and contributes key insights for content moderation in low-resource settings.

pdf bib
P6Jiggasha: Benchmarking Large Language Models on Bangla Physics Question Answering with Cross-lingual Evaluation
S.m. Shahriar | Md Tahmid Hasan Fuad | Md Fahim | Md. Azad Hossain

Understanding scientific concepts in native languages is crucial for educational accessibility and knowledge transfer. In this work, we present a comprehensive evaluation of Large Language Models (LLMs) on Bangla physics questions, introducing P6Jiggasha, a novel dataset of 1,500 multiple-choice questions compiled from HSC physics textbooks, supplementary guides, admission preparation books, and past examination papers from various educational boards. We evaluate three state-of-the-art models—GPT-4.1, Gemini-2.5 Pro, and DeepSeek-R1-Distill-Llama-70B—on both native Bangla questions and their English translations. Our results reveal significant performance variations, with GPT-4.1 achieving 86.67% accuracy on Bangla questions in a single inference, while other models show substantial improvement through multiple inference attempts, with Gemini-2.5 Pro reaching 89.52% after four iterations. We introduce a Cumulative Accuracy@k metric to evaluate iterative reasoning capabilities and provide comprehensive analysis across six physics topics and six question types. Our error analysis reveals systematic cross-lingual inconsistencies where models produce contradictory answers for identical questions across languages. This study provides valuable insights into the capabilities and limitations of current LLMs for low-resource scientific question answering and establishes benchmarks for future research in Bangla natural language processing.

pdf bib
LP-FT-LoRA: A Three-Stage PEFT Framework for Efficient Domain Adaptation in Bangla NLP Tasks
Tasnimul Hossain Tomal | Anam Borhan Uddin | Intesar Tahmid | Mir Sazzat Hossain | Md Fahim | Md Farhad Alam Bhuiyan

Adapting large pre-trained language models (LLMs) to downstream tasks typically requires fine-tuning, but fully updating all parameters is computationally prohibitive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by updating a small subset of parameters. However, the standard approach of jointly training LoRA adapters and a new classifier head from a cold start can lead to training instability, as the classifier chases shifting feature representations. To address this, we propose LP-FT-LoRA, a novel three-stage training framework that decouples head alignment from representation learning to enhance stability and performance. Our framework first aligns the classifier head with the frozen backbone via linear probing, then trains only the LoRA adapters to learn task-specific features, and finally performs a brief joint refinement of the head and adapters. We conduct extensive experiments on five Bangla NLP benchmarks across four open-weight compact transformer models. The results demonstrate that LP-FT-LoRA consistently outperforms standard LoRA fine-tuning and other baselines, achieving state-of-the-art average performance and showing improved generalization on out-of-distribution datasets.

pdf bib
Human–LLM Benchmarks for Bangla Dialect Translation: Sylheti and Chittagonian on the BanglaCHQ-Summ Corpus
Nowshin Mahjabin | Ahmed Shafin Ruhan | Mehreen Chowdhury | Md Fahim | MD Azam Hossain

Millions in Bangladesh speak Sylheti and Chittagonian (Chatgaiyya) dialects, yet most public health guidance exists only in Standard Bangla, which creates barriers and safety risks. Ad-hoc translation further harms comprehension, while challenges such as scarce data, non-standard spelling, medical terms, numerals, and idioms make accurate translation difficult. We present BanglaCHQ-Prantik, the first benchmark for this setting, extending BanglaCHQ-Summ with human gold references from 17 native translators. We evaluate Qwen 2.5 3B, Gemma 3 1B, GPT-4o mini, and Gemini 2.5 Flash under zero-shot, one-shot, five-shot, and chain-of-thought prompts, using BLEU, ROUGE-1/2/L, and METEOR. Closed-source models (GPT-4o, Gemini 2.5) lead overall, with Gemini 2.5 Flash being strongest. Few-shot prompting helps especially for Sylheti, though errors persist with terminology, numerals, and idioms. The dataset is designed to support both NLP research and public health communication by enabling reliable translation across regional Bangla dialects. To our knowledge, this is the first medical-domain dataset for Sylheti/Chittagonian.

pdf bib
BanHate: An Up-to-Date and Fine-Grained Bangla Hate Speech Dataset
Faisal Hossain Raquib | Akm Moshiur Rahman Mazumder | Md Tahmid Hasan Fuad | Md Farhan Ishmam | Md Fahim

Online safety in low-resource languages relies on effective hate speech detection, yet Bangla remains critically underexplored. Existing resources focus narrowly on binary classification and fail to capture the evolving, implicit nature of online hate. To address this, we introduce BanHate, a large-scale Bangla hate speech dataset, comprising 19,203 YouTube comments collected between April 2024 and June 2025. Each comment is annotated for binary hate labels, seven fine-grained categories, and seven target groups, reflecting diverse forms of abuse in contemporary Bangla discourse. We develop a tailored pipeline for data collection, filtering, and annotation with majority voting to ensure reliability. To benchmark BanHate, we evaluate a diverse set of open- and closed-source large language models under prompting and LoRA fine-tuning. We find that LoRA substantially improves open-source models, while closed-source models, such as GPT-4o and Gemini, achieve strong performance in binary hate classification, but face challenges in detecting implicit and fine-grained hate. BanHate sets a new benchmark for Bangla hate speech research, providing a foundation for safer moderation in low-resource languages. Our dataset is available at: https://huggingface.co/datasets/aplycaebous/BanHate.

pdf bib
BOIGENRE: A Large-Scale Bangla Dataset for Genre Classification from Book Summaries
Rafi Hassan Chowdhury | Rahanuma Ryaan Ferdous

The classification of literary genres plays a vital role in digital humanities and natural language processing (NLP), supporting tasks such as content organization, recommendation, and linguistic analysis. However, progress for the Bangla language remains limited due to the lack of large, structured datasets. To address this gap, we present BOIGENRE, the first large-scale dataset for Bangla book genre classification, built from publicly available summaries. The dataset contains 25,951 unique samples across 16 genres, showcasing diversity in narrative style, vocabulary, and linguistic expression. We provide statistical insights into text length, lexical richness, and cross-genre vocabulary overlap. To establish benchmarks, we evaluate traditional machine learning, neural, and transformer-based models. Results show that while unigram-based classifiers perform reasonably, transformer models, particularly BanglaBERT, achieve the highest F1-score of 69.62%. By releasing BOIGENRE and baseline results, we offer a valuable resource and foundation for future research in Bangla text classification and low-resource NLP.

pdf bib
ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language
Md. Abdur Rahman | Md. Tofael Ahmed Bhuiyan | Abdul Kadar Muhammad Masum

The advancement of NLP technologies for low-resource and endangered languages is critically hindered by the scarcity of high-quality, parallel corpora. This is particularly true for languages like Chakma, which also faces the challenge of prevalent non-standard, romanized script usage in digital communication. To address this, we introduce ChakmaBridge, the first five-way parallel corpus for Chakma, containing 807 sentences aligned across English, Standard Bangla, Bengali-script Chakma, Romanized Bangla, and Romanized Chakma. Our dataset is created by augmenting the MELD corpus with LLM-generated romanizations that are rigorously validated by native speakers. We establish robust machine translation baselines across six diverse language and script pairs. Our experiments reveal that a multilingual training approach, combining English and Bangla as source languages, yields a dramatic performance increase, achieving a BLEU score of 0.5228 for Chakma translation, a 124% relative improvement over the best bilingual model. We release ChakmaBridge to facilitate research in low-resource MT and aid in the digital preservation of this endangered language.

pdf bib
A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs
K. M. Jubair Sami | Dipto Sumit | Ariyan Hossain | Farig Sadeque

Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local_dialect:standard_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76% to 55% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.

pdf bib
Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language
Adity Khisa | Nusrat Jahan Lia | Tasnim Mahfuz Nafis | Zarif Masud | Tanzir Pial | Shebuti Rayana | Ahmedul Kabir

As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based transformer models, including multilingual (mBERT, XLM-RoBERTa, DistilBERT), regional (BanglaBERT, IndicBERT), and monolingual English (DeBERTaV3) variants on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our dataset to encourage further research on multilingual language modeling for low-resource languages.

pdf bib
LLMs for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti
Tabia Tanzin Prama

Large Language Models (LLMs) have demonstrated strong translation abilities through prompting, even without task-specific training. However, their effectiveness in dialectal and low-resource contexts remains underexplored. This study presents the first systematic investigation of LLM-based Machine Translation (MT) for Sylheti, a dialect of Bangla that is itself low-resource. We evaluate five advanced LLMs (GPT-4.1, GPT-4.1-mini, LLaMA 4, Grok 3, and Deepseek V3.2) across both translation directions (Bangla ↔ Sylheti), and find that these models struggle with dialect-specific vocabulary. To address this, we introduce Sylheti-CAP (Context-Aware Prompting), a three-step framework that embeds a linguistic rulebook, dictionary (core vocabulary and idioms), and authenticity check directly into prompts. Extensive experiments show that Sylheti-CAP consistently improves translation quality across models and prompting strategies. Both automatic metrics and human evaluations confirm its effectiveness, while qualitative analysis reveals notable reductions in hallucinations, ambiguities, and awkward phrasing—establishing Sylheti-CAP as a scalable solution for dialectal and low-resource MT.

pdf bib
Clustering LLM-based Word Embeddings to Determine Topics from Bangla Articles
Rifat Rahman

Topic modeling methods identify fundamental themes within textual documents, facilitating an understanding of the insights inside them. Traditional topic modeling approaches are based on the generative probabilistic process that assumes the document-topic and topic-word distribution. Hence, those approaches fail to capture semantic similarities among words inside the documents and are less scalable with the vast number of topics and documents. This paper presents a method for capturing topics from Bangla documents by clustering the word vectors induced from LLM models. Corpus statistics are integrated into the clustering & word reordering process within each cluster or topic to extract the top words. Additionally, we deploy dimensionality reduction techniques, such as PCA, prior to clustering. Finally, we perform a comparative study and identify the best-performing combination of clustering and word embedding methods. Our top-performing combination outperforms the traditional probabilistic topic model in capturing topics and top words per topic, and excels notably in terms of computational efficiency and time complexity.

pdf bib
Benchmarking Large Language Models on Bangla Dialect Translation and Dialectal Sentiment Analysis
Md Mahir Jawad | Rafid Ahmed | Ishita Sur Apan | Tasnimul Hossain Tomal | Fabiha Haider | Mir Sazzat Hossain | Md Farhad Alam Bhuiyan

We present a novel Bangla Dialect Dataset comprising 600 annotated instances across four major dialects: Chattogram, Barishal, Sylhet, and Noakhali. The dataset was constructed from YouTube comments spanning diverse domains to capture authentic dialectal variations in informal online communication. Each instance includes the original dialectical text, its standard Bangla translation, and sentiment labels (Positive and Negative). We benchmark several state-of-the-art large language models on dialect-to-standard translation and sentiment analysis tasks using zero-shot and few-shot prompting strategies. Our experiments reveal that transliteration significantly improves translation quality for closed-source models, with GPT-4o-mini achieving the highest BLEU score of 0.343 in zero-shot with transliteration. For sentiment analysis, GPT-4o-mini demonstrates perfect precision, recall, and F1 scores (1.000) in few-shot settings. This dataset addresses the critical gap in resources for low-resource Bangla dialects and provides a foundation for developing dialect-aware NLP systems.

pdf bib
Robustness of LLMs to Transliteration Perturbations in Bangla
Fabiha Haider | Md Farhan Ishmam | Fariha Tanjim Shifat | Md Tasmim Rahman Adib | Md Fahim | Md Farhad Alam Bhuiyan

Bangla text on the internet often appears in mixed scripts that combine native Bangla characters with their Romanized transliterations. To ensure practical usability, language models should be robust to naturally occurring script mixing. Our work investigates the robustness of current LLMs and Bangla language models under various transliteration-based textual perturbations, i.e., we augment portions of existing Bangla datasets using transliteration. Specifically, we replace words and sentences with their transliterated text to emulate realistic script mixing, and similarly, replace the top k salient words to emulate adversarial script mixing. Our experiments reveal interesting behavioral insights and vulnerabilities to robustness in language models for Bangla, which can be crucial for deploying such models in real-world scenarios and enhancing their overall robustness.

pdf bib
Generative Data Augmentation for Improving Semantic Classification
Shadman Rohan | Mahmud Elahi Akhter | Ibraheem Muhammad Moosa | Nabeel Mohammed | Amin Ahsan Ali | Akmmahbubur Rahman

We study sentence-level generative data augmentation for Bangla semantic classification across four public datasets and three pretrained model families (BanglaBERT, XLM-Indic, mBERT). We evaluate two widely used, reproducible techniques—paraphrasing (mT5-based) and round-trip backtranslation (Bn–En–Bn)—and analyze their impact under realistic class imbalance. Overall, augmentation often helps, but gains are tightly coupled to label quality: paraphrasing typically outperforms backtranslation and yields the most consistent improvements for the monolingual model, whereas multilingual encoders benefit less and can be more sensitive to noisy minority-class expansions. A key empirical observation is that the neutral class appears to be a major source of annotation noise, which degrades decision boundaries and can cap the benefits of augmentation even when positive/negative classes are clean and polarized. We provide practical guidance for Bangla sentiment pipelines: (i) use simple sentence-level augmentation to rebalance classes when labels are reliable; (ii) allocate additional curation and higher inter-annotator agreement targets to the neutral class. Our results indicate when augmentation helps and suggest that data quality—not model choice alone—can become the limiting factor.

pdf bib
bnContextQA: Benchmarking Long-Context Question Answering and Challenges in Bangla
Adnan Ahmad | Labiba Adiba | Namirah Rasul | Md Tahmid Rahman Laskar | Sabbir Ahmed

Large models have advanced in processing long input sequences, but their ability to consistently use information across extended contexts remains a challenge. Recent studies highlight a positional bias where models prioritize information at the beginning or end of the input while neglecting the middle, resulting in a U-shaped performance curve but this was limited to English. Whether this bias is universal or shaped by language-specific factors remains unclear. In this work, we investigate positional bias in Bangla, a widely spoken but computationally underrepresented language. To support this, we introduce a novel Bangla benchmark dataset, bnContextQA, specifically designed for long-context comprehension. The dataset comprises of 350 long-context QA instances, each paired with 30 context paragraphs, allowing controlled evaluation of information retrieval at different positions. Using this dataset, we assess the performance of LLMs on Bangla across varying passage positions, providing insights into cross-linguistic positional effects. The bnContextQA dataset is publicly available at https://github.com/labiba02/bnContextQA.git to support future research on long-context understanding in Bangla and multilingual LLMs.

pdf bib
Form-aware Poetic Generation for Bangla
Amina | Abdullah | Mueeze Al Mushabbir | Sabbir Ahmed

Poetry generation in low-resource languages such as Bangla is particularly challenging due to the scarcity of structured poetic corpora and the complexity of its metrical system (matra). We present a structure-aware framework for Bangla poetry generation using pretrained Bangla large language models (LLMs)–TigerLLM, TituLLM, and BanglaT5–trained on general non-poetic text corpora augmented with rich structural control tokens. These tokens capture rhyme, meter, word count, and line boundaries, enabling unsupervised modeling of poetic form without curated poetry datasets. Unlike prior fixed-pattern approaches, our framework introduces variable control compositions, allowing models to generate flexible poetic structures. Experiments show that explicit structural conditioning improves rhyme consistency and metrical balance while maintaining semantic coherence. Our study provides the first systematic evaluation of Bangla LLMs for form-constrained creative generation, offering insights into structural representation in low-resource poetic modeling.

pdf bib
Overview of BLP-2025 Task 2: Code Generation in Bangla
Nishat Raihan | Mohammad Anas Jawad | Md Mezbaur Rahman | Noshin Ulfat | Pranav Gupta | Mehrab Mustafy Rahman | Santu Karmaker | Marcos Zampieri

This paper presents an overview of the BLP 2025 shared task Code Generation in Bangla, organized with the BLP workshop co-located with AACL. The task evaluates Generative AI systems capable of generating executable Python code from natural language prompts written in Bangla. This is the first shared task to address Bangla code generation. It attracted 152 participants across 63 teams, yielding 488 submissions, with 15 system-description papers. Participating teams employed both proprietary and open-source LLMs, with prevalent strategies including prompt engineering, fine-tuning, and machine translation. The top Pass@1 reached 0.99 on the development phase and 0.95 on the test phase. In this report, we detail the task design, data, and evaluation protocol, and synthesize methodological trends observed across submissions. Notably, we observe that the high performance is not based on single models; rather, a pipeline of multiple AI tools and/or methods.

pdf bib
Overview of BLP-2025 Task 1: Bangla Hate Speech Identification
Md Arid Hasan | Firoj Alam | Md Fahad Hossain | Usman Naseem | Syed Ishtiaque Ahmed

Online discourse in Bangla is rife with nuanced toxicity expressed through code-mixing, dialectal variation, and euphemism. Effective moderation thus requires fine-grained detection of hate type, target, and severity, rather than a binary label. To address this, we organized the Bangla Hate Speech Identification Shared Task at the BLP 2025 workshop, co-located with IJCNLP-AACL 2025, comprising three subtasks: (1A) hate-type detection, (1B) hate-target detection, and (1C) joint prediction of type, target, and severity in a multi-task setup. The subtasks attracted 161, 103, and 90 participants, with 36, 23, and 20 final submissions, respectively, while a total of 19 teams submitted system description papers. The submitted systems employed a wide range of approaches, ranging from classical machine learning to fine-tuned pretrained models and zero-/few-shot LLMs. We describe the task setup, datasets, and evaluation framework, and summarize participant systems. All datasets and evaluation scripts are publicly released.

pdf bib
Bahash-AI at BLP-2025 Task 1: Bangla Hate Speech Detection using Data Augmentation and Pre-trained Model
Sahinur Rahman Laskar | Bishwaraj Paul

In recent times, internet users are frequently exposed to Hate Speech on social media platforms that have long-lasting negative impacts on their mental wellbeing and also radicalizes the society into an environment of fear and distrust. Many methods have been developed to detect and stop propagation of Hate Speech. However, there is a limitation of annotated data available for Hate Speech in Bengali language. In this work, we have used a pretrained BanglaBERT model on an extended train dataset synthesized via data augmentation techniques. Our team Bahash-AI has achieved 20th, 20th and 17th position of the 3 subtasks out of total 37, 24 and 21 total number of teams who participated in the subtasks 1A, 1B and 1C respectively for Bangla Multi-task Hatespeech Identification Shared Task at BLP Workshop with F1 scores of 0.7028, 0.6954, 0.6969 respectively.

pdf bib
Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection
Syed Mohaiminul Hoque | Naimur Rahman | Md Sakhawat Hossain

This paper introduces the approach of “Gradient Masters” for BLP-2025 Task 1: “Bangla Multitask Hate Speech Identification Shared Task”. We present an ensemble-based fine-tuning strategy for addressing subtasks 1A (hate-type classification) and 1B (target group classification) in YouTube comments. We propose a hybrid approach on a Bangla Language Model, which outperformed the baseline models and secured the 6th position in subtask 1A with a micro F1 score of 73.23% and the third position in subtask 1B with 73.28%. We conducted extensive experiments that evaluated the robustness of the model throughout the development and evaluation phases, including comparisons with other Language Model variants, to measure generalization in low-resource Bangla hate speech scenarios and data set coverage. In addition, we provide a detailed analysis of our findings, exploring misclassification patterns in the detection of hate speech.

pdf bib
PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection
Rakib Hossan | Shubhashis Roy Dipta

The BLP-2025 Task 1A requires Bengali hate speech classification into six categories. Traditional supervised approaches need extensive labeled datasets that are expensive for low-resource languages. We developed PromptGuard, a few-shot framework combining chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making. We explore statistical keyword selection versus random approaches and adaptive voting mechanisms that extend classification based on consensus quality. Chi-square keywords provide consistent improvements across categories, while adaptive voting benefits ambiguous cases requiring extended classification rounds. PromptGuard achieves a micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Ablation studies confirm chi-square–based keywords show the most consistent impact across all categories.

pdf bib
BElite at BLP-2025 Task 1: Leveraging Ensemble for Multi Task Hate Speech Detection in Bangla
Zannatul Fardaush Tripty | Ibnul Mohammad Adib | Nafiz Fahad | Muhammad Tanjib Hussain | Md Kishor Morol

The widespread use of the internet has made sharing information on social media more convenient. At the same time, it provides a platform for individuals with malicious intent to easily spread hateful content. Since many users prefer to communicate in their native language, detecting hate speech in Bengali poses a significant challenge. This study aims to identify Bengali hate speech on social media platforms. A shared task on Bengali hate speech detection was organized by the Second Bangla Language Processing Workshop (BLP). To tackle this task, we implemented five traditional machine learning models (LR, SVM, RF, NB, XGB), three deep learning models (CNN, BiLSTM, CNN+BiLSTM), and three transformer-based models (Bangla-BERT, m-BERT, XLM-R). Among all models, a weighted ensemble of transformer models achieved the best performance.Our approach ranked 3rd in Subtask 1A with a micro-F1 score of 0.734, 6th in Subtask 1B with 0.7315, and, after post-competition experiments, 4th in Subtask 1C with 0.735.

pdf bib
Computational Story Lab at BLP-2025 Task 1: HateSense: A Multi-Task Learning Framework for Comprehensive Hate Speech Identification using LLMs
Tabia Tanzin Prama | Christopher M. Danforth | Peter Dodds

This paper describes HateSense, our multi-task learning framework for the BLP 2025 shared task 1 on Bangla hate speech identification. The task requires not only detecting hate speech but also classifying its type, target, and severity. HateSense integrates binary and multi-label classifiers using both encoder- and decoder-based large language models (LLMs). We experimented with pre-trained encoder models (Bert based models), and decoder models like GPT-4.0, LLaMA 3.1 8B, and Gemma-2 9B. To address challenges such as class imbalance and the linguistic complexity of Bangla, we employed techniques like focal loss and odds ratio preference optimization (ORPO). Experimental results demonstrated that the pre-trained encoders (BanglaBert) achieved state-of-the-art performance. Among different prompting strategies, chain-of-thought (CoT) combined with few-shot prompting proved most effective. Following the HateSense framework, our system attained competitive micro-F1 scores: 0.741 (Task 1A), 0.724 (Task 1B), and 0.7233 (Task 1C). These findings affirm the effectiveness of transformer-based architectures for Bangla hate speech detection and suggest promising avenues for multi-task learning in low-resource languages.

pdf bib
CUET-NLP_Zenith at BLP-2025 Task 1: A Multi-Task Ensemble Approach for Detecting Hate Speech in Bengali YouTube Comments
Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque

Hate speech on social media platforms, particularly in low-resource languages like Bengali, poses a significant challenge due to its nuanced nature and the need to understand its type, severity, and targeted group. To address this, the Bangla Multi-task Hate Speech Identification Shared Task at BLP 2025 adopts a multi-task learning framework that requires systems to classify Bangla YouTube comments across three subtasks simultaneously: type of hate, severity, and targeted group. To tackle these challenges, this work presents BanTriX, a transformer ensemble method that leverages BanglaBERT-I, XLM-R, and BanglaBERT-II. Evaluation results show that the BanTriX, optimized with cross-entropy loss, achieves the highest weighted micro F1-score of 73.78% in Subtask 1C, securing our team 2nd place in the shared task.

pdf bib
TeamHateMate at BLP Task1: Divide and Conquer: A Two-Stage Cascaded Framework with K-Fold Ensembling for Multi-Label Bangla Hate Speech Classification
Mahbub Islam Mahim | Mehedi Hasan

Detecting hate speech on social media is essential for safeguarding online communities, yet it remains challenging for low-resource languages like Bangla due to class imbalance and subjective annotations. We introduce a two-stage cascaded framework with k-fold ensembling to address the BLP Workshop 2025 Shared Task’s three subtasks: 1A (hate type classification), 1B (target identification), and 1C (joint classification of type, target, and severity). Our solution balances precision and recall, achieving micro-F1 scores of 0.7331 on 1A, 0.7356 on 1B, and 0.7392 on 1C, ranking 4th on 1A and 1st on both 1B and 1C. It performs strongly on major classes, although underrepresented labels such as sexism and mild severity remain challenging. Our method makes the optimal use of limited data through k-fold ensembling and delivers overall balanced performance across majority and minority classes by mitigating class imbalance via cascaded layers.

pdf bib
NSU_MILab at BLP-2025 Task 1: Decoding Bangla Hate Speech: Fine-Grained Type and Target Detection via Transformer Ensembles
Md. Mohibur Rahman Nabil | Muhammad Rafsan Kabir | Rakib Islam | Fuad Rahman | Nabeel Mohammed | Shafin Rahman

This paper describes our participation in Task 1A and Task 1B of the Task 1A and Task 1B of the BLP Workshop, focused on Bangla Multi-task Hatespeech Identification. Our approach involves systematic evaluation of four transformer models: BanglaBERT, XLM-RoBERTa, IndicBERT, and Bengali Abusive MuRIL. To enhance performance, we implemented an ensemble strategy that averages output probabilities from these transformer models, which consistently outperformed individual models across both tasks. The baseline classical methods demonstrated limitations in capturing complex linguistic cues, underscoring the superiority of transformer-based approaches for low-resource hate speech detection. Our solution initially achieved F1 scores of 0.7235 (ranked 12th) for Task 1A and 0.6981 (ranked 17th) for Task 1B among participating teams. Through post-competition refinements, we improved our Task 1B performance to 0.7331, demonstrating the effectiveness of ensemble methods in Bangla hate speech detection.

pdf bib
Catalyst at BLP-2025 Task 1: Transformer Ensembles and Multi-task Learning Approaches for Bangla Hate Speech Detection
Nahid Hasan

We present a compact, cost-efficient system for the BLP-2025 Bangla Multi-task Hate Speech Identification Task 1, which requires fine-grained predictions across three dimensions: type, target, and severity. Our method pairs strong multilingual transformer encoders with two lightweight strategies: task-appropriate ensembling to stabilize decisions across seeds and backbones; and a multi-task head that shares representations while tailoring outputs to each subtask. As Catalyst, we ranked 7th on Subtask 1A with micro-F1 73.05, 8th on Subtask 1B with 72.79, and 10th on Subtask 1C with 72.40. Despite minimal engineering, careful model selection and straightforward combination rules yield competitive performance and more reliable behavior on minority labels. Ablations show consistent robustness gains from ensembling, while the multi-task head reduces cross-dimension inconsistencies. Error analysis highlights persistent challenges with code-mixed slang, implicit hate, and target ambiguity, motivating domain-adaptive pretraining and improved normalization.

pdf bib
Heisenberg at BLP-2025 Task 1: Bangla Hate Speech Classification using Pretrained Language Models and Data Augmentation
Samin Yasir

Detecting hate speech in Bangla is challenging due to its complex vocabulary, spelling variations, and region-specific word usage. However, effective detection is essential to ensure safer social media spaces and to take appropriate action against perpetrators. In this study, we report our participation in Subtask A of Task 1: Bangla Hate Speech Detection (Hasan et al., 2025b). In addition to the provided 50K Bangla comments (Hasan et al., 2025a), we collected approximately 4K Bangla comments and employed several data augmentation techniques. We evaluated several transformer-based models (e.g., BanglaBERT, BanglaT5, BanglaHateBERT), achieving the best performance with a micro-F1 score of 71% and securing 18th place in the Evaluation Phase.

pdf bib
Retriv at BLP-2025 Task 1: A Transformer Ensemble and Multi-Task Learning Approach for Bangla Hate Speech Identification
Sourav Saha | K M Nafi Asib | Mohammed Moshiul Hoque

This paper addresses the problem of Bangla hate speech identification, a socially impactful yet linguistically challenging task. As part of the Bangla Multi-task Hate Speech Identification shared task at the BLP Workshop, IJCNLP-AACL 2025, we participated in all three subtasks: (1A) hate type classification, (1B) target group identification, and (1C) joint detection of type, severity, and target. For subtasks 1A and 1B, we employed a soft-voting ensemble of transformer models (BanglaBERT, MuRIL, IndicBERTv2). For subtask 1C, we trained three multitask variants and aggregated their predictions through a weighted voting ensemble. Our systems achieved micro-f1 scores of 72.75% (1A) and 72.69% (1B), and a weighted micro-f1 score of 72.62% (1C). On the shared task leaderboard, these corresponded to 9th, 10th, and 7th positions, respectively. These results highlight the promise of transformer ensembles and weighted multitask frameworks for advancing Bangla hate speech detection in low-resource contexts. We made experimental scripts publicly available for the community.

pdf bib
CoU-CU-DSG at BLP-2025 Task 1: Leveraging Weighted Probabilistic Fusion of Language Models for Bangla Hate Speech Detection
Ashraful Alam | Abdul Aziz | Abu Nowshed Chy

The upsurge of social media and open source platforms has created new avenues for the rapid, global spread of negativity and obscenities targeting individuals and organizations. The process to identify hate speech is critical for the lexical and regional variation as well as the morphological complexity of the texts, especially in low-resource languages, e.g. Bangla. This paper presents our participation in the Hate Speech Detection task at the second workshop on Bangla Language Processing. The objective of this task is not only to detect whether the content is hateful, but also to identify the type of hate, the target group, and its severity. We proposed a Transformer-based weighted probabilistic fusion model to detect the presence of hate speech in Bangla texts. We independently fine-tuned three pre-trained Transformer models, BanglaBERT, XLM-RoBERTa, and MuRIL, to capture diverse linguistic representations. The probability distributions obtained from each model were combined using a weighted fusion strategy, allowing the system to leverage the strengths of all models simultaneously. This fused representation was then used to predict the final labels for the given instances. The experimental results showed that our proposed method obtained competitive performance, ranking 10th in subtask 1A and 15th in subtask 1B among the participants.

pdf bib
PerceptionLab at BLP-2025 Task 1: Domain-Adapted BERT for Bangla Hate Speech Detection: Contrasting Single-Shot and Hierarchical Multiclass Classification
Tamjid Hasan Fahim | Kaif Ahmed Khan

This paper presents PerceptionLab’s approach for the BLP-2025 Shared Task 1A on multiclass Bangla hate speech detection, addressing severe class imbalance and informal online discourse. We perform Domain-Adaptive Pretraining (DAPT) on BERT models using a curated corpus of over 315,000 social media comments to capture slang, non-standard spellings, and contextual nuances of online discourse. To enrich underrepresented categories, we align external resources and construct a novel Bangla sexism dataset of over 6,800 comments via weak supervision and manual verification. Two classification strategies are compared: a single-shot six-way classifier and a two-stage hierarchical model that first separates Hate from Non-hate before fine-grained categorization. Experimental results show that single-shot classification with DAPT-enhanced BUET-BERT achieves the highest micro-F1 score (0.7265), outperforming the hierarchical approach and benchmarked general-purpose Large Language Models. Error analysis reveals persistent challenges in detecting subtle sexism and context-dependent religious hate. Our findings highlight the value of domain adaptation, robust end-to-end modeling, and targeted dataset construction for improving fine-grained hate speech detection in low-resource settings.

pdf bib
SyntaxMind at BLP-2025 Task 1: Leveraging Attention Fusion of CNN and GRU for Hate Speech Detection
Md. Shihab Uddin Riad

This paper describes our system used in theBLP-2025 Task 1: Hate Speech Detection.We participated in Subtask 1A and Subtask1B, addressing hate speech classification inBangla text. Our approach employs a unified architecture that integrates BanglaBERTembeddings with multiple parallel processingbranches based on GRUs and CNNs, followedby attention and dense layers for final classification. The model is designed to capture bothcontextual semantics and local linguistic cues,enabling robust performance across subtasks.The proposed system demonstrated high competitiveness, obtaining 0.7345 micro F1-Score(2nd place) in Subtask 1A and 0.7317 microF1-Score (5th place) in Subtask 1B.

pdf bib
Code_Gen at BLP-2025 Task 1: Enhancing Bangla Hate Speech Detection with Transformers through Token-Aware Adversarial Contrastive Training and Layer-wise Learning Rate Decay
Shifat Islam | Abhishek Agarwala | Emon Ghosh

Bangla social media contains several types of hate speech and slurs, but automatic detection is tough due to linguistic complexity, data imbalance and limited resources. We address this challenge in the BLP-2025 shared task by combining Token-Aware Adversarial Contrastive Training (TACT) with Layer-wise Learning Rate Decay (LLRD) to fine-tune transformer models like BanglaBERT, MuRIL, mE5-base and Twitter XLM-R. To capture the complementary strengths of each model, we aggregate the model outputs through logits ensembling and get a robust system for multiclass classification. On the official test set, our model achieved F1 scores of 0.7362 for hate type, 0.7335 for severity, and 0.7361 for target ranking, placing it 1st, 2nd, and 3rd, respectively. The findings indicate that adversarial fine-tuning with logits ensemble learning is a robust way to detect hate speech in resource-limited languages and provides valuable insights for multilingual and low-resource NLP research.

pdf bib
CUET_Sntx_Srfrs at BLP-2025 Task 1: Combining Hierarchical Classification and Ensemble Learning for Bengali Hate Speech Detection
Hafsa Hoque Tripty | Laiba Tabassum | Hasan Mesbaul Ali Taher | Kawsar Ahmed | Mohammed Moshiul Hoque

Detecting hate speech in Bengali social media content presents considerable challenges, primarily due to the prevalence of informal language and the limited availability of annotated datasets. This study investigates the identification of hate speech in Bengali YouTube comments, focusing on classifying the type, severity, and target group. Multiple machine learning baselines and voting ensemble techniques are evaluated to address these tasks. The methodology involves text preprocessing, feature extraction using TF-IDF and Count vectors, and aggregating predictions from several models. Hierarchical classification with TF-IDF features and majority voting improves the detection of less frequent hate speech categories while maintaining robust overall performance, resulting in an 18th place ranking and a micro F1 score of 68.42%. Furthermore, ablation studies assess the impact of preprocessing steps and n-gram selection, providing reproducible baselines for Bengali hate speech detection. All codes and resources are publicly available at https://github.com/Hasan-Mesbaul-Ali-Taher/BLP_25_Task_1

pdf bib
Velora at BLP-2025 Task 1: Multi-Method Evaluation for Hate Speech Classification in Bangla Text
Sad Yeamin Sayem | Sabira Rahman

Hate speech detection in Bangla is challenging due to complex morphology, frequent code mixing, and severe class imbalance across categories such as abuse, sexism, religious and political hate, profanity, and neutrality. The BLP Workshop 2025 Subtask 1A addressed this by classifying Bangla YouTube comments into these categories to support online moderation in low-resource settings. We developed a BanglaBERT-based system with balanced data augmentation and advanced regularization techniques, combined with optimized training strategies for better generalization. On the blind test set, our system achieved a micro F1 score of 0.7013, ranking 21st on the leaderboard. These results indicate that augmentation, robust loss functions, and model refinements can enhance Bangla hate speech detection, though implicit and context-dependent hate speech remains difficult.

pdf bib
PentaML at BLP-2025 Task 1: Linear Probing of Pre-trained Transformer-based Models for Bangla Hate Speech Detection
Intesar Tahmid | Rafid Ahmed | Md Mahir Jawad | Anam Borhan Uddin | Md Fahim | Md Farhad Alam Bhuiyan

This paper presents our approach for the BLP Shared Task 1, where we implemented Linear Probing of Pre-trained Transformer-based Models for Bangla Hate Speech Detection. The goal of the task was to customize the existing models so that they’re capable of automatically identifying hate speech in Bangla social media text, with a focus on YouTube comments. Our approach relied on fine-tuning several pre-trained BERT models, adapting them to the shared task dataset for improved classification accuracy. To further enhance performance, we applied linear probing on three of the fine-tuned models, enabling more effective utilization of the learned representations. The combination of these strategies resulted in a consistent top-15 ranking across all subtasks of the competition. Our findings highlight the effectiveness of linear probing as a lightweight yet impactful technique for enhancing hate speech detection in low-resource languages like Bangla.

pdf bib
HateNet-BN at BLP-2025 Task 1: A Hierarchical Attention Approach for Bangla Hate Speech Detection
Mohaymen Ul Anam | Akm Moshiur Rahman Mazumder | Ashraful Islam | Akmmahbubur Rahman | M Ashraful Amin

The rise of social media in Bangladesh has increased abusive and hateful content, which is difficult to detect due to the informal nature of Bangla and limited resources. The BLP 2025 shared task addressed this challenge with Subtask 1A (multi-label abuse categories) and Subtask 1B (target identification). We propose a parameter-efficient model using a frozen BanglaBERT backbone with hierarchical attention to capture token level importance across hidden layers. Context vectors are aggregated for classification, combining syntactic and semantic features. On Subtask 1A, our frozen model achieved a micro-F1 of 0.7178, surpassing the baseline of 0.7100, while the unfrozen variant scored 0.7149. Our submissions ranked 15th (Subtask 1A) and 12th (Subtask 1B), showing that layer-wise attention with a frozen backbone can effectively detect abusive Bangla text.

pdf bib
Ecstasy at BLP-2025 Task 1: TF-IDF Informed Prompt Engineering with LoRA Fine-tuning for Bangla Hate Speech Detection
Kazi Reyazul Hasan | Mubasshira Musarrat | Muhammad Abdullah Adnan

We present a hybrid approach for Bangla hate speech detection that combines linguistic analysis with neural fine tuning. Our method first identifies category specific keywords using TF-IDF analysis on 35,522 training samples. These keywords then inform prompt engineering for Llama 3.1 8B model fine tuned with LoRA adapters. We incorporate distinctive Bangla terms directly into classification prompts to guide the model understanding of hate speech patterns. Our system achieved top 5 rankings across all three BLP 2025 Task 1 subtasks including hate type classification, target identification, and multi task prediction. The approach proved particularly effective for culturally specific hate speech patterns unique to Bangla social media discourse.

pdf bib
CodeAnubad at BLP-2025 Task 2: Efficient Bangla-to-Python Code Generation via Iterative LoRA Fine-Tuning of Gemma-2
Soumyajit Roy

This paper presents our submission for Task 2 of the Bangla Language Processing (BLP) Workshop, which focuses on generating Python code from Bangla programming prompts in a low-resource setting. We address this challenge by fine-tuning the gemma-2-9b instruction-tuned model using parameter-efficient fine-tuning (PEFT) with QLoRA. We propose an iterative self-improvement strategy that augments the extremely limited training data (74 examples) by reusing verified correct predictions from the development set, alongside LoRA rank experiments (8, 16, 32), observing a clear correlation between rank and accuracy, with rank 32 delivering the best results. Compared to translation-based and retrieval-augmented baselines, our approach achieves significantly higher accuracy, with a pass rate of 47% on the development set and 37% on the hidden test set. These results highlight the effectiveness of combining iterative data augmentation with rank optimisation for specialised, low-resource code generation tasks.

pdf bib
Troopers at BLP-2025 Task 2: Reward-Selective Fine-Tuning based Code Generation Approach for Bangla Prompts
Musa Tur Farazi | Nufayer Jahan Reza

We present a formally grounded description of a reward-selective fine-tuning (RSFT) pipeline for code generation from Bangla natural-language prompts. The implemented system mines candidate programs via temperature and nucleus sampling, executes candidates in a sandbox and retains programs that pass all unit tests, performs supervised fine-tuning (SFT) on winners using parameter-efficient Low rank adaptation (LoRA) adapters, and augments robustness through fuzzed asserts. We specify the exact objectives and estimators used, provide a Bangla-aware preprocessing recipe, prove simple properties of the sampling budget, and report an ablation showing the effect of inference sample budget K on accuracy. We also include a threat model for safe execution. Our codes are available on GitHub.

pdf bib
PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents
Jahidul Islam | Md Ataullha | Saiful Azad

LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. This paper addresses the challenge of Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches that rely on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought–Code–Observation loop, enabling the system to dynamically generate, test, and refine code from Bangla instructions. We benchmark several prominent small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with a pass@1 accuracy of 94.0% on the development set and 71.6% on the blind test set. These findings establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages.. Experimental scripts made publicly available at https://github.com/jahidulzaid/PyBanglaCodeActAgent

pdf bib
Barrier Breakers at BLP-2025 Task 2: Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter
Sajed Jalil | Shuvo Saha | Hossain Mohammad Seym

Over the past few years, improving LLM code generation capabilities has been a key focus in NLP research. Despite Bengali having 242 million native speakers worldwide, it receives little attention when it comes to training LLMs. More recently, various fine-tuning and augmented generation techniques have been employed to significantly enhance code generation performance. However, they require considerable expertise and resources to utilize effectively as an end user. The goal of our work is to democratize access to powerful code generation tools in resource-constrained emerging markets, enabling users to leverage them in their native language.We introduce a novel approach that combines Test-Driven Development (TDD) and Code Interpreter (CI), utilizing open-weight models, which improves the baseline accuracy for code generation with Bengali prompts and achieves an overall accuracy of 85%. Our approach requires no finetuning and proves that even the smallest models in the same family can attain up to 98% accuracy compared to the largest models. All of our results are publicly shared in GitHub for validation and reproducibility.

pdf bib
Musafir at BLP_2025 Task 2: Generating Python Code from Bangla Prompts using a Multi Model Cascade and Unit Test Validation
Sakibul Hasan | Md Tasin Abdullah | Abdullah Al Mahmud | Ayesha Banu

This paper presents our approach for the BLP25 Task 2: Code Generation in Bangla. To address the scarcity of Bangla–code training data, we adopt a two-stage pipeline. First, Bangla problem statements are translated into English using a neural translation model optimized for preserving technical semantics. Then, the translated text is passed to a Qwen-based code generation model to produce executable solutions. This translation–generation strategy leverages the strengths of English-centric code models while ensuring fidelity to the original Bangla instructions. Our system achieved competitive performance on the leaderboard, achieving the 3rd place with score of 91.8% while demonstrating that translation-augmented pipelines are effective for low-resource code generation tasks.

pdf bib
JU_NLP at BLP-2025 Task 2: Leveraging Zero-Shot Prompting for Bangla Natural Language to Python Code Generation
Pritam Pal | Dipankar Das

Code synthesis from natural language problem statements has recently gained popularity with the use of large language models (LLMs). Most of the available systems and benchmarks, however, are developed for English or other high-resource languages, and a gap exists for low-resource languages such as Bangla. Addressing the gap, the Bangla Language Processing (BLP) Workshop at AACL-IJCNLP 2025 featured a shared task on Bangla-to-Python code generation. Participants were asked to design systems that consume Bangla problem statements and generate executable Python programs. A benchmark data set of training, development, and test splits was provided, and evaluation utilized the Pass@1 metric through hidden test cases. We present here a system we developed, using the state-of-the-art LLMs through a zero-shot prompting setup. We report outcomes on several models, including variants of GPT-4 and Llama-4, and specify their relative strengths and weaknesses. Our best-performing system, based on GPT-4.1, achieved a Pass@1 score of 78.6% over the test dataset. We address the challenges of Bangla code generation, morphological richness, cross-lingual understanding, and functional correctness, and outline the potential for future work in multilingual program synthesis.

pdf bib
Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation
K M Nafi Asib | Sourav Saha | Mohammed Moshiul Hoque

Large Language Models (LLMs) have advanced the automated generation of code from natural language prompts. However, low-resource languages (LRLs) like Bangla remain underrepresented due to the limited availability of instruction-to-code datasets and evaluation benchmarks. To address this, the BLP Workshop at IJCNLP-AACL 2025 introduced a shared task on “Code Generation in Bangla”. In this work, we propose a method that combines instruction prompting with a test-driven, feedback-guided iterative refinement process using a fine-tuned Qwen2.5-14B model. The model generates code from Bangla instructions, tests it against unit tests, and iteratively refines any failing outputs through three evaluation passes, using test feedback to guide each step. This approach helped our team “Retriv” to secure 2nd place in the shared task with a Pass@1 score of 0.934. The analysis highlights challenges in Bangla instruction understanding and Python code generation, emphasizing the need for targeted methods in LRLs. We made experimental scripts publicly available for the community.

pdf bib
NSU_PiedPiper at BLP-2025 Task 2: A Chain-of-Thought with Iterative Debugging Approach for Code Generation with Bangla Instruction
Ahmad Fahmid | Fahim Foysal | Wasif Haider | Shafin Rahman | Md Adnan Arefeen

Code generation from natural language instructions in Bangla is a fundamental task in programming automation, as explored in BLP-2025 Shared Task 2: Code Generation in Bangla. Current code generation models are designed primarily for high-resource languages such as English, which creates major limitations when applied to Bangla. The key challenges are limited training data and difficulty in correctly interpreting Bangla programming instructions. In this paper, to accommodate Bangla instructions, we present a chain of thought (CoT) based prompting approach with Qwen2.5-Coder-14B model. We further introduce few-shot example in the prompt template to improve the accuracy. During competition, an accuracy of 93% is achieved in the shared test set (test_v1.csv) and 82.6% is achieved on the public and private test sets (hidden). After the competition is closed, we implement a debugger prompt technique which refines answers with 3 iterative fixing attempts. Applying this new technique on the public shared test set, our system outperforms by 7% and achieves 100% accuracy on the public test set, highlighting the effectiveness of combining CoT prompting with iterative debugging.

pdf bib
NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bengali Instruction to Python Code Generation
Hossain Shaikh Saadi | Faria Alam | Mario Sanz-Guerrero | Minh Duc Bui | Manuel Mager | Katharina von der Wense

This paper presents JGU Mainz’s winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a Pass@1 score of 95.4. We also make our code public.

pdf bib
CUET_Expelliarmus at BLP2025 Task 2: Leveraging Instruction Translation and Refinement for Bangla-to-Python Code Generation with Open-Source LLMs
Md Kaf Shahrier | Suhana Binta Rashid | Hasan Mesbaul Ali Taher | Mohammed Moshiul Hoque

Large language models (LLMs) have recently shown strong performance in generating code from natural language prompts. However, current benchmarks are primarily focused on English overlooking low-resource languages like Bangla. This creates a critical research gap since there are no well established resources or systematic evaluations for code generation from Bangla instruction. To address the gap, we present a system that generates executable Python code from Bangla instructions. We design a two-stage pipeline where the Bangla instructions are first translated and refined into clear English version to reduce ambiguity and then the python code is generated from the refined instructions with iterative error-correction. For both instruction refinement and code generation we used the open-source GPT-20B OSS model. On the official test set our system achieves competitive results. We also analyze common errors like unclear instruction, logical mistakes, runtime issues and the need for external knowledge beyond the model’s training. Overall, our findings show that a simple translation–refinement pipeline can be an effective and low-cost approach for code generation in low-resource languages.

pdf bib
AlphaBorno at BLP-2025 Task 2: Code Generation with Structured Prompts and Execution Feedback
Mohammad Ashfaq Ur Rahman | Muhtasim Ibteda Shochcho | Md Fahim

This paper explores various prompting strategies in the BLP-2025 Shared Task 2, utilizing a pipeline that first translates Bangla problem descriptions into English with GPT-4o,then applies techniques like zero-shot, few-shot,chain of thought, synthetic test case integration, and a self-repair loop. We evaluated fourLLMs (GPT-4o, Grok-3, Claude 3.7 Sonnet,and Qwen2.5-Coder 14B). Our findings revealthat while traditional methods like few-shotand chain-of-thought prompting provided inconsistent gains, the integration of explicit unittests delivered a substantial performance boostacross all models. The most effective strategycombined zero-shot prompting with these synthetic tests and a self-repair loop, leading GPT4o to achieve a top Pass@1 score of 72.2%.These results represent the value of using explicit constraints and iterative feedback in codegeneration, offering a solid framework that improves the model’s code generation capabilities.

pdf bib
PyBhasha at BLP-2025 Task 2: Effectiveness of Semantic-Aware Translation and Ensembling in Bangla Code Generation
Foyez Ahmed Dewan | Nahid Montasir Rifat

In this paper, we present our submission to Task 2 of the BLP-2025 shared task on code generation from Bangla instructions. Our approach focused on enhancing instruction quality through translation and improving model performance with a two-stage ensemble strategy. We evaluated two proprietary and several open-source models under three instruction settings: original Bangla instructions, Bangla instructions translated into English using Facebook NLLB, and instructions rewritten in English with GPT-4.1. Experimental results showed that GPT-4.1-rewritten instructions consistently achieved the highest accuracy across models. For final predictions, we used a two-stage ensemble, achieving a pass@1 score of 80.0% on the hidden test set and securing 12th place on the official leaderboard. Additionally, we conducted a qualitative analysis of selected translations to illustrate how variations in instruction phrasing influenced model outputs.

pdf bib
AdversaryAI at BLP-2025 Task 2: A Think, Refine, and Generate (TriGen) System with LoRA and Self-Refinement for Code Generation
Omar Faruqe Riyad | Jahedul Alam Junaed

In this paper, we propose a system for generating Python code from Bangla prompts. Our approach fine-tunes open-source models with parameter-efficient techniques and leverages proprietary models via prompting. To enhance the reasoning of smaller models, we adopt a Chain-of-Thought (CoT) augmented fine-tuning, enabling them to learn intermediate reasoning steps before generating code. A self-refinement loop further improves performance by iteratively critiquing and correcting code based on execution feedback. We also employ few-shot prompting to guide inference more effectively. Applied to both open-source and proprietary models, this pipeline achieved its best results with Gemini 2.5 Pro, where our system ranked 4th on the competition leaderboard with a Pass@1 score of 0.85. We conclude with a detailed analysis of these findings.

pdf bib
TeamB2B at BLP-2025 Task 2: BanglaForge: LLM Collaboration with Self-Refinement for Bangla Code Generation
Mahir Labib Dihan | Sadif Ahmed | Md Nafiu Rahman

Bangla is a low-resource language for code generation, lacking large-scale annotated datasets and tools to transform natural language specifications into executable programs. This makes Bangla-to-code generation a challenging task requiring innovative solutions. To address this, we introduce BanglaForge, a novel framework for generating code from Bangla function descriptions. BanglaForge leverages a retrieval-augmented dual-model collaboration paradigm with self-refinement, combining in-context learning, llm-based translation, systematic prompt engineering, and iterative self-refinement based on execution feedback, where a coder generates initial solutions and a reviewer enhances them for robustness. On the BLP-2025 Bangla Code Generation benchmark, BanglaForge achieves a competitive Pass@1 accuracy of 84.00%, demonstrating the effectiveness of retrieval, model collaboration, and self-refinement for low-resource Bangla code generation.

pdf bib
BRACU_CL at BLP-2025 Task 2: CodeMist: A Transformer-Based Framework for Bangla Instruction-to-Code Generation
Md. Fahmid-Ul-Alam Juboraj | Soumik Deb Niloy | Mahbub E Sobhani | Farig Sadeque

This study proposes a hybrid framework for Bangla-to-Python code generation, emphasizing improved code accuracy through a two-phase pipeline: generation and debugging. During development, standalone models such as TigerLLM and StarCoder achieved modest accuracies of 27% and 24%, respectively, while more advanced models, Gemini-1.5-flash and Gemma, reached 60% and 64%. Integrating Gemma with the gpt-oss debugger substantially increased accuracy to 99.75%, highlighting the critical role of a dedicated debugging stage. In testing on unseen data, gpt-oss alone achieved 67%, which improved to 71% with self-debugging. The highest performance, 84%, was obtained by pairing Gemini-2.5-flash as the generator with gpt-oss for debugging. These findings demonstrate that combining a strong generative model with an effective debugging component yields superior and robust code generation results, outperforming existing approaches such as TigerLLM. The full implementation of the framework is publicly available at https://github.com/fahmid-juboraj/Code_generation.

pdf bib
Code_Gen at BLP-2025 Task 2: BanglaCode: A Cross-lingual Benchmark for Code Generation with Translation and Assertion Strategies
Abhishek Agarwala | Shifat Islam | Emon Ghosh

Large Language Models (LLMs) have shown great code-generation capabilities, but their performance in low-resource languages like Bangla is largely unexplored. We participated in BLP-2025 Task 2: Code Generation in Bangla, where we built a pipeline to interpret and execute Bangla instructions using GPT-5. Extensive experiments were conducted with proprietary (GPT-4o Mini, GPT-5 Mini, GPT-5) and open-source (LLaMA 3-8B, TigerLLM-1B-it) models under translation and assertion settings. Results show that GPT-5, with translation and assertion, scored 83.8%, outperformed all baselines, while open-source models lagged due to limited Bangla adaptation. Assertion-based prompting always improved syntactic correctness, and fine-tuning reduced hallucinations across open-source models. We ranked 7th on the official leaderboard with an approach which is competitive and generalizable. Overall, our results show that translation quality, data normalization, and prompt design are key components of low-resource code generation. Furthermore, the proposed BanglaCode benchmark and preprocessing architecture provide a basis for further multilingual code-generation research.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)

pdf bib
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)
Arnab Bhattacharya | Pawan Goyal | Saptarshi Ghosh | Kripabandhu Ghosh

pdf bib
Multi-Feature Graph Convolution Network for Hindi OCR Verification
Shikhar Dubey | Krish Mittal | Sourava Kumar Behera | Manikandan Ravikiran | Nitin Kumar | Saurabh Shigwan | Rohit Saluja

This paper presents a novel Graph Convolutional Network (GCN) based framework for verifying OCR predictions on real Hindi document images, specifically addressing the challenges of complex conjuncts and character segmentation. Our approach first segments Hindi characters in real book images at different levels of granularity, while also synthetically generating word images from OCR predictions. Both real and synthetic images are processed through ResNet-50 to extract feature representations, which are then segmented using multiple patching strategies (uniform, akshara, random, and letter patches). The bounding boxes created using segmentation masks are scaled proportionally to the feature space while extracting features for GCN. We construct a line graph where each node represents a real-synthetic character pair (in feature space). Each node of the line graph captures semantic and geometric features including i) cross-entropy between original and synthetic features, ii) Hu moments difference for shape properties, and iii) and pixel count difference for size variation. The GCN with three convolutional layers (and ELU activation) processes these graph-structured features to verify the correctness of OCR predictions. Experimental evaluation on 1000 images from diverse Hindi books demonstrates the effectiveness of our graph-based verification approach in detecting OCR errors, particularly for challenging conjunct characters where traditional methods struggle.

pdf bib
Indian Grammatical Tradition-Inspired Universal Semantic Representation Bank (USR Bank 1.0)
Soma Paul | Sukhada Sukhada | Bidisha Bhattacharjee | Kumari Riya | Sashank Tatavolu | Kamesh R | Isma Anwar | Pratibha Rani

In this paper, we introduce USR Bank 1.0, a multi-layered, text-level semantic representation framework designed to capture not only the predicate-argument structure of an utterance but also the speaker’s communicative intent as expressed linguistically. Built on the Universal Semantic Grammar (USG), which is grounded in Pāṇinian grammar and the Indian Grammatical Tradition (IGT), USR systematically encodes semantic, morpho-syntactic, discourse, and pragmatic information across distinct layers. In the USR generation process, initial USRs are automatically generated using a dedicated USR-builder tool and subsequently validated via a web-based interface (SAVI), ensuring high inter-annotator agreement and semantic fidelity. Our evaluation on Hindi texts demonstrates robust dependency and discourse annotation consistency and strong semantic similarity in USR-to-text generation. By distributing semantic-pragmatic information across layers and capturing the speaker’s perspective, USR provides a cognitively motivated, language-agnostic framework with promising applications in multilingual natural language processing.

pdf bib
Auditing Political Bias in Text Generation by GPT-4 using Sociocultural and Demographic Personas: Case of Bengali Ethnolinguistic Communities
Dipto Das | Syed Ishtiaque Ahmed | Shion Guha

Though large language models (LLMs) are increasingly used in multilingual contexts, their political and sociocultural biases in low-resource languages remain critically underexplored. In this paper, we investigate how LLM-generated texts in Bengali shift in response to personas with varying political orientations (left vs. right), religious identities (Hindu vs. Muslim), and national affiliations (Bangladeshi vs. Indian). In a quasi-experimental study, we simulate these personas and prompt an LLM to respond to political discussions. Measuring the shifts relative to responses for a baseline Bengali persona, we examined how political orientation influences LLM outputs, how topical association shape the political leanings of outputs, and how demographic persona-induced changes align with differently politically oriented variations. Our findings highlight left-leaning political bias in Bengali text generation and its significant association with Muslim sociocultural and demographic identity. We also connect our findings with broader discussions around emancipatory politics, epistemological considerations, and alignment of multilingual models.

pdf bib
INDRA: Iterative Difficulty Refinement Attention for MCQ Difficulty Estimation for Indic Languages
Manikandan Ravikiran | Rohit Saluja | Arnav Bhavsar

Estimating the difficulty of multiple-choice questions (MCQs) is central to adaptive testing and learner modeling. We introduce INDRA (Iterative Difficulty Refinement Attention), a novel attention mechanism that unifies psychometric priors with neural refinement for Indic MCQ difficulty estimation. INDRA incorporates three key innovations: (i) IRT-informed initialization, which assigns token-level discrimination and difficulty scores to embed psychometric interpretability; (ii) entropy-driven iterative refinement, which progressively sharpens attention to mimic the human process of distractor elimination; and (iii) Indic Aware Graph Coupling, which propagates plausibility across morphologically and semantically related tokens, a critical feature for Indic languages. Experiments on TEEMIL-H and TEEMIL-K datasets show that INDRA achieves consistent improvements, with absolute gains of up to +1.02 F1 and +1.68 F1 over state-of-the-art, while demonstrating through ablation studies that psychometric priors, entropy refinement, and graph coupling contribute complementary gains to accuracy and robustness.

pdf bib
Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis
Anusha Kamath | Kanishk Singla | Rakesh Paul | Raviraj Bhuminand Joshi | Utkarsh Vaidya | Sanjay Singh Chauhan | Niranjan Wartikar

Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

pdf bib
Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study
Rakesh Paul | Anusha Kamath | Kanishk Singla | Raviraj Joshi | Utkarsh Vaidya | Sanjay Singh Chauhan | Niranjan Wartikar

Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Platform (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.

pdf bib
Automatic Accent Restoration in Vedic Sanskrit with Neural Language Models
Yuzuki Tsukagoshi | Ikki Ohmukai

Vedic Sanskrit, the oldest attested form of Sanskrit, employs a distinctive pitch-accent system that marks one syllable per word. This work presents the first application of large language models to the automatic restoration of accent marks in transliterated Vedic Sanskrit texts. A comprehensive corpus was assembled by extracting major Vedic works from the TITUS project and constructing paired samples of unaccented input and correctly accented references, yielding more than 100,000 training examples. Three generative LLMs were fine-tuned on this corpus: a LoRA-adapted Llama 3.1 8B Instruct model, OpenAI GPT‐4.1 nano, and Google Gemini 2.5 Flash. These models were trained in a sequence‐to‐sequence fashion to insert accent marks at appropriate positions. Evaluation on roughly 2,000 sentences using precision, recall, F1, character error rate, word error rate, and ChrF1 metrics shows that fine‐tuned models substantially outperform their untuned baselines. The LoRA-tuned Llama achieves the highest F1, followed by Gemini 2.5 Flash and GPT‐4.1 nano. Error analysis reveals that the models learn to infer accents from grammatical and phonological context. These results demonstrate that LLMs can capture complex accentual patterns and recover lost information, opening possibilities for improved sandhi splitting, morphological analysis, syntactic parsing and machine translation in Vedic NLP pipelines.

pdf bib
AnciDev: A Dataset for High-Accuracy Handwritten Text Recognition of Ancient Devanagari Manuscripts
Vriti Sharma | Rajat Verma | Rohit Saluja

The digital preservation and accessibility of historical documents require accurate and scalable Handwritten Text Recognition (HTR). However, progress in this field is significantly hampered for low-resource scripts, such as ancient forms of the scripts used in historical manuscripts, due to the scarcity of high-quality, transcribed training data. We address this critical gap by introducing the AnciDev Dataset, a novel, publicly available resource comprising 3,000 transcribed text lines sourced from 500 pages of different ancient Devanagari manuscripts. To validate the utility of this new resource, we systematically evaluate and fine-tune several HTR models on the AnciDev Dataset. Our experiments demonstrate a significant performance uplift across all fine-tuned models, with the best-performing architecture achieving a substantial reduction in Character Error Rate (CER), confirming the dataset’s efficacy in addressing the unique complexities of ancient handwriting. This work not only provides a crucial, well-curated dataset to the research community but also sets a new, reproducible state-of-the-art for the HTR of historical Devanagari, advancing the effort to digitally preserve India’s documentary heritage.

pdf bib
BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages
Hrishikesh Terdalkar | Kirtan Bhojani | Aryan Dongare | Omm Aditya Behera

Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation.

pdf bib
Mātṛkā: Multilingual Jailbreak Evaluation of Open-Source Large Language Models
Murali Emani | Kashyap Manjusha R

Artificial Intelligence (AI) and Large Language Models (LLMs) are increasingly integrated into high-stakes applications, yet their susceptibility to adversarial prompts poses significant security risks. In this work, we introduce Mātṛkā, a framework for systematically evaluating jailbreak vulnerabilities in open-source multilingual LLMs. Using the open-source dataset across nine sensitive categories, we constructed adversarial prompt sets that combine translation, mixed-language encoding, homoglyph signatures, numeric enforcement, and structural variations. Experiments were conducted on state-of-the-art open-source models from Llama, Qwen, GPT-OSS, Mistral, and Gemma families. Our findings highlight transferability of jailbreaks across multiple languages with varying success rates depending on attack design. We provide empirical insights, a novel taxonomy of multilingual jailbreak strategies, and recommendations for enhancing robustness in safety-critical environments.

pdf bib
Accent Placement Models for Rigvedic Sanskrit Text
Akhil Rajeev P | Annarao Kulkarni

The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitchaccent system - udatta, anudatta, svarita whose marks encode melodic and interpretive cues but are often absent from moderne-texts. This work develops a parallel corpus of accented-unaccented ́slokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiencyaccuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accentaware OCR, ASR/chant synthesis, and digital scholarship.

pdf bib
Findings of the IndicGEC and IndicWG Shared Task at BHASHA 2025
Pramit Bhattacharyya | Karthika N J | Hrishikesh Terdalkar | Manoj Balaji Jagadeeshan | Shubham Kumar Nigam | Arvapalli Sai Susmitha | Arnab Bhattacharya

This overview paper presents the findings of the two shared tasks organized as part of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA) co-located with IJCNLP-AACL 2025. The shared tasks are: (1) Indic Grammar Error Correction (IndicGEC) and (2) Indic Word Grouping (IndicWG). For GEC, participants were tasked with producing grammatically correct sentences based on given input sentences in five Indian languages. For WG, participants were required to generate a word-grouped variant of a provided sentence in Hindi. The evaluation metric used for GEC was GLEU, while Exact Matching was employed for WG. A total of 14 teams participated in the final phase of the Shared Task 1; 2 teams participated in the final phase of Shared Task 2. The maximum GLEU scores obtained for Hindi, Bangla, Telugu, Tamil and Malayalam languages are respectively 85.69, 95.79, 88.17, 91.57 and 96.02 for the IndicGEC shared task. The highest exact matching score obtained for IndicWG shared task is 45.13%.

pdf bib
Niyamika at BHASHA Task 1: Word-Level Transliteration for English-Hindi Mixed Text in Grammar Correction Using MT5
Rucha Ambaliya | Mahika Dugar | Pruthwik Mishra

Grammar correction for Indian languages poses significant challenges due to complex morphology, non-standard spellings, and frequent script variations. In this work, we address grammar correction for English-mixed sentences in five Indic languages—Hindi, Bengali, Malayalam, Tamil, and Telugu—as part of the IndicGEC 2025 Bhasha Workshop. Our approach first applies word-level transliteration using IndicTrans (Bhat et al., 2014) to normalize Romanized and mixed-script tokens, followed by grammar correction using the mT5-small model (Xue et al., 2021). Although our experiments focus on these five languages, the methodology is generalizable to other Indian languages. Our implementation and code are publicly available at: https://github.com/Rucha-Ambaliya/bhasha-workshop

pdf bib
Team Horizon at BHASHA Task 1: Multilingual IndicGEC with Transformer-based Grammatical Error Correction Models
Manav Dhamecha | Sunil Jaat | Gaurav Damor | Pruthwik Mishra

This paper presents Team Horizon’s approach to the BHASHA Shared Task 1: Indic Grammatical Error Correction (IndicGEC). We explore transformer-based multilingual models — mT5-small and IndicBART — to correct grammatical and semantic errors across five Indian languages: Bangla, Hindi, Tamil, Telugu, and Malayalam. Due to limited annotated data, we developed a synthetic data augmentation pipeline that introduces realistic linguistic errors under ten categories, simulating natural mistakes found in Indic scripts. Our fine-tuned models achieved competitive performance with GLEU scores of 86.03 (Tamil), 72.00 (Telugu), 82.69 (Bangla), 80.44 (Hindi), and 84.36 (Malayalam). We analyze the impact of dataset scaling, multilingual fine-tuning, and training epochs, showing that linguistically grounded augmentation can significantly improve grammatical correction accuracy in low-resource Indic languages.

pdf bib
A3-108 at BHASHA Task1: Asymmetric BPE configuration for Grammar Error Correction
Saumitra Yadav | Manish Shrivastava

This paper presents our approach to Grammatical Error Correction (GEC) for five low-resource Indic languages, a task severely limited by a scarcity of annotated data. Our core methodology involves two stages: synthetic data generation and model optimization. First, we leverage the provided training data to build a Statistical Machine Translation (SMT) system, which is then used to generate large-scale synthetic noisy-to-clean parallel data from available monolingual text. This artificially corrupted data significantly enhances model robustness. Second, we train Transformer-based sequence-to-sequence models using an asymmetric and symmetric Byte Pair Encoding (BPE) configuration, where the number of merge operations differs between the source (erroneous) and target (corrected) sides to better capture language-specific characteristics. For instance, source BPE sizes 4000, 8000 and 16000, with target sizes at 500, 1000, 2000, 3000 and 4000. Our experiments demonstrated competitive performance across all five languages, with the best results achieving a GLUE score of 94.16 for Malayalam (Rank 4th) followed by Bangla at 92.44 (ranked 5th), Tamil at 85.52 (ranked 5th), Telugu at 81.9 (7th), and Hindi at 79.45(10th) in the shared task. These findings substantiate the effectiveness of combining SMT-based synthetic data generation with asymmetric BPE configurations for low-resource GEC.

pdf bib
DLRG at BHASHA: Task 1 (IndicGEC): A Hybrid Neurosymbolic Approach for Tamil and Malayalam Grammatical Error Correction
Akshay Ramesh | Ratnavel Rajalakshmi

Grammatical Error Correction (GEC) for low-resource Indic languages remains challenging due to limited annotated data and morphological complexity. We present a hybrid neurosymbolic GEC system that combines neural sequence-to-sequence models with explicit language-specific rule-based pattern matching. Our approach leverages parameter-efficient LoRA adaptation on aggressively augmented data to fine-tune pre-trained mT5 models, followed by learned correction rules through intelligent ensemble strategies. The proposed hybrid architecture achieved 85.34% GLEU for Tamil (Rank 8) and 95.06% GLEU for Malayalam (Rank 2) on the provided IndicGEC test sets, outperforming individual neural and rule-based approaches. The system incorporates conservative safety mechanisms to prevent catastrophic deletions and over-corrections, thus ensuring robustness and real-world applicability. Our work demonstrates that extremely low-resource GEC can be effectively addressed by combining neural generalization with symbolic precision.

pdf bib
akhilrajeevp at BHASHA Task 1: Minimal-Edit Instruction Tuning for Low-Resource Indic GEC
Akhil Rajeev P

Grammatical error correction for Indic languages faces limited supervision, diverse scripts, and rich morphology. We propose an augmentation-free setup that uses instructiontuned large language models and conservative decoding. A 12B GEMMA 3 model is instruction-tuned in bnb 4 bit precision with Parameter efficient Finetuning and Alpaca-style formatting. Decoding follows a deterministic, constraint-aware procedure with a lightweight normaliser that encourages minimal, meaningpreserving edits. We operationalise inference, subsequent to instruction fine-tuning (IFT), via a fixed, language-specific prompt directly synthesised from a deterministic error classifier’s taxonomy, label distributions, and precedence ordering computed on the training corpus. Under the official untuned GLEU evaluation, the system scores 92.41 on Malayalam, sixth overall, and 81.44 on Hindi, third overall. These results indicate that classifier-informed prompt design, adapter-based instruction tuning, and deterministic decoding provide a reproducible and computationally efficient alternative to augmentation-centred pipelines for Indic GEC. The approach also motivates future work on stronger morphosyntactic constraints and human centered evaluation of conservative edits.

pdf bib
Team Horizon at BHASHA Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping
Manav Dhamecha | Gaurav Damor | Sunil Jaat | Pruthwik Mishra

We present Team Horizon’s approach to BHASHA Task 2: Indic Word Grouping. We model the word-grouping problem as token classification problem and fine-tune multilingual Transformer encoders for the task. We evaluated MuRIL, XLM-Roberta, and IndicBERT v2 and report Exact Match accuracy on the test data. Our best model (MuRIL) achieves 58.1818% exact match on the test set.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)

pdf bib
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)
Aman Sinha | Raúl Vázquez | Timothee Mickus | Rohit Agarwal | Ioana Buhnila | Patrícia Schmidtová | Federica Gamba | Dilip K. Prasad | Jörg Tiedemann

pdf bib
Task-Aware Evaluation and Error-Overlap Analysis for Large Language Models
Pranava Madhyastha

Public leaderboards for large language models often rely on aggregate scores that conceal critical information about model behavior. In this paper, we present a methodology for task-aware evaluation that combines (i) correctness metrics aligned with task semantics compliance checks for instruction-following and numeric equivalence for mathematics with (ii) pairwise error-overlap analysis to identify complementary model pairs. We apply this methodology to 17 outputs of recent state of the art and frontier LLMs across multiple-choice QA, instruction-following, and mathematical reasoning tasks. We observe that task-aware metrics can reorder model rankings relative to generic lexical metrics, and that error-overlap patterns vary substantially across model pairs and scenarios. We finally conclude by discussing implications for model selection, routing strategies, and LLM-as-judge calibration, and release our analysis pipeline to support further investigation.

pdf bib
Examining the Faithfulness of Deepseek R1’s Chain-of-Thought Reasoning
Chrisanna Cornish | Anna Rogers

Chain-of-Thought (CoT) ‘reasoning’ promises to enhance the performance and transparency of Large Language Models (LLMs). Models, such as Deepseek R1, are trained via reinforcement learning to automatically generate CoT explanations in their outputs. Their faithfulness, i.e. how well the explanations actually reflect their internal reasoning process, has been called into doubt by recent studies (Chen et al., 2025a; Chua and Evans, 2025). This paper extends previous work by probing Deepseek R1 with 445 logical puzzles under zero- and few-shot settings. We find that whilst the model explicitly acknowledges a strong harmful hint in 94.6% of cases, it reports less than 2% of helpful hints. Further analysis reveals implicit unfaithfulness as the model significantly reduces answer-rechecking behaviour for helpful hints (p<0.01) despite rarely mentioning them in its CoT, demonstrating a discrepancy between its reported and actual decision process. In line with prior reports for GPT, Claude, Gemini and other models, our results for DeepSeek raise concerns about the use of CoT as an explainability technique.

pdf bib
Better Together: Towards Localizing Fact-Related Hallucinations using Open Small Language Models
David Kletz | Sandra Mitrovic | Ljiljana Dolamic | Fabio Rinaldi

In this paper, we explore the potential of Open-source Small Language Models (OSLMs) for localizing hallucinations related to factual accuracy. We first present Lucifer, a dataset designed to enable proper and consistent evaluation of LMs, composed of an automatically constructed portion and a manually curated subset intended for qualitative analysis.We then assess the performance of five OSLMs using four carefully designed prompts. Results are evaluated either individually or merged through a voting-based merging approach. While our results demonstrate that the merging method yields promising performance even with smaller models, our manually curated dataset highlights the inherent difficulty of the task, underscoring the need for further research.

pdf bib
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Ofir Azachi | Kfir Eliyahu | Eyal El Ani | Rom Himelstein | Roi Reichart | Yuval Pinter | Nitay Calderon

Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM’s next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs. All data is publicly available at https://huggingface.co/datasets/wrom/Language-Vision-Hallucinations.

pdf bib
Language Confusion and Multilingual Performance: A Case Study of Thai-Adapted Large Language Models
Pakhapoom Sarapat | Trapoom Ukarapol | Tatsunori Hashimoto

This paper presents a comprehensive study on the multilingual adaptability of large language models (LLMs), with a focus on the interplay between training strategies and prompt design. Using Thai as a case study, we examine: (RQ1) the extent to which pre-trained models (Base) can adapt to another language through additional fine-tuning; (RQ2) how continual pre-training (CPT) compares to multilingual pre-training (MLLM) in terms of performance on downstream tasks; and (RQ3) how language variation within different components of a structured prompt–task instruction, context input, and output instruction–influences task performance in cross-lingual settings. Our findings reveal that CPT proves to be a promising strategy for enhancing model performance in languages other than English like Thai in monolingual settings, particularly for models that initially lack strong linguistic capabilities. Its effectiveness, however, is highly task-dependent and varies based on the base model’s initial proficiency. In cross-lingual scenarios, MLLMs exhibit superior robustness compared to Base and CPT models, which are more susceptible to context-output language mismatches. Considering the high cost of training multilingual models from scratch, MLLMs remain a critical component for downstream tasks in multilingual settings due to their strong cross-lingual performance.

pdf bib
A Comprehensive Evaluation of Large Language Models for Retrieval-Augmented Generation under Noisy Conditions
Josue Daniel Caldas Velasquez | Elvis de Souza

Retrieval-Augmented Generation (RAG) has emerged as an effective strategy to ground Large Language Models (LLMs) with reliable, real-time information. This paper investigates the trade-off between cost and performance by evaluating 13 LLMs within a RAG pipeline for the Question Answering (Q&A) task under noisy retrieval conditions. We assess four extractive and nine generative models—spanning both open- and closed-source ones of varying sizes—on a journalistic benchmark specifically designed for RAG. By systematically varying the level of noise injected into the retrieved context, we analyze not only which models perform best, but also their robustness to noisy input. Results show that large open-source generative models (approx. 70B parameters) achieve performance and noise tolerance on par with top-tier closed-source models. However, their computational demands limit their practicality in resource-constrained settings. In contrast, medium-sized open-source models (approx. 7B parameters) emerge as a compelling compromise, balancing efficiency, robustness, and accessibility.

pdf bib
SHROOM-CAP: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications
Aman Sinha | Federica Gamba | Raúl Vázquez | Timothee Mickus | Ahana Chattopadhyay | Laura Zanella | Binesh Arakkal Remesh | Yash Kankanampati | Aryan Chandramania | Rohit Agarwal

This paper presents an overview of the SHROOM-CAP Shared Task, which focuses on detecting hallucinations and over-generation errors in cross-lingual analyses of scientific publications. SHROOM-CAP covers nine languages: five high-resource (English, French, Hindi, Italian, and Spanish) and four low-resource (Bengali, Gujarati, Malayalam, and Telugu). The task frames hallucination detection as a binary classification problem, where participants must predict whether a given text contains factual inaccuracies and fluency mistakes. We received 1,571 submissions from 5 participating teams during the test phase over the nine languages. In the paper, we present an analysis of the evaluated systems to assess their performance on the hallucination detection task across languages. Our findings reveal a disparity in system performance between high-resource and low-resource languages. Furthermore, we observe that factuality and fluency tend to be closely aligned in high-resource languages, whereas this correlation is less evident in low-resource languages. Overall, SHROOM-CAP underlines that hallucination detection remains a challenging open problem, particularly in low-resource and domain-specific settings.

pdf bib
SmurfCat at SHROOM-CAP: Factual but Awkward? Fluent but Wrong? Tackling Both in LLM Scientific QA
Timur Ionov | Evgenii Nikolaev | Artem Vazhentsev | Mikhail Chaichuk | Anton Korznikov | Elena Tutubalina | Alexander Panchenko | Vasily Konovalov | Elisei Rykov

Large Language Models (LLMs) often generate hallucinations, a critical issue in domains like scientific communication where factual accuracy and fluency are essential. The SHROOM-CAP shared task addresses this challenge by evaluating Factual Mistakes and Fluency Mistakes across diverse languages, extending earlier SHROOM editions to the scientific domain. We present Smurfcat, our system for SHROOM-CAP, which integrates three complementary approaches: uncertainty estimation (white-box and black-box signals), encoder-based classifiers (Multilingual Modern BERT), and decoder-based judges (instruction-tuned LLMs with classification heads). Results show that decoder-based judges achieve the strongest overall performance, while uncertainty methods and encoders provide complementary strengths. Our findings highlight the value of combining uncertainty signals with encoder and decoder architectures for robust, multilingual detection of hallucinations and related errors in scientific publications.

pdf bib
Scalar_NITK at SHROOM-CAP: Multilingual Factual Hallucination and Fluency Error Detection in Scientific Publications Using Retrieval-Guided Evidence and Attention-Based Feature Fusion
Anjali R

One of the key challenges of deploying Large Language Models (LLMs) in multilingual scenarios is maintaining output quality across two conditions: factual correctness and linguistic fluency. LLMs are liable to produce text with factual hallucinations, solid-sounding but false information, and fluency errors that take the form of grammatical mistakes, repetition, or unnatural speech patterns. In this paper, we address a two-framework solution for the end-to-end quality evaluation of LLM-generated text in low-resource languages.(1) For hallucination detection, we introduce a retrieval-augmented classification model that utilizes hybrid document retrieval, along with gradient boosting.(2) For fluency detection, we introduce a deep learning model that combines engineered statistical features with pre-trained semantic embeddings using an attention-based mechanism.

pdf bib
AGI” team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa
Harsh Rathwa | Pruthwik Mishra | Shrikant Malviya

The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including 2nd place in Gujarati (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.

up

pdf (full)
bib (full)
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems

pdf bib
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems
Mousumi Akter | Tahiya Chowdhury | Steffen Eger | Christoph Leiter | Juri Opitz | Erion Çano

pdf bib
Beyond Tokens and Into Minds: Future Directions for Human-Centered Evaluation in Machine Translation Post-Editing
Molly Apsel | Sunil Kothari | Manish Mehta | Vasudevan Sundarababu

Machine translation post-editing (MTPE) is central to evaluating and ensuring translation quality, particularly for low-resource languages (LRLs), where systems are more error-prone than for high-resource languages. Traditional token-based models segment text according to statistical patterns of their (primarily high-resource) training data, which can distort meaning, fragment words in morphologically rich languages, and complicate MTPE and evaluation. Current evaluation metrics also tend to emphasize surface-level similarity to reference texts, overlooking how humans actually approach translation tasks and creating issues when references are unavailable or a more abstract interpretation is needed. In this position paper, we argue that emerging architectures (Large Concept Models [LCMs] and Byte Latent Transformers [BLTs]) and insights from cognitive science open new possibilities for MTPE frameworks. LCMs represent meaning at the conceptual level, enabling evaluation of different translation approaches and the robustness of such models in MT. At the same time, BLTs operate below the token level, potentially easing post-editing across diverse language scripts. Drawing on cognitive theories of bilingualism and meaning representation, we outline hypotheses and research methods for evaluating post-editing data, translation quality, and interface design toward more robust, human-centered MT evaluation.

pdf bib
Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs
H. G. Ranjani | Rutuja Prabhudesai

Telecom domain 3GPP documents are replete with images containing sequence diagrams. Advances in Vision-Language Large Models (VLMs) have eased conversion of such images to machine-readable PlantUML (puml) formats. However, there is a gap in evaluation of such conversions - existing works do not compare puml scripts for various components. In this work, we propose performance metrics to measure the effectiveness of such conversions. A subset of sequence diagrams from 3GPP documents is chosen to be representative of domain-specific actual scenarios. We compare puml outputs from two VLMs - Claude Sonnet and GPT-4V - against manually created ground truth representations. We use version control tools to capture differences and introduce standard performance metrics to measure accuracies along various components: participant identification, message flow accuracy, sequence ordering, and grouping construct preservation. We demonstrate effectiveness of proposed metrics in quantifying conversion errors across various components. The results show that nodes, edges and messages are accurately captured. However, we observe that VLMs do not necessarily perform well on complex structures such as notes, box, groups. Our experiments and performance metrics indicates a need for better representation of these components in training data for fine-tuned VLMs.

pdf bib
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation
Naila Shafirni Hidayat | Muhammad Dehan Al Kautsar | Alfan Farizki Wicaksono | Fajri Koto

The performance of large language models (LLMs) continues to improve, as reflected in rising scores on standard benchmarks. However, the lack of transparency around training data raises concerns about potential overlap with evaluation sets and the fairness of reported results. Although prior work has proposed methods for detecting data leakage, these approaches primarily focus on identifying outliers and have not been evaluated under controlled simulated leakage conditions. In this work, we compare existing leakage detection techniques, namely permutation and n-gram-based methods, under a continual pretraining setup that simulates real-world leakage scenarios, and additionally explore a lightweight method we call semi-half question. We further introduce two efficient extensions, permutation-R and permutation-Q. While semi-half offers a low-cost alternative, our analysis shows that the n-gram method consistently achieves the highest F1-Score, performing competitively with permutation-Q. We also refine these techniques to support instance-level detection and reduce computational overhead. Leveraging the best-performing method, we create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs. Our findings present a practical path toward more reliable and transparent evaluations, and we recommend contamination checks as a standard practice before releasing benchmark results.

pdf bib
Reliable Inline Code Documentation with LLMs: Fine-Grained Evaluation of Comment Quality and Coverage
Rohan Patil | Gaurav Tirodkar | Shubham Gatfane

Code documentation plays a vital role in enhancing collaboration, maintainability, and comprehension throughout the software development lifecycle. This becomes especially critical in legacy codebases, where missing or outdated comments hinder effective debugging and onboarding. Among documentation types, inline comments are particularly valuable for conveying program logic and supporting code reuse. With the growing capabilities of large language models (LLMs), their application to tasks such as code understanding and summarization has gained significant attention in the NLP community. However, the specific task of generating high-quality inline code comments using LLMs remains relatively under-explored. In this work, we conduct a systematic evaluation of several state-of-the-art LLMs to assess their effectiveness in producing meaningful and context-aware inline documentation. To this end, we curate a dataset of well-documented code snippets and propose a fine-grained evaluation framework that assesses both the quality and sufficiency of generated comments at the statement level. We further investigate the impact of prompting strategies and offer a comparative analysis across a range of models, including large foundational LLMs to smaller, code-specialized variants, within the domain of inline code documentation. Our findings offer actionable insights that can guide the development of effective and scalable systems for automated inline code documentation.

pdf bib
Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
Stefanie Urchs | Veronika Thurner | Matthias Aßenmacher | Christian Heumann | Stephanie Thiemichen

Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980–2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.

pdf bib
Between the Drafts: An Evaluation Framework for Identifying Quality Improvement and Stylistic Differences in Scientific Texts
Danqing Chen | Ingo Weber | Felix Dietrich

This study explores the potential of a lightweight, open-source Large Language Model (LLM), demonstrating how its integration with Retrieval-Augmented Generation (RAG) can support cost-effective evaluation of revision quality and writing style differentiation. By retrieving reference documents from a carefully chosen and constructed corpus of peer-reviewed conference proceedings, our framework leverages few-shot in-context learning to track manuscript revisions and venue-specific writing styles. We demonstrate that the LLM-based evaluation aligns closely with human revision histories—consistently recognizing quality improvements across revision stages and distinguishing writing styles associated with different conference venues. These findings highlight how a carefully designed evaluation framework, integrated with adequate, representative data, can advance automated assessment of scientific writing.

pdf bib
“The dentist is an involved parent, the bartender is not”: Revealing Implicit Biases in QA with Implicit BBQ
Aarushi Wagh | Saniya Srivastava

Existing benchmarks evaluating biases in large language models (LLMs) primarily rely on explicit cues, declaring protected attributes like religion, race, gender by name. However, real-world interactions often contain implicit biases, inferred subtly through names, cultural cues, or traits. This critical oversight creates a significant blind spot in fairness evaluation. We introduce ImplicitBBQ, a benchmark extending the Bias Benchmark for QA (BBQ) with implicitly cued protected attributes across 6 categories. Our evaluation of GPT-4o on ImplicitBBQ illustrates troubling performance disparity from explicit BBQ prompts, with accuracy declining up to 7% in the “sexual orientation” subcategory and consistent decline located across most other categories. This indicates that current LLMs contain implicit biases undetected by explicit benchmarks. ImplicitBBQ offers a crucial tool for nuanced fairness evaluation in NLP.

pdf bib
SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification
Mohamed Elaraby | Jyoti Prakash Maheswari

Large Language Models (LLMs) with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification—a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores don’t improve, underscoring its potential to strengthen both performance and explainability.

pdf bib
Evaluation of Generated Poetry
David Mareček | Kateřina Motalík Hodková | Tomáš Musil | Rudolf Rosa

We propose a range of automated metrics for evaluation of generated poetry.The metrics measure various aspects of poetry: rhyming, metre, syntax, semantics, and amount of unknown words.In a case study, we implement the metrics for Czech language, apply them to poetry generated by several automated systems as well as human-written, and correlate them with human judgment.We find that most of the proposed metrics correlate well with corresponding human evaluation, but semantically oriented metrics are much better predictors of the overall impression than metrics evaluating formal properties.

pdf bib
TitleTrap: Probing Presentation Bias in LLM-Based Scientific Reviewing
Shurui Du

Large language models (LLMs) are now used in scientific peer review, but their judgments can still be influenced by how information is presented. We study how the style of a paper’s title affects the way LLMs score scientific work. To control for content variation, we build the TitleTrap benchmark using abstracts generated by a language model for common research topics in computer vision and NLP. Each abstract is paired with three titles: a branded colon style, a plain descriptive style, and an interrogative style, while the abstract text remains fixed. We ask GPT-4o and Claude to review these title–abstract pairs under the same instructions. Our results show that title style alone can change the scores: branded titles often receive higher ratings, while interrogative titles sometimes lead to lower assessments of rigor. These findings reveal a presentation bias in LLM-based peer review and suggest the need for better methods to reduce such bias and support fairer automated evaluation.

pdf bib
Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health
Sumon Kanti Dey | Manvi S | Zeel Mehta | Meet Shah | Unnati Agrawal | Suhani Jalota | Azra Ismail

Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South, yet their evaluation remains heavily dependent on benchmarks designed around Western norms. We present insights from a preliminary benchmarking exercise with a chatbot for sexual and reproductive health (SRH) for an underserved community in India. We evaluated using HealthBench, a benchmark for conversational health models by OpenAI. We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations. Responses were evaluated using HealthBench’s rubric-based automated grader, which rated responses consistently low. However, qualitative analysis by trained annotators and public health experts revealed that many responses were actually culturally appropriate and medically accurate. We highlight recurring issues, particularly a Western bias, such as for legal framing and norms (e.g., breastfeeding in public), diet assumptions (e.g., fish safe to eat during pregnancy), and costs (e.g., insurance models). Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts. We argue for the development of culturally adaptive evaluation frameworks that meet quality standards while recognizing needs of diverse populations.

pdf bib
Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments
Berk Atıl | Sarp Aykent | Alexa Chittams | Lisheng Fu | Rebecca J. Passonneau | Evan Radcliffe | Guru Rajan Rajagopal | Adam Sloan | Tomasz Tudrej | Ferhan Ture | Zhe Wu | Lixinyu Xu | Breck Baldwin

LLM (large language model) users of hosted providers commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. While it is difficult to get exact statistics, recent reports on specialty news sites and discussion boards suggest that among users in all communities, the majority of LLM usage today is through cloud-based APIs. Yet the questions of how pervasive non- determinism is, and how much it affects perfor- mance results, have not to our knowledge been systematically investigated. We apply five API- based LLMs configured to be deterministic to eight diverse tasks across 10 runs. Experiments reveal accuracy variations of up to 15% across runs, with a gap of up to 70% between best pos- sible performance and worst possible perfor- mance. No LLM consistently delivers the same outputs or accuracies, regardless of task. We speculate about the sources of non-determinism such as input buffer packing across multiple jobs. To better quantify our observations, we introduce metrics focused on quantifying de- terminism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data will be publicly available at https://github.com/Anonymous.

pdf bib
InFiNITE (): Indian Financial Narrative Inference Tasks & Evaluations
Sohom Ghosh | Arnab Maji | Sudip Kumar Naskar

This paper introduces Indian Financial Narrative Inference Tasks and Evaluations (InFiNITE), a comprehensive framework for analyzing India’s financial narratives through three novel inference tasks. Firstly, we present multi-modal earnings call analysis by integrating transcripts, presentation visuals, and market indicators via the Multi-Modal Indian Earnings Calls (MiMIC) dataset, enabling holistic prediction of post-call stock movements. Secondly, our Budget-Assisted Sectoral Impact Ranking (BASIR) dataset aids in systematically decoding government fiscal narratives by classifying budget excerpts into 81 economic sectors and evaluating their post-announcement equity performance. Thirdly, we introduce Bharat IPO Rating (BIR) datasets to redefine Initial Public Offering (IPO) evaluation through prospectus analysis, classifying potential investments into four recommendation categories (Apply, May Apply, Neutral, Avoid). By unifying textual, visual, and quantitative modalities across corporate, governmental, and public investment domains, InFiNITE addresses critical gaps in Indian financial narrative analysis. The open source data sets of the framework, including earnings calls, union budgets, and IPO prospectuses, establish benchmark resources specific to India for computational economic research under permissive licenses. For investors, InFiNITE enables data-driven identification of capital allocation opportunities and IPO risks, while policymakers gain structured insights to assess Indian fiscal communication impacts. By releasing these datasets publicly, we aim to facilitate research in computational economics and financial text analysis, particularly for the Indian market.

pdf bib
Test Set Quality in Multilingual LLM Evaluation
Chalamalasetti Kranti | Gabriel Bernier-Colborne | Yvan Gauthier | Sowmya Vajjala

Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models (LLM). However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages – French and Telugu, identifying several errors in the datasets during the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages. Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on NLP for Empowering Justice (JUST-NLP 2025)

pdf bib
Proceedings of the 1st Workshop on NLP for Empowering Justice (JUST-NLP 2025)
Ashutosh Modi | Saptarshi Ghosh | Asif Ekbal | Pawan Goyal | Sarika Jain | Abhinav Joshi | Shivani Mishra | Debtanu Datta | Shounak Paul | Kshetrimayum Boynao Singh | Sandeep Kumar

pdf bib
Overview of the 1st Workshop on NLP for Empowering Justice
Ashutosh Modi | Saptarshi Ghosh | Asif Ekbal | Pawan Goyal | Sarika Jain | Abhinav Joshi | Shivani Mishra | Debtanu Datta | Shounak Paul | Kshetrimayum Boynao Singh | Sandeep Kumar

The first iteration of the JUST-NLP: Workshop on NLP for Empowering Justice was organized to accelerate research in Natural Language Processing for legal text processing. The inaugural edition, JUST-NLP 2025, was held as a hybrid event at IJCNLP-AACL 2025 on December 24 at IIT Bombay. The program featured a research track, four invited talks, and two shared tasks: (1) L-SUMM, an abstractive summarization task for Indian legal judgments, and (2) L-MT, a legal machine translation task between English and Hindi. The workshop received strong interest from the community, with 29 submissions, of which 21 were accepted. Among the accepted papers, 5 were regular research-track papers published in the proceedings, and 2 were accepted as non-archival presentations. For the shared tasks, 9 papers were accepted for L-SUMM, and 5 papers were accepted for L-MT, for publication in the proceedings. The workshop focused on a broad set of Legal NLP challenges, including information extraction, retrieval, multilingual processing, legal reasoning, and applications of large language models. Overall, JUST-NLP 2025 aimed to bring together AI researchers and legal practitioners to develop scalable, domain-aware NLP methods that can support legal workflows and contribute toward more efficient and equitable justice systems.

pdf bib
Findings of the JUST-NLP 2025 Shared Task on Summarization of Indian Court Judgments
Debtanu Datta | Shounak Paul | Kshetrimayum Boynao Singh | Sandeep Kumar | Abhinav Joshi | Shivani Mishra | Sarika Jain | Asif Ekbal | Pawan Goyal | Ashutosh Modi | Saptarshi Ghosh

This paper presents an overview of the Shared Task on Summarization of Indian Court Judgments (L-SUMM), hosted by the JUST-NLP 2025 Workshop at IJCNLP-AACL 2025. This task aims to increase research interest in automatic summarization techniques for lengthy and intricate legal documents from the Indian judiciary. It particularly addresses court judgments that contain dense legal reasoning and semantic roles that must be preserved in summaries. As part of this shared task, we introduce the Indian Legal Summarization (L-SUMM) dataset, comprising 1,800 Indian court judgments paired with expert-written abstractive summaries, both in English. Therefore, the task focuses on generating high-quality abstractive summaries of court judgments in English. A total of 9 teams participated in this task, exploring a diverse range of methodologies, including transformer-based models, extractive-abstractive hybrids, graph-based ranking approaches, long-context LLMs, and rhetorical-role-based techniques. This paper describes the task setup, dataset, evaluation framework, and our findings. We report the results and highlight key trends across participant approaches, including the effectiveness of hybrid pipelines and challenges in handling extreme sequence lengths.

pdf bib
Findings of the JUST-NLP 2025 Shared Task on English-to-Hindi Legal Machine Translation
Kshetrimayum Boynao Singh | Sandeep Kumar | Debtanu Datta | Abhinav Joshi | Shivani Mishra | Shounak Paul | Pawan Goyal | Sarika Jain | Saptarshi Ghosh | Ashutosh Modi | Asif Ekbal

This paper provides an overview of the Shared Task on Legal Machine Translation (L-MT), organized as part of the JUST-NLP 2025 Workshop at IJCNLP-AACL 2025, aimed at improving the translation of legal texts, a domain where precision, structural faithfulness, and terminology preservation are essential. The training set comprises 50,000 sentences, with 5,000 sentences each for the validation and test sets. The submissions employed strategies such as: domain-adaptive fine-tuning of multilingual models, QLoRA-based parameter-efficient adaptation, curriculum-guided supervised training, reinforcement learning with verifiable MT metrics, and from-scratch Transformer training. The systems are evaluated based on BLEU, METEOR, TER, chrF++, BERTScore, and COMET metrics. We also combine the scores of these metrics to give an average score (AutoRank). The top-performing system is based on a fine-tuned distilled NLLB-200 model and achieved the highest AutoRank score of 72.1. Domain adaptation consistently yielded substantial improvements over baseline models, and precision-focused rewards proved especially effective for the legal MT. The findings also highlight that large multilingual Transformers can deliver accurate and reliable English-to-Hindi legal translations when carefully fine-tuned on legal data, advancing the broader goal of improving access to justice in multilingual settings.

pdf bib
LeCNet: A Legal Citation Network Benchmark Dataset
Pooja Harde | Bhavya Jain | Sarika Jain

Legal document analysis is pivotal in modern judicial systems, particularly for case retrieval, classification, and recommendation tasks. Graph neural networks (GNNs) have revolutionized legal use cases by enabling the efficient analysis of complex relationships. Although existing legal citation network datasets have significantly advanced research in this domain, the lack of large-scale open-source datasets tailored to the Indian judicial system has limited progress. To address this gap, we present the Indian Legal Citation Network (LeCNet) - the first open-source benchmark dataset for the link prediction task (missing citation recommendation) in the Indian judicial context. The dataset has been created by extracting information from the original judgments. LeCNet comprises 26,308 nodes representing case judgments and 67,108 edges representing citation relationships between the case nodes. Each node is described with rich features of document embeddings that incorporate contextual information from the case documents. Baseline experiments using various machine learning models were conducted for dataset validation. The Mean Reciprocal Rank (MRR) metric is used for model evaluation. The results obtained demonstrate the utility of the LeCNet dataset, highlighting the advantages of graph-based representations over purely textual models.

pdf bib
Legal Document Summarization: A Zero-shot Modular Agentic Workflow Approach
Taha Sadikot | Sarika Jain

The increasing volume and complexity of Indian High Court judgments require high-quality automated summarization systems. Our agentic workflow framework for the summarization of Indian High Court judgments achieves competitive results without model fine-tuning. Experiments on CivilSum and IN-Abs test sets report ROUGE-1 F1 up to 0.547 and BERTScore F1 up to 0.866, comparable to state-of-the-art supervised models, with advantages in transparency and efficiency. We introduce two zero-shot modular agentic workflows: Lexical Modular Summarizer (LexA), a three-stage modular architecture optimized for lexical overlap (ROUGE), and Semantic Agentic Summarizer (SemA), a five-stage integrated architecture optimized for semantic similarity (BERTScore). Both workflows operate without supervised model fine-tuning, instead relying on strategic data processing, modular agent orchestration, and carefully engineered prompts. Our framework achieves ROUGE-1 F1 of 0.6326 and BERTScore F1 of 0.8902 on CivilSum test set, and ROUGE-1 F1 of 0.1951 and BERTScore F1 of 0.8299 on IN-Abs test set, substantially outperforming zero-shot baselines, rivaling leading fine-tuned transformer models while requiring no supervised training. This work demonstrates that modular, zero-shot agentic approaches can deliver production-grade results for legal summarization, offering a new direction for resource-limited judicial settings.

pdf bib
LLM Driven Legal Text Analytics: A Case Study For Food Safety Violation Cases
Suyog Joshi | Soumyajit Basu | Lipika Dey | Partha Pratim Das

Despite comprehensive food safety regulations worldwide, violations continue to pose significant public health challenges. This paper presents an LLM-driven pipeline for analyzing legal texts to identify structural and procedural gaps in food safety enforcement. We develop an end-to-end system that leverages Large Language Models to extract structured entities from legal judgments, construct statute-and-provision-level knowledge graphs, and perform semantic clustering of cases. Applying our approach to 782 Indian food safety violation cases filed between 2022-2024, we uncover critical insights: 96% of cases were filed by individuals and organizations against state authorities, with 60% resulting in decisions favoring appellants. Through automated clustering and analysis, we identify major procedural lapses including unclear jurisdictional boundaries between enforcement agencies, insufficient evidence collection, and ambiguous penalty guidelines. Our findings reveal concrete weaknesses in current enforcement practices and demonstrate the practical value of LLMs for legal analysis at scale.

pdf bib
Grahak-Nyay: Consumer Grievance Redressal through Large Language Models
Shrey Ganatra | Swapnil Bhattacharyya | Harshvivek Kashid | Spandan Anaokar | Shruthi N Nair | Reshma Sekhar | Siddharth Manohar | Rahul Hemrajani | Pushpak Bhattacharyya

Access to consumer grievance redressal in India is often hindered by procedural complexity, legal jargon, and jurisdictional challenges. To address this, we present Grahak-Nyay (Justice-to-Consumers), a chatbot that streamlines the process using open-source Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities through a concise and up-to-date knowledge base. We introduce three novel datasets: GeneralQA (general consumer law), SectoralQA (sector-specific knowledge) and SyntheticQA (for RAG evaluation), along with NyayChat, a dataset of 303 annotated chatbot conversations. We also introduce Judgments data sourced from Indian Consumer Courts to aid the chatbot in decision making and to enhance user trust. We also propose HAB metrics (Helpfulness, Accuracy, Brevity) to evaluate chatbot performance. Legal domain experts validated Grahak-Nyay’s effectiveness. Code and datasets are available at https://github.com/ShreyGanatra/GrahakNyay.git.

pdf bib
Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India
Swapnil Bhattacharyya | Harshvivek Kashid | Shrey Ganatra | Spandan Anaokar | Reshma Sekhar | Shruthi N Nair | Siddharth Manohar | Rahul Hemrajani | Pushpak Bhattacharyya

AI-based judicial assistance and case prediction have been extensively studied in criminal and civil domains, but remain largely unexplored in consumer law, especially in India. In this paper, we present Nyay-Darpan, a novel two-in-one framework that (i) summarizes consumer case files and (ii) retrieves similar case judgements to aid decision-making in consumer dispute resolution. Our methodology not only addresses the gap in consumer law AI tools, but also introduces an innovative approach to evaluate the quality of the summary. The term ‘Nyay-Darpan’ translates into ‘Mirror of Justice’, symbolizing the ability of our tool to reflect the core of consumer disputes through precise summarization and intelligent case retrieval. Our system achieves over 75 percent precision in similar case prediction and approximately 70 percent accuracy across material summary evaluation metrics, demonstrating its practical effectiveness. We will publicly release the Nyay-Darpan framework and dataset to promote reproducibility and facilitate further research in this underexplored yet impactful domain.

pdf bib
Cold Starts and Hard Cases: A Two-Stage SFT-RLVR Approach for Legal Machine Translation (Just-NLP L-MT shared task)
Pawitsapak Akarajaradwong | Chompakorn Chaksangchaichot

This paper details our system for the JUST-NLP 2025 Shared Task on English-to-Hindi Legal Machine Translation. We propose a novel two-stage, data-centric approach. First, we annotate the training data by translation difficulty and create easy and hard subsets.We perform SFT on the easier subset to establish a robust “cold start”. Then, we apply RLVR exclusively on the harder subset, using machine translation metrics as reward signals. This strategy allowed our system to significantly outperform strong baselines, demonstrating the capability of our systems for machine translation tasks. Source code and model weights are available at https://github.com/ppaolong/FourCorners-JustNLP-MT-Shared-Task

pdf bib
Contextors at L-SUMM: Retriever-Driven Multi-Generator Summarization
Pavithra Neelamegam | S Jaya Nirmala

Indian court judgments are very difficult to automatically summarize because of their length, complex legal reasoning and scattered important information. This paper outlines the methodology used for the Legal Summarization (L-SUMM) shared task at the JUST-NLP 2025 Workshop, which aims to provide abstractive summaries of roughly 500 words from english language Indian court rulings that are logical, concise and factually accurate. The paper proposes a Retriever-Driven Multi-Generator Summarization framework that combines a semantic retriever with fine-tuned encoder–decoder models BART, Pegasus and LED to enhance legal document summarization. This pipeline uses cosine similarity analysis to improve summary faithfulness, cross-model validation to guarantee factual consistency and iterative retrieval expansion to choose relevant text chunks in order to address document length and reduce hallucinations. Despite being limited to 400–500 words, the generated summaries successfully convey legal reasoning. Our team Contextors achieved an average score of 22.51, ranking 4th out of 9 in the L-SUMM shared task leaderboard, demonstrating the efficacy of Retriever-Driven Multi-Generator Summarization approach, which improves transparency, accessibility, and effective understanding of legal documents. This method shows excellent content coverage and coherence when assessed using ROUGE-2, ROUGE-L, and BLEU criteria.

pdf bib
A Budget Recipe for Finetuning a Long-form Legal Summarization Model
Chompakorn Chaksangchaichot | Pawitsapak Akarajaradwong

We describe an inexpensive system that ranked first in the JUST-NLP 2025 L-SUMM task, summarizing very long Indian court judgments (up to 857k characters) using a single 80GB GPU and a total budget of about $50. Our pipeline first filters out length–summary outliers, then applies two-stage LoRA SFT on Qwen3-4B-Instruct-2507 to learn style and extend context, and finally runs RLVR tuned to BLEU, ROUGE-2, and ROUGE-L, with BLEU upweighted. We showed that two-stage SFT is better than a single-stage run, and RLVR gives the largest gains, reaching 32.71 internal vs. 16.15 base and 29.91 on the test leaderboard. In ablation on prompting, we find that a simple, naive prompt converges faster but saturates earlier, while the curated legal-structured prompt keeps improving with longer training and yields higher final scores, and the finetuned model remains fairly robust to unseen prompts. Our code are fully open-sourced, available for reproducibility.

pdf bib
SCaLAR_NITK @ JUSTNLP Legal Summarization (L-SUMM) Shared Task
Arjun T D | Anand Kumar Madasamy

This paper presents the systems we submitted to the JUST-NLP 2025 Shared Task on Legal Summarization (L-SUMM). Creating abstractive summaries of lengthy Indian court rulings is challenging due to transformer token limits. To address this problem, we compare three systems built on a fine-tuned Legal Pegasus model. System 1 (Baseline) applies a standard hierarchical framework that chunks long documents using naive token-based segmentation. System 2 (RR-Chunk) improves this approach by using a BERT-BiLSTM model to tag sentences with rhetorical roles (RR) and incorporating these tags (e.g., [Facts]. . . ) to enable structurally informed chunking for hierarchical summarization. System 3 (WRR-Tune) tests whether explicit importance cues help the model by assigning importance scores to each RR using the geometric mean of their distributional presence in judgments and human summaries, and finetuning a separate model on text augmented with these tags (e.g., [Facts, importance score 13.58]). A comparison of the three systems demonstrates the value of progressively adding structural and quantitative importance signals to the model’s input.

pdf bib
goodmen @ L-MT Shared Task: A Comparative Study of Neural Models for English-Hindi Legal Machine Translation
Deeraj S K | Karthik Suryanarayanan | Yash Ingle | Pruthwik Mishra

In a massively multilingual country like India,providing legal judgments in understandablenative languages is essential for equitable jus-tice to all. The Legal Machine Translation(L-MT) shared task focuses on translating le-gal content from English to Hindi which is themost spoken language in India. We present acomprehensive evaluation of neural machinetranslation models for English-Hindi legal doc-ument translation, developed as part of the L-MT shared task. We investigate four multi-lingual and Indic focused translation systems.Our approach emphasizes domain specific fine-tuning on legal corpus while preserving statu-tory structure, legal citations, and jurisdic-tional terminology. We fine-tune two legalfocused translation models, InLegalTrans andIndicTrans2 on the English-Hindi legal paral-lel corpus provided by the organizers wherethe use of any external data is constrained.The fine-tuned InLegalTrans model achievesthe highest BLEU score of 0.48. Compara-tive analysis reveals that domain adaptationthrough fine-tuning on legal corpora signifi-cantly enhances translation quality for special-ized legal texts. Human evaluation confirmssuperior coherence and judicial tone preserva-tion in InLegalTrans outputs. Our best per-forming model is ranked 3rd on the test data.

pdf bib
NIT-Surat@L-Sum: A Semantic Retrieval-Based Framework for Summarizing Indian Judicial Documents
Nita Jadav | Ashok Urlana | Pruthwik Mishra

The shared task of Legal Summarization (L-Summ) focuses on generating abstractive summaries for the Indian court judgments in English. This task presents unique challenges in producing fluent, relevant, and legally appropriate summaries given voluminous judgment texts. We experiment with different sequence-to-sequence models and present a comprehensive comparative study of their performance. We also evaluate various Large Language Models (LLM) with zero-shot settings for testing their summarization capabilities. Our best performing model is fine-tuned on a pre-trained legal summarization model where relevant passages are identified using the maximum marginal relevance(MMR) technique. Our findings highlight that retrieval-augmented fine-tuning is an effective approach for generating precise and concise legal summaries. We obtained a rank of 5th overall.

pdf bib
Adapting IndicTrans2 for Legal Domain MT via QLoRA Fine-Tuning at JUST-NLP 2025
Akoijam Jenil Singh | Loitongbam Sanayai Meetei | Yumnam Surajkanta

Machine Translation (MT) in the legal domain presents substantial challenges due to its complex terminology, lengthy statutes, and rigid syntactic structures. The JUST-NLP 2025 Shared Task on Legal Machine Translation was organized to advance research on domain-specific MT systems for legal texts. In this work, we propose a fine-tuned version of the pretrained large language model (LLM) ai4bharat/indictrans2-en-indic-1B, a transformer-based English-to-Indic translation model. Fine-tuning was performed using the parallel corpus provided by the JUST-NLP 2025 Shared Task organizers.Our adapted model demonstrates notable improvements over the baseline system, particularly in handling domain-specific legal terminology and complex syntactic constructions. In automatic evaluation, our system obtained BLEU = 46.67 and chrF = 70.03.In human evaluation, it achieved adequacy = 4.085 and fluency = 4.006. Our approach achieved an AutoRank score of 58.79, highlighting the effectiveness of domain adaptation through fine-tuning for legal machine translation.

pdf bib
Team-SVNIT at JUST-NLP 2025: Domain-Adaptive Fine-Tuning of Multilingual Models for English–Hindi Legal Machine Translation
Rupesh Dhakad | Naveen Kumar | Shrikant Malviya

Translating the sentences between English and Hindi is challenging, especially in the domain of legal documents. The major reason behind the complexity is specialized legal terminology, long and complex sentences, and the accuracy constraint. This paper presents a system developed by Team-SVNIT for the JUST-NLP 2025 shared task on legal machine translation. We fine-tune and compare multiple pretrained multilingual translation models, including the facebook/nllb-200-distilled-1.3B, on a corpus of 50,000 English–Hindi legal sentence pairs provided for the shared task. The training pipeline includes preprocessing, context windows of 512 tokens, and decoding methods to enhance translation quality. The proposed method secured 1st place on the official leaderboard with the AutoRank score of 61.62. We obtained the following scores on various metrics: BLEU 51.61, METEOR 75.80, TER 37.09, CHRF++ 73.29, BERTScore 92.61, and COMET 76.36. These results demonstrate that fine-tuning multilingual models for a domain-specific machine translation task enhances performance. It works better than general multilingual translation systems.

pdf bib
Combining Extractive and Generative Methods for Legal Summarization: BLANCKED at JUST-NLP 2025
Erich Giusseppe Soto Parada | Carlos Manuel Muñoz Almeida | David Cuevas Alba

This paper presents Tayronas Trigrams’s methodology and findings from our participation in the JUST-NLP 2025 Shared Task of Legal Summarization (L-SUMM), which focused on generating abstractive summaries of lengthy Indian court judgments. Our initial approach involved evaluating and fine-tuning specialized sequence-to-sequence models like Legal-Pegasus, Indian Legal LED, and BART. We found that these small generative models, even after fine-tuning on the limited InLSum dataset (1,200 training examples), delivered performance (e.g., Legal-Pegasus AVG score: 16.50) significantly below expected.Consequently, our final, best-performing method was a hybrid extractive-abstractive pipeline. This approach first employed the extractive method PACSUM to select the most important sentences yielding an initial AVG score of 20.04 and then utilized a Large Language Model (specifically Gemini 2.5 Pro), correctly prompted, to perform the final abstractive step by seamlessly stitching and ensuring coherence between these extracted chunks. This hybrid strategy achieved an average ROUGE-2 of 21.05, ROUGE-L of 24.35, and BLEU of 15.12, securing 7th place in the competition. Our key finding is that, under data scarcity, a two-stage hybrid approach dramatically outperforms end-to-end abstractive fine-tuning on smaller models.

pdf bib
Automatic Legal Judgment Summarization Using Large Language Models: A Case Study for the JUST-NLP 2025 Shared Task
Santiago Chica

This paper presents the proposal developed for the JUST-NLP 2025 Shared Task on Legal Summarization, which aims to generate abstractive summaries of Indian court judgments. The work describes the motivation, dataset analysis, related work, and proposed methodology based on Large Language Models (LLMs). We analyze the Indian Legal Summarization (InLSum) dataset, review four relevant articles in the summarization of legal texts, and describe the experimental setup involving GPT-4.1 to evaluate the effectiveness of different prompting strategies. The evaluation will follow the ROUGE and BLEU metrics, consistent with the competition protocol.

pdf bib
Structure-Aware Chunking for Abstractive Summarization of Long Legal Documents
Himadri Sonowal | Saisab Sadhu

The efficacy of state-of-the-art abstractive summarization models is severely constrained by the extreme document lengths of legal judgments, which consistently surpass their fixed input capacities. The prevailing method, naive sequential chunking, is a discourse-agnostic process that induces context fragmentation and degrades summary coherence. This paper introduces Structure-Aware Chunking (SAC), a rhetorically-informed pre-processing pipeline that leverages the intrinsic logical structure of legal documents. We partition judgments into their constituent rhetorical strata—Facts, Arguments & Analysis, and Conclusion—prior to the summarization pass. We present and evaluate two SAC instantiations: a computationally efficient heuristic-based segmenter and a semantically robust LLM-driven approach. Empirical validation on the JUST-NLP 2025 L-SUMM shared task dataset reveals a nuanced trade-off: while our methods improve local, n-gram based metrics (ROUGE-2), they struggle to maintain global coherence (ROUGE-L). We identify this “coherence gap” as a critical challenge in chunk-based summarization and show that advanced LLM-based segmentation begins to bridge it.

pdf bib
From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation
Amit Barman | Atanu Mandal | Sudip Kumar Naskar

In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English–Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

pdf bib
Integrating Graph based Algorithm and Transformer Models for Abstractive Summarization
Sayed Ayaan Ahmed Sha | Sangeetha Sivanesan | Anand Kumar Madasamy | Navya Binu

Summarizing legal documents is a challenging and critical task in the field of Natural Language Processing(NLP). On top of that generating abstractive summaries for legal judgments poses a significant challenge to researchers as there is limitation in the number of input tokens for various language models. In this paper we experimented with two models namely BART base model finetuned on CNN DailyMail dataset along with TextRank and pegasus_indian_legal, a finetuned version of legal-pegasus on Indian legal judgments for generating abstractive summaries for Indian legal documents as part of the JUSTNLP 2025 - Shared Task on Legal Summarization. BART+TextRank outperformed pegasus_indian_legal with a score of 18.84.

pdf bib
Hierarchical Long-Document Summarization using LED for Legal Judgments
Reshma Sheik | Noah John Puthayathu | Fathima Firose A | Jonathan Paul

This paper describes our system for the L-SUMM shared task on legal document summarization. Our approach is built on the Longformer Encoder-Decoder (LED) model, which we augment with a multi-level summarization strategy tailored for legal documents that are substantially longer than typical transformer input limits. The system achieved competitive performance on the legal judgment summarization task through optimized training strategies, including gradient accumulation, Adafactor optimization, and hyperparameter tuning. Our findings indicate that combining hierarchical processing with strategically assigned global attention enables more reliable summarization of lengthy legal texts.

up

pdf (full)
bib (full)
NLP-AI4Health

pdf bib
NLP-AI4Health
Parameswari Krishnamurthy | Vandan Mujadia | Dipti Misra Sharma | Hannah Mary Thomas

pdf bib
Enhancing Patient-Centric Healthcare Communication Through Multimodal Emotion Recognition: A Transformer-Based Framework for Clinical Decision Support
Vineet Channe

This paper presents a multimodal emotion analysis framework designed to enhance patient-centric healthcare communication and support clinical decision-making. Our system addresses automated patient emotion monitoring during consultations, telemedicine sessions, and mental health screenings by combining audio transcription, facial emotion analysis, and text processing. Using emotion patterns from the CREMA-D dataset as a foundation for healthcare-relevant emotional expressions, we introduce a novel emotion-annotated text format “[emotion] transcript [emotion]” integrating Whisper-based audio transcription with DeepFace facial emotion analysis. We systematically evaluate eight transformer architectures (BERT, RoBERTa, DeBERTa, XLNet, ALBERT, DistilBERT, ELECTRA, and BERT-base) for three-class clinical emotion classification: Distress/Negative (anxiety, fear), Stable/Neutral (baseline), and Engaged/Positive (comfort). Our multimodal fusion strategy achieves 86.8% accuracy with DeBERTa-v3-base, representing a 12.6% improvement over unimodal approaches and meeting clinical requirements for reliable patient emotion detection. Cross-modal attention analysis reveals facial expressions provide crucial disambiguation, with stronger attention to negative emotions (0.41 vs 0.28), aligning with clinical priorities for detecting patient distress. Our contributions include emotion-annotated text representation for healthcare contexts, systematic transformer evaluation for clinical deployment, and a framework enabling real-time patient emotion monitoring and emotionally-aware clinical decision support.

pdf bib
MOD-KG: MultiOrgan Diagnosis Knowledge Graph
Anas Anwarul Haq Khan | Pushpak Bhattacharyya

The human body is highly interconnected, where a diagnosis in one organ can influence conditions in others. In medical research, graphs (such as Knowledge Graphs and Causal Graphs) have proven useful for capturing these relationships, but constructing them manually with expert input is both costly and time-intensive, especially given the continuous flow of new findings. To address this, we leverage the extraction capabilities of large language models (LLMs) to build the **MultiOrgan Diagnosis Knowledge Graph (MOD-KG)**. MOD-KG contains over **21,200 knowledge triples**, derived from both textbooks **(~13%)** and carefully selected research papers (with an average of **444** citations each). The graph focuses primarily on the *heart, lungs, kidneys, liver, pancreas, and brain*, which are central to much of today’s multimodal imaging research. The extraction quality of the LLM was benchmarked against baselines over **1000** samples, demonstrating reliability. We will make our dataset public upon acceptance.

pdf bib
Cross-Lingual Mental Health Ontologies for Indian Languages: Bridging Patient Expression and Clinical Understanding through Explainable AI and Human-in-the-Loop Validation
Ananth Kandala | Ratna Kandala | Akshata Kishore Moharir | Niva Manchanda | Sunaina Singh Rathod

Mental health communication in India is linguistically fragmented, culturally diverse, and often underrepresented in clinical NLP. Current health ontologies and mental health resources are dominated by English or Western-centric diagnostic frameworks, leaving a gap in representing patient distress expressions in Indian languages. We propose the Cross-Lingual Graphs of Patient Distress Expressions (CL-PDE), a framework for building cross-lingual mental health ontologies through graph-based methods that capture culturally embedded expressions of distress, align them across languages, and link them with clinical terminology. Our approach addresses critical gaps in healthcare communication by grounding AI systems in culturally valid representations, enabling more inclusive and patient-centric NLP tools for mental health care in multilingual contexts.

pdf bib
Automated Coding of Counsellor and Client Behaviours in Motivational Interviewing Transcripts: Validation and Application
Armaity Katki | Nathan Choi | Son Sophak Otra | George Flint | Kevin Zhu | Sunishchal Dev

Protein language models (PLMs) are powerful tools for protein engineering, but remain difficult to steer toward specific biochemical properties, where small sequence changes can affect stability or function. We adapt two prominent unsupervised editing methods: task arithmetic (TA; specifically, Forgetting via Negation) in weight space and feature editing with a sparse autoencoder (SAE) in activation space. We evaluate their effects on six biochemical properties of generations from three PLMs (ESM3, ProGen2-Large, and ProLLaMA). Across models, we observe complementary efficacies: TA more effectively controls some properties while SAE more effectively controls others. Property response patterns show some consistence across models. We suggest that the response pattern of biochemical properties should be considered when steering PLMs.

pdf bib
Patient-Centric Question Answering- Overview of the Shared Task at the Second Workshop on NLP and AI for Multilingual and Healthcare Communication
Arun Zechariah | Balu Krishna | Hannah Mary Thomas | Joy Mammen | Dipti Misra Sharma | Parameswari Krishnamurthy | Vandan Mujadia | Priyanka Dasari | Vishnuraj Arjunaswamy

This paper presents an overview of the Shared Task on Patient-Centric Question Answering, organized as part of the NLP-AI4Health workshop at IJCNLP. The task aims to bridge the digital divide in healthcare by developing inclusive systems for two critical domains: Head and Neck Cancer (HNC) and Cystic Fibrosis (CF). We introduce the NLP4Health-2025 Dataset, a novel, large-scale multilingual corpus consisting of more than 45,000 validated multi-turn dialogues between patients and healthcare providers across 10 languages: Assamese, Bangla, Dogri, English, Gujarati, Hindi, Kannada, Marathi, Tamil, and Telugu. Participants were challenged to develop lightweight models (< 3 billion parameters) to perform two core activities: (1) Clinical Summarization, encompassing both abstractive summaries and structured clinical extraction (SCE), and (2) Patient-Centric QA, generating empathetic, factually accurate answers in the dialogue native language. This paper details the hybrid human-agent dataset construction pipeline, task definitions, evaluation metrics, and analyzes the performance of 9 submissions from 6 teams. The results demonstrate the viability of small language models (SLMs) in low-resource medical settings when optimized via techniques like LoRA and RAG.

pdf bib
Multilingual Clinical Dialogue Summarization and Information Extraction with Qwen-1.5B LoRA
Kunwar Zaid | Amit Sangroya | Jyotsana Khatri

This paper describes our submission to theNLP-AI4Health 2025 Shared Task on multi-lingual clinical dialogue summarization andstructured information extraction. Our systemis based on Qwen-1.5B Instruct fine-tuned withLoRA adapters for parameter-efficient adapta-tion. The pipeline produces (i) concise Englishsummaries, (ii) schema-aligned JSON outputs,and (iii) multilingual Q&A responses. TheQwen-based approach substantially improvessummary fluency, factual completeness, andJSON field coverage while maintaining effi-ciency within constrained GPU resources.

pdf bib
Patient-Centric Multilingual Question Answering and Summary Generation for Head and Neck Cancer and Blood Donation
Amol Shinde | Saloni Chitte | Prakash B. Pimpale

This paper describes a production minded multilingual system built for the NLP-AI4Health shared task, designed to produce concise, medically accurate summaries and patient friendly answers for Head and Neck Cancer (HNC) and Blood Donation. We finetune Gemma2-2B under a strict model size constraint (<3B parameters) using parameter efficient adaptation (LoRA) and practical engineering to handle long dialogues, code mixing, and multilingual scripts. The pipeline couples careful preprocessing, token aware chunking, and constrained decoding with lightweight retrieval and verification steps. We report per language quantitative metrics and provide an analysis of design choices and operational considerations for real world deployment.

pdf bib
SAHA: Samvad AI for Healthcare Assistance
Aditya Kumar | Rakesh Kumar Nayak | Janhavi Naik | Ritesh Kumar | Dhiraj Bhatia | Shreya Agarwal

This paper deals with the dual task of developing a medical question answering (QA) system and generating concise summaries of medical dialogue data across nine languages (English and eight Indian languages). The medical dialogue data focuses on two critical health issues: Head and Neck Cancer (HNC) and Cystic Fibrosis (NLP AI4health shared task). The proposed framework utilises a dual approach: a fine-tuned small Multilingual Text-to-Text Transfer Transformer (mT5) model for the conversational summarisation component and a fine-tuned Retrieval Augmented Generation (RAG) system integrating the dense intfloat/e5-large language model for the language-independent QA component. The efficacy of the proposed approaches is demonstrated by achieving promising precision in the QA task. Our framework achieved the highest F1 scores in QA for the three Indian languages, with F1 score of 0.3995 in Marathi, 0.7803 in Bangla, and 0.74759 in Hindi, respectively. We achieved the highest cometscore of 0.5626 on the Gujarati QA test set. For the dialogue summarisation task, our model registered the highest ROUGE-2 and ROUGE-L precision across all eight Indian languages, with English being the sole exception. These results confirm our approach potential to improve e-health in dialogue data for low-resource Indian languages.

pdf bib
MedQwen-PE: Medical Qwen for Parameter-Efficient Multilingual Patient-Centric Summarization, Question Answering and Information Extraction
Vinay Babu Ulli | Anindita Mondal

This study addresses the Shared Task on Patient-Centric Multilingual Question Answering, which focuses on generating summaries and patient-oriented answers from multi-turn medical dialogues related to Head and Neck Cancer and Cystic Fibrosis across ten languages. The Qwen3-1.7B model is fine-tuned using QLoRA for three tasks—Summarization, Question Answering, and Information Extraction—while updating only approximately 1.6% of parameters through task-specific adapter layers. The resulting system demonstrates strong semantic fidelity, as evidenced by high BERTScore and COMET scores, particularly for Kannada, English, Telugu, and Tamil, with comparatively lower performance in Assamese, Bangla, Gujarati, and Marathi. The modular fine-tuning design enables efficient task adaptation while satisfying the constraints on model size and computational resources.

pdf bib
NLP4Health: Multilingual Clinical Dialogue Summarization and QA with mT5 and LoRA
Moutushi Roy | Dipankar Das

In this work, we present NLP4Health, a unified and reproducible pipeline to accomplish the tasks of multilingual clinical dialogue summarization and question answering (QA). Our system fine-tunes the multilingual sequence-to-sequence model google/mt5-base along with parameter-efficient Low-Rank Adaptation (LoRA) modules to support ten Indian languages. For each clinical dialogue, the model produces (1) a free-text English summary, (2) an English structured key–value (KnV) JSON summary, and (3) QA responses in the dialogue’s original language. We conducted preprocessing, fine-tuning, and inference, and evaluated across QA, textual, and structured metrics, analyzing performance in low-resource settings. The adapter weights, tokenizer, and inference scripts are publicly released to promote transparency and reproducibility.

up

pdf (full)
bib (full)
Proceedings of the QuantumNLP{:} Integrating Quantum Computing with Natural Language Processing

pdf bib
Proceedings of the QuantumNLP{:} Integrating Quantum Computing with Natural Language Processing
Santanu Pal | Partha Pakray | Priyanka Jain | Asif Ekbal | Sivaji Bandyopadhyay

pdf bib
Quantum-Infused Whisper: A Framework for Replacing Classical Components
Tapabrata Mondal | Debjit Dhar | Soham Lahiri | Sivaji Bandyopadhyay

We propose a compact hybrid quantum–classical extension of OpenAI’s Whisper in which classical components are replaced by Quantum Convolutional Neural Networks (QCNN), Quantum LSTMs (QLSTM), and optional Quantum Adaptive Self-Attention (QASA). Log-mel spectrograms are angle encoded and processed by QCNN kernels, whose outputs feed a Transformer encoder, while QLSTM-based decoding introduces quantum-enhanced temporal modeling. The design incorporates pretrained acoustic embeddings and is constrained to NISQ-feasible circuit depths and qubit counts. Although this work is primarily architectural, we provide a fully specified, reproducible evaluation plan using Speech Commands, LibriSpeech, and Common Voice, along with strong classical baselines and measurable hypotheses for assessing noise robustness, efficiency, and parameter sparsity. To our knowledge, this is the first hardware-aware, module-wise quantum replacement framework for Whisper.

pdf bib
These Aren’t the Vectors You’re Looking For: A Proof of Quantum Advantage in Compositional Generalization
Karthik Srikumar

Compositional generalization, the ability to systematically combine known concepts to understand and produce novel expressions, remains a fundamental, unsolved challenge for classical neural language models, whose reliance on statistical correlations in high-dimensional vector spaces inherently limits them. This paper establishes the first rigorous theoretical guarantee of an exponential quantum advantage for compositional generalization. We prove that classical language models, which represent concepts as vectors in d, require a latent dimension scaling linearly with the number of concepts and compositional rules to avoid catastrophic interference. In contrast, we introduce the Quantum Compositional Embedding (QCE) framework, which leverages the intrinsic properties of quantum mechanics. In doing so, we demonstrate that QCE, utilizing only a logarithmic number of qubits, can perfectly represent and generalize compositional structures, a task provably impossible for classical models of equivalent dimensionality. The separation is proven to be exponential, providing a compelling theoretical foundation for quantum natural language processing.

pdf bib
Hybrid Classical-Quantum Framework for Sentiment Classification and Claim Check-Worthiness Identification in Bengali
Pritam Pal | Dipankar Das | Anup Kumar Kolya | Siddhartha Bhattacharyya

Traditional machine learning and deep learning models have demonstrated remarkable performance across various NLP tasks in multiple languages. However, these conventional models often struggle with languages with complex linguistic structures and nuanced contexts, such as Bengali. Recent advancements in quantum computing offer promising solutions for tackling complex, computationally challenging problems, providing faster, more efficient processing than classical systems. This research aims to address the challenges posed by the intricate linguistic structure of the less-resourced Bengali language by developing a quantum-enhanced framework for sentiment classification and claim-checkworthiness identification. We created a classical LSTM framework and proposed novel 2-qubit and 4-qubit classical-quantum frameworks, evaluating their effectiveness for sentiment classification and claim-checkworthiness identification tasks in Bengali. An entirely new dataset comprising 3K samples was developed by curating Bengali news headlines from prominent sources. We tagged these headlines with sentiment and claim checkworthy labels using state-of-the-art LLMs. Our findings indicate that the quantum-enhanced frameworks outperform the traditional models in both tasks. Notably, the 4-qubit-based framework achieved the highest F1-score in sentiment classification, while the 2-qubit-based framework demonstrated the best F1-score in claim checkworthiness identification.

pdf bib
A Hybrid Quantum-Classical Fusion for Deep Semantic Paraphrase Detection
Devanarayanan K | Fayas S Mohamad | Dheeraj V Mohan | Reshma Sheik

Paraphrase Detection is a core task in natural language processing (NLP) that aims to determine whether two sentences convey equivalent meanings. This work proposes a hybrid quantum–classical framework that integrates Sentence-BERT embeddings, simulated quantum feature encoding, and classical machine learning models to enhance semantic similarity detection. Initially, sentence pairs are embedded using Sentence-BERT and standardized through feature scaling. These representations are then transformed via rotation-based quantum circuits to capture higher-order feature interactions and non-linear dependencies. The resulting hybrid feature space, combining classical and quantum-inspired components, is evaluated using LightGBM and deep neural network classifiers. Experimental results show that the hybrid model incorporating quantum-inspired features achieved superior classification performance, yielding a 10% improvement in overall accuracy outperforming standalone deep learning baselines. These findings demonstrate that quantum–classical fusion enhances semantic feature extraction and significantly improves paraphrase detection performance.

pdf bib
Quantum-Enhanced Gated Recurrent Units for Part-of-Speech Tagging
Ashutosh Rai | Shyambabu Pandey | Partha Pakray

Deep learning models for Natural Language Processing (NLP) tasks, such as Part-of-Speech (POS) tagging, usually have significant parameter counts that make them costly to train and deploy. Quantum Machine Learning (QML) offers a potential approach for building more parameter-efficient models. This paper proposes a hybrid quantum-classical gated recurrent unit model for POS tagging in code-mixed social media text. By integrating a quantum layer into the recurrent framework, our model achieved an accuracy comparable to the baseline classical model, while needing fewer parameters. Although the cut-off point in the parameters is modest in our setup, the approach is promising when scaled to deeper architectures. These results suggest that hybrid models can offer a resource-efficient alternative for NLP tasks.

pdf bib
A Review of Quantum Computing Approaches to Semantic Search and Text Classification in Natural Language Processing
Sauvik Bal

While having enhanced NLP, deep learning and pre-trained language models requires a lot of processing power. The work showcases the potential of quantum computing by mapping linguistic data into vast, high-dimensional Hilbert spaces through entanglement and superposition. It focuses on mathematical concepts that set quantum approaches apart from classical ones, among them being the fidelity-based similarity and quantum probability. Various quantum machine learning models are considered in this article, including Quantum Neural Networks and Quantum Support Vector Machines, each discussing the computational advantages in pattern recognition. In addition, it considers retrieval techniques like Grover’s algorithm, showing how quantum similarity functions give better semantic search. Indeed, the comparison does show that quantum techniques might yield advantages regarding expressiveness and scalability, despite obstacles such as hardware noise and data encoding. Notwithstanding that quantum technology is still in its infancy, future improvements might advance language understanding.

pdf bib
QCNN-MFND: A Novel Quantum CNN Framework for Multimodal Fake News Detection in Social Media
Arya Suneesh | Balasubramanian Palani

Fake news on social media platforms poses significant threats to public trust and information integrity. This research explores the application of quantum machine learning (QML) techniques for detecting fake news by leveraging quantum computing’s unique capabilities. Our work introduces a hybrid quantum-classical framework that utilizes quantum convolutional neural networks (QCNNs) with angle and amplitude encoding schemes for processing multimodal features from text and images. Experiments conducted on benchmark datasets - GossipCop and Politifact - demonstrate that our quantum-enhanced model achieves superior performance compared to classical approaches, with accuracy rates of 88.52% and 85.58%, and F1 scores of 93.19% and 90.20% respectively. Our findings establish QML as a viable approach for addressing the challenges of fake news detection in the digital era.

pdf bib
A Systematic Survey of Quantum Natural Language Processing: Models, Encoding Paradigms, and Evaluation Methods
Arpita Vats | Rahul Raja | Ashish Kattamuri | Abhinav Bohra

Quantum Natural Language Processing (QNLP) is an emerging interdisciplinary field at the intersection of quantum computing, natural language understanding, and formal linguistic theory. As advances in quantum hardware and algorithms accelerate, QNLP promises new paradigms for representation learning, semantic modeling, and efficient computation. However, existing literature remains fragmented, with no unified synthesis across modeling, encoding, and evaluation dimensions. In this work, we present the first systematic and taxonomy driven survey of QNLP that holistically organizes research spanning three core dimensions: computational models, encoding paradigms, and evaluation frameworks. First, we analyze foundational approaches that map linguistic structures into quantum formalism, including categorical compositional models, variational quantum circuits, and hybrid quantum classical architectures. Second, we introduce a unified taxonomy of encoding strategies, ranging from quantum tokenization and state preparation to embedding based encodings, highlighting tradeoffs in scalability, noise resilience, and expressiveness. Third, we provide the first comparative synthesis of evaluation methodologies, benchmark datasets, and performance metrics, while identifying reproducibility and standardization gaps.We further contrast quantum inspired NLP methods with fully quantum implemented systems, offering insights into resource efficiency, hardware feasibility, and real world applicability. Finally, we outline open challenges such as integration with LLMs and unified benchmark design, and propose a research agenda for advancing QNLP as a scalable and reliable discipline.

pdf bib
A Survey of Quantum Natural Language Processing: From Compositional Models to NISQ-Era Empiricism
Arpan Phukan | Asif Ekbal

Quantum Natural Language Processing (QNLP) has emerged as a novel paradigm that leverages the principles of quantum mechanics to address fundamental challenges in language modeling, particularly in capturing compositional meaning. This survey charts the evolution of QNLP, from its theoretical foundations in the Distributional Compositional Categorical (DisCoCat) framework to its modern implementation on Noisy Intermediate-Scale Quantum (NISQ) hardware. We review the primary architectural approaches, including variational quantum circuits and tensor networks, and summarize the growing body of empirical work in tasks such as text classification, sentence similarity, and question answering. A recurring finding is the potential for QNLP models to achieve competitive performance with significantly fewer parameters than their classical counterparts. However, the field is critically constrained by the limitations of NISQ-era hardware. We conclude by discussing these challenges and outlining the future trajectory towards achieving a demonstrable quantum advantage and building more interpretable, efficient language models.

up

pdf (full)
bib (full)
Proceedings of The First Workshop on Human–LLM Collaboration for Ethical and Responsible Science Production (SciProdLLM)

pdf bib
Proceedings of The First Workshop on Human–LLM Collaboration for Ethical and Responsible Science Production (SciProdLLM)
Wei Zhao | Jennifer D’Souza | Steffen Eger | Anne Lauscher | Yufang Hou | Nafise Sadat Moosavi | Tristan Miller | Chenghua Lin

pdf bib
Bridging Health Literacy Gaps in Indian Languages: Multilingual LLMs for Clinical Text Simplification
R S Pavithra

We demonstrate how open multilingual LLMs (mT5, IndicTrans2) can simplify complex medical documents into culturally sensitive, patient friendly text in Indian languages, advancing equitable healthcare communication and multilingual scientific accessibility.Clinical documents such as discharge summaries, consent forms, and medication instructions are essential for patient care but are often written in complex, jargon-heavy language. This barrier is intensified in multilingual and low-literacy contexts like India, where linguistic diversity meets limited health literacy. We present a multilingual clinical text simplification pipeline using open large language models (mT5 and IndicTrans2) to automatically rewrite complex medical text into accessible, culturally appropriate, and patient-friendly versions in English, Hindi, Tamil, and Telugu. Using a synthetic dataset of 2,000 discharge summaries, our models achieve up to 42% readability improvement while maintaining factual accuracy. The framework demonstrates how open, reproducible LLMs can bridge linguistic inequities in healthcare communication and support inclusive, patient-centric digital health access in India.

pdf bib
Human-Centered Disability Bias Detection in Large Language Models
Habiba Chakour | Fatiha Sadat

To promote a more just and inclusive society, developers and researchers are strongly encouraged to design Language Models (LM) with ethical considerations at the forefront, ensuring that the benefits and opportunities of AI are accessible to all users and communities. Incorporating humans in the loop is one approach recognized for mitigating general AI biases. Consequently, the development of new design guidelines and datasets is essential to help AI systems realize their full potential for the benefit of people with disabilities.This study aims to identify disability-related bias in Large Masked Language Models (MLMs), the Electra. A participatory and collaborative research approach was employed, involving three disability organizations to collect information on deaf and hard-of-hearing individuals. Our initial analysis reveals that the studied MLM is highly sensitive to the various identity references used to describe deaf and hard-of-hearing people.

pdf bib
TransLaTeX: Exposing the Last-Mile Execution Gap in LLM-Agent for Scientific Formatting
Jiawen Lyn | Yvette Graham

Large Language Models (LLMs) have achieved remarkable progress in tasks such as survey writing and language polishing, yet the final stage of LaTeX formatting and template adaptation remains a neglected and error-prone bottleneck.We identify an execution illusion, where LLMs produce linguistically fluent but unexecutable LaTeX code.To address this, we introduce TransLaTeX—the first reasoning-and-control framework that converts documents between scholarly templates with compiler-level verifiability.TransLaTeX achieves three key innovations:(1) Structure–content separation via placeholder masking, ensuring privacy and less token consumption;(2) SafeFormatBench, the first benchmark dedicated to executable LaTeX generation and template conversion; and(3) Execution-grounded verification across compilation, policy compliance, and visual consistency.TransLaTeX outperforms Pandoc and full-text LLM baselines on SafeFormatBench in compilation rate, ACL policy compliance, and layout fidelity, effectively mitigating the execution illusion.

pdf bib
MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning
Rajarshi Ghosh | Abhay Gupta | Hudson McBride | Anurag Jayant Vaidya | Faisal Mahmood

Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23k items each (69k total). We evaluate a frontier LLM and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS > 0.80) but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering—even when final diagnoses remain unchanged. Error analysis identifies specific cases where reasoning shifts occur, highlighting clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA provides a controlled diagnostic setting for auditing reasoning stability in medical AI.

pdf bib
Reasoning-Enhanced Retrieval for Misconception Prediction: A RAG-Inspired Approach with LLMs
Chaudhary Divya | Chang Xue | Shaorui Sun

Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23k items each (69k total). We evaluate a frontier LLM and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS > 0.80) but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering—even when final diagnoses remain unchanged. Error analysis identifies specific cases where reasoning shifts occur, highlighting clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA provides a controlled diagnostic setting for auditing reasoning stability in medical AI.

up

pdf (full)
bib (full)
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications

pdf bib
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Alberto Accomazzi | Tirthankar Ghosal | Felix Grezes | Kelly Lockhart

pdf bib
Overview of the Third Workshop for Artificial Intelligence for Scientific Publications
Kelly Lockhart | Alberto Accomazzi | Felix Grezes | Tirthankar Ghosal

The Workshop for Artificial Intelligence for Scientific Publications (WASP), formerly Workshop on Information Extraction from Scientific Publications (WIESP), started in 2022 to provide a platform for researchers to discuss research on information extraction, mining, generation, and knowledge discovery from scientific publications using Natural Language Processing and Machine Learning techniques. The third WASP workshop was held at the 14th International Joint Conference on Natural Language Processing and 4th Asia-Pacific Chapter of the Association for Computational Linguistics in Mumbai, India on December 23rd, 2025, as a hybrid event. The WASP workshop saw great interest, with 29 submissions, of which 16 were accepted. The program consisted of the contributed research talks, 2 keynote talks, a panel discussion, and one shared task, Telescope Reference and Astronomy Categorization Shared task (TRACS).

pdf bib
Overview of TRACS: the Telescope Reference and Astronomy Categorization Dataset & Shared Task
Felix Grezes | Jennifer Lynn Bartlett | Kelly Lockhart | Alberto Accomazzi | Ethan Seefried | Anjali Pandiri | Tirthankar Ghosal

To evaluate the scientific influence of observational facilities, astronomers examine the body of publications that have utilized data from those facilities. This depends on curated bibliographies that annotate and connect data products to the corresponding literature, enabling bibliometric analyses to quantify data impact. Compiling such bibliographies is a demanding process that requires expert curators to scan the literature for relevant names, acronyms, and identifiers, and then to determine whether and how specific observations contributed to each publication. These bibliographies have value beyond impact assessment: for research scientists, explicit links between data and literature form an essential pathway for discovering and accessing data. Accordingly, by building on the work of librarians and archivists, telescope bibliographies can be repurposed to directly support scientific inquiry. In this context, we present the Telescope Reference and Astronomy Categorization Shared task (TRACS) and its accompanying dataset, which comprises more than 89,000 publicly available English-language texts drawn from space telescope bibliographies. These texts are labeled according to a new, compact taxonomy developed in consultation with experienced bibliographers.

pdf bib
Exploring Health Misinformation Detection with Multi-Agent Debate
Chih-Han Chen | Chen-Han Tsai | Yu-Shao Peng

Fact-checking health-related claims has become increasingly critical as misinformation proliferates online. Effective verification requires both the retrieval of high-quality evidence and rigorous reasoning processes. In this paper, we propose a two-stage framework for health misinformation detection: Agreement Score Prediction followed by Multi-Agent Debate. In the first stage, we employ large language models (LLMs) to independently evaluate retrieved articles and compute an aggregated agreement score that reflects the overall evidence stance. When this score indicates insufficient consensus—falling below a predefined threshold—the system proceeds to a second stage. Multiple agents engage in structured debate to synthesize conflicting evidence and generate well-reasoned verdicts with explicit justifications. Experimental results demonstrate that our two-stage approach achieves superior performance compared to baseline methods, highlighting the value of combining automated scoring with collaborative reasoning for complex verification tasks.

pdf bib
Zero-Shot Cross-Sentential Scientific Relation Extraction via Entity-Guided Summarization
Vani Kanjirangat | Fabio Rinaldi

Structured information extraction (IE) from scientific abstracts is increasingly leveraging large language models (LLMs). A crucial step in IE is relation extraction (RE), which becomes challenging when entity relations span sentences. Traditional path-based methods, such as shortest dependency paths, are often unable to handle cross-sentential relations effectively. Although LLMs have been utilized as zero-shot learners for IE tasks, they continue to struggle with capturing long-range dependencies and multi-hop reasoning. In this work, we propose using GPT as a zero-shot entity-guided summarizer to encapsulate cross-sentential context into a single-sentence summary for relation extraction. We perform intrinsic evaluations, comparing our approach against direct zero-shot prompting on biomedical scientific abstracts. On the Chemical-Disease Relation (CDR) dataset, our method achieves a 7-point improvement in overall F-score and 6 points for cross-sentential relations. On the Gene-Disease Association (GDA) dataset, we observe an 8-point gain for inter-sentential relations. These results demonstrate that entity-guided summarization with GPT can enhance zero-shot biomedical RE, supporting more effective structured information extraction from scientific texts.

pdf bib
Finding the Paper Behind the Data: Automatic Identification of Research Articles related to Data Publications
Barbara McGillivray | Kaveh Aryan | Viola Harperath | Marton Ribary | Mandy Wigdorowitz

Data papers are scholarly publications that describe datasets in detail, including their structure, collection methods, and potential for reuse, typically without presenting new analyses. As data sharing becomes increasingly central to research workflows, linking data papers to relevant research papers is essential for improving transparency, reproducibility, and scholarly credit. However, these links are rarely made explicit in metadata and are often difficult to identify manually at scale. In this study, we present a comprehensive approach to automating the linking process using natural language processing (NLP) techniques. We evaluate both set-based and vector-based methods, including Jaccard similarity, TF-IDF, SBERT, and reranking with large language models. Our experiments on a curated benchmark dataset reveal that no single method consistently outperforms others across all metrics, in line with the multifaceted nature of the task. Set-based methods using frequent words (N=50) achieve the highest top-10% accuracy, closely followed by TF-IDF, which also leads in MRR and top-1% and top-5% accuracy. SBERT-based reranking with LLMs yields the best results in top-N accuracy. This dispersion suggests that different approaches capture complementary aspects of similarity (lexical, semantic, and contextual), showing the value of hybrid strategies for robust matching between data papers and research articles. For several methods, we find no statistically significant difference between using abstracts and full texts, suggesting that abstracts may be sufficient for effective matching. Our findings demonstrate the feasibility of scalable, automated linking between data papers and research articles, enabling more accurate bibliometric analyses, improved tracking of data reuse, and fairer credit assignment for data sharing. This contributes to a more transparent, interconnected, and accessible research ecosystem.

pdf bib
A benchmark for end-to-end zero-shot biomedical relation extraction with LLMs: experiments with OpenAI models
Aviv Brokman | Xuguang Ai | Yuhang Jiang | Shashank Gupta | Ramakanth Kavuluru

Extracting relations from scientific literature is a fundamental task in biomedical NLP because entities and relations among them drive hypothesis generation and knowledge discovery. As literature grows rapidly, relation extraction (RE) is indispensable to curate knowledge graphs to be used as computable structured and symbolic representations. With the rise of LLMs, it is pertinent to examine if it is better to skip tailoring supervised RE methods, save annotation burden, and just use zero shot RE (ZSRE) via LLM API calls. In this paper, we propose a benchmark with seven biomedical RE datasets with interesting characteristics and evaluate three Open AI models (GPT-4, o1, and GPT-OSS-120B) for end-to-end ZSRE. We show that LLM-based ZSRE is inching closer to supervised methods in performances on some datasets but still struggles on complex inputs expressing multiple relations with different predicates. Our error analysis reveals scope for improvements.

pdf bib
Bridging the Gap: Instruction-Tuned LLMs for Scientific Named Entity Recognition
Necva Bölücü | Maciej Rybinski | Stephen Wan

Information extraction (IE) from scientific literature plays an important role in many information-seeking pipelines. Large Language Models (LLMs) have demonstrated strong zero-shot and few-shot performance on IE tasks. However, there are challenges in practical deployment, especially in scenarios that involve sensitive information, such as industrial research or limited budgets. A key question is whether there is a need for a fine-tuned model for optimal domain adaptation (i.e., whether in-domain labelled training data is needed, or zero-shot to few-shot effectiveness is enough). In this paper, we explore this question in the context of IE on scientific literature. We further consider methodological questions, such as alternatives to cloud-based proprietary LLMs (e.g., GPT and Claude) when these are unsuitable due to data privacy, data sensitivity, or cost reasons. This paper outlines empirical results to recommend which locally hosted open-source LLM approach to adopt and illustrates the trade-offs in domain adaptation.

pdf bib
Metadata Generation for Research Data from URL Citation Contexts in Scholarly Papers: Task Definition and Dataset Construction
Yu Watanabe | Koichiro Ito | Shigeki Matsubara

This paper proposes a new research task aimed at automatically generating metadata for research data, such as datasets and code, to accelerate open science. From the perspective of ‘Findable’ in the FAIR data principles, research data is required to be assigned a global unique identifier and described with rich metadata. The proposed task is defined as extracting information about research data (specifically, name, generic mention, and in-text citation) from texts surrounding URLs that serve as identifiers for research data references in scholarly papers. To support this task, we constructed a dataset containing approximately 600 manually annotated citation contexts with URLs of research data from conference papers. To evaluate the task, we conducted a preliminary experiment using the constructed dataset, employing the In-Context Learning method with LLMs as a baseline. The results showed that the performance of LLMs matched that of humans in some cases, demonstrating the feasibility of the task.

pdf bib
Dynamic Reference Extraction and Linking across Multiple Scholarly Knowledge Graphs
Nicolau Duran-Silva | Pablo Accuosto

References are an important feature of scientific literature; however, they are unstructured, heterogeneous, noisy, and often multilingual. We present a modular pipeline that leverages fine-tuned transformer models for reference location, classification, parsing, retrieval, and re-ranking across multiple scholarly knowledge graphs, with a focus on multilingual and non-traditional sources such as patents and policy documents. Our main contributions are: a unified pipeline for reference extraction and linking across diverse document types, openly released annotated datasets, fine-tuned models for each subtask, and evaluations across multiple scholarly knowledge graphs, enabling richer, more inclusive infrastructures for open research information.

pdf bib
AI for Data Ingestion into IPAC Archives
Nicholas Susemiehl | Joseph Mazzarella

The astronomical data archives at IPAC, including the NASA Extragalactic Database (NED) and NASA Exoplanet Archive (NEA), have served as repositories for data published in the literature for decades. Throughout this time, extracting data from journal articles has remained a challenging task and future large data releases will exasperate this problem. We seek to accelerate the rate at which data can be extracted from journal articles and reformatted into database load files by leveraging recent advances in natural language processing enabled by AI. We are developing a new suite of tools to semi-automate information retrieval from scientific journal articles. Manual methods to extract and prepare data, which can take hours for some articles, are being replaced with AI-powered tools that can compress the task to minutes. A combination of AI and non-AI methods, along with human supervision, can substantially accelerate archive data ingestion. Challenges remain for improving accuracy, capturing data in external files, and flagging issues such as mislabeled object names and missing metadata.

pdf bib
A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature
Van-Thuy Phi | Dinh-Truong Do | Hoang-An Trieu | Yuji Matsumoto

Extracting structured information from tables in scientific literature is a critical yet challenging task for building domain-specific knowledge bases. This paper addresses extraction of 5-ary polymer property tuples: (POLYMER, PROP_NAME, PROP_VALUE, CONDITION, CHAR_METHOD). We introduce and systematically compare two distinct methodologies: (1) a novel two-stage Hybrid Pipeline that first utilizes Large Language Models (LLMs) for table-to-text conversion, which is then processed by specialized text-based extraction models; and (2) an end-to-end Direct LLM Extraction approach. To evaluate these methods, we employ a systematic, domain-aligned evaluation setup based on the expert-curated PoLyInfo database. Our results demonstrate the clear superiority of the hybrid pipeline. When using Claude Sonnet 4.5 for the linearization stage, the pipeline achieves a score of 67.92% F1@PoLyInfo, significantly outperforming the best direct LLM extraction approach (Claude Sonnet 4.5 at 56.66%). This work establishes the effectiveness of a hybrid architecture that combines the generative strengths of LLMs with the precision of specialized supervised models for structured data extraction.

pdf bib
TeG-DRec: Inductive Text-Graph Learning for Unseen Node Scientific Dataset Recommendation
Ammar Qayyum | Bassamtiano Irnawan | Fumiyo Fukumoto | Latifah Kamarudin | Kentaro Go | Yoshimi Suzuki

Scientific datasets are crucial for evaluating scientific research, and their number is increasing rapidly. Most scientific dataset recommendation systems use Information Retrieval (IR) methods that model semantics while overlooking interactions. Graph Neural Networks (GNNs) excel at handling interactions between entities but often overlook textual content, limiting their ability to generalise to unseen nodes. We propose TeG-DRec, a framework for scientific dataset recommendation that integrates GNNs and textual content via a subgraph generation module to ensure correct propagation throughout the model, enabling handling of unseen data. Experimental results on the dataset recommendation’s dataset show that our method outperformed the baselines for text-based IR and graph-based recommendation systems. Our source code is available at https://github.com/Maqif14/TeG-DRec.git

pdf bib
Structured Outputs in Prompt Engineering: Enhancing LLM Adaptability on Counterintuitive Instructions
Jingjing Ye | Song Bai | Zhenyang Li | Zheqi Zone

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, yet they often exhibit cognitive inertia, rigidly adhering to ingrained training conventions even when prompted to deviate. This paper investigates the efficacy of structured output techniques in prompt engineering to mitigate such inertia and improve instruction-following on counterintuitive tasks. We argue that using the structured input and output with our framework yields significant performance gains, studied on the Inversed IFEval dataset across varying prompts and domains. This work contributes to the growing field of prompt engineering research by demonstrating structured outputs as a robust method for enhancing LLM logical reasoning.

pdf bib
Atlas: Customizing Large Language Models for Reliable Bibliographic Retrieval and Verification
Akash Kodali | Hailu Xu | Wenlu Zhang | Xin Qin

Large Language Models (LLMs) are increasingly used for citation retrieval, yet their bibliographic outputs often contain hallucinated or inconsistent metadata. This paper examines whether structured prompting improves citation reliability compared with traditional API-based retrieval. We implement a three-stage BibTeX-fetching pipeline: a baseline Crossref resolver, a standard GPT prompting method, and a customized verification-guided GPT configuration. Across heterogeneous reference inputs, we evaluate retrieval coverage, field completeness, and metadata accuracy against Crossref ground truth. Results show that prompting improves coverage and completeness. Our findings highlight the importance of prompt design for building reliable, LLM-driven bibliographic retrieval systems.

pdf bib
Automated Telescope-Paper Linkage via Multi-Model Ensemble Learning
Ojaswa Ojaswa Varshney | Prashasti Vyas | Priyanka Goyal | Tarpita Singh | Ritesh Kumar | Mayank Singh

Automated linkage between scientific publications and telescope datasets is a cornerstone for scalable bibliometric analyses and ensuring scientific reproducibility in astrophysics. We propose a multi-model ensemble architecture integrating transformer models DeBERTa, RoBERTa, and TF-IDF logistic regression, tailored to the WASP-2025 shared task on telescope-paper classification. Our approach achieves a macro F1 score approaching 0.78 after extensive multi-seed ensembling and per-label threshold tuning, significantly outperforming baseline models. This paper presents comprehensive methodology, ablation studies, and an in-depth discussion of challenges, establishing a robust benchmark for scientific bibliometric task automation.

pdf bib
Systematic Evaluation of Machine Learning and Transformer-Based Methods for Scientific Telescope Literature Classification
Huynh Trung Kiet | Dao Sy Duy Minh | Tran Chi Nguyen | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong | Dinh Dien | Nguyen Hong Buu Long

Recent space missions such as Hubble, Chandra, and JWST have produced a rapidly growing body of scientific literature. Maintaining telescope bibliographies is essential for mission assessment and research traceability, yet current curation processes rely heavily on manual annotation and do not scale. To facilitate progress in this direction, the TRACS @ WASP 2025 shared task provides a benchmark for automatic telescope bibliographic classification based on scientific publications. In this work, we conduct a comparative study of modeling strategies for this task. We first explore traditional machine learning methods such as multinomial Naive Bayes with TF–IDF and CountVectorizer representations. We then evaluate transformer-based multi-label classification using BERT-based scientific language models. Finally, we investigate a task-wise classification approach, where we decompose the problem into separate prediction tasks and train a dedicated model for each. In addition, we experiment with a limited-resource LLM-based approach, showing that even without full fine-tuning and using only a partial subset of the training data, LLMs exhibit promising potential for telescope classification. Our best system achieves a macro F1 of 0.72 with BERT-based models on the test evaluation, substantially outperforming the official openai-gpt-oss-20b baseline (0.31 macro F1).

pdf bib
“Clutch or Cry” Team at TRACS @ WASP2025: A Hybrid Stacking Ensemble for Astrophysical Document Classification
Arshad Khatib | Aayush Prasad | Rudra Trivedi | Shrikant Malviya

Automatically identifying telescopes and their roles within astrophysical literature is crucial for large-scale scientific analysis and tracking instrument usage patterns. This paper describes the system developed by the “Clutch or Cry” team for the Telescope Reference and Astronomy Categorization Shared task (TRACS) at WASP 2025. The task involved two distinct challenges: multi-class telescope identification (Task 1) and multi-label role classification (Task 2). For Task 1, we employed a feature-centric approach combining document identifiers, metadata, and textual features to achieve high accuracy. For the more complex Task 2, we utilized a carefully designed two-level stacking ensemble. This hybrid model effectively fused symbolic information from a rule-based classifier with deep semantic understanding from a domain-adapted transformer. A subsequent meta-learning stage then performed targeted optimization for each role. These architectures were designed to address the primary challenges of handling long documents and managing severe class imbalance. A systematic optimization strategy focused on mitigating this imbalance significantly improved performance for minority classes. This work validates the effectiveness of using tailored, hybrid approaches and targeted optimization for complex classification tasks in specialized scientific domains.

pdf bib
amc: The Automated Mission Classifier for Telescope Bibliographies
John F. Wu | Joshua E.G. Peek | Sophie J. Miller | Jenny Novacescu | Achu J. Usha | Christopher A. Wilkinson

Telescope bibliographies record the pulse of astronomy research by capturing publication statistics and citation metrics for telescope facilities. Robust and scalable bibliographies ensure that we can measure the scientific impact of our facilities and archives. However, the growing rate of publications threatens to outpace our ability to manually label astronomical literature. We therefore present the Automated Mission Classifier (amc), a tool that uses large language models (LLMs) to identify and categorize telescope references by processing large quantities of paper text. A modified version of amc performs well on the TRACS Kaggle challenge, achieving a macro F1 score of 0.84 on the held-out test set. amc is valuable for other telescopes beyond TRACS; we developed the initial software for identifying papers that featured scientific results by NASA missions. Additionally, we investigate how amc can also be used to interrogate historical datasets and surface potential label errors. Our work demonstrates that LLM-based applications offer powerful and scalable assistance for library sciences.

pdf bib
AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers
Yuan-Sen Ting | Alberto Accomazzi | Tirthankar Ghosal | Tuan Dung Nguyen | Rui Pan | Zechang Sun | Tijmen de Haan

We present a dataset of 408,590 astrophysics papers from arXiv (astro-ph), spanning 1992 through July 2025. Each paper has been processed through a multi-stage pipeline to produce: (1) structured summaries organized into six semantic sections (Background, Motivation, Methodology, Results, Interpretation, Implication), and (2) concept extraction yielding 9,999 unique concepts with detailed descriptions. The dataset contains 3.8 million paper-concept associations and includes semantic embeddings for all concepts. Comparison with traditional ADS keywords reveals that the concepts provide denser coverage and more uniform distribution, while analysis of embedding space structure demonstrates that concepts are semantically dispersed within papers—enabling discovery through multiple diverse entry points. Concept vocabulary and embeddings are publicly released at https://github.com/tingyuansen/astro-ph_knowledge_graph.

pdf bib
Citation Drift: Measuring Reference Stability in Multi-Turn LLM Conversations
Gokul Srinath Seetha Ram

Large Language Models (LLMs) are increasingly used for scientific writing and research assistance, yet their ability to maintain consistent citations across multi-turn conversations remains unexplored. This paper introduces the concept of citation drift—the phenomenon where references mutate, disappear, or get fabricated during extended LLM interactions. We analyze 240 conversations across four LLaMA models using 36 authentic scientific papers from six domains and find significant citation instability. LLaMA-4-Maverick-17B achieves the highest stability (0.481) and lowest fabrication entropy, while LLaMA-4-Scout-17B fabricates up to 85.6% of citations. We introduce five new metrics—stability, fabrication rate, drift rate, drift entropy, and willingness-to-cite—providing a standardized framework for evaluating factual reliability in scientific dialogue systems. Our benchmark offers reproducible, model-agnostic evaluation tools for assessing citation reliability in AI-assisted research workflows.

pdf bib
Efficient Context-Limited Telescope Bibliography Classification for the WASP-2025 Shared Task Using SciBERT
Madhusudhan Naidu

The creation of telescope bibliographies is a crucial part of assessing the scientific impact of observatories and ensuring reproducibility in astronomy. This task involves identifying, categorizing, and linking scientific publications that reference or use specific telescopes. However, this process remains largely manual and resource intensive. In this work, we present an efficient SciBERT-based approach for automatic classification of scientific papers into four categories — science, instrumentation, mention, and not telescope. Despite strict context-length constraints (maximum 512 tokens) and limited compute resources, our approach achieved a macro F1 score of 0.89, ranking at the top of the WASP-2025 leaderboard. We analyze the effect of truncation and show that even with half the samples exceeding the token limit, SciBERT’s domain alignment enables robust classification. We discuss trade-offs between truncation, chunking, and long-context models, providing insights into the efficiency frontier for scientific text curation.

pdf bib
Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction
Shivam Rawat | Lucie Flek | Akbar Karimi

Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.

pdf bib
Enhanced Table Structure Recognition with Multi-Modal Approach
Huichen Yang | Andrew D. Hellicar | Maciej Rybinski | Sarvnaz Karimi

Tables are fundamental for presenting information in research articles, technical documents, manuals, and reports. One key challenge is accessing the information in tables that are embedded in Portable Document Format (PDF) files or scanned images. It requires accurately recognising table structures in diverse table layouts and complex tables. The Table Structure Recognition (TSR) task aims to recognise the internal structure of table images and convert them into a machine-readable format. We propose a flexible multi-modal framework for image-based TSR. Our approach employs two-stream transformer encoders alongside task-specific decoders for table structure extraction and cell bounding box detection. Experiments on benchmark datasets demonstrate that our model achieves highly competitive results compared to strong baselines, gaining 5.4% over single-modality approaches on the FinTabNetd dataset.

up

pdf (full)
bib (full)
Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025)

pdf bib
Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025)
Toshiaki Nakazawa | Isao Goto

pdf bib
Findings of the First Patent Claims Translation Task at WAT2025
Toshiaki Nakazawa | Takashi Tsunakawa | Isao Goto | Kazuhiro Kasada | Katsuhito Sudoh | Shoichi Okuyama | Takashi Ieda | Masaaki Nagata

This paper presents the results and findings of the first shared task of translating patent claims. We provide training, development, and test data for participants and perform human evaluation of the submitted translations. This time, 2 teams submitted their translation results. Our analysis of the human-annotated translation errors revealed not only general, domain-independent errors but also errors specific to patent translation. We also found that the human annotation itself exhibited some serious issues. In this paper, we report on these findings.

pdf bib
Ehime-U System with Judge and Refinement, Specialized Prompting, and Few-shot for the Patent Claim Translation Task at WAT 2025
Taishi Edamatsu | Isao Goto | Takashi Ninomiya

The Ehime University team participated inthe Japanese-to-English Patent Claim Trans-lation Task at WAT 2025. We experimentedwith (i) Judge and Refinement, (ii) SpecializedPrompting, and (iii) Few-Shot approaches, us-ing GPT-5 as the underlying LLM. Evaluationbased on the LLM-as-a-Judge framework con-firmed improvements for (i), while (ii) and (iii)showed no significant effects.

pdf bib
UTSK25 at WAT2025 Patent Claims Translation/Evaluation Task
Haruto Azami | Yin Zhang | Futo Kajita | Nobuyori Nishimura | Takehito Utsuro

This paper presents the submission of UTSK25 for the English–Japanese and Japanese–English at the WAT2025 Patent Claims Translation/Evaluation Task. We use a single translation model for both translation directions, built from a large language model through monolingual and bilingual continual pretraining and bilingual supervised fine-tuning. We finally generate translations via prompt engineering to reduce omissions and hallucinations.

pdf bib
Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Saumitra Yadav | Manish Shrivastava

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn’t guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE—where the source and target languages have different NMOs—significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant (p<0.05) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups (50K, 100K, and 500K sentence pairs, respectively). We validated this trend across six additional language pairs (EnglishTelugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.

pdf bib
Speech-to-Speech Machine Translation for Dialectal Variations of Hindi
Sanmay Sood | Siddharth Rajput | Md Shad Akhtar

Hindi has many dialects and they are vital to India’s cultural and linguistics heritage. However, many of them have been largely overlooked in modern language technological advancements, primarily due to lack proper resources. In this study, we explore speech-to-speech machine translation (S2ST) for four Hindi dialects, i.e., Awadhi, Bhojpuri, Braj Bhasha, and Magahi. We adopt a cascaded S2ST pipeline comprising of three stages: Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS). We evaluate many recent large language models (LLMs) for dialect-to-Hindi and dialect-to-English translations in zero-shot, few-shot, and chain-of-thought setups. Our comparative analysis offers insights into the current capabilities and limitations of LLM-based approaches for low-resource dialectal S2ST in Indian context.

pdf bib
A Systematic Review on Machine Translation and Transliteration Techniques for Code-Mixed Indo-Aryan Languages
H. Rukshan Dias | Deshan Sumanathilaka

In multilingual societies, it is common to observe the blending of multiple languages in communication, a phenomenon known as Code-mixing. Globalization and the increasing influence of social media have further amplified multilingualism, resulting in a wider use of code-mixing. This systematic review analyzes existing translation and transliteration techniques for code-mixed Indo-Aryan languages, spanning rule-based and statistical approaches to neural machine translation and transformer-based architectures. It also examines publicly available code-mixed datasets designed for machine translation and transliteration tasks, along with the evaluation metrics commonly introduced and applied in prior studies. Finally, the paper discusses current challenges and limitations, highlighting future research directions for developing more tailored translation pipelines for code-mixed Indo-Aryan languages.

pdf bib
CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation
Deepon Halder | Thanmay Jayakumar | Raj Dabre

Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose CycleDistill, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating synthetic parallel corpora from monolingual corpora via zero- or few-shot MT, which is then used to fine-tune the model that was used for generating said data for MT. CycleDistill does not need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments focusing on three Indian languages, by relying solely on monolingual corpora, it can achieve high-quality machine translation, improving upon a few-shot baseline model by 20-30 chrF points on average in the first iteration. We also study the effect of leveraging softmax activations during the distillation process and observe mild improvements in translation quality.

pdf bib
Findings of the WAT 2025 Shared Task on Japanese-English Article-level News Translation
Naoto Shirai | Kazutaka Kinugawa | Hitoshi Ito | Hideya Mino | Yoshihiko Kawai

We present the preliminary findings of the WAT 2025 shared task on document-level translation from Japanese to English in the news domain. This task focuses on translating full articles with particular attention to whether translation models can learn to produce expressions and stylistic features typical of English news writing, with the aim to generate outputs that resemble original English news articles. The task consists of three translation styles: (1) literal translation, (2) news-style translation, based on English articles edited to match Japanese content, and (3) finalized translation, the primary goal of this shared task. Only one team participated and submitted a system to a single subtask. All tasks were evaluated automatically, and one task was also evaluated manually to compare the submission with the baseline.

pdf bib
NHK Submission to WAT 2025: Leveraging Preference Optimization for Article-level Japanese–English News Translation
Hideya Mino | Rei Endo | Yoshihiko Kawai

This paper describes our submission to the Japanese English Article-level News Translation Task at WAT 2025. In this task, participants were provided with a small but high-quality parallel corpus along with two intermediate English translations: a literal translation and a style-adapted translation. To effectively exploit these limited training data, our system employs a large language model (LLM) trained via supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) that is a preference learning technique for aligning model outputs with professional-quality references. By leveraging literal and style-adapted intermediate translations as negative (rejected) samples and human-edited English articles as positive (chosen) samples in DPO training, we achieved notable improvements in translation quality. We evaluate our approach using BLEU scores and human assessments.

pdf bib
Findings of WAT2025 English-to-Indic Multimodal Translation Task
Shantipriya Parida | Ondřej Bojar

This paper presents the findings of the English-to-Indic Multimodal Translation shared task from the Workshop on Asian Translation (WAT2025). The task featured three tracks: text-only translation, image captioning, and multimodal translation across four low-resource Indic languages: Hindi, Bengali, Malayalam, and Odia. Three teams participated, submitting systems that achieved competitive performance, with BLEU scores ranging from 40.1 to 64.3 across different language pairs and tracks.

pdf bib
OdiaGenAI participation at WAT 2025
Debasish Dhal | Sambit Sekhar | Revathy V R | Shantipriya Parida | Akash Kumar Dhaka

We at ODIAGEN, provide a detailed description of the model, training procedure, results and conclusion of our submission to the Workshop on Asian Translation (WAT 2025). For this year, we focus only on text to text translation tasks on low resource Indic languages targetting Hindi, Bengali, Malayalam and Odia languages specifically. The system uses a large language model NLLB-200 finetuned on large datasets consisting of over 100K rows for each targetted language. The whole training dataset is made of the data provided by the organisers as in previous years and augmented by a much larger 100K sentences of data subsampled from the Samanantar dataset provided by AI4Bharat. From a total of eight evaluation/challenge tests, our approach obtained the highest BLEU scores yet, since the conception on five.

pdf bib
Does Vision Still Help? Multimodal Translation with CLIP-Based Image Selection
Deepak Kumar | Baban Gain | Kshetrimayum Boynao Singh | Asif Ekbal

Multimodal Machine Translation aims to enhance conventional text-only translation systems by incorporating visual context, typically in the form of images paired with captions. In this work, we present our submission to the WAT 2025 Multimodal Translation Shared Task, which explores the role of visual information in translating English captions into four Indic languages: Hindi, Bengali, Malayalam, and Odia. Our system builds upon the strong multilingual text translation backbone IndicTrans, augmented with a CLIP-based selective visual grounding mechanism. Specifically, we compute cosine similarities between text and image embeddings (both full and cropped regions) and automatically select the most semantically aligned image representation to integrate into the translation model. We observe that overall contribution of visual features is questionable. Our findings reaffirm recent evidence that large multilingual translation models can perform competitively without explicit visual grounding.

pdf bib
A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation
Siddharth Betala | Kushan Raj | Vipul Betala | Rohan Saswade

In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning.Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language.We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets. Training on corrected data yields consistent improvements, with BLEU score gains of +1.30 for English-Bengali on the evaluation set (42.00 → 43.30) and +0.70 on the challenge set (44.90 → 45.60), +0.60 for English-Odia on the evaluation set (41.00 → 41.60), and +0.10 for English-Hindi on the challenge set (53.90 → 54.00).

up

pdf (full)
bib (full)
Proceedings of the Workshop on Sign Language Processing (WSLP)

pdf bib
Proceedings of the Workshop on Sign Language Processing (WSLP)
Mohammed Hasanuzzaman | Facundo Manuel Quiroga | Ashutosh Modi | Sabyasachi Kamila | Keren Artiaga | Abhinav Joshi | Sanjeet Singh

pdf bib
Overview of the First Workshop on Sign Language Processing (WSLP 2025)
Sanjeet Singh | Abhinav Joshi | Keren Artiaga | Mohammed Hasanuzzaman | Facundo Manuel Quiroga | Sabyasachi Kamila | Ashutosh Modi

We organized the First Workshop on Sign Language Processing (WSLP 2025), co-located with IJCNLP–AACL 2025 at IIT Bombay, to bring together researchers, linguists, and members of the Deaf community and accelerate computational work on under-resourced sign languages.The workshop accepted ten papers—including two official shared-task submissions—that introduced new large-scale resources (a continuous ISL fingerspelling corpus, cross-lingual HamNoSys corpora), advanced multilingual and motion-aware translation models, explored LLM-based augmentation and glossing strategies, and presented lightweight deployable systems for regional languages such as Odia.We ran a three-track shared task on Indian Sign Language that attracted over sixty registered teams and established the first public leaderboards for sentence-level ISL-to-English translation, isolated word recognition, and word-presence prediction.By centring geographic, linguistic, and organiser diversity, releasing open datasets and benchmarks, and explicitly addressing linguistic challenges unique to visual–spatial languages, we significantly broadened the scope of sign-language processing beyond traditionally dominant European and East-Asian datasets, laying a robust foundation for inclusive, equitable, and deployable sign-language AI in the Global South.

pdf bib
Indain Sign Language Recognition and Translation into Odia
Astha Swarupa Nayak | Naisargika Subudhi | Tannushree Rana | Muktikanta Sahu | Rakesh Chandra Balabantaray

Sign language is a vital means of communication for the deaf and hard-of-hearing community. However, translating Indian Sign Language (ISL) into regional languages like Odia remains a significant technological challenge due to the language’s rich morphology, agglutinative grammar, and complex script. This work presents a real-time ISL recognition and translation system that converts hand gestures into Odia text, enhancing accessibility and promoting inclusive communication. The system leverages MediaPipe for real-time key-point detection and uses a custom-built dataset of 1,200 samples across 12 ISL gesture classes, captured under diverse Indian backgrounds and lighting conditions to ensure robustness. Both 2D and 3D Convolutional Neural Networks (CNNs) were explored, with the 2D CNN achieving superior performance 98.33% test accuracy compared to the 3D CNN’s 78.33%. Recognized gestures are translated into Odia using a curated gesture-to-text mapping dictionary, seamlessly integrated into a lightweight Tkinter-based GUI. Unlike other resource-heavy systems, this model is optimized for deployment on low-resource devices, making it suitable for rural and educational contexts. Beyond translation, the system can function as an assistive learning tool for students and educators of ISL. This work demonstrates that combining culturally curated datasets with efficient AI models can bridge communication gaps and create regionally adapted, accessible technology for the deaf and mute community in India.

pdf bib
Low-Resource Sign Language Glossing Profits From Data Augmentation
Diana Vania Lara Ortiz | Sebastian Padó

Glossing is the task of translating from a written language into a sequence of glosses, i.e., textual representations of signs from some sign language. While glossing is in principle ‘just’ a machine translation (MT) task, sign languages still lack the large parallel corpora that exist for many written language pairs and underlie the development of dedicated MT systems. In this work, we demonstrate that glossing can be significantly improved through data augmentation. We fine-tune a Spanish transformer model both on a small dedicated corpus 3,000 Spanish–Mexican Sign Language (MSL) gloss sentence pairs, and on a corpus augmented with an English–American Sign Language (ASL) gloss corpus. We obtain the best results when we oversample from the ASL corpus by a factor of ~4, achieving a BLEU increase from 62 to 85 and a TER reduction from 44 to 20. This demonstrates the usefulness of combining corpora in low-resource glossing situations.

pdf bib
Augmenting Sign Language Translation Datasets with Large Language Models
Pedro Alejandro Dal Bianco | Jean Paul Nunes Reinhold | Facundo Manuel Quiroga | Franco Ronchetti

Sign language translation (SLT) is a challenging task due to the scarcity of labeled data and the heavy-tailed distribution of sign language vocabularies. In this paper, we explore a novel data augmentation approach for SLT: using a large language model (LLM) to generate paraphrases of the target language sentences in the training data. We experiment with a Transformer-based SLT model (Signformer) on three datasets spanning German, Greek, and Argentinian Sign Languages. For models trained with augmentation, we adopt a two-stage regime: pre-train on the LLM-augmented corpus and then fine-tune on the original, non-augmented training set. Our augmented training sets, expanded with GPT-4-generated paraphrases, yield mixed results. On a medium-scale German SL corpus (PHOENIX14T), LLM augmentation improves BLEU-4 from 9.56 to 10.33. In contrast, a small-vocabulary Greek SL dataset with a near-perfect baseline (94.38 BLEU) sees a slight drop to 92.22 BLEU, and a complex Argentinian SL corpus with a long-tail vocabulary distribution remains around 1.2 BLEU despite augmentation. We analyze these outcomes in relation to each dataset’s complexity and token frequency distribution, finding that LLM-based augmentation is more beneficial when the dataset contains a richer vocabulary and many infrequent tokens. To our knowledge, this work is the first to apply LLM paraphrasing to SLT, and we discuss these results with respect to prior data augmentation efforts in sign language translation.

pdf bib
Multilingual Sign Language Translation with Unified Datasets and Pose-Based Transformers
Pedro Alejandro Dal Bianco | Oscar Agustín Stanchi | Facundo Manuel Quiroga | Franco Ronchetti

Sign languages are highly diverse across countries and regions, yet most Sign Language Translation (SLT) work remains monolingual. We explore a unified, multi-target SLT model trained jointly on four sign languages (German, Greek, Argentinian, Indian) using a standardized data layer. Our model operates on pose keypoints extracted with MediaPipe, yielding a lightweight and dataset-agnostic representation that is less sensitive to backgrounds, clothing, cameras, or signer identity while retaining motion and configuration cues. On RWTH-PHOENIX-Weather 2014T, Greek Sign Language Dataset, LSA-T, and ISLTranslate, naive joint training under a fully shared parameterization performs worse than monolingual baselines; however, a simple two-stage schedule: multilingual pre-training followed by a short language-specific fine-tuning, recovers and surpasses monolingual results on three datasets (PHOENIX14T: +0.15 BLEU-4; GSL: +0.74; ISL: +0.10) while narrowing the gap on the most challenging corpus (LSA-T: -0.24 vs. monolingual). Scores span from BLEU-4≈ 1 on open-domain news (LSA-T) to >90 on constrained curricula (GSL), highlighting the role of dataset complexity. We release our code to facilitate training and evaluation of multilingual SLT models.

pdf bib
Continuous Fingerspelling Dataset for Indian Sign Language
Kirandevraj R | Vinod K. Kurmi | Vinay P. Namboodiri | C.v. Jawahar

Fingerspelling enables signers to represent proper nouns and technical terms letter-by-letter using manual alphabets, yet remains severely under-resourced for Indian Sign Language (ISL). We present the first continuous fingerspelling dataset for ISL, extracted from the ISH News YouTube channel, in which fingerspelling is accompanied by synchronized on-screen text cues. The dataset comprises 1,308 segments from 499 videos, totaling 70.85 minutes and 14,814 characters, with aligned video-text pairs capturing authentic coarticulation patterns. We validated the dataset quality through annotation using a proficient ISL interpreter, achieving a 90.67% exact match rate for 150 samples. We further established baseline recognition benchmarks using a ByT5-small encoder-decoder model, which attains 82.91% Character Error Rate after fine-tuning. This resource supports multiple downstream tasks, including fingerspelling transcription, temporal localization, and sign generation. The dataset is available at the following link: https://kirandevraj.github.io/ISL-Fingerspelling/.

pdf bib
Enhancing Indian Sign Language Translation via Motion-Aware Modeling
Anal Roy Chowdhury | Debarshi Kumar Sanyal

Sign language translation (SLT) has witnessed rapid progress in the deep learning community across several sign languages, including German, American, British, and Italian. However, Indian Sign Language (ISL) remains relatively underexplored. Motivated by recent efforts to develop large-scale ISL resources, we investigate how existing SLT models perform on ISL data. Specifically, we evaluate three approaches: (i) training a transformer-based model, (ii) leveraging visual-language pretraining, and (iii) tuning a language model with pre-trained visual and motion representations. Unlike existing methods that primarily use raw video frames, we augment the model with optical flow maps to explicitly capture motion primitives, combined with a multi-scale feature extraction method for encoding spatial features (SpaMo-OF). Our approach achieves promising results, obtaining a BLEU-4 score of 8.58 on the iSign test set, establishing a strong baseline for future ISL translation research.

pdf bib
Pose-Based Temporal Convolutional Networks for Isolated Indian Sign Language Word Recognition
Tatigunta Bhavi Teja Reddy | Vidhya Kamakshi

This paper presents a lightweight and efficient baseline for isolated Indian Sign Language (ISL) word recognition developed forthe WSLP-AACL-2025 Shared Task. Wepropose a two-stage framework combiningskeletal landmark extraction via MediaPipeHolistic with a Temporal Convolutional Network (TCN) for temporal sequence classification. The system processes pose-basedinput sequences instead of raw video, significantly reducing computation and memorycosts. Trained on the WSLP-AACL-2025dataset containing 4,398 isolated sign videosacross 4,361 word classes, our model achieves54% top-1 and 78% top-5 accuracy.

pdf bib
Cross-Linguistic Phonological Similarity Analysis in Sign Languages Using HamNoSys
Abhishek Bharadwaj Varanasi | Manjira Sinha | Tirthankar Dasgupta

This paper presents a cross-linguistic analysis of phonological similarity in sign languages using symbolic representations from the Hamburg Notation System (HamNoSys). We construct a dataset of 1000 signs each from British Sign Language (BSL), German Sign Language (DGS), French Sign Language (LSF), and Greek Sign Language (GSL), and compute pairwise phonological similarity using normalized edit distance over HamNoSys strings. Our analysis reveals both universal and language-specific patterns in handshape usage, movement dynamics, non-manual features, and spatial articulation. We explore intra and inter-language similarity distributions, phonological clustering, and co-occurrence structures across feature types. The findings offer insights into the structural organization of sign language phonology and highlight typological variation shaped by linguistic and cultural factors.

pdf bib
Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture
Samuel Ebimobowei Johnny | Blessed Guda | Emmanuel Aaron | Assane Gueye

Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88% accuracy and 60.00% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification.

pdf bib
Finetuning Pre-trained Language Models for Bidirectional Sign Language Gloss to Text Translation
Arshia Kermani | Habib Irani | Vangelis Metsis

Sign Language Translation (SLT) is a crucial technology for fostering communication accessibility for the Deaf and Hard-of-Hearing (DHH) community. A dominant approach in SLT involves a two-stage pipeline: first, transcribing video to sign language glosses, and then translating these glosses into natural text. This second stage, gloss-to-text translation, is a challenging, low-resource machine translation task due to data scarcity and significant syntactic divergence. While prior work has often relied on training translation models from scratch, we show that fine-tuning large, pre-trained language models (PLMs) offers a more effective and data-efficient paradigm. In this work, we conduct a comprehensive bidirectional evaluation of several PLMs (T5, Flan-T5, mBART, and Llama) on this task. We use a collection of popular SLT datasets (RWTH-PHOENIX-14T, SIGNUM, and ASLG-PC12) and evaluate performance using standard machine translation metrics. Our results show that fine-tuned PLMs consistently and significantly outperform Transformer models trained from scratch, establishing new state-of-the-art results. Crucially, our bidirectional analysis reveals a significant performance gap, with Text-to-Gloss translation posing a greater challenge than Gloss-to-Text. We conclude that leveraging the linguistic knowledge of pre-trained models is a superior strategy for gloss translation and provides a more practical foundation for building robust, real-world SLT systems.