uppdf
bib
Findings of the Association for Computational Linguistics: NAACL 2025
Luis Chiruzzo
|
Alan Ritter
|
Lu Wang
pdf
bib
From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning
Ranran Haoran Zhang
|
Bensu Uçar
|
Soumik Dey
|
Hansi Wu
|
Binbin Li
|
Rui Zhang
pdf
bib
abs
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization
Pucheng Dang
|
Xing Hu
|
Dong Li
|
Rui Zhang
|
Qi Guo
|
Kaidi Xu
Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model’s capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of purely black-box attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.
pdf
bib
abs
MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens
Yongqi Fan
|
Hongli Sun
|
Kui Xue
|
Xiaofan Zhang
|
Shaoting Zhang
|
Tong Ruan
Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context “needles in a haystack” task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The second component confronts the challenge of requiring professional medical expertise. Especially, we design the ‘“Maximum Identical Context” principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long contexts and presents detailed performance analyses. This highlights that LLMs still face challenges and need for further research in this area. Our code and data are released in the repository:
https://github.com/JOHNNY-fans/MedOdyssey.
pdf
bib
abs
Can LLMs Learn Macroeconomic Narratives from Social Media?
Almog Gueta
|
Amir Feder
|
Zorik Gekhman
|
Ariel Goldstein
|
Roi Reichart
This study empirically tests the Narrative Economics hypothesis, which posits that narratives (ideas that are spread virally and affect public beliefs) can influence economic fluctuations. We introduce two curated datasets containing posts from X (formerly Twitter) which capture economy-related narratives (Data will be shared upon paper acceptance). Employing Natural Language Processing (NLP) methods, we extract and summarize narratives from the tweets. We test their predictive power for macroeconomic forecasting by incorporating the tweets’ or the extracted narratives’ representations in downstream financial prediction tasks. Our work highlights the challenges in improving macroeconomic models with narrative data, paving the way for the research community to realistically address this important challenge. From a scientific perspective, our investigation offers valuable insights and NLP tools for narrative extraction and summarization using Large Language Models (LLMs), contributing to future research on the role of narratives in economics.
pdf
bib
abs
Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency
Leonidas Gee
|
Milan Gritta
|
Gerasimos Lampouras
|
Ignacio Iacobacci
Code Language Models have been trained togenerate accurate solutions, typically with noregard for runtime. On the other hand, previousworks that explored execution optimisationhave observed corresponding drops infunctional correctness. To that end, we introduceCode-Optimise, a framework that incorporatesboth correctness (passed, failed) andruntime (quick, slow) as learning signals viaself-generated preference data. Our frameworkis both lightweight and robust as it dynamicallyselects solutions to reduce overfitting whileavoiding a reliance on larger models for learningsignals. Code-Optimise achieves significantimprovements in pass@k while decreasingthe competitive baseline runtimes by anadditional 6% for in-domain data and up to3% for out-of-domain data. As a by-product,the average length of the generated solutionsis reduced by up to 48% on MBPP and 23%on HumanEval, resulting in faster and cheaperinference. The generated data and codebaseis open-sourced at https://github.com/huawei-noah/HEBO/tree/Code_Optimise.
pdf
bib
abs
People will agree what I think: Investigating LLM’s False Consensus Effect
Junhyuk Choi
|
Yeseon Hong
|
Bugeun Kim
Large Language Models (LLMs) have been recently adopted in interactive systems requiring communication. As the false belief in a model can harm the usability of such systems, LLMs should not have cognitive biases that humans have. Psychologists especially focus on the False Consensus Effect (FCE), a cognitive bias where individuals overestimate the extent to which others share their beliefs or behaviors, because FCE can distract smooth communication by posing false beliefs. However, previous studies have less examined FCE in LLMs thoroughly, which needs more consideration of confounding biases, general situations, and prompt changes. Therefore, in this paper, we conduct two studies to examine the FCE phenomenon in LLMs. In Study 1, we investigate whether LLMs have FCE. In Study 2, we explore how various prompting styles affect the demonstration of FCE. As a result of these studies, we identified that popular LLMs have FCE. Also, the result specifies the conditions when FCE becomes more or less prevalent compared to normal usage.
pdf
bib
abs
LawInstruct: A Resource for Studying Language Model Adaptation to the Legal Domain
Joel Niklaus
|
Lucia Zheng
|
Arya D. McCarthy
|
Christopher Hahn
|
Brian M Rosen
|
Peter Henderson
|
Daniel E. Ho
|
Garrett Honke
|
Percy Liang
|
Christopher D Manning
Instruction tuning is an important step in making language models useful for direct user interaction. However, the legal domain is underrepresented in typical instruction datasets (e.g., only 10 out of 1600+ tasks in Super-NaturalInstructions). To study whether instruction tuning on legal datasets is necessary for strong legal reasoning, we aggregate 58 annotated legal datasets and write instructions for each, creating LawInstruct. LawInstruct covers 17 global jurisdictions, 24 languages and a total of 12M examples across diverse tasks such as legal QA, summarization of court cases, and legal argument mining. We evaluate our models on LegalBench, measuring legal reasoning across five categories in 162 challenging and realistic legal tasks, and MMLU, to measure potential drops in general reasoning capabilities. We find that legal-specific instruction tuning on Flan-T5 – yielding FLawN-T5 – improves performance on LegalBench across all model sizes, with an aggregate increase of 15 points or 50% over Flan-T5 for the base size. No model size shows performance drops in MMLU. We publish LawInstruct as a resource for further study of instruction tuning in the legal domain.
pdf
bib
abs
Stephanie: Step-by-Step Dialogues for Mimicking Human Interactions in Social Conversations
Hao Yang
|
Hongyuan Lu
|
Xinhua Zeng
|
Yang Liu
|
Xiang Zhang
|
Haoran Yang
|
Yumeng Zhang
|
Shan Huang
|
Yiran Wei
|
Wai Lam
In the rapidly evolving field of natural language processing, dialogue systems primarily employ a single-step dialogue paradigm. Although this paradigm is commonly adopted, it lacks the depth and fluidity of human interactions and does not appear natural. We introduce a novel **Step**-by-Step Dialogue Paradigm (Stephanie), designed to mimic the ongoing dynamic nature of human conversations. By employing a dual learning strategy and a further-split post-editing method, we generated and utilized a high-quality step-by-step dialogue dataset to fine-tune existing large language models, enabling them to perform step-by-step dialogues. We thoroughly present Stephanie. Tailored automatic and human evaluations are conducted to assess its effectiveness compared to the traditional single-step dialogue paradigm. We will release code, Stephanie datasets, and Stephanie LLMs to facilitate the future of chatbot eras.
pdf
bib
abs
ConShift: Sense-based Language Variation Analysis using Flexible Alignment
Clare Arrington
|
Mauricio Gruppi
|
Sibel Adali
We introduce ConShift, a family of alignment-based algorithms that enable semantic variation analysis at the sense-level. Using independent senses of words induced from the context of tokens in two corpora, sense-enriched word embeddings are aligned using self-supervision and a flexible matching mechanism. This approach makes it possible to test for multiple sense-level language variations such as sense gain/presence, loss/absence and broadening/narrowing, while providing explanation of the changes through visualization of related concepts. We illustrate the utility of the method with sense- and word-level semantic shift detection results for multiple evaluation datasets in diachronic settings and dialect variation in the synchronic setting.
pdf
bib
Breaking the Stigma! Unobtrusively Probe Symptoms in Depression Disorder Diagnosis Dialogue
Jieming Cao
|
Chen Huang
|
Yanan Zhang
|
Ruibo Deng
|
Jincheng Zhang
|
Wenqiang Lei
pdf
bib
abs
ToVo: Toxicity Taxonomy via Voting
Tinh Son Luong
|
Thanh-Thien Le
|
Thang Viet Doan
|
Linh Ngo Van
|
Thien Huu Nguyen
|
Nguyen Thi Ngoc Diep
Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications.We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions.
pdf
bib
abs
HALLUCANA: Fixing LLM Hallucination with A Canary Lookahead
Tianyi Li
|
Erenay Dayanik
|
Shubhi Tyagi
|
Andrea Pierleoni
In this paper, we present HALLUCANA, a canary lookahead to detect and correct factual hallucinations of Large Language Models (LLMs) in long-form generation. HALLUCANA detects and intervenes as soon as traces of hallucination emerge, during and even before generation. To support timely detection, we exploit the internal factuality representation in the LLM hidden space, where we investigate various proxies to the LLMs’ factuality self-assessment, and discuss its relation to the models’ context familiarity from their pre-training. On biography generation, our method improves generation quality by up to 2.5x, while consuming over 6 times less compute.
pdf
bib
abs
Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based Attack
Xin Liu
|
Aoyang Zhou
|
Kun He
Visual-Language Pre-training (VLP) models have achieved significant performance across various downstream tasks. However, they remain vulnerable to adversarial examples. While prior efforts focus on improving the adversarial transferability of multimodal adversarial examples through cross-modal interactions, these approaches suffer from overfitting issues, due to a lack of input diversity by relying excessively on information from adversarial examples in one modality when crafting attacks in another. To address this issue, we draw inspiration from strategies in some adversarial training methods and propose a novel attack called Local Shuffle and Sample-based Attack (LSSA). LSSA randomly shuffles one of the local image blocks, thus expanding the original image-text pairs, generating adversarial images, and sampling around them. Then, it utilizes both the original and sampled images to generate the adversarial texts. Extensive experiments on multiple models and datasets demonstrate that LSSA significantly enhances the transferability of multimodal adversarial examples across diverse VLP models and downstream tasks. Moreover, LSSA outperforms other advanced attacks on Large Vision-Language Models.
pdf
bib
abs
Dis2Dis: Explaining Ambiguity in Fact-Checking
Ieva Staliunaite
|
Andreas Vlachos
Ambiguity is a linguistic tool for encoding information efficiently, yet it also causes misunderstandings and disagreements. It is particularly relevant to the domain of misinformation, as fact-checking ambiguous claims is difficult even for experts. In this paper we argue that instead of predicting a veracity label for which there is genuine disagreement, it would be more beneficial to explain the ambiguity. Thus, this work introduces claim disambiguation, a constrained generation task, for explaining ambiguous claims in fact-checking. This involves editing them to spell out an interpretation that can then be unequivocally supported by the given evidence. We collect a dataset of 1501 such claim revisions and conduct experiments with sequence-to-sequence models. The performance is compared to a simple copy baseline and a Large Language Model baseline. The best results are achieved by employing Minimum Bayes Decoding, with a BertScore F1 of 92.22. According to human evaluation, the model successfully disambiguates the claims 72% of the time.
pdf
bib
abs
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Xiyao Wang
|
Jiuhai Chen
|
Zhaoyang Wang
|
Yuhang Zhou
|
Yiyang Zhou
|
Huaxiu Yao
|
Tianyi Zhou
|
Tom Goldstein
|
Parminder Bhatia
|
Taha Kass-Hout
|
Furong Huang
|
Cao Xiao
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there remains significant room for improvement in aligning visual and language modalities. Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results. In this paper, we propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies. SIMA leverages existing vision instruction tuning datasets to self-generate responses, incorporating an in-context self-critic mechanism that constructs preference pairs for tuning. Crucially, our approach allows LVLMs to act as critics by designing effective critic prompts, eliminating the need for additional fine-tuning with external instruction data. We introduce three novel visual metrics within the self-critic process to guide judgement, significantly improving the accuracy of self-critic. Through extensive experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA significantly improves LVLM’s performance and outperforms previous approaches, achieving superior modality alignment.
pdf
bib
abs
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
|
Xiaogeng Liu
|
Chaowei Xiao
In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pre-training and fine-tuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user’s original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user’s prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
pdf
bib
abs
ChatCRS: Incorporating External Knowledge and Goal Guidance for LLM-based Conversational Recommender Systems
Chuang Li
|
Yang Deng
|
Hengchang Hu
|
Min-Yen Kan
|
Haizhou Li
This paper aims to efficiently enable large language models (LLMs) to use external knowledge and goal guidance in conversational recommender system (CRS) tasks. Advanced LLMs (e.g., ChatGPT) are limited in domain-specific CRS tasks for 1) generating grounded responses with recommendation-oriented knowledge, or 2) proactively leading the conversations through different dialogue goals. In this work, we first analyze those limitations through a comprehensive evaluation, showing the necessity of external knowledge and goal guidance which contribute significantly to the recommendation accuracy and language quality. In light of this finding, we propose a novel ChatCRS framework to decompose the complex CRS task into several sub-tasks through the implementation of 1) a knowledge retrieval agent using a tool-augmented approach to reason over external Knowledge Bases and 2) a goal-planning agent for dialogue goal prediction. Experimental results on two multi-goal CRS datasets reveal that ChatCRS sets new state-of-the-art benchmarks, improving language quality of informativeness by 17% and proactivity by 27%, and achieving a tenfold enhancement in recommendation accuracy.
pdf
bib
abs
Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception
Zehan Wang
|
Haifeng Huang
|
Yang Zhao
|
Ziang Zhang
|
Tao Jin
|
Zhou Zhao
3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. With limited data, Chat-3D achieves a 82.2% relative score compared with GPT-4 on the constructed instruction dataset, and comparable performance to state-of-the-art LLM-based methods.
pdf
bib
abs
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model
Zhaowei Li
|
Wei Wang
|
YiQing Cai
|
Qi Xu
|
Pengyu Wang
|
Dong Zhang
|
Hang Song
|
Botian Jiang
|
Zhida Huang
|
Tao Wang
Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality.
pdf
bib
abs
PEMV: Improving Spatial Distribution for Emotion Recognition in Conversations Using Proximal Emotion Mean Vectors
Chen Lin
|
Fei Li
|
Donghong Ji
|
Chong Teng
Emotion Recognition in Conversation (ERC) aims to identify the emotions expressed in each utterance within a dialogue. Existing research primarily focuses on the analysis of contextual structure in dialogue and the interactions between different emotions. Nonetheless, ERC datasets often contain difficult-to-classify samples and suffer from imbalanced label distributions, which pose challenges to the spatial distribution of dialogue features. To tackle this issue, we propose a method that generates Proximal Emotion Mean Vectors (PEMV) based on emotion feature queues to optimize the spatial representation of text features. We design a Center Loss based on PEMVs to pull hard-to-classify samples closer to their respective category centers and employ Angle Loss to maximize the angular separation between different PEMVs. Furthermore, we utilize PEMV as a classifier to better adapt to the spatial structure of dialogue features. Extensive experiments on three widely used benchmark datasets demonstrate that our method achieves state-of-the-art performance and validates its effectiveness in optimizing feature space representations.
pdf
bib
abs
DiscoverGPT: Multi-task Fine-tuning Large Language Model for Related Table Discovery
Xuming Hu
|
Xiao Qin
|
Chuan Lei
|
Asterios Katsifodimos
|
Zhengyuan Shen
|
Balasubramaniam Srinivasan
|
Huzefa Rangwala
Natural language understanding over tabular data has played a significant role in data discovery tasks such as joinable and unionable table search. State-of-the-art approaches adopt large language models (LLMs) pre-trained over massive text corpora to learn and evaluate the table semantic relatedness. Existing methods typically follow a pretrain-and-finetune paradigm, namely fine-tuning an LLM using tabular data with table relatedness labels. To enhance model’s understanding of tabular data, recent studies include auxiliary tasks such as entity resolution and column type classification in the fine-tuning phase. In spite of achieving performance gain from these supervisions, there is a lack of study on how these supervisions complement or even contrast each other, leading to a subpar performance on the final data discovery tasks. In this paper, we propose a simple yet effective multi-task fine-tuning framework named DiscoverGPT that holistically discovers and leverages the intricate relationships among the supervisions to optimize the performance on the data discovery task. Moreover, DiscoverGPT is plug-and-play that allows a broad range of open-domain auxiliary tasks to be incorporated, by utilizing the generative power of LLMs. We demonstrate the usability and effectiveness of DiscoverGPT with baseline comparisons and ablation studies. DiscoverGPT outperforms the best performing baseline by up to 7% in F1 score.
pdf
bib
abs
Can GPT-4 Sway Experts’ Investment Decisions?
Takehiro Takayanagi
|
Hiroya Takamura
|
Kiyoshi Izumi
|
Chung-Chi Chen
In the post-Turing era, evaluating large language models (LLMs) involves assessing generated text based on readers’ decisions rather than merely its indistinguishability from human-produced content. This paper explores how LLM-generated text impacts readers’ decisions, focusing on both amateur and expert audiences. Our findings indicate that GPT-4 can generate persuasive analyses affecting the decisions of both amateurs and professionals. Furthermore, we evaluate the generated text from the aspects of grammar, convincingness, logical coherence, and usefulness. The results highlight a high correlation between real-world evaluation through audience decisions and the current multi-dimensional evaluators commonly used for generative models. Overall, this paper shows the potential and risk of using generated text to sway human decisions and also points out a new direction for evaluating generated text, i.e., leveraging the decisions of readers. We release our dataset to assist future research.
pdf
bib
abs
PolyJoin: Semantic Multi-key Joinable Table Search in Data Lakes
Xuming Hu
|
Chuan Lei
|
Xiao Qin
|
Asterios Katsifodimos
|
Christos Faloutsos
|
Huzefa Rangwala
Given a query table, how can we effectively discover multi-key joinable tables on the web? This can be seen as a retrieval task, where users can lookup on the web for tables related to an existing one. Searching and discovering such joinable tables is critical to data analysts and data scientists for reporting, establishing correlations and training machine learning models. Existing joinable table search methods have mostly focused on single key (unary) joins, where a single column is the join key. However, these methods are ineffective when dealing with join keys composed of multiple columns (n-ary joins), which are prevalent on web table corpora. In this paper, we introduce PolyJoin, which finds multi-key semantically-joinable tables on the web, given a query table. PolyJoin employs a multi-key encoder and a novel self-supervised training method to generate the representations of multiple join keys, preserving the alignment across multiple columns. In particular, PolyJoin is equipped with a hierarchical contrastive learning technique to further enhance the model’s semantic understanding of multi-key joinable tables. PolyJoin outperforms the state-of-the-art methods by 2.89% and 3.67% with respect to MAP@30 and R@30 on two real-world web table benchmarks, respectively.
pdf
bib
abs
Marrying LLMs with Dynamic Forecasting: A Graph Mixture-of-expert Perspective
Dapeng Jiang
|
Xiao Luo
Dynamical system modeling is a crucial area of research in machine learning with extensive applications in physics and social science. Recent data-driven approaches often employ graph neural networks (GNNs) to learn relationships in dynamical systems using message passing mechanisms. Despite their advancements, these methods often suffer from performance degradation when it comes to potential environmental change with distribution shifts in real-world applications. In this work, we propose a new perspective which leverages large language models (LLMs) to enhance the generalization capabilities of dynamical system modeling. In particular, we develop a novel framework named LLM Judge with Graph Mixture-of-expert LEGO which incorporates multiple graph experts to learn diverse dynamics within the systems. More importantly, LEGO utilizes LLMs with hierarchical prompts at object, edge, and system levels as a context-aware routing function to determine which experts carry the most relevant information to different environments. The whole framework is optimized by updating the weights and expert parameters in an alternative fashion. Extensive experiments across various datasets demonstrate the effectiveness of our proposed LEGO in comparison to extensive baselines.
pdf
bib
abs
DialogGen: Multi-modal Interactive Dialogue System with Multi-turn Text-Image Generation
Minbin Huang
|
Yanxin Long
|
Xinchi Deng
|
Ruihang Chu
|
Jiangfeng Xiong
|
Xiaodan Liang
|
Hong Cheng
|
Qinglin Lu
|
Wei Liu
Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user’s natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model’s ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen in producing correct output modalities and coherent multi-modal outputs compared with other State-of-the-Art models. We hope that DialogBen can contribute to the community for building more powerful MIDS.
pdf
bib
abs
RELexED: Retrieval-Enhanced Legal Summarization with Exemplar Diversity
Santosh T.y.s.s
|
Chen Jia
|
Patrick Goroncy
|
Matthias Grabmair
This paper addresses the task of legal summarization, which involves distilling complex legal documents into concise, coherent summaries. Current approaches often struggle with content theme deviation and inconsistent writing styles due to their reliance solely on source documents. We propose RELexED, a retrieval-augmented framework that utilizes exemplar summaries along with the source document to guide the model. RELexED employs a two-stage exemplar selection strategy, leveraging a determinantal point process to balance the trade-off between similarity of exemplars to the query and diversity among exemplars, with scores computed via influence functions. Experimental results on two legal summarization datasets demonstrate that RELexED significantly outperforms models that do not utilize exemplars and those that rely solely on similarity-based exemplar selection.
pdf
bib
abs
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Shangda Wu
|
Yashan Wang
|
Ruibin Yuan
|
Guo Zhancheng
|
Xu Tan
|
Ge Zhang
|
Monan Zhou
|
Jing Chen
|
Xuefeng Mu
|
Yuejie Gao
|
Yuanliang Dong
|
Jiafeng Liu
|
Xiaobing Li
|
Feng Yu
|
Maosong Sun
Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.
pdf
bib
abs
LogRules: Enhancing Log Analysis Capability of Large Language Models through Rules
Xin Huang
|
Ting Zhang
|
Wen Zhao
Currently, large language models (LLMs) have achieved impressive performance in natural language processing tasks. However, LLMs still exhibit many hallucinations when analyzing system logs, which is due to the implicit knowledge and rules in logs that LLMs cannot capture. Based on this, we propose LogRules, a lightweight log analysis framework that generates and utilizes rules through LLMs. LogRules consists of three stages: an induction stage, an alignment stage, and a reasoning stage. Firstly, in the induction stage, an strong LLM (e.g., GPT-4o-mini) is tasked with generating a series of rules related to logs, which are then validated on the training set. When the rules are confirmed to produce correct reasoning results, they are added to a rule repository. Secondly, considering that the LLMs with small size (≈8B parameters) still face challenges in utilizing rules, we design an alignment method based on rule-case contrastive preference optimization (CPO) to effectively enhance the rule reasoning capabilities of these LLMs. Finally, in the reasoning stage, the LLM constructs prompt using the rule repository and performs log analysis on the test set. Experiments show that LogRules outperforms LLM-based methods in log parsing and anomaly detection tasks, and achieves better performance compared to case-based methods.
pdf
bib
abs
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
Yingqiang Gao
|
Lukas Fischer
|
Alexa Lintner
|
Sarah Ebling
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content on television and in movies, among other settings. As an accessibility service typically provided by trained AD professionals, the generation of ADs demands significant human effort, making the process both time-consuming and costly. Recent advancements in natural language processing (NLP) and computer vision (CV), particularly in large language models (LLMs) and vision-language models (VLMs), have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of LLMs and VLMs: we discuss how state-of-the-art NLP and CV technologies can be applied to generate ADs and identify essential research directions for the future.
pdf
bib
abs
Adaptive Retrieval-Augmented Generation for Conversational Systems
Xi Wang
|
Procheta Sen
|
Ruizhe Li
|
Emine Yilmaz
With the success of integrating large language models into the development of conversational systems, many studies have shown the effectiveness of retrieving and augmenting external knowledge for informative responses. While many existing studies agree on the necessity of Retrieval Augmented Generation (RAG), further investigation into the necessity and value of applying RAG to every turn of the conversation is needed. In this study, we propose to investigate the need for each turn of system response to be augmented with external knowledge. In particular, by leveraging human judgements on the binary choice of adaptive augmentation, we develop RAGate, a gating model, which models conversation context and relevant inputs to predict if a conversational system requires RAG for improved responses. We conduct extensive experiments on devising and applying RAGate to conversational models, joined with well-rounded analyses of various conversational scenarios. Our experimental results and analysis indicate the effective application of RAGate in RAG-based conversational systems in identifying if system responses require RAG to generate high-quality responses with high confidence. This study also identifies and shows the correlation between the generation’s confidence level and the relevance of the augmented knowledge. We have also released the implementation code and resources in https://github.com/wangxieric/RAGate.
pdf
bib
abs
Multimodal Generation with Consistency Transferring
Junxiang Qiu
|
Jinda Lu
|
Shuo Wang
Multimodal content generation has become an area of considerable interest. However, existing methods are hindered by limitations related to model constraints and training strategies: (1) Most current approaches rely on training models from scratch, resulting in inefficient training processes when extending these models; (2) There is a lack of constraints on adjacent steps within the models, leading to slow sampling and poor generation stability across various sampling methods. To address the issues, we introduce Multimodal Generation with Consistency Transferring (MGCT). The method introduces two key improvements: (1) A Model Consistency Transferring (MCT) strategy to acquire low-cost prior knowledge, increasing training efficiency and avoiding error accumulation; (2) A Layer Consistency Transferring (LCT) between adjacent steps, enhancing denoising capabilities at each step and improving model stability across various generation methods. These strategies ensure the consistency of jointly generated multimodal content and improving training efficiency. Experiments show that the algorithm enhances the model’s ability to capture actions and depict backgrounds more effectively. In both the AIST++ and Landscape datasets, it improves video generation speed by approximately 40% and quality by about 39.3%, while also achieving a slight 3% improvement in audio quality over the baseline.
pdf
bib
abs
On the Impact of Noise in Differentially Private Text Rewriting
Stephen Meisenbacher
|
Maulik Chevli
|
Florian Matthes
The field of text privatization often leverages the notion of *Differential Privacy* (DP) to provide formal guarantees in the rewriting or obfuscation of sensitive textual data. A common and nearly ubiquitous form of DP application necessitates the addition of calibrated noise to vector representations of text, either at the data- or model-level, which is governed by the privacy parameter 𝜀. However, noise addition almost undoubtedly leads to considerable utility loss, thereby highlighting one major drawback of DP in NLP. In this work, we introduce a new sentence infilling privatization technique, and we use this method to explore the effect of noise in DP text rewriting. We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections. Our results highlight the significant impact of noise in current DP rewriting mechanisms, leading to a discussion of the merits and challenges of DP in NLP as well as the opportunities that non-DP methods present.
pdf
bib
abs
Teaching Large Language Models Number-Focused Headline Generation With Key Element Rationales
Zhen Qian
|
Xiuzhen Zhang
|
Xiaofei Xu
|
Feng Xia
Number-focused headline generation is a summarization task requiring both high textual quality and precise numerical accuracy, which poses a unique challenge for Large Language Models (LLMs). Existing studies in the literature focus only on either textual quality or numerical reasoning and thus are inadequate to address this challenge. In this paper, we propose a novel chain-of-thought framework for using rationales comprising key elements of the Topic, Entities, and Numerical reasoning (TEN) in news articles to enhance the capability for LLMs to generate topic-aligned high-quality texts with precise numerical accuracy. Specifically, a teacher LLM is employed to generate TEN rationales as supervision data, which are then used to teach and fine-tune a student LLM. Our approach teaches the student LLM automatic generation of rationales with enhanced capability for numerical reasoning and topic-aligned numerical headline generation. Experiments show that our approach achieves superior performance in both textual quality and numerical accuracy.
pdf
bib
abs
Zero-Shot Strategies for Length-Controllable Summarization
Fabian Retkowski
|
Alexander Waibel
Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings. We conduct a comprehensive study evaluating LLMs’ length control capabilities across multiple measures and propose practical methods to improve controllability. Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model. To address these challenges, we introduce a set of methods: length approximation, target adjustment, sample filtering, and automated revisions. By combining these methods, we demonstrate substantial improvements in length compliance while maintaining or enhancing summary quality, providing highly effective zero-shot strategies for precise length control without the need for model fine-tuning or architectural changes. With our work, we not only advance our understanding of LLM behavior in controlled text generation but also pave the way for more reliable and adaptable summarization systems in real-world applications.
pdf
bib
abs
SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
Wonjoong Kim
|
Sangwu Park
|
Yeonjun In
|
Seokwon Han
|
Chanyoung Park
Recently, interpreting complex charts with logical reasoning has emerged as challenges due to the development of vision-language models. A prior state-of-the-art (SOTA) model has presented an end-to-end method that leverages the vision-language model to convert charts into table format utilizing Large Language Model (LLM) for reasoning. However, unlike natural images, charts contain a mix of essential and irrelevant information required for chart reasoning, and we discover that this characteristic can lower the performance of chart-to-table extraction. In this paper, we introduce SIMPLOT, a method designed to extract only the elements necessary for chart reasoning. The proposed method involves two steps: 1) training to mimic a simple plot that contains only the essential information from a complex chart for table extraction, followed by 2) performing reasoning based on the table. Our model enables accurate chart reasoning without the need for additional annotations or datasets, and its effectiveness is demonstrated through various experiments.
pdf
bib
abs
InstructAny2Pix: Image Editing with Multi-Modal Prompts
Shufan Li
|
Harkanwar Singh
|
Aditya Grover
Image Editing has made incredible progress in recent years. Earliest work only supported caption-guided editing. Recently, free-form text instructions and reference images are incorporated to allow more flexibility. However, existing methods still struggle with complicated editing instructions involving multiple objects or reference images. We present InstructAny2Pix, a novel image editing model that leverages a multi-modal LLM to execute complicated edit instructions. Compared with previous, works, InstructAny2Pix extends the flexibility of edit instructions in three ways: First, it can perform complex instructions involving multiple object edits; Second, it supports interleaving text instructions with multiple reference images; Third, it supports audio and music inputs as part of edit prompts, unlocking many creative applications, such as album cover generation and music-inspired merchandise design. To evaluate the effectiveness of InstructAny2Pix, we propose two new benchmark datasets MM-Inst and Dream-booth++ consisting of human written, multi-modal prompts. InstructAny2Pix outperforms baselines in these two proposed multi-modal benchmarks, as well as conventional image editing benchmarks such as InstructPix2Pix.
pdf
bib
abs
Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs
Yiyang Luo
|
Ke Lin
|
Chao Gu
|
Jiahui Hou
|
Lijie Wen
|
Luo Ping
The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread usage of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks, such as paraphrasing or translation.In this paper, we introduce watermark collision as a novel and general philosophy for watermark attacks, aimed at enhancing attack performance on top of any other attacking methods. We also provide a comprehensive demonstration that watermark collision poses a threat to all logit-based watermark algorithms, impacting not only specific attack scenarios but also downstream applications.
pdf
bib
abs
Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech
Yejin Jeon
|
Youngjae Kim
|
Jihyun Lee
|
Gary Lee
Emotional dialogue speech synthesis (EDSS) aims to generate expressive speech by leveraging the dialogue context between interlocutors. This is typically done by concatenating global representations of previous utterances as conditions for text-to-speech (TTS) systems. However, such approaches overlook the importance of integrating localized acoustic cues that convey emotion. To address this, we introduce a novel approach that utilizes a large language model (LLM) to generate holistic emotion tags based on prior dialogue context, while also pinpointing key words in the target utterance that align with the predicted emotional state. Furthermore, we enhance the emotional richness of synthesized speech by incorporating concentrated acoustic features of these key words through a novel selective audio masking loss function. This methodology not only improves emotional expressiveness, but also facilitates automatic emotion speech generation during inference by eliminating the need for manual emotion tag selection. Comprehensive subjective and objective evaluations and analyses demonstrate the effectiveness of the proposed approach.
pdf
bib
abs
Identifying and Mitigating Social Bias Knowledge in Language Models
Ruizhe Chen
|
Yichen Li
|
Jianfei Yang
|
Yang Feng
|
Joey Tianyi Zhou
|
Jian Wu
|
Zuozhu Liu
Generating fair and accurate predictions plays a pivotal role in deploying pre-trained language models (PLMs) in the real world. However, existing debiasing methods may inevitably generate incorrect or nonsensical predictions as they are designed and evaluated to achieve parity across different social groups but leave aside individual commonsense facts, resulting in modified knowledge that elicits unreasonable or undesired predictions. This paper introduces a novel debiasing framework that first identifies the encoding locations of biases within language models and then applies the Fairness-Stamp (FAST). FAST focuses on fine-grained, individual bias mitigation and integrates a lightweight network into PLMs, specifically targeting identified biases while preserving essential knowledge and maintaining factual integrity. We also present BiaScope, a new benchmark comprising datasets and metrics designed to evaluate the retention of commonsense knowledge and the generalization across paraphrased social biases. Our extensive experiments across multiple datasets demonstrate that FAST surpasses state-of-the-art baselines with superior debiasing performance while not compromising the overall model capability for knowledge retention and downstream predictions. This highlights the potential of fine-grained debiasing strategies to achieve fairness in PLMs. Code will be publicly available.
pdf
bib
DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications
Sathya Krishnan Suresh
|
Wu Mengjun
|
Tushar Pranav
|
EngSiong Chng
pdf
bib
abs
Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs
Duygu Nur Yaldiz
|
Yavuz Faruk Bakman
|
Baturalp Buyukates
|
Chenyang Tao
|
Anil Ramakrishna
|
Dimitrios Dimitriadis
|
Jieyu Zhao
|
Salman Avestimehr
Uncertainty estimation (UE) of generative large language models (LLMs) is crucial for evaluating the reliability of generated sequences. A significant subset of UE methods utilize token probabilities to assess uncertainty, aggregating multiple token probabilities into a single UE score using a scoring function. Existing scoring functions for probability-based UE, such as length-normalized scoring and semantic contribution-based weighting, are designed to solve certain aspects of the problem but exhibit limitations, including the inability to handle biased probabilities and complex semantic dependencies between tokens. To address these issues, in this work, we propose Learnable Response Scoring (LARS) function, a novel scoring function that leverages supervised data to capture complex dependencies between tokens and probabilities, thereby producing more reliable and calibrated response scores in computing the uncertainty of LLM generations. Our comprehensive experiments across question-answering and arithmetical reasoning tasks with various datasets demonstrate that LARS significantly outperforms existing scoring functions, achieving improvements of up to 16% AUROC score.
pdf
bib
abs
Joint Learning Event-Specific Probe and Argument Library with Differential Optimization for Document-Level Multi-Event Extraction
Jianpeng Hu
|
Chao Xue
|
Chunqing Yu
|
JiaCheng Xu
|
Chengxiang Tan
Document-level multi-event extraction aims to identify a list of event types and corresponding arguments from the document. However, most of the current methods neglect the fine-grained difference among events in multi-event documents, which leads to event confusion and missing. This is also one of the reasons why the recall and F1-score of multi-event recognition are lower compared to single-event recognition. In this paper, we propose an event-specific probe-based method to sniff multiple events by querying each corresponding argument library, which uses a novel probe-label alignment method for differential optimization. In addition, the role contrastive loss and probe consistent loss are designed to fine-tune the fine-grained role differences and probe differences in each event. The experimental results on two general datasets show that our method outperforms the state-of-the-art method in the F1-score, especially in the recall of multi-events.
pdf
bib
abs
Synonym-unaware Fast Adversarial Training against Textual Adversarial Attacks
Yichen Yang
|
Xin Liu
|
Kun He
Numerous adversarial defense methods have been proposed to strengthen the robustness of Natural Language Processing (NLP) models against adversarial attacks. However, many of these methods rely on predetermined linguistic knowledge and assume that attackers’ synonym candidates are known, which is often unrealistic. In this work, we investigate adversarial training in the embedding space and introduce a Fast Adversarial Training (FAT) method to improve the model robustness without requiring synonym awareness. FAT leverages single-step perturbation generation and effective perturbation initialization based on two key insights: (1) adversarial perturbations generated by single-step and multi-step gradient ascent are similar, and (2) perturbations generated on the same training sample across successive epochs exhibit resemblance. By employing single-step gradient ascent and leveraging historical perturbation information, FAT not only expedites the training process but also efficiently initializes perturbations. Extensive experiments demonstrate that FAT significantly enhances the robustness of popular NLP models under scenarios where synonyms are unknown, outperforming other defense baselines under various character-level and word-level attacks.
pdf
bib
abs
Tethering Broken Themes: Aligning Neural Topic Models with Labels and Authors
Mayank Nagda
|
Phil Ostheimer
|
Sophie Fellenz
Topic models are a popular approach for extracting semantic information from large document collections. However, recent studies suggest that the topics generated by these models often do not align well with human intentions. Although metadata such as labels and authorship information are available, it has not yet been effectively incorporated into neural topic models. To address this gap, we introduce FANToM, a novel method to align neural topic models with both labels and authorship information. FANToM allows for the inclusion of this metadata when available, producing interpretable topics and author distributions for each topic. Our approach demonstrates greater expressiveness than conventional topic models by learning the alignment between labels, topics, and authors. Experimental results show that FANToM improves existing models in terms of both topic quality and alignment. Additionally, it identifies author interests and similarities.
pdf
bib
abs
Towards Zero-Shot Multimodal Machine Translation
Matthieu Futeral
|
Cordelia Schmid
|
Benoît Sagot
|
Rachel Bawden
Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e sentences with their translations and accompanying images), which is costly to collect and prevents the extension of MMT to language pairs with no such data. We propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method ( ZeroMMT) consists in adapting a strong text-only machine translation (MT) model by training it jointly on two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original MT and new MMT outputs. We evaluate on standard MMT benchmarks and on CoMMuTE, a contrastive test set designed to evaluate how well models use images to disambiguate translations. ZeroMMT obtains disambiguation results close to state-of-the-art MMT models trained on fully supervised examples. To prove that ZeroMMT generalizes to languages with no fully supervised training data, we extend CoMMuTE to three new languages: Arabic, Russian and Chinese. We also show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.
pdf
bib
abs
Large-Scale Corpus Construction and Retrieval-Augmented Generation for Ancient Chinese Poetry: New Method and Data Insights
Yang Liu
|
Lan Lan
|
Jiahuan Cao
|
Hiuyi Cheng
|
Kai Ding
|
Lianwen Jin
Ancient Chinese Poetry (ACP), a critical aspect of Chinese cultural heritage, presents unique challenges for Large Language Models (LLMs). One of the most pressing challenges is the significant hallucination issues faced by LLMs due to data scarcity and limited ability of general LLMs when dealing with ACP. To address these challenges, this paper constructs the ACP-Corpus, which encompasses 1.1 million ancient poems and 990K related texts, designed to enhance the training and performance of LLMs. Alongside this, we develop the ACP-QA dataset, comprising over 12 million question-answer pairs across 24 task categories, and the ACP-Eval dataset for rigorous evaluation purposes, containing 7,050 entries. Building on this resources, we propose the ACP-RAG framework, a specialized Retrieval-Augmented Generation (RAG) approach that significantly improves the performance of LLMs in the domain of ancient poetry from 49.2% to 89.0%. The ACP-RAG contains five modules of semantic coarse-grained retrieval, semantic fine-grained retrieval, keyword retrieval, keyword matching, and context filtering. Experiments show that ACP-RAG achieves a promising response accuracy of 89.0%, surpassing existing LLMs by a remarkable margin. We believe this work not only advances the capabilities of LLMs in processing ancient Chinese poetry but also contributes to the preservation and innovative development within this rich literary tradition. The datasets and code are available at https://github.com/SCUT-DLVCLab/ACP-RAG.
pdf
bib
abs
OpenBioNER: Lightweight Open-Domain Biomedical Named Entity Recognition Through Entity Type Description
Alessio Cocchieri
|
Giacomo Frisoni
|
Marcos Martínez Galindo
|
Gianluca Moro
|
Giuseppe Tagliavini
|
Francesco Candoli
Biomedical Named Entity Recognition (BioNER) faces significant challenges in real-world applications due to limited annotated data and the constant emergence of new entity types, making zero-shot learning capabilities crucial. While Large Language Models (LLMs) possess extensive domain knowledge necessary for specialized fields like biomedicine, their computational costs often make them impractical. To address these challenges, we introduce OpenBioNER, a lightweight BERT-based cross-encoder architecture that can identify any biomedical entity using only its description, eliminating the need for retraining on new, unseen entity types. Through comprehensive evaluation on established biomedical benchmarks, we demonstrate that OpenBioNER surpasses state-of-the-art baselines, including specialized 7B NER LLMs and GPT-4o, achieving up to 10% higher F1 scores while using 110M parameters only. Moreover, OpenBioNER outperforms existing small-scale models that match textual spans with entity types rather than descriptions, both in terms of accuracy and computational efficiency.
pdf
bib
abs
Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum
Ryan Soh-Eun Shim
|
Barbara Plank
There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories, yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.
pdf
bib
abs
Linguistically Grounded Analysis of Language Models using Shapley Head Values
Marcell Fekete
|
Johannes Bjerva
Understanding how linguistic knowledge is encoded in language models is crucial for improving their generalisation capabilities. In this paper, we investigate the processing of morphosyntactic phenomena, by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs). Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions such as anaphor agreement and filler-gap dependencies are handled. Through quantitative pruning and qualitative clustering analysis, we demonstrate that attention heads responsible for processing related linguistic phenomena cluster together. Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information. These findings support the hypothesis that language models learn subnetworks corresponding to linguistic theory, with potential implications for cross-linguistic model analysis and interpretability in Natural Language Processing (NLP).
pdf
bib
abs
How Do Large Language Models Perform in Dynamical System Modeling
Xiao Luo
|
Binqi Chen
|
Haixin Wang
|
Zhiping Xiao
|
Ming Zhang
|
Yizhou Sun
This paper studies the problem of dynamical system modeling, which involves the evolution of multiple interacting objects. Recent data-driven methods often utilize graph neural networks (GNNs) to learn these interactions by optimizing the neural network in an end-to-end fashion. While large language models (LLMs) have shown exceptional zero-shot performance across various applications, their potential for modeling dynamical systems has not been extensively explored. In this work, we design prompting techniques for dynamical system modeling and systematically evaluate the capabilities of LLMs on two tasks, including dynamic forecasting and relational reasoning. An extensive benchmark LLM4DS across nine datasets is built for performance comparison. Our extensive experiments yield several key findings: (1) LLMs demonstrate competitive performance without training compared to state-of-the-art methods in dynamical system modeling. (2) LLMs effectively infer complex interactions among objects to capture system evolution. (3) Prompt engineering plays a crucial role in enabling LLMs to accurately understand and predict the evolution of systems.
pdf
bib
abs
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang
|
Bo Li
|
Peiyuan Zhang
|
Fanyi Pu
|
Joshua Adrian Cahyono
|
Kairui Hu
|
Shuai Liu
|
Yuanhan Zhang
|
Jingkang Yang
|
Chunyuan Li
|
Ziwei Liu
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models’ generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs.
pdf
bib
abs
Pairwise Prompt-Based Tuning with Parameter Efficient Fast Adaptation for Generalized Zero-Shot Intent Detection
Xiaotong Zhang
|
Qianru Zhou
|
Han Liu
|
Hong Yu
Generalized zero-shot intent detection (GZID) aims to recognize the labels of utterances from both seen and unseen intents by utilizing the knowledge learned from seen intents. Enhancing the generalization ability from seen intents to unseen intents is a key challenge in the GZID setting. Existing methods attempt to tackle this challenge by distinguishing unseen intents from seen intents or focusing on enhancing the model discriminability. However, the challenge is not solved substantially as they ignore to promote the representation learning ability of the model itself and neglect to strengthen the model adaptability to new tasks, resulting in overfitting on the seen intents. In this paper, we propose a pairwise prompt-based tuning model with parameter efficient fast adaptation which involves two training steps. In the first step, we leverage hybrid contrastive learning in discriminant space and masked language modeling to make predictions at both sentence and token levels, which can enhance the model discriminability and representation learning ability respectively. In the second step, we design a pipeline for generating and filtering unseen data by only providing unseen intent labels, and utilize parameter-efficient fine-tuning to quickly adapt to unseen intents. Experiments on four intent detection datasets demonstrate that our two-step training method has better comprehension and generalization capabilities.
pdf
bib
abs
FaithfulPersona: Balancing Faithfulness and Personalization in Code Explanations through Self-Critique
Zhuang Luo
|
Yichuan Li
|
Zexing Xu
|
Kyumin Lee
|
S. Rasoul Etesami
Code explanations are crucial in real-world life, from educating students to aligning technical projects with business goals. However, existing approaches face challenges balancing faithfulness to the original code and personalization for diverse user needs. This paper addresses these challenges by introducing a novel benchmark and method for generating faithful personalized code explanations. Our benchmark, FaithfulPersonaCodeX, incorporates code samples and user profiles, employing various evaluation metrics to evaluate both faithfulness and personalization. We propose DISCO, a new method that uses a self-critique mechanism and two-stage optimization to balance faithfulness and personalization in code explanations, addressing the limitations of current large language model approaches. Our proposed model, DISCO, achieves a notable 3.7% improvement in Pass@5 compared to the strong baseline method, Self-Consistency, while maintaining high personalization with a 61.08% win rate in the LLM-as-a-Judge evaluation, effectively balancing faithfulness and user-specific needs in code explanations.
pdf
bib
abs
Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering
Wei Zhou
|
Mohsen Mesgar
|
Annemarie Friedrich
|
Heike Adel
Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrate notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain. The use of closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi Agent Collaboration with Tool use (MACT), a framework that requires neither fine-tuning nor closed-source models. In MACT, a planning agent and a coding agent that also make use of tools collaborate for TQA. MACT outperforms previous SoTA systems on three out of four benchmarks and performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. Our extensive analyses prove the effectiveness of MACT’s multi-agent collaboration in TQA. We release our code publicly.
pdf
bib
abs
Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation
Sirui Xia
|
Xintao Wang
|
Jiaqing Liang
|
Yifei Zhang
|
Weikang Zhou
|
Jiaji Deng
|
Fei Yu
|
Yanghua Xiao
Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large Language Models (LLMs) in knowledge-intensive tasks. To enhance credibility and verifiability in RAG systems, Attributed Text Generation (ATG) is proposed, which provides citations to retrieval knowledge in LLM-generated responses. Prior methods mainly adopt coarse-grained attributions, with passage-level or paragraph-level references or citations, which fall short in verifiability. This paper proposes ReClaim(Refer & Claim), a fine-grained ATG method that alternates the generation of references and answers step by step. Different from previous coarse-grained attribution, ReClaim provides sentence-level citations in long-form question-answering tasks. With extensive experiments, we verify the effectiveness of ReClaim in extensive settings, achieving a citation accuracy rate of 90%.
pdf
bib
abs
Understanding the Role of Mental Models in User Interaction with an Adaptive Dialog Agent
Lindsey Morgan Vanderlyn
|
Dirk Väth
|
Thang Vu
Mental models play an important role in whether user interactions with intelligent systems, such as dialog agents, are successful. Adaptive dialog systems present the opportunity to align a dialog agent’s behavior with heterogeneous user expectations. However, there has been little research into what mental models users form when interacting with a task-oriented dialog system, how these models affect users’ interactions, or what role system adaptation can play in this process. This can make it challenging to avoid damage to human-AI partnership. In this work, we collect a new publicly available dataset for exploring user mental models of information seeking dialog systems. We demonstrate that users have a variety of conflicting mental models about such systems, the validity of which directly impacts the success and perception of their interactions. Furthermore, we show that adapting a dialog agent’s behavior to better align with users’ mental models, even when done implicitly, can improve dialog efficiency, success, and user perception of the interaction. This shows that implicit adaptation can be beneficial for task-oriented dialog systems, so long as developers understand the mental models of their users.
pdf
bib
abs
CoPERLex: Content Planning with Event-based Representations for Legal Case Summarization
Santosh T.y.s.s
|
Youssef Farag
|
Matthias Grabmair
Legal professionals often struggle with lengthy judgments and require efficient summarization for quick comprehension. To address this challenge, we investigate the need for structured planning in legal case summarization, particularly through event-centric representations that reflect the narrative nature of legal case documents. We propose our framework, CoPERLex, which operates in three stages: first, it performs content selection to identify crucial information from the judgment; second, the selected content is utilized to generate intermediate plans through event-centric representations modeled as Subject-Verb-Object tuples; and finally, it generates coherent summaries based on both the content and the structured plan. Our experiments on four legal summarization datasets demonstrate the effectiveness of integrating content selection and planning components, highlighting the advantages of event-centric plans over traditional entity-centric approaches in the context of legal judgements.
pdf
bib
abs
DisComp: A Two-Stage Prompt Optimization Framework Combining Task-Agnostic and Task-Aware Compression
Liu Quancai
|
Haihui Fan
|
Jinchao Zhang
|
Lixiangfang Lixiangfang
|
Lichuanrong Lichuanrong
|
Bo Li
Large language models (LLMs) exhibit exceptional performance across a wide range of natural language processing tasks, often relying on lengthy prompts to harness their full capabilities. However, extended prompts can lead to substantial computational overhead and increased hardware demands, limiting the scalability and efficiency of such models. In this paper, we propose DisComp, a two-stage prompt compression framework based on knowledge distillation that combines task-agnostic and task-aware strategies, designed to efficiently compress prompt length without compromising performance.In the first stage, task-agnostic compression is achieved through knowledge distillation, transferring the summarization capabilities of a LLM to a smaller, more efficient model. The distillation process combines cross-entropy loss and keyword matching loss to ensure the smaller model generates concise and informative summaries. In the second stage, sentence-level pruning is applied, where sentences are ranked by relevance to the query, and irrelevant sentences are pruned to retain only task-critical information. We evaluate our method on three benchmark datasets, LongBench , ZeroSCROLLS and NaturalQuestions. The results show that DisComp significantly outperforms previous task-agnostic and task-specific compression approaches, and it is up to 6.56× faster at inference compared to the best token-level compression method.
pdf
bib
abs
A Large-Scale Benchmark for Vietnamese Sentence Paraphrases
Sang Quang Nguyen
|
Kiet Van Nguyen
This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original–paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks. The dataset is available for research purposes at
https://github.com/ngwgsang/ViSP.
pdf
bib
abs
RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering
Yang Bai
|
Christan Grant
|
Daisy Zhe Wang
Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Data and code will be made public once the paper is accepted.
pdf
bib
abs
MultiCAT: Multimodal Communication Annotations for Teams
Adarsh Pyarelal
|
John M Culnan
|
Ayesha Qamar
|
Meghavarshini Krishnaswamy
|
Yuwei Wang
|
Cheonkam Jeong
|
Chen Chen
|
Md Messal Monem Miah
|
Shahriar Hormozi
|
Jonathan Tong
|
Ruihong Huang
Successful teamwork requires team members to understand each other and communicate effectively, managing multiple linguistic and paralinguistic tasks at once. Because of the potential for interrelatedness of these tasks, it is important to have the ability to make multiple types of predictions on the same dataset. Here, we introduce Multimodal Communication Annotations for Teams (MultiCAT), a speech- and text-based dataset consisting of audio recordings, automated and hand-corrected transcriptions. MultiCAT builds upon data from teams working collaboratively to save victims in a simulated search and rescue mission, and consists of annotations and benchmark results for the following tasks: (1) dialog act classification, (2) adjacency pair detection, (3) sentiment and emotion recognition, (4) closed-loop communication detection, and (5) vocal (phonetic) entrainment detection. We also present exploratory analyses on the relationship between our annotations and team outcomes. We posit that additional work on these tasks and their intersection will further improve understanding of team communication and its relation to team performance. Code & data: https://doi.org/10.5281/zenodo.14834835
pdf
bib
abs
Prototype Tuning: A Meta-Learning Approach for Few-Shot Document-Level Relation Extraction with Large Language Models
Dinghao Pan
|
Yuanyuan Sun
|
Bo Xu
|
Jiru Li
|
Zhihao Yang
|
Ling Luo
|
Hongfei Lin
|
Jian Wang
Few-Shot Document-Level Relation Extraction (FSDLRE) aims to develop models capable of generalizing to new categories with minimal support examples. Although Large Language Models (LLMs) demonstrate exceptional In-Context Learning (ICL) capabilities on many few-shot tasks, their performance on FSDLRE tasks remains suboptimal due to the significant gap between the task format and the intrinsic capabilities of language models, coupled with the complexity of ICL prompts for document-level text. To address these challenges, we introduce a novel meta-training approach for LLMs termed Prototype Tuning. We construct simulated episodes using data with relation types that do not overlap with the test corpus, fundamentally enhancing the ICL capabilities of LLMs in FSDLRE through meta-learning. To further enhance the effects of meta-learning, we innovatively integrate the concept of prototype into the fine-tuning process of LLMs. This involves aggregating entity pairs from support documents into prototypes within the prompts and altering the way of determining relation categories to identifying the closest prototype. Experimental results demonstrate that our LLMs trained with this approach outperform all baselines. Our proposed approach markedly improves the ICL capabilities of LLMs in FSDLRE and mitigates the impact of relation semantic discrepancies between the training corpus and the test corpus on model performance.
pdf
bib
abs
LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification
Shubham Kumar Nigam
|
Tanmay Dubey
|
Govind Sharma
|
Noel Shallum
|
Kripabandhu Ghosh
|
Arnab Bhattacharya
In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce **LegalSeg**, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we evaluate multiple state-of-the-art models, including Hierarchical BiLSTM-CRF, TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and Role-Aware Transformers, alongside an exploratory **RhetoricLLaMA**, an instruction-tuned large language model. Our results demonstrate that models incorporating broader context, structural relationships, and sequential sentence information outperform those relying solely on sentence-level features. Additionally, we conducted experiments using surrounding context and predicted or actual labels of neighboring sentences to assess their impact on classification accuracy. Despite these advancements, challenges persist in distinguishing between closely related roles and addressing class imbalance. Our work underscores the potential of advanced techniques for improving legal document understanding and sets a strong foundation for future research in legal NLP.
pdf
bib
abs
Claim-Guided Textual Backdoor Attack for Practical Applications
Minkyoo Song
|
Hanna Kim
|
Jaehan Kim
|
Youngjin Jin
|
Seungwon Shin
Recent advances in natural language processing and the increased use of large language models have exposed new security vulnerabilities, such as backdoor attacks. Previous backdoor attacks require input manipulation after model distribution to activate the backdoor, posing limitations in real-world applicability. Addressing this gap, we introduce a novel Claim-Guided Backdoor Attack (CGBA), which eliminates the need for such manipulations by utilizing inherent textual claims as triggers. CGBA leverages claim extraction, clustering, and targeted training to trick models to misbehave on targeted claims without affecting their performance on clean data. CGBA demonstrates its effectiveness and stealthiness across various datasets and models, significantly enhancing the feasibility of practical backdoor attacks. Our code and data will be available at https://github.com/minkyoo9/CGBA.
pdf
bib
abs
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
Jiarui Lu
|
Thomas Holleis
|
Yizhe Zhang
|
Bernhard Aumayer
|
Feng Nan
|
Haoping Bai
|
Shuang Ma
|
Shen Ma
|
Mengyu Li
|
Guoli Yin
|
Zirui Wang
|
Ruoming Pang
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over arbitrary trajectory. We show that open source and proprietary models has a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights to tool-use LLM capabilities. Datasets and evaluation scripts of ToolSandbox are released at <placeholder>.
pdf
bib
abs
SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation
Qilong Wu
|
Xiaoneng Xiang
|
Huang Hejia
|
Xuan Wang
|
Yeo Wei Jie
|
Ranjan Satapathy
|
Ricardo Shirota Filho
|
Bharadwaj Veeravalli
The rapid growth of the financial sector and the increasing focus on Environmental, Social, and Governance (ESG) considerations have created a pressing need for advanced natural language processing (NLP) tools. Despite recent advancements, there is still a notable absence of open-source Large Language Models (LLMs) that are proficient across both general finance and ESG domains, such as generating ESG reports. To address this gap, we introduce SusGen-30k, a high-quality, category-balanced dataset comprising seven financial NLP tasks. In addition, we propose TCFD-Bench, a benchmark designed to improve the evaluation of sustainability report generation. Our data-centric approach led to the development of a suite of models, SusGen-GPT, trained on the curated dataset. These models were evaluated across six adapted tasks and two off-the-shelf tasks, showing state-of-the-art performance, surpassing all other models except GPT-4. Remarkably, SusGen-GPT achieved an average score only 0.02 below GPT-4, despite using models with only 7-8B parameters compared to much larger GPT-4. This demonstrates the efficiency of our approach in delivering high performance with significantly fewer resources, addressing existing challenges and fostering further advancements in the financial and ESG research community.
pdf
bib
GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge
Daniil Gurgurov
|
Rishu Kumar
|
Simon Ostermann
pdf
bib
abs
In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation
Armel Randy Zebaze
|
Benoît Sagot
|
Rachel Bawden
The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. In this paper, we focus on machine translation (MT), a task that has been shown to benefit from in-context translation examples. However no systematic studies have been published on how best to select examples, and mixed results have been reported on the usefulness of similarity-based selection over random selection, although these results have mainly been shown for high-resource languages only. We provide a study covering multiple LLMs and in-context example retrieval strategies. Contrarily to previously published results, we find that retrieval based on sentence embedding similarity can improve MT, especially for low-resource language directions, and we also discuss the balance between selection pool diversity and quality. Code and outputs will be made freely available.
pdf
bib
abs
Self-Training Large Language Models for Tool-Use Without Demonstrations
Ne Luo
|
Aryo Pradipta Gema
|
Xuanli He
|
Emile Van Krieken
|
Pietro Lesci
|
Pasquale Minervini
Large language models (LLMs) remain prone to factual inaccuracies and computational errors, including hallucinations and mistakes in mathematical reasoning. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. In this paper, we investigate whether LLMs can learn to use tools without demonstrations. First, we analyse zero-shot prompting strategies to guide LLMs in tool utilisation. Second, we propose a self-training method to synthesise tool-use traces using the LLM itself. We compare supervised fine-tuning and preference fine-tuning techniques for fine-tuning the model on datasets constructed using existing Question Answering (QA) datasets, i.e., TriviaQA and GSM8K. Experiments show that tool-use enhances performance on a long-tail knowledge task: 3.7% on PopQA, which is used solely for evaluation, but leads to mixed results on other datasets, i.e., TriviaQA, GSM8K, and NQ-Open. Our findings highlight the potential and challenges of integrating external tools into LLMs without demonstrations.
pdf
bib
abs
Can Large Language Models Generate High-quality Patent Claims?
Lekang Jiang
|
Caiqi Zhang
|
Pascal A. Scherz
|
Stefan Goetz
Large language models (LLMs) have shown exceptional performance across various text generation tasks, but remain under-explored in the patent domain, which offers highly structured and precise language. This paper constructs a dataset to investigate the performance of current LLMs in patent claim generation. Our results demonstrate that generating claims based on patent descriptions outperforms previous research relying on abstracts. Interestingly, current patent-specific LLMs perform much worse than state-of-the-art general LLMs, highlighting the necessity for future research on in-domain LLMs. We also find that LLMs can produce high-quality first independent claims, but their performances markedly decrease for subsequent dependent claims. Moreover, fine-tuning can enhance the completeness of inventions’ features, conceptual clarity, and feature linkage. Among the tested LLMs, GPT-4 demonstrates the best performance in comprehensive human evaluations by patent experts, with better feature coverage, conceptual clarity, and technical coherence. Despite these capabilities, comprehensive revision and modification are still necessary to pass rigorous patent scrutiny and ensure legal robustness.
pdf
bib
abs
Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm
Jaehan Kim
|
Minkyoo Song
|
Seung Ho Na
|
Seungwon Shin
Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%↓). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/jaehanwork/Obliviate.
pdf
bib
abs
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmented Generation
Yiruo Cheng
|
Kelong Mao
|
Ziliang Zhao
|
Guanting Dong
|
Hongjin Qian
|
Yongkang Wu
|
Tetsuya Sakai
|
Ji-Rong Wen
|
Zhicheng Dou
Retrieval-Augmented Generation (RAG) has become a powerful paradigm for enhancing large language models (LLMs) through external knowledge retrieval. Despite its widespread attention, existing academic research predominantly focuses on single-turn RAG, leaving a significant gap in addressing the complexities of multi-turn conversations found in real-world applications. To bridge this gap, we introduce CORAL, a large-scale benchmark designed to assess RAG systems in realistic multi-turn conversational settings. CORAL includes diverse information-seeking conversations automatically derived from Wikipedia and tackles key challenges such as open-domain coverage, knowledge intensity, free-form responses, and topic shifts. It supports three core tasks of conversational RAG: passage retrieval, response generation, and citation labeling. We propose a unified framework to standardize various conversational RAG methods and conduct a comprehensive evaluation of these methods on CORAL, demonstrating substantial opportunities for improving existing approaches.
pdf
bib
abs
Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs
Itai Mondshine
|
Tzuf Paz-Argaman
|
Reut Tsarfaty
Despite advances in the multilingual capabilities of Large Language Models (LLMs) across diverse tasks, English remains the dominant language for LLM research and development. So, when working with a different language, this has led to the widespread practice of pre-translation, i.e., translating the task prompt into English before inference. Selective pre-translation, a more surgical approach, focuses on translating specific prompt components. However, its current use is sporagic and lacks a systematic research foundation. Consequently, the optimal pre-translation strategy for various multilingual settings and tasks remains unclear. In this work, we aim to uncover the optimal setup for pre-translation by systematically assessing its use. Specifically, we view the prompt as a modular entity, composed of four functional parts: instruction, context, examples, and output, either of which could be translated or not. We evaluate pre-translation strategies across 35 languages covering both low and high-resource languages, on various tasks including Question Answering (QA), Natural Language Inference (NLI), Named Entity Recognition (NER), and Abstractive Summarization. Our experiments show the impact of factors as similarity to English, translation quality and the size of pre-trained data, on the model performance with pre-translation. We suggest practical guidelines for choosing optimal strategies in various multilingual settings
pdf
bib
abs
QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums
Varun Nagaraj Rao
|
Eesha Agarwal
|
Samantha Dalal
|
Dana Calacci
|
Andrés Monroy-Hernández
Online discussion forums provide crucial data to understand the concerns of a wide range of real-world communities. However, the typical qualitative and quantitative methodologies used to analyze those data, such as thematic analysis and topic modeling, are infeasible to scale or require significant human effort to translate outputs to human readable forms. This study introduces QuaLLM, a novel LLM-based framework to analyze and extract quantitative insights from text data on online forums. The framework consists of a novel prompting and human evaluation methodology. We applied this framework to analyze over one million comments from two of Reddit’s rideshare worker communities, marking the largest study of its type. We uncover significant worker concerns regarding AI and algorithmic platform decisions, responding to regulatory calls about worker insights. In short, our work sets a new precedent for AI-assisted quantitative data analysis to surface concerns from online forums.
pdf
bib
abs
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection
Tomáš Horych
|
Christoph Mandl
|
Terry Ruas
|
Andre Greiner-Petter
|
Bela Gipp
|
Akiko Aizawa
|
Timo Spinde
High annotation costs from hiring or crowdsourcing complicate the creation of large, high-quality datasets needed for training reliable text classifiers. Recent research suggests using Large Language Models (LLMs) to automate the annotation process, reducing these costs while maintaining data quality. LLMs have shown promising results in annotating downstream tasks like hate speech detection and political framing. Building on the success in these areas, this study investigates whether LLMs are viable for annotating a complex task of media bias detection and whether a downstream media bias classifier can be trained on such data. We create Annolexical, the first large-scale dataset for media bias classification with over 48k synthetically annotated examples. Our classifier fine-tuned on it surpasses all of the annotator LLMs by 5-9% in Mathew’s Correlation Coefficient (MCC) and performs close to or outperforms the model trained on human-labeled data when evaluated on two media bias benchmark datasets (BABE and BASIL). This study demonstrates how our approach significantly reduces the cost of dataset creation in the media bias domain and, by extension - the development of the classifiers, while our subsequent behavioral stress-testing reveals some of its current limitations and trade-offs.
pdf
bib
abs
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
Lin Zhang
|
Lijie Hu
|
Di Wang
Transformer-based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non-linear interactions and high-dimensional operations. While previous studies have demonstrated that these models implicitly embed reasoning trees, humans typically employ various distinct logical reasoning mechanisms to complete the same task. It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks. In this paper, we aim to address this question by investigating the mechanistic interpretability of language models, particularly in the context of multi-step reasoning tasks. Specifically, we employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process, allowing us to map the reasoning paths adopted by the model. We apply this methodology to the GPT-2 model on a prediction task (IOI) and demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
pdf
bib
abs
Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models
Yuyi Huang
|
Runzhe Zhan
|
Derek F. Wong
|
Lidia S. Chao
|
Ailin Tao
Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the “Priming Effect”, “Safe Attention Shift”, and “Cognitive Dissonance”, effectively attack the models’ guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta’s Llama-3.2, Google’s Gemma-2, Mistral’s Mistral-NeMo, Falcon’s Falcon-mamba, Apple’s DCLM, Microsoft’s Phi3, and Qwen’s Qwen2.5, among others. Similarly, for closed-source models such as OpenAI’s GPT-4o, Google’s Gemini-1.5, and Claude-3.5, we observed an ASR of at least 95% on the AdvBench dataset, which represents the current state-of-the-art. This study underscores the urgent need to reassess the use of generative models in critical applications to mitigate potential adverse societal impacts.
pdf
bib
abs
AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts
Soichiro Murakami
|
Peinan Zhang
|
Hidetaka Kamigaito
|
Hiroya Takamura
|
Manabu Okumura
Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people’s interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase.
pdf
bib
abs
Token Weighting for Long-Range Language Modeling
Falko Helm
|
Nico Daheim
|
Iryna Gurevych
Many applications of large language models (LLMs) require long-context understanding, but models continue to struggle with such tasks. We hypothesize that conventional next-token prediction training could contribute to this, because each token is assigned equal weight. Yet, intuitively, the amount of context needed to predict the next token accurately varies greatly across different data. To reflect this, we propose various novel token-weighting schemes that assign different weights to each training token in the loss, thereby generalizing existing works. For this, we categorize token-weighting methods using a two-step framework which compares the confidences of a long-context and short-context model to score tokens. We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful to improve the long-context abilities of LLMs.Different short-context models can be used effectively for token scoring, including models that are much smaller than the long-context model that is trained.All in all, this work contributes to a better understanding of the trade-offs long-context language modeling faces and provides guidelines for model steering via loss-weighting based on empirical evidence. The code can be found on [Github](https://github.com/UKPLab/naacl2025-token-weighting).
pdf
bib
abs
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Takyoung Kim
|
Kyungjae Lee
|
Young Rok Jang
|
Ji Yong Cho
|
Gangwoo Kim
|
Minseok Cho
|
Moontae Lee
Interactions with large language models (LLMs) often yield long and detailed responses, leveraging both parametric knowledge and retrieval-augmented generation (RAG). While these responses can provide rich insights, they often include redundant or less engaging content not aligned with user interests. This issue becomes apparent when users specify particular subtopics to include or exclude – termed **coverage-conditioned (C2)** queries – as LLMs often struggle to provide tailored responses. To address this challenge, we investigate the role of query outlines, sequences of subqueries designed to guide LLMs in generating responses that meet specific user requirements. To systematically create and evaluate these outlines, we introduce **QTree**, a dataset of 10K hierarchical sets of information-seeking subqueries that define structured boundaries for outline creation and evaluation in C2 scenarios. Additionally, we develop **QPlanner**, a 7B language model trained to generate customized outlines within boundaries of QTree. We evaluate the effectiveness of the generated outlines through automatic and human judgements, focusing on their impact within retrieval-augmented generation (RAG) systems. Experimental results demonstrate that QPlanner, especially when trained with alignment techniques like DPO, generates higher-quality outlines that better fulfill diverse user needs.
pdf
bib
abs
LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy
Zhiwen Ruan
|
Yixia Li
|
He Zhu
|
Longyue Wang
|
Weihua Luo
|
Kaifu Zhang
|
Yun Chen
|
Guanhua Chen
Despite being pretrained on multilingual corpora, large language models (LLMs) exhibit suboptimal performance on low-resource languages. Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models. However, these methods typically focus on the encoder’s output, overlooking valuable information from other layers. We propose Layer-Wise Adaptive Fusion and Alignment Strategy (LayAlign), a framework that integrates representations from all encoder layers, coupled with the adaptive fusion-enhanced attention mechanism to enable layer-wise interaction between the LLM and the multilingual encoder. Extensive experiments on multilingual reasoning tasks, along with analyses of learned representations, show that our approach consistently outperforms existing baselines.
pdf
bib
abs
On the Impacts of Contexts on Repository-Level Code Generation
Nam Le Hai
|
Dung Manh Nguyen
|
Nghi D. Q. Bui
CodeLLMs are widely used for code generation, yet their ability to handle repository-level dependencies remains underexplored. We introduce RepoExec, a benchmark for evaluating repository-level code generation, focusing on executability, functional correctness, and dependency utilization. Our study evaluates 18 models, revealing that retaining full dependency context yields the best performance, while smaller context sizes can be misleading. Pretrained LLMs excel in correctness but often reimplement dependencies, while instruction-tuned models better utilize dependencies but sometimes introduce unnecessary complexity. We propose an instruction-tuning dataset that improves dependency handling and introduce a new metric, Dependency Invocation Rate (DIR), to measure context utilization. Experiments show that instruction-tuned models improve DIR by over 10%, and multi-round debugging further enhances both correctness and dependency use. RepoExec provides a comprehensive framework to advance CodeLLMs for real-world applications. The dataset and source code are available at
https://github.com/FSoft-AI4Code/RepoExec.
pdf
bib
abs
From Argumentation to Deliberation: Perspectivized Stance Vectors for Fine-grained (Dis)agreement Analysis
Moritz Plenz
|
Philipp Heinisch
|
Janosch Gehring
|
Philipp Cimiano
|
Anette Frank
Debating over conflicting issues is a necessary first step towards resolving conflicts. However, intrinsic perspectives of an arguer are difficult to overcome by persuasive argumentation skills. Proceeding from a debate to a deliberative process, where we can identify actionable options for resolving a conflict requires a deeper analysis of arguments and the perspectives they are grounded in - as it is only from there that one can derive mutually agreeable resolution steps. In this work we develop a framework for a deliberative analysis of arguments in a computational argumentation setup. We conduct a fine-grained analysis of perspectivized stances expressed in the arguments of different arguers or stakeholders on a given issue, aiming not only to identify their opposing views, but also shared perspectives arising from their attitudes, values or needs. We formalize this analysis in Perspectivized Stance Vectors that characterize the individual perspectivized stances of all arguers on a given issue. We construct these vectors by determining issue- and argument-specific concepts, and predict an arguer’s stance relative to each of them. The vectors allow us to measure a modulated (dis)agreement between arguers, structured by perspectives, which allows us to identify actionable points for conflict resolution, as a first step towards deliberation.
pdf
bib
abs
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
Souvik Kundu
|
Anahita Bhiwandiwalla
|
Sungduk Yu
|
Phillip Howard
|
Tiep Le
|
Sharath Nittur Sridhar
|
David Cobbley
|
Hao Kang
|
Vasudev Lal
Despite recent efforts in understanding the compression impact on Large Language Models (LLMs) in terms of their downstream task performance and trustworthiness on relatively simpler uni-modal benchmarks (e.g. question answering, common sense reasoning), their detailed study on multi-modal Large Vision Language Models (LVLMs) is yet to be unveiled. Towards mitigating this gap, we present LVLM-Compress-Bench, a framework to first thorough study on the broad impact of compression on the generative performance of LVLMs on multi-modal input driven tasks. In specific, we consider two major classes of compression for autoregressive models, namely KV cache and weight compression, for the dynamically growing intermediate cache and static weights, respectively. We use four LVLM variants of the popular LLaVA framework to present our analysis to integrate various state-of-the-art KV and weight compression methods including uniform, outlier-reduced, and group quantization. With this framework we demonstrate on ten different multi-modal datasets with varied capabilities including recognition, knowledge, language generation, spatial awareness, visual reasoning, hallucination and visual illusion identification, toxicity, stereotypes and bias. In specific, our framework demonstrates the compression impact on both general and ethically critical metrics leveraging a combination of real world and synthetic datasets to encompass diverse societal intersectional attributes. Extensive experimental evaluations yield diverse and intriguing observations on the behavior of LVLMs at different quantization budget of KV and weights, in both maintaining and losing performance as compared to the baseline model with FP16 data format. We believe LVLM-Compress-Bench would help the community to have a deeper insight on the parting impact of compression and the societal impact the compressed models may pose. Code will be released soon.
pdf
bib
abs
Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs
David Ifeoluwa Adelani
|
A. Seza Doğruöz
|
Iyanuoluwa Shode
|
Anuoluwapo Aremu
Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian Pidgin spoken by approximately 120M speakers and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are some online platforms (e.g., Wikipedia), publishing in written Naija as well. West African Pidgin English (WAPE) is also spoken in Nigeria and it is used by BBC to broadcast news on the internet to a wider audience not only in Nigeria but also in other West African countries (e.g., Cameroon and Ghana). Through statistical analyses and Machine Translation experiments, our paper shows that these two pidgin varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is underrepresented in Generative AI, and it is hard to teach LLMs with few examples. In addition to the statistical analyses, we also provide historical information on both pidgins as well as insights from the interviews conducted with volunteer Wikipedia contributors in Naija.
pdf
bib
abs
A Guide To Effectively Leveraging LLMs for Low-Resource Text Summarization: Data Augmentation and Semi-supervised Approaches
Gaurav Sahu
|
Olga Vechtomova
|
Issam H. Laradji
Existing approaches for low-resource text summarization primarily employ large language models (LLMs) like GPT-3 or GPT-4 at inference time to generate summaries directly; however, such approaches often suffer from inconsistent LLM outputs and are difficult to adapt to domain-specific data in low-resource scenarios. In this work, we propose two novel methods to effectively utilize LLMs for low-resource text summarization: 1) MixSumm, an LLM-based data augmentation regime that synthesizes high-quality documents (short and long) for few-shot text summarization, and 2) PPSL, a prompt-based pseudolabeling strategy for sample-efficient semi-supervised text summarization. Specifically, MixSumm leverages the open-source LLaMA-3-70b-Instruct model to generate new documents by mixing topical information derived from a small seed set, and PPSL leverages the LLaMA-3-70b-Instruct model to generate high-quality pseudo-labels in a semi-supervised learning setup. We evaluate our methods on the TweetSumm, WikiHow, and ArXiv/PubMed datasets and use L-Eval, a LLaMA-3-based evaluation metric, and ROUGE scores to measure the quality of generated summaries. Our experiments on extractive and abstractive summarization show that MixSumm and PPSL achieve competitive ROUGE scores as a fully supervised method with 5% of the labeled data. We release our codebase here: https://github.com/ServiceNow/text-summarization-with-llms/
pdf
bib
abs
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
Aashiq Muhamed
|
Mona T. Diab
|
Virginia Smith
Understanding and mitigating the potential risks associated with foundation models (FMs) hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have emerged as a promising tool for disentangling FM representations, but they struggle to capture rare, yet crucial concepts in the data. We introduce Specialized Sparse Autoencoders (SSAEs), designed to illuminate these elusive dark matter features by focusing on specific subdomains. We present a practical recipe for training SSAEs, demonstrating the efficacy of dense retrieval for data selection and the benefits of Tilted Empirical Risk Minimization as a training objective to improve concept recall. Our evaluation of SSAEs on standard metrics, such as downstream perplexity and L0 sparsity, show that they effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5% increase in worst-group classification accuracy over the pretrained general-purpose SAE when applied to remove spurious gender information. SSAEs provide a powerful new lens for peering into the inner workings of FMs in subdomains.
pdf
bib
abs
MAiDE-up: Multilingual Deception Detection of AI-generated Hotel Reviews
Oana Ignat
|
Xiaomeng Xu
|
Rada Mihalcea
Deceptive reviews are becoming increasingly common, especially given the increase in performance and the prevalence of LLMs. While work to date has addressed the development of models to differentiate between truthful and deceptive human reviews, much less is known about the distinction between real reviews and AI-authored fake reviews. Moreover, most of the research so far has focused primarily on English, with very little work dedicated to other languages. In this paper, we compile and make publicly available the MAiDE-up dataset, consisting of 10,000 real and 10,000 AI-generated fake hotel reviews, balanced across ten languages. Using this dataset, we conduct extensive linguistic analyses to (1) compare the AI fake hotel reviews to real hotel reviews, and (2) identify the factors that influence the deception detection model performance. We explore the effectiveness of several models for deception detection in hotel reviews across three main dimensions: sentiment, location, and language. We find that these dimensions influence how well we can detect AI-generated fake reviews.
pdf
bib
abs
LeCoPCR: Legal Concept-guided Prior Case Retrieval for European Court of Human Rights cases
Santosh T.y.s.s
|
Isaac Misael Olguín Nolasco
|
Matthias Grabmair
Prior case retrieval (PCR) is crucial for legal practitioners to find relevant precedent cases given the facts of a query case. Existing approaches often overlook the underlying semantic intent in determining relevance with respect to the query case. In this work, we propose LeCoPCR, a novel approach that explicitly generate intents in the form of legal concepts from a given query case facts and then augments the query with these concepts to enhance models understanding of semantic intent that dictates relavance. To overcome the unavailability of annotated legal concepts, We employ a weak supervision approach to extract key legal concepts from the reasoning section using Determinantal Point Process (DPP) to balance quality and diversity. Experimental results on the ECtHR-PCR dataset demonstrate the effectiveness of leveraging legal concepts and DPP-based key concept extraction.
pdf
bib
abs
How much do contextualized representations encode long-range context?
Simeng Sun
|
Cheng-Ping Hsieh
We analyze contextual representations in neural autoregressive language models, emphasizing long-range contexts that span several thousand tokens. Our methodology employs a perturbation setup and the metric Anisotropy-Calibrated Cosine Similarity, to capture the degree of contextualization of long-range patterns from the perspective of representation geometry. We begin the analysis with a case study on standard decoder-only Transformers, demonstrating that similar perplexity can exhibit markedly different downstream task performance, which can be explained by the difference in contextualization of long-range content. Next, we extend the analysis to other models, covering recent novel architectural designs and various training configurations. The representation-level results illustrate a reduced capacity for high-complexity (i.e., less compressible) sequences across architectures, and that fully recurrent models rely heavily on local context, whereas hybrid models more effectively encode the entire sequence structure. Finally, preliminary analysis of model size and training configurations on the encoding of long-range context suggest potential directions for improving existing language models.
pdf
bib
abs
Data Poisoning for In-context Learning
Pengfei He
|
Han Xu
|
Yue Xing
|
Hui Liu
|
Makoto Yamada
|
Jiliang Tang
In-context learning (ICL) has emerged as a capability of large language models (LLMs), enabling them to adapt to new tasks using provided examples. While ICL has demonstrated its strong effectiveness, there is limited understanding of its vulnerability against potential threats. This paper examines ICL’s vulnerability to data poisoning attacks. We introduce ICLPoison, an attacking method specially designed to exploit ICL’s unique learning mechanisms by identifying discrete text perturbations that influence LLM hidden states. We propose three representative attack strategies, evaluated across various models and tasks. Our experiments, including those on GPT-4, show that ICL performance can be significantly compromised by these attacks, highlighting the urgent need for improved defense mechanisms to protect LLMs’ integrity and reliability.
pdf
bib
abs
Synthetic Audio Helps for Cognitive State Tasks
Adil Soubki
|
John Murzaku
|
Peter Zeng
|
Owen Rambow
The NLP community has broadly focused on text-only approaches of cognitive state tasks, but audio can provide vital missing cues through prosody. We posit that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance with text and synthetic audio compared to text and gold audio.
pdf
bib
BioEL: A Comprehensive Python Package for Biomedical Entity Linking
Prasanth Bathala
|
Christophe Ye
|
Batuhan Nursal
|
Shubham Lohiya
|
David Kartchner
|
Cassie S. Mitchell
pdf
bib
abs
PairScale: Analyzing Attitude Change with Pairwise Comparisons
Rupak Sarkar
|
Patrick Y. Wu
|
Kristina Miler
|
Alexander Miserlis Hoyle
|
Philip Resnik
We introduce a text-based framework for measuring attitudes in communities toward issues of interest, going beyond the pro/con/neutral of conventional stance detection to characterize attitudes on a continuous scale using both implicit and explicit evidence in language. The framework exploits LLMs both to extract attitude-related evidence and to perform pairwise comparisons that yield unidimensional attitude scores via the classic Bradley-Terry model. We validate the LLM-based steps using human judgments, and illustrate the utility of the approach for social science by examining the evolution of attitudes on two high-profile issues in U.S. politics in two political communities on Reddit over the period spanning from the 2016 presidential campaign to the 2022 mid-term elections. WARNING: Potentially sensitive political content.
pdf
bib
abs
Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation
Chenyu Wang
|
Weichao Zhou
|
Shantanu Ghosh
|
Kayhan Batmanghelich
|
Wenchao Li
Radiology report generation (RRG) has shown great potential in assisting radiologists by automating the labor-intensive task of report writing. While recent advancements have improved the quality and coherence of generated reports, ensuring their factual correctness remains a critical challenge. Although generative medical Vision Large Language Models (VLLMs) have been proposed to address this issue, these models are prone to hallucinations and can produce inaccurate diagnostic information. To address these concerns, we introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties. Unlike existing approaches, our method does not require modifications to the underlying model or access to its inner state, such as output token logits, thus serving as a plug-and-play module that can be seamlessly integrated with state-of-the-art models. Extensive experiments demonstrate the efficacy of our method in detecting hallucinations and enhancing the factual accuracy of automatically generated radiology reports. By abstaining from high-uncertainty reports, our approach improves factuality scores by 10%, achieved by rejecting 20% of reports on the MIMIC-CXR dataset. Furthermore, sentence-level uncertainty flags the lowest-precision sentence in each report with an 82.9% success rate. Our implementation is open-source and available at https://github.com/BU-DEPEND-Lab/SCUQ-RRG.
pdf
bib
abs
RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert
|
Valentina Pyatkin
|
Jacob Morrison
|
LJ Miranda
|
Bill Yuchen Lin
|
Khyathi Chandu
|
Nouha Dziri
|
Sachin Kumar
|
Tom Zick
|
Yejin Choi
|
Noah A. Smith
|
Hannaneh Hajishirzi
Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate RMs trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.
pdf
bib
abs
Evaluating Vision-Language Models for Emotion Recognition
Sree Bhattacharyya
|
James Z. Wang
Large Vision-Language Models (VLMs) have achieved unprecedented success in several objective multimodal reasoning tasks. However, to further enhance their capabilities of empathetic and effective communication with humans, improving how VLMs process and understand emotions is crucial. Despite significant research attention on improving affective understanding, there is a lack of detailed evaluations of VLMs for emotion-related tasks, which can potentially help inform downstream fine-tuning efforts. In this work, we present the first comprehensive evaluation of VLMs for recognizing evoked emotions from images. We create a benchmark for the task of evoked emotion recognition and study the performance of VLMs for this task, from perspectives of correctness and robustness. Through several experiments, we demonstrate important factors that emotion recognition performance depends on, and also characterize the various errors made by VLMs in the process. Finally, we pinpoint potential causes for errors through a human evaluation study. We use our experimental results to inform recommendations for the future of emotion research in the context of VLMs.
pdf
bib
abs
Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?
Crystina Zhang
|
Jing Lu
|
Vinh Q. Tran
|
Tal Schuster
|
Donald Metzler
|
Jimmy Lin
Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this work, we study the degree to which current multilingual language models (mLMs) understand based on subword-level semantic concepts. To this end, we form “semantic tokens” by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on five heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections of the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we find that the zero-shot results with semantic tokens are on par with or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transfer.
pdf
bib
abs
Open Domain Question Answering with Conflicting Contexts
Siyi Liu
|
Qiang Ning
|
Kishaloy Halder
|
Zheng Qi
|
Wei Xiao
|
Phu Mon Htut
|
Yi Zhang
|
Neha Anna John
|
Bonan Min
|
Yassine Benajiba
|
Dan Roth
Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts. We publicly release our dataset and code to promote research along this line.
pdf
bib
abs
The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models
Artem Kirsanov
|
Chi-Ning Chou
|
Kyunghyun Cho
|
SueYeon Chung
Decoder-only language models have the ability to dynamically switch between various computational tasks based on input prompts. Despite many successful applications of prompting, there is very limited understanding of the internal mechanism behind such flexibility. In this work, we investigate how different prompting methods affect the geometry of representations in these models. Employing a framework grounded in statistical physics, we reveal that various prompting techniques, while achieving similar performance, operate through distinct representational mechanisms for task adaptation. Our analysis highlights critical geometric effects of input distribution samples and label semantics in few-shot in-context learning. We also demonstrate evidence of synergistic and interfering interactions between different tasks on the representational level. Our work contributes to the theoretical understanding of large language models and lays the groundwork for developing more effective, representation-aware prompting strategies.
pdf
bib
abs
Biases in Opinion Dynamics in Multi-Agent Systems of Large Language Models: A Case Study on Funding Allocation
Pedro Cisneros-Velarde
We study the evolution of opinions inside a population of interacting large language models (LLMs). Every LLM needs to decide how much funding to allocate to an item with three initial possibilities: full, partial, or no funding. We identify biases that drive the exchange of opinions based on the LLM’s tendency to find consensus with the other LLM’s opinion, display caution when specifying funding, and consider ethical concerns in its opinion. We find these biases are affected by the perceived absence of compelling reasons for opinion change, the perceived willingness to engage in discussion, and the distribution of allocation values. Moreover, tensions among biases can lead to the survival of funding for items with negative connotations. We also find that the final distribution of full, partial, and no funding opinions is more diverse when an LLM freely forms its opinion after an interaction than when its opinion is a multiple-choice selection among the three allocation options. In the latter case, consensus is mostly attained. When agents are aware of past opinions, they seek to maintain consistency with them, changing the opinion dynamics. Our study is performed using Llama 3 and Mistral LLMs.
pdf
bib
abs
CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions
Mourad Heddaya
|
Kyle MacMillan
|
Hongyuan Mei
|
Chenhao Tan
|
Anup Malani
This paper introduces CaseSumm, a novel dataset for long-context summarization in the legal domain that addresses the need for longer and more complex datasets for summarization evaluation. We collect 25.6K U.S. Supreme Court (SCOTUS) opinions and their official summaries, known as “syllabuses.” Our dataset is the largest open legal case summarization dataset, and is the first to include summaries of SCOTUS decisions dating back to 1815.We also present a comprehensive evaluation of LLM-generated summaries using both automatic metrics and expert human evaluation, revealing discrepancies between these assessment methods. Our evaluation shows Mistral 7b, a smaller open-source model, outperforms larger models on most automatic metrics and successfully generates syllabus-like summaries. In contrast, human expert annotators indicate that Mistral summaries contain hallucinations. The annotators consistently rank GPT-4 summaries as clearer and exhibiting greater sensitivity and specificity. We find that LLM-based evaluations are not more correlated with human evaluations than traditional automatic metrics. Furthermore, our analysis identifies specific hallucinations in generated summaries, including precedent citation errors and misrepresentations of case facts. These findings demonstrate the limitations of current automatic evaluation methods for legal summarization and highlight the critical role of human evaluation in assessing summary quality, particularly in complex, high-stakes domains.
pdf
bib
Chasing Random: Instruction Selection Strategies Fail to Generalize
Harshita Diddee
|
Daphne Ippolito
pdf
bib
abs
Can’t Hide Behind the API: Stealing Black-Box Commercial Embedding Models
Manveer Singh Tamber
|
Jasper Xian
|
Jimmy Lin
Embedding models that generate dense vector representations of text are widely used and hold significant commercial value. Companies such as OpenAI and Cohere offer proprietary embedding models via paid APIs, but despite being “hidden” behind APIs, these models are not protected from theft. We present, to our knowledge, the first effort to “steal” these models for retrieval by training thief models on text–embedding pairs obtained from the APIs. Our experiments demonstrate that it is possible to replicate the retrieval effectiveness of commercial embedding models with a cost of under $300. Notably, our methods allow for distilling from multiple teachers into a single robust student model, and for distilling into presumably smaller models with fewer dimension vectors, yet competitive retrieval effectiveness. Our findings raise important considerations for deploying commercial embedding models and suggest measures to mitigate the risk of model theft.
pdf
bib
abs
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura
|
Ahmed Heakl
|
Omkar Thawakar
|
Ali Husain Salem Abdulla Alharthi
|
Ines Riahi
|
Abduljalil Radman
|
Jorma Laaksonen
|
Fahad Shahbaz Khan
|
Salman Khan
|
Rao Muhammad Anwer
Recent years have witnessed a significant interest in developing large multi-modal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the bestopen-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark will be publicly released.
pdf
bib
abs
ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models
David Anugraha
|
Genta Indra Winata
|
Chenyue Li
|
Patrick Amadeus Irawan
|
En-Shiun Annie Lee
Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper presents ProxyLM, a scalable task- and language-agnostic framework designed to predict the performance of LMs using proxy models. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging these proxy models, ProxyLM significantly reduces computational overhead in task evaluations, achieving up to a 37.08x speedup over traditional methods, even with our smallest proxy models. Our results across multiple multilingual NLP tasks and various robustness tests demonstrate that ProxyLM not only adapts well to previously unseen languages in pre-trained LMs, but also generalizes effectively across different datasets, outperforming the state-of-the-art by at least 1.78x in terms of root-mean-square error (RMSE).
pdf
bib
abs
SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse
Giang Do
|
Hung Le
|
Truyen Tran
Sparse mixture of experts (SMoE) have emerged as an effective approach for scaling large language models while keeping a constant computational cost. Regardless of several notable successes of SMoE, effective training such architecture remains elusive due to the representation collapse problem, which in turn harms model performance and causes parameter redundancy. In this work, we present Similarity-based Sparse Mixture of Experts (SimSMoE), a novel similarity of neural network algorithm, that guarantees a solution to address the representation collapse issue between experts given a fixed FLOPs budget. We conduct extensive empirical evaluations on three large language models for both Pre-training and Fine-tuning tasks to illustrate the efficacy, robustness, and scalability of our method. The results demonstrate that SimSMoE significantly enhances existing routing policy and outperforms other SMoE routing methods in performance for the tasks. Our implementation is publicly available at https://github.com/giangdip2410/SimSMoE.
pdf
bib
abs
UniRAG: Universal Retrieval Augmentation for Large Vision Language Models
Sahel Sharifymoghaddam
|
Shivani Upadhyay
|
Wenhu Chen
|
Jimmy Lin
Recently, Large Vision Language Models (LVLMs) have unlocked many complex use cases that require Multi-Modal (MM) understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelity of LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by Vision-Language (VL) retrievers like UniIR models. All the necessary code to reproduce our results is available at https://github.com/castorini/UniRAG.
pdf
bib
abs
Evaluating the Performance of Large Language Models via Debates
Behrad Moniri
|
Hamed Hassani
|
Edgar Dobriban
Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications, or rely on human input, making them unscalable. To address these issues, we propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as argumentative reasoning and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.
pdf
bib
abs
CausalGraph2LLM: Evaluating LLMs for Causal Queries
Ivaxi Sheth
|
Bahare Fatemi
|
Mario Fritz
Causality is essential in scientific research, enabling researchers to interpret true relationships between variables. These causal relationships are often represented by causal graphs, which are directed acyclic graphs. With the recent advancements in Large Language Models (LLMs), there is an increasing interest in exploring their capabilities in causal reasoning and their potential use to hypothesize causal graphs. These tasks necessitate the LLMs to encode the causal graph effectively for subsequent downstream tasks. In this paper, we introduce CausalGraph2LLM, a comprehensive benchmark comprising over 700k queries across diverse causal graph settings to evaluate the causal reasoning capabilities of LLMs. We categorize the causal queries into two types: graph-level and node-level queries. We benchmark both open-sourced and closed models for our study. Our findings reveal that while LLMs show promise in this domain, they are highly sensitive to the encoding used. Even capable models like GPT-4 and Gemini-1.5 exhibit sensitivity to encoding, with deviations of about 60%. We further demonstrate this sensitivity for downstream causal intervention tasks. Moreover, we observe that LLMs can often display biases when presented with contextual information about a causal graph, potentially stemming from their parametric memory.
pdf
bib
abs
PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction
Hammad Ayyubi
|
Xuande Feng
|
Junzhang Liu
|
Xudong Lin
|
Zhecan Wang
|
Shih-Fu Chang
The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can’t be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets – TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.
pdf
bib
abs
SAFR: Neuron Redistribution for Interpretability
Ruidi Chang
|
Chunyuan Deng
|
Hanjie Chen
Superposition refers to encoding representations of multiple features within a single neuron, which is common in deep neural networks. This property allows neurons to combine and represent multiple features, enabling the model to capture intricate information and handle complex tasks. Despite promising performance, the model’s interpretability has been diminished. This paper presents a novel approach to enhance model interpretability by regularizing feature superposition. We introduce SAFR, which simply applies regularizations to the loss function to promote monosemantic representations for important tokens while encouraging polysemanticity for correlated token pairs, where important tokens and correlated token pairs are identified via VMASK and attention weights respectively. We evaluate SAFR with a transformer model on two classification tasks. Experiments demonstrate the effectiveness of SAFR in improving model interpretability without compromising prediction performance. Besides, SAFR provides explanations by visualizing the neuron allocation within the intermediate layers.
pdf
bib
abs
GPT-4V Cannot Generate Radiology Reports Yet
Yuyang Jiang
|
Chacha Chen
|
Dang Nguyen
|
Benjamin M. Mervak
|
Chenhao Tan
GPT-4’s purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4 (4o and vision-preview) in generating radiology reports across three chest X-ray report benchmarks: MIMIC-CXR, CheXpert Plus, and IU X-Ray. We attempt to directly generate reports with different prompting strategies and find that the models fail terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the **medical image reasoning** step of predicting medical condition labels from images; and 2) the **report synthesis** step of generating reports from (groundtruth) conditions. We show that GPT-4’s performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned Llama. Altogether, our findings cast doubt on the viability of using GPT-4 in a radiology workflow.
pdf
bib
abs
Is Semantic Chunking Worth the Computational Cost?
Renyi Qu
|
Ruixuan Tu
|
Forrest Sheng Bao
Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.
pdf
bib
abs
On Using Arabic Language Dialects in Recommendation Systems
Abdulla Alshabanah
|
Murali Annavaram
While natural language processing (NLP) techniques have been applied to user reviews in recommendation systems, the potential of leveraging Arabic dialects in this context remains unexplored. Arabic is spoken by over 420 million people, with significant dialectal variation across regions. These dialects, often classified as low-resource languages, present both challenges and opportunities for machine learning applications. This paper represents the first attempt to incorporate Arabic dialects as a signal in recommendation systems. We explore both explicit and implicit approaches for integrating Arabic dialect information from user reviews, demonstrating its impact on improving recommendation performance. Our findings highlight the potential for leveraging dialectal diversity in Arabic to enhance recommendation systems and encourage further research at the intersection of NLP and recommendation systems within the Arab multicultural world.
pdf
bib
abs
Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing
Hadi Askari
|
Anshuman Chhabra
|
Muhao Chen
|
Prasant Mohapatra
Large Language Models (LLMs) have achieved state-of-the-art performance at zero-shot generation of abstractive summaries for given articles. However, little is known about the robustness of such a process of zero-shot summarization.To bridge this gap, we propose *relevance paraphrasing*, a simple strategy that can be used to measure the robustness of LLMs as summarizers. The relevance paraphrasing approach identifies the most *relevant* sentences that contribute to generating an ideal summary, and then *paraphrases* these inputs to obtain a minimally perturbed dataset. Then, by evaluating model performance for summarization on both the original and perturbed datasets, we can assess the LLM’s one aspect of robustness. We conduct extensive experiments with relevance paraphrasing on 4 diverse datasets, as well as 4 LLMs of different sizes (GPT-3.5-Turbo, Llama-2-13B, Mistral-7B-v1, and Dolly-v2-7B). Our results indicate that LLMs are not consistent summarizers for the minimally perturbed articles, necessitating further improvements.
pdf
bib
Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances
Zehui Wu
|
Ziwei Gong
|
Lin Ai
|
Pengyuan Shi
|
Kaan Donbekci
|
Julia Hirschberg
pdf
bib
abs
DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization
Haohan Yuan
|
Haopeng Zhang
Most research on abstractive summarization focuses on single-domain applications, often neglecting how domain shifts between documents affect performance and the generalization ability of summarization models. To address this issue, we introduce DomainSum, a hierarchical benchmark designed to capture fine-grained domain shifts in abstractive summarization. We categorize these shifts into three levels: genre, style, and topic, and demonstrate through comprehensive benchmark analysis that they follow a hierarchical structure. Furthermore, we evaluate the domain generalization capabilities of commonly used pre-trained language models (PLMs) and large language models (LLMs) in both in-domain and cross-domain settings. Our benchmark and source code are released at https://github.com/hpzhang94/DomainSum.
pdf
bib
abs
Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
Wenjie Jacky Mo
|
Jiashu Xu
|
Qin Liu
|
Jiongxiao Wang
|
Jun Yan
|
Hadi Askari
|
Chaowei Xiao
|
Muhao Chen
Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes pronounced in the context of Large Language Models (LLMs) deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, this study critically examines the use of demonstrations as a defense mechanism against backdoor attacks in black-box LLMs. With an identified task, we retrieve task-relevant demonstrations from a clean data pool and integrate them with user queries during testing. Importantly, this approach does not necessitate modifications or tuning of the model, nor does it require insight into the model’s internal architecture. The alignment properties inherent in in-context learning play a pivotal role in mitigating the impact of backdoor triggers, effectively recalibrating the behavior of compromised models. Our experimental analysis demonstrates that this method robustly defends against both instance-level and instruction-level backdoor attacks, outperforming existing defense baselines across most evaluation scenarios.
pdf
bib
abs
“All that Glitters”: Techniques for Evaluations with Unreliable Model and Human Annotations
Michael Hardy
“Gold” and “ground truth” human-mediated labels have error. This error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model ratings of the quality of classroom teaching from two LLM architecture families–encoders and GPT decoders. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality. The encoder family of models achieve state-of-the-art, even “super-human”, results across all classroom annotation tasks using standard metrics. However, evaluation techniques accounting for unreliable labels reveal important flaws, including spurious correlations and nonrandom racial biases across models and humans. We estimate that if models were used in a human-in-the-loop context, the variance contributed by GPT model labels would worsen ratings. These techniques also highlight tasks where encoders could offer 80% reduction in human costs while also reducing bias.
pdf
bib
abs
KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
Xiaoming Shi
|
Zeming Liu
|
Yiming Lei
|
Chenkai Zhang
|
Haitao Leng
|
Chuan Wang
|
Qingjie Liu
|
Wanxiang Che
|
Yunhong Wang
Video-based dialogue systems have compelling application value, such as education assistants, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering and emotionally dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.
pdf
bib
abs
GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings
Raghuveer Thirukovalluru
|
Bhuwan Dhingra
Training-free embedding methods directly leverage pretrained large language models (LLMs) to embed text, bypassing the costly and complex procedure of contrastive learning. Previous training-free embedding methods have mainly focused on optimizing embedding prompts and have overlooked the benefits of utilizing the generative abilities of LLMs. We propose a novel method, GenEOL, which uses LLMs to generate diverse transformations of a sentence that preserve its meaning, and aggregates the resulting embeddings of these transformations to enhance the overall sentence embedding. GenEOL significantly outperforms the existing training-free embedding methods by an average of 2.85 points across several LLMs on the sentence semantic text similarity (STS) benchmark. GenEOL also achieves notable gains in clustering, reranking, and pair-classification tasks from the MTEB benchmark. Additionally, GenEOL stabilizes representation quality across LLM layers and remains robust to perturbations of embedding prompts.
pdf
bib
abs
Attention Tracker: Detecting Prompt Injection Attacks in LLMs
Kuo-Han Hung
|
Ching-Yun Ko
|
Ambrish Rawat
|
I-Hsin Chung
|
Winston H. Hsu
|
Pin-Yu Chen
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
pdf
bib
Unsupervised Speech-text word-level alignment with Dynamic Programming
Tianshu Yu
|
Zihan Gong
|
Minghuan Tan
|
Guhong Chen
|
Min Yang
pdf
bib
abs
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis
Hengxing Cai
|
Xiaochen Cai
|
Junhan Chang
|
Sihang Li
|
Lin Yao
|
Wang Changxin
|
Zhifeng Gao
|
Hongshuai Wang
|
Li Yongge
|
Mujie Lin
|
Shuwen Yang
|
Jiankun Wang
|
Mingjun Xu
|
Jin Huang
|
Xi Fang
|
Jiaxi Zhuang
|
Yuqi Yin
|
Yaqi Li
|
Changhong Chen
|
Zheng Cheng
|
Zifeng Zhao
|
Linfeng Zhang
|
Guolin Ke
Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data.In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), Comprehension (L2), and Analysis & Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine.To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, highlighting their strengths and areas for improvement. We hope this evaluation supports the ongoing development of LLM applications in scientific literature analysis.SciAssess and its resources are available at
https://github.com/sci-assess/SciAssess.
pdf
bib
abs
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Samuele Poppi
|
Zheng Xin Yong
|
Yifei He
|
Bobbie Chern
|
Han Zhao
|
Aobo Yang
|
Jianfeng Chi
Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of weight parameters in fine-tuning attacks can break safety alignment across all languages. Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our attack vector can still jailbreak LLMs adapted to new languages.
pdf
bib
abs
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows
Xingjian Zhang
|
Yutong Xie
|
Jin Huang
|
Jinge Ma
|
Zhaoying Pan
|
Qijia Liu
|
Ziyang Xiong
|
Tolga Ergen
|
Dongsub Shim
|
Honglak Lee
|
Qiaozhu Mei
Scientific innovation relies on detailed workflows, which include critical steps such as contextualizing literature, generating ideas, validating ideas, interpreting results, and planning new research. Scientific publications that document these workflows are extensive and unstructured, making it difficult to effectively navigate and explore the space of scientific innovation. To meet this challenge, we introduce **MASSW**, a comprehensive dataset of **M**ulti-**A**spect **S**ummarization of **S**cientific **W**orkflows. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. Using Large Language Models (LLMs), we automatically extract five core aspects from these publications – *context, key idea, method, outcome*, and *projected impact* – which correspond to five key steps in a research workflow. We show that these LLM-extract summaries have a comparable quality to human annotations, and they facilitate a variety of downstream tasks, corresponding to different types of predictions and recommendations along the scientific workflow. Overall, MASSW demonstrates decent utility as a pre-computed and trustful resource for the AI4Science community to create and benchmark a wide-range of new AI methods for optimizing scientific workflows and fostering scientific innovation. Our code and datasets are made available anonymously: [link](https://osf.io/7ygrq/?view_only=3d8261a0ea09489fa67ece2c68235afa).
pdf
bib
abs
Neuro-symbolic Training for Reasoning over Spatial Language
Tanawan Premsri
|
Parisa Kordjamshidi
Spatial reasoning based on natural language expressions is essential for everyday human tasks. This reasoning ability is also crucial for machines to interact with their environment in a human-like manner. However, recent research shows that even state-of-the-art language models struggle with spatial reasoning over text, especially when facing nesting spatial expressions. This is attributed to not achieving the right level of abstraction required for generalizability.To alleviate this issue, we propose training language models with neuro-symbolic techniques that exploit the spatial logical rules as constraints, providing additional supervision to improve spatial reasoning and question answering.Training language models to adhere to spatial reasoning rules guides them in making more effective and general abstractions for transferring spatial knowledge to various domains. We evaluate our approach on existing spatial question-answering benchmarks. Our results indicate the effectiveness of our proposed technique in improving language models in complex multi-hop spatial reasoning over text.
pdf
bib
abs
On Localizing and Deleting Toxic Memories in Large Language Models
Anubrata Das
|
Manoj Kumar
|
Ninareh Mehrabi
|
Anil Ramakrishna
|
Anna Rumshisky
|
Kai-Wei Chang
|
Aram Galstyan
|
Morteza Ziyadi
|
Rahul Gupta
Warning: This paper contains offensive language.Ensuring that large language models (LLMs) do not generate harmful text is critical for their safe deployment. A common failure mode involves producing toxic responses to otherwise innocuous prompts. While various detoxification methods have been proposed, the underlying mechanisms that drive toxic generation in LLMs are not yet fully understood. Our work aims to provide a mechanistic understanding of toxic generation against innocuous-seeming adversarial prompts through the lens of memory localization. We find evidence of localization of toxic memories in the early Multilayer Perceptron (MLP) layers of GPT-2-XL. We further investigate the effects of editing and deleting these toxic memories in MLP layers to reduce toxic generation. Editing significantly reduces toxic generation, from 62.86% to 28.61%. However, this reduction comes with a trade-off in generation quality as perplexity increases from 78.18 on GPT2-XL against the adversarial prompts to 106.06 after editing. Localization-informed deletion achieves a better toxicity-perplexity tradeoff compared to random early layer editing, which reduces toxicity but leads to greater perplexity increases.
pdf
bib
abs
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
Yifan Liu
|
Yu Fang
|
Zhouhan Lin
Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pretraining has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pretraining, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Vsual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters. Code and weights will be made publicly available after acceptance of this paper.
pdf
bib
abs
GraphICL: Unlocking Graph Learning Potential in LLMs through Structured Prompt Design
Yuanfu Sun
|
Zhengnan Ma
|
Yi Fang
|
Jing Ma
|
Qiaoyu Tan
The growing importance of textual and relational systems has driven interest in enhancing large language models (LLMs) for graph-structured data, particularly Text-Attributed Graphs (TAGs), where samples are represented by textual descriptions interconnected by edges. While research has largely focused on developing specialized graph LLMs through task-specific instruction tuning, a comprehensive benchmark for evaluating LLMs solely through prompt design remains surprisingly absent. Without such a carefully crafted evaluation benchmark, most if not all, tailored graph LLMs are compared against general LLMs using simplistic queries (e.g., zero-shot reasoning with LLaMA), which can potentially camouflage many advantages as well as unexpected predicaments of them. To achieve more general evaluations and unveil the true potential of LLMs for graph tasks, we introduce Graph In-context Learning (GraphICL) Benchmark, a comprehensive benchmark comprising novel prompt templates designed to capture graph structure and handle limited label knowledge. Our systematic evaluation shows that general-purpose LLMs equipped with our GraphICL outperform state-of-the-art specialized graph LLMs and graph neural network models in resource-constrained settings and out-of-domain tasks. These findings highlight the significant potential of prompt engineering to enhance LLM performance on graph learning tasks without training and offer a strong baseline for advancing research in graph LLMs.
pdf
bib
abs
FIDELITY: Fine-grained Interpretable Distillation for Effective Language Insights and Topic Yielding
Divyansh Singh
|
Brodie Mather
|
Demi Zhang
|
Patrick Lehman
|
Justin Ho
|
Bonnie J Dorr
The rapid expansion of text data has increased the need for effective methods to distill meaningful information from large datasets. Traditional and state-of-the-art approaches have made significant strides in topic modeling, yet they fall short in generating contextually specific and semantically intuitive topics, particularly in dynamic environments and low-resource languages. Additionally, multi-document summarization systems often struggle with issues like redundancy, scalability, and maintaining readability. We introduce FIDELITY (Fine-grained Interpretable Distillation for Effective Language Insights and Topic Yielding), a hybrid method that combines topic modeling and text summarization to produce fine-grained, semantically rich, and contextually relevant output. FIDELITY enhances dataset accessibility and interpretability, outperforming traditional models in topic diversity, similarity, and in the ability to process new, unseen documents. Additionally, it demonstrates robust multilingual capabilities, effectively handling low-resource languages like Tagalog. This makes FIDELITY a powerful tool for distilling and understanding complex textual data, providing detailed insights while maintaining the necessary granularity for practical applications.
pdf
bib
abs
Classic4Children: Adapting Chinese Literary Classics for Children with Large Language Model
Jiali Chen
|
Xusen Hei
|
Yuqi Xue
|
Zihan Wu
|
Jiayuan Xie
|
Yi Cai
Chinese literary classics hold significant cultural and educational value, offering deep insights into morality, history, and human nature. These works often include classical Chinese and complex narratives, making them difficult for children to read. To bridge this gap, we introduce a child-friendly literary adaptation (CLA) task to adapt the Chinese literary classic into engaging and accessible text for children. However, recent large language models (LLMs) overlook children’s reading preferences (i.e., vivid character portrayals, concise narrative structures, and appropriate readability with simpler words and sentences), which poses challenges in CLA. In this paper, we propose a method called InstructChild, which augments the LLM with these preferences for adaptation. Specifically, we first obtain the characters’ personalities and narrative structure as additional information for fine-grained instruction tuning. Then, we devise a readability metric as the reward to align the LLM with the children’s reading level. Finally, a lookahead decoding strategy is applied to improve the readability of the generated text during inference. To support the evaluation of CLA task, we construct the Classic4Children dataset, which comprises both the original and child-friendly versions of the Four Great Classical Novels of Chinese literature. Experimental results show that our InstructChild significantly improves performance in automatic and human evaluation.
pdf
bib
abs
Considering Length Diversity in Retrieval-Augmented Summarization
Juseon-Do
|
Jaesung Hwang
|
Jingun Kwon
|
Hidetaka Kamigaito
|
Manabu Okumura
This study investigates retrieval-augmented summarization by specifically examining the impact of exemplar summary lengths because previous methods have not considered length constraints. We propose a Diverse Length-aware Maximal Marginal Relevance (DL-MMR) algorithm to better control summary lengths. This algorithm combines the query relevance with diverse target lengths in retrieval-augmented summarization. Unlike previous methods that necessitate exhaustive exemplar-exemplar relevance comparisons using MMR, DL-MMR considers the exemplar target length as well and avoids comparing exemplars to each other, thereby reducing computational cost and conserving memory during the construction of an exemplar pool. Experimental results showed the effectiveness of DL-MMR, which considers length diversity, compared to the original MMR algorithm. DL-MMR additionally showed the effectiveness in memory saving of 781,513 times and computational cost reduction of 500,092 times, while maintaining the same level of informativeness.
pdf
bib
abs
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models
Zhenyue Qin
|
Yu Yin
|
Dylan Campbell
|
Xuansheng Wu
|
Ke Zou
|
Ninghao Liu
|
Yih Chung Tham
|
Xiuzhen Zhang
|
Qingyu Chen
The prevalence of vision-threatening eye diseases is a significant global burden, with many cases remaining undiagnosed or diagnosed too late for effective treatment. Large vision-language models (LVLMs) have the potential to assist in understanding anatomical information, diagnosing eye diseases, and drafting interpretations and follow-up plans, thereby reducing the burden on clinicians and improving access to eye care. However, limited benchmarks are available to assess LVLMs’ performance in ophthalmology-specific applications. In this study, we introduce LMOD, a large-scale multimodal ophthalmology benchmark consisting of 21,993 instances across (1) five ophthalmic imaging modalities: optical coherence tomography, color fundus photographs, scanning laser ophthalmoscopy, lens photographs, and surgical scenes; (2) free-text, demographic, and disease biomarker information; and (3) primary ophthalmology-specific applications such as anatomical information understanding, disease diagnosis, and subgroup analysis. In addition, we benchmarked 13 state-of-the-art LVLM representatives from closed-source, open-source, and medical domains. The results demonstrate a significant performance drop for LVLMs in ophthalmology compared to other domains. Systematic error analysis further identified six major failure modes: misclassification, failure to abstain, inconsistent reasoning, hallucination, assertions without justification, and lack of domain-specific knowledge. In contrast, supervised neural networks specifically trained on these tasks as baselines demonstrated high accuracy. These findings underscore the pressing need for benchmarks in the development and validation of ophthalmology-specific LVLMs.
pdf
bib
abs
Syntriever: How to Train Your Retriever with Synthetic Data from LLMs
Minsang Kim
|
Seung Jun Baek
LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@K. the source code is available in https://github.com/kmswin1/Syntriever
pdf
bib
abs
DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition
Qi Zhang
|
Huitong Pan
|
Zhijia Chen
|
Longin Jan Latecki
|
Cornelia Caragea
|
Eduard Dragut
Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.
pdf
bib
abs
An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning
Andrew Bai
|
Chih-Kuan Yeh
|
Cho-Jui Hsieh
|
Ankur Taly
Incrementally fine-tuning foundational models on new tasks or domains is now the de facto approach in NLP. A known pitfall of this approach is the catastrophic forgetting of prior knowledge that happens during fine-tuning. A common approach to alleviate such forgetting is to rehearse samples from prior tasks during fine-tuning. Several existing works assume a fixed memory buffer to store prior task examples, while relying on inferences (forward passes) with the model at hand for choosing examples for rehearsal from the buffer. However, given the increasing computational cost of model inference, and decreasing cost of data storage, we focus on the setting to rehearse samples with a fixed computational budget instead of a fixed memory budget. We propose a sampling scheme, mix-cd, that prioritizes rehearsal of “collateral damage” samples, which are samples predicted correctly by the prior model but forgotten by the incrementally tuned one. The crux of our scheme is a procedure to efficiently estimate the density of collateral damage samples without incurring additional model inferences. Our approach is computationally efficient, easy to implement, and outperforms several leading continual learning methods in compute-constrained settings. All the code will be publicly available at https://github.com/jybai/mix-cd-rehearsal.
pdf
bib
abs
COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis
Weiqing Yang
|
Hanbin Wang
|
Zhenghao Liu
|
Xinze Li
|
Yukun Yan
|
Shuo Wang
|
Yu Gu
|
Minghe Yu
|
Zhiyuan Liu
|
Ge Yu
Code debugging is a vital stage of software development, essential for ensuring the reliability and performance of Large Language Models (LLMs) in the code generation task. Human debugging typically follows a multi-stage process, which includes Bug Localization, Bug Identification, Code Repair, and Code Recognition. However, existing code debugging benchmarks predominantly focus on the Code Repair stage, which offers only a limited perspective on evaluating the debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a comprehensive benchmark for evaluating the debugging abilities of LLMs by emulating the multi-stage human debugging process. Through evaluating on DEBUGEVAL, we observe that 7B-scale models consistently underperform compared to their larger counterparts, highlighting their limitations in comprehending code semantics. In this case, we propose the COmmunicative Agent-based data SynThesis (COAST) framework, which employs a multi-agent system to generate high-quality training data for supervised fine-tuning (SFT). Experimental results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance comparable to GPT-3.5. All data and codes are available at https://github.com/NEUIR/COAST.
pdf
bib
abs
Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step
Zezhong Wang
|
Xingshan Zeng
|
Weiwen Liu
|
Yufei Wang
|
Liangyou Li
|
Yasheng Wang
|
Lifeng Shang
|
Xin Jiang
|
Qun Liu
|
Kam-Fai Wong
Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in confidence during the model’s reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by the reasoning steps required. Furthermore, by analyzing patterns in confidence change, we examine the correctness of the model’s reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model’s reasoning.
pdf
bib
abs
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages
Abhishek Kumar Singh
|
Vishwajeet Kumar
|
Rudra Murthy
|
Jaydeep Sen
|
Ashish Mittal
|
Ganesh Ramakrishnan
Large Language Models (LLMs) perform well on unseen tasks in English, but their abilities in non-English languages are less explored due to limited benchmarks and training data. To bridge this gap, we introduce the Indic-QA Benchmark, a large dataset for context-grounded question answering in 11 major Indian languages, covering both extractive and abstractive tasks. Evaluations of multilingual LLMs, including instruction fine-tuned versions, revealed weak performance in low-resource languages due to a strong English-language bias in their training data. We also investigated the Translate-Test paradigm,where inputs are translated to English for processing and the results are translated back into the source language for output. This approach outperformed multilingual LLMs, particularly in low-resource settings. By releasing Indic-QA, we aim to promote further research into LLMs’ question-answering capabilities in low-resource languages. This benchmark offers a critical resource to address existing limitations and foster multilingual understanding.
pdf
bib
abs
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data
Juanhui Li
|
Sreyashi Nag
|
Hui Liu
|
Xianfeng Tang
|
Sheikh Muhammad Sarwar
|
Limeng Cui
|
Hansu Gu
|
Suhang Wang
|
Qi He
|
Jiliang Tang
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs (teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. % and the high computational expense of processing large unlabeled datasets. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
pdf
bib
abs
LSDC: An Efficient and Effective Large-Scale Data Compression Method for Supervised Fine-tuning of Large Language Models
Zhaoguang Long
|
Yuhao Zhou
|
Shangqing Zhao
|
Yupei Ren
|
Li Cai
|
Chenghao Jia
|
Zhe Chen
|
Zhe Fang
|
Yuxiang Song
|
Man Lan
With the scale of Large Language Models(LLMs) and the size of the training data continuing to expand, the computational costs required for training or tuning have significantly increased as well. In this work we propose an efficient and effective Large-Scale Data Compression (LSDC) method to substantially reduce the size of training data and thus enhance the training efficiency without compromising the performance of LLMs through a bifurcated quantization strategy. Specifically, our method first segments the dataset into multiple clusters, significantly reducing the time and memory requirements for data compression. Then, during the second phase of coreset selection, the diversity of samples is ensured by maximizing the submodular gain in order to avoid performance degradation. The comparative experiments showed that the performance of LLMs fine-tuned on a 20% compressed subset of the Alpaca dataset using LSDC outperformed those on the full dataset. Moreover,on a domain-specific instruction dataset of millions of samples, the LLMs fine-tuned on a 10% compressed dataset using LSDC outperformed those on the entire dataset, which dramatically enhances the domain-adaption capabilities of LLMs. This provides a promising potential of LSDC in training bigger LLMs from scratch and supervised fine-tuning as well.
pdf
bib
abs
What Is Missing in Multilingual Visual Reasoning and How to Fix It
Yueqi Song
|
Simran Khanuja
|
Graham Neubig
NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar performance between English and other languages, indicating the potential for equitable system development across languages. Our analysis on model failures reveals three key aspects that make this task challenging: multilinguality, complex reasoning, and multimodality. To address these challenges, we propose three targeted interventions including a translate-test approach to tackle multilinguality, a visual programming approach to break down complex reasoning, and a method that leverages image captioning to address multimodality. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open models LLaVA-v1.5-13B by 13.4%, LLaVA-v1.6-34B by 20.3%, and Qwen-VL by 16.7%, while also minorly improving GPT-4V’s performance.
pdf
bib
abs
Enhancing the Prototype Network with Local-to-Global Optimization for Few-Shot Relation Extraction
Hui Sun
|
Rongxin Chen
Few-Shot Relation Extraction (FSRE) aims to achieve high classification performance by training relation classification models with a small amount of labeled data. Prototypical networks serve as a straightforward and efficient method for optimizing model performance by combining similarity evaluation and contrastive learning. However, directly integrating these methods can introduce unpredictable noise, such as information redundancy, which hinders classification performance and negatively affects embedding space learning. The technique presented in this paper applies Local-To-Global optimization to enhance prototypical networks in few-shot relation extraction. Specifically, this paper develops a local optimization strategy that indirectly optimizes the prototypes by optimizing the other information contained within the prototypes. It considers relation prototypes as global anchors and incorporates the techniques introduced in this paper, such as information alignment, local contrastive learning, and a local adaptive focal loss function, to address the issues of information redundancy. This approach enables the model to learn a unified and effective embedding space. We conduct extensive experiments on the FewRel 1.0 and FewRel 2.0 datasets to validate the effectiveness of the proposed model.
pdf
bib
LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages
Xuhan Huang
|
Qingning Shen
|
Yan Hu
|
Anningzhe Gao
|
Benyou Wang
pdf
bib
abs
Advancing Persian LLM Evaluation
Sara Bourbour Hosseinbeigi
|
Behnam Rohani
|
Mostafa Masoudi
|
Mehrnoush Shamsfard
|
Zahra Saaberi
|
Mostafa Karimi Manesh
|
Mohammad Amin Abbasi
Evaluation of large language models (LLMs) in low-resource languages like Persian has received less attention than in high-resource languages like English. Existing evaluation approaches for Persian LLMs generally lack comprehensive frameworks, limiting their ability to assess models’ performance over a wide range of tasks requiring considerable cultural and contextual knowledge, as well as a deeper understanding of Persian literature and style. This paper first aims to fill this gap by providing two new benchmarks, PeKA and PK-BETS, on topics such as history, literature, and cultural knowledge, as well as challenging the present state-of-the-art models’ abilities in a variety of Persian language comprehension tasks. These datasets are meant to reduce data contamination while providing an accurate assessment of Persian LLMs. The second aim of this paper is the general evaluation of LLMs across the current Persian benchmarks to provide a comprehensive performance overview. By offering a structured evaluation methodology, we hope to promote the examination of LLMs in the Persian language.
pdf
bib
abs
Supportiveness-based Knowledge Rewriting for Retrieval-augmented Language Modeling
Zile Qiao
|
Wei Ye
|
Yong Jiang
|
Tong Mo
|
Pengjun Xie
|
Weiping Li
|
Fei Huang
|
Shikun Zhang
Retrieval-augmented language models (RALMs) have recently shown great potential in mitigating the limitations of implicit knowledge in LLMs, such as untimely updating of the latest expertise and unreliable retention of long-tail knowledge. However, since the external knowledge base, as well as the retriever, can not guarantee reliability, potentially leading to the knowledge retrieved not being helpful or even misleading for LLM generation. In this paper, we introduce Supportiveness-based Knowledge Rewriting (SKR), a robust and pluggable knowledge rewriter inherently optimized for LLM generation. Specifically, we introduce the novel concept of “supportiveness”—which represents how effectively a knowledge piece facilitates downstream tasks. Based on supportiveness, we first design a training data curation strategy for our rewriter model, effectively identifying and filtering out poor or irrelevant rewrites to improve data efficacy. We then introduce the direct preference optimization (DPO) algorithm to align the generated rewrites to optimal supportiveness, guiding the rewriter model to summarize augmented content that better improves the final response. Comprehensive evaluations across six popular knowledge-intensive tasks and four LLMs have demonstrated the effectiveness and superiority of SKR. With only 7B parameters, SKR has shown better knowledge rewriting capability over GPT-4.
pdf
bib
abs
Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models
Jiatao Li
|
Xinyu Hu
|
Xunjian Yin
|
Xiaojun Wan
The integration of documents generated by LLMs themselves (Self-Docs) alongside retrieved documents has emerged as a promising strategy for retrieval-augmented generation systems. However, previous research primarily focuses on optimizing the use of Self-Docs, with their inherent properties remaining underexplored. To bridge this gap, we first investigate the overall effectiveness of Self-Docs, identifying key factors that shape their contribution to RAG performance (RQ1). Building on these insights, we develop a taxonomy grounded in Systemic Functional Linguistics to compare the influence of various Self-Docs categories (RQ2) and explore strategies for combining them with external sources (RQ3). Our findings reveal which types of Self-Docs are most beneficial and offer practical guidelines for leveraging them to achieve significant improvements in knowledge-intensive question answering tasks.
pdf
bib
abs
PREMISE: Matching-based Prediction for Accurate Review Recommendation
Wei Han
|
Hui Chen
|
Soujanya Poria
We present PREMISE, a new architecture for the matching-based learning in the multimodal fields for the MRHP task. Distinct to previous fusion-based methods which obtains multimodal representations via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field representations, filters duplicated semantics, and then obtained a set of matching scores as feature vectors for the downstream recommendation task. This new architecture significantly boosts the performance for such multimodal tasks whose context matching content are highly correlated to the targets of that task, compared to the state-of-the-art fusion-based methods. Experimental results on two publicly available datasets show that PREMISE achieves promising performance with less computational cost.
pdf
bib
abs
Semi-supervised Fine-tuning for Large Language Models
Junyu Luo
|
Xiao Luo
|
Xiusi Chen
|
Zhiping Xiao
|
Wei Ju
|
Ming Zhang
Supervised fine-tuning (SFT) is crucial in adapting large language models (LLMs) to a specific domain or task. However, only a limited amount of labeled data is available in practical applications, which poses a severe challenge for SFT in yielding satisfactory results. Therefore, a data-efficient framework that can fully exploit labeled and unlabeled data for LLM fine-tuning is highly anticipated.Towards this end, we introduce a **semi-supervised fine-tuning (SemiFT)** task and a framework named **SemiEvol** for LLM alignment from a propagate-and-select manner. For knowledge propagation, SemiEvol adopts a bi-level approach, propagating knowledge from labeled data to unlabeled data through both in-weight and in-context methods. For knowledge selection, SemiEvol incorporates a collaborative learning mechanism, selecting higher-quality pseudo-response samples. We conducted experiments using GPT-4o-mini and Llama-3.1 on seven general or domain-specific datasets, demonstrating significant improvements in model performance on target data. Furthermore, we compared SemiEvol with SFT and self-evolution methods, highlighting its practicality in hybrid data scenarios. Github Repository: [https://github.com/luo-junyu/SemiEvol](https://github.com/luo-junyu/SemiEvol).
pdf
bib
abs
CALM: Unleashing the Cross-Lingual Self-Aligning Ability of Language Model Question Answering
Yumeng Wang
|
Zhiyuan Fan
|
Qingyun Wang
|
Yi R. Fung
|
Heng Ji
Large Language Models (LLMs) are pretrained on extensive multilingual corpora to acquire both language-specific cultural knowledge and general knowledge. Ideally, while LLMs should provide consistent responses to culture-independent questions across languages, we observe significant performance disparities. To address this, we explore the **C**ross-Lingual Self-**A**ligning ability of **L**anguage **M**odels (**CALM**) to align knowledge across languages. Specifically, for a given question, we sample multiple responses across different languages and select the most self-consistent response as the target, leaving the remaining responses as negative examples. We then employ direct preference optimization (DPO) to align the model’s knowledge across different languages. Evaluations on the MEDQA and X-CSQA datasets demonstrate CALM’s effectiveness in enhancing cross-lingual knowledge question answering, both in zero-shot and retrieval-augmented settings. We also found that increasing the number of languages involved in CALM training leads to higher accuracy and consistency. We offer a qualitative analysis of how cross-lingual consistency can enhance knowledge alignment and explore the method’s generalizability.
pdf
bib
abs
Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring
Heejin Do
|
Taehee Park
|
Sangwon Ryu
|
Gary Lee
In automated essay scoring (AES), recent efforts have shifted toward cross-prompt settings that score essays on unseen prompts for practical applicability. However, prior methods trained with essay-score pairs of specific prompts pose challenges in obtaining prompt-generalized essay representation. In this work, we propose a grammar-aware cross-prompt trait scoring (GAPS), which internally captures prompt-independent syntactic aspects to learn generic essay representation. We acquire grammatical error-corrected information in essays via the grammar error correction technique and design the AES model to seamlessly integrate such information. By internally referring to both the corrected and the original essays, the model can focus on generic features during training. Empirical experiments validate our method’s generalizability, showing remarkable improvements in prompt-independent and grammar-related traits. Furthermore, GAPS achieves notable QWK gains in the most challenging cross-prompt scenario, highlighting its strength in evaluating unseen prompts.
pdf
bib
abs
MedEureka: A Medical Domain Benchmark for Multi-Granularity and Multi-Data-Type Embedding-Based Retrieval
Yongqi Fan
|
Nan Wang
|
Kui Xue
|
Jingping Liu
|
Tong Ruan
Embedding-based retrieval (EBR), the mainstream approach in information retrieval (IR), aims to help users obtain relevant information and plays a crucial role in retrieval-augmented generation (RAG) techniques of large language models (LLMs). Numerous methods have been proposed to significantly improve the quality of retrieved content and many generic benchmarks are proposed to evaluate the retrieval abilities of embedding models. However, texts in the medical domain present unique contexts, structures, and language patterns, such as terminology, doctor-patient dialogue, and electronic health records (EHRs). Despite these unique features, specific benchmarks for medical context retrieval are still lacking. In this paper, we propose MedEureka, an enriched benchmark designed to evaluate medical-context retrieval capabilities of embedding models with multi-granularity and multi-data types. MedEureka includes four levels of granularity and six types of medical texts, encompassing 18 datasets, incorporating granularity and data type description to prompt instruction-fine-tuned text embedding models for embedding generation. We also provide the MedEureka Toolkit to support evaluation on the MedEureka test set. Our experiments evaluate state-of-the-art open-source and proprietary embedding models, and fine-tuned classical baselines, providing a detailed performance analysis. This underscores the challenges of using embedding models for medical domain retrieval and the need for further research. Our code and data are released in the repository:
https://github.com/JOHNNY-fans/MedEureka.
pdf
bib
abs
A Federated Framework for LLM-based Recommendation
Jujia Zhao
|
Wenjie Wang
|
Chen Xu
|
See-Kiong Ng
|
Tat-Seng Chua
Large Language Models (LLMs) have showcased their potential in building generative recommendation systems through fine-tuning user behavior data. However, utilizing the user behavior data may pose significant privacy risks like in the traditional recommender models, potentially leading to ethical dilemmas and violations of data protection regulations. To address the privacy concerns, Federated Learning for Recommendation (Fed4Rec) has been identified as a promising solution. However, directly applying Fed4Rec in the LLM context introduces two challenges: 1) exacerbated client performance imbalance, which ultimately impacts the system’s long-term effectiveness, and 2) substantial client resource costs, posing a high demand for clients’ both computational and storage capability to locally train and infer LLMs.To tackle these challenges, we propose a federated framework for LLM-based recommendation (shorted as FELLRec). Generally, FELLRec designs two key strategies. 1) Dynamic balance strategy, which designs dynamic parameter aggregation and learning speed for different clients during training, aiming to ensure relatively balanced performance across clients. 2) Flexible storage strategy, which selectively retains certain sensitive LLM layers on the client side, while offloading other layers to the server, aiming to preserve privacy while saving resources. Specifically, FELLRec flexibly maintains those input and output layers on the client side to ensure the protection of all sensitive information. Experiment results show that FELLRec can achieve a more balanced client performance and improved overall performance in a computational and storage-efficient way while safeguarding user privacy well.
pdf
bib
abs
WaterSeeker: Pioneering Efficient Detection of Watermarked Segments in Large Documents
Leyi Pan
|
Aiwei Liu
|
Yijian Lu
|
Zitian Gao
|
Yichen Di
|
Lijie Wen
|
Irwin King
|
Philip S. Yu
Watermarking algorithms for large language models (LLMs) have attained high accuracy in detecting LLM-generated text. However, existing methods primarily focus on distinguishing fully watermarked text from non-watermarked text, overlooking real-world scenarios where LLMs generate only small sections within large documents. In this scenario, balancing time complexity and detection performance poses significant challenges. This paper presents WaterSeeker, a novel approach to efficiently detect and locate watermarked segments amid extensive natural text. It first applies an efficient anomaly extraction method to preliminarily locate suspicious watermarked regions. Following this, it conducts a local traversal and performs full-text detection for more precise verification. Theoretical analysis and experimental results demonstrate that WaterSeeker achieves a superior balance between detection accuracy and computational efficiency. Moreover, its localization capability lays the foundation for building interpretable AI detection systems. Our code is available at https://github.com/THU-BPM/WaterSeeker.
pdf
bib
abs
MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation
Chanhee Park
|
Hyeonseok Moon
|
Chanjun Park
|
Heuiseok Lim
Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings.
pdf
bib
abs
FIRE: Fact-checking with Iterative Retrieval and Verification
Zhuohan Xie
|
Rui Xing
|
Yuxia Wang
|
Jiahui Geng
|
Hasan Iqbal
|
Dhruv Sahnan
|
Iryna Gurevych
|
Preslav Nakov
Fact-checking long-form text is challenging, and it is therefore common practice to break it down into multiple atomic claims. The typical approach to fact-checking these atomic claims involves retrieving a fixed number of pieces of evidence, followed by a verification step. However, this method is usually not cost-effective, as it underutilizes the verification model’s internal knowledge of the claim and fails to replicate the iterative reasoning process in human search strategies. To address these limitations, we propose FIRE, a novel agent-based framework that integrates evidence retrieval and claim verification in an iterative manner. Specifically, FIRE employs a unified mechanism to decide whether to provide a final answer or generate a subsequent search query, based on its confidence in the current judgment. We compare FIRE with other strong fact-checking frameworks and find that it achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times. These results indicate that FIRE holds promise for application in large-scale fact-checking operations.
pdf
bib
abs
Lessons from a User Experience Evaluation of NLP Interfaces
Eduardo Calò
|
Lydia Penkert
|
Saad Mahamood
Human evaluations lay at the heart of evaluations within the field of Natural Language Processing (NLP). Seen as the “golden standard” of evaluations, questions are being asked on whether these evaluations are both reproducible and repeatable. One overlooked aspect is the design choices made by researchers when designing user interfaces (UIs). In this paper, four UIs used in past NLP human evaluations are assessed by UX experts, based on standardized human-centered interaction principles. Building on these insights, we derive several recommendations that the NLP community should apply when designing UIs, to enable more consistent human evaluation responses.
pdf
bib
abs
TrendSim: Simulating Trending Topics in Social Media Under Poisoning Attacks with LLM-based Multi-agent System
Zeyu Zhang
|
Jianxun Lian
|
Chen Ma
|
Yaning Qu
|
Ye Luo
|
Lei Wang
|
Rui Li
|
Xu Chen
|
Yankai Lin
|
Le Wu
|
Xing Xie
|
Ji-Rong Wen
Trending topics have become a significant part of modern social media, attracting users to participate in discussions of breaking events. However, they also bring in a new channel for poisoning attacks, resulting in negative impacts on society. Therefore, it is urgent to study this critical problem and develop effective strategies for defense. In this paper, we propose TrendSim, an LLM-based multi-agent system to simulate trending topics in social media under poisoning attacks. Specifically, we create a simulation environment for trending topics that incorporates a time-aware interaction mechanism, centralized message dissemination, and an interactive system. Moreover, we develop LLM-based humanoid agents to simulate users in social media, and propose prototype-based attackers to replicate poisoning attacks. Besides, we evaluate TrendSim from multiple aspects to validate its effectiveness. Based on TrendSim, we conduct simulation experiments to study four critical problems about poisoning attacks on trending topics.
pdf
bib
abs
ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval
Abdelrahman Abdallah
|
Jamshid Mozafari
|
Bhawna Piryani
|
Adam Jatowt
Retrieval-Augmented Generation (RAG) models have drawn considerable attention in modern open-domain question answering. The effectiveness of RAG depends on the quality of the top retrieved documents. However, conventional retrieval methods sometimes fail to rank the most relevant documents at the top. In this paper, we introduce ASRANK, a new re-ranking method based on scoring retrieved documents using zero-shot answer scent which relies on a pre-trained large language model to compute the likelihood of the document-derived answers aligning with the answer scent. Our approach demonstrates marked improvements across several datasets, including NQ, TriviaQA, WebQA, ArchivalQA, HotpotQA, and Entity Questions. Notably, ASRANK increases Top-1 retrieval accuracy on NQ from 19.2% to 46.5% for MSS and 22.1% to 47.3% for BM25. It also shows strong retrieval performance on several datasets compared to state-of-the-art methods (47.3 Top-1 by ASRANK vs 35.4 by UPR by BM25).
pdf
bib
abs
DSQG-Syn: Synthesizing High-quality Data for Text-to-SQL Parsing by Domain Specific Question Generation
Shaoming Duan
|
Youxuan Wu
|
Chuanyi Liu
|
Yuhao Zhang
|
Zirui Wang
|
Peiyi Han
|
Shengyuan Yu
|
Liang Yan
|
Yingwei Liang
Synthetic data has recently proven effective in enhancing the accuracy of Text-to-SQL parsers. However, existing methods generate SQL queries first by randomly sampling tables and columns based on probability and then synthesize natural language questions (NLQs). This approach often produces a large number of NLQ-SQL pairs that are irrelevant to the target domain and inconsistent in query intent, significantly diminishing the fine-tuning effectiveness of LLMs. In this paper, we introduce DSQG-Syn, a novel text-to-SQL data synthesis framework that based on domain-specific question generation. Specifically, we design a question generation method that creates domain-relevant questions based on predefined question types, ensuring coverage of major SQL operations. Guided by these questions, we synthesize NLQ-SQL pairs that are both domain-relevant and intent-consistent. To further enhance data quality, we filter out noisy samples from the generated pairs. When popular open-source LLMs are fine-tuned on our high-quality synthesized dataset, they achieve significant accuracy improvements, surpassing the performance of closed-source LLM-based approaches. Moreover, we demonstrate that our method outperforms existing state-of-the-art (SOTA) data synthesis techniques.
pdf
bib
abs
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild
Junhyeok Kim
|
Min Soo Kim
|
Jiwan Chung
|
Jungbin Cho
|
Jisoo Kim
|
Sungwoong Kim
|
Gyeongbo Sim
|
Youngjae Yu
Predicting when to initiate speech in real-world environments remains a fundamental challenge for conversational agents. We introduce , a novel framework for real-time speech initiation prediction in egocentric streaming video. By modeling the conversation from the speaker’s first-person viewpoint, is tailored for human-like interactions in which a conversational agent must continuously observe its environment and dynamically decide when to talk.Our approach bridges the gap between simplified experimental setups and complex natural conversations by integrating four key capabilities: (1) first-person perspective, (2) RGB processing, (3) online processing, and (4) untrimmed video processing. We also present YT-Conversation, a diverse collection of in-the-wild conversational videos from YouTube, as a resource for large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that outperforms random and silence-based baselines in real time. Our results also highlight the importance of multimodal input and context length in effectively deciding when to speak. Code and data are available at website.
pdf
bib
abs
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Chengyue Wu
|
Zhixuan Liang
|
Yixiao Ge
|
Qiushan Guo
|
Zeyu Lu
|
Jiahao Wang
|
Ying Shan
|
Ping Luo
Multi-modal Large Language Models have shown remarkable progress in visual contexts, yet their ability to convert visual figures into executable code remains underexplored. To address this, we introduce Plot2Code, a comprehensive benchmark designed to assess MLLMs’ visual coding capabilities. Plot2Code includes 132 high-quality matplotlib plots across six plot types, as well as an additional 150 and 86 plots from Python’s and R’s plotly libraries respectively, totaling 368 plots. Each plot is paired with its source code and a descriptive instruction generated by GPT-4, enabling thorough evaluation across diverse inputs. Furthermore, we propose three automatic evaluation metrics—code pass rate, text-match ratio, and GPT-4V rating judgement—to assess the quality of generated code and rendered images. Notably, the GPT-4V rating demonstrates strong reliability, as it correlates well with human evaluations, particularly for datasets of a certain size. Cross-validation across MLLMs (GPT-4V, Gemini-1.5-Pro, and Claude-3-Opus) also shows high consistency in ratings, which likely stems from the fact that ratings are based on rendered images rather than direct MLLM outputs, indicating minimal bias for this metric. Our evaluation of 14 MLLMs, including both proprietary and open-source models, highlights significant challenges in visual coding, particularly for text-dense plots, where MLLMs heavily rely on textual instructions. We believe these findings will advance future development of MLLMs.
pdf
bib
abs
FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG
Xinping Zhao
|
Yan Zhong
|
Zetian Sun
|
Xinshuo Hu
|
Zhenyu Liu
|
Dongfang Li
|
Baotian Hu
|
Min Zhang
Retrieval-Augmented Generation (RAG) prevails in Large Language Models. It mainly consists of retrieval and generation. The retrieval modules (a.k.a. retrievers) aim to find useful information used to facilitate the generation modules (a.k.a. generators). As such, generators’ performance largely depends on the effectiveness and efficiency of retrievers. However, the widely used retrieval paradigm remains flat. It treats retrieval procedures as a one-off deal with constant granularity. Despite effectiveness, we argue that they suffer from two limitations: (1) flat retrieval exerts a significant burden on one retriever; (2) constant granularity limits the ceiling of retrieval performance. In this work, we propose a progressive retrieval paradigm with coarse-to-fine granularity for RAG, termed FunnelRAG, so as to balance effectiveness and efficiency. Specifically, FunnelRAG establishes a progressive retrieval pipeline by collaborating coarse-to-fine granularity, large-to-small quantity, and low-to-high capacity, which can relieve the burden on one retriever and also promote the ceiling of retrieval performance. Extensive experiments manifest that FunnelRAG achieves comparable retrieval performance while the time overhead is reduced by nearly 40 percent.
pdf
bib
abs
The Power of Bullet Lists: A Simple Yet Effective Prompting Approach to Enhancing Spatial Reasoning in Large Language Models
Ikhyun Cho
|
Changyeon Park
|
Julia Hockenmaier
While large language models (LLMs) are dominating the field of natural language processing, it remains an open question how well these models can perform spatial reasoning. Contrary to recent studies suggesting that LLMs struggle with spatial reasoning tasks, we demonstrate in this paper that a novel prompting technique, termed Patient Visualization of Thought (Patient-VoT), can boost LLMs’ spatial reasoning abilities. The core idea behind Patient-VoT is to explicitly integrate *bullet lists, coordinates, and visualizations* into the reasoning process. By applying Patient-VoT, we achieve a significant boost in spatial reasoning performance compared to prior prompting techniques. We also show that integrating bullet lists into reasoning is effective in planning tasks, highlighting its general effectiveness across different applications.
pdf
bib
abs
Overcoming both Domain Shift and Label Shift for Referring Video Segmentation
Hai Huang
|
Sashuai Zhou
|
Yan Xia
Open-set domain generalization (OSDG) aims to enhance the robustness of the model when facing both domain shift and label shift, highlighting a wide range of potential in real-world applications. However, previous OSDG methods can only recognize seen objects and mark all unseen objects as “unknown” categories during inference, which is far from satisfactory. In this paper, we explore the scenario of referring video segmentation to study how to make the model maintain good segmentation ability for unknown objects under OSDG setting. To bridge the huge gap caused by label shift, we propose CLIP-based Reasoning Prompt (CRPrompt), which can combine text and visual prompts together to improve text-object matching ability of CLIP, transferring the segmentation ability to unseen classes based on the knowledge learned from seen classes and large-scale text-image pairs, i.e., color, shape, spatial relationships. Meanwhile, to improve the robustness of CRPrompt, we propose Retrieval-augmented Instance Normalization (RaIN), which can effectively enhance the robustness of the model by retrieving visual objects with similar semantic concepts through input query and performing Instance Norm among them. Extensive experiments on open-set and zero-shot domain generalization tasks demonstrate the effectiveness of our approach.
pdf
bib
abs
Language Modeling with Editable External Knowledge
Belinda Z. Li
|
Emmy Liu
|
Alexis Ross
|
Abbas Zeitoun
|
Graham Neubig
|
Jacob Andreas
When the world changes, so does the text that people write about it. How do we build language models that can be easily updated to reflect these changes? One popular approach is retrieval-augmented generation (RAG), in which new documents are inserted into a knowledge base and retrieved during prediction for downstream tasks. Most prior work on RAG has focused on improving model behavior during *prediction* through better retrieval or reasoning. This paper introduces ERASE, which instead improves model behavior **when new documents are acquired**, by incrementally deleting or rewriting other entries in the knowledge base each time a document is added. In two new datasets evaluating models’ ability to answer questions about a stream of news articles or conversations, ERASE improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute. This improvement is complementary to improved retrieval or reasoning for RAG: we demonstrate an 11% improvement by applying ERASE to SelfRAG.
pdf
bib
abs
Beyond Excess and Deficiency: Adaptive Length Bias Mitigation in Reward Models for RLHF
Yuyan Bu
|
Liangyu Huo
|
Yi Jing
|
Qing Yang
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models (LLMs) with human values. However, it has been noted that reward models in RLHF often exhibit unintended biases, such as an overemphasis on response length based on the erroneous assumption that longer responses are universally preferred. This “length bias” can lead to excessively verbose responses that compromise the quality of LLMs alignment. Previous efforts to mitigate length bias in reward models have inadvertently decreased their accuracy by neglecting the legitimate influence of response length on human preferences. In this work, we argue that response length is a context-specific factor in human evaluations, with different queries naturally eliciting varying preferences for response length. We propose an adaptive approach to modeling length preference that dynamically adjusts the influence of response length in reward evaluations according to the context of the query. Experimental results demonstrate that our adaptive approach effectively balances the mitigation of undesired length hacking and alignment accuracy, reducing unnecessary verbosity while improving overall response quality.
pdf
bib
abs
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification
Vishnu Kabir Chhabra
|
Ding Zhu
|
Mohammad Mahdi Khalili
Previous research has shown that fine-tuning language models on general tasks enhance their underlying mechanisms. However, the impact of fine-tuning on poisoned data and the resulting changes in these mechanisms are poorly understood. This study investigates the changes in a model’s mechanisms during toxic fine-tuning and identifies the primary corruption mechanisms. We also analyze the changes after retraining a corrupted model on the original dataset and observe neuroplasticity behaviors, where the model relearns original mechanisms after fine-tuning the corrupted model. Our findings indicate that; (i) Underlying mechanisms are amplified across task-specific fine-tuning which can be generalized to longer epochs, (ii) Model corruption via toxic fine-tuning is localized to specific circuit components, (iii) Models exhibit neuroplasticity when retraining corrupted models on clean dataset, reforming the original model mechanisms.
pdf
bib
abs
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
Hanan Gani
|
Rohit Bharadwaj
|
Muzammal Naseer
|
Fahad Shahbaz Khan
|
Salman Khan
The recent advancements in Large Language Models (LLMs) have greatly influenced the development of Large Multi-modal Video Models (Video-LMMs), significantly enhancing our ability to interpret and analyze video data. Despite their impressive capabilities, current Video-LMMs have not been evaluated for anomaly detection tasks, which is critical to their deployment in practical scenarios e.g., towards identifying deepfakes, manipulated video content, traffic accidents and crimes. In this paper, we introduce VANE-Bench, a benchmark designed to assess the proficiency of Video-LMMs in detecting and localizing anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models, encompassing a variety of subtle anomalies and inconsistencies grouped into five categories: unnatural transformations, unnatural appearance, pass-through, disappearance and sudden appearance. Additionally, our benchmark features real-world samples from existing anomaly detection datasets, focusing on crime-related irregularities, atypical pedestrian behavior, and unusual events. The task is structured as a visual question-answering challenge to gauge the models’ ability to accurately detect and localize the anomalies within the videos. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies. In conclusion, our research offers significant insights into the current capabilities of Video-LMMs in the realm of anomaly detection, highlighting the importance of our work in evaluating and improving these models for real-world applications. Our code and data is publicly available at https://github.com/rohit901/VANE-Bench.
pdf
bib
abs
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
Jiachen Ma
|
Yijiang Li
|
Zhiqing Xiao
|
Anda Cao
|
Jie Zhang
|
Chao Ye
|
Junbo Zhao
Text-to-image (T2I) models can be maliciously used to generate harmful content such as sexually explicit, unfaithful, and misleading or Not-Safe-for-Work (NSFW) images. Previous attacks largely depend on the availability of the diffusion model or involve a lengthy optimization process. In this work, we investigate a more practical and universal attack that does not require the presence of a target model and demonstrate that the high-dimensional text embedding space inherently contains NSFW concepts that can be exploited to generate harmful images. We present the Jailbreaking Prompt Attack (JPA). JPA first searches for the target malicious concepts in the text embedding space using a group of antonyms generated by ChatGPT. Subsequently, a prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space.We further introduce a soft assignment with gradient masking technique that allows us to perform gradient ascent in the discrete vocabulary space.We perform extensive experiments with open-sourced T2I models, e.g. stable-diffusion-v1-4 and closed-sourced online services, e.g. DALL·E 2 and Midjourney with black-box safety checkers. Results show that (1) JPA bypasses both text and image safety checkers, (2) while preserving high semantic alignment with the target prompt. (3) JPA demonstrates a much faster speed than previous methods and can be executed in a fully automated manner. These merits render it a valuable tool for robustness evaluation in future text-to-image generation research.
pdf
bib
abs
Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description
Mahshid Dehghani
|
Amirahmad Shafiee
|
Ali Shafiei
|
Neda Fallah
|
Farahmand Alizadeh
|
Mohammad Mehdi Gholinejad
|
Hamid Behroozi
|
Jafar Habibi
|
Ehsaneddin Asgari
3D facial emotion modeling has important applications in areas such as animation design, virtual reality, and emotional human-computer interaction (HCI). However, existing models are constrained by limited emotion classes and insufficient datasets. To address this, we introduce Emo3D, an extensive “Text-Image-Expression dataset” that spans a wide spectrum of human emotions, each paired with images and 3D blendshapes. Leveraging Large Language Models (LLMs), we generate a diverse array of textual descriptions, enabling the capture of a broad range of emotional expressions. Using this unique dataset, we perform a comprehensive evaluation of fine-tuned language-based models and vision-language models, such as Contrastive Language-Image Pretraining (CLIP), for 3D facial expression synthesis. To better assess conveyed emotions, we introduce Emo3D metric, a new evaluation metric that aligns more closely with human perception than traditional Mean Squared Error (MSE). Unlike MSE, which focuses on numerical differences, Emo3D captures emotional nuances in visual-text alignment and semantic richness. Emo3D dataset and metric hold great potential for advancing applications in animation and virtual reality.
pdf
bib
abs
Task-wrapped Continual Learning in Task-Oriented Dialogue Systems
Min Zeng
|
Haiqin Yang
|
Xi Chen
|
Yike Guo
Continual learning is vital for task-oriented dialogue systems (ToDs), and AdapterCL, equipped with residual adapters, has proven effectiveness in this domain. However, its performance is limited by training separate adapters for each task, preventing global knowledge sharing. To address this, we propose **Task-wrapped Continual Learning (TCL)**, a novel framework that employs **Task-Wrapped Adapters (TWAs)**, to simultaneously learn both global and task-specific information through parameter sharing. TCL leverages task-conditioned hypernetworks to transfer global knowledge across tasks, enabling TWAs to start from more informed initialization, efficiently learning task-specific details while reducing model parameters. Additionally, the simple, linear structure of both hypernetworks and TWAs ensure stable training, with task-free inference supported through effective loss utilization. Across 37 ToD domains, TCL consistently outperforms AdapterCL, significantly reducing forgetting. Remarkably, by setting the task embedding dimension to 1, TCL achieves a 4.76% improvement over AdapterCL while using only 46% of the parameters. These findings position TWA as a lightweight, powerful alternative to traditional adapters, offering a promising solution for continual learning in ToDs. The code is availableat https://github.com/cloversjtu/TCL.
pdf
bib
abs
Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains
Katerina Korre
|
Arianna Muti
|
Federico Ruggeri
|
Alberto Barrón-Cedeño
Hate speech relies heavily on cultural influences, leading to varying individual interpretations. For that reason, we propose a Semantic Componential Analysis (SCA) framework for a cross-cultural and cross-domain analysis of hate speech definitions. We create the first dataset of hate speech definitions encompassing 493 definitions from more than 100 cultures, drawn from five key domains: online dictionaries, academic research, Wikipedia, legal texts, and online platforms. By decomposing these definitions into semantic components,our analysis reveals significant variation across definitions, yet many domains borrow definitions from one another without taking into account the target culture. We conduct zero-shot model experiments using our proposed dataset, employing three popular open-sourced LLMs to understand the impact of different definitions on hate speech detection. Our findings indicate that LLMs are sensitive to definitions: responses for hate speech detection change according to the complexity of definitions used in the prompt.
pdf
bib
abs
CodeRAG-Bench: Can Retrieval Augment Code Generation?
Zora Zhiruo Wang
|
Akari Asai
|
Xinyan Velocity Yu
|
Frank F. Xu
|
Yiqing Xie
|
Graham Neubig
|
Daniel Fried
While language models (LMs) excel at generating code, many programs are difficult to generate using only parametric knowledge. Despite the success of retrieval-augmented generation (RAG) in text-centric tasks, its potential for code generation remains under-explored. This work introduces CodeRAG-bench, a holistic retrieval-augmented code generation benchmark covering tasks like basic programming, open-domain, and repository-level problems and provides reproducible evaluations on both retrieval and end-to-end code generation performance. We further create a diverse, open datastore for code retrieval, aggregating sources such as competition solutions, tutorials, library documentation, StackOverflow posts, and GitHub repositories. Based on CodeRAG-bench, we conduct large-scale evaluations of 10 retrievers and 10 LMs and systematically analyze when retrieval can benefit code generation models and identify remaining challenges. We find that while retrieving high-quality contexts improves code generation, retrievers often struggle to fetch useful contexts, and generators face limitations in using those contexts effectively. We hope CodeRAG-bench encourages further development in code-oriented RAG methods.
pdf
bib
abs
Multi-Condition Guided Diffusion Network for Multimodal Emotion Recognition in Conversation
Wenjin Tian
|
Xianying Huang
|
Shihao Zou
Emotion recognition in conversation (ERC) involves identifying emotional labels associated with utterances within a conversation, a task that is essential for developing empathetic robots. Current research emphasizes contextual factors, the speaker’s influence, and extracting complementary information across different modalities. However, it often overlooks the cross-modal noise at the semantic level and the redundant information brought by the features themselves. This study introduces a diffusion-based approach designed to effectively address the challenges posed by redundant information and unexpected noise while robustly capturing shared semantics, thus facilitating the learning of compact and representative features from multimodal data. Specifically, we present the Multi-Condition Guided Diffusion Network (McDiff). McDiff employs a modal prior knowledge extraction strategy to derive the prior distribution for each modality, thereby enhancing the regional attention of each modality and applying the generated prior distribution at each diffusion step. Furthermore, we propose a method to learn the mutual information of each modality through a specific objective constraints approach prior to the forward process, which aims to improve inter-modal interaction and mitigate the effects of noise and redundancy. Comprehensive experiments conducted on two multimodal datasets, IEMOCAP and MELD, demonstrate that McDiff significantly surpasses existing state-of-the-art methodologies, thereby affirming the generalizability and efficacy of the proposed model.
pdf
bib
abs
Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Senses
Samuel Cahyawijaya
|
Ruochen Zhang
|
Jan Christian Blaise Cruz
|
Holy Lovenia
|
Elisa Gilbert
|
Hiroki Nomoto
|
Alham Fikri Aji
Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends—words that are orthographically similar but have completely different meanings in two languages— as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive language modeling, promoting fairer access for the wider multilingual community.
pdf
bib
abs
Atoxia: Red-teaming Large Language Models with Target Toxic Answers
Yuhao Du
|
Zhuo Li
|
Pengyu Cheng
|
Xiang Wan
|
Anningzhe Gao
Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that **A**ttacks LLMs with **T**arget **Toxi**c **A**nswers (**Atoxia**). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.
pdf
bib
abs
A Practical Method for Generating String Counterfactuals
Matan Avitan
|
Ryan Cotterell
|
Yoav Goldberg
|
Shauli Ravfogel
Interventions targeting the representation space of language models (LMs) have emerged as an effective means to influence model behavior. Such methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model’s representations and, in so doing, create a counterfactual representation. However, because the intervention operates within the representation space, understanding precisely what aspects of the text it modifies poses a challenge. In this paper, we give a method to convert representation counterfactuals into string counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation space intervention and to interpret the features utilized to encode a specific concept. Moreover, the resulting counterfactuals can be used to mitigate bias in classification through data augmentation.
pdf
bib
abs
Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval
Ingeol Baek
|
Hwan Chang
|
ByeongJeong Kim
|
Jimin Lee
|
Hwanhee Lee
Retrieval-Augmented Generation (RAG) enhances language models by retrieving and incorporating relevant external knowledge. However, traditional retrieve-and-generate processes may not be optimized for real-world scenarios, where queries might require multiple retrieval steps or none at all. In this paper, we propose a Probing-RAG, which utilizes the hidden state representations from the intermediate layers of language models to adaptively determine the necessity of additional retrievals for a given query. By employing a pre-trained prober, Probing-RAG effectively captures the model’s internal cognition, enabling reliable decision-making about retrieving external documents. Experimental results across five open-domain QA datasets demonstrate that Probing-RAG outperforms previous methods while reducing the number of redundant retrieval steps.
pdf
bib
abs
Extracting Military Event Temporal Relations via Relative Event Time Prediction and Virtual Adversarial Training
Jie Gong
|
Qiwang Hu
Extracting temporal relationships between events in the text is crucial for understanding how events unfold over time, especially in the information-dense and precision-demanding military field. Existing models for extracting event temporal relations typically compare the relative times of events directly, neglecting the contextual information between event pairs. This can lead to difficulties in handling uncertain temporal boundaries expressed in text. In this paper, we propose an event temporal relationship extraction model for the military field, based on relative event time prediction and virtual adversarial training, MFRV. The relative event time prediction as an auxiliary task enhances the model’s ability to capture and infer temporal relationships. Virtual adversarial training increases the model’s generalization by generating adversarial samples. Additionally, we adopt the MoCo (Multi-objective gradient correction) method to balance the losses from relative event time prediction and virtual adversarial training, effectively resolving the gradient bias issue in multi-objective optimization. Furthermore, we have constructed a new dataset, TRMF, specifically for event temporal relationship extraction in the military field. Experiments conducted on TRMF, as well as widely used public datasets MATRES and TCR, demonstrate the effectiveness of MFRV.
pdf
bib
abs
Unlocking the Planning Capabilities of Large Language Models with Maximum Diversity Fine-tuning
Wenjun Li
|
Changyu Chen
|
Pradeep Varakantham
Large language models (LLMs) have demonstrated impressive task-solving capabilities through prompting techniques and system designs, including solving planning tasks (e.g., math proofs, basic travel planning) when sufficient data is available online and used during pre-training. However, for planning tasks with limited prior data (e.g., blocks world, advanced travel planning), the performance of LLMs, including proprietary models like GPT and Gemini, is poor. This paper investigates the impact of fine-tuning on the planning capabilities of LLMs, revealing that LLMs can achieve strong performance in planning through substantial (tens of thousands of specific examples) fine-tuning. Yet, this process incurs high economic, time, and computational costs for each planning problem variation. To address this, we propose Clustering-Based Maximum Diversity Sampling (CMDS), which selects diverse and representative data to enhance sample efficiency and the model’s generalization capability. Extensive evaluations demonstrate that CMDS-l, a baseline method combining CMDS with language embeddings, outperforms random sampling. Furthermore, we introduce a novel algorithm, CMDS-g, which encodes planning task instances with their graph representations into the embedding space. Empirical results show that CMDS-g consistently outperforms baseline methods across various scales and multiple benchmark domains.
pdf
bib
abs
Continuous Speech Tokenizer in Text To Speech
Yixing Li
|
Ruobing Xie
|
Xingwu Sun
|
Yu Cheng
|
Zhanhui Kang
The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer named Cont-SPT, and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain. The code and resources for Cont-SPT can be found in https://github.com/Yixing-Li/Continuous-Speech-Tokenizer.
pdf
bib
abs
Efficient Annotator Reliability Assessment and Sample Weighting for Knowledge-Based Misinformation Detection on Social Media
Owen Cook
|
Charlie Grimshaw
|
Ben Peng Wu
|
Sophie Dillon
|
Jack Hicks
|
Luke Jones
|
Thomas Smith
|
Matyas Szert
|
Xingyi Song
Misinformation spreads rapidly on social media, confusing the truth and targeting potentially vulnerable people. To effectively mitigate the negative impact of misinformation, it must first be accurately detected before applying a mitigation strategy, such as X’s community notes, which is currently a manual process. This study takes a knowledge-based approach to misinformation detection, modelling the problem similarly to one of natural language inference. The EffiARA annotation framework is introduced, aiming to utilise inter- and intra-annotator agreement to understand the reliability of each annotator and influence the training of large language models for classification based on annotator reliability. In assessing the EffiARA annotation framework, the Russo-Ukrainian Conflict Knowledge-Based Misinformation Classification Dataset (RUC-MCD) was developed and made publicly available. This study finds that sample weighting using annotator reliability performs the best, utilising both inter- and intra-annotator agreement and soft label training. The highest classification performance achieved using Llama-3.2-1B was a macro-F1 of 0.757 and 0.740 using TwHIN-BERT-large.
pdf
bib
abs
Challenges in Trustworthy Human Evaluation of Chatbots
Wenting Zhao
|
Alexander M Rush
|
Tanya Goyal
Recently, open community-driven platforms like Chatbot Arena that collect user preference data from site visitors have gained reputation as trustworthy publicly available benchmarks for LLM performance. While gold standard, it is often tricky to implement the required guardrails to collect high-quality annotations from humans. In this paper, we demonstrate that different source of bad annotations, both malicious and otherwise, can corrupt the reliability of open leaderboard rankings. In particular, we show that only 10% of poor quality votes by apathetic (site visitors not appropriately incentivized to give correct votes) or adversarial (bad actors seeking to inflate the ranking of a target model) annotators can change the rankings of models by up to 5 places on the leaderboard. Finally, we discuss open challenges in ensuring high quality human annotations.
pdf
bib
abs
RATSD: Retrieval Augmented Truthfulness Stance Detection from Social Media Posts Toward Factual Claims
Zhengyuan Zhu
|
Zeyu Zhang
|
Haiqi Zhang
|
Chengkai Li
Social media provides a valuable lens for assessing public perceptions and opinions. This paper focuses on the concept of truthfulness stance, which evaluates whether a textual utterance affirms, disputes, or remains neutral or indifferent toward a factual claim. Our systematic analysis fills a gap in the existing literature by offering the first in-depth conceptual framework encompassing various definitions of stance. We introduce RATSD (Retrieval Augmented Truthfulness Stance Detection), a novel method that leverages large language models (LLMs) with retrieval-augmented generation (RAG) to enhance the contextual understanding of tweets in relation to claims. RATSD is evaluated on TSD-CT, our newly developed dataset containing 3,105 claim-tweet pairs, along with existing benchmark datasets. Our experiment results demonstrate that RATSD outperforms state-of-the-art methods, achieving a significant increase in Macro-F1 score on TSD-CT. Our contributions establish a foundation for advancing research in misinformation analysis and provide valuable tools for understanding public perceptions in digital discourse.
pdf
bib
abs
FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval
Jinlin Wang
|
Suyuchen Wang
|
Ziwen Xia
|
Sirui Hong
|
Yun Zhu
|
Bang Liu
|
Chenglin Wu
Large Language Models (LLMs) are proficient at retrieving single facts from extended contexts, yet they struggle with tasks requiring the simultaneous retrieval of multiple facts, especially during generation. This paper identifies a novel “lost-in-the-middle” phenomenon, where LLMs progressively lose track of critical information throughout the generation process, resulting in incomplete or inaccurate retrieval. To address this challenge, we introduce Find All Crucial Texts (FACT), an iterative retrieval method that refines context through successive rounds of rewriting. This approach enables models to capture essential facts incrementally, which are often overlooked in single-pass retrieval. Experiments demonstrate that FACT substantially enhances multi-fact retrieval performance across various tasks, though improvements are less notable in general-purpose QA scenarios. Our findings shed light on the limitations of LLMs in multi-fact retrieval and underscore the need for more resilient long-context retrieval strategies.
pdf
bib
abs
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding
Xingjian Diao
|
Chunhui Zhang
|
Weiyi Wu
|
Zhongyu Ouyang
|
Peijun Qing
|
Ming Cheng
|
Soroush Vosoughi
|
Jiang Gui
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences—an essential requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model’s limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.
pdf
bib
abs
Investigating the Transferability of Code Repair for Low-Resource Programming Languages
Kyle Wong
|
Alfonso Amayuelas
|
Liangming Pan
|
William Yang Wang
Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent use case is iterative code repair, where an LLM fixes an incorrect program by rationalizing about errors and generating new code. Recent works augment the code repair process by integrating modern techniques such as chain-of-thought reasoning or distillation, but only study their benefits on high-resource languages like Python, and ignore low-resource languages like Perl. To address this gap of knowledge, we investigate the benefits of distilling code repair for both high and low resource languages to determine if the techniques that are effective in a high resource setting are also applicable in a low resource setting. Our evaluation shows that distilling the ability to repair code has language dependent benefits. To explain this behavior, we perform a further analysis and find that contrary to preexisting beliefs, the correlation between reasoning ability and code correction ability is weak. We hypothesize this weak correlation is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair.
pdf
bib
abs
Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture
Jiayang Song
|
Yuheng Huang
|
Zhehua Zhou
|
Lei Ma
As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly focused on safeguarding and aligning LLM behaviors with human preferences and ethical standards. LLMs, trained on extensive multilingual corpora, exhibit powerful generalization abilities across diverse languages and domains. However, current safety alignment practices predominantly focus on single-language scenarios, which leaves their effectiveness in complex multilingual contexts, especially for those complex mixed-language formats, largely unexplored. In this study, we introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various state-of-the-art LLMs (e.g., GPT-4o, GPT 3.5, Llama3) under sophisticated, multilingual conditions. We further investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending in compromising the safeguards of LLMs. Our experimental results show that, without meticulously crafted prompt templates, Multilingual Blending significantly amplifies the detriment of malicious queries, leading to dramatically increased bypass rates in LLM safety alignment (67.23% on GPT-3.5 and 40.34% on GPT-4o), far exceeding those of single-language baselines. Moreover, the performance of Multilingual Blending varies notably based on intrinsic linguistic properties, with languages of different morphology and from diverse families being more prone to evading safety alignments. These findings underscore the necessity of evaluating LLMs and developing corresponding safety alignment strategies in a complex, multilingual context to align with their superior cross-language generalization capabilities.
pdf
bib
abs
Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting
Jiarui Wu
|
Zhuo Liu
|
Hangfeng He
Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.
pdf
bib
abs
Concise and Organized Perception Facilitates Reasoning in Large Language Models
Junjie Liu
|
Shaotian Yan
|
Chen Shen
|
Zhengdong Xiao
|
Liang Xie
|
Wenxiao Wang
|
Jieping Ye
Exploiting large language models (LLMs) to tackle reasoning has garnered growing attention. It still remains highly challenging to achieve satisfactory results in complex logical problems, characterized by plenty of premises within the context and requiring multi-hop reasoning. In particular, the reasoning capabilities of LLMs are brittle to disorder and distractibility. In this work, we first examine the mechanism from the perspective of information flow and reveal that LLMs confront difficulties akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks. However, in contrast to LLMs, disordered and irrelevant content does not significantly decrease human performance, as humans have a propensity to distill the most relevant information and systematically organize their thoughts, aiding them in responding to questions.Stem from that, we further propose a novel reasoning approach named Concise and Organized Perception (COP). COP carefully analyzes the given statements to identify the most pertinent information while eliminating redundancy efficiently. It then prompts the LLMs in a more organized form that adapts to the model’s inference process. By perceiving concise and organized context, the reasoning abilities of LLMs can be better elicited. Extensive experimental results on several popular logical benchmarks (ProofWriter, PrOntoQA, PrOntoQA-OOD, and FOLIO) and mathematical benchmark (DI-GSM) show that COP significantly outperforms previous state-of-the-art methods.
pdf
bib
abs
Verifiable Format Control for Large Language Model Generations
Zhaoyang Wang
|
Jinqi Jiang
|
Huichi Zhou
|
Wenhao Zheng
|
Xuchao Zhang
|
Chetan Bansal
|
Huaxiu Yao
Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.
pdf
bib
abs
Taxonomy and Analysis of Sensitive User Queries in Generative AI Search System
Hwiyeol Jo
|
Taiwoo Park
|
Hyunwoo Lee
|
Nayoung Choi
|
Changbong Kim
|
Ohjoon Kwon
|
Donghyeon Jeon
|
Eui-Hyeon Lee
|
Kyoungho Shin
|
Sun Suk Lim
|
Kyungmi Kim
|
Jihye Lee
|
Sun Kim
Although there has been a growing interest among industries in integrating generative LLMs into their services, limited experience and scarcity of resources act as a barrier in launching and servicing large-scale LLM-based services. In this paper, we share our experiences in developing and operating generative AI models within a national-scale search engine, with a specific focus on the sensitiveness of user queries. We propose a taxonomy for sensitive search queries, outline our approaches, and present a comprehensive analysis report on sensitive queries from actual users. We believe that our experiences in launching generative AI search systems can contribute to reducing the barrier in building generative LLM-based services.
pdf
bib
SynGhost: Invisible and Universal Task-agnostic Backdoor Attack via Syntactic Transfer
Pengzhou Cheng
|
Wei Du
|
Zongru Wu
|
Fengwei Zhang
|
Libo Chen
|
Zhuosheng Zhang
|
Gongshen Liu
pdf
bib
abs
TESTEVAL: Benchmarking Large Language Models for Test Case Generation
Wenhan Wang
|
Chenyuan Yang
|
Zhijie Wang
|
Yuheng Huang
|
Zhaoyang Chu
|
Da Song
|
Lingming Zhang
|
An Ran Chen
|
Lei Ma
For program languages, testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities.In this paper, we propose TestEval, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate 17 popular LLMs, including both commercial and open-source ones, on TestEval. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths.
pdf
bib
abs
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Models
Siyin Wang
|
Xingsong Ye
|
Qinyuan Cheng
|
Junwen Duan
|
Shimin Li
|
Jinlan Fu
|
Xipeng Qiu
|
Xuanjing Huang
As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (*SIUO*) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the *SIUO*, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
pdf
bib
abs
FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models
Dahyun Jung
|
Seungyoon Lee
|
Hyeonseok Moon
|
Chanjun Park
|
Heuiseok Lim
Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.
pdf
bib
abs
When and How to Augment Your Input: Question Routing Helps Balance the Accuracy and Efficiency of Large Language Models
Shufan Chen
|
He Zheng
|
Lei Cui
Although large language models rely on parametric knowledge to achieve exceptional performance across various question-answering tasks, they still face challenges when addressing knowledge-based long-tail questions. Augmented generation techniques, such as chain-of-thought prompting and retrieval augmentation, can effectively enhance the ability of these models to answer long-tail questions. However, improving accuracy through augmented generation often results in significant latency within question-answering systems. This paper addresses the issue of “when and how to augment the input” by proposing an adaptive question routing framework. This framework employs a query router to select the most appropriate augmentation path at the right time, thereby enhancing both the accuracy and efficiency of question-answering systems. Extensive comparative experiments on benchmarks such as AmbigNQ, HotpotQA, MMLU-STEM, and PopQA demonstrate that our method surpasses existing approaches in both accuracy and efficiency. Furthermore, this paper introduces two metrics for evaluating adaptive question augmentation methods and presents a new benchmark for adaptive question augmentation, aiming to advance the field.
pdf
bib
abs
GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration
Ziwen Li
|
Xiang Chen
|
Youngseung Jeon
Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations on previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
pdf
bib
abs
From Curiosity to Clarity : Exploring the Impact of Consecutive Why-Questions
Geonyeong Son
|
Jaeyoung Lee
|
Misuk Kim
Humans attempt to understand the real world by asking the fundamental question ”Why?” when faced with incomprehensible situations in everyday life. Such why-questions provide essential knowledge that can help in understanding these situations. In this study, we conducted an end-to-end process to verify the utility of consecutive why-questions, from constructing a large language model (LLM)-based dataset to performing quantitative evaluation and analysis. Firstly, we created a WHY-Chain dataset, consisting of answers generated by an LLM in response to chain-of-why-questions, including a validity check. We also incorporated objectives that effectively capture the ”consecutive” characteristic of the data. Using the WHY-Chain dataset and two types of self-supervised objectives, we trained the pre-trained model. As a result, the refined model demonstrated improved performance on downstream tasks that require commonsense reasoning. Additionally, we conducted various ablation studies to assess the impact of different factors, confirming the scalability of the proposed approach. Lastly, we confirmed the consistency of the logical information by reasoning chain analysis of the answers generated from consecutive why-questions.
pdf
bib
abs
CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis
Saranya Venkatraman
|
Nafis Irtiza Tripto
|
Dongwon Lee
The rise of unifying frameworks that enable seamless interoperability of Large Language Models (LLMs) has made LLM-LLM collaboration for open-ended tasks a possibility. Despite this, there have not been efforts to explore such collaborative writing. We take the next step beyond human-LLM collaboration to explore this multi-LLM scenario by generating the first exclusively LLM-generated collaborative stories dataset called CollabStory. We focus on single-author to multi-author (up to 5 LLMs) scenarios, where multiple LLMs co-author stories. We generate over 32k stories using open-source instruction-tuned LLMs. Further, we take inspiration from the PAN tasks that have set the standard for human-human multi-author writing tasks and analysis. We extend their authorship-related tasks for multi-LLM settings and present baselines for LLM-LLM collaboration. We find that current baselines are not able to handle this emerging scenario. Thus, CollabStory is a resource that could help propel an understanding as well as the development of new techniques to discern the use of multiple LLMs. This is crucial to study in the context of writing tasks since LLM-LLM collaboration could potentially overwhelm ongoing challenges related to plagiarism detection, credit assignment, maintaining academic integrity in educational settings, and addressing copyright infringement concerns. We make our dataset and code available at https://github.com/saranya-venkatraman/CollabStory.
pdf
bib
abs
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models
Pranshu Pandya
|
Vatsal Gupta
|
Agney S Talwarr
|
Tushar Kataria
|
Dan Roth
|
Vivek Gupta
Cognitive textual and visual reasoning tasks, including puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. Due to extensive training on vast amounts of human-curated data, large language models (LLMs) and vision language models (VLMs) excel in common-sense reasoning tasks, but still struggle with more complex reasoning that demands deeper cognitive understanding. We introduce NTSEBENCH, a new dataset designed to evaluate cognitive multimodal reasoning and problem-solving skills of large models. The dataset contains 2,728 multiple-choice questions, accompanied by a total of 4,642 images, spanning 26 categories. These questions are drawn from the nationwide NTSE examination in India and feature a mix of visual and textual general aptitude challenges, designed to assess intelligence and critical thinking skills beyond mere rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open-source and propriety models, we propose four distinct modeling strategies to handle different modalities—text and images—in the dataset instances.
pdf
bib
abs
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
Yuqi Zhu
|
Shuofei Qiao
|
Yixin Ou
|
Shumin Deng
|
Shiwei Lyu
|
Yue Shen
|
Lei Liang
|
Jinjie Gu
|
Huajun Chen
|
Ningyu Zhang
Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation.
pdf
bib
abs
SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models
Jahyun Koo
|
Yerin Hwang
|
Yongil Kim
|
Taegwan Kang
|
Hyunkyung Bae
|
Kyomin Jung
Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with the use of student-generated outputs (SGOs) as training data being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying With Teacher for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student’s sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in long sequences that are more prone to teacher misguidance. Extensive experimental results across three model families and five instruction-following datasets show that SWITCH surpasses traditional KD methods, particularly excelling in the generation of long sequential data.
pdf
bib
abs
Thought2Text: Text Generation from EEG Signal using Large Language Models (LLMs)
Abhijit Mishra
|
Shreya Shukla
|
Jose Torres
|
Jacek Gwizdka
|
Shounak Roychowdhury
Decoding and expressing brain activity in a comprehensible form is a challenging frontier in AI. This paper presents *Thought2Text*, which uses instruction-tuned Large Language Models (LLMs) fine-tuned with EEG data to achieve this goal. The approach involves three stages: (1) training an EEG encoder for visual feature extraction, (2) fine-tuning LLMs on image and text data, enabling multimodal description generation, and (3) further fine-tuning on EEG embeddings to generate text directly from EEG during inference. Experiments on a public EEG dataset collected for six subjects with image stimuli and text captions demonstrate the efficacy of multimodal LLMs (*LLaMA-v3*, *Mistral-v0.3*, *Qwen2.5*), validated using traditional language generation evaluation metrics, as well as *fluency* and *adequacy* measures. This approach marks a significant advancement towards portable, low-cost “thoughts-to-text” technology with potential applications in both neuroscience and natural language processing.
pdf
bib
abs
A Comprehensive Survey of Contemporary Arabic Sentiment Analysis: Methods, Challenges, and Future Directions
Zhiqiang Shi
|
Ruchit Agrawal
Sentiment Analysis, a popular subtask of Natural Language Processing, employs computational methods to extract sentiment, opinions, and other subjective aspects from linguistic data. Given its crucial role in understanding human sentiment, research in sentiment analysis has witnessed significant growth in the recent years. However, the majority of approaches are aimed at the English language, and research towards Arabic sentiment analysis remains relatively unexplored. This paper presents a comprehensive and contemporary survey of Arabic Sentiment Analysis, identifies the challenges and limitations of existing literature in this field and presents avenues for future research. We present a systematic review of Arabic sentiment analysis methods, focusing specifically on research utilizing deep learning. We then situate Arabic Sentiment Analysis within the broader context, highlighting research gaps in Arabic sentiment analysis as compared to general sentiment analysis. Finally, we outline the main challenges and promising future directions for research in Arabic sentiment analysis.
pdf
bib
abs
Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
Shintaro Ozaki
|
Kazuki Hayashi
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Katsuhiko Hayashi
|
Taro Watanabe
As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.
pdf
bib
abs
Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis
Yiyi Chen
|
Qiongxiu Li
|
Russa Biswas
|
Johannes Bjerva
Language Confusion is a phenomenon where Large Language Models (LLMs) generate text that is neither in the desired language, nor in a contextually appropriate language. This phenomenon presents a critical challenge in text generation by LLMs, often appearing as erratic and unpredictable behavior. We hypothesize that there are linguistic regularities to this inherent vulnerability in LLMs and shed light on patterns of language confusion across LLMs. We introduce a novel metric, Language Confusion Entropy, designed to directly measure and quantify this confusion, based on language distributions informed by linguistic typology and lexical variation. Comprehensive comparisons with the Language Confusion Benchmark (Marchisio et al., 2024) confirm the effectiveness of our metric, revealing patterns of language confusion across LLMs. We further link language confusion to LLM security, and find patterns in the case of multilingual embedding inversion attacks. Our analysis demonstrates that linguistic typology offers theoretically grounded interpretation, and valuable insights into leveraging language similarities as a prior for LLM alignment and security.
pdf
bib
abs
Huatuo-26M, a Large-scale Chinese Medical QA Dataset
Xidong Wang
|
Jianquan Li
|
Shunian Chen
|
Yuxuan Zhu
|
Xiangbo Wu
|
Zhiyi Zhang
|
Xiaolong Xu
|
Junying Chen
|
Jie Fu
|
Xiang Wan
|
Anningzhe Gao
|
Benyou Wang
Large Language Models infuse newfound vigor into the advancement of the medical domain, yet the scarcity of data poses a significant bottleneck hindering community progress. In this paper, we release the largest ever medical Question Answering (QA) dataset with 26 Million QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset’s utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.
pdf
bib
abs
SEP-MLDC: A Simple and Effective Paradigm for Multi-Label Document Classification
Han Liu
|
Shuqin Li
|
Xiaotong Zhang
|
Yuanyuan Wang
|
Feng Zhang
|
Hongyang Chen
|
Hong Yu
Multi-label document classification (MLDC) aims to allocate more than one label to each document and attracts increasing attention in many practical applications. However, previous studies have failed to pay sufficient attention to the lack of semantic information on labels and the long-tail problem prevalent in the datasets. Additionally, most existing methods focus on optimizing document features, overlooking the potential of high-quality label features to enhance classification performance. In this paper, we propose a simple and effective paradigm for MLDC. Regarding the problem of insufficient label information and imbalance in the sample size of categories, we utilize large language models (LLMs) to semantically expand the label content and generate pseudo-samples for the tail categories. To optimize the features of both documents and labels, we design the contrastive learning boosted feature optimization module facilitated by the similarity matrices. Finally, we construct a label-guided feature selection module to incorporate the optimized label features into the input features to provide richer semantic information for the classifier. Extensive experiments have demonstrated that our proposed method significantly outperforms state-of-the-art baselines.
pdf
bib
abs
Improving Pre-trained Language Models with Knowledge Enhancement and Filtering Framework
Qi Zhao
|
Qi Song
|
Tian Xie
|
Haiyue Zhang
|
Hongyu Yang
|
Xiangyang Li
Pre-trained language models (PLMs) are widely used in NLP but struggle with capturing entity knowledge. To address this, knowledge enhancement techniques have been proposed. However, existing methods rely heavily on external knowledge bases embedding and often introduce noisy entity representations. In this work, we propose a novel **K**nowledge **E**nhancement **F**iltering **F**ramework named KEFF, which contains both knowledge enhancement and knowledge enhancement filtering modules for PLM. We find that there are certain redundant bits in the embedding space of PLMs. Building on this insight, we implement knowledge-enhanced mapping of redundant bit values in entity span tokens. In order to solve the knowledge enhancement problem of existing methods that introduce noisy entity representation knowledge, we further propose a novel knowledge enhancement filter based on our knowledge enhancement method. Finally, experiments on four knowledge-driven NLP tasks show that our method effectively improves the ability of PLMs on downstream tasks. Compared to state-of-the-art approachs, our method achieves the highest F1-score and accuracy, while reducing the computational cost by 1.7-2.5x.
pdf
bib
abs
Using Review Combination and Pseudo-Tokens for Aspect Sentiment Quad Prediction
Jiazhou Chen
|
Xu Jia
|
RuiQiang Guo
Aspect Sentiment Quad Prediction (ASQP) aims to identify quadruples consisting of an aspect term, aspect category, opinion term, and sentiment polarity from a given sentence, which is the most representative and challenging task in aspect-based sentiment analysis. A major challenge arises when implicit sentiment is present, as existing models often confuse implicit and explicit sentiment, making it difficult to extract the quadruples effectively. To tackle this issue, we propose a framework that leverages distinct labeled features from diverse reviews and incorporates pseudo-token prompts to harness the semantic knowledge of pre-trained models, effectively capturing both implicit and explicit sentiment expressions. Our approach begins by categorizing reviews based on the presence of implicit sentiment elements. We then build new samples that combine those with implicit sentiment and those with explicit sentiment. Next, we employ prompts with pseudo-tokens to guide the model in distinguishing between implicit and explicit sentiment expressions. Extensive experimental results show that our proposed method enhances the model’s ability across four public datasets, averaging 1.99% F1 improvement, particularly in instances involving implicit sentiment. We release our code at https://github.com/chienarmor/absa-implicit.
pdf
bib
abs
DDGIP: Radiology Report Generation Through Disease Description Graph and Informed Prompting
Chentao Huang
|
Guangli Li
|
Xinjiong Zhou
|
Yafeng Ren
|
Hongbin Zhang
Automatic radiology report generation has attracted considerable attention with the rise of computer-aided diagnostic systems. Due to the inherent biases in medical imaging data, generating reports with precise clinical details is challenging yet crucial for accurate diagnosis. To this end, we design a disease description graph that encapsulates comprehensive and pertinent disease information. By aligning visual features with the graph, our model enhances the quality of the generated reports. Furthermore, we introduce a novel informed prompting method which increases the accuracy of short-gram predictions, acting as an implicit bag-of-words planning for surface realization. Notably, this informed prompt succeeds with a three-layer decoder, reducing the reliance on conventional prompting methods that require extensive model parameters. Extensive experiments on two widely-used datasets, IU-Xray and MIMIC-CXR, demonstrate that our method outperforms previous state-of-the-art models.
pdf
bib
abs
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Sukmin Cho
|
Sangjin Choi
|
Taeho Hwang
|
Jeongyeon Seo
|
Soyeong Jeong
|
Huije Lee
|
Hoyun Song
|
Jong C. Park
|
Youngjin Kwon
Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.
pdf
bib
abs
Improve Decoding Factuality by Token-wise Cross Layer Entropy of Large Language Models
Jialiang Wu
|
Yi Shen
|
Sijia Liu
|
Yi Tang
|
Sen Song
|
Xiaoyi Wang
|
Longjun Cai
Despite their impressive capacities, Large language models (LLMs) often struggle with the hallucination issue of generating inaccurate or fabricated content even when they possess correct knowledge. In this paper, we extend the exploration of the correlation between hidden-state prediction changes and output factuality into a deeper, token-wise level. Based on the insights , we propose cross-layer Entropy eNhanced Decoding (END), a decoding method that mitigates hallucinations without requiring extra training. END leverages inner probability changes across layers to individually quantify the factual knowledge required for each candidate token, and adjusts the final predicting distribution to prioritize tokens with higher factuality. Experiments on both hallucination and QA benchmarks demonstrate that END significantly enhances the truthfulness and informativeness of generation while maintaining robust QA accuracy. Moreover, our work provides a deeper perspective of understanding the correlations between inherent knowledge and output factuality.
pdf
bib
abs
TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement
Zhaopeng Feng
|
Yan Zhang
|
Hao Li
|
Bei Wu
|
Jiayu Liao
|
Wenqiang Liu
|
Jun Lang
|
Yang Feng
|
Jian Wu
|
Zuozhu Liu
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, human evaluations reveal that LLM-generated translations still contain various errors. Notably, feeding the error information back into the LLMs can facilitate self-refinement, leading to enhanced translation quality. Motivated by these findings, we introduce TEaR (Translate, Estimate, and Refine), a systematic LLM-based self-refinement framework aimed at bootstrapping translation performance. Our key results show that: 1) TEaR framework enables LLMs to improve their translation quality relying solely on self-feedback, measured by both automatic metrics and Multidimensional Quality Metrics (MQM) scores; 2) TEaR autonomously selects improvements, ensuring a robust translation quality baseline while outperforming both internal refinement and external feedback methods. Error analysis and iterative refinement experiments show its ability to continuously reduce translation errors and enhance overall translation quality. Our code and data are publicly available at https://github.com/fzp0424/self_correct_mt.
pdf
bib
abs
Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety
Yiwei Wang
|
Muhao Chen
|
Nanyun Peng
|
Kai-Wei Chang
Previous research on jailbreak attacks has mainly focused on optimizing the adversarial snippet content injected into input prompts to expose LLM security vulnerabilities. A significant portion of this research focuses on developing more complex, less readable adversarial snippets that can achieve higher attack success rates. In contrast to this trend, our research investigates the impact of the adversarial snippet’s position on the effectiveness of jailbreak attacks. We find that placing a simple and readable adversarial snippet at the beginning of the output effectively exposes LLM safety vulnerabilities, leading to much higher attack success rates than the input suffix attack or prompt-based output jailbreaks. Precisely speaking, we discover that directly enforcing the user’s target embedded output prefix is an effective method to expose LLMs’ safety vulnerabilities.
pdf
bib
abs
ImaRA: An Imaginative Frame Augmented Method for Low-Resource Multimodal Metaphor Detection and Explanation
Yuan Tian
|
Minzheng Wang
|
Nan Xu
|
Wenji Mao
Multimodal metaphor detection is an important and challenging task in multimedia computing, which aims to distinguish between metaphorical and literal multimodal expressions. Existing studies mainly utilize typical multimodal computing approaches for detection, neglecting the unique cross-domain and cross-modality characteristics underlying multimodal metaphor understanding. According to Conceptual Metaphor Theory (CMT), the inconsistency between source and target domains and their attribute similarity are essential to infer the intricate meanings implied in metaphors. In practice, the scarcity of the annotated multimodal metaphorical contents in the real world brings additional difficulty to the detection task and further complicates the understanding of multimodal metaphors. To address the above challenges, in this paper, we propose a novel Imaginative FRame Augmented (ImaRA) method for low-resource multimodal metaphor detection and explanation inspired by CMT. Specifically, we first identify imaginative frame as an associative structure to stimulate the imaginative thinking of multimodal metaphor detection and understanding. We then construct a cross-modal imagination dataset rich in multimodal metaphors and corresponding imaginative frames, and retrieve an augmented instance from this imagination dataset using imaginative frames mined from the input. This augmented instance serves as the demonstration exemplar to boost the metaphor reasoning ability of the multimodal large language model (MLLM) in low-resource multimodal scenarios. Experiments on two publicly available datasets show that our method consistently achieves robust results compared to MLLM-based methods for both multimodal metaphor detection and explanation in low-resource scenarios and meanwhile surpasses existing multimodal metaphor detection methods with full training data.
pdf
bib
abs
XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples
Peiqin Lin
|
Andre Martins
|
Hinrich Schuetze
Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored to tackle the challenge of cross-lingual in-context learning using only annotated English data. XAMPLER first trains a retriever based on Glot500, a multilingual small language model, using positive and negative English examples constructed from the predictions of a multilingual large language model, i.e., MaLA500. Leveraging the cross-lingual capacity of the retriever, it can directly retrieve English examples as few-shot examples for in-context learning of target languages. Experiments on two multilingual text classification benchmarks, namely SIB200 with 176 languages and MasakhaNEWS with 16 languages, demonstrate that XAMPLER substantially improves the in-context learning performance across languages.
pdf
bib
abs
Evaluating Cultural and Social Awareness of LLM Web Agents
Haoyi Qiu
|
Alexander Fabbri
|
Divyansh Agarwal
|
Kung-Hsiang Huang
|
Sarah Tan
|
Nanyun Peng
|
Chien-Sheng Wu
As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents’ sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents’ ability to detect and appropriately respond to norm-violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine-tuning, and find that combining both methods can offer complementary advantages – fine-tuning on culture-specific datasets significantly enhances the agents’ ability to generalize across different regions, while prompting boosts the agents’ ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents’ cultural and social awareness during the development cycle.
pdf
bib
abs
GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation
Runchuan Zhu
|
Xinke Jiang
|
Jiang Wu
|
Zhipeng Ma
|
Jiahe Song
|
Fengshuo Bai
|
Dahua Lin
|
Lijun Wu
|
Conghui He
Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .
pdf
bib
abs
Entity Pair-guided Relation Summarization and Retrieval in LLMs for Document-level Relation Extraction
Fu Zhang
|
Hongsen Yu
|
Jingwei Cheng
|
Huangming Xu
Document-level relation extraction (DocRE) aims to extract relations between entities in a document. While previous research has primarily focused on traditional small models, recent studies have extended the scope to large language models (LLMs). Current LLM-based methods typically focus on filtering all potential relations (candidate relations) within a document at one time and then performing triplet fact extraction. However, most approaches for candidate relation filtering are based on the document level, which results in insufficient correlation between candidate relations and entity pairs. In addition, the data imbalance problem caused by a large amount of no-relation data (NA problem) is another important reason for the suboptimal performance of LLM-based methods. To address these issues, we propose an entity pair-guided relation summarization and retrieval model (EP-RSR) for DocRE, which introduces an innovative LLM-based document-level relation extraction paradigm, EPRF (Entity Pair-Relation-Fact), along with an entity pair-level candidate relation filtering method. Our approach first selects entity pairs that potentially contain relations and uses them to guide relation summarization and retrieval for extracting relation facts. This enhances the relevance between candidate relations and entity pairs while alleviating the issue of imbalanced NA data. Benchmark testing on three datasets demonstrates that our approach achieves state-of-the-art (SOTA) performance for LLM-based models. Our code is available at https://github.com/LookingYu/EP-RSR.
pdf
bib
abs
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
Peiqin Lin
|
Andre Martins
|
Hinrich Schuetze
Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus with just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.
pdf
bib
abs
Omni-Chart-600K: A Comprehensive Dataset of Chart Types for Chart Understanding
Shulei Wang
|
Shuai Yang
|
Wang Lin
|
Zirun Guo
|
Sihang Cai
|
Hai Huang
|
Ye Wang
|
Jingyuan Chen
|
Tao Jin
To address the deficiencies in chart types and the limited scope of chart tasks in existing datasets, we conducted a comprehensive review of current data collection methodologies. By integrating manual annotation with data generation leveraging GPT-4, we developed a dataset that includes 21 diverse chart types and a broad spectrum of tasks, such as data retrieval and mathematical reasoning. Our analysis of existing models revealed that capabilities in information extraction, mathematical reasoning, and understanding of multiple chart types are essential for performing a variety of chart tasks. To overcome the limitations in these areas, we devised a two-stage training strategy and a method for jointly training the vision encoder tailored for multi-type charts. In the first stage, we designed several tasks to enhance the model’s general understanding of charts, aligning multimodal large models pre-trained on natural images to chart tasks. To further improve the model’s capability to understand various chart tasks and enhance its reasoning abilities, we employed Chain-of-Thought data for training in the second stage. Through two-stage training on our proposed dataset, the pre-trained multimodal large language model achieved state-of-the-art performance across multiple chart understanding tasks, demonstrating the superiority of our data and methods.
pdf
bib
abs
Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection
Yassine El Kheir
|
Younes Samih
|
Suraj Maharjan
|
Tim Polzehl
|
Sebastian Möller
This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse contexts, including multilingual datasets (English, Chinese, Spanish), partial, song, and scene-based deepfake scenarios. By systematically evaluating the contributions of different transformer layers, we uncover critical insights into model behavior and performance. Our findings reveal that lower layers consistently provide the most discriminative features, while higher layers capture less relevant information. Notably, all models achieve competitive equal error rate (EER) scores even when employing a reduced number of layers. This indicates that we can reduce computational costs and increase the inference speed of detecting deepfakes by utilizing only a few lower layers. This work enhances our understanding of SSL models in deepfake detection, offering valuable insights applicable across varied linguistic and contextual settings. Our models and code are publicly available at https://github.com/Yaselley/SSL_Layerwise_Deepfake.
pdf
bib
abs
Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax
Iuliia Zaitova
|
Vitalii Hirak
|
Badr M. Abdullah
|
Dietrich Klakow
|
Bernd Möbius
|
Tania Avgustinova
This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages — English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.
pdf
bib
abs
Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios
Jiwei Tang
|
Jin Xu
|
Tingwei Lu
|
Zhicheng Zhang
|
YimingZhao YimingZhao
|
LinHai LinHai
|
Hai-Tao Zheng
Large language models (LLMs) demonstrate exceptional capabilities in various scenarios. However, they suffer from much redundant information and are sensitive to the position of key information in long context scenarios. To address these challenges, we present Perception Compressor, a training-free prompt compression framework. It includes a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations, a dual-slope ratio allocator to dynamically allocate compression ratios and open-book ratios, and a semi-guided iterative compression that retains key information at the token level while removing tokens that distract the LLM. We conduct extensive experiments on long context benchmarks, i.e., NaturalQuestions, LongBench, and MuSiQue. Experiment results show that Perception Compressor outperforms existing methods by a large margin, achieving state-of-the-art performance.
pdf
bib
abs
MojoBench: Language Modeling and Benchmarks for Mojo
Nishat Raihan
|
Joanna C. S. Santos
|
Marcos Zampieri
The recently introduced Mojo programming language (PL) by Modular, has received significant attention in the scientific community due to its claimed significant speed boost over Python. Despite advancements in code Large Language Models (LLMs) across various PLs, Mojo remains unexplored in this context. To address this gap, we introduce MojoBench, the first framework for Mojo code generation. MojoBench includes HumanEval-Mojo, a benchmark dataset designed for evaluating code LLMs on Mojo, and Mojo-Coder, the first LLM pretrained and finetuned for Mojo code generation, which supports instructions in 5 natural languages (NLs). Our results show that Mojo-Coder achieves a 30-35% performance improvement over leading models like GPT-4o and Claude-3.5-Sonnet. Furthermore, we provide insights into LLM behavior with underrepresented and unseen PLs, offering potential strategies for enhancing model adaptability. MojoBench contributes to our understanding of LLM capabilities and limitations in emerging programming paradigms fostering more robust code generation systems.
pdf
bib
abs
VLind-Bench: Measuring Language Priors in Large Vision-Language Models
Kang-il Lee
|
Minbeom Kim
|
Seunghyun Yoon
|
Minsung Kim
|
Dongryeol Lee
|
Hyukhun Koh
|
Kyomin Jung
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.
pdf
bib
abs
GRAG: Graph Retrieval-Augmented Generation
Yuntong Hu
|
Zhihan Lei
|
Zheng Zhang
|
Bo Pan
|
Chen Ling
|
Liang Zhao
Naive Retrieval-Augmented Generation (RAG) focuses on individual documents during retrieval and, as a result, falls short in handling networked documents which are very popular in many applications such as citation graphs, social media, and knowledge graphs. To overcome this limitation, we introduce Graph Retrieval-Augmented Generation (GRAG), which tackles the fundamental challenges in retrieving textual subgraphs and integrating the joint textual and topological information into Large Language Models (LLMs) to enhance its generation. To enable efficient textual subgraph retrieval, we propose a novel divide-and-conquer strategy that retrieves the optimal subgraph structure in linear time. To achieve graph context-aware generation, incorporate textual graphs into LLMs through two complementary views—the text view and the graph view—enabling LLMs to more effectively comprehend and utilize the graph context. Extensive experiments on graph reasoning benchmarks demonstrate that in scenarios requiring multi-hop reasoning on textual graphs, our GRAG approach significantly outperforms current state-of-the-art RAG methods. Our datasets as well as codes of GRAG are available at https://github.com/HuieL/GRAG.
pdf
bib
abs
Sequence-level Large Language Model Training with Contrastive Preference Optimization
Zhili Feng
|
Dhananjay Ram
|
Cole Hawkins
|
Aditya Rawal
|
Jinman Zhao
|
Sheng Zha
The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks. However, upon closer investigation of this objective, we find that it lacks an understanding of sequence-level signals, leading to a mismatch between training and inference processes. To bridge this gap, we introduce a contrastive preference optimization (CPO) procedure that can inject sequence-level information into the language model at any training stage without expensive human labeled data. Our experiments show that the proposed objective surpasses the next token prediction in terms of win rate in the instruction-following and text generation tasks.
pdf
bib
abs
Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models
Haritz Puerto
|
Martin Gubri
|
Sangdoo Yun
|
Seong Joon Oh
Membership inference attacks (MIA) attempt to verify the membership of a given data sample in the training set for a model. MIA has become relevant in recent years, following the rapid development of large language models (LLM). Many are concerned about the usage of copyrighted materials for training them and call for methods for detecting such usage. However, recent research has largely concluded that current MIA methods do not work on LLMs. Even when they seem to work, it is usually because of the ill-designed experimental setup where other shortcut features enable “cheating.” In this work, we argue that MIA still works on LLMs, but only when multiple documents are presented for testing. We construct new benchmarks that measure the MIA performances at a continuous scale of data samples, from sentences (n-grams) to a collection of documents (multiple chunks of tokens). To validate the efficacy of current MIA approaches at greater scales, we adapt a recent work on Dataset Inference (DI) for the task of binary membership detection that aggregates paragraph-level MIA features to enable document- and dataset-level MIA. This baseline achieves the first successful MIA on pre-trained and fine-tuned LLMs.
pdf
bib
abs
Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
Kyungmin Min
|
Minbeom Kim
|
Kang-il Lee
|
Dongryeol Lee
|
Kyomin Jung
Large Vision-Language Models (LVLMs) demonstrate impressive capabilities in generating detailed and coherent responses from visual inputs.However, they are prone to generate hallucinations due to an over-reliance on language priors. To address this issue, we investigate the language priors in LVLMs and make two key observations: (1) Even when predicting the tokens associated with image-related part-of-speech (POS), models increasingly rely on linguistic priors as the token sequences grow, thereby amplifying hallucinations. (2) Methods that directly calibrate LVLM’s output distribution to mitigate language priors can lead to a degradation in text quality or even exacerbate hallucinations.Based on these findings, we propose a novel method, Summary-Guided Decoding (SumGD). This method naturally encourages the model to focus more on image information by reducing the text context through summaries, while controlling only the image-related POS tokens to maintain text quality.Through experiments, we demonstrate that SumGD achieves state-of-the-art performance on object hallucination benchmarks. Furthermore, in terms of the trade-off between precision and recall, SumGD achieves Pareto optimality among the existing methods.Lastly, we observe that although existing methods struggle to balance the reduction of object hallucinations with maintaining text quality, SumGD demonstrates robustness in handling this challenge.
pdf
bib
abs
Exploring Hybrid Sampling Inference for Aspect-based Sentiment Analysis
Xiaoyi Bao
|
Minjie Qiang
|
Jinghang Gu
|
Zhongqing Wang
|
Chu-Ren Huang
As the training of large language models (LLMs) will encounter high computational costs, massive works are now focusing on inference. Their methods can be generally summarised as re-sampling the target multiple times and performing a vote upon the outputs. Despite bringing significant performance improvements, it is a high-cost method that requires multiple sampling with the preset size. In this paper, we propose a simple yet efficient inference strategies named __Hybrid Sampling__ that combining both multiple and single sampling to greatly reduce the cost of multiple sampling without sacrificing performance. __Hybrid Sampling__ could dynamically choose the essential part of generated sequence for multiple sampling and proceed the rest with single sampling, achieving a performance-cost balance. Extensive experiments in several benchmarks underscore the robustness and effectiveness of our proposed Hybrid Sampling and more importantly, it is much faster.
pdf
bib
FeRG-LLM : Feature Engineering by Reason Generation Large Language Models
Jeonghyun Ko
|
Gyeongyun Park
|
Donghoon Lee
|
Kyunam Lee
pdf
bib
abs
Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs
Abdellah El Mekki
|
Muhammad Abdul-Mageed
Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset (CITATION) and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of 7 BLEU points.
pdf
bib
abs
GPT-NER: Named Entity Recognition via Large Language Models
Shuhe Wang
|
Xiaofei Sun
|
Xiaoya Li
|
Rongbin Ouyang
|
Fei Wu
|
Tianwei Zhang
|
Jiwei Li
|
Guoyin Wang
|
Chen Guo
Despite the fact that large-scale Language Models (LLM) have achieved SOTA performances on a variety of NLP tasks, its performance on NER is still significantly below supervised baselines. This is due to the gap between the two tasks the NER and LLMs: the former is a sequence labeling task in nature while the latter is a text-generation model.In this paper, we propose GPT-NER to resolve this issue. GPT-NER bridges the gap by transforming the sequence labeling task to a generation task that can be easily adapted by LLMs e.g., the task of finding location entities in the input text “Columbus is a city” is transformed to generate the text sequence "@@Columbus## is a city”, where special tokens @@## marks the entity to extract. To efficiently address the hallucination issue of LLMs, where LLMs have a strong inclination to over-confidently label NULL inputs as entities, we propose a self-verification strategy by prompting LLMs to ask itself whether the extracted entities belong to a labeled entity tag.We conduct experiments on five widely adopted NER datasets, and GPT-NER achieves comparable performances to fully supervised baselines, which is the first time as far as we are concerned. More importantly, we find that GPT-NER exhibits a greater ability in the low-resource and few-shot setups, when the amount of training data is extremely scarce, GPT-NER performs significantly better than supervised models. This demonstrates the capabilities of GPT-NER in real-world NER applications where the number of labeled examples is limited.
pdf
bib
abs
QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models
Changhai Zhou
|
Yuhua Zhou
|
Yibin Wang
|
Shijie Han
|
Qian Qiao
|
Hongguang Li
The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial challenges. Structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation, necessitating parameter updates to adapt. Unfortunately, such fine-tuning requires substantial memory, which limits its applicability. To address these challenges, we introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference. However, the combined errors from pruning and quantization increase the difficulty of fine-tuning, requiring a more refined quantization scheme. To this end, we propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme. Quantization precisions are assigned to each layer based on their importance to the target task, and Bayesian optimization is employed to refine precision allocation strategies, ensuring a balance between model accuracy and memory efficiency. Extensive experiments on benchmark datasets demonstrate that QPruner significantly outperforms existing methods in memory savings while maintaining or improving model performance.
pdf
bib
abs
MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG
Pingyu Wu
|
Daiheng Gao
|
Jing Tang
|
Huimin Chen
|
Wenbo Zhou
|
Weiming Zhang
|
Nenghai Yu
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. Our proposed **MES-RAG** framework enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying protections prior to data access. Additionally, the system supports real-time multi-modal outputs, including text, images, audio, and video, seamlessly integrating into existing RAG architectures. Experimental results demonstrate that MES-RAG significantly improves both accuracy and recall, highlighting its effectiveness in advancing the security and utility of question-answering, increasing accuracy to **0.83 (+0.25)** on targeted task. Our code and data are available at https://github.com/wpydcr/MES-RAG.
pdf
bib
abs
LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
Yizheng Sun
|
Yanze Xin
|
Hao Li
|
Jingyuan Sun
|
Chenghua Lin
|
Riza Batista-Navarro
Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting their practicality in resource-constrained environments. We introduce Language-Guided Vision Token Pruning (LVPruning) for MLLMs, an effective yet simple method that significantly reduces the computational burden while preserving model performance. LVPruning employs cross-attention modules to compute the importance of vision tokens based on their interaction with language tokens, determining which to prune. Importantly, LVPruning can be integrated without modifying the original MLLM parameters, which makes LVPruning simple to apply or remove. Our experiments show that LVPruning can effectively reduce up to 90% of vision tokens by the middle layer of LLaVA-1.5, resulting in a 62.1% decrease in inference Tera Floating-Point Operations Per Second (TFLOPs), with an average performance loss of just 0.45% across nine multi-modal benchmarks.
pdf
bib
abs
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Sergey Pletenev
|
Maria Marina
|
Daniil Moskovskiy
|
Vasily Konovalov
|
Pavel Braslavski
|
Alexander Panchenko
|
Mikhail Salnikov
The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model’s parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model’s performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.
pdf
bib
abs
TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning
Xinyuan Lu
|
Liangming Pan
|
Yubo Ma
|
Preslav Nakov
|
Min-Yen Kan
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering and table-based fact verification. To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table–tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. Both code and data are openly available at https://github.com/XinyuanLu00/TART.
pdf
bib
abs
Enhancing Text-to-SQL with Question Classification and Multi-Agent Collaboration
Zhihui Shao
|
Shubin Cai
|
Rongsheng Lin
|
Zhong Ming
Large Language Models (LLMs) have recently demonstrated remarkable performance in Text-to-SQL tasks. However, existing research primarily focuses on the optimization of prompts and improvements in workflow, with few studies delving into the exploration of the questions. In this paper, we propose a Text-to-SQL framework based on question classification and multi-agent collaboration (QCMA-SQL). Specifically, we first employ multiple cross-attention mechanisms to train a schema selector to classify questions and select the most suitable database schema. Subsequently, we employ the appropriate agents based on the varying difficulty levels of the questions to generate preliminary SQL queries. Moreover, we implement syntax validation and execution optimization steps to generate final SQL queries. Experimental results on the Spider dataset show that the QCMA-SQL framework achieves an execution accuracy of 87.4%, outperforming state-of-the-art methods. Through ablation studies, we find that classifying the questions ultimately leads to a 2.8% increase in execution accuracy.
pdf
bib
abs
Efficient Nearest Neighbor based Uncertainty Estimation for Natural Language Processing Tasks
Wataru Hashimoto
|
Hidetaka Kamigaito
|
Taro Watanabe
Trustworthiness in model predictions is crucial for safety-critical applications in the real world. However, deep neural networks often suffer from the issues of uncertainty estimation, such as miscalibration. In this study, we propose k-Nearest Neighbor Uncertainty Estimation (kNN-UE), which is a new uncertainty estimation method that uses not only the distances from the neighbors, but also the ratio of labels in the neighbors. Experiments on sentiment analysis, natural language inference, and named entity recognition show that our proposed method outperforms the baselines and recent density-based methods in several calibration and uncertainty metrics. Moreover, our analyses indicate that approximate nearest neighbor search techniques reduce the inference overhead without significantly degrading the uncertainty estimation performance when they are appropriately combined.
pdf
bib
abs
BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks
Hanyong Lee
|
Chaelyn Lee
|
Yongjae Lee
|
Jaesung Lee
Phishing often targets victims through visually perturbed texts to bypass security systems. The noise contained in these texts functions as an adversarial attack, designed to deceive language models and hinder their ability to accurately interpret the content. However, since it is difficult to obtain sufficient phishing cases, previous studies have used synthetic datasets that do not contain real-world cases. In this study, we propose the BitAbuse dataset, which includes real-world phishing cases, to address the limitations of previous research. Our dataset comprises a total of 325,580 visually perturbed texts. The dataset inputs are drawn from the raw corpus, consisting of visually perturbed sentences and sentences generated through an artificial perturbation process. Each input sentence is labeled with its corresponding ground truth, representing the restored, non-perturbed version. Language models trained on our proposed dataset demonstrated significantly better performance compared to previous methods, achieving an accuracy of approximately 96%. Our analysis revealed a significant gap between real-world and synthetic examples, underscoring the value of our dataset for building reliable pre-trained models for restoration tasks. We release the BitAbuse dataset, which includes real-world phishing cases annotated with visual perturbations, to support future research in adversarial attack defense.
pdf
bib
abs
Unfolding the Headline: Iterative Self-Questioning for News Retrieval and Timeline Summarization
Weiqi Wu
|
Shen Huang
|
Yong Jiang
|
Pengjun Xie
|
Fei Huang
|
Hai Zhao
In the fast-changing realm of information, the capacity to construct coherent timelines from extensive event-related content has become increasingly significant and challenging. The complexity arises in aggregating related documents to build a meaningful event graph around a central topic. This paper proposes CHRONOS - Causal Headline Retrieval for Open-domain News Timeline SummarizatiOn via Iterative Self-Questioning, which offers a fresh perspective on the integration of Large Language Models (LLMs) to tackle the task of Timeline Summarization (TLS). By iteratively reflecting on how events are linked and posing new questions regarding a specific news topic to gather information online or from an offline knowledge base, LLMs produce and refresh chronological summaries based on documents retrieved in each round. Furthermore, we curate Open-TLS, a novel dataset of timelines on recent news topics authored by professional journalists to evaluate open-domain TLS where information overload makes it impossible to find comprehensive relevant documents from the web. Our experiments indicate that CHRONOS is not only adept at open-domain timeline summarization but also rivals the performance of existing state-of-the-art systems designed for closed-domain applications, where a related news corpus is provided for summarization.
pdf
bib
abs
RetrieverGuard: Empowering Information Retrieval to Combat LLM-Generated Misinformation
Chuwen Chen
|
Shuai Zhang
Large language models (LLMs) have demonstrated impressive capabilities in generating human-like text and have been shown to store factual knowledge within their extensive parameters. However, models like ChatGPT can still actively or passively generate false or misleading information, increasing the challenge of distinguishing between human-created and machine-generated content. This poses significant risks to the authenticity and reliability of digital communication. This work aims to enhance retrieval models’ ability to identify the authenticity of texts generated by large language models, with the goal of improving the truthfulness of retrieved texts and reducing the harm of false information in the era of large models. Our contributions include: (1) we construct a diverse dataset of authentic human-authored texts and highly deceptive AI-generated texts from various domains; (2) we propose a self-supervised training method, RetrieverGuard, that enables the model to capture textual rules and styles of false information from the corpus without human-labelled data, achieving higher accuracy and robustness in identifying misleading and highly deceptive AI-generated content.
pdf
bib
abs
Unified Automated Essay Scoring and Grammatical Error Correction
SeungWoo Song
|
Junghun Yuk
|
ChangSu Choi
|
HanGyeol Yoo
|
HyeonSeok Lim
|
KyungTae Lim
|
Jungyeul Park
This study explores the integration of automated writing evaluation (AWE) and grammatical error correction (GEC) through multitask learning, demonstrating how combining these distinct tasks can enhance performance in both areas. By leveraging a shared learning framework, we show that models trained jointly on AWE and GEC outperform those trained on each task individually. To support this effort, we introduce a dataset specifically designed for multitask learning using AWE and GEC. Our experiments reveal significant synergies between tasks, leading to improvements in both writing assessment accuracy and error correction precision. This research represents a novel approach for optimizing language learning tools by unifying writing evaluation and correction tasks, offering insights into the potential of multitask learning in educational applications.
pdf
bib
abs
A Closer Look into Mixture-of-Experts in Large Language Models
Ka Man Lo
|
Zeyu Huang
|
Zihan Qiu
|
Zili Wang
|
Jie Fu
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of four popular MoE-based models and reveal some intriguing observations, including 1) Neurons act like fine-grained experts; 2) The router of MoE usually selects experts with larger output norms; 3) The expert diversity increases as the layer increases, while the last layer is an outlier, which is further validated by an initial experiment. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.
pdf
bib
abs
CDB: A Unified Framework for Hope Speech Detection Through Counterfactual, Desire and Belief
Tulio Ferreira Leite Da Silva
|
Gonzalo Freijedo Aduna
|
Farah Benamara
|
Alda Mari
|
Zongmin Li
|
Li Yue
|
Jian Su
Computational modeling of user-generated desires on social media can significantly aid decision-makers across various fields. Initially explored through wish speech,this task has evolved into a nuanced examination of hope speech. To enhance understanding and detection, we propose a novel scheme rooted in formal semantics approaches to modality, capturing both future-oriented hopes through desires and beliefs and the counterfactuality of past unfulfilled wishes and regrets. We manually re-annotated existing hope speech datasets and built a new one which constitutes a new benchmark in the field. We also explore the capabilities of LLMs in automatically detecting hope speech, relying on several prompting strategies. To the best of our knowledge, this is the first attempt towards a language-driven decomposition of the notional category hope and its automatic detection in a unified setting.
pdf
bib
abs
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models
Jiyue Jiang
|
Pengan Chen
|
Liheng Chen
|
Sheng Wang
|
Qinghang Bao
|
Lingpeng Kong
|
Yu Li
|
Chuan Wu
The rapid evolution of large language models (LLMs) has transformed the competitive landscape in natural language processing (NLP), particularly for English and other data-rich languages. However, underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps, which is particularly concerning given the economic significance of the Guangdong-Hong Kong-Macau Greater Bay Area, and in substantial Cantonese-speaking populations in places like Singapore and North America. Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions. To bridge these gaps, we outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese, which aim to advance open-source Cantonese LLM technology. We also propose future research directions and recommended models to enhance Cantonese LLM development.
pdf
bib
abs
Improving Reward Models with Synthetic Critiques
Zihuiwen Ye
|
Fraser David Greenlee
|
Max Bartolo
|
Phil Blunsom
|
Jon Ander Campos
|
Matthias Gallé
Reward models (RMs) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unseen distributions. We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback, evaluating aspects such as instruction following, correctness, and style. This offers richer signals and more robust features for RMs to assess and score on. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs initialized from different pretrained models, reducing the reliance on costly human annotations. Furthermore, incorporating critiques improves both the interpretability and robustness of RM training.
pdf
bib
abs
Rethinking Smoothness for Fast and Adaptable Entity Alignment Decoding
Yuanyi Wang
|
Han Li
|
Haifeng Sun
|
Lei Zhang
|
Bo He
|
Wei Tang
|
Tianhao Yan
|
Qi Qi
|
Jingyu Wang
Entity alignment (EA) is crucial for integrating multi-source knowledge graphs (KGs), aiming to identify equivalent entities across different graphs. However, most existing EA decoding methods rely on both entity and relation embeddings, limiting their generalizability and efficiency, especially in GNN-based models. To address these challenges, we propose Triple Feature Propagation (TFP), an adaptable and fast EA decoding framework that only utilizes entity embeddings. TFP reconstructs KG representation by maximizing the smoothness of entity embeddings. The discretized smoothness-maximization process yields the explicit Euler solution of TFP. We also generalize multi-view matrices: entity-to-entity, entity-to-relation, relation-to-entity, and relation-to-triple, to capture structural diversity. Extensive experiments on public datasets demonstrate that TFP is fast and adaptable to various encoders, achieving comparable results to state-of-the-art methods in under 6 seconds, and surpassing them in many cases.
pdf
bib
abs
Lost in the Distance: Large Language Models Struggle to Capture Long-Distance Relational Knowledge
Meiyun Wang
|
Takeshi Kojima
|
Yusuke Iwasawa
|
Yutaka Matsuo
Large language models (LLMs) have demonstrated impressive capabilities in handling long contexts, but challenges remain in capturing relational knowledge spread far apart within text. Connecting long-distance knowledge is important for solving tasks as the context length increases: imagine reading a lengthy detective novel where seemingly trivial information introduced early on often becomes essential during the climactic reveal of the culprit. In this study, we expose the ”Lost in the Distance” phenomenon, where LLM performance of capturing the relational knowledge degrades significantly when the relational knowledge is separated by noise, i.e., unrelated sentences to solve a task. Specifically, we design an experiment in which we insert artificial noise between two related elements and observe model performance as the distance between them increases. Our findings show that while LLMs can handle edge noise with little impact, their ability to reason about distant relationships declines sharply as the intervening noise grows. These findings are consistent in both forward-looking prediction and backward-looking prediction settings. We validate this across various models (GPT-4, Gemini-1.5-pro, GPT-4o-mini, Gemini-1.5-flash, Claude-3.5-Sonnet) and tasks (causal reasoning and knowledge extraction). These results reveal a significant limitation in how LLMs process relational knowledge over long contexts. We release our code and data to support further research.
pdf
bib
abs
FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
Jabez Magomere
|
Elena Kochkina
|
Samuel Mensah
|
Simerjot Kaur
|
Charese Smiley
We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset’s difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.
pdf
bib
abs
Music for All: Representational Bias and Cross-Cultural Adaptability of Music Generation Models
Atharva Mehta
|
Shivam Chauhan
|
Amirbek Djanibekov
|
Atharva Kulkarni
|
Gus Xia
|
Monojit Choudhury
The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres.We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models – MusicGen and Mustango, for two underrepresented non-Western music traditions – Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning.
pdf
bib
abs
SFMSS: Service Flow aware Medical Scenario Simulation for Conversational Data Generation
Zhijie Bao
|
Qingyun Liu
|
Xuanjing Huang
|
Zhongyu Wei
Medical-specific Large Language Models (LLMs) have demonstrated impressive performance on medical-related exams and tasks. Despite their success in single-turn question and answering, instruction-tuned LLMs often falter in real-world healthcare applications, highlighting a disconnect between existing instruction datasets and practical contexts. To address this issue, we propose Service Flow aware Medical Scenario Simulation (SFMSS), a simulation framework designed for medical conversational data generation. SFMSS employs three key strategies to ensure the quality of the data generation. the use of Authentic Seed Data ensures alignment of real-world distributions. Diverse Patient Simulation enables simulated patients to exhibit distinct communication styles and complex behavioral logic. Service Flow Control ensures that conversations progress in alignment with medical objectives. We construct a dataset targeting on outpatient reception through SFMSS, named SFMSS-CD. Building on this dataset, we develop a model called SFMSS-Nurse. We conduct both automatic and human evaluations, involving 15 users and 15 clinical experts, to assess the effectiveness of SFMSS. The results demonstrate that SFMSS-Nurse outperforms all baselines, including the current state-of-the-art model GPT-4o, and aligns with human preferences and clinical demands.
pdf
bib
abs
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Mingqi Gao
|
Yixin Liu
|
Xinyu Hu
|
Xiaojun Wan
|
Jonathan Bragg
|
Arman Cohan
Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation of LLMs. Furthermore, we discovered that when evaluating LLMs with similar performance, the performance of the automatic LLM bencher declines sharply, underscoring the limitations of current benchers and calling for future work. Lastly, we found that the evaluation models’ performance at the instance level (e.g., the accuracy of selecting the best output) does not always align with their effectiveness when used as a component of a bencher, highlighting the importance of dedicated system-level evaluation of benchers.
pdf
bib
abs
GuideQ: Framework for Guided Questioning for progressive informational collection and classification
Priya Mishra
|
Suraj Racha
|
Kaustubh Ponkshe
|
Adit Akarsh
|
Ganesh Ramakrishnan
The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model’s ability to answer a query consistently across languages, and the ability to ”store” answers in a shared representation for several languages. We propose a methodology to measure the extent of representation sharing across languages by repurposing knowledge editing methods. We examine LLMs with various multilingual configurations using a new multilingual dataset. We reveal that high consistency does not necessarily imply shared representation, particularly for languages with different scripts. Moreover, we find that script similarity is a dominant factor in representation sharing. Finally, we observe that if LLMs could fully share knowledge across languages, their accuracy in their best-performing language could benefit an increase of up to 150% on average. These findings highlight the need for improved multilingual knowledge representation in LLMs and suggest a path for the development of more robust and consistent multilingual LLMs.
pdf
bib
abs
Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations
Kirti Bhagat
|
Kinshuk Vasisht
|
Danish Pruthi
While a large body of work inspects language models for biases concerning gender, race, occupation and religion, biases of geographical nature are relatively less explored. Some recent studies benchmark the degree to which large language models encode geospatial knowledge. However, the impact of the encoded geographical knowledge (or lack thereof) on real-world applications has not been documented. In this work, we examine large language models for two common scenarios that require geographical knowledge: (a) travel recommendations and (b) geo-anchored story generation. Specifically, we study five popular language models, and across about 100K travel requests, and 200K story generations, we observe that travel recommendations corresponding to poorer countries are less unique with fewer location references, and stories from these regions more often convey emotions of hardship and sadness compared to those from wealthier nations.
pdf
bib
abs
Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks
Gagan Bhatia
|
El Moatez Billah Nagoudi
|
Abdellah El Mekki
|
Fakhraddin Alwajih
|
Muhammad Abdul-Mageed
In this paper, we introduce Swan, a family of embedding models centred around the Arabic language, addressing both small-scale and large-scale use cases. Swan includes two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model. To evaluate these models, we propose ArabicMTEB, a comprehensive benchmark suite that assesses cross-lingual, multi-dialectal, multi-domain, and multi-cultural Arabic text embedding performance, covering eight diverse tasks and spanning 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks, while the Swan-Small consistently surpasses Multilingual-E5-base. Our extensive evaluations demonstrate that Swan models are dialectally and culturally aware, excelling across various Arabic domains while offering significant monetary efficiency. This work significantly advances the field of Arabic language modelling and provides valuable resources for future research and applications in Arabic natural language processing. Our models and benchmarks will be made publicly accessible for research.
pdf
bib
abs
TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
Jipeng Zhang
|
Yaxuan Qin
|
Renjie Pi
|
Weizhong Zhang
|
Rui Pan
|
Tong Zhang
Instruction tuning has achieved unprecedented success in NLP, turning large language models into versatile chatbots. However, the increasing variety and volume of instruction datasets demand significant computational resources. To address this, it is essential to extract a small and highly informative subset (i.e., Coreset) that achieves comparable performance to the full dataset. Achieving this goal poses non-trivial challenges: 1) data selection requires accurate data representations that reflect the training samples’ quality, 2) considering the diverse nature of instruction datasets, and 3) ensuring the efficiency of the coreset selection algorithm for large models. To address these challenges, we propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS). Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection. Experimental results show that our algorithm, selecting only 5% of the data, surpasses other unsupervised methods and achieves performance close to that of the full dataset.
pdf
bib
abs
From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs
Navya Jain
|
Zekun Wu
|
Cristian Enrique Munoz Villalobos
|
Airlie Hilliard
|
Xin Guan
|
Adriano Koshiyama
|
Emre Kazim
|
Philip Colin Treleaven
The manipulation of the personality traits of large language models (LLMs) has emerged as a key area of research. Methods like prompt-based In-Context Knowledge Editing (IKE) and gradient-based Model Editor Networks (MEND) have been explored but show irregularity and variability; IKE depends on the prompt, leading to variability and sensitivity, while MEND yields inconsistent and gibberish outputs. To address this, we employed Opinion QA Based Parameter-Efficient Fine-Tuning (PEFT), specifically Quantized Low-Rank Adaptation (QLoRA), to manipulate the Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. After PEFT, models such as Mistral-7B-Instruct and LLaMA-2-7B-chat showed a latent behaviour by generating emojis for certain traits, despite no emojis being present in the PEFT data. For instance, LLaMA-2-7B-chat generated emojis in 99.5% of extraversion-related test instances, while Mistral-7B-Instruct did so in 92.5% of openness-related test instances. ICL Explainability analysis indicated that the LLMs used emojis intentionally to express these traits. Mechanistic Interpretability analysis showed that this latent behaviour of LLMs could be traced to specific neurons that became activated or amplified after PEFT. This paper provides a number of novel contributions. First, introducing an Opinion QA dataset for PEFT-driven personality manipulation; second, developing metric models to benchmark LLM personality traits; third, demonstrating PEFT’s superiority over IKE in personality manipulation; and finally, analysing and validating emoji usage through explainability methods such as Mechanistic Interpretability and In-context learning Explainability methods.
pdf
bib
abs
Decoding Fatphobia: Examining Anti-Fat and Pro-Thin Bias in AI-Generated Images
Jane Warren
|
Gary M. Weiss
|
Fernando Martinez
|
Annika Guo
|
Yijun Zhao
Existing studies have shown that AI-generated images tend to reinforce social biases, including those related to race and gender. However, no studies have investigated weight bias, or fatphobia, in AI-generated images. This study utilizes DALL-E 3 to determine the extent to which anti-fat and pro-thin biases are present in AI-generated images, and examines stereotypical associations between moral character and body weight. Four-thousand images are generated using twenty pairs of positive and negative textual prompts. These images are then manually labeled with weight information and analyzed to determine the extent to which they reflect fatphobia. The findings and their impact are discussed and related to existing research on weight bias.
pdf
bib
abs
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Guoli Yin
|
Haoping Bai
|
Shuang Ma
|
Feng Nan
|
Yanchao Sun
|
Zhaoyang Xu
|
Shen Ma
|
Jiarui Lu
|
Xiang Kong
|
Aonan Zhang
|
Dian Ang Yap
|
Yizhe Zhang
|
Karsten Ahnert
|
Vik Kamath
|
Mathias Berglund
|
Dominic Walsh
|
Tobias Gindele
|
Juergen Wiest
|
Zhengfeng Lai
|
Xiaoming Simon Wang
|
Jiulong Shan
|
Meng Cao
|
Ruoming Pang
|
Zirui Wang
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluate models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covering five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 20 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance.
pdf
bib
abs
Improving Consistency in LLM Inference using Probabilistic Tokenization
Ashutosh Sathe
|
Divyanshu Aggarwal
|
Sunayana Sitaram
Prior research has demonstrated noticeable performance gains through the use of probabilistic tokenizations, an approach that involves employing multiple tokenizations of the same input string during the training phase of a language model. Despite these promising findings, modern large language models (LLMs) have yet to be trained using probabilistic tokenizations. Interestingly, while the tokenizers of these contemporary LLMs have the capability to generate multiple tokenizations, this property remains underutilized.In this work, we propose a novel method to leverage the multiple tokenization capabilities of modern LLM tokenizers, aiming to enhance the self-consistency of LLMs in reasoning tasks. Our experiments indicate that when utilizing probabilistic tokenizations, LLMs generate logically diverse reasoning paths, moving beyond mere surface-level linguistic diversity. We carefully study probabilistic tokenization and offer insights to explain the self consistency improvements it brings through extensive experimentation on 5 LLM families and 4 reasoning benchmarks.
pdf
bib
abs
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
Tianrong Zhang
|
Bochuan Cao
|
Yuanpu Cao
|
Lu Lin
|
Prasenjit Mitra
|
Jinghui Chen
The recent breakthrough in large language models (LLMs) such as ChatGPT has revolutionized every industry at an unprecedented pace. Alongside this progress also comes mounting concerns about LLMs’ susceptibility to jailbreaking attacks, which leads to the generation of harmful or unsafe content. While safety alignment measures have been implemented in LLMs to mitigate existing jailbreak attempts and force them to become increasingly complicated, it is still far from perfect. In this paper, we analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses. Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query and encourage benign content regarding the games to precede the anticipated harmful content in the response, creating a context that is hardly covered by any corpus used for safety alignment. Extensive experiments demonstrate that WordGame attack can break the guardrails of the current leading proprietary and open-source LLMs, including the latest Claude 3, GPT 4, and Llama 3 models more effectively than existing attacks efficiently. The attack also remains powerful when external defenses are adopted. Further ablation studies on such simultaneous obfuscation in query and response provide evidence of the merits of the attack strategy beyond an individual attack.
pdf
bib
abs
Human and LLM-Based Resume Matching: An Observational Study
Swanand Vaishampayan
|
Hunter Leary
|
Yoseph Berhanu Alebachew
|
Louis Hickman
|
Brent A. Stevenor
|
Weston Beck
|
Chris Brown
Resume matching assesses the extent to which candidates qualify for jobs based on the content of resumes. This process increasingly uses natural language processing (NLP) techniques to automate parsing and rating tasks—saving time and effort. Large language models (LLMs) are increasingly used for this purpose—thus, we explore their capabilities for resume matching in an observational study. We compare zero-shot GPT-4 and human ratings for 736 resumes submitted to job openings from diverse fields using real-world evaluation criteria. We also study the effects of prompt engineering techniques on GPT-4 ratings and compare differences in GPT-4 and human ratings across racial and gender groups. Our results show: LLM scores correlate minorly with humans, suggesting they are not interchangeable; prompt engineering such as CoT improves the quality of LLM ratings; and LLM scores do not show larger group differences (i.e., bias) than humans. Our findings provide implications for LLM-based resume rating to promote more fair and NLP-based resume matching in a multicultural world.
pdf
bib
abs
A Practical Examination of AI-Generated Text Detectors for Large Language Models
Brian Tufts
|
Xuandong Zhao
|
Lei Li
The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors. Machine-generated content detectors claim to effectively identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, PHD, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered. We employ various prompting strategies to simulate practical adversarial attacks, demonstrating that even moderate efforts can significantly evade detection. We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0%. Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.
pdf
bib
abs
Robust Bias Detection in MLMs and its Application to Human Trait Ratings
Ingroj Shrestha
|
Louis Tay
|
Padmini Srinivasan
There has been significant prior work using templates to study bias against demographic attributes in MLMs. However, these have limitations: they overlook random variability of templates and target concepts analyzed, assume equality amongst templates, and overlook bias quantification. Addressing these, we propose a systematic statistical approach to assess bias in MLMs, using mixed models to account for random effects, pseudo-perplexity weights for sentences derived from templates and quantify bias using statistical effect sizes. Replicating prior studies, we match on bias scores in magnitude and direction with small to medium effect sizes.Next, we explore the novel problem of gender bias in the context of *personality* and *character* traits, across seven MLMs (base and large). We find that MLMs vary; ALBERT is unbiased for binary gender but the most biased for non-binary *neo*, while RoBERTa-large is the most biased for binary gender but shows small to no bias for *neo*. There is some alignment of MLM bias and findings in psychology (human perspective) - in *agreeableness* with RoBERTa-large and *emotional stability* with BERT-large. There is general agreement for the remaining 3 personality dimensions: both sides observe at most small differences across gender. For character traits, human studies on gender bias are limited thus comparisons are not feasible.
pdf
bib
abs
How Inclusively do LMs Perceive Social and Moral Norms?
Michael Galarnyk
|
Agam Shah
|
Dipanwita Guhathakurta
|
Poojitha Nandigam
|
Sudheer Chava
**This paper discusses and contains offensive content.** Language models (LMs) are used in decision-making systems and as interactive assistants. However, how well do these models making judgements align with the diversity of human values, particularly regarding social and moral norms? In this work, we investigate how inclusively LMs perceive norms across demographic groups (e.g., gender, age, and income). We prompt 11 LMs on rules-of-thumb (RoTs) and compare their outputs with the existing responses of 100 human annotators. We introduce the Absolute Distance Alignment Metric (ADA-Met) to quantify alignment on ordinal questions. We find notable disparities in LM responses, with younger, higher-income groups showing closer alignment, raising concerns about the representation of marginalized perspectives. Our findings highlight the importance of further efforts to make LMs more inclusive of diverse human values. The code and prompts are available on GitHub under the CC BY-NC 4.0 license.
pdf
bib
abs
Jailbreaking with Universal Multi-Prompts
Yu-Ling Hsu
|
Hsuan Su
|
Shang-Tse Chen
Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.
pdf
bib
abs
Echoes of Discord: Forecasting Hater Reactions to Counterspeech
Xiaoying Song
|
Sharon Lisseth Perez
|
Xinchen Yu
|
Eduardo Blanco
|
Lingzi Hong
Hate speech (HS) erodes the inclusiveness of online users and propagates negativity and division. Counterspeech has been recognized as a way to mitigate the harmful consequences. While some research has investigated the impact of user-generated counterspeech on social media platforms, few have examined and modeled haters’ reactions toward counterspeech, despite the immediate alteration of haters’ attitudes being an important aspect of counterspeech. This study fills the gap by analyzing the impact of counterspeech from the hater’s perspective, focusing on whether the counterspeech leads the hater to reenter the conversation and if the reentry is hateful. We compile the Reddit Echoes of Hate dataset (ReEco), which consists of triple-turn conversations featuring haters’ reactions, to assess the impact of counterspeech. To predict haters’ behaviors, we employ two strategies: a two-stage reaction predictor and a three-way classifier. The linguistic analysis sheds insights on the language of counterspeech to hate eliciting different haters’ reactions. Experimental results demonstrate that the 3-way classification model outperforms the two-stage reaction predictor, which first predicts reentry and then determines the reentry type. We conclude the study with an assessment showing the most common errors identified by the best-performing model.
pdf
bib
abs
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Athiya Deviyani
|
Fernando Diaz
Meta-evaluation of automatic evaluation metrics—assessing evaluation metrics themselves—is crucial for accurately benchmarking natural language processing systems and has implications for scientific inquiry, production model development, and policy enforcement. While existing approaches to metric meta-evaluation focus on general statements about the absolute and relative quality of metrics across arbitrary system outputs, in practice, metrics are applied in highly contextual settings, often measuring the performance for a highly constrained set of system outputs. For example, we may only be interested in evaluating a specific model or class of models. We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts. This observed variation highlights the importance of adopting context-specific metric evaluations over global ones.
pdf
bib
abs
Advocating Character Error Rate for Multilingual ASR Evaluation
Thennal D K
|
Jesin James
|
Deepa Padmini Gopinath
|
Muhammed Ashraf K
Automatic speech recognition (ASR) systems have traditionally been evaluated using English datasets, with the word error rate (WER) serving as the predominant metric. WER’s simplicity and ease of interpretation have contributed to its widespread adoption, particularly for English. However, as ASR systems expand to multilingual contexts, WER fails in various ways, particularly with morphologically complex languages or those without clear word boundaries. Our work documents the limitations of WER as an evaluation metric and advocates for the character error rate (CER) as the primary metric in multilingual ASR evaluation. We show that CER avoids many of the challenges WER faces and exhibits greater consistency across writing systems. We support our proposition by conducting human evaluations of ASR transcriptions in three languages—Malayalam, English, and Arabic—which exhibit distinct morphological characteristics. We show that CER correlates more closely with human judgments than WER, even for English. To facilitate further research, we release our human evaluation dataset for future benchmarking of ASR metrics. Our findings suggest that CER should be prioritized, or at least supplemented, in multilingual ASR evaluations to account for the varying linguistic characteristics of different languages.
pdf
bib
abs
Enhancing Temporal Understanding in LLMs for Semi-structured Tables
Irwin Deng
|
Kushagra Dixit
|
Dan Roth
|
Vivek Gupta
Temporal reasoning over tabular data presents substantial challenges for large language models (LLMs), as evidenced by recent research. In this study, we conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of LLMs. Our investigation leads to enhancements in TempTabQA, a benchmark specifically designed for tabular temporal question answering. We provide critical insights for enhancing LLM performance in temporal reasoning tasks with tabular data. Furthermore, we introduce a novel approach, C.L.E.A.R to strengthen LLM capabilities in this domain. Our findings demonstrate that our method improves evidence-based reasoning across various models. Additionally, our experimental results reveal that indirect supervision with auxiliary unstructured data (TRAM) substantially boosts model performance in these tasks. This work contributes to a deeper understanding of LLMs’ temporal reasoning abilities over tabular data and promotes advancements in their application across diverse fields.
pdf
bib
abs
BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
Mohammad Jahid Ibna Basher
|
Md Kowsher
|
Md Saiful Islam
|
Rabindra Nath Nandi
|
Nusrat Jahan Prottasha
|
Mehadi Hasan Menon
|
Tareq Al Muntasir
|
Shammur Absar Chowdhury
|
Firoj Alam
|
Niloofar Yousefi
|
Ozlem Garibay
This paper introduces BnTTS (Bangla Text-To-Speech), the first framework for Bangla speaker adaptation-based TTS, designed to bridge the gap in Bangla speech synthesis using minimal training data. Building upon the XTTS architecture, our approach integrates Bangla into a multilingual TTS pipeline, with modifications to account for the phonetic and linguistic characteristics of the language. We pretrain BnTTS on 3.85k hours of Bangla speech dataset with corresponding text labels and evaluate performance in both zero-shot and few-shot settings on our proposed test dataset. Empirical evaluations in few-shot settings show that BnTTS significantly improves the naturalness, intelligibility, and speaker fidelity of synthesized Bangla speech. Compared to state-of-the-art Bangla TTS systems, BnTTS exhibits superior performance in Subjective Mean Opinion Score (SMOS), Naturalness, and Clarity metrics.
pdf
bib
abs
Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge
Lian Remme
|
Kevin Tang
This paper provides a proof of concept that audio of tabletop role-playing games (TTRPG) could serve as a challenge for diarization systems. TTRPGs are carried out mostly by conversation. Participants often alter their voices to indicate that they are talking as a fictional character. Audio processing systems are susceptible to voice conversion with or without technological assistance. TTRPG present a conversational phenomenon in which voice conversion is an inherent characteristic for an immersive gaming experience. This could make it more challenging for diarizers to pick the real speaker and determine that impersonating is just that. We present the creation of a small TTRPG audio dataset and compare it against the AMI and the ICSI corpus. The performance of two diarizers, pyannote.audio and wespeaker, were evaluated. We observed that TTRPGs’ properties result in a higher confusion rate for both diarizers.Additionally, wespeaker strongly underestimates the number of speakers in the TTRPG audio files.We propose TTRPG audio as a promising challenge for diarization systems.
pdf
bib
abs
Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias
Yuen Chen
|
Vethavikashini Chithrra Raghuram
|
Justus Mattern
|
Rada Mihalcea
|
Zhijing Jin
Generated texts from large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics. These findings motivate research efforts aiming to understand and measure such effects. This paper introduces a causal formulation for bias measurement in generative language models. Based on this theoretical foundation, we outline a list of desiderata for designing robust bias benchmarks. We then propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias. We test several state-of-the-art open-source LLMs on OccuGender, including Llama, Mistral, and their instruction-tuned versions. The results show that these models exhibit substantial occupational gender bias. Lastly, we discuss prompting strategies for bias mitigation and an extension of our causal formulation to illustrate the generalizability of our framework.
pdf
bib
abs
OLMES: A Standard for Language Model Evaluations
Yuling Gu
|
Oyvind Tafjord
|
Bailey Kuehl
|
Dany Haddad
|
Jesse Dodge
|
Hannaneh Hajishirzi
Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models can be particularly challenging, as choices of how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require the unnatural “cloze” formulation of multiple-choice questions against larger models that can utilize the original formulation. OLMES includes well-considered, documented recommendations guided by results from existing literature as well as new experiments resolving open questions.
pdf
bib
abs
Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning
Joy Crosbie
|
Ekaterina Shutova
Large language models (LLMs) have shown a remarkable ability to learn and perform complex tasks through in-context learning (ICL). However, a comprehensive understanding of its internal mechanisms is still lacking. This paper explores the role of induction heads in a few-shot ICL setting. We analyse two state-of-the-art models, Llama-3-8B and InternLM2-20B on abstract pattern recognition and NLP tasks. Our results show that even a minimal ablation of induction heads leads to ICL performance decreases of up to ~32% for abstract pattern recognition tasks, bringing the performance close to random. For NLP tasks, this ablation substantially decreases the model’s ability to benefit from examples, bringing few-shot ICL performance close to that of zero-shot prompts. We further use attention knockout to disable specific induction patterns, and present fine-grained evidence for the role that the induction mechanism plays in ICL.
pdf
bib
abs
MoLA: MoE LoRA with Layer-wise Expert Allocation
Chongyang Gao
|
Kezhen Chen
|
Jinmeng Rao
|
Ruibo Liu
|
Baochen Sun
|
Yawen Zhang
|
Daiyi Peng
|
Xiaoyuan Guo
|
Vs Subrahmanian
Recent efforts to integrate low-rank adaptation (LoRA) with the Mixture-of-Experts (MoE) have managed to achieve performance comparable to full-parameter fine-tuning by tuning much fewer parameters. Despite promising results, research on improving the efficiency and expert analysis of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method,
MoE-LoRA with Layer-wise Expert Allocation (MoLA) for Transformer-based models, where each model layer uses a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines on top of both LLAMA-2, Mistral, and Gemma. We find that allocating more LoRA experts to middle layers further enhances the effectiveness of models with a certain number of experts in total. The redundancy of the experts is more obvious in the lower layers. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code has been made available at
https://github.com/GCYZSL/MoLA.
pdf
bib
CodeSim: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Md. Ashraful Islam
|
Mohammed Eunus Ali
|
Md Rizwan Parvez
pdf
bib
abs
On the Feasibility of In-Context Probing for Data Attribution
Cathy Jiao
|
Weizhen Gao
|
Aditi Raghunathan
|
Chenyan Xiong
Data attribution methods are used to measure the contribution of training data towards model outputs, and have several important applications in areas such as dataset curation and model interpretability. However, many standard data attribution methods, such as influence functions, utilize model gradients and are computationally expensive. In our paper, we show in-context probing (ICP) – prompting a LLM – can serve as a fast proxy for gradient-based data attribution for data selection under conditions contingent on data similarity. We study this connection empirically on standard NLP tasks, and show that ICP and gradient-based data attribution are well-correlated in identifying influential training data for tasks that share similar task type and content as the training data. Additionally, fine-tuning models on influential data selected by both methods achieves comparable downstream performance, further emphasizing their similarities. We then examine the connection between ICP and gradient-based data attribution using synthetic data on linear regression tasks. Our synthetic data experiments show similar results with those from NLP tasks, suggesting that this connection can be isolated in simpler settings, which offers a pathway to bridging their differences.
pdf
bib
abs
Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?
Goncalo Emanuel Cavaco Gomes
|
Chrysoula Zerva
|
Bruno Martins
The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.
pdf
bib
abs
Avoiding Copyright Infringement via Large Language Model Unlearning
Guangyao Dou
|
Zheyuan Liu
|
Qing Lyu
|
Kaize Ding
|
Eric Wong
Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In real-world scenarios, model owners need to continuously address copyright infringement as new requests for content removal emerge at different time points. This leads to the need for sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model’s parameters that correspond to copyrighted content. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters. Experimental results show that SSU achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines.
pdf
bib
abs
A Context-Aware Contrastive Learning Framework for Hateful Meme Detection and Segmentation
Xuanyu Su
|
Yansong Li
|
Diana Inkpen
|
Nathalie Japkowicz
Amidst the rise of Large Multimodal Models (LMMs) and their widespread application in generating and interpreting complex content, the risk of propagating biased and harmful memes remains significant. Current safety measures often fail to detect subtly integrated hateful content within “Confounder Memes”. To address this, we introduce HateSieve, a new framework designed to enhance the detection and segmentation of hateful elements in memes. HateSieve features a novel Contrastive Meme Generator that creates semantically correlated memes, a customized triplet dataset for contrastive learning, and an Image-Text Alignment module that produces context-aware embeddings for accurate meme segmentation. Empirical experiments show that HateSieve not only surpasses existing LMMs in performance with fewer trainable parameters but also offers a robust mechanism for precisely identifying and isolating hateful content. Caution: Contains academic discussions of hate speech; viewer discretion advised.
pdf
bib
abs
LLM-Generated Passphrases That Are Secure and Easy to Remember
Jie S. Li
|
Jonas Geiping
|
Micah Goldblum
|
Aniruddha Saha
|
Tom Goldstein
Automatically generated passwords and passphrases are a cornerstone of IT security. Yet, these passphrases are often hard to remember and see only limited adoption. In this work, we use large language models to generate passphrases with rigorous security guarantees via the computation of the entropy of the output as a metric of the security of the passphrase. We then present a range of practical methods to generate language model outputs with sufficient entropy: raising entropy through in-context examples and generation through a new top-q truncation method. We further verify the influence of prompt construction in steering the output topic and grammatical structure. Finally, we conduct user studies to determine the adoption rates for these LLM-generated passphrases in practice.
pdf
bib
abs
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
|
Ozlem Uzuner
|
Meliha Yetisgen
|
Fei Xia
Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. However, as LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination, where overlap between training data and evaluation datasets inflates performance assessments. Multiple approaches have been developed to identify data contamination. These approaches rely on specific assumptions that may not hold universally across different settings. To bridge this gap, we systematically review 50 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated. We identify and analyze eight categories of assumptions and test three of them as case studies. Our case studies focus on detecting direct, instance-level data contamination, which is also referred to as Membership Inference Attacks (MIA). Our analysis reveals that MIA approaches based on these three assumptions can have similar performance to random guessing, on datasets used in LLM pretraining, suggesting that current LLMs might learn data distributions rather than memorizing individual instances. Meanwhile, MIA can easily fail when there are data distribution shifts between the seen and unseen instances.
pdf
bib
abs
Representation-to-Creativity (R2C): Automated Holistic Scoring Model for Essay Creativity
Deokgi Kim
|
Joonyoung Jo
|
Byung-Won On
|
Ingyu Lee
Despite active research on Automated Essay Scoring (AES), there is a noticeable scarcity of studies focusing on predicting creativity scores for essays. In this study, we develop a new essay rubric specifically designed for assessing creativity in essays. Leveraging this rubric, we construct ground truth data consisting of 5,048 essays. Furthermore, we propose a novel self-supervised learning model that recognizes cluster patterns within the essay embedding space and leverages them for creativity scoring. This approach aims to automatically generate a high-quality training set, thereby facilitating the training of diverse language models. Our experimental findings indicated a substantial enhancement in the assessment of essay creativity, demonstrating an increase in F1-score up to 58% compared to the primary state-of-the-art models across the ASAP and AIHUB datasets.
pdf
bib
abs
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G Belém
|
Pouya Pezeshkpour
|
Hayate Iso
|
Seiji Maekawa
|
Nikita Bhutani
|
Estevam Hruschka
Although many studies have investigated and reduced hallucinations in large language models (LLMs) for single-document tasks, research on hallucination in multi-document summarization (MDS) tasks remains largely unexplored. Specifically, it is unclear how the challenges arising from handling multiple documents (e.g., repetition and diversity of information) affect models outputs. In this work, we investigate how hallucinations manifest in LLMs when summarizing topic-specific information from a set of documents. Since no benchmarks exist for investigating hallucinations in MDS, we leverage existing news and conversation datasets, annotated with topic-specific insights, to create two novel multi-document benchmarks. When evaluating 5 LLMs on our benchmarks, we observe that on average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries. Moreover, when summarizing non-existent topic-related information, GPT-3.5-turbo and GPT-4o still generate summaries about 79.45% and 44% of the time, raising concerns about their tendency to fabricate content. To better understand the characteristics of these hallucinations, we conduct a human evaluation of 700+ insights and discover that most errors stem from either failing to follow instructions or producing overly generic insights. Motivated by these observations, we investigate the efficacy of simple post-hoc baselines in mitigating hallucinations but find them only moderately effective. Our results underscore the need for more effective approaches that systematically mitigate hallucinations in MDS.
pdf
bib
abs
Aligning to Constraints for Data-Efficient Language Model Customization
Fei Wang
|
Chao Shang
|
Shuai Wang
|
Sarthak Jain
|
Qiang Ning
|
Bonan Min
|
Vittorio Castelli
|
Yassine Benajiba
|
Dan Roth
General-purpose language models (LMs) are aligned to diverse user intents, but fall short when it comes to specific applications. While finetuning is the default method for customized alignment, human annotations are often unavailable in various customization scenarios. Based on the observation that one of the main issues of LM customization is constraint adherence, we investigate the feasibility of using constraints as a bridge from general LMs to customized ones. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs’ capability to adhere to different classes of constraints, thereby improving task performance comparable to or approaching that of finetuning with labeled data.
pdf
bib
abs
Where is this coming from? Making groundedness count in the evaluation of Document VQA models
Armineh Nourbakhsh
|
Siddharth Parekh
|
Pranav Shetty
|
Zhao Jin
|
Sameena Shah
|
Carolyn Rose
Document Visual Question Answering (VQA) models have evolved at an impressive rate over the past few years, coming close to or matching human performance on some benchmarks. We argue that common evaluation metrics used by popular benchmarks do not account for the semantic and multimodal groundedness of a model’s outputs. As a result, hallucinations and major semantic errors are treated the same way as well-grounded outputs, and the evaluation scores do not reflect the reasoning capabilities of the model. In response, we propose a new evaluation methodology that accounts for the groundedness of predictions with regard to the semantic characteristics of the output as well as the multimodal placement of the output within the input document. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences. We validate our scoring methodology using human judgment and show its potential impact on existing popular leaderboards. Through extensive analyses, we demonstrate that our proposed method produces scores that are a better indicator of a model’s robustness and tends to give higher rewards to better-calibrated answers.
pdf
bib
abs
Transformer-based Causal Language Models Perform Clustering
Xinbo Wu
|
Lav R. Varshney
Even though large language models (LLMs) have demonstrated remarkable capability in solving various natural language tasks, the capability of an LLM to follow human instructions is still an area of active development. Recent works (Ouyang et al., 2022; Rafailov et al., 2023; Zhang et al., 2023) have shown great improvements in instruction-following capability through additional training for instruction-following tasks. However, the mechanisms responsible for effective instruction-following capabilities remain inadequately understood. Here, we introduce a simplified instruction-following task and use synthetic datasets to analyze a Transformer-based causal language model. Our findings suggest that the model learns task-specific information by clustering data within its hidden space, with this clustering process evolving dynamically during learning. We also demonstrate how this phenomenon assists the model in handling unseen instances, and validate our results in a more realistic setting. We further present applications in pre-training and alignment, inspired by clustering.
pdf
bib
abs
Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models
Zaifu Zhan
|
Rui Zhang
To efficiently select optimal dataset combinations for enhancing multi-task learning (MTL) performance in large language models, we proposed a novel framework that leverages a neural network to predict the best dataset combinations. The framework iteratively refines the selection, greatly improving efficiency, while being model-, dataset-, and domain-independent. Through experiments on 12 biomedical datasets across four tasks—named entity recognition, relation extraction, event extraction, and text classification—we demonstrate that our approach effectively identifies better combinations, even for tasks that may seem unpromising from a human perspective. This verifies that our framework provides a promising solution for maximizing MTL potential.
pdf
bib
abs
Gender Bias in Instruction-Guided Speech Synthesis Models
Chun-Yi Kuan
|
Hung-yi Lee
Recent advancements in controllable expressive speech synthesis, especially in text-to-speech (TTS) models, have allowed for the generation of speech with specific styles guided by textual descriptions, known as style prompts. While this development enhances the flexibility and naturalness of synthesized speech, there remains a significant gap in understanding how these models handle vague or abstract style prompts. This study investigates the potential gender bias in how models interpret occupation-related prompts, specifically examining their responses to instructions like “Act like a nurse”. We explore whether these models exhibit tendencies to amplify gender stereotypes when interpreting such prompts. Our experimental results reveal the model’s tendency to exhibit gender bias for certain occupations. Moreover, models of different sizes show varying degrees of this bias across these occupations.
pdf
bib
abs
ResoFilter: Fine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis
Zeao Tu
|
Xiangdi Meng
|
Yu He
|
Zihan Yao
|
Tianyu Qi
|
Jun Liu
|
Ming Li
Large language models (LLMs) have shown remarkable effectiveness across various domains, with data augmentation methods utilizing GPT for synthetic data generation becoming prevalent. However, the quality and utility of augmented data remain questionable, and current methods lack clear metrics for evaluating data characteristics. To address these challenges, we propose ResoFilter, a novel method that integrates models, data, and tasks to refine datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter features for data selection, offering improved interpretability by representing data characteristics through model weights. Our experiments demonstrate that ResoFilter achieves comparable results to full-scale fine-tuning using only half the data in mathematical tasks and exhibits strong generalization across different models and domains. This method provides valuable insights for constructing synthetic datasets and evaluating high-quality data, offering a promising solution for enhancing data augmentation techniques and improving training dataset quality for LLMs. For reproducibility, we will release our code and data upon acceptance.
pdf
bib
abs
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Yuzhe Yang
|
Yifei Zhang
|
Yan Hu
|
Yilin Guo
|
Ruoli Gan
|
Yueru He
|
Mingcong Lei
|
Xiao Zhang
|
Haining Wang
|
Qianqian Xie
|
Jimin Huang
|
Honghai Yu
|
Benyou Wang
This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.
pdf
bib
abs
BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression
Yuankai Li
|
Jia-Chen Gu
|
Di Wu
|
Kai-Wei Chang
|
Nanyun Peng
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context RAG. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic propositions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader model. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.
pdf
bib
abs
An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer
Weipeng Jiang
|
Zhenting Wang
|
Juan Zhai
|
Shiqing Ma
|
Zhengyu Zhao
|
Chao Shen
Despite prior safety alignment efforts, LLMs can still generate harmful and unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall into two main categories: template-based and optimization-based methods. The former requires significant manual effort and domain knowledge, while the latter, exemplified by GCG, which seeks to maximize the likelihood of harmful LLM outputs through token-level optimization, also encounters several limitations: requiring white-box access, necessitating pre-constructed affirmative phrase, and suffering from low efficiency. This paper introduces ECLIPSE, a novel and efficient black-box jailbreaking method with optimizable suffixes. We employ task prompts to translate jailbreaking objectives into natural language instructions, guiding LLMs to generate adversarial suffixes for malicious queries. A harmfulness scorer provides continuous feedback, enabling LLM self-reflection and iterative optimization to autonomously produce effective suffixes. Experimental results demonstrate that ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly outperforming GCG by 2.4 times. Moreover, ECLIPSE matches template-based methods in ASR while substantially reducing average attack overhead by 83%, offering superior attack efficiency.
pdf
bib
abs
Multi-Stage LLM Fine-Tuning with a Continual Learning Setting
Changhao Guan
|
Chao Huang
|
Hongliang Li
|
You Li
|
Ning Cheng
|
Zihe Liu
|
Yufeng Chen
|
Jinan Xu
|
Jian Liu
In recent years, large language models (LLMs) have made significant progress in knowledge-intensive applications. However, when adapting them to specific domains, we may encounter a multi-stage continuous learning scenario, especially in cases where domain knowledge evolves rapidly.This issue severely limits traditional fine-tuning approaches for LLMs.To overcome this limitation, we propose a new learning paradigm designed specifically for multi-stage continuous learning. This paradigm includes a preference-based learning bias to identify potential knowledge conflicts, as well as a self-distillation-based data augmentation strategy to expand and enrich the training corpus, thereby improving the integration of knowledge-compatible information.In the experiments, we show that our proposed method achieves a significant improvement in accuracy after 7 stages of fine-tuning compared to previous methods, while also demonstrating excellent performance in preserving general knowledge.We have released our code and dataset at Multi-Stage-Learning.
pdf
bib
abs
Constraining Sequential Model Editing with Editing Anchor Compression
Hao-Xiang Xu
|
Jun-Yu Ma
|
Zhen-Hua Ling
|
Ningyu Zhang
|
Jia-Chen Gu
Large language models (LLMs) struggle with hallucinations due to false or outdated knowledge. Given the high resource demands of retraining these models, there is an increasing focus on developing model editing. However, the general abilities of LLMs across downstream tasks are prone to significant degradation during sequential editing. This paper statistically observes that the parameter matrix after editing exhibits a significant deviation compared to its previous state as the number of edits increases. This serious deviation affects the original knowledge associations within LLMs and leads to the degradation of their general abilities. To this end, a framework termed Editing Anchor Compression (EAC) is proposed to constrain the deviation of the parameter matrix during sequential editing. It compresses the editing information by selecting editing anchors that are important in encoding new relations without deviating too much from the original matrix, thereby preserving the general abilities. Experiments of applying EAC to two popular editing methods on three LLMs across four tasks are conducted. Evaluation results show that EAC effectively minimizes unreasonable deviations caused by model editing, preserving over 70% of the general abilities while better retaining the editing knowledge compared to the original counterpart methods.
pdf
bib
abs
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Zayd Muhammad Kawakibi Zuhri
|
Muhammad Farid Adilazuarda
|
Ayu Purwarianti
|
Alham Fikri Aji
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV’s potential for efficient deployment of transformer models at scale.
pdf
bib
abs
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs
Michael JQ Zhang
|
Eunsol Choi
In this work, we explore the challenges of developing interactive assistants that resolve ambiguity by asking their users clarifying questions. Specifically, we develop a task-agnostic framework for evaluating a system’s ability to determine when to ask for clarification. Determining when to ask for clarification is a challenging task that requires systems to consider the demands of the individual user (i.e., how much they prioritize speed and usability versus carefulness) and the distribution of interpretations for a given request (i.e., whether an ambiguous request has one dominant, inferable interpretation). Using this framework, we evaluate systems for determining when to clarify across three NLP applications: QA, MT, and NLI. Finally, we introduce present a novel uncertainty estimation approach, IntentSim, that determines the utility of asking a clarifying question by estimating the entropy over user intents. Our method consistently outperforms existing uncertainty estimation approaches at identifying predictions that will benefit from clarification. Furthermore, we find that IntentSim is robust, demonstrating improvements across a wide range of NLP tasks and LMs. Together, our work lays foundation for further studies on clarifying interactions with LM assistants.
pdf
bib
abs
DOLFIN - Document-Level Financial Test-Set for Machine Translation
Mariam Nakhle
|
Marco Dinarelli
|
Raheel Qader
|
Emmanuelle Esperança-Rodier
|
Hervé Blanchon
Despite the strong research interest in document-level Machine Translation (MT), the test-sets dedicated to this task are still scarce. The existing test-sets mainly cover topics from the general domain and fall short on specialised domains, such as legal and financial. Also, despite their document-level aspect, they still follow a sentence-level logic that doesn’t allow for including certain linguistic phenomena such as information reorganisation. In this work, we aim to fill this gap by proposing a novel test-set : DOLFIN. The dataset is built from specialised financial documents and it makes a step towards true document-level MT by abandoning the paradigm of perfectly aligned sentences, presenting data in units of sections rather than sentences. The test-set consists of an average of 1950 aligned sections for five language pairs. We present the detailed data collection pipeline that can serve as inspiration for aligning new document-level datasets. We demonstrate the usefulness and the quality of this test-set with the evaluation of a series of models. Our results show that the test-set is able to discriminate between context-sensitive and context-agnostic models and shows the weaknesses when models fail to accurately translate financial texts. The test-set will be made public for the community.
pdf
bib
Are Large Language Models Effective in Clinical Trial Design? A Study on Baseline Feature Generation
Nafis Neehal
|
Bowen Wang
|
Shayom Debopadhaya
|
Corey Curran
|
Keerthiram Murugesan
|
Soham Dan
|
Vibha Anand
|
Kristin Bennett
pdf
bib
abs
Lightweight Contenders: Navigating Semi-Supervised Text Mining through Peer Collaboration and Self Transcendence
Qianren Mao
|
Weifeng Jiang
|
Junnan Liu
|
Chenghua Lin
|
Qian Li
|
Xianqing Wen
|
Jianxin Li
|
Jinhu Lu
The semi-supervised learning (SSL) strategy in lightweight models requires reducing annotated samples and facilitating cost-effective inference. However, the constraint on model parameters, imposed by the scarcity of training labels, limits the SSL performance. In this paper, we introduce PS-NET, a novel framework tailored for semi-supervised text mining with lightweight models. PS-NET incorporates online distillation to train lightweight student models by imitating the Teacher model. It also integrates an ensemble of student peers that collaboratively instruct each other. Additionally, PS-NET implements a constant adversarial perturbation schema to further self-augmentation by progressive generalizing. Our PS-NET, equipped with a 2-layer distilled BERT, exhibits notable performance enhancements over SOTA lightweight SSL frameworks of FLiText and Disco in SSL text classification with extremely rare labelled data.
pdf
bib
abs
Language-based Valence and Arousal Expressions between the United States and China: a Cross-Cultural Examination
Young Min Cho
|
Dandan Pang
|
Stuti Thapa
|
Garrick Sherman
|
Lyle Ungar
|
Louis Tay
|
Sharath Chandra Guntuku
While affective expressions on social media have been extensively studied, most research has focused on the Western context. This paper explores cultural differences in affective expressions by comparing valence and arousal on Twitter/X (geolocated to the US) and Sina Weibo (in Mainland China). Using the NRC-VAD lexicon to measure valence and arousal, we identify distinct patterns of emotional expression across both platforms. Our analysis reveals a functional representation between valence and arousal, showing a negative offset in contrast to traditional lab-based findings which suggest a positive offset. Furthermore, we uncover significant cross-cultural differences in arousal, with US users displaying higher emotional intensity than Chinese users, regardless of the valence of the content. Finally, we conduct a comprehensive language analysis correlating n-grams and LDA topics with affective dimensions to deepen our understanding of how language and culture shape emotional expression. These findings contribute to a more nuanced understanding of affective communication across cultural and linguistic contexts on social media.
pdf
bib
abs
Chain-of-Rank: Enhancing Large Language Models for Domain-Specific RAG in Edge Device
Juntae Lee
|
Jihwan Bang
|
Kyuhong Shim
|
Seunghan Yang
|
Simyung Chang
Retrieval-augmented generation (RAG) with large language models (LLMs) is especially valuable in specialized domains, where precision is critical. To more specialize the LLMs into a target domain, domain-specific RAG has recently been developed by allowing the LLM to access the target domain early via finetuning. The domain-specific RAG makes more sense in resource-constrained environments like edge devices, as they should perform a specific task (e.g. personalization) reliably using only small-scale LLMs. While the domain-specific RAG is well-aligned with edge devices in this respect, it often relies on widely-used reasoning techniques like chain-of-thought (CoT). The reasoning step is useful to understand the given external knowledge, and yet it is computationally expensive and difficult for small-scale LLMs to learn it. Tackling this, we propose the Chain of Rank (CoR) which shifts the focus from intricate lengthy reasoning to simple ranking of the reliability of input external documents. Then, CoR reduces computational complexity while maintaining high accuracy, making it particularly suited for resource-constrained environments. We attain the state-of-the-art (SOTA) results in benchmarks, and analyze its efficacy.
pdf
bib
abs
MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning
Xujia Wang
|
Haiyan Zhao
|
Shuo Wang
|
Hanqing Wang
|
Zhiyuan Liu
Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA have significantly improved the adaptation of LLMs to downstream tasksin a resource-efficient manner. However, in multi-task scenarios, challenges such as training imbalance and the seesaw effect frequently emerge. Mixture-of-LoRA (MoLoRA), which combines LoRA with sparse Mixture-of-Experts, mitigates some of these issues by promoting task-specific learning among experts. Despite this, MoLoRA remains inefficient in terms of training speed, parameter utilization, and overall multi-task performance. In this paper, we propose Mixture of Asymmetric Low-Rank Adaptaion (MALoRA), a flexible fine-tuning framework that leverages asymmetric optimization among LoRA experts. MALoRA reduces the number of trainable parameters by 30% to 48%, increases training speed by 1.2x, and matches the computational efficiency of single-task LoRA models. Additionally, MALoRA addresses overfitting issues commonly seen in high-rank configurations, enhancing performance stability. Extensive experiments across diverse multi-task learning scenarios demonstrate that MALoRA consistently outperforms all baseline methods in both inter-domain and intra-domain tasks.
pdf
bib
abs
LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content
Mohamed Bayan Kmainasi
|
Ali Ezzat Shahroor
|
Maram Hasanain
|
Sahinur Rahman Laskar
|
Naeemul Hassan
|
Firoj Alam
Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields. However, their capabilities remain limited when addressing domain-specific problems, particularly in downstream NLP tasks. Research has shown that models fine-tuned on instruction-based downstream NLP datasets outperform those that are not fine-tuned. While most efforts in this area have primarily focused on resource-rich languages like English and broad domains, little attention has been given to multilingual settings and specific domains. To address this gap, this study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context. To the best of our knowledge, this is the first attempt to tackle both domain specificity and multilinguality, with a particular focus on news and social media. Our experimental setup includes 18 tasks, represented by 52 datasets covering Arabic, English, and Hindi. We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 23 testing sets, and achieves comparable performance on 8 sets. We make the models and resources publicly available for the research community (https://huggingface.co/QCRI).
pdf
bib
abs
LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education
Iain Weissburg
|
Sathvika Anand
|
Sharon Levy
|
Haewon Jeong
With the increasing adoption of large language models (LLMs) in education, concerns about inherent biases in these models have gained prominence. We evaluate LLMs for bias in the personalized educational setting, specifically focusing on the models’ roles as “teachers.” We reveal significant biases in how models generate and select educational content tailored to different demographic groups, including race, ethnicity, sex, gender, disability status, income, and national origin. We introduce and apply two bias score metrics—Mean Absolute Bias (MAB) and Maximum Difference Bias (MDB)—to analyze 9 open and closed state-of-the-art LLMs. Our experiments, which utilize over 17,000 educational explanations across multiple difficulty levels and topics, uncover that models potentially harm student learning by both perpetuating harmful stereotypes and reversing them. We find that bias is similar for all frontier models, with the highest MAB along income levels while MDB is highest relative to both income and disability status. For both metrics, we find the lowest bias exists for sex/gender and race/ethnicity.
pdf
bib
abs
Preserving Zero-shot Capability in Supervised Fine-tuning for Multi-label Text Classification
Si-An Chen
|
Hsuan-Tien Lin
|
Chih-Jen Lin
Zero-shot multi-label text classification (ZMTC) requires models to predict multiple labels for a document, including labels unseen during training. Previous work assumes that models leveraging label descriptions ensures zero-shot capability. However, we find that supervised methods, despite achieving strong overall performance, lose their zero-shot capability during training, revealing a trade-off between overall and zero-shot performance. To address the issue, we propose OF-DE and OF-LAN, which preserve the zero-shot capabilities of powerful dual encoder and label-wise attention network architectures by freezing the label encoder. Additionally, we introduce a self-supervised auxiliary loss to further improve zero-shot performance. Experiments demonstrate that our approach significantly improves zero-shot performance of supervised methods while maintaining strong overall accuracy.
pdf
bib
abs
Data-centric NLP Backdoor Defense from the Lens of Memorization
Zhenting Wang
|
Zhizhi Wang
|
Mingyu Jin
|
Mengnan Du
|
Juan Zhai
|
Shiqing Ma
Backdoor attack is a severe threat to the trustworthiness of DNN-based language models. In this paper, we first extend the definition of memorization of language models from sample-wise to more fine-grained sentence element-wise (e.g., word, phrase, structure, and style), and then point out that language model backdoors are a type of element-wise memorization. Through further analysis, we find that the strength of such memorization is positively correlated to the frequency of duplicated elements in the training dataset. In conclusion, duplicated sentence elements are necessary for successful backdoor attacks. Based on this, we propose a data-centric defense. We first detect trigger candidates in training data by finding memorizable elements, i.e., duplicated elements, and then confirm real triggers by testing if the candidates can activate backdoor behaviors (i.e., malicious elements). Results show that our method outperforms state-of-the-art defenses in defending against different types of NLP backdoors.
pdf
bib
abs
Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs
Sen Yang
|
Xin Li
|
Leyang Cui
|
Lidong Bing
|
Wai Lam
Two lines of approaches are adopted for complex reasoning with LLMs. One line of work prompts LLMs with various reasoning structures, while the structural outputs can be naturally regarded as intermediate reasoning steps. Another line of work adopt LLM-free declarative solvers to do the reasoning task, rendering higher reasoning accuracy but lacking interpretability due to the black-box nature of the solvers. Aiming to resolve the trade-off between answer accuracy and interpretability, we present a simple extension to the latter line of work. Specifically, we showcase that the intermediate search logs generated by Prolog interpreters can be accessed and interpreted into human-readable reasoning proofs. As long as LLMs correctly translate problem descriptions into Prolog representations, the corresponding reasoning proofs are ensured to be causal and reliable. On two logical reasoning and one arithmetic reasoning datasets, our framework obtains significant improvements in terms of both answer accuracy and reasoning proof accuracy. We released our code at
https://github.com/DAMO-NLP-SG/CaRing for future research regarding better reasoning proofs using LLMs.
pdf
bib
abs
Infogent: An Agent-Based Framework for Web Information Aggregation
Revanth Gangi Reddy
|
Sagnik Mukherjee
|
Jeonghwan Kim
|
Zhenhailong Wang
|
Dilek Hakkani-Tür
|
Heng Ji
Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of a linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the Web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor, and Aggregator. Experiments on different information access settings demonstrate that Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.
pdf
bib
abs
On the Role of Key Phrases in Argument Mining
Nilmadhab Das
|
Vijaya V Saradhi
|
Ashish Anand
Argument mining (AM) focuses on analyzing argumentative structures such as Argument Components (ACs) and Argumentative Relations (ARs). Modeling dependencies between ACs and ARs is challenging due to the complex interactions between ACs. Existing approaches often overlook crucial conceptual links, such as key phrases that connect two related ACs, and tend to rely on cartesian product methods to model these dependencies, which can result in class imbalances. To extract key phrases from the AM benchmarks, we employ a prompt-based strategy utilizing an open-source Large Language Model (LLM). Building on this, we propose a unified text-to-text generation framework that leverages Augmented Natural Language (ANL) formatting and integrates the extracted key phrases inside the ANL itself to efficiently solve multiple AM tasks in a joint formulation. Our method sets new State-of-the-Art (SoTA) on three structurally distinct standard AM benchmarks, surpassing baselines by up to 9.5% F1 score, demonstrating its strong potential.
pdf
bib
abs
TabComp: A Dataset for Visual Table Reading Comprehension
Somraj Gautam
|
Abhishek Bhandari
|
Gaurav Harit
Reaching a human-level understanding of real-world documents necessitates effective machine reading comprehension, yet recent developments in this area often struggle with table images. In response, we introduce the Visual Table Reading Comprehension (TabComp) dataset, which includes table images, questions, and generative answers designed to evaluate OCR-free models. Unlike general Visual Question Answering (VQA) datasets, TabComp uniquely focuses on table images, fostering the development of systems which obviate the use of optical character recognition (OCR) technology, which often struggles with complex table layouts. Our findings reveal that current OCR-free models perform poorly on TabComp, highlighting the need for robust, specialized models for accurate table reading comprehension. We propose TabComp as a benchmark for evaluating OCR-free models in table reading comprehension and encourage the research community to collaborate on developing more effective solutions. The code and data are available at - https://github.com/dialabiitj/TabComp/
pdf
bib
abs
RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model
Changhai Zhou
|
Shijie Han
|
Lining Yang
|
Yuhua Zhou
|
Xu Cheng
|
Yibin Wang
|
Hongguang Li
The efficient compression of large language models (LLMs) has become increasingly popular. However, recovering the performance of compressed LLMs remains a major challenge. The current practice in LLM compression entails the implementation of structural pruning, complemented by a recovery phase that leverages the Low-Rank Adaptation (LoRA) algorithm. Structural pruning’s uneven modification of model architecture, coupled with standard LoRA’s fixed configuration allocation across layers in an online pipeline, leads to suboptimal performance in various downstream tasks for pruned models. To address this challenge, we introduce RankAdaptor, a hierarchical rank allocation method that enables efficient fine-tuning of pruned LLMs according to layerwise specific recovery requirements. We employ a performance model that conducts offline meta-learning and online incremental learning to explore optimal rank values for each layer. Comprehensive experiments on popular benchmarks show that RankAdaptor consistently outperforms state-of-the-art methods across a variety of pruning settings and LLM architectures, with improvements ranging from 0.7% to 5.5%.
pdf
bib
abs
Rationale Behind Essay Scores: Enhancing S-LLM’s Multi-Trait Essay Scoring with Rationale Generated by LLMs
SeongYeub Chu
|
Jong Woo Kim
|
Bryan Wong
|
Mun Yong Yi
Existing automated essay scoring (AES) has solely relied on essay text without using explanatory rationales for the scores, thereby forgoing an opportunity to capture the specific aspects evaluated by rubric indicators in a fine-grained manner. This paper introduces Rationale-based Multiple Trait Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates prompt-engineering-based large language models (LLMs) with a fine-tuning-based essay scoring model using a smaller large language model (S-LLM). RMTS uses an LLM-based trait-wise rationale generation system where a separate LLM agent generates trait-specific rationales based on rubric guidelines, which the scoring model uses to accurately predict multi-trait scores. Extensive experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state-of-the-art models and vanilla S-LLMs in trait-specific scoring. By assisting quantitative assessment with fine-grained qualitative rationales, RMTS enhances the trait-wise reliability, providing partial explanations about essays. The code is available at
https://github.com/BBeeChu/RMTS.git.
pdf
bib
abs
MTPChat: A Multimodal Time-Aware Persona Dataset for Conversational Agents
Wanqi Yang
|
Yanda Li
|
Meng Fang
|
Ling Chen
Understanding temporal dynamics is critical for conversational agents, enabling effective content analysis and informed decision-making. However, time-aware datasets, particularly for persona-grounded conversations, are still limited, which narrows their scope and diminishes their complexity. To address this gap, we introduce MTPChat, a multimodal, time-aware persona dialogue dataset that integrates linguistic, visual, and temporal elements within dialogue and persona memory. Leveraging MTPChat, we propose two time-sensitive tasks: Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP), both designed to assess a model’s ability to understand implicit temporal cues and dynamic interactions. Additionally, we present an innovative framework featuring an adaptive temporal module to effectively integrate multimodal streams and capture temporal dependencies. Experimental results validate the challenges posed by MTPChat and demonstrate the effectiveness of our framework in multimodal time-sensitive scenarios.
pdf
bib
abs
MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time
Mozhi Zhang
|
Pengyu Wang
|
Chenkun Tan
|
Mianqiu Huang
|
Dong Zhang
|
Yaqian Zhou
|
Xipeng Qiu
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora, making them powerful tools for various applications. To make LLMs more usable, aligning them with human preferences is essential. Existing alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), typically embed predefined preferences directly within the model’s parameters. These methods, however, often result in a static alignment that can not account for the diversity of human preferences in practical applications.In response to this challenge, we propose an effective method, MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time. Experimental results show that LLMs optimized on our meticulously constructed MetaAlign Dataset can effectively align with any preferences specified at the inference stage, validating the feasibility of MetaAlign. We hope that our work can provide some insights into the alignment of language models.
pdf
bib
abs
MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty
Yongjin Yang
|
Haneul Yoo
|
Hwaran Lee
Despite the massive advancements in large language models (LLMs), they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty—the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, **MAQA**, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that previous methods relatively struggle compared to single-answer settings, though this varies depending on the task. Moreover, we observe that entropy- and consistency-based methods effectively estimate model uncertainty, even in the presence of data uncertainty.
pdf
bib
abs
Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning
Hyundong Justin Cho
|
Karishma Sharma
|
Nicolaas Paul Jedema
|
Leonardo F. R. Ribeiro
|
Jonathan May
|
Alessandro Moschitti
Language models are aligned to the collective voice of many, resulting in generic outputs that do not align with specific users’ styles. In this work, we present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method that personalizes language models for text generation tasks with fewer than 10 examples per user. TICL iteratively expands an in-context learning prompt via a trial-error-explain process, adding model-generated negative samples and explanations that provide fine-grained guidance towards a specific user’s style. TICL achieves favorable win rates on pairwise comparisons with LLM-as-a-judge up to 91.5% against the previous state-of-the-art and outperforms competitive tuning-free baselines for personalized alignment tasks of writing emails, essays and news articles. Both lexical and qualitative analyses show that the negative samples and explanations enable language models to learn stylistic context more effectively and overcome the bias towards structural and formal phrases observed in their zero-shot outputs. By front-loading inference compute to create a user-specific in-context learning prompt that does not require extra generation steps at test time, presents a novel yet simple approach for personalized alignment.
pdf
bib
abs
Causal Inference with Large Language Model: A Survey
Jing Ma
Causal inference has been a pivotal challenge across diverse domains such as medicine and economics, demanding a complicated integration of human knowledge, mathematical reasoning, and data mining capabilities. Recent advancements in natural language processing (NLP), particularly with the advent of large language models (LLMs), have introduced promising opportunities for traditional causal inference tasks. This paper reviews recent progress in applying LLMs to causal inference, encompassing various tasks spanning different levels of causation. We summarize the main causal problems and approaches, and present a comparison of their evaluation results in different causal scenarios. Furthermore, we discuss key findings and outline directions for future research, underscoring the potential implications of integrating LLMs in advancing causal inference methodologies.
pdf
bib
abs
Ask Optimal Questions: Aligning Large Language Models with Retriever’s Preference in Conversation
Chanwoong Yoon
|
Gangwoo Kim
|
Byeongguk Jeon
|
Sungdong Kim
|
Yohan Jo
|
Jaewoo Kang
Conversational search, unlike single-turn retrieval tasks, requires understanding the current question within a dialogue context. The common approach of rewrite-then-retrieve aims to decontextualize questions to be self-sufficient for off-the-shelf retrievers, but most existing methods produce sub-optimal query rewrites due to the limited ability to incorporate signals from the retrieval results. To overcome this limitation, we present a novel framework RetPO (Retriever’s Preference Optimization), which is designed to optimize a language model (LM) for reformulating search queries in line with the preferences of the target retrieval systems. The process begins by prompting a large LM to produce various potential rewrites and then collects retrieval performance for these rewrites as the retrievers’ preferences. Through the process, we construct a large-scale dataset called RF collection, containing Retrievers’ Feedback on over 410K query rewrites across 12K conversations. Furthermore, we fine-tune a smaller LM using this dataset to align it with the retrievers’ preferences as feedback. The resulting model demonstrates superiority on two benchmarks, surpassing the previous state-of-the-art performance of rewrite-then-retrieve approaches, including GPT-3.5.
pdf
bib
abs
Systematic Knowledge Injection into Large Language Models via Diverse Augmentation for Domain-Specific RAG
Kushagra Bhushan
|
Yatin Nandwani
|
Dinesh Khandelwal
|
Sonam Gupta
|
Gaurav Pandey
|
Dinesh Raghu
|
Sachindra Joshi
Retrieval-Augmented Generation (RAG) has emerged as a prominent method for incorporating domain knowledge into Large Language Models (LLMs). While RAG enhances response relevance by incorporating retrieved domain knowledge in the context, retrieval errors can still lead to hallucinations and incorrect answers. To recover from retriever failures, domain knowledge is injected by fine-tuning the model to generate the correct response, even in the case of retrieval errors. However, we observe that without systematic knowledge augmentation, fine-tuned LLMs may memorize new information but still fail to extract relevant domain knowledge, leading to poor performance. In this work, we present a novel framework that significantly enhances the fine-tuning process by augmenting the training data in two ways – context augmentation and knowledge paraphrasing. In context augmentation, we create multiple training samples for a given QA pair by varying the relevance of the retrieved information, teaching the model when to ignore and when to rely on retrieved content. In knowledge paraphrasing, we finetune with multiple answers to the same question, enabling LLMs to better internalize specialized knowledge. To mitigate catastrophic forgetting due to fine-tuning, we add a domain-specific identifier to a question and also utilize a replay buffer containing general QA pairs. Experimental results demonstrate the efficacy of our method over existing techniques, achieving up to 10% relative gain in token-level recall while preserving the LLM’s generalization capabilities.
pdf
bib
abs
Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models
Hyeonseok Moon
|
Jaehyung Seo
|
Seungyoon Lee
|
Chanjun Park
|
Heuiseok Lim
Through numerous endeavors, large language models (LLMs) have witnessed significant advancements in their instruction-following capability. However, we discern that LLMs are prone to generate responses to instruction-formatted statements in an instinctive manner, rather than comprehending the underlying user intention reside within the given instructions. We also recognize that the significance of instruction understanding capability is largely overlooked in most of LLM evaluation benchmarks. To ensure more comprehensive evaluation on the instruction understanding capability of LLM, we propose Intention of Instruction (IntInst) benchmark, which primary objective is to distinguish the appropriate instruction that accurately instruct to generate a given context. IntInst presents four instruction candidates and requires LLMs to select one among them. Through extensive experiments with several instruction-tuned LLMs, we reveal that most LLMs struggle to grasp the actual intention concealed in the instruction and thoroughly analyze the factors influencing instruction understanding.
pdf
bib
abs
Long-Tail Crisis in Nearest Neighbor Language Models
Yuto Nishida
|
Makoto Morishita
|
Hiroyuki Deguchi
|
Hidetaka Kamigaito
|
Taro Watanabe
The k-nearest-neighbor language model (kNN-LM), one of the retrieval-augmented language models, improves the perplexity for given text by directly accessing a large datastore built from any text data during inference.A widely held hypothesis for the success of kNN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena.However, prior works have primarily shown its ability to retrieve long-tail contexts, leaving the model’s performance remain underexplored in estimating the probabilities of long-tail target tokens during inference.In this paper, we investigate the behavior of kNN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, and token distribution in the datastore.Our experimental results reveal that kNN-LM does not improve prediction performance for low-frequency tokens but mainly benefits high-frequency tokens regardless of long-tail contexts in the datastore.
pdf
bib
abs
Keep Guessing? When Considering Inference Scaling, Mind the Baselines
Gal Yona
|
Or Honovich
|
Omer Levy
|
Roee Aharoni
Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains – mathematical reasoning and factual knowledge – reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains k answers by using only 10 model samples and similarly guessing the remaining k-10 attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing.
pdf
bib
abs
Large Language Models for Anomaly and Out-of-Distribution Detection: A Survey
Ruiyao Xu
|
Kaize Ding
Detecting anomalies or out-of-distribution (OOD) samples is critical for maintaining the reliability and trustworthiness of machine learning systems. Recently, Large Language Models (LLMs) have demonstrated their effectiveness not only in natural language processing but also in broader applications due to their advanced comprehension and generative capabilities. The integration of LLMs into anomaly and OOD detection marks a significant shift from the traditional paradigm in the field. This survey focuses on the problem of anomaly and OOD detection under the context of LLMs. We propose a new taxonomy to categorize existing approaches into two classes based on the role played by LLMs. Following our proposed taxonomy, we further discuss the related work under each of the categories and finally discuss potential challenges and directions for future research in this field. We also provide an up-to-date reading list of relevant papers: https://github.com/rux001/Awesome-LLM-Anomaly-OOD-Detection.
pdf
bib
abs
Time-aware ReAct Agent for Temporal Knowledge Graph Question Answering
Qianyi Hu
|
Xinhui Tu
|
Cong Guo
|
Shunping Zhang
Temporal knowledge graph question answering (TKGQA) addresses time-sensitive queries using knowledge bases. Although large language models (LLMs) and LLM-based agents such as ReAct have shown potential for TKGQA, they often lack sufficient temporal constraints in the retrieval process. To tackle this challenge, we propose TempAgent, a novel autonomous agent framework built on LLMs that enhances their ability to conduct temporal reasoning and comprehension. By integrating temporal constraints into information retrieval, TempAgent effectively discards irrelevant material and concentrates on extracting pertinent temporal and factual information. We evaluate our framework on the MultiTQ dataset, a real-world multi-granularity TKGQA benchmark, using a fully automated setup. Our experimental results reveal the remarkable effectiveness of our approach: TempAgent achieves a 41.3% improvement over the baseline model and a 32.2% gain compared to the Abstract Reasoning Induction (ARI) method. Moreover, our method attains an accuracy of 70.2% on the @hit1 metric, underscoring its substantial advantage in addressing time-aware TKGQA tasks.
pdf
bib
abs
SG-FSM: A Self-Guiding Zero-Shot Prompting Paradigm for Multi-Hop Question Answering Based on Finite State Machine
Xiaochen Wang
|
Junqing He
|
Liang Chen
|
Gholamreza Haffari
|
Yiru Wang
|
Zhe Yang
|
Xiangdi Meng
|
Kunhao Pan
|
Zhifang Sui
Large Language Models with chain-of-thought prompting, such as OpenAI-o1, have shown impressive capabilities in natural language inference tasks. However, Multi-hop Question Answering (MHQA) remains challenging for many existing models due to issues like hallucination, error propagation, and limited context length. To address these challenges and enhance LLMs’ performance on MHQA, we propose the Self-Guiding prompting Finite State Machine (SG-FSM), designed to strengthen multi-hop reasoning abilities. Unlike traditional chain-of-thought methods, SG-FSM tackles MHQA by iteratively breaking down complex questions into sub-questions, correcting itself to improve accuracy. It processes one sub-question at a time, dynamically deciding the next step based on the current context and results, functioning much like an automaton. Experiments across various benchmarks demonstrate the effectiveness of our approach, outperforming strong baselines on challenging datasets such as Musique. SG-FSM reduces hallucination, enabling recovery of the correct final answer despite intermediate errors. It also improves adherence to specified output formats, simplifying evaluation significantly.
pdf
bib
abs
Dynamic Strategy Planning for Efficient Question Answering with Large Language Models
Tanmay Parekh
|
Pradyot Prakash
|
Alexander Radovic
|
Akshay Shekher
|
Denis Savenkov
Research has shown an effectiveness of reasoning (e.g. Chain-of-Thought), planning (e.g. SelfAsk) and retrieval augmented generation strategies to improve performance of Large Language Models (LLMs) on various tasks, such as question answering. However, using a single fixed strategy for answering all different kinds of questions is sub-optimal in performance and inefficient in terms of generated tokens and retrievals. In our work, we propose a novel technique, DyPlan, to induce a dynamic strategy selection process in LLMs for cost-effective question-answering. DyPlan incorporates an initial decision step to select the most suitable strategy conditioned on the input question and guides the LLM’s response generation accordingly. We extend DyPlan to DyPlan-verify, adding an internal verification and correction process to further enrich the generated answer. Experimentation on three prominent multi-hop question answering (MHQA) datasets reveals how DyPlan can improve model performance by 7-13% while reducing the cost by 11-32% relative to the best baseline model.
pdf
bib
abs
Can I Introduce My Boyfriend to My Grandmother? Evaluating Large Language Models Capabilities on Iranian Social Norm Classification
Hamidreza Saffari
|
Mohammadamin Shafiei
|
Donya Rooein
|
Francesco Pierri
|
Debora Nozza
Creating globally inclusive AI systems demands datasets reflecting diverse social norms. Iran, with its unique cultural blend, offers an ideal case study, with Farsi adding linguistic complexity. In this work, we introduce the Iranian Social Norms (ISN) dataset, a novel collection of 1,699 Iranian social norms, including environments, demographic features, and scope annotation, alongside English translations. Our evaluation of 6 Large Language Models (LLMs) in classifying Iranian social norms, using a variety of prompts, uncovered critical insights into the impact of geographic and linguistic context. Results revealed a substantial performance gap in LLMs’ comprehension of Iranian norms. Notably, while the geographic context in English prompts enhanced the performance, this effect was absent in Farsi, pointing to nuanced linguistic challenges. Particularly, performance was significantly worse for Iran-specific norms, emphasizing the importance of culturally tailored datasets. As the first Farsi dataset for social norm classification, ISN will facilitate crucial cross-cultural analyses, shedding light on how values differ across contexts and cultures.
pdf
bib
abs
PLD+: Accelerating LLM Inference by Leveraging Language Model Artifacts
Shwetha Somasundaram
|
Anirudh Phukan
|
Apoorv Saxena
To reduce the latency associated with autoretrogressive LLM inference, speculative decoding has emerged as a novel decoding paradigm, where future tokens are drafted and verified in parallel. However, the practical deployment of speculative decoding is hindered by its requirements for additional computational resources and fine-tuning, which limits its out-of-the-box usability. To address these challenges, we present PLD+, a suite of novel algorithms developed to accelerate the inference process of LLMs, particularly for input-guided tasks. These tasks, which include code editing, text editing, summarization, etc., often feature outputs with substantial overlap with their inputs—an attribute PLD+ is designed to exploit. PLD+ also leverages the artifacts (attention and hidden states) generated during inference to accelerate inference speed. We test our approach on five input-guided tasks and through extensive experiments we find that PLD+ outperforms all tuning-free approaches. In the greedy setting, it even outperforms the state-of-the-art tuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31 in terms of avg. speedup). Our approach is tuning free, does not require any additional compute and can easily be used for accelerating inference of any LLM.
pdf
bib
abs
Adapting LLM Agents with Universal Communication Feedback
Kuan Wang
|
Yadong Lu
|
Michael Santacroce
|
Yeyun Gong
|
Chao Zhang
|
Yelong Shen
Recent advances in large language models (LLMs) have demonstrated potential for LLM agents. To facilitate the training for these agents with both linguistic feedback and non-linguistic reward signals, we introduce Learning through Communication (LTC). We design a universal buffer to store all the feedback, and an iterative pipeline to enable an LLM agent to explore and update its policy in an given environment. To optimize agent interactions for task-specific learning with our universal buffer and pipeline, we introduce diverse communication patterns tailored for both single-agent and multi-agent environments. We evaluate the efficacy of our LTC approach on four diverse datasets: ALFWorld (single-agent), HotpotQA (multi-agent collaboration), Chameleon (multi-agent competition), and GSM8k (multi-agent teacher-student). On these data sets, LTC outperforms the supervised instruction fine-tuning baselines by 3.6% to 12%. These results highlight the versatility and efficiency of LTC in facilitating online adaptation for LLM agents.
pdf
bib
abs
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
Jean Vassoyan
|
Nathanaël Beau
|
Roman Plaud
The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of “critical tokens” which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.
pdf
bib
abs
SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
Chaoqun Liu
|
Wenxuan Zhang
|
Jiahao Ying
|
Mahani Aljunied
|
Anh Tuan Luu
|
Lidong Bing
This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.
pdf
bib
abs
Learning to Search Effective Example Sequences for In-Context Learning
Xiang Gao
|
Ankita Sinha
|
Kamalika Das
Large language models (LLMs) demonstrate impressive few-shot learning capabilities, but their performance varies widely based on the sequence of in-context examples. Key factors influencing this include the sequence’s length, composition, and arrangement, as well as its relation to the specific query. Existing methods often tackle these factors in isolation, overlooking their interdependencies. Moreover, the extensive search space for selecting optimal sequences complicates the development of a holistic approach. In this work, we introduce Beam Search-based Example Sequence Constructor (BESC), a novel method for learning to construct optimal example sequences. addresses all key factors involved in sequence selection by considering them jointly during inference, while incrementally building the sequence. This design enables the use of beam search to significantly reduce the complexity of the search space. Experiments across various datasets and language models show notable improvements in performance.
pdf
bib
abs
From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models
Harsh Nishant Lalai
|
Aashish Anantha Ramakrishnan
|
Raj Sanjay Shah
|
Dongwon Lee
With the rapid growth of Large Language Models (LLMs), safeguarding textual content against unauthorized use is crucial. Watermarking offers a vital solution, protecting both - LLM-generated and plain text sources. This paper presents a unified overview of different perspectives behind designing watermarking techniques through a comprehensive survey of the research literature. Our work has two key advantages: (1) We analyze research based on the specific intentions behind different watermarking techniques, evaluation datasets used, and watermarking addition and removal methods to construct a cohesive taxonomy. (2) We highlight the gaps and open challenges in text watermarking to promote research protecting text authorship. This extensive coverage and detailed analysis sets our work apart, outlining the evolving landscape of text watermarking in Language Models.
pdf
bib
abs
M-IFEval: Multilingual Instruction-Following Evaluation
Antoine Dussolle
|
A. Cardeña
|
Shota Sato
|
Peter Devine
Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages.We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.
pdf
bib
abs
Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language
Zhiqiang Zhong
|
Simon Sataa-Yu Larsen
|
Haoyu Guo
|
Tao Tang
|
Kuangyu Zhou
|
Davide Mottin
Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA3, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA3 by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations.Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA3 leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA3 notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.
pdf
bib
abs
Let Modalities Teach Each Other: Modal-Collaborative Knowledge Extraction and Fusion for Multimodal Knowledge Graph Completion
Guoliang Zhu
|
Tao Ren
|
Dandan Wang
|
Jun Hu
Multimodal knowledge graph completion (MKGC) aims to predict missing triples in MKGs using multimodal information. Recent research typically either extracts information from each modality separately to predict, then ensembles the predictions at the decision stage, or projects multiple modalities into a unified feature space to learn multimodal representations for prediction. However, these methods usually overlook the intrinsic correlation between modalities in MKGs which should be leveraged in both unimodal knowledge extraction and multimodal knowledge fusion. Motivated by this, we propose a noval Modal-collaborative knowledge learning (Moodle) framework for MKGC, the key idea of which is to foster mutual guidance and collaboration during unimodal knowledge extraction, to let each modality acquire distinct and complementary knowledge that subsequently enhances the multimodal knowledge fusion. Specifically, Moodle preserves the representations of different modalities to learn unimodal knowledge while modeling the mutual guidance through multi-task learning. Furthermore, Moodle performs multimodal knowledge fusion and prediction guided by unimodal knowledge, capturing their synergistic relationships and acquire fine-grained semantic knowledge through contrastive learning. Extensive experiments on three real-world datasets demonstrate the advantages of Moodle over state-of-the-art methods.
pdf
bib
abs
Modeling the Differential Prevalence of Online Supportive Interactions in Private Instant Messages of Adolescents
Ondrej Sotolar
|
Michał Tkaczyk
|
Jaromír Plhák
|
David Smahel
This paper focuses on modeling gender-based and pair-or-group disparities in online supportive interactions among adolescents. To address the limitations of conventional social science methods in handling large datasets, this research employs language models to detect supportive interactions based on the Social Support Behavioral Code and to model their distribution. The study conceptualizes detection as a classification task, constructs a new dataset, and trains predictive models. The novel dataset comprises 196,772 utterances from 2165 users collected from Instant Messenger apps. The results show that the predictions of language models can be used to effectively model the distribution of supportive interactions in private online dialogues. As a result, this study provides new computational evidence that supports the theory that supportive interactions are more prevalent in online female-to-female conversations. The findings advance our understanding of supportive interactions in adolescent communication and present methods to automate the analysis of large datasets, opening new research avenues in computational social science.
pdf
bib
abs
Dynamic Feature Fusion for Sign Language Translation Using HyperNetworks
Ruiquan Zhang
|
Rui Zhao
|
Zhicong Wu
|
Liang Zhang
|
Haoqi Zhang
|
Yidong Chen
This paper presents an efficient dual-stream early fusion method for sign language translation. Inspired by the brain’s ability to process color, shape, and motion simultaneously, the method explores complex dependencies between RGB and keypoint streams, improving speed and efficiency. A key challenge is extracting complementary features from both streams while ensuring global semantic consistency to avoid conflicts and improve generalization. To address this issue, we propose a hypernetwork-based fusion strategy that effectively extracts salient features from RGB and keypoint streams, alongside a partial shortcut connection training method to strengthen the complementary information between the dual streams. Additionally, we introduce self-distillation and SST contrastive learning to maintain feature advantages while aligning the global semantic space. Experiments show that our method achieves state-of-the-art performance on two public sign language datasets, reducing model parameters by about two-thirds.
pdf
bib
abs
Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models
Sonam Gupta
|
Yatin Nandwani
|
Asaf Yehudai
|
Dinesh Khandelwal
|
Dinesh Raghu
|
Sachindra Joshi
Fine-tuning Large Language Models (LLMs) on specific datasets is a common practice to improve performance on target tasks. However, this performance gain often leads to overfitting, where the model becomes too specialized in either the task or the characteristics of the training data, resulting in a loss of generalization. This paper introduces Selective Self-to-Supervised Fine-Tuning (S3FT), a fine-tuning approach that achieves better performance than the standard supervised fine-tuning (SFT) while improving generalization.S3FT leverages the existence of multiple valid responses to a query.By utilizing the model’s correct responses, S3FT reduces model specialization during the fine-tuning stage. S3FT first identifies the correct model responses from the training set by deploying an appropriate judge. Then, it fine-tunes the model using the correct model responses and the gold response (or its paraphrase) for the remaining samples.The effectiveness of S3FT is demonstrated through experiments on mathematical reasoning, Python programming and reading comprehension tasks. The results show that standard SFT can lead to an average performance drop of up to 4.4 on multiple benchmarks, such as MMLU and TruthfulQA. In contrast, S3FT reduces this drop by half, i.e. 2.5, indicating better generalization capabilities than SFT while performing significantly better on the fine-tuning tasks.
pdf
bib
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding
Israel Abebe Azime
|
Atnafu Lambebo Tonja
|
Tadesse Destaw Belay
|
Yonas Chanie
|
Bontu Fufa Balcha
|
Negasi Haile Abadi
|
Henok Biadglign Ademtew
|
Mulubrhan Abebe Nerea
|
Debela Desalegn Yadeta
|
Derartu Dagne Geremew
|
Assefa Atsbiha Tesfu
|
Philipp Slusallek
|
Thamar Solorio
|
Dietrich Klakow
pdf
bib
abs
MRE-MI: A Multi-image Dataset for Multimodal Relation Extraction in Social Media Posts
Shizhou Huang
|
Bo Xu
|
Changqun Li
|
Yang Yu
|
Xin Alex Lin
Despite recent advances in Multimodal Relation Extraction (MRE), existing datasets and approaches primarily focus on single-image scenarios, overlooking the prevalent real-world cases where relationships are expressed through multiple images alongside text. To address this limitation, we present MRE-MI, a novel human-annotated dataset that includes both multi-image and single-image instances for relation extraction. Beyond dataset creation, we establish comprehensive baselines and propose a simple model named Global and Local Relevance-Modulated Attention Model (GLRA) to address the new challenges in multi-image scenarios. Our extensive experiments reveal that incorporating multiple images substantially improves relation extraction in multi-image scenarios. Furthermore, GLRA achieves state-of-the-art results on MRE-MI, demonstrating its effectiveness. The datasets and source code can be found at https://github.com/JinFish/MRE-MI.
pdf
bib
abs
Discrete Diffusion Language Model for Efficient Text Summarization
Do Huu Dat
|
Duc Anh Do
|
Anh Tuan Luu
|
Wray Buntine
While diffusion models excel at conditionally generating high-quality images, prior works in discrete diffusion models were not evaluated on conditional long-text generation. This work addresses the limitations of prior discrete diffusion models for conditional long-text generation, particularly in the long abstractive summarization task. Despite faster decoding speeds compared to autoregressive methods, previous discrete diffusion models failed on the abstractive summarization task due to the incompatibility between the backbone architectures and the random noising process. To overcome these challenges, we introduce a novel semantic-aware noising process that enables Transformer backbones to handle long sequences effectively. Additionally, we propose CrossMamba, an adaptation of the Mamba model to the encoder-decoder paradigm, which integrates seamlessly with the random absorbing noising process. Our approaches outperform existing discrete diffusion models on three benchmark summarization datasets: Gigaword, CNN/DailyMail, and Arxiv, while also achieving much faster inference speed compared to autoregressive models.
pdf
bib
abs
CAPE: A Chinese Dataset for Appraisal-based Emotional Generation in Large Language Models
June M. Liu
|
He Cao
|
Renliang Sun
|
Rui Wang
|
Yu Li
|
Jiaxing Zhang
Generating emotionally appropriate responses in conversations with large language models presents a significant challenge due to the complexities of human emotions and cognitive processes, which remain largely underexplored in their critical role in social interactions. In this study, we introduce a two-stage automatic data generation framework to create CAPE, a Chinese dataset named Cognitive Appraisal theory-based Emotional corpus. This corpus facilitates the generation of dialogues with contextually appropriate emotional responses by accounting for diverse personal and situational factors. We propose two tasks utilizing this dataset: emotion prediction and next utterance prediction. Both automated and human evaluations demonstrate that agents trained on our dataset can deliver responses that are more aligned with human emotional expressions. Our study shows the potential for advancing emotional expression in conversational agents, paving the way for more nuanced and meaningful human-computer interactions.
pdf
bib
abs
Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models
Hongbang Yuan
|
Yubo Chen
|
Pengfei Cao
|
Zhuoran Jin
|
Kang Liu
Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model’s failure to uphold factuality under a distribution shift is under-alignment, rather than over-alignment, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose APEFT (Atomic Preference Enhanced Factuality Tuning), a framework that enhances model’s awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of on both ID and OOD datasets, which is highly effective.
pdf
bib
abs
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference
Go Kamoda
|
Benjamin Heinzerling
|
Tatsuro Inaba
|
Keito Kudo
|
Keisuke Sakaguchi
|
Kentaro Inui
According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model’s “inner vocabulary”.Prior analysis of this *detokenization* stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior.Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps.Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2.Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects.By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.
pdf
bib
abs
DiPT: Enhancing LLM Reasoning through Diversified Perspective-Taking
Hoang Anh Just
|
Mahavir Dabas
|
Lifu Huang
|
Ming Jin
|
Ruoxi Jia
Existing work on improving language model reasoning typically explores a single solution path, which can be prone to errors. Inspired by perspective-taking in social studies, this paper introduces DiPT, a novel approach that complements current reasoning methods by explicitly incorporating diversified viewpoints. This approach allows the model to gain a deeper understanding of the problem’s context and identify the most effective solution path during the inference stage. Additionally, it provides a general data-centric AI recipe for augmenting existing data to improve their quality for fine-tuning. Our empirical results demonstrate that DiPT can be flexibly integrated into existing methods that focus on a single reasoning approach, enhancing their reasoning performance and stability when presented with paraphrased problems. Furthermore, we illustrate improved context understanding by maintaining the model’s safe outputs against “jailbreaking” prompts intentionally designed to bypass safeguards built into deployed models. Lastly, we show that fine-tuning with data enriched with diverse perspectives can boost the reasoning capabilities of the model compared to fine-tuning with raw data alone.
pdf
bib
abs
SOLID: Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking Dialogs
Arian Askari
|
Roxana Petcu
|
Chuan Meng
|
Mohammad Aliannejadi
|
Amin Abolghasemi
|
Evangelos Kanoulas
|
Suzan Verberne
Intent prediction in information-seeking dialogs is challenging and requires a substantial amount of data with human-labeled intents for effective model training. While Large Language Models (LLMs) have demonstrated effectiveness in generating synthetic data, existing methods typically rely on human feedback and are tailored to structured, task-oriented intents. In this paper, we leverage LLMs for zero-shot generation of large-scale, open-domain, intent-aware information-seeking dialogs to serve as training data for intent prediction models. We introduce SOLID, a method that generates dialogs turn by turn using novel self-seeding and multi-intent self-instructing strategies. Additionally, we propose SOLID-RL, a finetuned version that generates an entire dialog in one step using data created with SOLID. SOLID and SOLID-RL are each used to generate over 300k intent-aware dialogs, significantly surpassing the size of existing datasets. Experiments show that intent prediction models trained on sampled dialogs generated by SOLID and SOLID-RL outperform those trained solely on human-generated dialogs. Our findings demonstrate the potential of LLMs to expand training datasets, as they provide valuable resources for conversational agents across multiple tasks. Our self-seeding and self-instructing approaches are adaptable to various conversational data types and languages with minimal modifications.
pdf
bib
CollagePrompt: A Benchmark for Budget-Friendly Visual Recognition with GPT-4V
Siyu Xu
|
Yunke Wang
|
Daochang Liu
|
Bo Du
|
Chang Xu
pdf
bib
abs
ARISE: Iterative Rule Induction and Synthetic Data Generation for Text Classification
Yaswanth M
|
Vaibhav Singh
|
Ayush Maheshwari
|
Amrith Krishna
|
Ganesh Ramakrishnan
We propose ARISE, a framework that iteratively induces rules and generates synthetic data for text classification. We combine synthetic data generation and automatic rule induction, via bootstrapping, to iteratively filter the generated rules and data. We induce rules via inductive generalisation of syntactic-ngrams, enabling us to capture a complementary source of supervision. These rules alone lead to performance gains in both, in-context learning (ICL) and fine-tuning (FT) settings. Similarly, use of augmented data from ARISE alone improves the performance for a model, outperforming configurations that rely on complex methods like contrastive learning. Further, our extensive experiments on various datasets covering three full-shot, eight few-shot and seven multilingual variant settings demonstrate that the rules and data we generate lead to performance improvements across these diverse domains and languages.
pdf
bib
abs
Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context
Sangwon Yu
|
Ik-hwan Kim
|
Jongyoon Song
|
Saehyung Lee
|
Junsung Park
|
Sungroh Yoon
Multi-hop reasoning, which requires multi-step reasoning based on the supporting documents within a given context, remains challenging for large language models (LLMs). LLMs often struggle to filter out irrelevant documents within the context, and their performance is sensitive to the absolute position of supporting documents within that context. In this paper, we identify an additional challenge: LLMs’ performance is also sensitive to the order, relative position, in which the supporting documents are presented. We refer to this as the misordered context problem. To address this issue, based on the theoretical approach, we propose a simple yet effective method called context repetition (CoRe), which involves prompting the model by repeatedly presenting the context. This ensures that certain contiguous reasoning segments within supporting documents are presented in the optimal order, effectively guiding the model’s reasoning in the appropriate direction. Applying CoRe, we improve the F1 score by up to 30%p on multi-hop QA tasks and increase accuracy by up to 70%p on a synthetic task. Additionally, CoRe helps mitigate the well-known “lost-in-the-middle” problem in LLMs and can be effectively combined with retrieval-based approaches utilizing Chain-of-Thought (CoT) reasoning.
pdf
bib
abs
Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis
Angelina Parfenova
|
Andreas Marfurt
|
Jürgen Pfeffer
|
Alexander Denzler
This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM-generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.
pdf
bib
abs
Investigating the Zone of Proximal Development of Language Models for In-Context Learning
Peng Cui
|
Mrinmaya Sachan
In this paper, we introduce a learning analytics framework to analyze the in-context learning (ICL) behavior of large language models (LLMs) through the lens of the Zone of Proximal Development (ZPD), an established theory in educational psychology. ZPD delineates the range of tasks a learner can accomplish with appropriate guidance but not yet independently. We adapt this concept to ICL, measuring the ZPD of LLMs based on model performance on individual examples in different settings. Furthermore, we propose an item response theory (IRT) model to predict the distribution of zones for LLMs. Our findings reveal a series of intricate and multifaceted behaviors of ICL, providing new insights into understanding and leveraging this technique. Finally, we demonstrate how our framework can enhance LLM in both inference and fine-tuning scenarios: (1) By predicting a model’s zone distribution, we selectively apply ICL to queries that are most likely to benefit from demonstrations, achieving a better balance between inference cost and performance; (2) We propose a human-like curriculum for fine-tuning, which prioritizes examples within the model’s ZPD. The curriculum results in improved performance, and we explain its effectiveness through an analysis of the training dynamics of LLMs.
pdf
bib
abs
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
Itay Nakash
|
George Kour
|
Guy Uziel
|
Ateret Anaby Tavor
Following the advancement of large language models (LLMs), the development of LLM-based autonomous agents has become prevalent.As a result, the need to understand the security vulnerabilities of these agents has become a critical task. We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack.Our experiments show that indirect prompt injection attacks, prompted by harmless and unrelated requests (such as basic calculations) can significantly increase the likelihood of the agent performing subsequent malicious actions.Our results show that once a ReAct agent’s thought includes a specific tool or action, the likelihood of executing this tool in the subsequent steps increases significantly, as the agent seldom re-evaluates its actions. Consequently, even random, harmless requests can establish a ‘foot-in-the-door’, allowing an attacker to embed malicious instructions into the agent’s thought process, making it more susceptible to harmful directives.To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution, which can help reduce the success of such attacks.
pdf
bib
abs
As easy as PIE: understanding when pruning causes language models to disagree
Pietro Tropeano
|
Maria Maistro
|
Tuukka Ruotsalo
|
Christina Lioma
Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness.However, when looking at how individual data pointsare affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning,but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP.In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, andthat BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE
pdf
bib
abs
Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction
Shengbin Yue
|
Ting Huang
|
Zheng Jia
|
Siyuan Wang
|
Shujun Liu
|
Yun Song
|
Xuanjing Huang
|
Zhongyu Wei
Large Language Models (LLMs) have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi-agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real-legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants’ characters and behaviors as well as addressing distractions. A Multi-stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs’ performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.
pdf
bib
abs
Exploring Backward Reasoning in Large Language Models
Leonardo Ranaldi
|
Giulia Pucci
Multi-step reasoning through in-context learning strategies have been extensively explored, highlighting the abilities of Large Language Models (LLMs) to generate answers derived from step-by-step reasoning. These studies focus the attention on LLMs’ forward reasoning abilities epitomised in a series of general premises leading to a final solution. In this paper, by taking the reverse perspective, we study the backward reasoning abilities of LLMs, namely the inference that leads to the causal hypothesis. Behind formalising the backward problems, we analyse whether the LLMs are able to reason about the conclusion and reconstruct the original question that led to the delivery of the final answer. Operating with question-answering tasks involving symbolic reasoning, understanding, and commonsense abilities, we observe that the proposed models reveal robust comprehension capabilities managing different kinds of input; however, they are not always able to reason in the backward direction. Finally, to challenge this limitation, we demonstrate that instructing LLMs to generate the answer by reconsidering the structure of the problem allows for improved backward reasoning direction.
pdf
bib
abs
MMLF: Multi-query Multi-passage Late Fusion Retrieval
Yuan-Ching Kuo
|
Yi Yu
|
Chih-Ming Chen
|
Chuan-Ju Wang
Leveraging large language models (LLMs) for query expansion has proven highly effective across diverse tasks and languages. Yet, challenges remain in optimizing query formatting and prompting, often with less focus on handling retrieval results. In this paper, we introduce Multi-query Multi-passage Late Fusion (MMLF), a straightforward yet potent pipeline that generates sub-queries, expands them into pseudo-documents, retrieves them individually, and aggregates results using reciprocal rank fusion. Our experiments demonstrate that MMLF exhibits superior performance across five BEIR benchmark datasets, achieving an average improvement of 4% and a maximum gain of up to 8% in both Recall@1k and nDCG@10 compared to state of the art across BEIR information retrieval datasets.
pdf
bib
abs
Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models
Weidi Luo
|
He Cao
|
Zijing Liu
|
Yu Wang
|
Aidan Wong
|
Bin Feng
|
Yuan Yao
|
Yu Li
With the extensive deployment of Large Language Models (LLMs), ensuring their safety has become increasingly critical. However, existing defense methods often struggle with two key issues: (i) inadequate defense capabilities, particularly in domain-specific scenarios like chemistry, where a lack of specialized knowledge can lead to the generation of harmful responses to malicious queries. (ii) over-defensiveness, which compromises the general utility and responsiveness of LLMs. To mitigate these issues, we introduce a multi-agents-based defense framework, Guide for Defense (G4D), which leverages accurate external information to provide an unbiased summary of user intentions and analytically grounded safety response guidance. Extensive experiments on popular jailbreak attacks and benign datasets show that our G4D can enhance LLM’s robustness against jailbreak attacks on general and domain-specific scenarios without compromising the model’s general functionality.
pdf
bib
abs
kNN For Whisper And Its Effect On Bias And Speaker Adaptation
Maya K. Nachesa
|
Vlad Niculae
Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level k nearest neighbor search (kNN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from kNN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.
pdf
bib
abs
VisualCoder: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning
Cuong Le Chi
|
Chau Truong Vinh Hoang
|
Phan Nhật Huy
|
Dung D. Le
|
Tien N Nguyen
|
Nghi D. Q. Bui
Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static syntax, they often struggle with dynamic reasoning tasks. We introduce VisualCoder, a simple yet effective approach that enhances code reasoning by integrating multimodal Chain-of-Thought (CoT) reasoning with a visual Control Flow Graph (CFG). By aligning code snippets with their corresponding CFGs, VisualCoder provides deeper insights into execution flows. We address challenges in multimodal CoT integration through a reference mechanism, ensuring consistency between code and its execution path, thereby improving performance in program behavior prediction, error detection, and output generation.
pdf
bib
abs
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
Luca Moroni
|
Giovanni Puccetti
|
Pere-Lluís Huguet Cabot
|
Andrei Stefan Bejgu
|
Alessio Miaschi
|
Edoardo Barba
|
Felice Dell’Orletta
|
Andrea Esuli
|
Roberto Navigli
The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token “fertility”) and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
pdf
bib
abs
Beyond the Mode: Sequence-Level Distillation of Multilingual Translation Models for Low-Resource Language Pairs
Aarón Galiano-Jiménez
|
Juan Antonio Pérez-Ortiz
|
Felipe Sánchez-Martínez
|
Víctor M. Sánchez-Cartagena
This paper delves into sequence-level knowledge distillation (KD) of multilingual pre-trained translation models. We posit that, beyond the approximated mode obtained via beam search, the whole output distribution of the teacher contains valuable insights for students. We explore the potential of n-best lists from beam search to guide student’s learning and then investigate alternative decoding methods to address observed issues like low variability and under-representation of infrequent tokens. Our research in data-limited scenarios reveals that although sampling methods can slightly compromise the translation quality of the teacher output compared to beam search based methods, they enrich the generated corpora with increased variability and lexical richness, ultimately enhancing student model performance and reducing the gender bias amplification commonly associated with KD.
pdf
bib
abs
LLMs for Extremely Low-Resource Finno-Ugric Languages
Taido Purason
|
Hele-Andra Kuulmets
|
Mark Fishel
The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on Võro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.
pdf
bib
abs
LOFT: Scalable and More Realistic Long-Context Evaluation
Jinhyuk Lee
|
Anthony Chen
|
Zhuyun Dai
|
Dheeru Dua
|
Devendra Singh Sachan
|
Michael Boratko
|
Yi Luan
|
Séb Arnold
|
Vincent Perot
|
Siddharth Dalmia
|
Hexiang Hu
|
Xudong Lin
|
Panupong Pasupat
|
Aida Amini
|
Jeremy R. Cole
|
Sebastian Riedel
|
Iftekhar Naim
|
Ming-Wei Chang
|
Kelvin Guu
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs’ ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs’ performance on in-context retrieval and reasoning. Our findings reveal LCLMs’ surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their capabilities to tackle existing paradigms.
pdf
bib
abs
On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems
Juraj Vladika
|
Florian Matthes
Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.
pdf
bib
abs
Aligning Black-box Language Models with Human Judgments
Gerrit J.j. Van Den Burg
|
Gen Suzuki
|
Wei Liu
|
Murat Sensoy
Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM’s outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.
pdf
bib
abs
Guideline Compliance in Task-Oriented Dialogue: The Chained Prior Approach
Xiangyu Wen
|
Jianyuan Zhong
|
Zhijian Xu
|
Qiang Xu
Task-oriented dialogue (TOD) systems are widely used across various domains, including customer service, appointment scheduling, and technical support. In real-world scenarios, such systems must adhere to given operational guidelines. However, existing solutions based on large language models often cannot achieve strict guideline compliance, even when fine-tuned with domain knowledge. To address this issue, we introduce a novel TOD system named GuidedTOD, which explicitly considers domain-specific guidelines by integrating a policy module. This module employs a Markov Chain, termed Chained Prior, to efficiently encode and dynamically update guideline knowledge. During inference, the Chained Prior re-ranks outputs from the domain-expert language model using beam search, ensuring guideline adherence. Experimental results show that GuidedTOD significantly improves guideline compliance, achieving approximately 20% better action prediction accuracy than state-of-the-art solutions. Code is available here: https://github.com/cure-lab/GuidedTOD.
pdf
bib
abs
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs
Jiawei Chen
|
Xiao Yang
|
Zhengwei Fang
|
Yu Tian
|
Yinpeng Dong
|
Zhaoxia Yin
|
Hang Su
Recent studies show that large language models (LLMs) are vulnerable to jailbreak attacks, which can bypass their defense mechanisms. However, existing jailbreak research often exhibits limitations in universality, validity, and efficiency. Therefore, we rethink jailbreaking LLMs and define three key properties to guide the design of effective jailbreak methods. We introduce AutoBreach, a novel black-box approach that uses wordplay-guided mapping rule sampling to create universal adversarial prompts. By leveraging LLMs’ summarization and reasoning abilities, AutoBreach minimizes manual effort. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Also, we propose a two-stage mapping rule optimization that initially optimizes mapping rules before querying target LLMs to enhance efficiency. Experimental results indicate AutoBreach efficiently identifies security vulnerabilities across various LLMs (Claude-3, GPT-4, etc.), achieving an average success rate of over 80% with fewer than 10 queries. Notably, the adversarial prompts generated by AutoBreach for GPT-4 can directly bypass the defenses of the advanced commercial LLM GPT o1-preview, demonstrating strong transferability and universality.
pdf
bib
𝒮2IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction
Bingfeng Chen
|
Chenjie Qiu
|
Yifeng Xie
|
Boyan Xu
|
Ruichu Cai
|
Zhifeng Hao
pdf
bib
abs
BanNERD: A Benchmark Dataset and Context-Driven Approach for Bangla Named Entity Recognition
Md. Motahar Mahtab
|
Faisal Ahamed Khan
|
Md. Ekramul Islam
|
Md. Shahad Mahmud Chowdhury
|
Labib Imam Chowdhury
|
Sadia Afrin
|
Hazrat Ali
|
Mohammad Mamun Or Rashid
|
Nabeel Mohammed
|
Mohammad Ruhul Amin
In this study, we introduce
BanNERD, the most extensive human-annotated and validated
Bangla
Named
Entity
Recognition
Dataset to date, comprising over 85,000 sentences. BanNERD is curated from a diverse array of sources, spanning over 29 domains, thereby offering a comprehensive range of generalized contexts. To ensure the dataset’s quality, expert linguists developed a detailed annotation guideline tailored to the Bangla language. All annotations underwent rigorous validation by a team of validators, with final labels being determined via majority voting, thereby ensuring the highest annotation quality and a high IAA score of 0.88. In a cross-dataset evaluation, models trained on BanNERD consistently outperformed those trained on four existing Bangla NER datasets. Additionally, we propose a method named
BanNERCEM (Bangla NER context-ensemble Method) which outperforms existing approaches on Bangla NER datasets and performs competitively on English datasets using lightweight Bangla pretrained LLMs. Our approach passes each context separately to the model instead of previous concatenation-based approaches achieving the highest average macro F1 score of 81.85% across 10 NER classes, outperforming previous approaches and ensuring better context utilization. We are making the code and datasets publicly available at
https://github.com/eblict-gigatech/BanNERD in order to contribute to the further advancement of Bangla NLP.
pdf
bib
abs
Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias
Andres Algaba
|
Carmen Mazijn
|
Vincent Holst
|
Floriano Tori
|
Sylvia Wenmackers
|
Vincent Ginis
Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of Large Language Models (LLMs) introduces a new dynamic to these practices. Interestingly, the characteristics and potential biases of references recommended by LLMs that entirely rely on their parametric knowledge, and not on search or retrieval-augmented generation, remain unexplored. Here, we analyze these characteristics in an experiment using a dataset from AAAI, NeurIPS, ICML, and ICLR, published after GPT-4’s knowledge cut-off date. In our experiment, LLMs are tasked with suggesting scholarly references for the anonymized in-text citations within these papers. Our findings reveal a remarkable similarity between human and LLM citation patterns, but with a more pronounced high citation bias, which persists even after controlling for publication year, title length, number of authors, and venue. The results hold for both GPT-4, and the more capable models GPT-4o and Claude 3.5 where the papers are part of the training data. Additionally, we observe a large consistency between the characteristics of LLM’s existing and non-existent generated references, indicating the model’s internalization of citation patterns. By analyzing citation graphs, we show that the references recommended are embedded in the relevant citation context, suggesting an even deeper conceptual internalization of the citation networks. While LLMs can aid in citation generation, they may also amplify existing biases, such as the Matthew effect, and introduce new ones, potentially skewing scientific knowledge dissemination.
pdf
bib
abs
What can Large Language Models Capture about Code Functional Equivalence?
Nickil Maveli
|
Antonio Vergari
|
Shay B Cohen
Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code)-LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.
pdf
bib
abs
Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning
Xinglin Wang
|
Shaoxiong Feng
|
Yiwei Li
|
Peiwen Yuan
|
Yueqi Zhang
|
Chuyi Tan
|
Boyuan Pan
|
Yao Hu
|
Kan Li
Self-consistency (SC), a widely used decoding strategy for chain-of-thought reasoning, shows significant gains across various multi-step reasoning tasks but comes with a high cost due to multiple sampling with the preset size. Its variants, Adaptive self-consistency (ASC) and Early-stopping self-consistency (ESC), dynamically adjust the number of samples based on the posterior distribution of a set of pre-samples, reducing the cost of SC with minimal impact on performance. Both methods, however, do not exploit the prior information about question difficulty. It often results in unnecessary repeated sampling for easy questions that could be accurately answered with just one attempt, wasting resources. To tackle this problem, we propose Difficulty-Adaptive Self-Consistency (DSC), which leverages the difficulty information of batch queries from both prior and posterior perspectives to adaptively allocate inference resources, further reducing the overall cost of SC. To demonstrate the effectiveness of DSC, we conduct extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning on six benchmarks. The empirical results show that DSC consistently surpasses the strong baseline ASC and ESC in terms of costs by a significant margin, while attaining comparable performances.
pdf
bib
abs
Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation
Jiwon Jeong
|
Hyeju Jang
|
Hogun Park
The advancement of Large Language Models (LLMs) has greatly improved our ability to process complex language. However, accurately detecting logical fallacies remains a significant challenge. This study presents a novel and effective prompt formulation approach for logical fallacy detection, applicable in both supervised (fine-tuned) and unsupervised (zero-shot) settings. Our method enriches input text by incorporating implicit contextual information—counterarguments, explanations, and goals—which we query for validity within the argument’s context. We then rank these queries based on confidence scores to inform classification. We evaluate our approach across multiple datasets from 5 domains, covering 29 distinct fallacy types, using models from GPT and LLaMA series. The results show substantial improvements over state-of-the-art models: up to a 0.57 increase in F1-score in zero-shot settings and up to 0.45 in fine-tuned models. Extensive analyses further illustrate why and how our method excels.
pdf
bib
abs
MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing
Vlad Andrei Negru
|
Robert Vacareanu
|
Camelia Lemnaru
|
Mihai Surdeanu
|
Rodica Potolea
We introduce MorphNLI, a modular step-by-step approach to natural language inference (NLI). When classifying the premise-hypothesis pairs into entailment, contradiction, neutral, we use a language model to generate the necessary edits to incrementally transform (i.e., morph) the premise into the hypothesis. Then, using an off-the-shelf NLI model we track how the entailment progresses with these atomic changes, aggregating these intermediate labels into a final output. We demonstrate the advantages of our proposed method particularly in realistic cross-domain settings, where our method always outperforms strong baselines with improvements up to 12.6% (relative). Further, our proposed approach is explainable as the atomic edits can be used to understand the overall NLI label.
pdf
bib
abs
Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
Đorđe Klisura
|
Anthony Rios
Text-to-SQL systems empower users to interact with databases using natural language, automatically translating queries into executable SQL code. However, their reliance on database schema information for SQL generation exposes them to significant security vulnerabilities, particularly schema inference attacks that can lead to unauthorized data access or manipulation. In this paper, we introduce a novel zero-knowledge framework for reconstructing the underlying database schema of text-to-SQL models without any prior knowledge of the database. Our approach systematically probes text-to-SQL models with specially crafted questions and leverages a surrogate GPT-4 model to interpret the outputs, effectively uncovering hidden schema elements—including tables, columns, and data types. We demonstrate that our method achieves high accuracy in reconstructing table names, with F1 scores of up to .99 for generative models and .78 for fine-tuned models, underscoring the severity of schema leakage risks. We also show that our attack can steal prompt information in non-text-to-SQL models. Furthermore, we propose a simple protection mechanism for generative models and empirically show its limitations in mitigating these attacks.
pdf
bib
abs
Media of Langue: Exploring Word Translation Network
Goki Muramoto
|
Atsuki Sato
|
Takayoshi Koyama
In the human activity of word translation, two languages face each other, mutually searching their own language system for the semantic place of words in the other language. We discover the huge network formed by the chain of these mutual translations as *Word Translation Network*, a network where words are nodes, and translation volume is represented as edges, and propose *Word Translation Map*, a novel interface for exploring this network. *Word Translation Map* points to the semantic configurations of many words in multiple languages at once, containing the information of existing dictionaries such as bilingual and synonym dictionaries. We have also implemented and published this interface as a web application, focusing on seven language pairs. This paper first defines the *Word Translation Network* and describes how to actually construct the network from bilingual corpora, followed by an analysis of the properties of the network. Next, we explain how to design a *Word Translation Map* using this network, and finally, we analyze the features of the *Word Translation Map* as a dictionary. The web application is publicly accessible at www.media-of-langue.org.
pdf
bib
abs
Tackling Social Bias against the Poor: a Dataset and a Taxonomy on Aporophobia
Georgina Curto
|
Svetlana Kiritchenko
|
Muhammad Hammad Fahim Siddiqui
|
Isar Nejadgholi
|
Kathleen C. Fraser
Eradicating poverty is the first goal in the U.N. Sustainable Development Goals. However, aporophobia – the societal bias against people living in poverty – constitutes a major obstacle to designing, approving and implementing poverty-mitigation policies. This work presents an initial step towards operationalizing the concept of aporophobia to identify and track harmful beliefs and discriminative actions against poor people on social media. In close collaboration with non-profits and governmental organizations, we conduct data collection and exploration. Then we manually annotate a corpus of English tweets from five world regions for the presence of (1) direct expressions of aporophobia, and (2) statements referring to or criticizing aporophobic views or actions of others, to comprehensively characterize the social media discourse related to bias and discrimination against the poor. Based on the annotated data, we devise a taxonomy of categories of aporophobic attitudes and actions expressed through speech on social media. Finally, we train several classifiers and identify the main challenges for automatic detection of aporophobia in social networks. This work paves the way towards identifying, tracking, and mitigating aporophobic views on social media at scale.
pdf
bib
abs
The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge
Lee Kezar
|
Nidhi Munikote
|
Zian Zeng
|
Zed Sehyr
|
Naomi Caselli
|
Jesse Thomason
Sign language models could make modern language technologies more accessible to those who sign, but the supply of accurately labeled data struggles to meet the demand associated with training large, end-to-end neural models. As an alternative to this approach, we explore how knowledge about the linguistic structure of signs may be used as inductive priors for learning sign recognition and comprehension tasks. We first construct the American Sign Language Knowledge Graph (ASLKG) from 11 sources of linguistic knowledge, with emphasis on features related to signs’ phonological and lexical-semantic properties. Then, we use the ASLKG to train neuro-symbolic models on ASL video input tasks, achieving accuracies of 91% for isolated sign recognition, 14% for predicting the semantic features of unseen signs, and 36% for classifying the topic of Youtube-ASL videos.
pdf
bib
abs
Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting
Mohamed Salim Aissi
|
Clément Romac
|
Thomas Carta
|
Sylvain Lamprier
|
Pierre-Yves Oudeyer
|
Olivier Sigaud
|
Laure Soulier
|
Nicolas Thome
Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model’s internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.
pdf
bib
abs
An empirical study of validating synthetic data for formula generation
Usneek Singh
|
José Cambronero
|
Sumit Gulwani
|
Aditya Kanade
|
Anirudh Khatry
|
Vu Le
|
Mukul Singh
|
Gust Verbruggen
Large language models (LLMs) can be leveraged to help write formulas in spreadsheets, but formula data resources are scarce, impacting both the base performance of pre-trained models and limiting the ability to fine-tune them. Given a corpus of formulas, we can use another model to generate synthetic natural language utterances for fine-tuning. However, it is important to validate whether the natural language (NL) generated by the LLM is accurate for it to be beneficial for fine-tuning. In this paper, we provide empirical results on the impact of validating these synthetic training examples with surrogate objectives that evaluate the accuracy of the synthetic annotations. We demonstrate that validation improves performance over raw data across four models (2 open and 2 closed weight). Interestingly, we show that although validation tends to prune more challenging examples, it increases the complexity of problems that models can solve after being fine-tuned on validated data.
pdf
bib
abs
TeCoFeS: Text Column Featurization using Semantic Analysis
Ananya Singha
|
Mukul Singh
|
Ashish Tiwari
|
Sumit Gulwani
|
Vu Le
|
Chris Parnin
Extracting insights from text columns can bechallenging and time-intensive. Existing methods for topic modeling and feature extractionare based on syntactic features and often overlook the semantics. We introduce the semantictext column featurization problem, and presenta scalable approach for automatically solvingit. We extract a small sample smartly, use alarge language model (LLM) to label only thesample, and then lift the labeling to the wholecolumn using text embeddings. We evaluateour approach by turning existing text classification benchmarks into semantic categorization benchmarks. Our approach performs better than baselines and naive use of LLMs.
pdf
bib
abs
CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation
Xi Xu
|
Wenda Xu
|
Siqi Ouyang
|
Lei Li
Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.
pdf
bib
abs
Augmented Adversarial Trigger Learning
Zhe Wang
|
Yanjun Qi
Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs.
pdf
bib
abs
Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents
Qiusi Zhan
|
Richard Fang
|
Henil Shalin Panchal
|
Daniel Kang
Large Language Model (LLM) agents exhibit remarkable performance across diverse applications by using external tools to interact with environments. However, integrating external tools introduces security risks, such as indirect prompt injection (IPI) attacks. Despite defenses designed for IPI attacks, their robustness remains questionable due to insufficient testing against adaptive attacks.In this paper, we evaluate eight different defenses and bypass all of them using adaptive attacks, consistently achieving an attack success rate of over 50%.This reveals critical vulnerabilities in current defenses. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability.The code is available at https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.
pdf
bib
abs
Flaming-hot Initiation with Regular Execution Sampling for Large Language Models
Weizhe Chen
|
Zhicheng Zhang
|
Guanlin Liu
|
Renjie Zheng
|
Wenlei Shi
|
Chen Dun
|
Zheng Wu
|
Xing Jin
|
Lin Yan
Since the release of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities across various domains. A key challenge in developing these general capabilities is efficiently sourcing diverse, high-quality data. This becomes especially critical in reasoning-related tasks with sandbox checkers, such as math or code, where the goal is to generate correct solutions to specific problems with higher probability. In this work, we introduce Flaming-hot Initiation with Regular Execution (FIRE) sampling, a simple yet highly effective method to efficiently find good responses. Our empirical findings show that FIRE sampling enhances inference-time generation quality and also benefits training in the alignment stage. Furthermore, we explore how FIRE sampling improves performance by promoting diversity and analyze the impact of employing FIRE at different positions within a response.
pdf
bib
abs
HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs
Sangyeop Kim
|
Hangyeul Lee
|
Yohan Lee
The growth of conversational AI services has increased demand for effective information retrieval from dialogue data. However, existing methods often face challenges in capturing semantic intent or require extensive labeling and fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted Semantic Indexing for Retrieval), a novel framework that enhances semantic understanding in conversational data retrieval through optimized data ingestion, eliminating the need for resource-intensive labeling or model adaptation.HEISIR implements a two-step process: (1) Hierarchical Triplets Formulation and (2) Adjunct Augmentation, creating semantic indices consisting of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured representation effectively captures the underlying semantic information from dialogue content. HEISIR achieves high retrieval performance while maintaining low latency during the actual retrieval process. Our experimental results demonstrate that HEISIR outperforms fine-tuned models across various embedding types and language models. Beyond improving retrieval capabilities, HEISIR also offers opportunities for intent and topic analysis in conversational data, providing a versatile solution for dialogue systems.
pdf
bib
abs
“Women do not have heart attacks!” Gender Biases in Automatically Generated Clinical Cases in French
Fanny Ducel
|
Nicolas Hiebel
|
Olivier Ferret
|
Karën Fort
|
Aurélie Névéol
Healthcare professionals are increasingly including Language Models (LMs) in clinical practice. However, LMs have been shown to exhibit and amplify stereotypical biases that can cause life-threatening harm in a medical context. This study aims to evaluate gender biases in automatically generated clinical cases in French, on ten disorders. Using seven LMs fine-tuned for clinical case generation and an automatic linguistic gender detection tool, we measure the associations between disorders and gender. We unveil that LMs over-generate cases describing male patients, creating synthetic corpora that are not consistent with documented prevalence for these disorders. For instance, when prompts do not specify a gender, LMs generate eight times more clinical cases describing male (vs. female patients) for heart attack. We discuss the ideal synthetic clinical case corpus and establish that explicitly mentioning demographic information in generation instructions appears to be the fairest strategy. In conclusion, we argue that the presence of gender biases in synthetic text raises concerns about LM-induced harm, especially for women and transgender people.
pdf
bib
abs
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
Mingni Tang
|
Jiajia Li
|
Lu Yang
|
Zhiqiang Zhang
|
Jinhao Tian
|
Zuchao Li
|
Lefei Zhang
|
Ping Wang
Symbolic music is represented in two distinct forms: two-dimensional, visually intuitive score images, and one-dimensional, standardized text annotation sequences. While large language models have shown extraordinary potential in music, current research has primarily focused on unimodal symbol sequence text. Existing general-domain visual language models still lack the ability of music notation understanding. Recognizing this gap, we propose NOTA, the first large-scale comprehensive multimodal music notation dataset. It consists of 1,019,237 records, from 3 regions of the world, and contains 3 tasks. Based on the dataset, we trained NotaGPT, a music notation visual large language model. Specifically, we involve a pre-alignment training phase for cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation. Subsequent training phases focus on foundational music information extraction, followed by training on music score notation analysis. Experimental results demonstrate that our NotaGPT-7B achieves significant improvement on music understanding, showcasing the effectiveness of NOTA and the training pipeline.
pdf
bib
abs
Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish
Juan Manuel Pérez
|
Paula Miguel
|
Viviana Cotik
Hate speech detection deals with many language variants, slang, slurs, expression modalities, and cultural nuances. This outlines the importance of working with specific corpora, when addressing hate speech within the scope of Natural Language Processing, recently revolutionized by the irruption of Large Language Models. This work presents a brief analysis of the performance of large language models in the detection of Hate Speech for Rioplatense Spanish. We performed classification experiments leveraging chain-of-thought reasoning with ChatGPT 3.5, Mixtral, and Aya, comparing their results with those of a state-of-the-art BERT classifier. These experiments outline that, even if large language models show a lower precision compared to the fine-tuned BERT classifier and, in some cases, they find hard-to-get slurs or colloquialisms, they still are sensitive to highly nuanced cases (particularly, homophobic/transphobic hate speech). We make our code and models publicly available for future research.
pdf
bib
abs
An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them
Creston Brooks
|
Johannes Haubold
|
Charlie Cowen-Breen
|
Jay White
|
Desmond DeVaul
|
Frederick Riemenschneider
|
Karthik R Narasimhan
|
Barbara Graziosi
As premodern texts are passed down over centuries, errors inevitably accrue. These errors can be challenging to identify, as some have survived undetected for so long precisely because they are so elusive. While prior work has evaluated error detection methods on artificially-generated errors, we introduce the first dataset of real errors in premodern Greek, enabling the evaluation of error detection methods on errors that genuinely accumulated at some stage in the centuries-long copying process. To create this dataset, we use metrics derived from BERT conditionals to sample 1,000 words more likely to contain errors, which are then annotated and labeled by a domain expert as errors or not. We then propose and evaluate new error detection methods and find that our discriminator-based detector outperforms all other methods, improving the true positive rate for classifying real errors by 5%. We additionally observe that scribal errors are more difficult to detect than print or digitization errors. Our dataset enables the evaluation of error detection methods on real errors in premodern texts for the first time, providing a benchmark for developing more effective error detection algorithms to assist scholars in restoring premodern works.
pdf
bib
abs
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
João Matos
|
Shan Chen
|
Siena Kathleen V. Placino
|
Yingya Li
|
Juan Carlos Climent Pardo
|
Daphna Idan
|
Takeshi Tohyama
|
David Restrepo
|
Luis Filipe Nakayama
|
José María Millet Pascual-Leone
|
Guergana K Savova
|
Hugo Aerts
|
Leo Anthony Celi
|
An-Kwok Ian Wong
|
Danielle Bitterman
|
Jack Gallifant
Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications.
pdf
bib
abs
BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla
Fabiha Haider
|
Fariha Tanjim Shifat
|
Md Farhan Ishmam
|
Md Sakib Ul Rahman Sourove
|
Deeparghya Dutta Barua
|
Md Fahim
|
Md Farhad Alam Bhuiyan
The proliferation of transliterated texts in digital spaces has emphasized the need for detecting and classifying hate speech in languages beyond English, particularly in low-resource languages. As online discourse can perpetuate discrimination based on target groups, e.g. gender, religion, and origin, multi-label classification of hateful content can help in understanding hate motivation and enhance content moderation. While previous efforts have focused on monolingual or binary hate classification tasks, no work has yet addressed the challenge of multi-label hate speech classification in transliterated Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset. The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups, reflecting the regional demographic. We propose a novel translation-based LLM prompting strategy that translates or transliterates under-resourced text to higher-resourced text before classifying the hate group(s). Experiments reveal further pre-trained encoders achieving state-of-the-art performance on the BanTH dataset while translation-based prompting outperforms other strategies in the zero-shot setting. We address a critical gap in Bangla hate speech and set the stage for further exploration into code-mixed and multi-label classification in underrepresented languages.
pdf
bib
abs
Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization
Yen-Ju Lu
|
Ting-Yao Hu
|
Hema Swetha Koppula
|
Hadi Pouransari
|
Jen-Hao Rick Chang
|
Yin Xia
|
Xiang Kong
|
Qi Zhu
|
Xiaoming Simon Wang
|
Oncel Tuzel
|
Raviteja Vemulapalli
In this work, we propose Mutual Reinforcing Data Synthesis (MRDS) within LLMs to improve few-shot dialogue summarization task. Unlike prior methods that require external knowledge, we mutually reinforce the LLM’s dialogue synthesis and summarization capabilities, allowing them to complement each other during training and enhance overall performances. The dialogue synthesis capability is enhanced by directed preference optimization with preference scoring from summarization capability. The summarization capability is enhanced by the additional high quality dialogue-summary paired data produced by the dialogue synthesis capability. By leveraging the proposed MRDS mechanism, we elicit the internal knowledge of LLM in the format of synthetic data, and use it to augment the few-shot real training dataset. Empirical results demonstrate that our method improves dialogue summarization, achieving a 1.5% increase in ROUGE scores and a 0.3% improvement in BERT scores in few-shot settings. Furthermore, our method attains the highest average scores in human evaluations, surpassing both the pre-trained models and the baselines fine-tuned solely for summarization tasks.
pdf
bib
abs
UNLEARN Efficient Removal of Knowledge in Large Language Models
Tyler Lizzo
|
Larry Heck
Large Language Models (LLMs) excel in many Natural Language Processing tasks but are outperformed by specialized tools for certain tasks. This raises the question: Can we reduce redundant LLM parameters when using these tools? Given the size and high training costs of LLMs, it is essential to efficiently forget specific knowledge without retraining. This paper introduces UNLEARN, a novel method that uses subspace techniques to selectively remove knowledge without access to the original training data, without retraining, and with minimal impact to other tasks. Our results show that UNLEARN significantly outperforms previous methods for forgetting targeted (unwanted) knowledge while also preserving related (wanted) knowledge. We also propose LEARN, a complementary approach for targeted knowledge addition, which achieves fine-tuning accuracy comparable to Low-Rank Adaptation (LoRA) without degrading related task performance.
pdf
bib
Adaptive Parameter Compression for Language Models
Jeremias Bohn
|
Frederic Mrozinski
|
Georg Groh
pdf
bib
abs
Personalize Your LLM: Fake it then Align it
Yijing Zhang
|
Dyah Adila
|
Changho Shin
|
Frederic Sala
Personalizing large language models (LLMs) is essential for delivering tailored interactions that improve user experience. Many existing personalization methods require fine-tuning LLMs for each user, rendering them prohibitively expensive for widespread adoption. Although retrieval-based approaches offer a more compute-efficient alternative, they still depend on large, high-quality datasets that are not consistently available for all users. To address this challenge, we propose Chameleon, a scalable and efficient personalization approach that uses (1) self-generated personal preference data and (2) representation editing to enable quick and cost-effective personalization. Our experiments on various tasks, including those from the LaMP personalization benchmark, show that Chameleon efficiently adapts models to personal preferences, improving instruction-tuned models and outperforms two personalization baselines by an average of 40% across two model architectures.
pdf
bib
abs
A Survey to Recent Progress Towards Understanding In-Context Learning
Haitao Mao
|
Guangliang Liu
|
Yao Ma
|
Rongrong Wang
|
Kristen Johnson
|
Jiliang Tang
In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn from a few examples provided in the prompt, enabling downstream generalization without the requirement for gradient updates. Despite encouragingly empirical success, the underlying mechanism of ICL remains unclear. Existing research remains ambiguous with various viewpoints, utilizing intuition-driven and ad-hoc technical solutions to interpret ICL. In this paper, we leverage a data generation perspective to reinterpret recent efforts from a systematic angle, demonstrating the potential broader usage of these popular technical solutions. For a conceptual definition, we rigorously adopt the terms of skill recognition and skill learning. Skill recognition selects one learned data generation function previously seen during pre-training while skill learning can learn new data generation functions from in-context data. Furthermore, we provide insights into the strengths and weaknesses of both abilities, emphasizing their commonalities through the perspective of data generation. This analysis suggests potential directions for future research. The corresponding paper list can be found here.
pdf
bib
abs
Inference Scaling for Bridging Retrieval and Augmented Generation
Youngwon Lee
|
Seung-won Hwang
|
Daniel F Campos
|
Filip Graliński
|
Zhewei Yao
|
Yuxiong He
Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MoI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MoI can leverage the retriever’s prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MoI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.
pdf
bib
abs
GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models
Aditya Sharma
|
Aman Dalmia
|
Mehran Kazemi
|
Amal Zouaq
|
Christopher Pal
Geometry problem-solving demands advanced reasoning abilities to process multimodal inputs and employ mathematical knowledge effectively. Vision-language models (VLMs) have made significant progress in various multimodal tasks. Yet, they still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training, such as calculating the cosine of an arbitrary angle, and by difficulties in correctly applying relevant geometry formulas. To overcome these challenges, we present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library. By executing the code, we achieve accurate and deterministic calculations, contrasting the stochastic nature of autoregressive token prediction, while the function library minimizes errors in formula usage. We also propose a multimodal retrieval-augmented variant of GeoCoder, named RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving functions from the geometry library, thereby reducing reliance on parametric memory. Our modular code-finetuning approach enhances the geometric reasoning capabilities of VLMs, yielding an average improvement of over 16% across various question complexities on the GeomVerse dataset compared to other fine-tuning methods.
pdf
bib
abs
SEEval: Advancing LLM Text Evaluation Efficiency and Accuracy through Self-Explanation Prompting
Meng-Chen Wu
|
Md Mosharaf Hossain
|
Tess Wood
|
Shayan Ali Akbar
|
Si-Chi Chin
|
Erwin Cornejo
Large language models (LLMs) have achieved remarkable success in various natural language generation (NLG) tasks, but their performance in automatic text evaluation is not yet ready as human replacements. In this paper, we propose SEEval (Self-Explanation in Evaluation), a novel prompt-based text evaluator. Inspired by educational psychology, SEEval incorporates self-explanation, a metacognitive strategy, to enhance automatic text evaluation. Our experimental results show that SEEval, without probability normalization, is able to achieve competitive and often superior performance compared to the two state-of-the-art baselines – G-Eval and Analyze-Rate – across all evaluation dimensions and is 20 times more efficient in terms of run-time. The SEEval method is also generalizable as its results are consistent across three other selected LLMs – Claude 3.5 Sonnet, Command R+, and Mistral-Large 2.
pdf
bib
abs
When natural language is not enough: The limits of in-context learning demonstrations in multilingual reasoning
Leonardo Ranaldi
|
Barry Haddow
|
Alexandra Birch
Previous studies have demonstrated the effectiveness of reasoning methods in eliciting multi-step reasoned answers from Large Language Models (LLMs) by leveraging in-context demonstrations. These methods, exemplified by Chain-of-Thought (CoT) and Program-Aided Language Models (PAL), have been shown to perform well in monolingual contexts, primarily in English. There has, however, been limited exploration of their abilities in other languages.To gain a deeper understanding of the role of reasoning methods for in-context demonstrations, we investigate how well CoT and PAL perform across languages for arithmetic and symbolic reasoning tasks. Our findings indicate that the effectiveness of reasoning methods varies significantly across different languages and models. Specifically, CoT, which relies on natural language demonstrations, tends to be more accurate in high-resource than in low-resource languages. Conversely, the structured nature of PAL demonstrations facilitates multilingual comprehension, enabling LLMs to generate programmatic answers in both high- and low-resource languages and leading to significant performance improvements over CoT concerning the accuracy of the generated responses.
pdf
bib
abs
Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop Strategy
Tunazzina Islam
|
Dan Goldwasser
The widespread use of social media has led to a surge in popularity for automated methods of analyzing public opinion. Supervised methods are adept at text categorization, yet the dynamic nature of social media discussions poses a continual challenge for these techniques due to the constant shifting of the focus. On the other hand, traditional unsupervised methods for extracting themes from public discourse, such as topic modeling, often reveal overarching patterns that might not capture specific nuances. Consequently, a significant portion of research into social media discourse still depends on labor-intensive manual coding techniques and a human-in-the-loop approach, which are both time-consuming and costly. In this work, we study the problem of discovering arguments associated with a specific theme. We propose a generic **LLMs-in-the-Loop** strategy that leverages the advanced capabilities of Large Language Models (LLMs) to extract latent arguments from social media messaging. To demonstrate our approach, we apply our framework to contentious topics. We use two publicly available datasets: (1) the climate campaigns dataset of 14k Facebook ads with 25 themes and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads with 14 themes. Additionally, we design a downstream task as stance prediction by leveraging talking points in climate debates. Furthermore, we analyze demographic targeting and the adaptation of messaging based on real-world events.
pdf
bib
abs
AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
Aleksandr Fedchin
|
Isabel Cooperman
|
Pramit Chaudhuri
|
Joseph P. Dexter
For centuries, writers have hidden messages as acrostics, in which initial letters of consecutive lines or paragraphs form meaningful words or phrases. Scholars searching for acrostics manually can only focus on a few authors at a time and often favor qualitative arguments about whether a given acrostic is accidental or intentional. Here we describe AcrosticSleuth, a first-of-its-kind approach to identify acrostics automatically and rank them by the probability that the corresponding sequence of characters does not occur by chance. Since acrostics are rare, we formalize the problem as a binary classification task in the presence of extreme class imbalance. To evaluate AcrosticSleuth, we present the Acrostic Identification Dataset (AcrostID), a collection of acrostics from the WikiSource online database. Despite the class imbalance, AcrosticSleuth achieves F1 scores of 0.39, 0.59, and 0.66 on the French, English, and Russian subdomains of WikiSource, respectively. We further demonstrate that AcrosticSleuth can identify previously unknown instances of wordplay in high-profile literary contexts, including the English philosopher Thomas Hobbes’ signature in the opening paragraphs of The Elements of Law.
pdf
bib
abs
MedThink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering
Xiaotang Gai
|
Chenyi Zhou
|
Jiaxiang Liu
|
Yang Feng
|
Jian Wu
|
Zuozhu Liu
Medical Visual Question Answering (Med-VQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing Med-VQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamline data preparation and build new benchmark Med-VQA datasets R-RAD, R-SLAKE and R-Path. These datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing Med-VQA datasets, i.e., VQA-RAD, SLAKE and PathVQA. Moreover, we design a novel framework, MedThink, which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales. MedThink includes three distinct strategies to generate decision outcomes and corresponding rationales, clearly showcasing the medical decision-making process during reasoning. Our comprehensive experiments show that our method achieves an accuracy of 83.5% on R-RAD, 86.3% on R-SLAKE and 87.2% on R-Path. These results significantly exceed those of existing state-of-the-art models with comparable parameters. Datasets and code are available at https://github.com/Tang-xiaoxiao/Medthink.
pdf
bib
abs
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng
|
Di Wu
|
Christof Monz
The massive amounts of web-mined parallel data often contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a process for simulating misalignment controlled by semantic similarity, which closely resembles misaligned sentences in real-world web-crawled corpora. Under our simulated misalignment noise settings, we quantitatively analyze its impact on machine translation and demonstrate the limited effectiveness of widely used pre-filters for noise detection. This underscores the necessity of more fine-grained ways to handle hard-to-detect misalignment noise. By analyzing the reliability of the model’s self-knowledge for distinguishing misaligned and clean data at the token level, we propose self-correction—an approach that gradually increases trust in the model’s self-knowledge to correct the supervision signal during training. Comprehensive experiments show that our method significantly improves translation performance both in the presence of simulated misalignment noise and when applied to real-world, noisy web-mined datasets, across a range of translation tasks.
pdf
bib
abs
Rejected Dialects: Biases Against African American Language in Reward Models
Joel Mire
|
Zubin Trivadi Aysola
|
Daniel Chechelnitsky
|
Nicholas Deas
|
Chrysoula Zerva
|
Maarten Sap
Preference alignment via reward models helps build safe, helpful, and reliable large language models (LLMs). However, subjectivity in preference judgments and the lack of representative sampling in preference data collection can introduce new biases, hindering reward models’ fairness and equity. In this work, we introduce a framework for evaluating dialect biases in reward models and conduct a case study on biases against African American Language (AAL) through several experiments comparing reward model preferences and behavior on paired White Mainstream English (WME) and both machine-translated and human-written AAL corpora. We show that reward models are less aligned with human preferences when processing AAL texts vs. WME ones (-4% accuracy on average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and steer conversations toward WME, even when prompted with AAL texts. Our findings provide a targeted analysis of anti-AAL biases at a relatively understudied stage in LLM development, highlighting representational harms and ethical questions about the desired behavior of LLMs concerning AAL.
pdf
bib
abs
Do Large Language Models Align with Core Mental Health Counseling Competencies?
Viet Cuong Nguyen
|
Mohammad Taher
|
Dongwan Hong
|
Vinicius Konkolics Possobom
|
Vibha Thirunellayi Gopalakrishnan
|
Ekta Raj
|
Zihang Li
|
Heather J. Soled
|
Michael L. Birnbaum
|
Srijan Kumar
|
Munmun De Choudhury
The rapid evolution of Large Language Models (LLMs) presents a promising solution to the global shortage of mental health professionals. However, their alignment with essential counseling competencies remains underexplored. We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating 22 general-purpose and medical-finetuned LLMs across five key competencies. While frontier models surpass minimum aptitude thresholds, they fall short of expert-level performance, excelling in Intake, Assessment & Diagnosis but struggling with Core Counseling Attributes and Professional Practice & Ethics. Surprisingly, medical LLMs do not outperform generalist models in accuracy, though they provide slightly better justifications while making more context-related errors. These findings highlight the challenges of developing AI for mental health counseling, particularly in competencies requiring empathy and nuanced reasoning. Our results underscore the need for specialized, fine-tuned models aligned with core mental health counseling competencies and supported by human oversight before real-world deployment. Code and data associated with this manuscript can be found at: https://github.com/cuongnguyenx/CounselingBench
pdf
bib
abs
Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models
Zizhang Chen
|
Peizhao Li
|
Xiaomeng Dong
|
Pengyu Hong
To facilitate healthcare delivery, language models (LMs) have significant potential for clinical prediction tasks using electronic health records (EHRs). However, in these high-stakes applications, unreliable decisions can result in significant costs due to compromised patient safety and ethical concerns, thus increasing the need for good uncertainty modelling of automated clinical predictions. To address this, we consider uncertainty quantification of LMs for EHR tasks in both white-box and black-box settings. We first quantify uncertainty in white-box models, where we have access to model parameters and output logits. We show that an effective reduction of model uncertainty can be achieved by using the proposed multi-tasking and ensemble methods in EHRs. Continuing with this idea, we extend our approach to black-box settings, including popular proprietary LMs such as GPT-4. We validate our framework using longitudinal clinical data from over 6,000 patients across ten clinical prediction tasks. Results show that ensembling methods and multi-task prediction prompts reduce uncertainty across different scenarios. These findings increase model transparency in white-box and black-box settings, thereby advancing reliable AI healthcare.
pdf
bib
abs
Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents
Shrinidhi Kumbhar
|
Venkatesh Mishra
|
Kevin Coutinho
|
Divij Handa
|
Ashif Iquebal
|
Chitta Baral
Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this process. We explore the potential of LLMs to generate viable hypotheses that, once validated, can expedite materials discovery. Collaborating with materials science experts, we curated a novel dataset from recent journal publications, featuring real-world goals, constraints, and methods for designing real-world applications. Using this dataset, we test LLM-based agents that generate hypotheses for achieving given goals under specific constraints. To assess the relevance and quality of these hypotheses, we propose a novel scalable evaluation metric that emulates the process a materials scientist would use to evaluate a hypothesis critically. Our curated dataset, proposed method, and evaluation framework aim to advance future research in accelerating materials discovery and design with LLMs.
pdf
bib
abs
Aligning to What? Limits to RLHF Based Alignment
Logan Barnhart
|
Reza Akbarian Bafghi
|
Stephen Becker
|
Maziar Raissi
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.
pdf
bib
abs
Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models
Srishti Yadav
|
Zhi Zhang
|
Daniel Hershcovich
|
Ekaterina Shutova
Investigating value alignment in Large Language Models (LLMs) based on cultural context has become a critical area of research. However, similar biases have not been extensively explored in large vision-language models (VLMs). As the scale of multimodal models continues to grow, it becomes increasingly important to assess whether images can serve as reliable proxies for culture and how these values are embedded through the integration of both visual and textual data. In this paper, we conduct a thorough evaluation of multimodal model at different scales, focusing on their alignment with cultural values. Our findings reveal that, much like LLMs, VLMs exhibit sensitivity to cultural values, but their performance in aligning with these values is highly context-dependent. While VLMs show potential in improving value understanding through the use of images, this alignment varies significantly across contexts highlighting the complexities and underexplored challenges in the alignment of multimodal models.
pdf
bib
abs
Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning
Jeffrey Olmo
|
Jared Wilson
|
Max Forsey
|
Bryce Hepner
|
Thomas Vincent Howe
|
David Wingate
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network’s internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs.To address this, we introduce Gradient SAEs (g-SAEs), which modify the k-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the k elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network.Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts.By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both representations, retrospectively, and actions, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.
pdf
bib
abs
Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving
Botao Yu
|
Frazier N. Baker
|
Ziru Chen
|
Garrett Herb
|
Boyu Gou
|
Daniel Adu-Ampratwum
|
Xia Ning
|
Huan Sun
To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents’ ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.
pdf
bib
abs
RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation
Viacheslav Vasilev
|
Julia Agafonova
|
Nikolai Gerasimenko
|
Alexander Kapitanov
|
Polina Mikhailova
|
Evelina Mironova
|
Denis Dimitrov
Text-to-image generation models have gained popularity among users around the world. However, many of these models exhibit a strong bias toward English-speaking cultures, ignoring or misrepresenting the unique characteristics of other language groups, countries, and nationalities. The lack of cultural awareness can reduce the generation quality and lead to undesirable consequences such as unintentional insult, and the spread of prejudice. In contrast to the field of natural language processing, cultural awareness in computer vision has not been explored as extensively. In this paper, we strive to reduce this gap. We propose a RusCode benchmark for evaluating the quality of text-to-image generation containing elements of the Russian cultural code. To do this, we form a list of 19 categories that best represent the features of Russian visual culture. Our final dataset consists of 1250 text prompts in Russian and their translations into English. The prompts cover a wide range of topics, including complex concepts from art, popular culture, folk traditions, famous people’s names, natural objects, scientific achievements, etc. We present the results of a human evaluation of the side-by-side comparison of Russian visual concepts representations using popular generative models.
pdf
bib
abs
Evaluation of LLMs-based Hidden States as Author Representations for Psychological Human-Centered NLP Tasks
Nikita Soni
|
Pranav Chitale
|
Khushboo Singh
|
Niranjan Balasubramanian
|
H. Schwartz
Like most of NLP, models for human-centered NLP tasks—tasks attempting to assess author-level information—predominantly use rep-resentations derived from hidden states of Transformer-based LLMs. However, what component of the LM is used for the representation varies widely. Moreover, there is a need for Human Language Models (HuLMs) that implicitly model the author and provide a user-level hidden state. Here, we systematically evaluate different ways of representing documents and users using different LM and HuLM architectures to predict task outcomes as both dynamically changing states and averaged trait-like user-level attributes of valence, arousal, empathy, and distress. We find that representing documents as an average of the token hidden states performs the best generally. Further, while a user-level hidden state itself is rarely the best representation, we find its inclusion in the model strengthens token or document embeddings used to derive document- and user-level representations resulting in best performances.
pdf
bib
abs
Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey
Xiaoyu Liu
|
Paiheng Xu
|
Junda Wu
|
Jiaxin Yuan
|
Yifan Yang
|
Yuhang Zhou
|
Fuxiao Liu
|
Tianrui Guan
|
Haoliang Wang
|
Tong Yu
|
Julian McAuley
|
Wei Ai
|
Furong Huang
Causal inference has demonstrated significant potential to enhance Natural Language Processing (NLP) models in areas such as predictive accuracy, fairness, robustness, and explainability by capturing causal relationships among variables. The rise of generative Large Language Models (LLMs) has greatly impacted various language processing tasks. This survey focuses on research that evaluates or improves LLMs from a causal view in the following areas: reasoning capacity, fairness and safety issues, explainability, and handling multimodality. Meanwhile, LLMs can assist in causal inference tasks, such as causal relationship discovery and causal effect estimation, by leveraging their generation ability and knowledge learned during pre-training. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and robust artificial intelligence systems.
pdf
bib
abs
ThoughtSculpt: Reasoning with Intermediate Revision and Search
Yizhou Chi
|
Kevin Yang
|
Dan Klein
We present THOUGHTSCULPT, a general reasoning and search method for tasks with outputs that can be decomposed into components. THOUGHTSCULPT explores a search tree of potential solutions using Monte Carlo Tree Search (MCTS), building solutions one action at a time and evaluating according to any domain-specific heuristic, which in practice is often simply an LLM evaluator. Critically, our action space includes revision actions: THOUGHTSCULPT may choose to revise part of its previous output rather than continuing to build the rest of its output. Empirically, THOUGHTSCULPT outperforms state-of-the-art reasoning methods across three challenging tasks: Story Outline Improvement (up to +30% interestingness), Mini-Crosswords Solving (up to +16% word success rate), and Constrained Generation (up to +10% concept coverage).
pdf
bib
abs
Optimizing Hidden Markov Language Models: An Empirical Study of Reparameterization and Initialization Techniques
Ivan Lee
|
Taylor Berg-Kirkpatrick
Hidden Markov models (HMMs) are valuable for their ability to provide exact and tractable inference. However, learning an HMM in an unsupervised manner involves a non-convex optimization problem that is plagued by poor local optima. Recent work on scaling-up HMMs to perform competitively as language models has indicated that this challenge only increases with larger hidden state sizes. Several techniques to address this problem have been proposed, but have not be evaluated comprehensively. This study provides a comprehensive empirical analysis of two recent strategies that use neural networks to enhance HMM optimization: neural reparameterization and neural initialization. We find that (1) these techniques work effectively for scaled HMM language modeling, (2) linear reparameterizations can be as effective as non-linear ones, and (3) the strategies are complementary.
pdf
bib
abs
Using Linguistic Entrainment to Evaluate Large Language Models for Use in Cognitive Behavioral Therapy
Mina Kian
|
Kaleen Shrestha
|
Katrin Fischer
|
Xiaoyuan Zhu
|
Jonathan Ong
|
Aryan Trehan
|
Jessica Wang
|
Gloria Chang
|
Séb Arnold
|
Maja Mataric
Entrainment, the responsive communication between interacting individuals, is a crucial process in building a strong relationship between a mental health therapist and their client, leading to positive therapeutic outcomes. However, so far entrainment has not been investigated as a measure of efficacy of large language models (LLMs) delivering mental health therapy. In this work, we evaluate the linguistic entrainment of an LLM (ChatGPT 3.5-turbo) in a mental health dialog setting. We first validate computational measures of linguistic entrainment with two measures of the quality of client self-disclosures: intimacy and engagement (p < 0.05). We then compare the linguistic entrainment of the LLM to trained therapists and non-expert online peer supporters in a cognitive behavioral therapy (CBT) setting. We show that the LLM is outperformed by humans with respect to linguistic entrainment (p < 0.001). These results support the need to be cautious in using LLMs out-of-the-box for mental health applications.
pdf
bib
abs
Analysis of LLM as a grammatical feature tagger for African American English
Rahul Porwal
|
Alice Rozet
|
Jotsna Gowda
|
Pryce Houck
|
Kevin Tang
|
Sarah Moeller
African American English (AAE) presents unique challenges in natural language processing (NLP) This research systematically compares the performance of available NLP models—rule-based, transformer-based, and large language models (LLMs)—capable of identifying key grammatical features of AAE, namely Habitual Be and Multiple Negation. These features were selected for their distinct grammatical complexity and frequency of occurrence. The evaluation involved sentence-level binary classification tasks, using both zero-shot and few-shot strategies. The analysis reveals that while LLMs show promise compared to the baseline, they are influenced by biases such as recency and unrelated features in the text such as formality. This study highlights the necessity for improved model training and architectural adjustments to better accommodate AAE’s unique linguistic characteristics. Data and code are available.
pdf
bib
abs
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Anton Razzhigaev
|
Matvey Mikhalchuk
|
Temurbek Rahmatullaev
|
Elizaveta Goncharova
|
Polina Druzhinina
|
Ivan Oseledets
|
Andrey Kuznetsov
We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens — especially stopwords, articles, and commas — consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer’s embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of “filler” tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.
pdf
bib
abs
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
Xiaonan Jing
|
Srinivas Billa
|
Danny Godbout
Hallucination has been a popular topic in natural language generation (NLG). In real-world applications, unfaithful content can result in poor data quality or loss of trust from end users. Thus, it is crucial to fact-check before adopting NLG for production usage, which can be expensive if done manually. In this paper, we investigate automated faithfulness evaluation in guided NLG. We developed a rubric template and used large language models (LLMs) to score the generation on quantifiable scales. We compared popular LLMs as well as widely adopted natural language inference (NLI) models in scoring quality and sensitivity. In addition, we developed methods for the generation of synthetic unfaithful data, as well as heuristics to quantify the percentage of hallucination. Our results on 4 travel-domain industry dataset show that GPT-4 can provide accurate judgement and explanation of whether a source and a generation are factually consistent. Furthermore, we found that tuning NLI models on synthetic data can improve performance. Lastly, we present insights on the latency and cost of deploying such a system.
pdf
bib
abs
LITERA: An LLM Based Approach to Latin-to-English Translation
Paul Rosu
This paper introduces an LLM-based Latin-to-English translation platform designed to address the challenges of translating Latin texts. We named the model LITERA, which stands for Latin Interpretation and Translations into English for Research Assistance. Through a multi-layered translation process utilizing a fine-tuned version of GPT-4o-mini and GPT-4o, LITERA offers an unprecedented level of accuracy, showcased by greatly improved BLEU scores, particularly in classical Latin, along with improved BLEURT scores. The development of LITERA involved close collaboration with Duke University’s Classical Studies Department, which was instrumental in creating a small, high-quality parallel Latin-English dataset. This paper details the architecture, fine-tuning methodology, and prompting strategies used in LITERA, emphasizing its ability to produce literal translations.
pdf
bib
abs
Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning
Venkatesh Mishra
|
Bimsara Pathiraja
|
Mihir Parmar
|
Sat Chidananda
|
Jayanth Srinivasa
|
Gaowen Liu
|
Ali Payani
|
Chitta Baral
Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and analogical reasoning, and conflicts between rules. Although there have been a few works on using LLMs for legal reasoning, their focus has been on overall accuracy. In this paper, we dig deeper to do a step-by-step analysis and figure out where they commit errors. We use the college-level Multiple Choice Question-Answering (MCQA) task from the Civil Procedure dataset and propose a new error taxonomy derived from initial manual analysis of reasoning chains with respect to several LLMs, including two objective measures: soundness and correctness scores. We then develop an LLM-based automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs. The computation of soundness and correctness on the dataset using the auto-evaluator framework reveals several interesting insights. Furthermore, we show that incorporating the error taxonomy as feedback in popular prompting techniques marginally increases LLM performance. Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.
pdf
bib
abs
Towards Long Context Hallucination Detection
Siyi Liu
|
Kishaloy Halder
|
Zheng Qi
|
Wei Xiao
|
Nikolaos Pappas
|
Phu Mon Htut
|
Neha Anna John
|
Yassine Benajiba
|
Dan Roth
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, they are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. Although many studies have investigated contextual hallucinations in LLMs, addressing them in long-context inputs remains an open problem. In this work, we take an initial step toward solving this problem by constructing a dataset specifically designed for long-context hallucination detection. Furthermore, we propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations through a decomposition and aggregation mechanism. Our experimental results show that the proposed architecture significantly outperforms previous models of similar size as well as LLM-based models across various metrics, while providing substantially faster inference. We publicly release our dataset and code to promote research along the same line.
pdf
bib
abs
How to Talk to Language Models: Serialization Strategies for Structured Entity Matching
Haoteng Yin
|
Jinha Kim
|
Prashant Mathur
|
Krishanu Sarker
|
Vidit Bansal
Entity matching (EM), which identifies whether two data records refer to the same real-world entity, is crucial for knowledge base construction and enhancing data-driven AI systems. Recent advances in language models (LMs) have shown great potential in resolving entities with rich textual attributes. However, their performance heavily depends on how structured entities are “talked” through serialized text. The impact of this serialization process remains underexplored, particularly for entities with complex relations in knowledge graphs (KGs). In this work, we systematically study entity serialization by benchmarking the effect of common schemes with LMs of different sizes on diverse tabular matching datasets. We apply our findings to propose a novel serialization scheme for KG entities based on random walks and utilize LLMs to encode sampled semantic walks for matching. Using this lightweight approach with open-source LLMs, we achieve a leading performance on EM in canonical and highly heterogeneous KGs, demonstrating significant throughput increases and superior robustness compared to GPT-4-based methods. Our study on serialization provides valuable insights for the deployment of LMs in real-world EM tasks.
pdf
bib
abs
Accounting for Sycophancy in Language Model Uncertainty Estimation
Anthony Sicilia
|
Mert Inan
|
Malihe Alikhani
Effective human-machine collaboration requires machine learning models to externalize uncertainty, so users can reflect and intervene when necessary. For language models, these representations of uncertainty may be impacted by sycophancy bias: proclivity to agree with users, even if they are wrong. For instance, models may be over-confident in (incorrect) problem solutions suggested by a user. We study the relationship between sycophancy and uncertainty estimation for the first time. We propose a generalization of the definition of sycophancy bias to measure downstream impacts on uncertainty estimation, and also propose a new algorithm (SyRoUP) to account for sycophancy in the uncertainty estimation process. Unlike previous works, we study a broad array of user behaviors, varying both correctness and confidence of user suggestions to see how model answers (and their certainty) change. Our experiments across conversation forecasting and question-answering tasks show that user confidence plays a critical role in modulating the effects of sycophancy, and that SyRoUP can better predict these effects. From these results, we argue that externalizing both model and user uncertainty can help to mitigate the impacts of sycophancy bias.
pdf
bib
abs
Zero-Shot Keyphrase Generation: Investigating Specialized Instructions and Multi-sample Aggregation on Large Language Models
Jishnu Ray Chowdhury
|
Jayanth Mohan
|
Tomas Malik
|
Cornelia Caragea
Keyphrases are the essential topical phrases that summarize a document. Keyphrase generation is a long-standing NLP task for automatically generating keyphrases for a given document. While the task has been comprehensively explored in the past via various models, only a few works perform some preliminary analysis of Large Language Models (LLMs) for the task. Given the impact of LLMs in the field of NLP, it is important to conduct a more thorough examination of their potential for keyphrase generation. In this paper, we attempt to meet this demand with our research agenda. Specifically, we focus on the zero-shot capabilities of open-source instruction-tuned LLMs (Phi-3, Llama-3) and the closed-source GPT-4o for this task. We systematically investigate the effect of providing task-relevant specialized instructions in the prompt. Moreover, we design task-specific counterparts to self-consistency-style strategies for LLMs and show significant benefits from our proposals over the baselines.
pdf
bib
abs
Meta-Reasoning Improves Tool Use in Large Language Models
Lisa Alazraki
|
Marek Rei
External tools help large language models succeed at tasks where they would otherwise typically fail. In existing frameworks, choosing tools at test time relies on naive greedy decoding, regardless of whether the model has been fine-tuned on tool-annotated data or prompted with in-context examples. In contrast, we find that gathering and choosing among a suitable set of candidate tools has greater potential to lead to an optimal selection. We present Tool selECTion via meta-reasONing (TECTON), a two-phase system that first *reasons* over a task and outputs candidate tools using a custom fine-tuned language modelling head. Then, with the custom head disabled, it *meta-reasons* (i.e., it reasons over the previous reasoning process) to make a final choice. We show that TECTON results in substantial gains—both in-distribution and out-of-distribution—on a range of math reasoning datasets.
pdf
bib
abs
CLERC: A Dataset for U. S. Legal Case Retrieval and Retrieval-Augmented Analysis Generation
Abe Bohan Hou
|
Orion Weller
|
Guanghui Qin
|
Eugene Yang
|
Dawn Lawrie
|
Nils Holzenberger
|
Andrew Blair-Stanek
|
Benjamin Van Durme
Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligence systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to create a colossal dataset. supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset **CLERC** (Case Law Evaluation and Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.
pdf
bib
abs
GAIfE: Using GenAI to Improve Literacy in Low-resourced Settings
Allahsera Auguste Tapo
|
Nouhoum Coulibaly
|
Seydou Diallo
|
Sebastien Diarra
|
Christopher M Homan
|
Mamadou K. Keita
|
Michael Leventhal
Illiteracy is a predictor of many negative social and personal outcomes. Illiteracy rates are particularly high in countries with underresourced languages, where few books exist that are suitable for children to learn to read from. We present GAIfE (Generative AI for Education), a toolchain and workflow developed through empirical methods, that demonstrates how existing tools can be adapted to address low literacy for an underresourced language. We used GAIfE (a play on the Bambara word for “book”) to construct materials for developing children’s reading competence in Bambara, the vehicular language of Mali. Our approach to the generation and post-generation editing of content skewed by the Global-North-centric bias of available LLMs, enabled us to rapidly multiply the content in Bambara available online by 10 times while maintaining high standards of attractiveness of the material to maintain high engagement, accurate representation of the Malian culture and physical and social environment and language quality. Using our materials, pilot reading programs achieved a 67% reduction in the number of children unable to read Bambara. Our approach demonstrated the power of bias-aware application of generative AI to the problem domain as well as the potential impact the application of this technology could have on reducing illiteracy and improving learning outcomes through native language education.
pdf
bib
abs
Hard Emotion Test Evaluation Sets for Language Models
Tiberiu Sosea
|
Cornelia Caragea
Language models perform well on emotion datasets but it remains unclear whether these models indeed understand emotions expressed in text or simply exploit supperficial lexical cues (e.g., emotion words). In this paper, we present two novel test evaluation sets sourced from two existing datasets that allow us to evaluate whether language models make real inferential decisions for emotion detection or not. Our human-annotated test sets are created by iteratively rephrasing input texts to gradually remove explicit emotion cues (while preserving the semantic similarity and the emotions) until a strong baseline BERT model yields incorrect predictions. Using our new test sets, we carry out a comprehensive analysis into the capabilities of small and large language models to predict emotions. Our analysis reveals that all models struggle to correctly predict emotions when emotion lexical cues become scarcer and scarcer, but large language models perform better than small pre-trained language models and push the performance by 14% over the 5% BERT baseline. We make our evaluation test sets and code publicly available.
pdf
bib
abs
UCL-Bench: A Chinese User-Centric Legal Benchmark for Large Language Models
Ruoli Gan
|
Duanyu Feng
|
Chen Zhang
|
Zhihang Lin
|
Haochen Jia
|
Hao Wang
|
Zhenyang Cai
|
Lei Cui
|
Qianqian Xie
|
Jimin Huang
|
Benyou Wang
Existing legal benchmarks focusing on knowledge and logic effectively evaluate LLMs on various tasks in legal domain. However, few have explored the practical application of LLMs by actual users. To further assess whether LLMs meet the specific needs of legal practitioners in real-world scenarios, we introduce UCL-Bench, a Chinese User-Centric Legal Benchmark, comprising 22 tasks across 5 distinct legal scenarios.To build the UCL-Bench, we conduct a user survey targeting legal professionals to understand their needs and challenges. Based on the survey results, we craft tasks, verified by legal professionals, and categorized them according to Bloom’s taxonomy. Each task in UCL-Bench mirrors real-world legal scenarios, and instead of relying on pre-defined answers, legal experts provide detailed answer guidance for each task, incorporating both “information” and “needs” elements to mimic the complexities of legal practice. With the guidance, we use GPT-4 as the user simulator and evaluator, enabling multi-turn dialogues as a answer guidance based evaluation framework. Our findings reveal that many recent open-source general models achieve the highest performance, suggesting that they are well-suited to address the needs of legal practitioners. However, these legal LLMs do not outperform ChatGPT, indicating a need for training strategies aligned with users’ needs. Furthermore, we find that the most effective models are able to address legal issues within fewer dialogue turns, highlighting the importance of concise and accurate responses in achieving high performance. The code and dataset are available at https://github.com/wittenberg11/UCL-bench.
pdf
bib
abs
MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU
Yan Li
|
So-Eon Kim
|
Seong-Bae Park
|
Caren Han
Although Large Language Models (LLMs) can generate coherent text, they often struggle to recognise user intent behind queries. In contrast, Natural Language Understanding (NLU) models interpret the purpose and key information of user input for responsive interactions. Existing NLU models typically map utterances to a dual-level semantic frame, involving sentence-level intent (SI) and word-level slot (WS) labels. However, real-life conversations primarily consist of multi-turn dialogues, requiring the interpretation of complex and extended exchanges. Researchers encounter challenges in addressing all facets of multi-turn dialogue using a unified NLU model. This paper introduces MIDAS, a novel approach leveraging multi-level intent, domain, and slot knowledge distillation for multi-turn NLU. We construct distinct teachers for SI detection, WS filling, and conversation-level domain (CD) classification, each fine-tuned for specific knowledge. A multi-teacher loss is proposed to facilitate the integration of these teachers, guiding a student model in multi-turn dialogue tasks. Results demonstrate the efficacy of our model in improving multi-turn conversation understanding, showcasing the potential for advancements in NLU through multi-level dialogue knowledge distillation. Our implementation is open-sourced on GitHub (https://github.com/adlnlp/Midas).
pdf
bib
abs
A Practical Analysis of Human Alignment with *PO
Kian Ahrabian
|
Xihui Lin
|
Barun Patra
|
Vishrav Chaudhary
|
Alon Benhaim
|
Jay Pujara
|
Xia Song
At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). Prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we examine the robustness of existing state-of-the-art methods to varying hyperparameters in a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment. Our goal is to empirically find the method that increases the likelihood of achieving better results through the lens of various metrics, such as KL divergence and response length. We also introduce LN-DPO, a simple length-normalized version of DPO that is more stable across hyperparameters, effectively reduces the average response length, and improves performance. Our analysis of state-of-the-art reference-free (i.e., SimPO) and reference-dependent (i.e., DPO and LN-DPO) methods reveals that they perform similarly at their peak (i.e., best possible scenario). However, we uncover that the pattern of change in performance greatly varies as we move away from the best possible scenario.
pdf
bib
abs
Understanding Reference Policies in Direct Preference Optimization
Yixin Liu
|
Pengfei Liu
|
Arman Cohan
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO – its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO’s effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of the KL-constraint from the reference policies in DPO by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO’s superiority in this controlled setting. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.
pdf
bib
abs
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models
Saaket Agashe
|
Yue Fan
|
Anthony Reyna
|
Xin Eric Wang
Large Language Models (LLMs) have demonstrated emergent common-sense reasoning and Theory of Mind (ToM) capabilities, making them promising candidates for developing coordination agents. This study introduces the LLM-Coordination Benchmark, a novel benchmark for analyzing LLMs in the context of Pure Coordination Settings, where agents must cooperate to maximize gains. Our benchmark evaluates LLMs through two distinct tasks. The first is Agentic Coordination, where LLMs act as proactive participants in four pure coordination games. The second is Coordination Question Answering (CoordQA), which tests LLMs on 198 multiple-choice questions across these games to evaluate three key abilities: Environment Comprehension, ToM Reasoning, and Joint Planning. Results from Agentic Coordination experiments reveal that LLM-Agents excel in multi-agent coordination settings where decision-making primarily relies on environmental variables but face challenges in scenarios requiring active consideration of partners’ beliefs and intentions. The CoordQA experiments further highlight significant room for improvement in LLMs’ Theory of Mind reasoning and joint planning capabilities. Zero-Shot Coordination (ZSC) experiments in the Agentic Coordination setting demonstrate that LLM agents, unlike RL methods, exhibit robustness to unseen partners. These findings indicate the potential of LLMs as Agents in pure coordination setups and underscore areas for improvement.
pdf
bib
abs
AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation
Vaishnavi Pulavarthi
|
Deeksha Nandal
|
Soham Dan
|
Debjit Pal
Assertions have been the de facto collateral for hardware for over a decade. The verification quality, i.e., detection and diagnosis of corner-case design bugs, is critically dependent on the assertion quality. There has been a considerable amount of research to generate high-quality assertions from hardware design source code and design execution trace data. With recent advent of generative AI techniques such as Large-Language Models (LLMs), there has been a renewed interest in deploying LLMs for assertion generation. However, there is little effort to quantitatively establish the effectiveness and suitability of various LLMs for assertion generation. In this paper, we present AssertionBench, a novel benchmark to evaluate LLMs’ effectiveness for assertion generation quantitatively. AssertionBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design, generated from GoldMine and HARM. We use AssertionBench to compare state-of-the-art LLMs, e.g., GPT-3.5, GPT-4o, CodeLLaMa-2, and LLaMa3-70B, to assess their effectiveness in inferring functionally correct assertions for hardware designs. Our experiments comprehensively demonstrate how LLMs perform relative to each other, the benefits of using more in-context exemplars in generating a higher fraction of functionally correct assertions, and the significant room for improvement for LLM-based assertion generators.
pdf
bib
abs
On Reference (In-)Determinacy in Natural Language Inference
Sihao Chen
|
Chaitanya Malaviya
|
Alex Fabrikant
|
Hagai Taitelbaum
|
Tal Schuster
|
Senaka Buthpitiya
|
Dan Roth
We revisit the reference determinacy (RD) assumption in the task of natural language inference (NLI), i.e., the premise and hypothesis are assumed to refer to the same context when human raters annotate a label. While RD is a practical assumption for constructing a new NLI dataset, we observe that current NLI models—which are typically trained solely on hypothesis-premise pairs created with the RD assumption—fail in downstream applications such as fact verification, where the input premise and hypothesis may refer to different contexts. To highlight the impact of this phenomenon in real-world use cases, we introduce RefNLI, a diagnostic benchmark for identifying reference ambiguity in NLI examples. In RefNLI, the premise is retrieved from a knowledge source (i.e. Wikipedia) and does not necessarily refer to the same context as the hypothesis. With RefNLI, we demonstrate that finetuned NLI models and few-shot prompted LLMs both fail to recognize context mismatch, leading to >80% false contradiction and >50% entailment predictions. We discover that the existence of reference ambiguity in NLI examples can in part explain the inherent human disagreements in NLI, and provide insight into how the RD assumption impacts NLI dataset creation process.
pdf
bib
abs
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
|
Jiayi Yuan
|
Yu-Neng Chuang
|
Zhuoer Wang
|
Yingchi Liu
|
Mark Cusick
|
Param Kulkarni
|
Zhengping Ji
|
Yasser Ibrahim
|
Xia Hu
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks; this is often referred to as “LLM-as-a-judge” paradigm. However, the capabilities of LLMs in evaluating NLG quality remain underexplored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs. This framework leverages hierarchically perturbed text data and statistical tests to systematically measure the NLG evaluation capabilities of LLMs. We re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM families provides critical insight into their strengths and limitations as NLG evaluators. Our dataset is available at https://huggingface.co/datasets/YCWANGVINCE/DHP_Benchmark.
pdf
bib
abs
GraphEval36K: Benchmarking Coding and Reasoning Capabilities of Large Language Models on Graph Datasets
Qiming Wu
|
Zichen Chen
|
Will Corcoran
|
Misha Sra
|
Ambuj Singh
Large language models (LLMs) have achieved remarkable success in natural language processing (NLP), demonstrating significant capabilities in processing and understanding text data. However, recent studies have identified limitations in LLMs’ ability to manipulate, program, and reason about structured data, especially graphs. We introduce GraphEval36K, the first comprehensive graph dataset, comprising 40 graph coding problems and 36,900 test cases to evaluate the ability of LLMs on graph problem-solving. Our dataset is categorized into eight primary and four sub-categories to ensure a thorough evaluation across different types of graphs. We benchmark eight LLMs, finding that private models outperform open-source ones, though the gap is narrowing. We also analyze the performance of LLMs across directed vs undirected graphs, different kinds of graph concepts, and network models. Furthermore, to improve the usability of our evaluation framework, we propose Structured Symbolic Decomposition (SSD), an instruction-based method designed to enhance LLM performance on complex graph tasks. Results show that SSD improves the average passing rate of GPT-4, GPT-4o, Gemini-Pro and Claude-3-Sonnet by 8.38%, 6.78%, 29.28% and 25.28%, respectively.
pdf
bib
abs
SimulBench: Evaluating Language Models with Creative Simulation Tasks
Qi Jia
|
Xiang Yue
|
Tuney Zheng
|
Jie Huang
|
Bill Yuchen Lin
We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation tasks, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM’s general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on SimulBench, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these creative simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55% more cases.
pdf
bib
abs
ReasoningRec: Bridging Personalized Recommendations and Human-Interpretable Explanations through LLM Reasoning
Millennium Bismay
|
Xiangjue Dong
|
James Caverlee
This paper presents ReasoningRec, a reasoning-based recommendation framework that leverages Large Language Models (LLMs) to bridge the gap between recommendations and human-interpretable explanations. In contrast to conventional recommendation systems that rely on implicit user-item interactions, ReasoningRec employs LLMs to model users and items, focusing on preferences, aversions, and explanatory reasoning. The framework utilizes a larger LLM to generate synthetic explanations for user preferences, subsequently used to fine-tune a smaller LLM for enhanced recommendation accuracy and human-interpretable explanation. Our experimental study investigates the impact of reasoning and contextual information on personalized recommendations, revealing that the quality of contextual and personalized data significantly influences the LLM’s capacity to generate plausible explanations. Empirical evaluations demonstrate that ReasoningRec surpasses state-of-the-art methods by up to 12.5% in recommendation prediction while concurrently providing human-intelligible explanations.
pdf
bib
abs
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision
Shilong Li
|
Yancheng He
|
Hui Huang
|
Xingyuan Bu
|
Jiaheng Liu
|
Hangyu Guo
|
Weixun Wang
|
Jihao Gu
|
Wenbo Su
|
Bo Zheng
Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.
pdf
bib
abs
Demystifying the Power of Large Language Models in Graph Generation
Yu Wang
|
Ryan A. Rossi
|
Namyong Park
|
Nesreen K. Ahmed
|
Danai Koutra
|
Franck Dernoncourt
|
Tyler Derr
Despite the unprecedented success of applying Large Language Models (LLMs) to graph discriminative tasks such as node classification and link prediction, its potential for graph structure generation remains largely unexplored. To fill this crucial gap, this paper presents a systematic investigation into the capability of LLMs for graph structure generation. Specifically, we design prompts triggering LLMs to generate codes that optimize network properties by injecting domain expertise from network science. Since graphs in different domains exhibit unique structural properties captured by various metrics (e.g., clustering coefficient capturing triangles in social networks while squares reflecting road segments in transportation networks), we first evaluate the capability of LLMs to generate graphs satisfying each structural property in different domains. After that, we select the optimal property configurations and benchmark the graph structure generation performance of LLMs against established graph generative models across multiple domains. Our findings shed light on generating graph structures from an LLM perspective. Our code is publically available https://github.com/yuwvandy/LLM-GraphGen.
pdf
bib
abs
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Yuelin Bai
|
Xeron Du
|
Yiming Liang
|
Leo Jin
|
Junting Zhou
|
Ziqiang Liu
|
Feiteng Fang
|
Mingshan Chang
|
Tianyu Zheng
|
Xincheng Zhang
|
Nuo Ma
|
Zekun Moore Wang
|
Ruibin Yuan
|
Haihong Wu
|
Hongquan Lin
|
Wenhao Huang
|
Jiajun Zhang
|
Chenghua Lin
|
Jie Fu
|
Min Yang
|
Shiwen Ni
|
Ge Zhang
Remarkable progress on large language models (LLMs), particularly in English, has facilitated impressive capabilities in following human instructions. However, there remains a noticeable gap in instruction fine-tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users’ interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world data resources and undergoing comprehensive human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
pdf
bib
abs
Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation
Yu Wang
|
Jiaxin Zhang
|
Xiang Gao
|
Wendi Cui
|
Peng Li
|
Kamalika Das
In tasks such as summarization and open-book question answering (QA), Large Language Models (LLMs) frequently experience “contextual hallucination”, where they generate irrelevant or incorrect responses despite having access to accurate information in the input. This issue often stems from the models’ propensity to prioritize self-generated content over input context, leading to a disregard for pertinent details. To address this challenge, we introduce, Guided Attention Map Editing (GAME), an innovative approach that dynamically adjusts attention maps to enhance contextual relevance. During inference, GAME employs a trained classifier to identify attention maps likely to induce hallucinations and implements targeted interventions. These interventions, guided by gradient-informed “edit directions”, strategically redistribute attention weights across various heads to efficiently mitigate hallucination. Extensive evaluations on challenging summarization and open-book QA tasks demonstrate that GAME consistently and significantly reduces hallucinations across diverse open-source models, thereby improving the reliability and applicability of LLMs.
pdf
bib
abs
Alleviating Hallucinations of Large Language Models through Induced Hallucinations
Yue Zhang
|
Leyang Cui
|
V. W.
|
Shuming Shi
Despite their impressive capabilities, large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon commonly known as hallucination. In this work, we propose a simple Induce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations. We first construct a factually weak LLM by inducing hallucinations from the original LLMs. Then, we penalize these induced hallucinations during decoding to enhance the factuality of the generated content. Concretely, we determine the final next-token predictions by amplifying the predictions from the original model and downplaying the induced untruthful predictions via contrastive decoding. Experimental results on both discrimination-based and generation-based hallucination evaluation benchmarks, such as TruthfulQA and FActScore, demonstrate that our proposed ICD methods can effectively enhance the factuality of LLMs across various task formats, model sizes, and model families. For example, when equipped with ICD, Llama2-7B-Chat and Mistral-7B-Instruct achieve performance comparable to ChatGPT and GPT4 on TruthfulQA, respectively, without compromising their generalization capabilities on other tasks.
pdf
bib
abs
MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts
Lin Ning
|
Harsh Lara
|
Meiqi Guo
|
Abhinav Rastogi
Parameter-efficient fine-tuning techniques like Low-Rank Adaptation (LoRA) have revolutionized the adaptation of large language models (LLMs) to diverse tasks. Recent efforts have explored mixtures of LoRA modules for multi-task settings. However, our analysis reveals redundancy in the down-projection matrices of these architectures. This observation motivates our proposed method, Mixture of Dyadic Experts (MoDE), which introduces a novel design for efficient multi-task adaptation. This is done by sharing the down-projection matrix across tasks and employing atomic rank-one adapters, coupled with routers that allow more sophisticated task-level specialization. Our design allows for more fine-grained mixing, thereby increasing the model’s ability to jointly handle multiple tasks. We evaluate MoDE on the Supernatural Instructions (SNI) benchmark consisting of a diverse set of 700+ tasks and demonstrate that it outperforms state-of-the-art multi-task parameter-efficient fine-tuning (PEFT) methods, without introducing additional parameters. Our findings contribute to a deeper understanding of parameter efficiency in multi-task LLM adaptation and provide a practical solution for deploying high-performing, lightweight models.
pdf
bib
abs
Unsupervised Sentence Representation Learning with Syntactically Aligned Negative Samples
Zhilan Wang
|
Zekai Zhi
|
Rize Jin
|
Kehui Song
|
He Wang
|
Da-Jung Cho
Sentence representation learning benefits from data augmentation strategies to improve model performance and generalization, yet existing approaches often encounter issues such as semantic inconsistencies and feature suppression. To address these limitations, we propose a method for generating Syntactically Aligned Negative (SAN) samples through a semantic importance-aware Masked Language Model (MLM) approach. Our method quantifies semantic contributions of individual words to produce negative samples that have substantial textual overlap with the original sentences while conveying different meanings. We further introduce Hierarchical-InfoNCE (HiNCE), a novel contrastive learning objective employing differential temperature weighting to optimize the utilization of both in-batch and syntactically aligned negative samples. Extensive evaluations across seven semantic textual similarity benchmarks demonstrate consistent improvements over state-of-the-art models.
pdf
bib
abs
Hierarchical Speculative Decoding with Dynamic Window
Shensian Syu
|
Hung-yi Lee
Speculative Decoding (SD) utilizes an efficient draft model to generate multiple tokens, which are subsequently verified in parallel by a target model. This approach has shown significant potential for accelerating inference in large language models (LLMs), with performance heavily reliant on the hyperparameter K—the window size. However, previous methods often depend on simple heuristics to select K or dynamically adjust the window size, which may necessitate additional training or careful resource management to avoid competition.To address these challenges, we propose Hierarchical Speculative Decoding with Dynamic Window (HSDDW), a straightforward framework that eliminates the need for additional training. Specifically, we introduce a self-verify mechanism that enables the draft model to autonomously decide when to stop generating tokens. Additionally, by integrating a hierarchical structure that leverages the capabilities of models of different sizes, we significantly enhance the overall speed of the system.HSDDW demonstrates competitive performance across four datasets, achieving notable speedups of 2.91× on MT-Bench and 2.99× on Alpaca, outperforming existing state-of-the-art methods.
pdf
bib
abs
Q-FAKER: Query-free Hard Black-box Attack via Controlled Generation
CheolWon Na
|
YunSeok Choi
|
Jee-Hyong Lee
Many adversarial attack approaches are proposed to verify the vulnerability of language models. However, they require numerous queries and the information on the target model. Even black-box attack methods also require the target model’s output information. They are not applicable in real-world scenarios, as in hard black-box settings where the target model is closed and inaccessible. Even the recently proposed hard black-box attacks still require many queries and demand extremely high costs for training adversarial generators. To address these challenges, we propose Q-faker (Query-free Hard Black-box Attacker), a novel and efficient method that generates adversarial examples without accessing the target model. To avoid accessing the target model, we use a surrogate model instead. The surrogate model generates adversarial sentences for a target-agnostic attack. During this process, we leverage controlled generation techniques. We evaluate our proposed method on eight datasets. Experimental results demonstrate our method’s effectiveness including high transferability and the high quality of the generated adversarial examples, and prove its practical in hard black-box settings.
pdf
bib
abs
PRDetect: Perturbation-Robust LLM-generated Text Detection Based on Syntax Tree
Xiang Li
|
Zhiyi Yin
|
Hexiang Tan
|
Shaoling Jing
|
Du Su
|
Yi Cheng
|
Huawei Shen
|
Fei Sun
As LLM-generated text becomes increasingly prevalent on the internet, often containing hallucinations or biases, detecting such content has emerged as a critical area of research.Recent methods have demonstrated impressive performance in detecting text generated entirely by LLMs.However, in real-world scenarios, users often introduce perturbations to the LLM-generated text, and the robustness of existing detection methods against these perturbations has not been sufficiently explored.This paper empirically investigates this challenge and finds that even minor perturbations can severely degrade the performance of current detection methods. To address this issue, we find that the syntactic tree is minimally affected by disturbances and exhibits distinct differences between human-written and LLM-generated text.Therefore, we propose a detection method based on syntactic trees, which can capture features invariant to perturbations.It demonstrates significantly improved robustness against perturbation on the HC3 and GPT-3.5-mixed datasets.Moreover, it also has the shortest time expenditure.We provide the code and data at https://github.com/thulx18/PRDetect.
pdf
bib
abs
Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning
Ahmed Elshabrawy
|
Yongxin Huang
|
Iryna Gurevych
|
Alham Fikri Aji
While Large Language Models (LLMs) exhibit remarkable capabilities in zero-shot and few-shot scenarios, they often require computationally prohibitive sizes. Conversely, smaller Masked Language Models (MLMs) like BERT and RoBERTa achieve state-of-the-art results through fine-tuning but struggle with extending to few-shot and zero-shot settings due to their architectural constraints. Hence, we propose Statement-Tuning, a technique that models discriminative tasks as a set of finite statements and trains an encoder model to discriminate between the potential statements to determine the label. We do Statement-Tuning on multiple tasks to enable cross-task generalization. Experimental results demonstrate that Statement-Tuning achieves competitive performance compared to state-of-the-art LLMs with significantly fewer parameters. Furthermore, we compare with previous encoder-based methodology and show that our method is more accurate and more robust to spurious patterns. Moreover, the study investigates the impact of several design choices on few-shot and zero-shot generalization, revealing that Statement-Tuning can achieve strong performance with modest training data and benefits from task and statement diversity for unseen task generalizability. We release all the code used to generate statement data, train and evaluate our Statement-Tuned models.
pdf
bib
abs
Faster Machine Translation Ensembling with Reinforcement Learning and Competitive Correction
Kritarth Prasad
|
Mohammadi Zaki
|
Pratik Rakesh Singh
|
Pankaj Wasnik
Ensembling neural machine translation (NMT) models to produce higher-quality translations than the L individual models has been extensively studied. Recent methods typically employ a candidate selection block (CSB) and an encoder-decoder fusion block (FB), requiring inference across all candidate models, leading to significant computational overhead, generally 𝛺(L). This paper introduces SmartGen, a reinforcement learning (RL)-based strategy that improves the CSB by selecting a small, fixed number of candidates and identifying optimal groups to pass to the fusion block for each input sentence. Furthermore, previously, the CSB and FB were trained independently, leading to suboptimal NMT performance. Our DQN-based SmartGen addresses this by using feedback from the FB block as a reward during training. We also resolve a key issue in earlier methods, where candidates were passed to the FB without modification, by introducing a Competitive Correction Block (CCB). Finally, we validate our approach with extensive experiments on English-Hindi translation tasks in both directions as well as English to Chinese and English to German.
pdf
bib
abs
Evaluating Numeracy of Language Models as a Natural Language Inference Task
Rahmad Mahendra
|
Damiano Spina
|
Lawrence Cavedon
|
Karin Verspoor
While recent advancements in large language models (LLMs) have enhanced their capabilities to solve mathematical problems, other aspects of numeracy remain underexplored. In this paper, we propose a benchmark to evaluate the ability of language models to perform basic numeracy tasks. We frame numeracy as a Natural Language Inference (NLI) task to assess the models’ ability to understand both numbers and language contexts. We evaluate 49 language models (LMs), including fine-tuned LMs on NLI datasets, instruction-tuned LLMs, and specialized math-LLMs. Our findings reveal three main insights: (1) LLMs only clearly outperform smaller LMs in arithmetic tasks, indicating that mathematical reasoning cannot be generalized to other numeracy skills such as number comparison and normalization; (2) while most language models achieve fair to good accuracy for NLI entailment cases, they still struggle to predict contradiction and neutral cases; and (3) the robustness of language models’ numeracy capabilities needs improvement, particularly in understanding the semantics and pragmatics of numbers in linguistic contexts.
pdf
bib
abs
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
Poulami Ghosh
|
Raj Dabre
|
Pushpak Bhattacharyya
Pre-trained language models (PLMs) are known to be susceptible to perturbations to the input text, but existing works do not explicitly focus on linguistically grounded attacks, which are subtle and more prevalent in nature. In this paper, we study whether PLMs are agnostic to linguistically grounded attacks or not. To this end, we offer the first study addressing this, investigating different Indic languages and various downstream tasks. Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks. This highlights that even constrained attacks are effective. Moreover, we investigate the implications of these outcomes across a range of languages, encompassing diverse language families and different scripts.
pdf
bib
abs
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Seungbeen Lee
|
Seungwon Lim
|
Seungju Han
|
Giyeong Oh
|
Hyungjoo Chae
|
Jiwan Chung
|
Minju Kim
|
Beong-woo Kwak
|
Yeonsoo Lee
|
Dongha Lee
|
Jinyoung Yeo
|
Youngjae Yu
Recent advancements in Large Language Models (LLMs) have led to their adaptation in various domains as conversational agents. We wonder: can personality tests be applied to these agents to analyze their behavior, similar to humans? We introduce TRAIT, a new benchmark consisting of 8K multi-choice questions designed to assess the personality of LLMs. TRAIT is built on two psychometrically validated small human questionnaires, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC-10X knowledge graph to a variety of real-world scenarios. TRAIT also outperforms existing personality tests for LLMs in terms of reliability and validity, achieving the highest scores across four key metrics: Content Validity, Internal Validity, Refusal Rate, and Reliability. Using TRAIT, we reveal two notable insights into personalities of LLMs: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (e.g., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction.
pdf
bib
abs
Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection
Myrthe Reuver
|
Indira Sen
|
Matteo Melis
|
Gabriella Lapesa
This paper investigates hybrid intelligence and collaboration between researchers of sexism and Large Language Models (LLMs), with afour-component pipeline. First, nine sexism researchers answer questions about their knowledge of sexism and of LLMs. They then participate in two interactive experiments involving an LLM (GPT3.5). The first experiment has experts assessing the model’s knowledgeabout sexism and suitability for use in research. The second experiment tasks them with creating three different definitions of sexism: anexpert-written definition, an LLM-written one, and a co-created definition. Lastly, zero-shot classification experiments use the three definitions from each expert in a prompt template for sexism detection, evaluating GPT4o on 2.500 texts sampled from five sexism benchmarks. We then analyze the resulting 67.500 classification decisions. The LLM interactions lead to longer and more complex definitions of sexism. Expert-written definitions on average perform poorly compared to LLM-generated definitions. However, some experts do improve classification performance with their co-created definitions of sexism, also experts who are inexperienced in using LLMs.
pdf
bib
abs
The Role of Prosody in Spoken Question Answering
Jie Chi
|
Maureen de Seyssel
|
Natalie Schluter
Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody–additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features.
pdf
bib
abs
Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation
Palaash Goel
|
Dushyant Singh Chauhan
|
Md Shad Akhtar
Sarcasm is a linguistic phenomenon that intends to ridicule a target (e.g., entity, event, or person) in an inherent way. Multimodal Sarcasm Explanation (MuSE) aims at revealing the intended irony in a sarcastic post using a natural language explanation. Though important, existing systems overlooked the significance of the target of sarcasm in generating explanations. In this paper, we propose a Target-aUgmented shaRed fusion-Based sarcasm explanatiOn model, aka. TURBO. We design a novel shared-fusion mechanism to leverage the inter-modality relationships between an image and its caption. TURBO assumes the target of the sarcasm and guides the multimodal shared fusion mechanism in learning intricacies of the intended irony for explanations. We evaluate our proposed TURBO model on the MORE+ dataset. Comparison against multiple baselines and state-of-the-art models signifies the performance improvement of TURBO by an average margin of +3.3%. Moreover, we explore LLMs in zero and one-shot settings for our task and observe that LLM-generated explanation, though remarkable, often fails to capture the critical nuances of the sarcasm. Furthermore, we supplement our study with extensive human mevaluation on TURBO’s generated explanations and find them out to be comparatively better than other systems.
pdf
bib
abs
Seeds of Discourse: A Multilingual Corpus of Direct Quotations from African Media on Agricultural Biotechnologies
Patricia Chiril
|
Trevor Spreadbury
|
Joeva Sean Rock
|
Brian Dowd-Uribe
|
David Uminsky
Direct quotations play a crucial role in journalism by substantiating claims and enhancing persuasive communication. This makes news articles a rich resource for opinion mining, providing valuable insights into the topics they cover. This paper presents the first multilingual corpora (English and French) featuring both manually annotated (1,657) and automatically extracted (102,483) direct quotations related to agricultural biotechnologies from a curated list of Africa-based news sources. In addition, we provide 665 instances annotated for Aspect-Based Sentiment Analysis, enabling a fine-grained examination of sentiment toward key aspects of agricultural biotechnologies. These corpora are freely available to the research community for future work on media discourse surrounding agricultural biotechnologies.
pdf
bib
abs
Position Really Matters: Towards a Holistic Approach for Prompt Tuning
Xianjun Yang
|
Wei Cheng
|
Xujiang Zhao
|
Wenchao Yu
|
Linda Ruth Petzold
|
Haifeng Chen
Prompt tuning is highly effective in efficiently extracting knowledge from foundation models, encompassing both language, vision, and vision-language models. However, the efficacy of employing fixed soft prompts with a predetermined position for concatenation with inputs for all instances, irrespective of their inherent disparities, remains uncertain. Variables such as the position, length, and representations of prompts across diverse instances and tasks can substantially influence the performance of prompt tuning. We first provide a theoretical analysis, revealing that optimizing the position of the prompt to encompass the input can capture additional semantic information that traditional prefix or postfix prompt tuning methods fail to capture. Then, we present a holistic parametric prompt tuning strategy that dynamically determines different factors of prompts based on specific tasks or instances. Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks, including NLP, vision recognition, and vision-language tasks. Furthermore, we establish the universal applicability of our approach under full-data, few-shot, and multitask settings.