Proceedings of the 31st International Conference on Computational Linguistics

Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert (Editors)

Anthology ID:: 2025.coling-main
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2025.coling-main/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2025.coling-main.pdf

Addressing the disparity between predictions and actual results can enable individuals to expand their thought processes and stimulate self-reflection, thus promoting accurate planning. In this research, we present **PreAct**, an agent framework that integrates **pre**diction, **rea**soning, and **act**ion. By utilizing the information derived from predictions, the large language model (LLM) agent can provide a wider range and more strategically focused reasoning. This leads to more efficient actions that aid the agent in accomplishing intricate tasks. Our experimental results show that PreAct surpasses the ReAct method in completing complex tasks and that PreAct’s performance can be further improved when paired with other memory or selection strategy techniques. We presented the model with varying quantities of historical predictions and discovered that these predictions consistently enhance LLM planning. The variances in single-step reasoning between PreAct and ReAct indicate that PreAct indeed has benefits in terms of diversity and strategic orientation over ReAct.

pdf bib abs
The PRECOM-SM Corpus: Gambling in Spanish Social Media
Pablo Álvarez-Ojeda | María Victoria Cantero-Romero | Anastasia Semikozova | Arturo Montejo-Raez

Gambling addiction is a “silent problem” in society, especially among young people in recent years due to the easy access to betting and gambling sites on the Internet through smartphones and personal computers. As online communities in messaging apps, forums and other “teenagers gathering” sites keep growing day by day, more textual information is available for its study. This work focuses on collecting text from online Spanish-speaking communities and analysing it in order to find patterns in written language from frequent and infrequent users on the collected platforms so that an emerging gambling addiction problem can be detected. In this paper, a newly built corpus is introduced, as well as an extensive description of how it has been made. Besides, some baseline experiments on the data have been carried on, employing the generated features after the analysis of the text with different machine learning approaches like the bag of words model or deep neural network encodings.

pdf bib abs
How Well Can a Long Sequence Model Model Long Sequences? Comparing Architectural Inductive Biases on Long-Context Abilities
Jerry Huang

Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence length. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.

pdf bib abs
Sequential Fusion of Text-close and Text-far Representations for Multimodal Sentiment Analysis
Kaiwei Sun | Mi Tian

Multimodal Sentiment Analysis (MSA) aims to identify human attitudes from diverse modalities such as visual, audio and text modalities. Recent studies suggest that the text modality tends to be the most effective, which has encouraged models to consider text as its core modality. However, previous methods primarily concentrate on projecting modalities other than text into a space close to the text modality and learning an identical representation, which does not fully make use of the auxiliary information provided by audio and visual modalities. In this paper, we propose a framework, Sequential Fusion of Text-close and Text-far Representations (SFTTR), aiming to refine multimodal representations from multimodal data which should contain both representations close to and far from the text modality. Specifically, we employ contrastive learning to sufficiently explore the information similarities and differences between text and audio/visual modalities. Moreover, to fuse the extracted representations more effectively, we design a sequential cross-modal encoder to sequentially fuse representations that are close to and far from the text modality.

pdf bib abs
PoemBERT: A Dynamic Masking Content and Ratio Based Semantic Language Model For Chinese Poem Generation
Chihan Huang | Xiaobo Shen

Ancient Chinese poetry stands as a crucial treasure in Chinese culture. To address the absence of pre-trained models for ancient poetry, we introduced PoemBERT, a BERT-based model utilizing a corpus of classical Chinese poetry. Recognizing the unique emotional depth and linguistic precision of poetry, we incorporated sentiment and pinyin embeddings into the model, enhancing its sensitivity to emotional information and addressing challenges posed by the phenomenon of multiple pronunciations for the same Chinese character. Additionally, we proposed Character Importance-based masking and dynamic masking strategies, significantly augmenting the model’s capability to extract imagery-related features and handle poetry-specific information. Fine-tuning our PoemBERT model on various downstream tasks, including poem generation and sentiment classification, resulted in state-of-the-art performance in both automatic and manual evaluations. We provided explanations for the selection of the dynamic masking rate strategy and proposed a solution to the issue of a small dataset size.

pdf bib abs
CDAˆ2: Counterfactual Diffusion Augmentation for Cross-Domain Adaptation in Low-Resource Sentiment Analysis
Dancheng Xin | Kaiqi Zhao | Jingyun Sun | Yang Li

Domain adaptation is widely employed in cross-domain sentiment analysis, enabling the transfer of models from label-rich source domains to target domain with fewer or no labels. However, concerns have been raised regarding their robustness and sensitivity to data distribution shift, particularly when encountering significant disparities in data distribution between the different domains. To tackle this problem, we introduce a framework CDAˆ2 for cross-domain adaptation in low-resource sentiment analysis, which utilizes counterfactual diffusion augmentation. Specifically, it employs samples derived from domain-relevant word substitutions in source domain samples to guide the diffusion model for generating high-quality counterfactual target domain samples. We adopt a soft absorbing state and MMD loss during the training stage, and use advanced ODE solvers to expedite the sampling process. Our experiments demonstrate that CDAˆ2 generates high-quality target samples and achieves state-of-the-art performance in cross-domain sentiment analysis.

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model’s code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs’ code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark’s ability to probe deeper into models’ code understanding abilities. Our benchmark is available at https://github.com/CodeLLM-Research/CodeJudge-Eval .

Entity matching (EM) is a critical step in entity resolution (ER). Recently, entity matching based on large language models (LLMs) has shown great promise. However, current LLM-based entity matching approaches typically follow a binary matching paradigm that ignores the global consistency among record relationships. In this paper, we investigate various methodologies for LLM-based entity matching that incorporate record interactions from different perspectives. Specifically, we comprehensively compare three representative strategies: matching, comparing, and selecting, and analyze their respective advantages and challenges in diverse scenarios. Based on our findings, we further design a compound entity matching framework (ComEM) that leverages the composition of multiple strategies and LLMs. ComEM benefits from the advantages of different sides and achieves improvements in both effectiveness and efficiency. Experimental results on 8 ER datasets and 10 LLMs verify the superiority of incorporating record interactions through the selecting strategy, as well as the further cost-effectiveness brought by ComEM.

pdf bib abs
InstructGEC: Enhancing Unsupervised Grammatical Error Correction with Instruction Tuning
Jiayi Deng | Chen Chen | Chunyan Hou | Xiaojie Yuan

Recent works have proposed methods of generating synthetic data automatically for unsupervised Grammatical Error Correction (GEC). Although a large amount of synthetic data is generated at a low cost, it is unrealistic and of poor quality. The copying phenomenon of synthetic data prevents GEC models from learning the semantic knowledge of contextual language. In this paper, we design an instruction format and use the masking strategy in both an erroneous sentence and the corresponding instruction consistently to alleviate the impact of the copy phenomenon. We also propose a novel approach, InstructGEC, which integrates the knowledge of grammatical detection into GEC models with instruction tuning to address the low-quality issue. Experiments are conducted on English and Chinese GEC datasets and results demonstrate that our method outperforms state-of-the-art unsupervised GEC methods.

Recently, there has been a heightened interest in building chatbots based on Large Language Models (LLMs) to emulate human-like qualities in multi-turn conversations. Despite having access to commonsense knowledge to better understand the psychological aspects and causality of dialogue context, even these powerful LLMs struggle to achieve the goals of empathy and emotional support. Current commonsense knowledge derived from dialogue contexts is inherently limited and often fails to adequately anticipate the future course of a dialogue. This lack of foresight can mislead LLMs and hinder their ability to provide effective support. In response to this challenge, we present an innovative framework named Sensible and Visionary Commonsense Knowledge (Sibyl). Designed to concentrate on the immediately succeeding dialogue, this paradigm equips LLMs with the capability to uncover the implicit requirements of the conversation, aiming to elicit more empathetic responses. Experimental results demonstrate that incorporating our paradigm for acquiring commonsense knowledge into LLMs comprehensively enhances the quality of their responses.

The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph (MMKG) representation learning framework. Such a framework is essential for embedding structured knowledge into multi-modal Large Language Models effectively, alleviating issues like knowledge misconceptions and multi-modal hallucinations. In this work, we explore the efficacy of models in accurately embedding entities within MMKGs through two pivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking to robustly integrate multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility. Moreover, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Code and data are available at https://github.com/zjukg/SNAG.

Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs’ tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.

pdf bib abs
Federated Incremental Named Entity Recognition
Zesheng Liu | Qiannan Zhu | Cuiping Li | Hong Chen

Federated learning-based Named Entity Recognition (FNER) has attracted widespread attention through decentralized training on local clients. However, most FNER models assume that entity types are pre-fixed, so in practical applications, local clients constantly receive new entity types without enough storage to access old entity types, resulting in severe forgetting on previously learned knowledge. In addition, new clients collecting only new entity types may join the global training of FNER irregularly, further exacerbating catastrophic forgetting. To overcome the above challenges, we propose a Forgetting-Subdued Learning (FSL) model which solves the forgetting problem on old entity types from both intra-client and inter-client two aspects. Specifically, for intra-client aspect, we propose a prototype-guided adaptive pseudo labeling and a prototypical relation distillation loss to surmount catastrophic forgetting of old entity types with semantic shift. Furthermore, for inter-client aspect, we propose a task transfer detector. It can identify the arrival of new entity types that are protected by privacy and store the latest old global model for relation distillation. Qualitative experiments have shown that our model has made significant improvements compared to several baseline methods.

pdf bib abs
Large Language Models are Good Annotators for Type-aware Data Augmentation in Grammatical Error Correction
Xinyuan Li | Yunshi Lan

Large Language Models (LLMs) have achieved outstanding performance across various NLP tasks. Grammatical Error Correction (GEC) is a task aiming at automatically correcting grammatical errors in text, but it encounters a severe shortage of annotated data. Researchers have tried to make full use of the generalization capabilities of LLMs and prompt them to correct erroneous sentences, which however results in unexpected over-correction issues. In this paper, we rethink the role of LLMs in GEC tasks and propose a method, namely TypeDA, considering LLMs as the annotators for type-aware data augmentation in GEC tasks. Different from the existing data augmentation methods, our method prevents in-distribution corruption and is able to generate sentences with multi-granularity error types. Our experiments verify that our method can generally improve the GEC performance of different backbone models with only a small amount of augmented data. Further analyses verify the high consistency and diversity of the pseudo data generated via our method.

pdf bib abs
Looks can be Deceptive: Distinguishing Repetition Disfluency from Reduplication
Arif A. Ahmad | Khyathi Gayathri Mothika | Pushpak Bhattacharyya

Reduplication and repetition, though similar in form, serve distinct linguistic purposes. Reduplication is a deliberate morphological process used to express grammatical, semantic, or pragmatic nuances, while repetition is often unintentional and indicative of disfluency. This paper presents the first large-scale study of reduplication and repetition in speech using computational linguistics. We introduce IndicRedRep, a new publicly available dataset containing Hindi, Telugu, and Marathi text annotated with reduplication and repetition at the word level. We evaluate transformer-based models for multi-class reduplication and repetition token classification, utilizing the Reparandum-Interregnum-Repair structure to distinguish between the two phenomena. Our models achieve macro F1 scores of up to 85.62% in Hindi, 83.95% in Telugu, and 84.82% in Marathi for reduplication-repetition classification.

pdf bib abs
Learning to Verify Summary Facts with Fine-Grained LLM Feedback
Jihwan Oh | Jeonghwan Choi | Nicole Hee-Yoen Kim | Taewon Yun | Hwanjun Song

Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. In this paper, we explore alternative way of leveraging Large Language Model (LLM) generated feedback to address the inherent limitation of using human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries. We employ 10 distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for feedback. We utilize this dataset to fine-tune the lightweight open-source model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining high performance. Our experimental results reveal that the model trained on extensive LLM-generated datasets surpasses that trained on smaller human-annotated datasets when evaluated using human-generated test sets. Fine-tuning fact verification models with LLM feedback can be more effective and cost-efficient than using human feedback. The dataset is available at https://github.com/DISL-Lab/FineSumFact.

Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server’s LLM and clients’ SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient federated mutual knowledge transfer framework for large and small language models. This framework is designed to adaptively transfer knowledge from the server’s LLM to clients’ SLMs while concurrently enhancing the LLM with clients’ unique domain insights. We facilitate token alignment using minimum edit distance (MinED) and then selective mutual knowledge transfer between client-side SLMs and a server-side LLM, aiming to collectively enhance their performance. Through extensive experiments across three distinct scenarios, we evaluate the effectiveness of FedMKT by utilizing diverse public LLMs and SLMs on a variety of NLP text generation tasks. Empirical results demonstrate that FedMKT simultaneously boosts the performance of both LLMs and SLMs. Our code has been contributed to the FATE open-source project and is now publicly accessible at https://github.com/FederatedAI/FATE-LLM/tree/main/python/fate_llm/algo/fedmkt

pdf bib abs
Dynamic Graph Neural ODE Network for Multi-modal Emotion Recognition in Conversation
Yuntao Shou | Tao Meng | Wei Ai | Keqin Li

Multimodal emotion recognition in conversation (MERC) refers to identifying and classifying human emotional states by combining data from multiple different modalities (e.g., audio, images, text, video, etc.). Specifically, human emotional expressions are often complex and diverse, and these complex emotional expressions can be captured and understood more comprehensively through the fusion of multimodal information. Most existing graph-based multimodal emotion recognition methods can only use shallow GCNs to extract emotion features and fail to capture the temporal dependencies caused by dynamic changes in emotions. To address the above problems, we propose a Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for multimodal emotion recognition in conversation, which combines the dynamic changes of emotions to capture the temporal dependency of speakers’ emotions. Technically, the key idea of DGODE is to use the graph ODE evolution network to characterize the continuous dynamics of node representations over time and capture temporal dependencies. Extensive experiments on two publicly available multimodal emotion recognition datasets demonstrate that the proposed DGODE model has superior performance compared to various baselines. Furthermore, the proposed DGODE can also alleviate the over-smoothing problem, thereby enabling the construction of a deep GCN network.

Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (**HGCLIP**) that effectively combines **CLIP** with a deeper exploitation of the **H**ierarchical class structure via **G**raph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on 11 diverse visual recognition benchmarks. Our codes are fully available at https: //github.com/richard-peng-xia/HGCLIP.

The increasing demand for personalized interactions with large language models (LLMs) calls for methodologies capable of accurately and efficiently identifying user opinions and preferences. Retrieval augmentation emerges as an effective strategy, as it can accommodate a vast number of users without the costs from fine-tuning. Existing research, however, has largely focused on enhancing the retrieval stage and devoted limited exploration toward optimizing the representation of the database, a crucial aspect for tasks such as personalization. In this work, we examine the problem from a novel angle, focusing on how data can be better represented for more data-efficient retrieval in the context of LLM customization. To tackle this challenge, we introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts and collaborative refinement to effectively bridge knowledge gaps among users. In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size, a critical advantage in scenarios with extensive histories or limited context windows. Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data. Furthermore, our analysis reveals the increasing importance of collaborative knowledge as the retrieval capacity expands.

pdf bib abs
Style Over Substance: Evaluation Biases for Large Language Models
Minghao Wu | Alham Fikri Aji

As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human or LLM judgment, is gaining more popularity. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed, machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced evaluations, indicating the need for further investigation.

pdf bib abs
Multimodal Aspect-Based Sentiment Analysis under Conditional Relation
Xinjing Liu | Ruifan Li | Shuqin Ye | Guangwei Zhang | Xiaojie Wang

Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms from text-image pairs and identify their sentiments. Previous methods are based on the premise that the image contains the objects referred by the aspects within the text. However, this condition cannot always be met, resulting in a suboptimal performance. In this paper, we propose COnditional Relation based Sentiment Analysis framework (CORSA). Specifically, we design a conditional relation detector (CRD) to mitigate the impact of the unmet conditional image. Moreover, we design a visual object localizer (VOL) to locate the exact condition-related visual regions associated with the aspects. With CRD and VOL, our CORSA framework takes a multi-task form. In addition, to effectively learn CORSA we conduct two types of annotations. One is the conditional relation using a pretrained referring expression comprehension model; the other is the bounding boxes of visual objects by a pretrained object detection model. Experiments on our built C-MABSA dataset show that CORSA consistently outperforms existing methods. The code and data are available at https://github.com/Liuxj-Anya/CORSA.

pdf bib abs
Semantic Role Labeling of NomBank Partitives
Adam Meyers | Advait Pravin Savant | John E. Ortega

This article is about Semantic Role Labeling for English partitive nouns (5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank annotated corpus. Several systems are described using traditional and transformer-based machine learning, as well as ensembling. Our highest scoring system achieves an F1 of 91.74% using “gold” parses from the Penn Treebank and 91.12% when using the Berkeley Neural parser. This research includes both classroom and experimental settings for system development.

pdf bib abs
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
Dongjun Lee | Choongwon Park | Jaehyuk Kim | Heesoo Park

Recent advancements in large language models (LLMs) have enabled in-context learning (ICL)-based methods that significantly outperform fine-tuning approaches for text-to-SQL tasks. However, their performance is still considerably lower than that of human experts on benchmarks that include complex schemas and queries, such as BIRD. This study considers the sensitivity of LLMs to the prompts and introduces a novel approach that leverages multiple prompts to explore a broader search space for possible answers and effectively aggregate them. Specifically, we robustly refine the database schema through schema linking using multiple prompts. Thereafter, we generate various candidate SQL queries based on the refined schema and diverse prompts. Finally, the candidate queries are filtered based on their confidence scores, and the optimal query is obtained through a multiple-choice selection that is presented to the LLM. When evaluated on the BIRD and Spider benchmarks, the proposed method achieved execution accuracies of 65.5% and 89.6%, respectively, significantly outperforming previous ICL-based methods.

pdf bib abs
InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
He Cao | Zijing Liu | Xingyu Lu | Yuan Yao | Yu Li

The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialists, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.

Multi-modal sarcasm detection aims to identify whether a given image-text pair is sarcastic. The pivotal factor of the task lies in accurately capturing incongruities from different modalities. Although existing studies have achieved impressive success, they primarily committed to fusing the textual and visual information to establish cross-modal correlations, overlooking the significance of original unimodal incongruity information at the text-level and image-level. Furthermore, the utilized fusion strategies of cross-modal information neglected the effect of inherent ambiguity within text and image modalities on multimodal fusion. To overcome these limitations, we propose a novel Ambiguity-aware Multi-level Incongruity Fusion Network (AMIF) for multi-modal sarcasm detection. Our method involves a multi-level incongruity learning module to capture the incongruity information simultaneously at the text-level, image-level and cross-modal-level. Additionally, an ambiguity-based fusion module is developed to dynamically learn reasonable weights and interpretably aggregate incongruity features from different levels. Comprehensive experiments conducted on a publicly available dataset demonstrate the superiority of our proposed model over state-of-the-art methods.

pdf bib abs
AdminSet and AdminBERT: a Dataset and a Pre-trained Language Model to Explore the Unstructured Maze of French Administrative Documents
Thomas Sebbag | Solen Quiniou | Nicolas Stucky | Emmanuel Morin

In recent years, Pre-trained Language Models(PLMs) have been widely used to analyze various documents, playing a crucial role in Natural Language Processing (NLP). However, administrative texts have rarely been used in information extraction tasks, even though this resource is available as open data in many countries. Most of these texts contain many specific domain terms. Moreover, especially in France, they are unstructured because many administrations produce them without a standardized framework. Due to this fact, current language models do not process these documents correctly. In this paper, we propose AdminBERT, the first French pre-trained language models for the administrative domain. Since interesting information in such texts corresponds to named entities and the relations between them, we compare this PLM with general domain language models, fine-tuned on the Named Entity Recognition (NER) task applied to administrative texts, as well as to a Large Language Model (LLM) and to a language model with an architecture different from the BERT one. We show that taking advantage of a PLM for French administrative data increases the performance in the administrative and general domains, on these texts. We also release AdminBERT as well as AdminSet, the pre-training corpus of administrative texts in French and the subset AdminSet-NER, the first NER dataset consisting exclusively of administrative texts in French.

pdf bib abs
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Thibaut Thonet | Laurent Besacier | Jos Rozen

Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending the models’ context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, we propose a new benchmark for long-context LLMs focused on a practical meeting assistant scenario in which the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, ELITR-Bench, augments the existing ELITR corpus by adding 271 manually crafted questions with their ground-truth answers, as well as noisy versions of meeting transcripts altered to target different Word Error Rate levels. Our experiments with 12 long-context LLMs on ELITR-Bench confirm the progress made across successive generations of both proprietary and open models, and point out their discrepancies in terms of robustness to transcript noise. We also provide a thorough analysis of our GPT-4-based evaluation, including insights from a crowdsourcing study. Our findings indicate that while GPT-4’s scores align with human judges, its ability to distinguish beyond three score levels may be limited.

pdf bib abs
Positive Text Reframing under Multi-strategy Optimization
Shutong Jia | Biwei Cao | Qingqing Gao | Jiuxin Cao | Bo Liu

Differing from sentiment transfer, positive reframing seeks to substitute negative perspectives with positive expressions while preserving the original meaning. With the emergence of pre-trained language models (PLMs), it is possible to achieve acceptable results by fine-tuning PLMs. Nevertheless, generating fluent, diverse and task-constrained reframing text remains a significant challenge. To tackle this issue, a **m**ulti-**s**trategy **o**ptimization **f**ramework (MSOF) is proposed in this paper. Starting from the objective of positive reframing, we first design positive sentiment reward and content preservation reward to encourage the model to transform the negative expressions of the original text while ensuring the integrity and consistency of the semantics. Then, different decoding optimization approaches are introduced to improve the quality of text generation. Finally, based on the modeling formula of positive reframing, we propose a multi-dimensional re-ranking method that further selects candidate sentences from three dimensions: strategy consistency, text similarity and fluency. Extensive experiments on two Seq2Seq PLMs, BART and T5, demonstrate our framework achieves significant improvements on unconstrained and controlled positive reframing tasks.

pdf bib abs
RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration
Haoyu Huang | Tong Niu | Rui Yang | Luping Shi

Recently, many studies focus on utilizing large language models (LLMs) into educational dialogues. Especially, within liberal arts dialogues, educators must balance Humanized communication, Teaching expertise, and Safety-ethics (HTS), besides the subject knowledge itself. However, due to collecting massive amounts of HTS-compliant teaching dialogues from real world as training corpus is expensive, the outputs of existing LLMs in teaching dialogues fall short of human standards. To address this, we design a Retrieval-augmented Multi-role Multi-expert Collaboration (RAM2C) framework to automatically generate such dialogues data. Specifically, we first establish HTS-guided knowledge bases, encompassing three domain knowledge in teaching skills, psychology, and safety ethics. Then, RAM2C organizes LLMs, which are retrieval-augmented by the above different knowledge bases, into multi-experts groups with distinct roles to generate the HTS-compliant educational dialogues dataset. We then fine-tuned the LLMs using this dataset. Empirical evaluations indicate that RAM2C-empowered LLMs excel in Chinese reading teaching, offering more personalized, and ethically safe teaching response, demonstrating RAM2C’s practicality and high quality. We release the experiments at https://github.com/ram2c/ram2c.

pdf bib abs
SURE: Mutually Visible Objects and Self-generated Candidate Labels For Relation Extraction
Yuxuan Feng | Qian Chen | Qianyou Wu | Xin Guo | Suge Wang

Joint relation extraction models effectively mitigate the error propagation problem inherently present in pipeline models. Nevertheless, joint models face challenges including high computational complexity, complex network architectures, difficult parameter tuning, and notably, limited interpretability. In contrast, recent advances in pipeline relation extraction models (PURE, PL-Marker) have attracted considerable attention due to their lightweight design and high extraction accuracy. A key advancement is the introduction of a marker mechanism, which enhances relation extraction (RE) process by highlighting entities. However, these models primarily focus on generating correct labels. In doing so, they neglect the label selection process. Moreover, they fail to adequately capture the intricate interactions between entity pairs. To overcome these limitations, we develop a Candidate Label Markers (CLMs) mechanism that prioritizes strategic label selection over simple label generation. Furthermore, we facilitate interactions among diverse relation pairs, enabling the identification of more intricate relational patterns. Experimental results show that we achieve a new SOTA performance. Specifically, based on the same Named Entity Recognition (NER) results as theirs, we improve the SOTA methods by 2.5%, 1.9%, 1.2% in terms of strict F1 scores on SciERC, ACE05 and ACE04.

pdf bib abs
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Yihong Liu | Chunlan Ma | Haotian Ye | Hinrich Schütze

Transliterating related languages that use different scripts into a common script is effective for improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is undesirable because it requires a large computation budget. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI). TransMI can create strong baselines for data that is transliterated into a common script by exploiting an existing mPLM and its tokenizer without any training. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We apply TransMI to three strong recent mPLMs. Our experiments demonstrate that TransMI not only preserves the mPLM’s ability to handle non-transliterated data, but also enables it to effectively process transliterated data, thereby facilitating crosslingual transfer across scripts. The results show consistent improvements of 3% to 34% for different mPLMs and tasks. We make our code and models publicly available at https://github.com/cisnlp/TransMI.

pdf bib abs
Two-stage Incomplete Utterance Rewriting on Editing Operation
Zhiyu Cao | Peifeng Li | Qiaoming Zhu | Yaxin Fan

Previous work on Incomplete Utterance Rewriting (IUR) has primarily focused on generating rewritten utterances based solely on dialogue context, ignoring the widespread phenomenon of coreference and ellipsis in dialogues. To address this issue, we propose a novel framework called TEO (Two-stage approach on Editing Operation) for IUR, in which the first stage generates editing operations and the second stage rewrites incomplete utterances utilizing the generated editing operations and the dialogue context. Furthermore, an adversarial perturbation strategy is proposed to mitigate cascading errors and exposure bias caused by the inconsistency between training and inference in the second stage. Experimental results on three IUR datasets show that our TEO outperforms the SOTA models significantly.

The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn’t require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the ∞-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code is in https://github.com/dvlab-research/Q-LLM.

pdf bib abs
SVD-GCL: A Noise-Augmented Hybrid Graph Contrastive Learning Framework for Recommendation
Liping Wang | Shichao Li | Hui Wang | Yuyan Gao | Mingyao Wei

Recently, deep graph neural networks (GNNs) have emerged as the predominant architecture for recommender systems based on collaborative filtering. Nevertheless, numerous GNN-based approaches confront challenges such as complex computations and skewed feature distributions, especially with high-dimensional, sparse, and noisy data, making it difficult to accurately capture user preferences. To tackle these issues, we introduce SVD-GCL, a streamlined graph contrastive learning recommendation model based on noise augmentation that integrates truncated singular value decomposition in the feature engineering stage. This hybrid optimization approach reduces the dimensionality and denoises the original data. Through extracting self-supervised signals and gradually adding noise to embeddings in the training phase to enrich data samples, the data sparsity is effectively alleviated. Experimental outcomes on three large public benchmark datasets illustrate that SVD-GCL effectively manages high-dimensional sparse data, remains stable in the presence of noise, and provides significant advantages in computational efficiency, recommendation performance, and robustness.

Recent LLM-based Text-to-SQL methods usually suffer from significant performance degradation on “huge” databases and complex user questions that require multi-step reasoning. Moreover, most existing methods neglect the crucial significance of LLMs utilizing external tools and model collaboration. To address these challenges, we introduce MAC-SQL, a novel LLM-based multi-agent collaborative framework. Our framework comprises a core decomposer agent for Text-to-SQL generation with few-shot chain-of-thought reasoning, accompanied by two auxiliary agents that utilize external tools or models to acquire smaller sub-databases and refine erroneous SQL queries. The decomposer agent collaborates with auxiliary agents, which are activated as needed and can be expanded to accommodate new features or tools for effective Text-to-SQL parsing. In our framework, We initially leverage GPT-4 as the strong backbone LLM for all agent tasks to determine the upper bound of our framework. We then fine-tune an open-sourced instruction-followed model, SQL-Llama, by leveraging Code Llama 7B, to accomplish all tasks as GPT-4 does. Experiments show that SQL-Llama achieves a comparable execution accuracy of 43.94, compared to the baseline accuracy of 46.35 for vanilla GPT-4. At the time of writing, MAC-SQL+GPT-4 achieves an execution accuracy of 59.59 when evaluated on the BIRD benchmark, establishing a new state-of-the-art (SOTA) on its holdout test set.

Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of “Concept Depth” to suggest that more complex concepts are typically acquired in deeper layers. Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the three domains of tasks. Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored. We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at https://github.com/Luckfort/CD.

pdf bib abs
Knowledge Graph Entity Typing with Curriculum Contrastive Learning
Hao Wang | Minghua Nuo | Shan Jiang

The Knowledge Graph Entity Typing (KGET) task aims to predict missing type annotations for entities in knowledge graphs. Most recent studies only focus on the structural information from an entity’s neighborhood or semantic information from textual representations of entities or relations. In this paper, inspired by curriculum learning and contrastive learning, we propose the CCLET model using the Curriculum Contrastive Learning strategy for KGET, which uses the Pre-trained Language Model (PLM) and the graph model to fuse the entity related semantic and the structural information of the Knowledge Graph (KG) respectively. Our CCLET model consists of two main parts. In the Knowledge Fusion part, we design an Enhanced-MLP architecture to fuse the text of the entity’s description, related triplet, and tuples; In the Curriculum Contrastive Learning part, we define the difficulty of the course by controlling the level of added noise, we aim to accurately learn with curriculum contrastive learning strategy from easy to difficult. Our extensive experiments demonstrate that the CCLET model outperforms recent state-of-the-art models, verifying its effectiveness in the KGET task.

pdf bib abs
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
Zihui Wu | Haichang Gao | Jianping He | Ping Wang

Large language models (LLMs) have demonstrated remarkable capabilities, but their power comes with significant security considerations. While extensive research has been conducted on the safety of LLMs in chat mode, the security implications of their function calling feature have been largely overlooked. This paper uncovers a critical vulnerability in the function calling process of LLMs, introducing a novel “jailbreak function” attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our empirical study, conducted on six state-of-the-art LLMs including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro, reveals an alarming average success rate of over 90% for this attack. We provide a comprehensive analysis of why function calls are susceptible to such attacks and propose defensive strategies, including the use of defensive prompts. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs, contributing to the field of AI safety by identifying a previously unexplored risk, designing an effective attack method, and suggesting practical defensive measures

pdf bib abs
Adapters Selector: Cross-domains and Multi-tasks LoRA Modules Integration Usage Method
Yimin Tian | Bolin Zhang | Zhiying Tu | Dianhui Chu

Parameter-Efficient Fine-Tuning (PEFT) adapts large language models (LLMs) to specific domains by updating only a small portion of the parameters. Although fine-tuning on a single task within a specific domain has demonstrated promising results, there remains limited exploration on how to effectively integrate these adapters for optimal performance. In this paper, we propose Adapters Selector (AS): a novel framework for better integrating usage of multiple adapters by training a middleman adapter to select the appropriate adapter for inference. Our approach utilizes PEFT to train a selector that determines which input content corresponds to which task in which domain, and subsequently selects the homologous adapter. By the way, The AS has developed the capability to execute cross-domain multi-tasks effectively through the utilization of a compact model in combination with multiple LoRA modules. Our code is publicly available.

In the domain of Document AI, parsing semi-structured image form is a crucial Key Information Extraction (KIE) task. The advent of pre-trained multimodal models significantly empowers Document AI frameworks to extract key information from form documents in different formats such as PDF, Word, and images. Nonetheless, form parsing is still encumbered by notable challenges like subpar capabilities in multilingual parsing and diminished recall in industrial contexts in rich text and rich visuals. In this work, we introduce a simple but effective Multimodal and Multilingual semi-structured FORM PARSER (XFormParser), which is anchored on a comprehensive Transformer-based pre-trained language model and innovatively amalgamates semantic entity recognition (SER) and relation extraction (RE) into a unified framework. Combined with Bi-LSTM, the performance of multilingual parsing is significantly improved. Furthermore, we develop InDFormSFT, a pioneering supervised fine-tuning (SFT) industrial dataset that specifically addresses the parsing needs of forms in a variety of industrial contexts. Through rigorous testing on established benchmarks, XFormParser has demonstrated its unparalleled effectiveness and robustness. Compared to existing state-of-the-art (SOTA) models, XFormParser notably achieves up to 1.79% F1 score improvement on RE tasks in language-specific settings. It also exhibits exceptional improvements in cross-task performance in both multilingual and zero-shot settings.

pdf bib abs
Debiasing by obfuscating with 007-classifiers promotes fairness in multi-community settings
Ingroj Shrestha | Padmini Srinivasan

While there has been considerable amount of research on bias mitigation algorithms, two properties: multi-community perspective and fairness to *all* communities have not been given sufficient attention. Focusing on these, we propose an obfuscation based data augmentation debiasing approach. In it we add to the training data *obfuscated* versions of *all* false positive instances irrespective of source community. We test our approach by debiasing toxicity classifiers built using 5 neural models (multi layer perceptron model and masked language models) and 3 datasets in a 4 communities setting. We also explore 4 different obfuscators for debiasing. Results demonstrate the merits of our approach: bias is reduced for almost all of our runs without sacrificing false positive rates or F1 scores for minority or majority communities. In contrast, the 4 state of the art baselines typically make performance sacrifices (often large) while reducing bias. Crucially, we demonstrate that it is possible to debias while maintaining standards for both minority and majority communities.

pdf bib abs
Graph Representation Learning in Hyperbolic Space via Dual-Masked
Rui Gong | Zuyun Jiang | Daren Zha

Graph representation learning (GRL) in hyperbolic space has gradually emerged as a promising approach. Meanwhile, masking and reconstruction-based (MR-based) methods lead to state-of-the-art self-supervised graph representation. However, existing MR-based methods do not fully consider deep node and structural information. Inspired by the recent active and emerging field of self-supervised learning, we propose a novel node and edge dual-masked self-supervised graph representation learning framework in hyperbolic space, named HDM-GAE. We have designed a graph dual-masked module and a hyperbolic structural self-attention encoder module to mask nodes or edges and perform node aggregation within hyperbolic space, respectively. Comprehensive experiments and ablation studies on real-world multi-category datasets, demonstrate the superiority of our method in downstream tasks such as node classification and link prediction.

pdf bib abs
Perturbation-driven Dual Auxiliary Contrastive Learning for Collaborative Filtering Recommendation
Caihong Mu | Keyang Zhang | Jialiang Zhou | Yi Liu

Graph collaborative filtering has made great progress in the recommender systems, but these methods often struggle with the data sparsity issue in real-world recommendation scenarios. To mitigate the effect of data sparsity, graph collaborative filtering incorporates contrastive learning as an auxiliary task to improve model performance. However, existing contrastive learning-based methods generally use a single data augmentation graph to construct the auxiliary contrastive learning task, which has problems such as loss of key information and low robustness. To address these problems, this paper proposes a Perturbation-driven Dual Auxiliary Contrastive Learning for Collaborative Filtering Recommendation (PDACL). PDACL designs structure perturbation and weight perturbation to construct two data augmentation graphs. The Structure Perturbation Augmentation (SPA) graph perturbs the topology of the user-item interaction graph, while the Weight Perturbation Augmentation (WPA) graph reconstructs the implicit feedback unweighted graph into a weighted graph similar to the explicit feedback. These two data augmentation graphs are combined with the user-item interaction graph to construct the dual auxiliary contrastive learning task to extract the self-supervised signals without losing key information and jointly optimize it together with the supervised recommendation task, to alleviate the data sparsity problem and improve the performance. Experimental results on multiple public datasets show that PDACL outperforms numerous benchmark models, demonstrating that the dual-perturbation data augmentation graph in PDACL can overcome the shortcomings of a single data augmentation graph, leading to superior recommendation results. The implementation of our work will be found at https://github.com/zky77/PDACL.

pdf bib abs
Enhancing Reranking for Recommendation with LLMs through User Preference Retrieval
Haobo Zhang | Qiannan Zhu | Zhicheng Dou

Recently, large language models (LLMs) have shown the potential to enhance recommendations due to their sufficient knowledge and remarkable summarization ability. However, the existing LLM-powered recommendation may create redundant output, which generates irrelevant information about the user’s preferences on candidate items from user behavior sequences. To address the issues, we propose a framework UR4Rec that enhances reranking for recommendation with large language models through user preference retrieval. Specifically, UR4Rec develops a small transformer-based user preference retriever towards candidate items to build the bridge between LLMs and recommendation, which focuses on producing the essential knowledge through LLMs from user behavior sequences to enhance reranking for recommendation. Our experimental results on three real-world public datasets demonstrate the superiority of UR4Rec over existing baseline models.

Integrating Large Language Models (LLMs) with existing Knowledge Graph (KG) databases presents a promising avenue for enhancing LLMs’ efficacy and mitigating their “hallucinations”. Given that most KGs reside in graph databases accessible solely through specialized query languages (e.g., Cypher), it is critical to connect LLMs with KG databases by automating the translation of natural language into Cypher queries (termed as “Text2Cypher” task). Prior efforts tried to bolster LLMs’ proficiency in Cypher generation through Supervised Fine-Tuning (SFT). However, these explorations are hindered by the lack of annotated datasets of Query-Cypher pairs, resulting from the labor-intensive and domain-specific nature of such annotation. In this study, we propose SyntheT2C, a methodology for constructing a synthetic Query-Cypher pair dataset, comprising two distinct pipelines: (1) LLM-based prompting and (2) template-filling. SyntheT2C is applied to two medical KG databases, culminating in the creation of a synthetic dataset, MedT2C. Comprehensive experiments demonstrate that the MedT2C dataset effectively enhances the performance of backbone LLMs on Text2Cypher task via SFT. Both the SyntheT2C codebase and the MedT2C dataset will be released.

pdf bib abs
Language Models Encode the Value of Numbers Linearly
Fangwei Zhu | Damai Dai | Zhifang Sui

Large language models (LLMs) have exhibited impressive competence in various tasks, but their internal mechanisms on mathematical problems are still under-explored. In this paper, we study a fundamental question: how language models encode the value of numbers, a basic element in math. To study the question, we construct a synthetic dataset comprising addition problems and utilize linear probes to read out input numbers from the hidden states. Experimental results support the existence of encoded number values in LLMs on different layers, and these values can be extracted via linear probes. Further experiments show that LLMs store their calculation results in a similar manner, and we can intervene the output via simple vector additions, proving the causal connection between encoded numbers and language model outputs. Our research provides evidence that LLMs encode the value of numbers linearly, offering insights for better exploring, designing, and utilizing numeric information in LLMs.

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce FinDABench, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. The benchmark comprises 15,200 training instances and 8,900 test instances, all meticulously crafted by human experts. FinDABench assesses LLMs across three dimensions: 1) Core Ability, evaluating the models’ ability to perform financial indicator calculation and corporate sentiment risk assessment; 2) Analytical Ability, determining the models’ ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) Technical Ability, examining the models’ use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release FinDABench, and the evaluation scripts at https://github.com/xxx. FinDABench aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.

pdf bib abs
Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding
Binh-Nguyen Nguyen | Yang He

Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on cross-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain examples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources.

Retrieving superior legal articles involves identifying relevant legal articles that hold higher legal effectiveness. This process is crucial in legislative work because superior legal articles form the legal basis for drafting new laws. However, most existing legal information retrieval research focuses on retrieving legal documents, with limited research on retrieving superior legal articles. This gap restricts the digitization of legislative work. To advance research in this area, we propose SLARD: A Chinese Superior Legal Article Retrieval Dataset, which filters 2,627 queries and 9,184 candidates from over 4.3 million effective Chinese regulations, covering 32 categories, such as environment, agriculture, and water resources. Each query is manually annotated, and the candidates include superior articles at both the provincial and national levels. We conducted detailed experiments and analyses on the dataset and found that existing retrieval methods struggle to achieve ideal results. The best method achieved a R@1 of only 0.4719. Additionally, we found that existing large language models (LLMs) lack prior knowledge of the content of superior legal articles. This indicates the necessity for further exploration and research in this field.

Existing retrieval-based methods have made significant strides in maintaining long-term conversations. However, these approaches face challenges in memory database management and accurate memory retrieval, hindering their efficacy in dynamic, real-world interactions. This study introduces a novel framework, COmpressive Memory-Enhanced Dialogue sYstems (COMEDY), which eschews traditional retrieval modules and memory databases. Instead, COMEDY adopts a “One-for-All” approach, utilizing a single language model to manage memory generation, compression, and response generation. Central to this framework is the concept of compressive memory, which integrates session-specific summaries, user-bot dynamics, and past events into a concise memory format. To support COMEDY, we collect the biggest Chinese long-term conversation dataset, Dolphin, derived from real user-chatbot interactions. Comparative evaluations demonstrate COMEDY’s superiority over traditional retrieval-based methods in producing more nuanced and human-like conversational experiences.

pdf bib abs
Refined Evaluation for End-to-End Grammatical Error Correction Using an Alignment-Based Approach
Junrui Wang | Mengyang Qiu | Yang Gu | Zihao Huang | Jungyeul Park

We propose a refined alignment-based method to assess end-to-end grammatical error correction (GEC) systems, aiming to reproduce and improve results from existing evaluation tools, such as errant, even when applied to raw text input—reflecting real-world language learners’ writing scenarios. Our approach addresses challenges arising from sentence boundary detection deviations in text preprocessing, a factor overlooked by current GEC evaluation metrics. We demonstrate its effectiveness by replicating results through a re-implementation of errant, utilizing stanza for error annotation and simulating end-to-end evaluation from raw text. Additionally, we propose a potential multilingual errant, presenting Chinese and Korean GEC results. Previously, Chinese and Korean errant were implemented independently for each language, with different annotation formats. Our approach generates consistent error annotations across languages, establishing a basis for standardized grammatical error annotation and evaluation in multilingual GEC contexts.

pdf bib abs
LLMs on interactive feature collections with implicit dynamic decision strategy
Juyeon Heo | Vihari Piratla | Kyunghyun Lee | Hyonkeun Joh | Adrian Weller

In real-world contexts such as medical diagnosis and business consulting, effective problem-solving often requires gathering relevant information through interactions and targeted questioning to pinpoint the root cause of a problem. However, Large Language Models (LLMs) often struggle to efficiently narrow down the search space, leading to either missing key information or asking redundant questions when guided by implicit methods like Chain-of-Thought (CoT). Some approaches employ external engineered systems to guide reasoning paths, but these methods may not fully utilize the inherent problem-solving capabilities of LLMs and often require multiple expensive API calls. This study explores how we can implicitly guide LLMs to enhance their interactive feature collection abilities within a single prompt. Instead of employing explicit search algorithms or step-by-step external guidance, we provide high-level guidelines that allow LLMs to dynamically adjust their strategies and iteratively refine their decision-making processes independently. Evaluations on synthetic 20-Questions games and real-world scenarios, including business and medical diagnosis cases, demonstrate that LLMs guided by these strategies perform more effective interactive feature collection, asking fewer and more strategic questions and achieving better problem-solving efficiency.

pdf bib abs
Pre-trained Semantic Interaction based Inductive Graph Neural Networks for Text Classification
Shiyu Wang | Gang Zhou | Jicang Lu | Jing Chen | Ningbo Huang

Nowadays, research of Text Classification (TC) based on graph neural networks (GNNs) is on the rise. Both inductive methods and transductive methods have made significant progress. For transductive methods, the semantic interaction between texts plays a crucial role in the learning of effective text representations. However, it is difficult to perform inductive learning while modeling interactions between texts on the graph. To give a universal solution, we propose the graph neural network based on pre-trained semantic interaction called PaSIG. Firstly, we construct a text-word heterogeneity graph and design an asymmetric structure to ensure one-way message passing from words to the test texts. Meanwhile, we use the context representation capability of the pre-trained language model to construct node features that contain classification semantic information. Afterward, we explore the adaptative aggregation methods with a gated fusion mechanism. Extensive experiments on five datasets have shown the effectiveness of PaSIG, with the accuracy exceeding the baseline by 2.7% on average. While achieving state-of-the-art performance, we have also taken measures of subgraph sampling and intermediate state preservation to achieve fast inference.

pdf bib abs
From Superficial to Deep: Integrating External Knowledge for Follow-up Question Generation Using Knowledge Graph and LLM
Jianyu Liu | Yi Huang | Sheng Bi | Junlan Feng | Guilin Qi

In a conversational system, dynamically generating follow-up questions based on context can help users explore information and provide a better user experience. Humans are usually able to ask questions that involve some general life knowledge and demonstrate higher order cognitive skills. However, the questions generated by existing methods are often limited to shallow contextual questions that are uninspiring and have a large gap to the human level. In this paper, we propose a three-stage external knowledge-enhanced follow-up question generation method, which generates questions by identifying contextual topics, constructing a knowledge graph (KG) online, and finally combining these with a large language model to generate the final question. The model generates information-rich and exploratory follow-up questions by introducing external common sense knowledge and performing a knowledge fusion operation. Experiments show that compared to baseline models, our method generates questions that are more informative and closer to human questioning levels while maintaining contextual relevance.

Prior studies on Aspect-level Sentiment Classification (ALSC) emphasize modeling interrelationships among aspects and contexts but overlook the crucial role of aspects themselves as essential domain knowledge. To this end, we propose AGCL, a novel Aspect Graph Construction and Learning method, aimed at furnishing the model with finely tuned aspect information to bolster its task-understanding ability. AGCL’s pivotal innovations reside in Aspect Graph Construction (AGC) and Aspect Graph Learning (AGL), where AGC harnesses intrinsic aspect connections to construct the domain aspect graph, and then AGL iteratively updates the introduced aspect graph to enhance its domain expertise, making it more suitable for the ALSC task. Hence, this domain aspect graph can serve as a bridge connecting unseen aspects with seen aspects, thereby enhancing the model’s generalization capability. Experiment results on three widely used datasets demonstrate the significance of aspect information for ALSC and highlight AGL’s superiority in aspect learning, surpassing state-of-the-art baselines greatly. Code is available at https://github.com/jian-projects/agcl.

The fine-tuning of Large Language Models (LLMs) specialized in code generation has seen notable advancements through the use of open-domain coding queries. Despite the successes, existing methodologies like Evol-Instruct encounter performance limitations, impeding further enhancements in code generation tasks. This paper examines the constraints of existing prompt evolution techniques and introduces a novel approach, Instruction Fusion (IF). IF innovatively combines two distinct prompts through a hybridization process, thereby enhancing the evolution of training prompts for code LLMs. Our experimental results reveal that the proposed novel method effectively addresses the shortcomings of prior methods, significantly improving the performance of Code LLMs across five code generation benchmarks, namely HumanEval, HumanEval+, MBPP, MBPP+ and MultiPL-E, which underscore the effectiveness of Instruction Fusion in advancing the capabilities of LLMs in code generation.

E-commerce authoring entails creating engaging, diverse, and targeted content to enhance preference elicitation and retrieval experience. While Large Language Models (LLMs) have revolutionized content generation, they often fall short in e-commerce applications due to their limited memorization of domain-specific features. This paper proposes LLaMA-E, the unified e-commerce authoring models that address the contextual preferences of customers, sellers, and platforms, the essential objects in e-commerce operation. We design the instruction set derived from tasks of ads generation, query-enhanced product title rewriting, product classification, purchase intent speculation, and general e-commerce Q&A. The instruction formulation ensures the interleaved cover of the presented and required object features, allowing the alignment of base models to parameterize e-commerce knowledge comprehensively. The proposed LLaMA-E models achieve state-of-the-art evaluation performance and exhibit the advantage in zero-shot practical applications. To our knowledge, this is the first LLM tailored to empower authoring applications with comprehensive scenario understanding by integrating features focused on participated objects.

The lack of training data gives rise to the system cold-start problem in recommendation systems, making them struggle to provide effective recommendations. To address this problem, Large Language Models(LLMs) can model recommendation tasks as language analysis tasks and provide zero-shot results based on their vast open-world knowledge. However, the large scale of the item corpus poses a challenge to LLMs, leading to substantial token consumption that makes it impractical to deploy in real-world recommendation systems. To tackle this challenge, we introduce a tree-based LLM recommendation framework LLMTreeRec, which structures all items into an item tree to improve the efficiency of LLM’s item retrieval. LLMTreeRec achieves state-of-the-art performance under the system cold-start setting in two widely used datasets, which is even competitive with conventional deep recommendation systems that use substantial training data. Furthermore, LLMTreeRec outperforms the baseline model in the A/B test on Huawei industrial system. Consequently, LLMTreeRec demonstrates its effectiveness as an industry-friendly solution that has been successfully deployed online.

Research on text simplification has been ongoing for many years. However, the task of document simplification (DS) remains a significant challenge due to the need to consider complex factors such as technical terminology, metaphors, and overall coherence. In this work, we introduce a novel multi-agent framework for document simplification (AgentSimp) based on large language models (LLMs). This framework emulates the collaborative process of a human expert team through the roles played by multiple agents, addressing the intricate demands of document simplification. We explore two communication strategies among agents (pipeline-style and synchronous) and two document reconstruction strategies (Direct and Iterative ). According to both automatic evaluation metrics and human evaluation results, the documents simplified by AgentSimp are deemed to be more thoroughly simplified and more coherent on a variety of articles across different types and styles.

pdf bib abs
Distilling Rule-based Knowledge into Large Language Models
Wenkai Yang | Yankai Lin | Jie Zhou | Ji-Rong Wen

Large language models (LLMs) have shown incredible performance in completing various real-world tasks. The current paradigm of knowledge learning for LLMs is mainly based on learning from examples, in which LLMs learn the internal rule implicitly from a certain number of supervised examples. However, this learning paradigm may not well learn those complicated rules, especially when the training examples are limited. We are inspired that humans can learn the new tasks or knowledge in another way by learning from rules. That is, humans can learn new tasks or grasp new knowledge quickly and generalize well given only a detailed rule and a few optional examples. Therefore, in this paper, we aim to explore the feasibility of this new learning paradigm, which targets on encoding rule-based knowledge into LLMs. We further propose rule distillation, which first uses the strong in-context abilities of LLMs to extract the knowledge from the textual rules, and then explicitly encode the knowledge into the parameters of LLMs by learning from the above in-context signals produced inside the model. Our experiments show that making LLMs learn from rules by our method is much more efficient than example-based learning in both the sample size and generalization ability. Warning: This paper may contain examples with offensive content.

pdf bib abs
Exploring Backdoor Vulnerabilities of Chat Models
Wenkai Yang | Yunzhuo Hao | Yankai Lin

Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on single-turn instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor cannot be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic examples.

pdf bib abs
Towards the Machine Translation of Scientific Neologisms
Paul Lerner | François Yvon

Scientific research continually discovers and invents new concepts, which are then referred to by new terms, neologisms, or neonyms in this context. As the vast majority of publications are written in English, disseminating this new knowledge to the general public often requires translating these terms. However, by definition, no parallel data exist to provide such translations. Therefore, we propose to leverage term definitions as a useful source of information for the translation process. As we discuss, Large Language Models are well suited for this task and can benefit from in-context learning with co-hyponyms and terms sharing the same derivation paradigm. These models, however, are sensitive to the superficial and morphological similarity between source and target terms. Their predictions are also impacted by subword tokenization, especially for prefixed terms.

pdf bib abs
HyperIDP: Customizing Temporal Hypergraph Neural Networks for Multi-Scale Information Diffusion Prediction
Haowei Xu | Chao Gao | Xianghua Li | Zhen Wang

Information diffusion prediction is crucial for understanding how information spreads within social networks, addressing both macroscopic and microscopic prediction tasks. Macroscopic prediction assesses the overall impact of diffusion, while microscopic prediction focuses on identifying the next user likely to be influenced. However, few studies have focused on both scales of diffusion. This paper presents HyperIDP, a novel Hypergraph-based model designed to manage both macroscopic and microscopic Information Diffusion Prediction tasks. The model captures interactions and dynamics of cascades at the macro level with hypergraph neural networks (HGNNs) while integrating social homophily at the micro level. Considering the diverse data distributions across social media platforms, which necessitate extensive tuning of HGNN architectures, a search space is constructed to accommodate diffusion hypergraphs, with optimal architectures derived through differentiable search strategies. Additionally, cooperative-adversarial loss, inspired by multi-task learning, is introduced to ensure that the model can leverage the advantages of the shared representation when handling both tasks, while also avoiding potential conflicts. Experimental results show that the proposed model significantly outperforms baselines.

pdf bib abs
Enhancing multi-modal Relation Extraction with Reinforcement Learning Guided Graph Diffusion Framework
Rui Yang | Rajiv Gupta

With the massive growth of multi-modal information such as text, images, and other data, how should we analyze and align these data becomes very important. In our work, we introduce a new framework based on Reinforcement Learning Guided Graph Diffusion to address the complexity of multi-modal graphs and enhance the interpretability, making it clearer to understand the alignment of multi-modal information. Our approach leverages pre-trained models to encode multi-modal data into scene graphs and combines them into a cross-modal graph (CMG). We design a reinforcement learning agent to filter nodes and modify edges based on the observation of the graph state to dynamically adjust the graph structure, providing coarse-grained refinement. Then we will iteratively optimize edge weights and node selection to achieve fine-grained adjustment. We conduct extensive experimental results on multi-modal relation extraction task datasets and show that our model significantly outperforms existing multi-modal methods such as MEGA and MKGFormer. We also conduct an ablation study to demonstrate the importance of each key component, showing that performance drops significantly when any key element is removed. Our method uses reinforcement learning methods to better mine potential multi-modal information relevance, and adjustments based on graph structure make our method more interpretable.

pdf bib abs
Non-Emotion-Centric Empathetic Dialogue Generation
Yuanxiang Huangfu | Peifeng Li | Yaxin Fan | Qiaoming Zhu

Previous work on empathetic response generation mainly focused on utilizing the speaker’s emotions to generate responses. However, the performance of identifying fine-grained emotions is limited, introducing cascading errors to empathetic response generation. Moreover, due to the conflict between the information in the dialogue history and the recognized emotions, previous work often generated general and uninformative responses. To address the above issues, we propose a novel framework NEC (Non-Emotion-Centric empathetic dialogue generation) based on contrastive learning and context-sensitive entity and social commonsense, in which the frequent replies and sentences with incorrect emotions are punished through contrastive learning, thereby improving the empathy, diversity and information of the responses. The experimental results demonstrate that our NEC enhances the quality of empathetic generation and generates more diverse responses in comparison with the state-of-the-art baselines.The code will be available at https://github.com/huangfu170/NEC-empchat

Open-Domain Question Answering (ODQA) systems often struggle with the quality of retrieved passages, which may contain conflicting information and be misaligned with the reader’s needs. Existing retrieval methods aim to gather relevant passages but often fail to prioritize consistent and useful information for the reader. In this paper, we introduce a novel Reader-Centered Passage Selection (R-CPS) method, which enhances the performance of the retrieve-then-read pipeline by re-ranking and clustering passages from the reader’s perspective. Our method re-ranks passages based on the reader’s prediction probability distribution and clusters passages according to the predicted answers, prioritizing more useful and relevant passages to the top and reducing inconsistent information. Experiments on ODQA datasets demonstrate the effectiveness of our approach in improving the quality of evidence passages under zero-shot settings.

The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques.

pdf bib abs
Citation Amnesia: On The Recency Bias of NLP and Other Academic Fields
Jan Philip Wahle | Terry Lima Ruas | Mohamed Abdalla | Bela Gipp | Saif M. Mohammad

This study examines the tendency to cite older work across 20 fields of study over 43 years (1980–2023). We put NLP’s propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to them over time or whether differences can be observed. Our analysis, based on a dataset of ~240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). The trend is strongest in NLP and ML research (-12.8% and -5.5% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) — even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community’s engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.

pdf bib abs
Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation
Yanxu Mao | Peipei Liu | Tiehan Cui | Congying Liu | Datao You

In recent years, text classification methods based on neural networks and pre-trained models have gained increasing attention and demonstrated excellent performance. However, these methods still have some limitations in practical applications: (1) They typically focus only on the matching similarity between sentences. However, there exists implicit high-value information both within sentences of the same class and across different classes, which is very crucial for classification tasks. (2) Existing methods such as pre-trained language models and graph-based approaches often consume substantial memory for training and text-graph construction. (3) Although some low-resource methods can achieve good performance, they often suffer from excessively long processing times. To address these challenges, we propose a low-resource and fast text classification model called LFTC. Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data. We then remove redundant information irrelevant to the target classification to reduce processing time. Finally, we compute the similarity distance between text pairs for classification. We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time, especially under limited computational and data resources, highlighting its superior advantages.

Jailbreak attacks craft specific prompts or append adversarial suffixes to prompts, thereby inducing language models to generate harmful or unethical content and bypassing the model’s safety guardrails. With the recent blossom of large language models (LLMs), there’s a growing focus on jailbreak attacks to probe their safety. While current white-box attacks typically focus on meticulously identifying adversarial suffixes for specific models, their effectiveness and efficiency diminish when applied to different LLMs. In this paper, we propose a Monte Carlo Tree Search (MCTS) based Prompt Auto-generation (MPA) method to enhance the effectiveness and efficiency of attacks across various models. MPA automatically searches for and generates adversarial suffixes for valid jailbreak attacks. Specifically, we first identify a series of action candidates that could potentially trick LLMs into providing harmful responses. To streamline the exploration of adversarial suffixes, we design a prior confidence probability for each MCTS node. We then iteratively auto-generate adversarial prompts using the MCTS framework. Extensive experiments on multiple open-source models (like Llama, Gemma, and Mistral) and closed-source models (such as ChatGPT) show that our proposed MPA surpasses existing methods in search efficiency as well as attack effectiveness. The codes are available at https://github.com/KDEGroup/MPA.

pdf bib abs
LogiGraph: Logical Reasoning with Contrastive Learning and Lightweight Graph Networks
Xiang Li | Chen Shi | Yong Xu | Jun Huang

Logical reasoning is a crucial factor in machine reading comprehension tasks (MRC). Existing methods suffer from the balance between semantic and explicit logical relation representations, in which some emphasize contextual semantics, while others pay more attention to explicit logical features. Additionally, previous methods utilize graph convolutional networks (GCN) for node updates, still exhibiting some shortcomings. To address these challenges, in this paper, we propose a logical reasoning method with contrastive learning and lightweight graph networks (LogiGraph). Our method focuses on the lightweight aspect of the GCN, which greatly improves the shortcomings of the GCN, and employs conjunction and punctuation marks as two types of edges to construct a dual graph. Besides, we combine contrastive learning with graph reasoning, which changes the logical expression’s content as the negative sample of the original context, enabling the model to capture negative logical relationships and improving generalization ability. We conduct extensive experiments on two public datasets, ReClor and LogiQA. Experimental results demonstrate that LogiGraph can achieve state-of-the-art performance on both datasets.

pdf bib abs
Explaining Relationships Among Research Papers
Xiangci Li | Jessica Ouyang

The rapid pace of research publications makes it challenging for researchers to stay up to date. There is a growing need for automatically generated, concise literature reviews to help researchers quickly identify papers relevant to their interests. Prior work over the past decade has focused on summarizing individual research papers, typically in the context of citation generation, while the relationships among multiple papers have largely been overlooked. Existing approaches primarily generate standalone citation sentences without addressing the need for expository and transition sentences to explain the relationships among multiple citations. In this work, we propose a feature-based, LLM-prompting approach to generate richer citation texts and simultaneously capture the complex relationships among multiple papers. Our expert evaluation reveals a strong correlation between human preference and integrative writing styles, indicating that readers favor high-level, abstract citations with transition sentences that weave them into a coherent narrative.

pdf bib abs
From Generalist to Specialist: A Survey of Large Language Models for Chemistry
Yang Han | Ziping Wan | Lu Chen | Kai Yu | Xin Chen

Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP). However, the predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry. The scarcity of specialized chemistry data, coupled with the complexity of multi-modal data such as 2D graph, 3D structure and spectrum, present distinct challenges. Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs. In this paper, we outline methodologies for incorporating domain-specific chemistry knowledge and multi-modal information into LLMs, we also conceptualize chemistry LLMs as agents using chemistry tools and investigate their potential to accelerate scientific research. Additionally, we conclude the existing benchmarks to evaluate chemistry ability of LLMs. Finally, we critically examine the current challenges and identify promising directions for future research. Through this comprehensive survey, we aim to assist researchers in staying at the forefront of developments in chemistry LLMs and to inspire innovative applications in the field.

Recent state-of-the-art authorship attribution methods learn authorship representations of text in a latent, uninterpretable space, which hinders their usability in real-world applications. We propose a novel approach for interpreting learned embeddings by identifying representative points in the latent space and leveraging large language models to generate informative natural language descriptions of the writing style associated with each point. We evaluate the alignment between our interpretable and latent spaces and demonstrate superior prediction agreement over baseline methods. Additionally, we conduct a human evaluation to assess the quality of these style descriptions and validate their utility in explaining the latent space. Finally, we show that human performance on the challenging authorship attribution task improves by +20% on average when aided with explanations from our method.

pdf bib abs
Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step Parsing
HaiXiang Zhu | Lixian Su | ShuangMing Mao | Jing Ye

Visual grounding (VG) is an important task in vision and language that involves understanding the mutual relationship between query terms and images. However, existing VG datasets typically use simple and intuitive textual descriptions, with limited attribute and spatial information between images and text. Recently, the Scene Knowledge Visual Grounding (SK-VG) task has been introduced, which constructs VG datasets using visual knowledge and relational referential expressions. Due to the length of textual visual knowledge and the complexity of the referential relationships between entities, previous models have struggled with this task. Therefore, we propose ReadVG, a zero-shot, plug-and-play method that leverages the robust language understanding capabilities of Large Language Models (LLMs) to transform long visual knowledge texts into concise, information-dense visual descriptions. To improve the accuracy of target localisation, we employ a multi-step parsing algorithm that can progressively extract the query targets and their features from the visual knowledge and relational referencing expressions, thereby assisting multimodal models to more accurately localise the target for grounding purposes. Extensive experiments and case studies show that our approach can significantly improve the performance of multimodal grounding models.

Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.

pdf bib abs
BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation
Minchong Li | Feng Zhou | Xiaohui Song

In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden “noise” in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-k teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.

Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model.

pdf bib abs
Do language models practice what they preach? Examining language ideologies about gendered language reform encoded in LLMs
Julia Watson | Sophia S. Lee | Barend Beekhuizen | Suzanne Stevenson

We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is “correct” or “natural”, LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs’ metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.

pdf bib abs
T-MES: Trait-Aware Mix-of-Experts Representation Learning for Multi-trait Essay Scoring
Jiong Wang | Jie Liu

In current research on automatic essay scoring, related work tends to focus more on evaluating the overall quality or a single trait of prompt-specific essays. However, when scoring essays in an educational context, it is essential not only to consider the overall score but also to provide feedback on various aspects of the writing. This helps students clearly identify areas for improvement, enabling them to engage in targeted practice. Although many methods have been proposed to address the scoring issue, they still suffer from insufficient learning of trait representations and overlook the diversity and correlations between trait scores in the scoring process. To address this problem, we propose a novel multi-trait essay scoring method based on Trait-Aware Mix-of-Experts Representation Learning. Our method obtains trait-specific essay representations using a Mix-of-Experts scoring architecture. Furthermore, based on this scoring architecture, we propose a diversified trait-expert method to learn distinguishable expert weights. And to facilitate multi-trait scoring, we introduce two trait correlation learning strategies that achieve learning the correlations among traits. Experimental results demonstrate the effectiveness of our method, and compared to existing methods, it achieves a further improvement in computational efficiency.

pdf bib abs
A Graph Interaction Framework on Relevance for Multimodal Named Entity Recognition with Multiple Images
Jiachen Zhao | Shizhou Huang | Xin Lin

Posts containing multiple images have significant research potential in Multimodal Named Entity Recognition nowadays. The previous methods determine whether the images are related to named entities in the text through similarity computation, such as using CLIP. However, it is not effective in some cases and not conducive to task transfer, especially in multi-image scenarios. To address the issue, we propose a graph interaction framework on relevance (GIFR) for Multimodal Named Entity Recognition with multiple images. For humans, they have the abilities to distinguish whether an image is relevant to named entities, but human capabilities are difficult to model. Therefore, we propose using reinforcement learning based on human preference to integrate human abilities into the model to determine whether an image-text pair is relevant, which is referred to as relevance. To better leverage relevance, we construct a heterogeneous graph and introduce graph transformer to enable information interaction. Experiments on benchmark datasets demonstrate that our method achieves the state-of-the-art performance.

Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from parallel speech-text data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1K sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.

Randomized Controlled Trials (RCTs) are rigorous clinical studies crucial for reliable decision-making, but their credibility can be compromised by bias. The Cochrane Risk of Bias tool (RoB 2) assesses this risk, yet manual assessments are time-consuming and labor-intensive. Previous approaches have employed Large Language Models (LLMs) to automate this process. However, they typically focus on manually crafted prompts and a restricted set of simple questions, limiting their accuracy and generalizability. Inspired by the human bias assessment process, we propose RoBGuard, a novel framework for enhancing LLMs to assess the risk of bias in RCTs. Specifically, RoBGuard integrates medical knowledge-enhanced question reformulation, multimodal document parsing, and multi-expert collaboration to ensure both completeness and accuracy. Additionally, to address the lack of suitable datasets, we introduce two new datasets: RoB-Item and RoB-Domain. Experimental results demonstrate RoBGuard’s effectiveness on the RoB-Item dataset, outperforming existing methods.

Recent works have demonstrated the effectiveness of retrieval augmentation in the Event Argument Extraction (EAE) task. However, existing retrieval-based EAE methods have two main limitations: (1) input length constraints and (2) the gap between the retriever and the inference model. These issues limit the diversity and quality of the retrieved information. In this paper, we propose a Compressive Memory-based Retrieval (CMR) mechanism for EAE, which addresses the two limitations mentioned above. Our compressive memory, designed as a dynamic matrix that effectively caches retrieved information and supports continuous updates, overcomes the limitations of input length. Additionally, after pre-loading all candidate demonstrations into the compressive memory, the model further retrieves and filters relevant information from the memory based on the input query, bridging the gap between the retriever and the inference model. Extensive experiments show that our method achieves new state-of-the-art performance on three public datasets (RAMS, WikiEvents, ACE05), significantly outperforming existing retrieval-based EAE methods.

pdf bib abs
FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics
Yupei Du | Albert Gatt | Dong Nguyen

Despite the massive success of fine-tuning Pre-trained Language Models (PLMs), they remain susceptible to out-of-distribution input. Dataset cartography is a simple yet effective dual-model approach that improves the robustness of fine-tuned PLMs. It involves fine-tuning a model on the original training set (i.e. reference model), selecting a subset of important training instances based on the training dynamics, % of the reference model, and fine-tuning again only on these selected examples (i.e. main model). However, this approach requires fine-tuning the same model twice, which is computationally expensive for large PLMs. In this paper, we show that 1) training dynamics are highly transferable across model sizes and pre-training methods, and that 2) fine-tuning main models using these selected training instances achieves higher training efficiency than empirical risk minimization (ERM). Building on these observations, we propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT). Compared with dataset cartography, FTFT uses more efficient reference models and aggressive early stopping. FTFT achieves robustness improvements over ERM while lowering the training cost by up to ~50%

This work introduces PrahokBART, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.

pdf bib abs
Relation Logical Reasoning and Relation-aware Entity Encoding for Temporal Knowledge Graph Reasoning
Longzhou Liu | Chenglong Xiao | Shanshan Wang | Tingwen Liu

Temporal Knowledge Graph Reasoning (TKGR) aims to predict future facts based on historical data. Current mainstream models primarily use embedding techniques, which predict missing facts by representing entities and relations as low-dimensional vectors. However, these models often consider only the structural information of individual entities and relations, overlooking the broader structure of the entire TKG. To address these limitations, we propose a novel model called Relation Logical Reasoning and Relation-aware Entity Encoding (RLEE), drawing inspiration from attention mechanisms and logical rule-based techniques. RLEE introduces a two-layer representation of the TKG: an entity layer and a relation layer. At the relation layer, we extract relation paths to mine potential logical correlations between different relations, learning relation embeddings through a process of relation logical reasoning. At the entity layer, we use the relation-aware attention mechanism to learn the entity embeddings specific to the predicted query relations. These learned relation and entity embeddings are then used to predict facts at future timestamps. When evaluated on five commonly used public datasets, RLEE consistently outperforms state-of-the-art baselines.

Retrieval-Augmented-Generation and Generation-Augmented-Generation have been proposed to enhance the knowledge required for question answering with Large Language Models (LLMs) by leveraging richer context. However, the former relies on external resources, and both require incorporating explicit documents into the context, which increases execution costs and susceptibility to noise data during inference. Recent works indicate that LLMs model rich knowledge, but it is often not effectively activated and awakened. Inspired by this, we propose a novel knowledge-augmented framework, Awakening-Augmented-Generation (AAG), which mimics the human ability to answer questions using only thinking and recalling to compensate for knowledge gaps, thereby awaking relevant knowledge in LLMs without relying on external resources. AAG consists of two key components for awakening richer context. Explicit awakening fine-tunes a context generator to create a synthetic, compressed document that functions as symbolic context. Implicit awakening utilizes a hypernetwork to generate adapters based on the question and synthetic document, which are inserted into LLMs to serve as parameter context. Experimental results on three datasets demonstrate that AAG exhibits significant advantages in both open-domain and closed-book settings, as well as in out-of-distribution generalization. Our code will be available at https://github.com/Xnhyacinth/IAG.

Euphemisms are a linguistic device used to soften discussions of sensitive or uncomfortable topics, with death being a prominent example. In this paper, we present a study on the detection of death-related euphemisms in historical literary texts from a corpus containing Danish and Norwegian novels from the late 19th century. We introduce an annotated dataset of euphemistic and literal references to death, including both common and rare euphemisms, ranging from well-established terms to more culturally nuanced expressions. We evaluate the performances of state-of-the-art pre-trained language models fine-tuned for euphemism detection. Our findings show that fixed, literal expressions of death became less frequent over time, while metaphorical euphemisms grew in prevalence. Additionally, euphemistic language was more common in historical novels, whereas contemporary novels tended to refer to death more literally, reflecting the rise of secularism. These results shed light on the shifting discourse on death during a period when the concept of death as final became prominent.

Inspired by human cognitive behavior, we introduce visual modality to enhance the performance of pure text-based question-answering tasks with the development of multimodal models. However, obtaining corresponding images through manual annotation often entails high costs. Faced with this challenge, an intuitive strategy is to use search engines or use web scraping techniques to automatically obtain relevant image information. However, the images obtained by this strategy may be of low quality and may not match the context of the original task, which could fail to improve or even decrease performance on downstream tasks. In this paper, we propose a novel framework named “ITERATE”, aimed at retrieving and optimizing the quality of images to improve the alignment between text and images. Inspired by evolutionary algorithms in reinforcement learning and driven by the synergy of large language models (LLMs) and multimodal models, ITERATE employs a series of strategic actions such as filtering, optimizing, and retrieving to acquire higher quality images, and repeats this process over multiple generations to enhance the quality of the entire image cluster. Our experimental results on the ScienceQA, ARC-Easy, and OpenDataEval datasets also verify the effectiveness of our method, showing improvements of 3.5%, 5%, and 7%, respectively.

pdf bib abs
Multi-Graph Co-Training for Capturing User Intent in Session-based Recommendation
Zhe Yang | Tiantian Liang

Session-based recommendation focuses on predicting the next item a user will interact with based on sequences of anonymous user sessions. A significant challenge in this field is data sparsity due to the typically short-term interactions. Most existing methods rely heavily on users’ current interactions, overlooking the wealth of auxiliary information available. To address this, we propose a novel model, the Multi-Graph Co-Training model (MGCOT), which leverages not only the current session graph but also similar session graphs and a global item relation graph. This approach allows for a more comprehensive exploration of intrinsic relationships and better captures user intent from multiple views, enabling session representations to complement each other. Additionally, MGCOT employs multi-head attention mechanisms to effectively capture relevant session intent and uses contrastive learning to form accurate and robust session representations. Extensive experiments on three datasets demonstrate that MGCOT significantly enhances the performance of session-based recommendations, particularly on the Diginetica dataset, achieving improvements up to 2.00% in P@20 and 10.70% in MRR@20. Resources have been made publicly available in our GitHub repository https://github.com/liang-tian-tian/MGCOT.

pdf bib abs
CAST: Cross-modal Alignment Similarity Test for Vision Language Models
Gautier Dagan | Olga Loginova | Anil Batra

Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model’s understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.

Retrieval-augmented large language models (LLMs) have been remarkably competent in various NLP tasks. However, it was observed by previous works that retrieval is not always helpful, especially when the LLM is already knowledgable on the query to answer. Motivated by this, Adaptive Retrieval-Augmented Generation (ARAG) studies retrieving only when the knowledge asked by the query is absent in the LLM. Previous works of ARAG either require accessing the pre-training corpus or prompting with additional model inferences. Aiming to avoid such drawbacks, we propose to determine whether the model is knowledgeable on a query via inspecting the (contextualized) pre-trained token embeddings of LLMs. We hypothesize that such embeddings capture rich information on the model’s intrinsic knowledge base, which enables an efficient way of judging the necessity to retrieve from an external corpus. Extensive experiments demonstrate our ARAG approach’s superior performance across various benchmarks.

pdf bib abs
Investigating the Contextualised Word Embedding Dimensions Specified for Contextual and Temporal Semantic Changes
Taichi Aida | Danushka Bollegala

The sense-aware contextualised word embeddings (SCWEs) encode semantic changes of words within the contextualised word embedding (CWE) spaces. Despite the superior performance of (SCWE) in contextual/temporal semantic change detection (SCD) benchmarks, it remains unclear as to how the meaning changes are encoded in the embedding space. To study this, we compare pre-trained CWEs and their fine-tuned versions on contextual and temporal semantic change benchmarks under Principal Component Analysis (PCA) and Independent Component Analysis (ICA) transformations. Our experimental results reveal (a) although there exist a smaller number of axes that are specific to semantic changes of words in the pre-trained CWE space, this information gets distributed across all dimensions when fine-tuned, and (b) in contrast to prior work studying the geometry of CWEs, we find that PCA to better represent semantic changes than ICA within the top 10% of axes. These findings encourage the development of more efficient SCD methods with a small number of SCD-aware dimensions.

pdf bib abs
Uncertainty Modelling in Under-Represented Languages with Bayesian Deep Gaussian Processes
Ubaid Azam | Imran Razzak | Shelly Vishwakarma | Shoaib Jameel

NLP models often face challenges with under-represented languages due to a lack of sufficient training data and language complexities. This can result in inaccurate predictions and a failure to capture the inherent uncertainties within these languages. This paper introduces a new method for modelling uncertainty in under-represented languages by employing deep Bayesian Gaussian Processes. We develop a novel framework that integrates prior knowledge and leverages kernel functions. This helps enable the quantification of uncertainty in predictions to overcome the data limitations in under-represented languages. The efficacy of our approach is validated through various experiments, and the results are benchmarked against existing methods to highlight the enhancements in prediction accuracy and measurement of uncertainty.

pdf bib abs
Cross-lingual Text Classification Transfer: The Case of Ukrainian
Daryna Dementieva | Valeriia Khylenko | Georg Groh

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks—toxicity classification, formality classification, and natural language inference (NLI)—providing the “recipe” for the optimal setups for each task.

Large language models (LLMs) have shown significant potential for robotics applications, particularly task planning, by harnessing their language comprehension and text generation capabilities. However, in applications such as household robotics, a critical gap remains in the personalization of these models to household preferences. For example, an LLM planner may find it challenging to perform tasks that require personalization, such as deciding where to place mugs in a kitchen based on specific household preferences. We introduce LLM-Personalize, a novel framework designed to personalize LLM planners for household robotics. LLM-Personalize uses an LLM planner to perform iterative planning in multi-room, partially-observable household environments, utilizing a scene graph built dynamically from local observations. To personalize the LLM planner towards user preferences, our optimization pipeline integrates imitation learning and reinforced Self-Training. We evaluate LLM-Personalize on Housekeep, a challenging simulated real-world 3D benchmark for household rearrangements, demonstrating a more than 30 percent increase in success rate over existing LLM planners, showcasing significantly improved alignment with human preferences.

Natural Language Processing (NLP) of news articles can play an important role in understanding the dynamics and causes of violent conflict. Despite the availability of datasets categorizing various conflict events, the existing labels often do not cover all of the fine-grained violent conflict event types relevant to areas like the Horn of Africa. In this paper, we introduce a new benchmark dataset Conflict Events in the Horn of Africa region (CEHA) and propose a new task for identifying violent conflict events using online resources with this dataset. The dataset consists of 500 English event descriptions regarding conflict events in the Horn of Africa region with fine-grained event-type definitions that emphasize the cause of the conflict. This dataset categorizes the key types of conflict risk according to specific areas required by stakeholders in the Humanitarian-Peace-Development Nexus. Additionally, we conduct extensive experiments on two tasks supported by this dataset: Event-relevance Classification and Event-type Classification. Our baseline models demonstrate the challenging nature of these tasks and the usefulness of our dataset for model evaluations in low-resource settings.

pdf bib abs
QABISAR: Query-Article Bipartite Interactions for Statutory Article Retrieval
Santosh T.y.s.s | Hassan Sarwat | Matthias Grabmair

In this paper, we introduce QABISAR, a novel framework for statutory article retrieval, to overcome the semantic mismatch problem when modeling each query-article pair in isolation, making it hard to learn representation that can effectively capture multi-faceted information. QABISAR leverages bipartite interactions between queries and articles to capture diverse aspects inherent in them. Further, we employ knowledge distillation to transfer enriched query representations from the graph network into the query bi-encoder, to capture the rich semantics present in the graph representations, despite absence of graph-based supervision for unseen queries during inference. Our experiments on a real-world expert-annotated dataset demonstrate its effectiveness.

Prototype network-based methods have made substantial progress in few-shot relation extraction (FSRE) by enhancing relation prototypes with relation descriptions. However, the distribution of relations and instances in distinct representation spaces isolates the constraints of relations on instances, making relation prototypes biased. In this paper, we propose an end-to-end partial order-centered hyperbolic representation learning (PO-HRL) framework, which imposes the constraints of relations on instances by modeling partial order in hyperbolic space, so as to effectively learn the distribution of instance representations. Specifically, we develop the hyperbolic supervised contrastive learning based on Lorentzian cosine similarity to align representations of relations and instances, and model the partial order by constraining instances to reside within the Lorentzian entailment cone of their respective relation. Experiments on three benchmark datasets show that PO-HRL outperforms the strong baselines, especially in 1-shot settings lacking relation descriptions.

With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To address these issues, we propose a novel Taxonomy-guided Recommendation (TaxRec) framework to empower LLM with category information in a systematic approach. Specifically, TaxRec features a two-step process: one-time taxonomy categorization and LLM-based recommendation. In the one-time taxonomy categorization phase, we organize and categorize items, ensuring clarity and structure of item information. In the LLM-based recommendation phase, we feed the structured items into LLM prompts, achieving efficient token utilization and controlled feature generation. This enables more accurate, contextually relevant, and zero-shot recommendations without the need for domain-specific fine-tuning. Experimental results demonstrate that TaxRec significantly enhances recommendation quality compared to traditional zero-shot approaches, showcasing its efficacy as a personal recommender with LLMs. Code is available at: https://github.com/yueqingliang1/TaxRec.

pdf bib abs
Enhancing Multi-party Dialogue Discourse Parsing with Explanation Generation
Shannan Liu | Peifeng Li | Yaxin Fan | Qiaoming Zhu

Multi-party dialogue discourse parsing is an important and challenging task in natural language processing (NLP). Previous studies struggled to fully understand the deep semantics of dialogues, especially when dealing with complex topic interleaving and ellipsis. To address the above issues, we propose a novel model DDPE (Dialogue Discourse Parsing with Explanations) to integrate external knowledge from Large Language Models (LLMs), which consists of three components, i.e., explanation generation, structural parsing, and contrastive learning. DDPE employs LLMs to generate explanatory and contrastive information about discourse structure, thereby providing additional reasoning cues that enhance the understanding of dialogue semantics. The experimental results on the two public datasets STAC and Molweni show that our DDPE significantly outperforms the state-of-the-art (SOTA) baselines.

Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are inherently derived from PPO, requiring a reference model that adds GPU memory resources and relies heavily on abundant preference data. Meanwhile, current preference optimization research mainly targets single-question scenarios with two replies, neglecting optimization with multiple replies, which leads to a waste of data in the application. This study introduces the MPPO algorithm, which leverages the average likelihood of model responses to fit the reward function and maximizes the utilization of preference data. Through a comparison of Point-wise, Pair-wise, and List-wise implementations, we found that the Pair-wise approach achieves the best performance, significantly enhancing the quality of model responses. Experimental results demonstrate MPPO’s outstanding performance across various benchmarks. On MT-Bench, MPPO outperforms DPO, ORPO, and SimPO. Notably, on Arena-Hard, MPPO surpasses DPO and ORPO by substantial margins. These achievements underscore the remarkable advantages of MPPO in preference optimization tasks.

pdf bib abs
Polysemy Interpretation and Transformer Language Models: A Case of Korean Adverbial Postposition -(u)lo
Seongmin Mun | Gyu-Ho Shin

This study examines how Transformer language models utilise lexico-phrasal information to interpret the polysemy of the Korean adverbial postposition -(u)lo. We analysed the attention weights of both a Korean pre-trained BERT model and a fine-tuned version. Results show a general reduction in attention weights following fine-tuning, alongside changes in the lexico-phrasal information used, depending on the specific function of -(u)lo. These findings suggest that, while fine-tuning broadly affects a model’s syntactic sensitivity, it may also alter its capacity to leverage lexico-phrasal features according to the function of the target word.

This study aims to improve the efficiency and quality of career interviews conducted by nursing managers. To this end, we have been developing a slot-filling dialogue system that engages in pre-interview to collect information on staff careers as a preparatory step before the actual interviews. Conventional slot-filling-based interview dialogue systems have limitations in the flexibility of information collection because the dialogue progresses based on predefined slot sets. We therefore propose a method that leverages large language models (LLMs) to dynamically generate new slots according to the flow of the dialogue, achieving more natural conversations. Furthermore, we incorporate abduction into the slot generation process to enable more appropriate and effective slot generation. To validate the effectiveness of the proposed method, we conducted experiments using a user simulator. The results suggest that the proposed method using abduction is effective in enhancing both information-collecting capabilities and the naturalness of the dialogue.

pdf bib abs
A Simple-Yet-Efficient Instruction Augmentation Method for Zero-Shot Sentiment Classification
Yang Zhao | Masayasu Muraoka | Issei Yoshida | Bishwaranjan Bhattacharjee | Hiroshi Kanayama

Instruction tuning significantly enhances the performance of large language models in tasks such as sentiment classification. Previous studies have leveraged labeled instances from sentiment benchmark datasets to instruction-tune LLMs, improving zero-shot sentiment classification performance. In this work, we propose a simple-yet-efficient instruction augmentation method which does not rely on any actual labeled sentiment instances. With just 240 pseudo instruction instances, the proposed method significantly improve the classification performance across several LLMs on 12 benchmark datasets, increasing scores by 30 points and outperforming LLMs that utilize more complex instruction tuning methods by 5.1 points. Surprisingly, the models tuned with 240 pseudo-instructions even outperform those tuned with actual domain-specific instruction instances. Despite method’s simplicity, our further analysis suggests that the probability shift toward the positive and negative classes and its generalization ability may be the primary driver of the improvement.

pdf bib abs
Improving Explainable Fact-Checking with Claim-Evidence Correlations
Xin Tan | Bowei Zou | Ai Ti Aw

Automatic fact-checking systems that employ large language models (LLMs) have achieved human-level performance in combating widespread misinformation. However, current LLM-based fact-checking systems fail to reveal the reasoning principles behind their decision-making for the claim verdict. In this work, we propose Correlation-Enhanced Explainable Fact-Checking (CorXFact), an LLM-based fact-checking system that simulates the reasoning principle of human fact-checkers for evidence-based claim verification: assessing and weighing the correlations between the claim and each piece of evidence. Following this principle, CorXFact enables efficient claim verification and transparent explanation generation. Furthermore, we contribute the CorFEVER test set to comprehensively evaluate the CorXFact system in claim-evidence correlation identification and claim verification in both closed-domain and real-world fact-checking scenarios. Experimental results show that our proposed CorXFact significantly outperforms four strong fact-checking baselines in claim authenticity prediction and verdict explanation.

The meanings and relationships of words shift over time. This phenomenon is referred to as semantic shift. Research focused on understanding how semantic shifts occur over multiple time periods is essential for gaining a detailed understanding of semantic shifts. However, detecting change points only between adjacent time periods is insufficient for analyzing detailed semantic shifts, and using BERT-based methods to examine word sense proportions incurs a high computational cost. To address those issues, we propose a simple yet intuitive framework for how semantic shifts occur over multiple time periods by utilizing similarity matrices based on word embeddings. We calculate diachronic word similarity matrices using fast and lightweight word embeddings across arbitrary time periods, making it deeper to analyze continuous semantic shifts. Additionally, by clustering the resulting similarity matrices, we can categorize words that exhibit similar behavior of semantic shift in an unsupervised manner.

pdf bib abs
A Testset for Context-Aware LLM Translation in Korean-to-English Discourse Level Translation
Minjae Lee | Youngbin Noh | Seung Jin Lee

Large Language Models (LLMs) demonstrate remarkable performance in machine translation. Recent studies indicate that for high-resource languages, LLM surpasses encoder-decoder neural machine translation (NMT) models. However, evaluation datasets used in many LLM-based translation studies are often compromised by data leakage and lack demanding datasets that accurately gauge the potential and limitations of LLMs in human-like translation. This paper introduces a manually constructed Korean-English discourse-level corpus comprising 600 text instances featuring six linguistic phenomena: lexical ambiguity, zero anaphora, slang, idiom, figurative language, and implicature. Utilizing this challenge test set, we investigated LLM’s Korean-to-English translation capability, particularly in cases requiring inter-sentential context based semantic inference. The findings reveal that state-of-the-art LLM, such as GPT-4o, still struggle with specific linguistic phenomena that can be challenging for machine translation. Additionally, step-by-step prompting, such as Chain-of-Thought (CoT) prompting, significantly enhance the translation performance of LLMs compared to zero-shot prompting.

pdf bib abs
MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for Multi-Task Learning
Lulu Zhao | Weihao Zeng | Shi Xiaofeng | Hua Zhou

Recently, LoRA has emerged as a crucial technique for fine-tuning large pre-trained models, yet its performance in multi-task learning scenarios often falls short. In contrast, the MoE architecture presents a natural solution to this issue. However, it introduces challenges such as mutual interference of data across multiple domains and knowledge forgetting of various tasks. Additionally, MoE significantly increases the number of parameters, posing a computational cost challenge. Therefore, in this paper, we propose MoSLD, a mixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these challenges by sharing the upper projection matrix in LoRA among different experts, encouraging the model to learn general knowledge across tasks, while still allowing the lower projection matrix to focus on the unique features of each task. The application of dropout alleviates the imbalanced update of parameter matrix and mitigates parameter overfitting in LoRA. Extensive experiments demonstrate that our model exhibits excellent performance in both single-task and multi-task scenarios, with robust out-of-domain generalization capabilities.

pdf bib abs
A Combinatorial Approach to Neural Emergent Communication
Zheyuan Zhang

Substantial research on deep learning-based emergent communication uses the referential game framework, specifically the Lewis signaling game, however we argue that successful communication in this game typically only need one or two symbols for target image classification because of a sampling pitfall in the training data. To address this issue, we provide a theoretical analysis and introduce a combinatorial algorithm SolveMinSym (SMS) to solve the symbolic complexity for classification, which is the minimum number of symbols in the message for successful communication. We use the SMS algorithm to create datasets with different symbolic complexity to empirically show that data with higher symbolic complexity increases the number of effective symbols in the emergent language.

pdf bib abs
Multi-perspective Preference Alignment of LLMs for Programming-Community Question Answering
Hongyu Yang | Jiahui Hou | Liyang He | Rui Li

Programming-Community Question Answering (PCQA) aims to tackle issues through generating functional code and guiding descriptions. It involves multiple candidates, with different users having varying preferences for them. Additionally, one may contain outdated APIs. These undoubtedly present a challenge for responsing that meet user preferences. Recently, Reinforcement Learning from Human Feedback demonstrates its ability to precisely control the behavior of large language models (LLMs) to yield human-like responses. However, applying it to LLMs in domain-specific PCQA remains unexplored. In this work, we propose Multi-perspective Preference Alignment for Programming-Community Question Answering to generate user-centric responses, called MupPCQA. It includes three stages: Preference Standardization to control content quality, Preference Integration to consider diverse user tendencies, Preference Timeliness Mitigation to alleviate outdated answers. Extensive experiments on a high-quality, real-world PCQA dataset validate its accuracy and preference. Compared to its base model, MupPCQA shows an improvement of nearly 11% in BLEU, with increases of 20% and 17.5% in BERTScore and CodeBERTScore.

pdf bib abs
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
Zhenhua Liu | Tong Zhu | Chuanyuan Tan | Wenliang Chen

Large language models (LLMs) exhibit remarkable capabilities in understanding and generating natural language. However, these models can inadvertently memorize private information, posing significant privacy risks. This study addresses the challenge of enabling LLMs to protect specific individuals’ private data without the need for complete retraining. We propose RETURN, a Real-world pErsonal daTa UnleaRNing dataset, comprising 2,492 individuals from Wikipedia with associated QA pairs, to evaluate machine unlearning (MU) methods for protecting personal data in a realistic scenario. Additionally, we introduce the Name-Aware Unlearning Framework (NAUF) for Privacy Protection, which enables the model to learn which individuals’ information should be protected without affecting its ability to answer questions related to other unrelated individuals. Our extensive experiments demonstrate that NAUF achieves a state-of-the-art average unlearning score, surpassing the best baseline method by 5.65 points, effectively protecting target individuals’ personal data while maintaining the model’s general capabilities.

pdf bib abs
Exploring Unified Training Framework for Multimodal User Profiling
Minjie Qiang | Zhongqing Wang | Shoushan Li | Guodong Zhou

With the emergence of social media and e-commerce platforms, accurate user profiling has become increasingly vital for recommendation systems and personalized services. Recent studies have focused on generating detailed user profiles by extracting various aspects of user attributes from textual reviews. Nevertheless, these investigations have not fully exploited the potential of the abundant multimodal data at hand. In this study, we propose a novel task called multimodal user profiling. This task emphasizes the utilization of both review texts and their accompanying images to create comprehensive user profiles. By integrating textual and visual data, we leverage their complementary strengths, enabling the generation of more holistic user representations. Additionally, we explore a unified joint training framework with various multimodal training strategies that incorporate users’ historical review texts and images for user profile generation. Our experimental results underscore the significance of multimodal data in enhancing user profile generation and demonstrate the effectiveness of the proposed unified joint training approach.

pdf bib abs
Acquiring Bidirectionality via Large and Small Language Models
Takumi Goto | Hiroyoshi Nagao | Yuta Koreeda

Using token representation from bidirectional language models (LMs) such as BERT is still a widely used approach for token-classification tasks. Even though there exist much larger unidirectional LMs such as Llama-2, they are rarely used to replace the token representation of bidirectional LMs. In this work, we hypothesize that their lack of bidirectionality is what is keeping unidirectional LMs behind. To that end, we propose to newly train a small backward LM and concatenate its representations to those of an existing LM for downstream tasks. Through experiments in token-classification tasks, we demonstrate that introducing backward model can improve the benchmark performance by more than 10 points. Furthermore, we show that the proposed method is especially effective for rare domains and in few-shot learning settings.

Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ retraining-free one-shot techniques to compress PLMs; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experiments demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 5.16 on Raw-Wikitext2 and improves average accuracy by 3.86% across multiple zero-shot benchmarks for LLaMA-3-8B compared to Wanda with 2:4 sparsity.

pdf bib abs
Language Models over Large-Scale Knowledge Base: on Capacity, Flexibility and Reasoning for New Facts
Qiyuan He | Yizhong Wang | Jianfei Yu | Wenya Wang

Advancements in language models (LMs) have sparked interest in exploring their potential as knowledge bases (KBs) due to their high capability for storing huge amounts of factual knowledge and semantic understanding. However, existing studies face challenges in quantifying the extent of large-scale knowledge packed into LMs and lack systematic studies on LMs’ structured reasoning capabilities over the infused knowledge. Addressing these gaps, our research investigates whether LMs can effectively act as large-scale KBs after training over an expansive set of world knowledge triplets via addressing the following three crucial questions: (1) How do LMs of different sizes perform at storing world knowledge of different frequencies in a large-scale KB? (2) How flexible are these LMs in recalling the stored knowledge when prompted with natural language queries? (3) After training on the abundant world knowledge, can LMs additionally gain the ability to reason over such information to infer new facts? Our findings indicate that while medium-scaled LMs hold promise as world knowledge bases capable of storing and responding with flexibility, enhancements in their reasoning capabilities are necessary to fully realize their potential.

Multimodal sarcasm detection (MSD) is essential for various downstream tasks. Existing MSD methods tend to rely on spurious correlations. These methods often mistakenly prioritize non-essential features yet still make correct predictions, demonstrating poor generalizability beyond training environments. Regarding this phenomenon, this paper undertakes several initiatives. Firstly, we identify two primary causes that lead to the reliance of spurious correlations. Secondly, we address these challenges by proposing a novel method that integrate Multimodal Incongruities via Contrastive Learning (MICL) for multimodal sarcasm detection. Specifically, we first leverage incongruity to drive multi-view learning from three views: token-patch, entity-object, and sentiment. Then, we introduce extensive data augmentation to mitigate the biased learning of the textual modality. Additionally, we construct a test set, SPMSD, which consists potential spurious correlations to evaluate the the model’s generalizability. Experimental results demonstrate the superiority of MICL on benchmark datasets, along with the analyses showcasing MICL’s advancement in mitigating the effect of spurious correlation.

pdf bib abs
Cognitive Biases, Task Complexity, and Result Intepretability in Large Language Models
Mario Mina | Valle Ruiz-Fernández | Júlia Falcão | Luis Vasquez-Reina | Aitor Gonzalez-Agirre

In humans, cognitive biases are systematic deviations from rationality in judgment that simplify complex decisions. They typically manifest as a consequence of learned behaviors or limitations on information processing capabilities. Recent work has shown that these biases can percolate through training data and ultimately be learned by language models. We examine different groups of models, factoring in model size and type (base or instructed) for four kinds of cognitive bias: primacy, recency, common token, and majority class bias. We evaluate the performance of each model for each type of bias in different settings using simple and complex variants of datasets. Our results show that some biases have much stronger effects than others, and that task complexity plays a part in eliciting stronger effects for some of these biases as measured by effect size. We show that some cognitive biases such as common token and majority class bias are not straightforward to evaluate, and that, contrary to some of the previous literature, some effects that have been previously classified as common token bias in the literature are actually due to primacy and recency bias.

To ensure reliable performance of Question Answering (QA) systems, evaluation of robustness is crucial. Common evaluation benchmarks commonly only include performance metrics, such as Exact Match (EM) and the F1 score. However, these benchmarks overlook critical factors for the deployment of QA systems. This oversight can result in systems vulnerable to minor perturbations in the input such as typographical errors. While several methods have been proposed to test the robustness of QA models, there has been minimal exploration of these approaches for languages other than English. This study focuses on the robustness evaluation of German language QA models, extending methodologies previously applied primarily to English. The objective is to nurture the development of robust models by defining an evaluation method specifically tailored to the German language. We assess the applicability of perturbations used in English QA models for German and perform a comprehensive experimental evaluation with eight models. The results show that all models are vulnerable to character-level perturbations. Additionally, the comparison of monolingual and multilingual models suggest that the former are less affected by character and word-level perturbations.

Multimodal named entity recognition (MNER) extends traditional named entity recognition (NER) by integrating visual and textual information. However, current methods still face significant challenges due to the text-image mismatch problem. Recent advancements in text-to-image synthesis provide promising solutions, as synthesized images can introduce additional visual context to enhance MNER model performance. To fully leverage the benefits of both original and synthesized images, we propose an adaptive mixup image augmentation method. This method generates augmented images by determining the mixing ratio based on the matching score between the text and image, utilizing a triplet loss-based Gaussian Mixture Model (TL-GMM). Our approach is highly adaptable and can be seamlessly integrated into existing MNER models. Extensive experiments demonstrate consistent performance improvements, and detailed ablation studies and case studies confirm the effectiveness of our method.

pdf bib abs
Bridging Modality Gap for Effective Multimodal Sentiment Analysis in Fashion-related Social Media
Zheyu Zhao | Zhongqing Wang | Shichen Li | Hongling Wang | Guodong Zhou

Multimodal sentiment analysis for fashion-related social media is essential for understanding how consumers appraise fashion products across platforms like Instagram and Twitter, where both textual and visual elements contribute to sentiment expression. However, a notable challenge in this task is the modality gap, where the different information density between text and images hinders effective sentiment analysis. In this paper, we propose a novel multimodal framework that addresses this challenge by introducing pseudo data generated by a two-stage framework. We further utilize a multimodal fusion approach that efficiently integrates the information from various modalities for sentiment classification of fashion posts. Experiments conducted on a comprehensive dataset demonstrate that our framework significantly outperforms existing unimodal and multimodal baselines, highlighting its effectiveness in bridging the modality gap for more accurate sentiment classification in fashion-related social media posts.

pdf bib abs
Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora
Rik van Noord | Miquel Esplà-Gomis | Malina Chichirau | Gema Ramírez-Sánchez | Antonio Toral

Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extracted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has also received limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models’ performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.

pdf bib abs
MLLM-I2W: Harnessing Multimodal Large Language Model for Zero-Shot Composed Image Retrieval
Tong Bao | Che Liu | Derong Xu | Zhi Zheng | Tong Xu

Combined Image Retrieval (CIR) involves retrieving an image based on a reference image and a brief text description, which is widely present in various scenarios such as fashion recommendation. Existing methods can be mainly divided into two categories, respectively supervised CIR methods and Zero-Shot CIR (ZS-CIR) methods. In contrast to supervised CIR methods, which need manually annotated triples for training task-specific models, ZS-CIR models can be trained using images datasets only and performs well. However, ZS-CIR still faces the primary challenge of learning how to map pseudo-words to images within the joint image-text embedding space. Therefore, in this paper, we propose a novel image-text mapping network, named MLLM-I2W, which adaptively converts description-related image information into pseudo-word markers for precise ZS-CIR. Specifically, the image and text encoding enhancement module within the MLLM prompt selects subject headings and generates text descriptions. It then reduces the modality gap between images and text using uncertainty modeling. An adaptive weighting module and a prototype are proposed to adjust and learn the deep fusion features, which are further mapped to pseudo-word markers via well-designed MOE-based mapping network. Our model demonstrates consistent improvements across common CIR benchmarks, including COCO, CIRR, and Fashion-IQ.

pdf bib abs
Linguistic Features Extracted by GPT-4 Improve Alzheimer’s Disease Detection based on Spontaneous Speech
Jonathan Heitz | Gerold Schneider | Nicolas Langer

Alzheimer’s Disease (AD) is a significant and growing public health concern. Investigating alterations in speech and language patterns offers a promising path towards cost-effective and non-invasive early detection of AD on a large scale. Large language models (LLMs), such as GPT, have enabled powerful new possibilities for semantic text analysis. In this study, we leverage GPT-4 to extract five semantic features from transcripts of spontaneous patient speech. The features capture known symptoms of AD, but they are difficult to quantify effectively using traditional methods of computational linguistics. We demonstrate the clinical significance of these features and further validate one of them (“Word-Finding Difficulties”) against a proxy measure and human raters. When combined with established linguistic features and a Random Forest classifier, the GPT-derived features significantly improve the detection of AD. Our approach proves effective for both manually transcribed and automatically generated transcripts, representing a novel and impactful use of recent advancements in LLMs for AD speech analysis.

pdf bib abs
Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?
Tatsuki Kuribayashi | Timothy Baldwin

Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human–LM gap arises. This study explores the advantage of grounded language acquisition, specifically the impact of visual information — which humans can usually rely on but LMs largely do not have access to during language acquisition — on syntactic generalization in LMs. Our experiments, following the poverty of stimulus paradigm under two scenarios (using artificial vs. naturalistic images), demonstrate that if the alignments between the linguistic and visual components are clear in the input, access to vision data does help with the syntactic generalization of LMs, but if not, visual input does not help. This highlights the need for additional biases or signals, such as mutual gaze, to enhance cross-modal alignment and enable efficient syntactic generalization in multimodal LMs.

pdf bib abs
Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly
Peyman Hosseini | Ignacio Castro | Iacopo Ghinassi | Matthew Purver

Large Language Models (LLMs) have demonstrated remarkable capabilities in comprehending and analyzing lengthy sequential inputs, owing to their extensive context windows that allow processing millions of tokens in a single forward pass. However, this paper uncovers a surprising limitation: LLMs fall short when handling long input sequences. We investigate this issue using three datasets and two tasks (sentiment analysis and news categorization) across various LLMs, including Claude 3, Gemini Pro, GPT 3.5 Turbo, Llama 3 Instruct, and Mistral Instruct models. To address this limitation, we propose and evaluate ad-hoc solutions that substantially enhance LLMs’ performance on long input sequences by up to 50%, while reducing API cost and latency by up to 93% and 50%, respectively.

pdf bib abs
MLD-EA: Check and Complete Narrative Coherence by Introducing Emotions and Actions
Jinming Zhang | Yunfei Long

Narrative understanding and story generation are critical challenges in natural language processing (NLP), with much of the existing research focused on summarization and question-answering tasks. While previous studies have explored predicting plot endings and generating extended narratives, they often neglect the logical coherence within stories, leaving a significant gap in the field. To address this, we introduce the Missing Logic Detector by Emotion and Action (MLD-EA) model, which leverages large language models (LLMs) to identify narrative gaps and generate coherent sentences that integrate seamlessly with the story’s emotional and logical flow. The experimental results demonstrate that the MLD-EA model enhances narrative understanding and story generation, highlighting LLMs’ potential as effective logic checkers in story writing with logical coherence and emotional consistency. This work fills a gap in NLP research and advances border goals of creating more sophisticated and reliable story-generation systems.

pdf bib abs
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
Kohei Tsuji | Tatsuya Hiraoka | Yuchang Cheng | Tomoya Iwakura

NLP datasets may still contain annotation errors, even when they are manually annotated. Researchers have attempted to develop methods to automatically reduce the adverse effect of errors in datasets. However, existing methods are time-consuming because they require many trained models to detect errors. This paper proposes a time-saving method that utilizes a tokenization technique called subword regularization to simulate multiple error detection models for detecting errors. Our proposed method, SubRegWeigh, can perform annotation weighting four to five times faster than the existing method. Additionally, SubRegWeigh improved performance in document classification and named entity recognition tasks. In experiments with pseudo-incorrect labels, SubRegWeigh clearly identifies pseudo-incorrect labels as annotation errors. Our code is available at https://github.com/4ldk/SubRegWeigh.

pdf bib abs
Rethinking Long Context Generation from the Continual Learning Perspective
Zeyuan Yang | Fangzhou Xiong | Peng Li | Yang Liu

Due to the limited context window, Large Language Models (LLMs) struggle with processing long contexts. Although fine-tuning can extend the context window, it incurs substantial computation costs. In contrast, recent tuning-free approaches reallocate the attention mechanism or incorporate temporary trainable parameters. In this work, by jointly modeling instance-level generation with a limited context window and learning over sequential data, we rethink the long context generation of LLMs from a continual learning perspective. In practice, we inspect existing representative approaches and analyze their synergy with continual learning strategies. Moreover, we integrate these strategies into current approaches to further boost LLMs’ efficiency in processing long contexts. Comprehensive experiments and analysis confirm the feasibility of continual learning insights for improving long-context processing.

pdf bib abs
LTRS: Improving Word Sense Disambiguation via Learning to Rank Senses
Hansi Wang | Yue Wang | Qiliang Liang | Yang Liu

Word Sense Disambiguation (WSD) is a fundamental task critical for accurate semantic understanding. Conventional training strategies usually only consider predefined senses for target words and learn each of them from relatively limited instances, neglecting the influence of similar ones. To address these problems, we propose the method of Learning to Rank Senses (LTRS) to enhance the task. This method helps a model learn to represent and disambiguate senses from a broadened range of instances via ranking an expanded list of sense definitions. By employing LTRS, our model achieves a SOTA F1 score of 79.6% in Chinese WSD and exhibits robustness in low-resource settings. Moreover, it shows excellent training efficiency, achieving faster convergence than previous methods. This provides a new technical approach to WSD and may also apply to the task for other languages.

pdf bib abs
Are Your Keywords Like My Queries? A Corpus-Wide Evaluation of Keyword Extractors with Real Searches
Martina Galletti | Giulio Prevedello | Emanuele Brugnoli | Donald Ruggiero Lo Sardo | Pietro Gravino

Keyword Extraction (KE) is essential in Natural Language Processing (NLP) for identifying key terms that represent the main themes of a text, and it is vital for applications such as information retrieval, text summarisation, and document classification. Despite the development of various KE methods — including statistical approaches and advanced deep learning models — evaluating their effectiveness remains challenging. Current evaluation metrics focus on keyword quality, balance, and overlap with annotations from authors and professional indexers, but neglect real-world information retrieval needs. This paper introduces a novel evaluation method designed to overcome this limitation by using real query data from Google Trends and can be used with both supervised and unsupervised KE approaches. We applied this method to three popular KE approaches (YAKE, RAKE and KeyBERT) and found that KeyBERT was the most effective in capturing users’ top queries, with RAKE also showing surprisingly good performance. The code is open-access and publicly available.

pdf bib abs
NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers
Angel Yahir Loredo Lopez | Tyler McDonald | Ali Emami

Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive “System 1” thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.

Motivational Interviewing (MI) is a counseling technique that promotes behavioral change through reflective responses to mirror or refine client statements. While advanced Large Language Models (LLMs) can generate engaging dialogues, challenges remain for applying them in a sensitive context such as MI. This work assesses the potential of LLMs to generate MI reflections via three LLMs: GPT-4, Llama-2, and BLOOM, and explores the effect of dialogue context size and integration of MI strategies for reflection generation by LLMs. We conduct evaluations using both automatic metrics and human judges on four criteria: appropriateness, relevance, engagement, and naturalness, to assess whether these LLMs can accurately generate the nuanced therapeutic communication required in MI. While we demonstrate LLMs’ potential in generating MI reflections comparable to human therapists, content analysis shows that significant challenges remain. By identifying the strengths and limitations of LLMs in generating empathetic and contextually appropriate reflections in MI, this work contributes to the ongoing dialogue in enhancing LLM’s role in therapeutic counseling.

Recent advancements in large language models (LLMs) have shown promise in generating psychotherapeutic dialogues, particularly in the context of motivational interviewing (MI). However, the inherent lack of transparency in LLM outputs presents significant challenges given the sensitive nature of psychotherapy. Applying MI strategies, a set of MI skills, to generate more controllable therapeutic-adherent conversations with explainability provides a possible solution. In this work, we explore the alignment of LLMs with MI strategies by first prompting the LLMs to predict the appropriate strategies as reasoning and then utilizing these strategies to guide the subsequent dialogue generation. We seek to investigate whether such alignment leads to more controllable and explainable generations. Multiple experiments including automatic and human evaluations are conducted to validate the effectiveness of MI strategies in aligning psychotherapy dialogue generation. Our findings demonstrate the potential of LLMs in producing strategically aligned dialogues and suggest directions for practical applications in psychotherapeutic settings.

Chain-of-thought (CoT) prompting has significantly enhanced the the capability of large language models (LLMs) by structuring their reasoning processes. However, existing methods face critical limitations: handcrafted demonstrations require extensive human expertise, while trigger phrases are prone to inaccuracies. In this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method, a novel approach that improves CoT prompting by utilizing uncertainty estimates to select effective demonstrations without needing access to model parameters. Unlike traditional methods, ZEUS offers high sensitivity in distinguishing between helpful and ineffective questions, ensuring more precise and reliable selection. Our extensive evaluation shows that ZEUS consistently outperforms existing CoT strategies across four challenging reasoning benchmarks, demonstrating its robustness and scalability.

pdf bib abs
Word-level Cross-lingual Structure in Large Language Models
Zihao Feng | Hailong Cao | Wang Xu | Tiejun Zhao

Large Language Models (LLMs) have demonstrated exceptional performance across a broad spectrum of cross-lingual Natural Language Processing (NLP) tasks. However, previous methods predominantly focus on leveraging parallel corpus to conduct instruction data for continuing pre-training or fine-tuning. They ignored the state of parallel data on the hidden layers of LLMs. In this paper, we demonstrate Word-level Cross-lingual Structure (WCS) of LLM which proves that the word-level embedding on the hidden layers are isomorphic between languages. We find that the hidden states of different languages’ input on the LLMs hidden layers can be aligned with an orthogonal matrix on word-level. We prove this conclusion in both mathematical and downstream task ways on two representative LLM foundations, LLaMA2 and BLOOM. Besides, we propose an Isomorphism-based Data Augmentation (IDA) method to apply the WCS on a downstream cross-lingual task, Bilingual Lexicon Induction (BLI), in both supervised and unsupervised ways. The experiment shows the significant improvement of our proposed method over all the baselines, especially on low-resource languages.

pdf bib abs
Trucidator: Document-level Event Factuality Identification via Hallucination Enhancement and Cross-Document Inference
Zihao Zhang | Zhong Qian | Xiaoxu Zhu | Peifeng Li | Qiaoming Zhu

Document-level event factuality identification (DEFI) assesses the veracity degree to which an event mentioned in a document has happened, which is crucial for many natural language processing tasks. Previous work assesses event factuality by solely relying on the semantic information within a single document, which fails to identify hard cases where the document itself is hallucinative or counterfactual. There is also a pressing need for more suitable data of this kind. To tackle these issues, we construct Factualusion, a novel corpus with hallucination features that can be used not only for DEFI but can also be applied for hallucination evaluation for large language models. We further propose Trucidator, a graph-based framework that constructs intra-document and cross-document graphs and employs a multi-task learning paradigm to acquire more robust node embeddings, leveraging cross-document inference for more accurate identification. Experiments show that our proposed framework outperformed several baselines, demonstrating the effectiveness of our method.

Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum for benchmarking purposes. We manually evaluated the results of the best-performing system to gain insight into the potential pitfalls of this data set and future development.

pdf bib abs
From Detection to Explanation: Effective Learning Strategies for LLMs in Online Abusive Language Research
Chiara Di Bonaventura | Lucia Siciliani | Pierpaolo Basile | Albert Merono Penuela | Barbara McGillivray

Abusive language detection relies on understanding different levels of intensity, expressiveness and targeted groups, which requires commonsense reasoning, world knowledge and linguistic nuances that evolve over time. Here, we frame the problem as a knowledge-guided learning task, and demonstrate that LLMs’ implicit knowledge without an accurate strategy is not suitable for multi-class detection nor explanation generation. We publicly release GLlama Alarm, the knowledge-Guided version of Llama-2 instruction fine-tuned for multi-class abusive language detection and explanation generation. By being fine-tuned on structured explanations and external reliable knowledge sources, our model mitigates bias and generates explanations that are relevant to the text and coherent with human reasoning, with an average 48.76% better alignment with human judgment according to our expert survey.

pdf bib abs
TEEMIL : Towards Educational MCQ Difficulty Estimation in Indic Languages
Manikandan Ravikiran | Siddharth Vohra | Rajat Verma | Rohit Saluja | Arnav Bhavsar

Difficulty estimation of multiple-choice questions (MCQs) is crucial for creating effective educational assessments, yet remains underexplored in Indic languages like Hindi and Kannada due to the lack of comprehensive datasets. This paper addresses this gap by introducing two datasets, TEEMIL-H and TEEMIL-K, containing 4689 and 4215 MCQs, respectively, with manually annotated difficulty labels. We benchmark these datasets using state-of-the-art multilingual models and conduct ablation studies to analyze the effect of context, the impact of options, and the presence of the None of the Above (NOTA) option on difficulty estimation. Our findings establish baselines for difficulty estimation in Hindi and Kannada, offering valuable insights into improving model performance and guiding future research in MCQ difficulty estimation .

pdf bib abs
What’s Wrong? Refining Meeting Summaries with LLM Feedback
Frederic Thomas Kirstein | Terry Lima Ruas | Bela Gipp

Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.

pdf bib abs
Scene Graph and Dependency Grammar Enhanced Remote Sensing Change Caption Network (SGD-RSCCN)
Qiaoli Sun | Yan Wang | Xiaoyu Song

With the continuous advancement of remote sensing technology, it is easier to obtain high-resolution, multi-temporal and multi-spectral images. The images carry rich information of ground objects. However, how to effectively extract useful information from the complex image data and convert it into understandable semantic descriptions remains a challenge. To deal with the challenges, we propose a Scene Graph and Dependency Grammar Enhanced Remote Sensing Change Caption Network (SGD-RSCCN) to improve the accuracy and naturalness of extracting and describing change information from remote sensing images. By combining advanced visual analysis technology and natural language processing technology, the network not only optimizes the problem of insufficient understanding of complex scenes, but also enhances the ability to capture dynamic changes, thereby generating more accurate and smooth natural language description. In addition, we also proposes the decoder based on prior knowledge, which further improves the readability and comprehensibility of the description. Extensive experiments on LEVIR-CC and Dubai-CC datasets verify the advantages of the proposed method in generating accurate and true descriptions.

pdf bib abs
Looking at the Unseen: Effective Sampling of Non-Related Propositions for Argument Mining
Ramon Ruiz-Dolz | Debela Gemechu | Zlata Kikteva | Chris Reed

Traditionally, argument mining research has approached the task of automatic identification of argument structures by using existing definitions of what constitutes an argument, while leaving the equally important matter of what does not qualify as an argument unaddressed. With the ability to distinguish between what is and what is not a natural language argument being at the core of argument mining as a field, it is interesting that no previous work has explored approaches to effectively select non-related propositions (i.e., propositions that are not connected through an argumentative relation, such as support or attack) that improve the data for learning argument mining tasks better. In this paper, we address the question of how to effectively sample non-related propositions from six different argument mining corpora belonging to different domains and encompassing both monologue and dialogue forms of argumentation. To that end, in addition to considering undersampling baselines from previous work, we propose three new sampling strategies relying on context (i.e., short/long) and the semantic similarity between propositions. Our results indicate that using more informed sampling strategies improves the performance, not only when evaluating models on their respective test splits, but also in the case of cross-domain evaluation.

“Jailbreak” is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations—erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the Benchmark for reliABilitY and jailBreak haLlUcination Evaluation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.

pdf bib abs
From Form to Meaning: The Case of Particles within the Prague Dependency Treebank Annotation Scheme
Marie Mikulova | Barbora Štěpánková | Jan Štěpánek

In the last decades, computational linguistics has become increasingly interested in annotation schemes that aim at an adequate description of the meaning of the sentences and texts. Discussions are ongoing on an appropriate annotation scheme for a large and complex amount of diverse information. In this contribution devoted to description of polyfunctional uninflected words (namely particles), i.e. words which, although having only one paradigmatic form, can have several different syntactic functions and even express relatively different semantic distinctions, we argue that it is the multi-layer system (linked from meaning to text) that allows a comprehensive description of the relations between morphological properties, syntactic function and expressed meaning, and thus contributes to greater accuracy in the description of the phenomena concerned and to the overall consistency of the annotated data. These aspects are demonstrated within the Prague Dependency Treebank annotation scheme, whose pioneering proposal can be found in the first COLING proceedings from 1965 (Sgall 1965), and to this day, the concept has proved to be sound and serves very well for complex annotation.

pdf bib abs
Enhancing Long-range Dependency with State Space Model and Kolmogorov-Arnold Networks for Aspect-based Sentiment Analysis
Adamu Lawan | Juhua Pu | Haruna Yunusa | Aliyu Umar | Muhammad Lawan

Aspect-based Sentiment Analysis (ABSA) evaluates sentiments toward specific aspects of entities within the text. However, attention mechanisms and neural network models struggle with syntactic constraints. The quadratic complexity of attention mechanisms also limits their adoption for capturing long-range dependencies between aspect and opinion words in ABSA. This complexity can lead to the misinterpretation of irrelevant contextual words, restricting their effectiveness to short-range dependencies. To address the above problem, we present a novel approach to enhance long-range dependencies between aspect and opinion words in ABSA (MambaForGCN). This approach incorporates syntax-based Graph Convolutional Network (SynGCN) and MambaFormer (Mamba-Transformer) modules to encode input with dependency relations and semantic information. The Multihead Attention (MHA) and Selective State Space model (Mamba) blocks in the MambaFormer module serve as channels to enhance the model with short and long-range dependencies between aspect and opinion words. We also introduce the Kolmogorov-Arnold Networks (KANs) gated fusion, an adaptive feature representation system that integrates SynGCN and MambaFormer and captures non-linear, complex dependencies. Experimental results on three benchmark datasets demonstrate MambaForGCN’s effectiveness, outperforming state-of-the-art (SOTA) baseline models.

pdf bib abs
ROUGE-SciQFS: A ROUGE-based Method to Automatically Create Datasets for Scientific Query-Focused Summarization
Juan Ramirez-Orta | Ana Maguitman | Axel J. Soto | Evangelos Milios

So far, the task of Scientific Query-Focused Summarization (Sci-QFS) has lagged in development when compared to other areas of Scientific Natural Language Processing because of the lack of data. In this work, we propose a methodology to take advantage of existing collections of academic papers to obtain large-scale datasets for this task automatically. After applying it to the papers from our reading group, we introduce a novel dataset for Sci-QFS composed of 8,695 examples, each one with a query, the sentences of the full text from a paper and the relevance labels for each. After testing several classical and state-of-the-art embedding models on this data, we found that the task of Sci-QFS is far from being solved, although it is relatively straightforward for humans. Surprisingly, we found that classical methods outperformed modern pre-trained Deep Language Models (sometimes by a large margin), showing the need for large datasets to better fine-tune the latter. We share our experiments, data and models at https://github.com/jarobyte91/rouge_sciqfs.

pdf bib abs
Commonsense Subgraph for Inductive Relation Reasoning with Meta-learning
Feng Zhao | Zhilu Zhang | Cheng Yan | Xianggan Liu

In knowledge graphs (KGs), predicting missing relations is a critical reasoning task. Recent subgraph-based models have delved into inductive settings, which aim to predict relations between newly added entities. While these models have demonstrated the ability for inductive reasoning, they only consider the structural information of the subgraph and neglect the loss of semantic information caused by replacing entities with nodes. To address this problem, we propose a novel Commonsense Subgraph Meta-Learning (CSML) model. Specifically, we extract concepts from entities, which can be viewed as high-level semantic information. Unlike previous methods, we use concepts instead of nodes to construct commonsense subgraphs. By combining these with structural subgraphs, we can leverage both structural and semantic information for more comprehensive and rational predictions. Furthermore, we regard concepts as meta-information and employ meta-learning to facilitate rapid knowledge transfer, thus addressing more complex few-shot scenarios. Experimental results confirm the superior performance of our model in both standard and few-shot inductive reasoning.

Fine-grained intent detection involves identifying a large number of classes with subtle variations. Recently, generating pseudo samples via large language models has attracted increasing attention to alleviate the data scarcity caused by emerging new intents. However, these methods generate samples for each class independently and neglect the relationships between classes, leading to ambiguity in pseudo samples, particularly for fine-grained labels. And, they typically rely on one-time generation and overlook feedback from pseudo samples. In this paper, we propose an iterative differential generation framework with contrastive feedback to generate high-quality pseudo samples and accurately capture the crucial nuances in target class distribution. Specifically, we propose differential guidelines that include potential ambiguous labels to reduce confusion for similar labels. Then we conduct rubric-driven refinement, ensuring the validity and diversity of pseudo samples. Finally, despite one generation, we propose to iteratively generate new samples with contrastive feedback to achieve accurate identification and distillation of target knowledge. Extensive experiments in zero/few-shot and full-shot settings on three datasets verify the effectiveness of our method.

pdf bib abs
Leveraging Explicit Reasoning for Inference Integration in Commonsense-Augmented Dialogue Models
Sarah E. Finch | Jinho D. Choi

Open-domain dialogue systems need to grasp social commonsense to understand and respond effectively to human users. Commonsense-augmented dialogue models have been proposed that aim to infer commonsense knowledge from dialogue contexts in order to improve response quality. However, existing approaches to commonsense-augmented dialogue rely on implicit reasoning to integrate commonsense inferences during response generation. In this study, we explore the impact of explicit reasoning against implicit reasoning over commonsense for dialogue response generation. Our findings demonstrate that separating commonsense reasoning into explicit steps for generating, selecting, and integrating commonsense into responses leads to better dialogue interactions, improving naturalness, engagement, specificity, and overall quality. Subsequent analyses of these findings unveil insights into the effectiveness of various types of commonsense in generating responses and the particular response traits enhanced through explicit reasoning for commonsense integration. Our work advances research in open-domain dialogue by achieving a new state-of-the-art in commonsense-augmented response generation.

pdf bib abs
Integrating Group-based Preferences from Coarse to Fine for Cold-start Users Recommendation
Siyu Wang | Jianhui Jiang | Jiangtao Qiu | Shengran Dai

Recent studies have demonstrated that cross-domain recommendation (CDR) effectively addresses the cold-start problem. Most approaches rely on transfer functions to generate user representations from the source to the target domain. Although these methods substantially enhance recommendation performance, they exhibit certain limitations, notably the frequent oversight of similarities in user preferences, which can offer critical insights for training transfer functions. Moreover, existing methods typically derive user preferences from historical purchase records or reviews, without considering that preferences operate at three distinct levels: category, brand, and aspect, each influencing decision-making differently. This paper proposes a model that integrates the preferences from coarse to fine levels to improve recommendations for cold-start users. The model leverages historical data from the source domain and external memory networks to generate user representations across different preference levels. A meta-network then transfers these representations to the target domain, where user-item ratings are predicted by aggregating the diverse representations. Experimental results demonstrate that our model outperforms state-of-the-art approaches in addressing the cold-start problem on three CDR tasks.

pdf bib abs
Automatic Multiple-Choice Question Generation and Evaluation Systems Based on LLM: A Study Case With University Resolutions
Sérgio Silva Mucciaccia | Thiago Meireles Paixão | Filipe Wall Mutz | Claudine Santos Badue | Alberto Ferreira de Souza | Thiago Oliveira-Santos

Multiple choice questions (MCQs) are often used in both employee selection and training, providing objectivity, efficiency, and scalability. However, their creation is resource-intensive, requiring significant expertise and financial investment. This study leverages large language models (LLMs) and prompt engineering techniques to automate the generation and validation of MCQs, particularly within the context of university regulations. Mainly, two novel approaches are proposed in this work: an automatic question generation system for university resolution and an automatic evaluation system to assess the performance of MCQ generation systems. The generation system combines different prompt engineering techniques and a review process to create well formulated questions. The evaluation system uses prompt engineering combined with an advanced LLM model to assess the integrity of the generated question. Experimental results demonstrate the effectiveness of both systems. The findings highlight the transformative potential of LLMs in educational assessment, reducing the burden on human resources and enabling scalable, cost-effective MCQ generation.

This paper studies the task of generating commonsense reasoning questions (QG) with desired difficulty levels. Compared to traditional shallow questions that can be solved by simple term matching, ours are more challenging. Our answering process requires reasoning over multiple contextual and commonsense clues. That involves advanced comprehension skills, such as abstract semantics learning and missing knowledge inference. Existing work mostly learns to map the given text into questions, lacking a mechanism to control results with the desired complexity. To address this problem, we propose a novel controllable framework. We first derive contextual and commonsense clues involved in reasoning questions from the text. These clues are used to create simple sub-questions. We then aggregate multiple sub-questions to compose complex ones under the guidance of prior reasoning structures. By iterating this process, we can compose a complex QG task based on a series of smaller and simpler QG subtasks. Each subtask serves as a building block for a larger one. Each composition corresponds to an increase in the reasoning step. Moreover, we design a voting verifier to ensure results’ validity from multiple views, including answer consistency, reasoning difficulty, and context correlation. Finally, we can learn the optimal QG model to yield thought-provoking results. Evaluations on two typical datasets validate our method.

The acceleration of Large Language Models (LLMs) research has opened up new possibilities for evaluating generated text. Though LLMs serve as scalable and economical evaluators, how reliable these evaluators is still under-explored. Prior research efforts in the meta-evaluation of LLMs as judges limit the prompting of an LLM to a single use to obtain a final evaluation decision. They then compute the agreement between LLMs’ outputs and human labels. This lacks interpretability in understanding the evaluation capability of LLMs. In light of this challenge, we propose DnA-Eval, which breaks down the evaluation process into decomposition and aggregation stages based on pedagogical practices. Our experiments show that it not only provides a more interpretable window for how well LLMs evaluate, but also leads to improvements up to 39.6% for different LLMs on a variety of meta-evaluation benchmarks.

Despite the remarkable reasoning capabilities demonstrated by large language models (LLM), the substantial computational overhead limits their practices. Some efforts have been directed toward distilling multi-step reasoning capabilities into smaller models through chain-of-thought (CoT). While CoT facilitates multi-step reasoning, the dependencies between reasoning steps are not always clearly discernible, which may lead to inconsistent reasoning. In this paper, we introduce fine-grained attribution reasoning distillation (FARD), which incorporates grounded citations to consolidate the relationships between reasoning steps. Specifically, FARD distills attribution reasoning rationales from LLMs to substitute CoT reasonings, which clarifies the dependencies among reasoning steps. Besides, we regularize the model’s attention pattern by leveraging the causal dependencies between reasoning steps, thereby enhancing the consistency of reasoning. Grounded attribution reasoning also enhances interpretability and verifiability, thereby facilitating faithful reasoning. We evaluate FARD on mathematical and general reasoning benchmarks. The experimental results indicate that FARD outperforms CoT distillation methods in mathematical reasoning, demonstrating its effectiveness. Furthermore, the small models trained with FARD have shown outstanding performance in out-of-distribution reasoning, proving strong generalization capabilities.

pdf bib abs
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
Qian Tao | Wenyuan Yu | Jingren Zhou

Large language models have shown exceptional capabilities in a wide range of tasks, such as text generation and video generation, among others. However, due to their massive parameter count, these models often require substantial storage space, imposing significant constraints on the machines deploying LLMs. To overcome this limitation, one research direction proposes to compress the models using integer replacements for floating-point numbers, in a process known as Quantization. Some recent studies suggest quantizing the key and value cache (KV Cache) of LLMs, and designing quantization techniques that treat the key and value matrices equivalently. This work delves deeper into the asymmetric structural roles of KV Cache, a phenomenon where the transformer’s output loss is more sensitive to the quantization of key matrices. We conduct a systematic examination of the attention output error resulting from key and value quantization. The phenomenon inspires us to propose an asymmetric quantization strategy. Our approach allows for 1-bit quantization of the KV cache by implementing distinct configurations for key and value matrices. We carry out experiments across a variety of datasets, demonstrating that our proposed model allows for the quantization of up to 75% decoder layers with 1 bit, while simultaneously maintaining performance levels comparable to those of the models with floating parameters.

pdf bib abs
E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models
Zhenyu Zhang | Bingguang Hao | Jinpeng Li | Zekai Zhang | Dongyan Zhao

Modern large language models are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model. Composing an optimal prompt for a specific demand lacks theoretical support and relies entirely on human experimentation, which poses a considerable obstacle to popularizing generative artificial intelligence. However, there is no systematic analysis of the stability of large language models to resist prompt perturbations. In this work, we propose to evaluate the ease-of-use of large language models and construct E-Bench, simulating the actual situation of human use from synonymous perturbation (including paraphrasing, simplification, and colloquialism) and typographical perturbation. Besides we also discuss the combination of these two types of perturbation and analyze the main reasons for performance degradation. Experimental results indicate that with the increase of model size, although the ease-of-use could be significantly improved, there is still a long way to go to build a sufficiently user-friendly model.

pdf bib abs
Enhancing Online Grooming Detection via Backtranslation Augmentation
Hamed Waezi | Hossein Fani

Grooming minors for sexual exploitation become an increasingly significant concern in online conversation platforms. For a safer online experience for minors, machine learning models have been proposed to tap into explicit textual remarks and automate detecting predatory conversations. Such models, however, fall short of real-world applications for the sparse distribution of predatory conversations. In this paper, we propose backtranslation augmentation to augment training datasets with more predatory conversations. Through our experiments on 8 languages from 4 language families using 3 neural translators, we demonstrate that backtranslation augmentation improves models’ performance with fewer training epochs for better classification efficacy. Our code and experimental results are available at https://github.com/fani-lab/osprey/tree/coling25.

pdf bib abs
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
Tao Feng | Lizhen Qu | Xiaoxi Kang | Gholamreza Haffari

Automatically evaluating the quality of responses in dialogue systems is a challenging yet crucial task. Current metrics often fail to align with human judgments, especially when assessing responses that are grammatically correct. To address this issue, we propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses. The causal strength is estimated by utilizing both unconditional dependence and conditional dependencies from dialogue histories to responses. We compare our metric with the existing competitive metrics in terms of their alignment with human judgements. Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements. Additionally, we collect a dialogue dataset CGDIALOG+ with human-annotated causal relations and a set of pairwise human judgements to facilitate the development of automatic metrics.

pdf bib abs
Exploring the Impact of Language Switching on Personality Traits in LLMs
Jacopo Amidei | Jose Gregorio Ferreira De Sá | Rubén Nieto Luna | Andreas Kaltenbrunner

This paper investigates the extent to which LLMs align with humans when personality shifts are associated with language changes. Based on three experiments, that focus on GPT-4o and the Eysenck Personality Questionnaire-Revised (EPQR-A), our initial results reveal a weak yet significant variation in GPT-4o’s personality across languages, indicating that some stem from a language-switching effect rather than translation. Further analysis across five English-speaking countries shows that GPT-4o, leveraging stereotypes, reflects distinct country-specific personality traits.

pdf bib abs
LLMs Know What They Need: Leveraging a Missing Information Guided Framework to Empower Retrieval-Augmented Generation
Keheng Wang | Feiyu Duan | Peiguang Li | Sirui Wang | Xunliang Cai

Retrieval-Augmented Generation (RAG) demonstrates great value in alleviating outdated knowledge or hallucination by supplying LLMs with updated and relevant knowledge. However, RAG still faces several challenges in tackling complex multi-hop queries, which require LLMs to perform accurate reasoning and retrieval at each step. Inspired by the human reasoning process, where we progressively search for missing information after acquiring useful clues, it is natural to question whether LLMs have similar capabilities. In this work, we first experimentally verified the ability of LLMs to extract information from the retrieved knowledge as well as to know what is still missing. Based on the above discovery, we propose a Missing Information Guided Retrieve-Extraction-Solving paradigm (MIGRES), where we leverage the identification of missing information to generate a targeted query that steers the subsequent knowledge retrieval. Besides, we design a sentence-level re-ranking filtering approach to filter the irrelevant content from the document, along with the information extraction capability of LLMs to extract useful information from denoised documents. Extensive experiments conducted on multiple public datasets reveal the superiority of the proposed MIGRES method, and analytical experiments demonstrate the effectiveness of our proposed modules. Code and data are released in https://github.com/AdelWang/MIGRES.

Large Language Models (LLMs) exhibit remarkable generative capabilities, enabling the generation of valuable information. Despite these advancements, previous research found that LLMs sometimes struggle with adhering to specific constraints, such as being in a specific place or at a specific time, and at times even overlook them, which leads to responses that are either too generic or not fully satisfactory. Existing approaches attempted to address this issue by decomposing and rewriting input instructions or reflecting on prior failings, yet they fall short in adequately emphasizing specific constraints and unlocking the underlying knowledge, such as programming within the context of software development. In response, this paper proposes a simple yet effective method called Chain-of-Specificity (CoS). Specifically, CoS emphasizes the specific constraints in the input instructions, unlocks knowledge within LLMs, and refines responses. Experiments conducted on publicly available and self-built complex datasets demonstrate that CoS outperforms existing methods in enhancing generated content, especially in terms of specificity. Additionally, as the number of specific constraints increases, other baselines falter, while CoS still performs well. Moreover, we show that distilling responses generated by CoS effectively enhances the ability of smaller models to follow constrained instructions.

Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experimental results show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary transliteration-based alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better crosslingual alignment. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance. The code implementation is based on https://github.com/cisnlp/Transliteration-PPA.

pdf bib abs
GL-GAN: Perceiving and Integrating Global and Local Styles for Handwritten Text Generation with Mamba
Yiming Wang | Hongxi Wei | Heng Wang | Shiwen Sun | Chao He

Handwritten text generation (HTG) aims to synthesize handwritten samples by imitating a specific writer, which has a wide range of applications and thus has significant research value. However, current studies on HTG are confronted with a main bottleneck: dominant models lack the ability to perceive and integrate handwriting styles, which affects the realism of the synthesized samples. In this paper, we propose GL-GAN, which effectively captures and integrates global and local styles. Specifically, we propose a Hybrid Style Encoder (HSE) that combines a state space model (SSM) and convolution to capture multilevel style features through various receptive fields. The captured style features are then fed to the proposed Dynamic Feature Enhancement Module (DFEM), which integrates these features by adaptively modeling the entangled relationships between multilevel styles and removing redundant details. Extensive experiments on two widely used handwriting datasets demonstrate that our GL-GAN is an effective HTG model and outperforms state-of-the-art models remarkably. Our code is publicly available at:https://github.com/Fyzjym/GL-GAN.

pdf bib abs
Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering
Pascal Tilli | Ngoc Thang Vu

Explainable artificial intelligence (XAI) aims to make machine learning models more transparent. While many approaches focus on generating explanations post-hoc, interpretable approaches, which generate the explanations intrinsically alongside the predictions, are relatively rare. In this work, we integrate different discrete subset sampling methods into a graph-based visual question answering system to compare their effectiveness in generating interpretable explanatory subgraphs intrinsically. We evaluate the methods on the dataset and show that the integrated methods effectively mitigate the performance trade-off between interpretability and answer accuracy, while also achieving strong co-occurrences between answer and question tokens. Furthermore, we conduct a human evaluation to assess the interpretability of the generated subgraphs using a comparative setting with the extended Bradley-Terry model, showing that the answer and question token co-occurrence metrics strongly correlate with human preferences. Our source code is publicly available.

The rapid evolution of Natural Language Processing (NLP) has favoured major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing an existing multilingual dataset for a new NLP task: we repurpose a subset of the BELEBELE dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable the more practical task of extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced. We also provide a thorough analysis and share insights to deepen understanding of the challenges and opportunities in NLP task reformulation.

pdf bib abs
Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment
Tianyu Peng | Jiajun Zhang

Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs causes difficulties for student models to learn. In this paper, we first demonstrate the importance of multi-modal distribution alignment with experiments and then highlight the inefficiency of existing KD approaches in learning multi-modal distributions. To address this problem, we propose Ranking Loss based Knowledge Distillation (RLKD), which encourages the consistency of the ranking of peak predictions between the teacher and student models. By incorporating word-level ranking loss, we ensure excellent compatibility with existing distillation objectives while fully leveraging the fine-grained information between different categories in peaks of two predicted distribution. Experimental results demonstrate that our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.

Emotion recognition in conversations (ERC) has garnered significant attention from the research community. However, due to the complexity of visual scenes and dialogue contextual dependencies in conversations, previous ERC methods fail to handle emotional cues from both visual sources and discourse structures. Furthermore, existing state-of-the-art ERC models are trained and tested separately on each single ERC dataset, not verifying their effectiveness across multiple datasets simultaneously. To address these challenges, this paper proposes an innovative framework for ERC, called Dialogue Scenes Understanding Enhanced Multi-modal Multi-task Tuning (DialogueMMT). More concretely, a novel video-language connector is applied within the large vision-language model for capturing video features effectively. Additionally, we utilize multi-task instruction tuning with a unified ERC dataset to enhance the model’s understanding of multi-modal dialogue scenes and employ a chain-of-thought strategy to improve emotion classification performance. Extensive experimental results on three benchmark ERC datasets indicate that the proposed DialogueMMT framework consistently outperforms existing state-of-the-art approaches in terms of overall performance.

Large Language Models (LLMs) have demonstrated powerful performance in sequential recommendation due to their robust language modeling and comprehension capabilities. In such paradigms, the item texts of interaction sequences are formulated as sentences and LLMs are utilized to learn language representations or directly generate target item texts by incorporating instructions. Despite their promise, these methods solely focus on modeling the mapping from sequential texts to target items, neglecting the relationship between the items in an interaction sequence. This results in a failure to learn the transition patterns between items, which reflect the dynamic change in user preferences and are crucial for predicting the next item. To tackle this issue, we propose a novel framework for mapping the sequential item texts to the sequential item IDs, named ST2SI. Specifically, we first introduce multi-query input and item linear projection (ILP) to model the conditional probability distribution of items. Then, we further propose ID alignment to address misalignment between item texts and item IDs by instruction tuning. Finally, we propose efficient ILP tuning to adapt flexibly to different scenarios, requiring only training a linear layer to achieve competitive performance. Extensive experiments on six real-world datasets show our approach outperforms the best baselines by 7.33% in NDCG@10, 4.65% in Recall@10, and 8.42% in MRR.

pdf bib abs
Aligning Large Language Models with Human Opinions through Persona Selection and Value–Belief–Norm Reasoning
Xuan Long Do | Kenji Kawaguchi | Min-Yen Kan | Nancy Chen

Reasoning and predicting human opinions with large language models (LLMs) is essential yet challenging. Current methods employ role-playing with personae but face two major issues: LLMs are sensitive to even a single irrelevant persona, skewing predictions by up to 30%; and LLMs fail to reason strategically over personae. We propose Chain-of-Opinion (COO), a simple four-step solution modeling which and how to reason with personae, inspired by the Value–Belief–Norm (VBN) theory. COO differentiates between explicit personae (demographics and ideology) and implicit personae (historical opinions), involves: (1) filtering irrelevant attributes from explicit personae; (2) ranking implicit personae into a preferential list for selecting top-k; (3) applying novel VBN reasoning to extract user environmental and personal value, belief, and norm variables for accurate and reliable predictions; and (4) iterating VBN reasoning with progressively larger lists of implicit personae to handle potential persona insufficiency. COO efficiently achieves new state-of-the-art opinion prediction via prompting with only 5 inference calls, improving prior techniques by up to 4%. Notably, fine-tuning LMs with COO’s data results in significantly better opinion-aligned models, by up to 23%.

pdf bib abs
MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning
Zheng Li | Yang Du | Mao Zheng | Mingyang Song

Extensive research has been conducted to explore the capability of Large Language Models (LLMs) for table reasoning and has significantly improved the performance on existing benchmarks. However, tables and user questions in real-world applications are more complex and diverse, presenting an unignorable gap compared to the existing benchmarks. To fill the gap, we propose a Multi-scale spreadsheet benchmark with Meta operations for Table reasoning, named as MiMoTable. Specifically, MiMoTable incorporates two key features. First, the tables in MiMoTable are all spreadsheets used in real-world scenarios, which cover seven domains and contain different types. Second, we define a new criterion with six categories of meta operations for measuring the difficulty of each question in MiMoTable, simultaneously as a new perspective for measuring the difficulty of the existing benchmarks. Experimental results show that Claude-3.5-Sonnet achieves the best performance with 77.4% accuracy, indicating that there is still significant room to improve for LLMs on MiMoTable. Furthermore, we grade the difficulty of existing benchmarks according to our new criteria. Experiments have shown that the performance of LLMs decreases as the difficulty of benchmarks increases, thereby proving the effectiveness of our proposed new criterion.

pdf bib abs
Implicit Discourse Relation Classification For Nigerian Pidgin
Muhammed Yahia Gaffar Saeed Saeed | Peter Bourgonje | Vera Demberg

Nigerian Pidgin (NP) is an English-based creole language spoken by nearly 100 million people across Nigeria, and is still low-resource in NLP. In particular, there are currently no available discourse parsing tools, which, if available, would have the potential to improve various downstream tasks. Our research focuses on implicit discourse relation classification (IDRC) for NP, a task which, even in English, is not easily solved by prompting LLMs, but requires supervised training. % With this in mind, we have developed a framework for the task, which could also be used by researchers for other English-lexified languages. We systematically compare different approaches to the low resource IDRC task: in one approach, we use English IDRC tools directly on the NP text as well as on their English translations (followed by a back-projection of labels). In another approach, we create a synthetic discourse corpus for NP, in which we automatically translate the English discourse-annotated corpus PDTB to NP, project PDTB labels, and then train an NP IDR classifier. The latter approach of training a “native” NP classifier outperforms our baseline by 13.27% and 33.98% in f₁ score for 4-way and 11-way classification, respectively.

pdf bib abs
How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM
Shaoxiong Ji | Pinzhen Chen

Instruction tuning a large language model with multiple languages can prepare it for multilingual downstream tasks. Nonetheless, it is yet to be determined whether having a handful of languages is sufficient, or whether the benefits increase with the inclusion of more. By fine-tuning large multilingual models on 1 to 52 languages, we present a case study on BLOOM to understand three pertinent factors affecting performance: the number of languages, language exposure, and similarity between training and test languages. Overall we found that 1) expanding language coverage in multilingual instruction tuning proves to be beneficial; 2) accuracy often significantly boots if the test language appears in the instruction mixture; 3) languages’ genetic features correlate with cross-lingual transfer more than merely the number of language but different languages benefit to various degrees.

pdf bib abs
Gradient Inversion Attack in Federated Learning: Exposing Text Data through Discrete Optimization
Ying Gao | Yuxin Xie | Huanghao Deng | Zukun Zhu

Federated learning has emerged as a potential solution to overcome the bottleneck posed by the near exhaustion of public text data in training large language models. There are claims that the strategy of exchanging gradients allows using text data including private information. Although recent studies demonstrate that data can be reconstructed from gradients, the threat for text data seems relatively small due to its sensitivity to even a few token errors. However, we propose a novel attack method FET, indicating that it is possible to Fully Expose Text data from gradients. Unlike previous methods that optimize continuous embedding vectors, we directly search for a text sequence with gradients that match the known gradients. First, we infer the total number of tokens and the unique tokens in the target text data from the gradients of the embedding layer. Then we develop a discrete optimization algorithm, which globally explores the solution space and precisely refines the obtained solution, incorporating both global and local search strategies. We also find that gradients of the fully connected layer are dominant, providing sufficient guidance for the optimization process. Our experiments show a significant improvement in attack performance, with an average increase of 39% for TinyBERT-6, 20% for BERT-base and 15% for BERT-large in exact match rates across three datasets. These findings highlight serious privacy risks in text data, suggesting that using smaller models is not an effective privacy-preserving strategy.

pdf bib abs
Simulating Dual-Process Thinking in Dialogue Topic Shift Detection
Huiyao Wang | Peifeng Li | Yaxin Fan | Qiaoming Zhu

Previous work on dialogue topic shift detection has primarily focused on shallow local reasoning, overlooking the importance of considering the global historical structure and local details to elucidate the underlying causes of topic shift. To address the above two issues, we introduce the dual-process theory to this task and design a novel Dual-Module Framework DMF (i.e., intuition and reasoning module) for dialogue topic shift detection to emulate this cognitive process. Specifically, the intuition module employs Large Language Models (LLMs) to extract and store the global topic structure of historical dialogue, while the reasoning module introduces a LLM to generate reasoning samples between the response and the most recent topic of historical dialogue, thereby providing local detail explanations for topic shift. Moreover, we distill the dual-module framework into a small generative model to facilitate more precise reasoning. The experimental results on three public datasets show that our DMF outperforms the state-of-the-art baselines.

pdf bib abs
A Compliance Checking Framework Based on Retrieval Augmented Generation
Jingyun Sun | Zhongze Luo | Yang Li

The text-based compliance checking aims to verify whether a company’s business processes comply with laws, regulations, and industry standards using NLP techniques. Existing methods can be divided into two categories: Logic-based methods offer the advantage of precise and reliable reasoning processes but lack flexibility. Semantic embedding methods are more generalizable; however, they may lose structured information and lack logical coherence. To combine the strengths of both approaches, we propose a compliance checking framework based on Retrieval-Augmented Generation (RAG). This framework includes a static layer for storing factual knowledge, a dynamic layer for storing regulatory and business process information, and a computational layer for retrieval and reasoning. We employ an eventic graph to structurally describe regulatory information as we recognize that the knowledge in regulatory documents is centered not on entities but on actions and states. We conducted experiments on Chinese and English compliance checking datasets. The results demonstrate that our framework consistently achieves state-of-the-art results across various scenarios, surpassing other baselines.

pdf bib abs
MIDLM: Multi-Intent Detection with Bidirectional Large Language Models
Shangjian Yin | Peijie Huang | Yuhong Xu

Decoder-only Large Language Models (LLMs) have demonstrated exceptional performance in language generation, exhibiting broad capabilities across various tasks. However, the application to label-sensitive language understanding tasks remains challenging due to the limitations of their autoregressive architecture, which restricts the sharing of token information within a sentence. In this paper, we address the Multi-Intent Detection (MID) task and introduce MIDLM, a bidirectional LLM framework that incorporates intent number detection and multi-intent selection. This framework allows autoregressive LLMs to leverage bidirectional information awareness through post-training, eliminating the need for training the models from scratch. Comprehensive evaluations across 8 datasets show that MIDLM consistently outperforms both existing vanilla models and pretrained baselines, demonstrating its superior performance in the MID task.

Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs, serving as a promising paradigm for accelerating model inference. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to pursue activation sparsity and acceleration, but few can simultaneously obtain high activation sparsity and comparable model performance. This paper introduces a simple and effective method named “ProSparse” to sparsify LLMs while achieving both targets. Specifically, after introducing ReLU activation, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing for multiple stages. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, with comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models. Inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52x inference speedup.

pdf bib abs
Reasoning-Oriented and Analogy-Based Methods for Locating and Editing in Zero-Shot Event-Relational Reasoning
Jingyao Tang | Lishuang Li | Liteng Mi | Haiming Wu | Hongbin Lu

Zero-shot event-relational reasoning is an important task in natural language processing, and existing methods jointly learn a variety of event-relational prefixes and inference-form prefixes to achieve such tasks. However, training prefixes consumes large computational resources and lacks interpretability. Additionally, learning various relational and inferential knowledge inefficiently exploits the connections between tasks. Therefore, we first propose a method for Reasoning-Oriented Locating and Editing (ROLE), which locates and edits the key modules of the language model for reasoning about event relations, enhancing interpretability and also resource-efficiently optimizing the reasoning ability. Subsequently, we propose a method for Analogy-Based Locating and Editing (ABLE), which efficiently exploits the similarities and differences between tasks to optimize the zero-shot reasoning capability. Experimental results show that ROLE improves interpretability and reasoning performance with reduced computational cost. ABLE achieves SOTA results in zero-shot reasoning.

pdf bib abs
Leveraging Language Models for Summarizing Mental State Examinations: A Comprehensive Evaluation and Dataset Release
Nilesh Kumar Sahu | Manjeet Yadav | Mudita Chaturvedi | Snehil Gupta | Haroon R. Lone

Mental health disorders affect a significant portion of the global population, with diagnoses primarily conducted through Mental State Examinations (MSEs). MSEs serve as structured assessments to evaluate behavioral and cognitive functioning across various domains, aiding mental health professionals in diagnosis and treatment monitoring. However, in developing countries, access to mental health support is limited, leading to an overwhelming demand for mental health professionals. Resident doctors often conduct initial patient assessments and create summaries for senior doctors, but their availability is constrained, resulting in extended patient wait times. This study addresses the challenge of generating concise summaries from MSEs through the evaluation of various language models. Given the scarcity of relevant mental health conversation datasets, we developed a 12-item descriptive MSE questionnaire and collected responses from 405 participants, resulting in 9720 utterances covering diverse mental health aspects. Subsequently, we assessed the performance of five well-known pre-trained summarization models, both with and without fine-tuning, for summarizing MSEs. Our comprehensive evaluation, leveraging metrics such as ROUGE, SummaC, and human evaluation, demonstrates that language models can generate automated coherent MSE summaries for doctors. With this paper, we release our collected conversational dataset and trained models publicly for the mental health research community.

pdf bib abs
Oddballness: universal anomaly detection with language models
Filip Gralinski | Ryszard Staruch | Krzysztof Jurkiewicz

We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric defined in this paper: oddballness. Oddballness measures how “strange” a given token is according to the language model. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.

With the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Although datasets such as MathVista have been introduced for evaluating mathematical capabilities in multimodal scenarios, there remains a lack of evaluation tools and datasets tailored for fine-grained assessment in Chinese K12 education. To systematically evaluate the ability of multimodal large models to solve Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark (CMMaTH), containing 23,856 multimodal K12 math related questions, making it the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH includes questions ranging from elementary to high school levels, offering greater diversity in problem types, solution goals, visual elements, detailed knowledge points, and standard solution annotations. To facilitate stable, fast, and cost-free model evaluation, we have developed an open-source tool called GradeGPT, which is integrated with the CMMaTH dataset. Our data and code are available at https://github.com/zzli2022/CMMaTH.

To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools help LLMs access this external knowledge, but there remains challenges for fine-tuning LLM agents (e.g., Toolformer) to invoke tools in multi-step reasoning problems, where inter-connected tool calls require holistic and efficient tool usage planning. In this work, we propose a new method for LLMs to better leverage tools in multi-step reasoning. Our method, Chain-of-Abstraction (CoA), trains LLMs to first decode reasoning chains with abstract placeholders, and then call domain tools to reify each reasoning chain by filling in specific knowledge. This planning with abstract chains enables LLMs to learn more general reasoning strategies, which are robust to shifts of domain knowledge (e.g., math results) relevant to different reasoning questions. It also allows LLMs to perform decoding and calling of external tools in parallel, which avoids the inference delay caused by waiting for tool responses. In mathematical reasoning and Wiki QA domains, we show that our method consistently outperforms previous chain-of-thought and tool-augmented baselines on both in-distribution and out-of-distribution test sets, with an average ~6% absolute QA accuracy improvement. LLM agents trained with our method also show more efficient tool use, with inference speed being on average ~1.4x faster than baseline tool-augmented LLMs.

pdf bib abs
Enhancing Arabic NLP Tasks through Character-Level Models and Data Augmentation
Mohanad Mohamed | Sadam Al-Azani

This study introduces a character-level approach specifically designed for Arabic NLP tasks, offering a novel and highly effective solution to the unique challenges inherent in Arabic language processing. It presents a thorough comparative study of various character-level models, including Convolutional Neural Networks (CNNs), pre-trained transformers (CANINE), and Bidirectional Long Short-Term Memory networks (BiLSTMs), assessing their performance and exploring the impact of different data augmentation techniques on enhancing their effectiveness. Additionally, it introduces two innovative Arabic-specific data augmentation methods—vowel deletion and style transfer—and rigorously evaluates their effectiveness. The proposed approach was evaluated on Arabic privacy policy classification task as a case study, demonstrating significant improvements in model performance, reporting a micro-averaged F1-score of 93.8%, surpassing state-of-the-art models.

pdf bib abs
The Gaps between Fine Tuning and In-context Learning in Bias Evaluation and Debiasing
Masahiro Kaneko | Danushka Bollegala | Timothy Baldwin

The output tendencies of PLMs vary markedly before and after FT due to the updates to the model parameters. These divergences in output tendencies result in a gap in the social biases of PLMs. For example, there exits a low correlation between intrinsic bias scores of a PLM and its extrinsic bias scores under FT-based debiasing methods. Additionally, applying FT-based debiasing methods to a PLM leads to a decline in performance in downstream tasks. On the other hand, PLMs trained on large datasets can learn without parameter updates via ICL using prompts. ICL induces smaller changes to PLMs compared to FT-based debiasing methods. Therefore, we hypothesize that the gap observed in pre-trained and FT models does not hold true for debiasing methods that use ICL. In this study, we demonstrate that ICL-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to FT-based methods. Moreover, the performance degradation due to debiasing is also lower in the ICL case compared to that in the FT case.

pdf bib abs
LLM Sensitivity Challenges in Abusive Language Detection: Instruction-Tuned vs. Human Feedback
Yaqi Zhang | Viktor Hangya | Alexander Fraser

The capacity of large language models (LLMs) to understand and distinguish socially unacceptable texts enables them to play a promising role in abusive language detection. However, various factors can affect their sensitivity. In this work, we test whether LLMs have an unintended bias in abusive language detection, i.e., whether they predict more or less of a given abusive class than expected in zero-shot settings. Our results show that instruction-tuned LLMs tend to under-predict positive classes, since datasets used for tuning are dominated by the negative class. On the contrary, models fine-tuned with human feedback tend to be overly sensitive. In an exploratory approach to mitigate these issues, we show that label frequency in the prompt helps with the significant over-prediction.

pdf bib abs
Improving Automatic Grammatical Error Annotation for Chinese Through Linguistically-Informed Error Typology
Yang Gu | Zihao Huang | Min Zeng | Mengyang Qiu | Jungyeul Park

Comprehensive error annotation is essential for developing effective Grammatical Error Correction (GEC) systems and delivering meaningful feedback to learners. This paper introduces improvements to automatic grammatical error annotation for Chinese. Our refined framework addresses language-specific challenges that cause common spelling errors in Chinese, including pronunciation similarity, visual shape similarity, specialized participles, and word ordering. In a case study, we demonstrated our system’s ability to provide detailed feedback on 12-16% of all errors by identifying them under our new error typology, specific enough to uncover subtle differences in error patterns between L1 and L2 writings. In addition to improving automated feedback for writers, this work also highlights the value of incorporating language-specific features in NLP systems.

pdf bib abs
Bias Vector: Mitigating Biases in Language Models with Task Arithmetic Approach
Daiki Shirafuji | Makoto Takenaka | Shinya Taguchi

The use of language models (LMs) has increased considerably in recent years, and the biases and stereotypes in training data that are reflected in the LM outputs are causing social problems. In this paper, inspired by the task arithmetic, we propose the “Bias Vector” method for the mitigation of these LM biases. The Bias Vector method does not require manually created debiasing data. The three main steps of our approach involve: (1) continual training the pre-trained LMs on biased data using masked language modeling; (2) constructing the Bias Vector as the difference between the weights of the biased LMs and those of pre-trained LMs; and (3) subtracting the Bias Vector from the weights of the pre-trained LMs for debiasing. We evaluated the Bias Vector method on the SEAT across three LMs and confirmed an average improvement of 0.177 points. We demonstrated that the Bias Vector method does not degrade the LM performance on downstream tasks in the GLUE benchmark. In addition, we examined the impact of scaling factors, which control the magnitudes of Bias Vectors, with effect sizes on the SEAT and conducted a comprehensive evaluation of our debiased LMs across both the SEAT and GLUE benchmarks.

pdf bib abs
Topology-of-Question-Decomposition: Enhancing Large Language Models with Information Retrieval for Knowledge-Intensive Tasks
Weijie Li | Jin Wang | Liang-Chih Yu | Xuejie Zhang

Large language models (LLMs) are increasingly deployed for general problem-solving across various domains yet remain constrained to chaining immediate reasoning steps and depending solely on parametric knowledge. Integrating an information retrieval system directly into the reasoning process of LLMs can improve answer accuracy but might disrupt the natural reasoning sequence. Consequently, LLMs may underperform in complex, knowledge-intensive tasks requiring multiple reasoning steps, extensive real-world knowledge, or critical initial decisions. To overcome these challenges, we introduce a novel framework, Topology-of-Question-Decomposition (ToQD), which activates retrieval only when necessary. Globally, ToQD guides LLMs in constructing a topology graph from the input question, each node representing a sub-question. Locally, ToQD employs self-verify inference to determine whether a sub-question should retrieve relevant documents, necessitate further decomposition, or directly provide an answer. Experiments demonstrate that ToQD achieves superior performance and robustness in complex, knowledge-intensive tasks, significantly enhancing system response efficiency.

pdf bib abs
t-HNE: A Text-guided Hierarchical Noise Eliminator for Multimodal Sentiment Analysis
Zuocheng Li | Lishuang Li

In the Multimodal Sentiment Analysis task, most existing approaches focus on extracting modality-consistent information from raw unimodal data and integrating it into multimodal representations for sentiment classification. However, these methods often assume that all modalities contribute equally to model performance, prioritizing the extraction and enhancement of consistent information, while overlooking the adverse effects of noise caused by modality inconsistency. In contrast to these approaches, this paper introduces a novel approach namely text-guided Hierarchical Noise Eliminator (t-HNE). This model consists of a two-stage denoising phase and a feature recovery phase. Firstly, textual information is injected into both visual and acoustic modalities using an attention mechanism, aiming to reduce intra-modality noise in the visual and acoustic representations. Secondly, it further mitigates inter-modality noise by maximizing the mutual information between textual representations and the respective visual and acoustic representations. Finally, to address the potential loss of modality-invariant information during denoising, the fused multimodal representation is refined through contrastive learning with each unimodal representation except the textual. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets demonstrate the efficacy of our approach.

Game theory is a branch of mathematics that studies strategic interactions among rational agents. We propose Alympics (Olympics for Agents), a systematic framework utilizing Large Language Model (LLM) agents for empirical game theory research. Alympics creates a versatile platform for studying complex game theory problems, bridging the gap between theoretical game theory and empirical investigations by providing a controlled environment for simulating human-like strategic interactions with LLM agents. In our pilot case study, the “Water Allocation Challenge”, we explore Alympics through a challenging strategic game focused on the multi-round auction of scarce survival resources. This study demonstrates the framework’s ability to qualitatively and quantitatively analyze game determinants, strategies, and outcomes. Additionally, we conduct a comprehensive human assessment and an in-depth evaluation of LLM agents in rational strategic decision-making scenarios. Our findings highlight LLM agents’ potential to advance game theory knowledge and expand the understanding of their proficiency in emulating human strategic behavior.

pdf bib abs
Towards Adaptive Mechanism Activation in Language Agent
Ziyang Huang | Jun Zhao | Kang Liu

Language Agent could be endowed with different mechanisms for autonomous task accomplishment. Current agents typically rely on a fixed mechanism or a set of mechanisms activated in a predefined order, limiting their adaptation to varied potential task solution structures. To this end, this paper proposes Adaptive Language Agent Mechanism Activation Learning with Self-Exploration (ALAMA), which focuses on optimizing mechanism activation adaptability without reliance on expert models. Initially, it builds a harmonized agent framework (UniAct) to Unify different mechanisms via Actions. Then it leverages a training-efficient optimization method based on self-exploration to enable the UniAct to adaptively activate the appropriate mechanisms according to the potential characteristics of the task. Experimental results demonstrate significant improvements in downstream agent tasks, affirming the effectiveness of our approach in facilitating more dynamic and context-sensitive mechanism activation.

pdf bib abs
Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models
Xuanyu Lei | Zonghan Yang | Xinrui Chen | Peng Li | Yang Liu

State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional capabilities in vision-language tasks. Despite their advanced functionalities, the performances of LMMs are still limited in challenging scenarios that require complex reasoning with multiple levels of visual information. Existing prompting techniques for LMMs focus on either improving textual reasoning or leveraging tools for image preprocessing, lacking a simple and general visual prompting scheme to promote vision-language coordination in LMMs. In this work, we propose SCAFFOLD prompting that scaffolds coordinates to promote vision-language coordination. Specifically, SCAFFOLD overlays a dot matrix within the image as visual information anchors and leverages multi-dimensional coordinates as textual positional references. Extensive experiments on a wide range of challenging vision-language tasks demonstrate the superiority of SCAFFOLD over the textual Chain-of-Thought prompting.

The strong capability of large language models (LLMs) has been applied to information extraction (IE) through either retrieval augmented prompting or instruction tuning (IT). However, the best way to incorporate information with LLMs for IE remains an open question. In this paper, we explore Retrieval Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named entity recognition (NER). Specifically, for each training sample, we retrieve semantically similar examples from the training dataset as the context and prepend them to the input of the original instruction. To evaluate our RA-IT approach more thoroughly, we construct a Chinese IT dataset for open NER and evaluate RA-IT in both English and Chinese scenarios. Experimental results verify the effectiveness of RA-IT across various data sizes and in both English and Chinese scenarios. We also conduct thorough studies to explore the impacts of various retrieval strategies in the proposed RA-IT framework.

The performance of multilingual language models (MLLMs) is notably inferior for low-resource languages (LRL) compared to high-resource ones, primarily due to the limited available corpus during the pre-training phase. This inadequacy stems from the under-representation of low-resource language words in the subword vocabularies of MLLMs, leading to their misidentification as unknown or incorrectly concatenated subwords. Previous approaches are based on frequency sorting to select words for augmenting vocabularies. However, these methods overlook the fundamental disparities between model representation distributions and frequency distributions. To address this gap, we introduce a novel Entropy-Consistency Word Selection (ECWS) method, which integrates semantic and frequency metrics for vocabulary augmentation. Our results indicate an improvement in performance, supporting our approach as a viable means to enrich vocabularies inadequately represented in current MLLMs.

pdf bib abs
Hawkes based Representation Learning for Reasoning over Scale-free Community-structured Temporal Knowledge Graphs
Yuwei Du | Xinyue Liu | Wenxin Liang | Linlin Zong | Xianchao Zhang

Temporal knowledge graph (TKG) reasoning has become a hot topic due to its great value in many practical tasks. The key to TKG reasoning is modeling the structural information and evolutional patterns of the TKGs. While great efforts have been devoted to TKG reasoning, the structural and evolutional characteristics of real-world networks have not been considered. In the aspect of structure, real-world networks usually exhibit clear community structure and scale-free (long-tailed distribution) properties. In the aspect of evolution, the impact of an event decays with the time elapsing. In this paper, we propose a novel TKG reasoning model called Hawkes process-based Evolutional Representation Learning Network (HERLN), which learns structural information and evolutional patterns of a TKG simultaneously, considering the characteristics of real-world networks: community structure, scale-free and temporal decaying. First, we find communities in the input TKG to make the encoding get more similar intra-community embeddings. Second, we design a Hawkes process-based relational graph convolutional network to cope with the event impact-decaying phenomenon. Third, we design a conditional decoding method to alleviate biases towards frequent entities caused by long-tailed distribution. Experimental results show that HERLN achieves significant improvements over the state-of-the-art models.

pdf bib abs
Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang | Liang Ding | Lefei Zhang | Dacheng Tao

Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis (IA). IA works by triggering LLMs’ inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably,IA is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that IA could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our IA, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, IA is robust to errors in generated intentions. Further analyses reveal the underlying principle of IA: suppressing LLM’s tendency to follow jailbreak prompts, thereby enhancing safety.

pdf bib abs
Towards Understanding Multi-Task Learning (Generalization) of LLMs via Detecting and Exploring Task-Specific Neurons
Yongqi Leng | Deyi Xiong

While large language models (LLMs) have demonstrated superior multi-task capabilities, understanding the learning mechanisms behind this is still a challenging problem. In this paper, we attempt to understand such mechanisms from the perspective of neurons. Specifically, we detect task-sensitive neurons in LLMs via gradient attribution on task-specific data. Through extensive deactivation and fine-tuning experiments, we demonstrate that the detected neurons are highly correlated with the given task, which we term as task-specific neurons. With these identified task-specific neurons, we delve into two common problems in multi-task learning and continuous learning: Generalization and Catastrophic Forgetting. We find that the overlap of task-specific neurons is strongly associated with generalization and specialization across tasks. Interestingly, at certain layers of LLMs, there is a high similarity in the parameters of different task-specific neurons, and such similarity is highly correlated with the generalization performance. Inspired by these findings, we propose a neuron-level continuous fine-tuning method that only fine-tunes the current task-specific neurons during continuous learning, and extensive experiments demonstrate the effectiveness of the proposed method. Our study provides insights into the interpretability of LLMs in multi-task learning.

pdf bib abs
Do Large Language Models Mirror Cognitive Language Processing?
Yuqi Ren | Renren Jin | Tongxuan Zhang | Deyi Xiong

Large Language Models (LLMs) have demonstrated remarkable abilities in text comprehension and logical reasoning, indicating that the text representations learned by LLMs can facilitate their language processing capabilities. In neuroscience, brain cognitive processing signals are typically utilized to study human language processing. Therefore, it is natural to ask how well the text embeddings from LLMs align with the brain cognitive processing signals, and how training strategies affect the LLM-brain alignment? In this paper, we employ Representational Similarity Analysis (RSA) to measure the alignment between 23 mainstream LLMs and fMRI signals of the brain to evaluate how effectively LLMs simulate cognitive language processing. We empirically investigate the impact of various factors (e.g., pre-training data size, model scaling, alignment training, and prompts) on such LLM-brain alignment. Experimental results indicate that pre-training data size and model scaling are positively correlated with LLM-brain similarity, and alignment training can significantly improve LLM-brain similarity. Explicit prompts contribute to the consistency of LLMs with brain cognitive language processing, while nonsensical noisy prompts may attenuate such alignment. Additionally, the performance of a wide range of LLM evaluations (e.g., MMLU, Chatbot Arena) is highly correlated with the LLM-brain similarity.

The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that metric tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.

pdf bib abs
Learning to Reason via Self-Iterative Process Feedback for Small Language Models
Kaiyuan Chen | Jin Wang | Xuejie Zhang

Small language models (SLMs) are more efficient, cost-effective, and customizable than large language models (LLMs), though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs’ reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43 (Acc) on GSM8K and 3.95 (Pass@1) on MBPP. Furthermore, the proposed method also demonstrated superior out-of-domain generalization capabilities on MMLU_Math and HumanEval.

pdf bib abs
Rethinking-based Code Summarization with Chain of Comments
Liuwen Cao | Hongkui He | Hailin Huang | Jiexin Wang | Yi Cai

Automatic code summarization aims to generate concise natural language descriptions (summary) for source code, which can free software developers from the heavy burden of manual commenting and software maintenance. Existing methods focus on learning a direct mapping from pure code to summaries, overlooking the significant heterogeneity gap between code and summary. Moreover, existing methods lack a human-like re-check process to evaluate whether the generated summaries match well with the code. To address these two limitations, we introduce RBCoSum, a novel framework that incorporates the generated Chain Of Comments (COC) as auxiliary intermediate information for the model to bridge the gap between code and summaries. Also, we propose a rethinking process where a learned ranker trained on our constructed ranking dataset scores the extent of matching between the generated summary and the code, selecting the highest-scoring summary to achieve a re-check process. We conduct extensive experiments to evaluate our approach and compare it with other automatic code summarization models as well as multiple code Large Language Models (LLMs). The experimental results show that RBCoSum is effective and outperforms baselines by a large margin. The human evaluation also proves the summaries generated with RBCoSum are more natural, informative, useful, and truthful.

pdf bib abs
RGR-KBQA: Generating Logical Forms for Question Answering Using Knowledge-Graph-Enhanced Large Language Model
Tengfei Feng | Liang He

In the field of natural language processing, Knowledge Base Question Answering (KBQA) is a challenging task that involves accurately retrieving answers from structured knowledge. Existing methods often face issues when generating query statements using LLMs, as the knowledge introduced may be imprecise and the models themselves may exhibit hallucination problems, leading to low accuracy, particularly when dealing with complex questions. To address these challenges, we introduce a novel semantic parsing approach called RGR-KBQA, which adopts a Retrieve-Generate-Retrieve framework. The first retrieval step introduces factual knowledge from a knowledge graph to enhance the semantic understanding capabilities of LLMs, thereby improving generation accuracy of logical form. The second step uses a fine-tuned model to generate the logical form, and the final step involves unsupervised relation and entity retrieval to further enhance generation accuracy. These two retrieval steps help alleviate the hallucination problems inherent in LLMs. Experimental results show that RGR-KBQA demonstrate promising performance on CWQ and WebQSP datasets.

pdf bib abs
To Label or Not to Label: Hybrid Active Learning for Neural Machine Translation
Abdul Hameed Azeemi | Ihsan Ayyub Qazi | Agha Ali Raza

Active learning (AL) techniques reduce labeling costs for training neural machine translation (NMT) models by selecting smaller representative subsets from unlabeled data for annotation. Diversity sampling techniques select heterogeneous instances, while uncertainty sampling methods select instances with the highest model uncertainty. Both approaches have limitations - diversity methods may extract varied but trivial examples, while uncertainty sampling can yield repetitive, uninformative instances. To bridge this gap, we propose Hybrid Uncertainty and Diversity Sampling (HUDS), an AL strategy for domain adaptation in NMT that combines uncertainty and diversity for sentence selection. HUDS computes uncertainty scores for unlabeled sentences and subsequently stratifies them. It then clusters sentence embeddings within each stratum and computes diversity scores by distance to the centroid. A weighted hybrid score that combines uncertainty and diversity is then used to select the top instances for annotation in each AL iteration. Experiments on multi-domain German-English and French-English datasets demonstrate the better performance of HUDS over other strong AL baselines. We analyze the sentence selection with HUDS and show that it prioritizes diverse instances having high model uncertainty for annotation in early AL iterations.

Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM’s reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.

Multimodal large language models (MLLMs) combine visual and textual data for tasks like image captioning and visual question answering. Proper uncertainty calibration is crucial but challenging for reliable use in areas like healthcare and autonomous driving. This paper investigates several MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight differences in uncertainty between text and the impact of the integration of these two types of information in uncertainty. To better understand MLLMs’ miscalibration and their ability to self-assess uncertainty, we developed the IDK (I don’t know) dataset, which is key for evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications.

pdf bib abs
Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning
Cunda Wang | Weihua Wang | Qiuyu Liang | Feilong Bao | Guanglai Gao

Entity alignment (EA) aims to match identical entities across different knowledge graphs (KGs). Graph neural network-based entity alignment methods have achieved promising results in Euclidean space. However, KGs often contain complex local and hierarchical structures, which are hard to represent in a single space. In this paper, we propose a novel method named as UniEA, which unifies dual-space embedding to preserve the intrinsic structure of KGs. Specifically, we simultaneously learn graph structure embeddings in both Euclidean and hyperbolic spaces to maximize the consistency between embeddings in the two spaces. Moreover, we employ contrastive learning to mitigate the misalignment issues caused by similar entities, where embeddings of similar neighboring entities become too close. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in structure-based EA. Our code is available at https://github.com/wonderCS1213/UniEA.

pdf bib abs
Aspect-Based Sentiment Analysis with Syntax-Opinion-Sentiment Reasoning Chain
Rui Fan | Shu Li | Tingting He | Yu Liu

Despite the impressive capabilities of large language models (LLMs) in aspect-based sentiment analysis (ABSA), the role of syntactic information remains underexplored in LLMs. Syntactic structures are known to be crucial for capturing aspect-opinion relationships. To explore whether LLMs can effectively leverage syntactic information to improve ABSA performance, we propose a novel multi-step reasoning framework, the Syntax-Opinion-Sentiment Reasoning Chain (Syn-Chain). Syn-Chain sequentially analyzes syntactic dependencies, extracts opinions, and classifies sentiment. We introduce Syn-Chain into LLMs via zero-shot prompting, and results show that Syn-Chain significantly enhances ABSA performance, though smaller LLM exhibit weaker performance. Furthermore, we enhance smaller LLMs via distillation using GPT-3.5-generated Syn-Chain responses, achieving state-of-the-art ABSA performance. Our findings highlight the importance of syntactic information for improving LLMs in ABSA and offer valuable insights for future research.

pdf bib abs
Reasoning with Trees: Faithful Question Answering over Knowledge Graph
Tiesunlong Shen | Jin Wang | Xuejie Zhang | Erik Cambria

Recent advancements in large language models (LLMs) have shown remarkable progress in reasoning capabilities, yet they still face challenges in complex, multi-step reasoning tasks. This study introduces Reasoning with Trees (RwT), a novel framework that synergistically integrates LLMs with knowledge graphs (KGs) to enhance reasoning performance and interpretability. RwT reformulates knowledge graph question answering (KGQA) as a discrete decision-making problem, leveraging Monte Carlo Tree Search (MCTS) to iteratively refine reasoning paths. This approach mirrors human-like reasoning by dynamically integrating the LLM’s internal knowledge with external KG information. We propose a real-data guided iteration technique to train an evaluation model that assesses action values, improving the efficiency of the MCTS process. Experimental results on two benchmark KGQA datasets demonstrate that RwT significantly outperforms existing state-of-the-art methods, with an average performance improvement of 9.81%. Notably, RwT achieves these improvements without requiring complete retraining of the LLM, offering a more efficient and adaptable approach to enhancing LLM reasoning capabilities.

The recent surge in jailbreaking attacks has revealed significant vulnerabilities in Large Language Models (LLMs) when exposed to malicious inputs. While various defense strategies have been proposed to mitigate these threats, there has been limited research into the underlying mechanisms that make LLMs vulnerable to such attacks. In this study, we suggest that the self-safeguarding capability of LLMs is linked to specific activity patterns within their representation space. Although these patterns have little impact on the semantic content of the generated text, they play a crucial role in shaping LLM behavior under jailbreaking attacks. Our findings demonstrate that these patterns can be detected with just a few pairs of contrastive queries. Extensive experimentation shows that the robustness of LLMs against jailbreaking can be manipulated by weakening or strengthening these patterns. Further visual analysis provides additional evidence for our conclusions, providing new insights into the jailbreaking phenomenon. These findings highlight the importance of addressing the potential misuse of open-source LLMs within the community.

pdf bib abs
Lexicography Saves Lives (LSL): Automatically Translating Suicide-Related Language
Annika Marie Schoene | John E. Ortega | Rodolfo Joel Zevallos | Laura Haaber Ihle

Recent years have seen a marked increase in research that aims to identify or predict risk, intention or ideation of suicide. The majority of new tasks, datasets, language models and other resources focus on English and on suicide in the context of Western culture. However, suicide is global issue and reducing suicide rate by 2030 is one of the key goals of the UN’s Sustainable Development Goals. Previous work has used English dictionaries related to suicide to translate into different target languages due to lack of other available resources. Naturally, this leads to a variety of ethical tensions (e.g.: linguistic misrepresentation), where discourse around suicide is not present in a particular culture or country. In this work, we introduce the ‘Lexicography Saves Lives Project’ to address this issue and make three distinct contributions. First, we outline ethical consideration and provide overview guidelines to mitigate harm in developing suicide-related resources. Next, we translate an existing dictionary related to suicidal ideation into 200 different languages and conduct human evaluations on a subset of translated dictionaries. Finally, we introduce a public website to make our resources available and enable community participation.

pdf bib abs
Enhancing Emotional Support Conversations: A Framework for Dynamic Knowledge Filtering and Persona Extraction
Jiawang Hao | Fang Kong

With the growing need for accessible emotional support, conversational agents are being used more frequently to provide empathetic and meaningful interactions. However, many existing dialogue models struggle to interpret user context accurately due to irrelevant or misclassified knowledge, limiting their effectiveness in real-world scenarios. To address this, we propose a new framework that dynamically filters relevant commonsense knowledge and extracts personalized information to improve empathetic dialogue generation. We evaluate our framework on the ESConv dataset using extensive automatic and human experiments. The results show that our approach outperforms other models in metrics, demonstrating better coherence, emotional understanding, and response relevance.

Small Language Models (SLMs) are attracting attention due to the high computational demands and privacy concerns of Large Language Models (LLMs). Some studies fine-tune SLMs using Chains of Thought (CoT) data distilled from LLMs, aiming to enhance their reasoning ability. Furthermore, Some CoT distillation methods introduce external symbolic knowledge into the generation process to improve the limited knowledge memory, reasoning ability and out-of-domain (OOD) generalization of SLMs. However, the introduction of symbolic knowledge increases computational overhead and introduces potential noise. In this paper, we introduce SKIntern, an innovative approach that empowers SLMs to internalize symbolic knowledge and few-shot examples gradually through a progressive fine-tuning process, guided by a predefined linear decay schedule under curriculum learning. By efficiently internalizing knowledge, SKIntern reduces computational overhead and speeds up the reasoning process by focusing solely on the question during inference. It outperforms state-of-the-art baselines by over 5%, while reducing inference costs (measured in FLOPs) by up to 4× across a wide range of SLMs in both in-domain (ID) and out-of-domain (OOD) tasks. Our code will be available at https://github.com/Xnhyacinth/SKIntern.

Extractive summarization for legal documents aims to automatically extract key sentences from legal texts to form concise summaries. Recent studies have explored diffusion models for extractive summarization task, showcasing their remarkable capabilities. Despite these advancements, these models often fall short in effectively capturing and leveraging the specialized legal terminology crucial for accurate legal summarization. To address the limitation, this paper presents a novel term-guided diffusion model for extractive summarization of legal documents, named TermDiffuSum. It incorporates legal terminology into the diffusion model via a well-designed multifactor fusion noise weighting schedule, which allocates higher attention weight to sentences containing a higher concentration of legal terms during the diffusion process. Additionally, TermDiffuSum utilizes a re-ranking loss function to refine the model’s selection of more relevant summaries by leveraging the relationship between the candidate summaries generated by the diffusion process and the reference summaries. Experimental results on a self-constructed legal summarization dataset reveal that TermDiffuSum outperforms existing diffusion-based summarization models, achieving improvements of 3.10 in ROUGE-1, 2.84 in ROUGE-2, and 2.89 in ROUGE-L. To further validate the generalizability of TermDiffuSum, we conduct experiments on three public datasets from news and social media domains, with results affirming the scalability of our approach.

Comparative Opinion Quintuple Extraction (COQE) aims to extract all comparative sentiment quintuples from product review text. Each quintuple comprises five elements: subject, object, aspect, opinion and preference. With the rise of Large Language Models (LLMs), existing work primarily focuses on enhancing the performance of COQE task through data augmentation, supervised fine-tuning and instruction tuning. Instead of the above pre-modeling and in-modeling design techniques, we focus on innovation in the post-processing. We introduce a model-unaware adaptive chain-of-feedback (COF) method from the perspective of inference feedback and extraction revision. This method comprises three core modules: dynamic example selection, self-critique and self-revision. By integrating LLMs, COF enables dynamic iterative self-optimization, making it applicable across different baselines. To validate the effectiveness of our approach, we utilize the outputs of two distinct baselines as inputs for COF: frozen parameters few-shot learning and the SOTA supervised fine-tuned model. We evaluate our approach on three benchmarks: Camera, Car and Ele. Experimental results show that, compared to the few-shot learning method, our approach achieves F1 score improvements of 3.51%, 2.65% and 5.28% for exact matching on the respective dataset. Even more impressively, our method further boosts performance, surpassing the current SOTA results, with additional gains of 0.76%, 6.54%, and 2.36% across the three datasets.

Retrieval Augmented Generation (RAG) has proven to be highly effective in boosting the generative performance of language model in knowledge-intensive tasks. However, existing RAG framework either indiscriminately perform retrieval or rely on rigid single-label classifiers to select retrieval methods, leading to inefficiencies and suboptimal performance across queries of varying complexity. To address these challenges, we propose a reinforcement learning-based framework that dynamically selects the most suitable retrieval strategy based on query complexity. To address these challenges, we propose a reinforcement learning-based framework that dynamically selects the most suitable retrieval strategy based on query complexity. Our approach leverages a multi-armed bandit algorithm, which treats each retrieval method as a distinct “arm” and adapts the selection process by balancing exploration and exploitation. Additionally, we introduce a dynamic reward function that balances accuracy and efficiency, penalizing methods that require more retrieval steps, even if they lead to a correct result. Our method achieves new state of the art results on multiple single-hop and multi-hop datasets while reducing retrieval costs. Our code are available at https://github.com/FUTUREEEEEE/MBA.

pdf bib abs
Improvement in Sign Language Translation Using Text CTC Alignment
Sihan Tan | Taro Miyazaki | Nabeela Khan | Kazuhiro Nakadai

Current sign language translation (SLT) approaches often rely on gloss-based supervision with Connectionist Temporal Classification (CTC), limiting their ability to handle non-monotonic alignments between sign language video and spoken text. In this work, we propose a novel method combining joint CTC/Attention and transfer learning. The joint CTC/Attention introduces hierarchical encoding and integrates CTC with the attention mechanism during decoding, effectively managing both monotonic and non-monotonic alignments. Meanwhile, transfer learning helps bridge the modality gap between vision and language in SLT. Experimental results on two widely adopted benchmarks, RWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves results comparable to state-of-the-art and outperforms the pure-attention baseline. Additionally, this work opens a new door for future research into gloss-free SLT using text-based CTC alignment.

pdf bib abs
Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining
Zongru Wu | Pengzhou Cheng | Lingyong Fang | Zhuosheng Zhang | Gongshen Liu

Backdoor attacks remain significant security threats to generative large language models (LLMs). Since generative LLMs output sequences of high-dimensional token logits instead of low-dimensional classification logits, most existing backdoor defense methods designed for discriminative models like BERT are ineffective for generative LLMs. Inspired by the observed differences in learning behavior between backdoor and clean mapping in the frequency space, we transform gradients of each training sample, directly influencing parameter updates, into the frequency space. Our findings reveal a distinct separation between the gradients of backdoor and clean samples in the frequency space. Based on this phenomenon, we propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients in the frequency space to effectively identify backdoor samples without requiring retraining LLMs. Experimental results show that GraCeFul outperforms baselines significantly. Notably, GraCeFul exhibits remarkable computational efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor samples, reducing the average success rate of various backdoor attacks to 0% with negligible drops in clean accuracy across multiple free-style question answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna. The codes are publicly available at https://github.com/ZrW00/GraceFul.

pdf bib abs
MQM-Chat: Multidimensional Quality Metrics for Chat Translation
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Kentaro Inui

The complexities of chats, such as the stylized contents specific to source segments and dialogue consistency, pose significant challenges for machine translation. Recognizing the need for a precise evaluation metric to address the issues associated with chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat), which encompasses seven error types, including three specifically designed for chat translations: ambiguity and disambiguation, buzzword or loanword issues, and dialogue inconsistency. In this study, human annotations were applied to the translations of chat data generated by five translation models. Based on the error distribution of MQM-Chat and the performance of relabeling errors into chat-specific types, we concluded that MQM-Chat effectively classified the errors while highlighting chat-specific issues explicitly. The results demonstrate that MQM-Chat can qualify both the lexical accuracy and semantical accuracy of translation models in chat translation tasks.

Sequential recommendation systems play a key role in modern information retrieval. However, existing intent-related work fails to adequately capture long-term dependencies in user behavior, i.e., the influence of early user behavior on current behavior, and also fails to effectively utilize item relevance. To this end, we propose a novel sequential recommendation framework to overcome the above limitations, called ICMA. Specifically, we combine temporal variability with position encoding that has extrapolation properties to encode sequences, thereby expanding the model’s view of user behavior and capturing long-term user dependencies more effectively. Additionally, we design a multi-view data augmentation method, i.e., based on random data augmentation methods (e.g., crop, mask, and reorder), and further introduce insertion and substitution operations to augment the sequence data from different views by utilizing item relevance. Within this framework, clustering is performed to learn intent distributions, and these learned intents are integrated into the sequential recommendation model via contrastive SSL, which maximizes consistency between sequence views and their corresponding intents. The training process alternates between the Expectation (E) step and the Maximization (M) step. Experiments on three real datasets show that our approach improves by 0.8% to 14.7% compared to most baselines.

pdf bib abs
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
Siyuan Wang | Zhuohan Long | Zhihao Fan | Xuanjing Huang | Zhongyu Wei

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs). We utilize a multi-agent system to reframe new evolving instances with high confidence that extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, shortcut biases and probing their problem-solving sub-abilities. With this framework, we extend datasets across general and specific tasks, through various iterations. Experimental results show a performance decline in most LLMs against their original results under scalable and robust evaluations, offering a more accurate reflection of model capabilities alongside our fine-grained evaluation. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks. We hope this framework contributes the research community for continuously evolving benchmarks alongside LLM development.

pdf bib abs
Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection
Dmitri Roussinov | Serge Sharoff | Nadezhda Puchnina

This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: (1) genre classification and (2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance.

pdf bib abs
Finetuning LLMs for Comparative Assessment Tasks
Vatsal Raina | Adian Liusie | Mark Gales

Automated assessment in natural language generation is a challenging task. Instruction-tuned large language models (LLMs) have shown promise in reference-free evaluation, particularly through comparative assessment. However, the quadratic computational complexity of pairwise comparisons limits its scalability. To address this, efficient comparative assessment has been explored by applying comparative strategies on zero-shot LLM probabilities. We propose a framework for finetuning LLMs for comparative assessment to align the model’s output with the target distribution of comparative probabilities. By training on soft probabilities, our approach improves state-of-the-art performance while maintaining high performance with an efficient subset of comparisons.

pdf bib abs
Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM Hallucination on North Korea
Eunjung Cho | Won Ik Cho | Soomin Seo

Hallucination in large language models (LLMs) remains a significant challenge for their safe deployment, particularly due to its potential to spread misinformation. Most existing solutions address this challenge by focusing on aligning the models with credible sources or by improving how models communicate their confidence (or lack thereof) in their outputs. While these measures may be effective in most contexts, they may fall short in scenarios requiring more nuanced approaches, especially in situations where access to accurate data is limited or determining credible sources is challenging. In this study, we take North Korea - a country characterised by an extreme lack of reliable sources and the prevalence of sensationalist falsehoods - as a case study. We explore and evaluate how some of the best-performing multilingual LLMs and specific language-based models generate information about North Korea in three languages spoken in countries with significant geo-political interests: English (United States, United Kingdom), Korean (South Korea), and Mandarin Chinese (China). Our findings reveal significant differences, suggesting that the choice of model and language can lead to vastly different understandings of North Korea, which has important implications given the global security challenges the country poses.

Open Information Extraction (OpenIE) aims to extract structured information in the form of triples from unstructured text, serving as a foundation for various downstream NLP tasks. Despite the success of neural OpenIE models, their dependence on large-scale annotated datasets poses a challenge, particularly in low-resource settings. In this paper, we introduce a novel approach to address the low-resource OpenIE task through two key innovations: (1) we improve the quality of training data by curating small-scale, high-quality datasets annotated by a large language model (GPT-3.5), leveraging both OpenIE principles and few-shot examples to form LSOIE-g principles and LSOIE-g examples; (2) we propose CycleOIE, a training framework that maximizes data efficiency through a cycle-consistency mechanism, enabling the model to learn effectively from minimal data. Experimental results show that CycleOIE, when trained on only 2k+ instances, achieves comparable results to models trained on over 90k instances. Our contributions are further validated through extensive experiments, demonstrating the superior performance of CycleOIE and our curated LSOIE-g datasets in low-resource OpenIE as well as revealing the internal mechanisms of CycleOIE.

pdf bib abs
AHVE-CNER: Aligned Hanzi Visual Encoding Enhance Chinese Named Entity Recognition with Multi-Information
Xuhui Zheng | Zhiyuan Min | Bin Shi | Hao Wang

The integration of multi-modal information, especially the graphic features of Hanzi, is crucial for improving the performance of Chinese Named Entity Recognition (NER) tasks. However, existing glyph-based models frequently neglect the relationship between pictorial elements and radicals. This paper presents AHVE-CNER, a model that integrates multi-source visual and phonetic information of Hanzi, while explicitly aligning pictographic features with their corresponding radicals. We propose the Gated Pangu-𝜋 Cross Transformer to effectively facilitate the integration of these multi-modal representations. By leveraging a multi-source glyph alignment strategy, AHVE-CNER demonstrates an improved capability to capture the visual and semantic nuances of Hanzi for NER tasks. Extensive experiments on benchmark datasets validate that AHVE-CNER achieves superior performance compared to existing multi-modal Chinese NER methods. Additional ablation studies further confirm the effectiveness of our visual alignment module and the fusion approach.

pdf bib abs
Edit-Wise Preference Optimization for Grammatical Error Correction
Jiehao Liang | Haihui Yang | Shiping Gao | Xiaojun Quan

While large language models (LLMs) have achieved remarkable success in various natural language processing tasks, their strengths have yet to be fully demonstrated in grammatical error correction (GEC). This is partly due to the misalignment between their pre-training objectives and the GEC principle of making minimal edits. In this work, we aim to bridge this gap by introducing a novel method called Edit-wise Preference Optimization (EPO). By distinguishing the importance of different tokens and assigning higher reward weights to edit tokens during preference optimization, our method captures fine-grained distinctions in GEC that traditional preference learning often overlooks. Extensive experiments on both English and Chinese datasets show that our framework consistently outperforms strong baselines, achieving state-of-the-art performance and demonstrating the advantages of LLMs in GEC.

pdf bib abs
You Only Query Twice: Multimodal Rumor Detection via Evidential Evaluation from Dual Perspectives
Junyi Chen | Leyuan Liu | Tian Lan | Fan Zhou | Xiaosong Zhang

Current rumor detectors exhibit limitations in fully exploiting responses to the source tweet as essential public opinions, and in explaining and indicating the reliability of the results obtained. Additionally, the joint utilization of both responses and the multimodal source content for detection presents challenges due to the heterogeneous nature of the data points. In this work, to address the first challenge, we initially prompt the Large Language Model (LLM) with both multimodal source content and the corresponding response set to extract contrasting evidence to enable maximal utilization of informative responses. To overcome the second challenge, we introduce an uncertainty-aware evidential evaluator to assess the evidence intensity from the multimodal source content and dual-sided reasoning, from which the final prediction is derived. As we model the second-order probability, we can effectively indicate the model’s uncertainty (i.e., the reliability) of the results. The reasoning from the correct perspective also serves as a natural language-based explanation. To this end, the third challenge is also addressed as we fully leverage the available resources. Extensive experiments validate the effectiveness, uncertainty awareness in predictions, helpful explainability for human judgment, and superior efficiency of our approach compared to contemporary works utilizing LLMs.

pdf bib abs
On Evaluation Protocols for Data Augmentation in a Limited Data Scenario
Frédéric Piedboeuf | Philippe Langlais

Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning, and that spending more time doing so before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely : which DA technique performs best (all of them as long as they generate data close enough to the training set, as to not impair training) and why did DA show positive results (facilitates training of network). We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances, confirming that this form of data augmentation is preferable to classical methods.

pdf bib abs
Context-Informed Machine Translation of Manga using Multimodal Large Language Models
Philip Lippmann | Konrad Skublicki | Joshua Tanner | Shonosuke Ishiwatari | Jie Yang

Due to the significant time and effort required for handcrafting translations, most manga never leave the domestic Japanese market. Automatic manga translation is a promising potential solution. However, it is a budding and underdeveloped field and presents complexities even greater than those found in standard translation due to the need to effectively incorporate visual elements into the translation process to resolve ambiguities. In this work, we investigate to what extent multimodal large language models (LLMs) can provide effective manga translation, thereby assisting manga authors and publishers in reaching wider audiences. Specifically, we propose a methodology that leverages the vision component of multimodal LLMs to improve translation quality and evaluate the impact of translation unit size, context length, and propose a token efficient approach for manga translation. Moreover, we introduce a new evaluation dataset – the first parallel Japanese-Polish manga translation dataset – as part of a benchmark to be used in future research. Finally, we contribute an open-source software suite, enabling others to benchmark LLMs for manga translation. Our findings demonstrate that our proposed methods achieve state-of-the-art results for Japanese-English translation and set a new standard for Japanese-Polish.

pdf bib abs
Large Language Model as a Teacher for Zero-shot Tagging at Extreme Scales
Jinbin Zhang | Nasib Ullah | Rohit Babbar

Extreme Multi-label Text Classification (XMC) entails selecting the most relevant labels for an instance from a vast label set. Extreme Zero-shot XMC (EZ-XMC) extends this challenge by operating without annotated data, relying only on raw text instances and a predefined label set, making it particularly critical for addressing cold-start problems in large-scale recommendation and categorization systems. State-of-the-art methods, such as MACLR and RTS, leverage lightweight bi-encoders but rely on suboptimal pseudo labels for training, such as document titles (MACLR) or document segments (RTS), which may not align well with the intended tagging or categorization tasks. On the other hand, LLM-based approaches, like ICXML, achieve better label-instance alignment but are computationally expensive and impractical for real-world EZ-XMC applications due to their heavy inference costs. In this paper, we introduce LMTX (Large language Model as Teacher for eXtreme classification), a novel framework that bridges the gap between these two approaches. LMTX utilizes an LLM to identify high-quality pseudo labels during training, while employing a lightweight bi-encoder for efficient inference. This design eliminates the need for LLMs at inference time, offering the benefits of improved label alignment without sacrificing computational efficiency. Our approach achieves superior performance and efficiency over both LLM and non-LLM based approaches, establishing a new state-of-the-art in EZ-XMC.

The rapid expansion of online content has intensified the issue of information redundancy, underscoring the need for solutions that can identify genuinely new information. Despite this challenge, the research community has seen a decline in focus on novelty detection, particularly with the rise of large language models (LLMs). Additionally, previous approaches have relied heavily on human annotation, which is time-consuming, costly, and particularly challenging when annotators must compare a target document against a vast number of historical documents. In this work, we introduce NovAScore (Novelty Evaluation in Atomicity Score), an automated metric for evaluating document-level novelty. NovAScore aggregates the novelty and salience scores of atomic information, providing high interpretability and a detailed analysis of a document’s novelty. With its dynamic weight adjustment scheme, NovAScore offers enhanced flexibility and an additional dimension to assess both the novelty level and the importance of information within a document. Our experiments show that NovAScore strongly correlates with human judgments of novelty, achieving a 0.626 Point-Biserial correlation on the TAP-DLND 1.0 dataset and a 0.920 Pearson correlation on an internal human-annotated dataset.

pdf bib abs
HLU: Human Vs LLM Generated Text Detection Dataset for Urdu at Multiple Granularities
Iqra Ali | Jesse Atuhurra | Hidetaka Kamigaito | Taro Watanabe

The rise of large language models (LLMs) generating human-like text has raised concerns about misuse, especially in low-resource languages like Urdu. To address this gap, we introduce the HLU dataset, which consists of three datasets: Document, Paragraph, and Sentence level. The document-level dataset contains 1,014 instances of human-written and LLM-generated articles across 13 domains, while the paragraph and sentence-level datasets each contain 667 instances. We conducted both human and automatic evaluations. In the human evaluation, the average accuracy at the document level was 35%, while at the paragraph and sentence levels, accuracies were 75.68% and 88.45%, respectively. For automatic evaluation, we finetuned the XLMRoBERTa model for both monolingual and multilingual settings achieving consistent results in both. Additionally, we assessed the performance of GPT4 and Claude3Opus using zero-shot prompting. Our experiments and evaluations indicate that distinguishing between human and machine-generated text is challenging for both humans and LLMs, marking a significant step in addressing this issue in Urdu.

pdf bib abs
Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models
Benjamin Icard | Evangelia Zve | Lila Sainero | Alice Breton | Jean-Gabriel Ganascia

This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.

Large Language Models (LLMs) show promising learning and reasoning abilities. Compared to other NLP tasks, multilingual and multi-label emotion evaluation tasks are under-explored in LLMs. In this paper, we present EthioEmo, a multi-label emotion classification dataset for four Ethiopian languages, namely, Amharic (amh), Afan Oromo (orm), Somali (som), and Tigrinya (tir). We perform extensive experiments with an additional English multi-label emotion dataset from SemEval 2018 Task 1. Our evaluation includes encoder-only, encoder-decoder, and decoder-only language models. We compare zero and few-shot approaches of LLMs to fine-tuning smaller language models. The results show that accurate multi-label emotion classification is still insufficient even for high-resource languages such as English, and there is a large gap between the performance of high-resource and low-resource languages. The results also show varying performance levels depending on the language and model type. EthioEmo is available publicly to further improve the understanding of emotions in language models and how people convey emotions through various languages.

pdf bib abs
Knowledge Graph Unlearning with Schema
Yang Xiao | Ruimeng Ye | Bo Hui

Graph unlearning emerges as a crucial step to eliminate the impact of deleted elements from a trained model. However, unlearning on the knowledge graph (KG) has not yet been extensively studied. We remark that KG unlearning is non-trivial because KG is distinctive from general graphs. In this paper, we first propose a new unlearning method based on schema for KG. Specifically, we update the representation of the deleted element’s neighborhood with an unlearning object that regulates the affinity between the affected neighborhood and the instances within the same schema. Second, we raise a new task: schema unlearning. Given a schema graph to be deleted, we remove all instances matching the pattern and make the trained model forget the removed instances. Last, we evaluate the proposed unlearning method on various KG embedding models with benchmark datasets. Our codes are available at https://github.com/NKUShaw/KGUnlearningBySchema.

pdf bib abs
Assessing the Human Likeness of AI-Generated Counterspeech
Xiaoying Song | Sujana Mamidisetty | Eduardo Blanco | Lingzi Hong

Counterspeech is a targeted response to counteract and challenge abusive or hateful content. It effectively curbs the spread of hatred and fosters constructive online communication. Previous studies have proposed different strategies for automatically generated counterspeech. Evaluations, however, focus on relevance, surface form, and other shallow linguistic characteristics. This paper investigates the human likeness of AI-generated counterspeech, a critical factor influencing effectiveness. We implement and evaluate several LLM-based generation strategies, and discover that AI-generated and human-written counterspeech can be easily distinguished by both simple classifiers and humans. Further, we reveal differences in linguistic characteristics, politeness, and specificity. The dataset used in this study is publicly available for further research.

pdf bib abs
Discarding the Crutches: Adaptive Parameter-Efficient Expert Meta-Learning for Continual Semantic Parsing
Ruiheng Liu | Jinyu Zhang | Yanqi Song | Yu Zhang | Bailong Yang

Continual Semantic Parsing (CSP) enables parsers to generate SQL from natural language questions in task streams, using minimal annotated data to handle dynamically evolving databases in real-world scenarios. Previous works often rely on replaying historical data, which poses privacy concerns. Recently, replay-free continual learning methods based on Parameter-Efficient Tuning (PET) have gained widespread attention. However, they often rely on ideal settings and initial task data, sacrificing the model’s generalization ability, which limits their applicability in real-world scenarios. To address this, we propose a novel Adaptive PET eXpert meta-learning (APEX) approach for CSP. First, SQL syntax guides the LLM to assist experts in adaptively warming up, ensuring better model initialization. Then, a dynamically expanding expert pool stores knowledge and explores the relationship between experts and instances. Finally, a selection/fusion inference strategy based on sample historical visibility promotes expert collaboration. Experiments on two CSP benchmarks show that our method achieves superior performance without data replay or ideal settings, effectively handling cold start scenarios and generalizing to unseen tasks, even surpassing performance upper bounds.

pdf bib abs
Improving Multilingual Sign Language Translation with Automatically Clustered Language Family Information
Ruiquan Zhang | Cong Hu | Pei Yu | Yidong Chen

Sign Language Translation (SLT) bridges the communication gap between deaf and hearing individuals by converting sign language videos into spoken language texts. While most SLT research has focused on bilingual translation models, the recent surge in interest has led to the exploration of Multilingual Sign Language Translation (MSLT). However, MSLT presents unique challenges due to the diversity of sign languages across nations. This diversity can lead to cross-linguistic conflicts and hinder translation accuracy. To use the similarity of actions and semantics between sign languages to alleviate conflict, we propose a novel approach that leverages sign language families to improve MSLT performance. Sign languages were clustered into families automatically based on their Language distribution in the MSLT network. We compare the results of our proposed family clustering method with the analysis conducted by sign language linguists and then train dedicated translation models for each family in the many-to-one translation scenario. Our experiments on the SP-10 dataset demonstrate that our approach can achieve a balance between translation accuracy and computational cost by regulating the number of language families.

pdf bib abs
Is Peer-Reviewing Worth the Effort?
Kenneth Ward Church | Raman Chandrasekar | John E. Ortega | Ibrahim Said Ahmad

How effective is peer-reviewing in identifying important papers? We treat this question as a forecasting task. Can we predict which papers will be highly cited in the future based on venue and “early returns” (citations soon after publication)? We show early returns are more predictive than venue. Finally, we end with a constructive suggestion to simplify reviewing.

pdf bib abs
OptiPrune: Effective Pruning Approach for Every Target Sparsity
Khang Nguyen Le | Ryo Sato | Dai Nakashima | Takeshi Suzuki | Minh Le Nguyen

Large language models (LLMs) have achieved notable success across various tasks but are hindered by their large size and high computational demands. Post-training pruning (PTP) offers a promising solution by reducing model size through parameter removal while preserving performance. However, current PTP methods perform optimally only within specific sparsity ranges. This paper presents two key findings: (1) Layerwise uniform sparsity is effective at low sparsity, while non-uniform sparsity excels at high levels; (2) Relative importance-based pruning works best at low sparsity, whereas Hessian-based weight reconstruction is superior at high sparsity. We design and conduct experiments to validate these findings. Based on these insights, we introduce OptiPrune, a robust pruning method effective across all sparsity levels. OptiPrune adapts non-uniform sparsity with adaptive deviation and employs a threshold to select the optimal pruning strategy. Empirical results across diverse datasets, architectures, and languages validate its performance and robustness. These findings provide valuable directions for future LLM pruning research. Our code and data are publicly available.

pdf bib abs
ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary
Yutong Li | Lu Chen | Aiwei Liu | Kai Yu | Lijie Wen

The literature review is an indispensable step in the research process. It provides the benefit of comprehending the research problem and understanding the current research situation while conducting a comparative analysis of prior works. However, literature summary is challenging and time consuming. The previous LLM-based studies on literature review mainly focused on the complete process, including literature retrieval, screening, and summarization. However, for the summarization step, simple CoT method often lacks the ability to provide extensive comparative summary. In this work, we firstly focus on the independent literature summarization step and introduce ChatCite, an LLM agent with human workflow guidance for comparative literature summary. This agent, by mimicking the human workflow, first extracts key elements from relevant literature and then generates summaries using a Reflective Incremental Mechanism. In order to better evaluate the quality of the generated summaries, we devised a LLM-based automatic evaluation metric, G-Score, in refer to the human evaluation criteria. The ChatCite agent outperformed other models in various dimensions in the experiments. The literature summaries generated by ChatCite can also be directly used for drafting literature reviews.

pdf bib abs
Paraphrase Makes Perfect: Leveraging Expression Paraphrase to Improve Implicit Sentiment Learning
Xia Li | Junlang Wang | Yongqiang Zheng | Yuan Chen | Yangjia Zheng

Existing implicit sentiment learning methods mainly focus on capturing implicit sentiment knowledge individually, without paying more attention to the potential connection between implicit and explicit sentiment. From a linguistic perspective, implicit and explicit sentiment expressions are essentially similar when conveying the same sentiment polarity for a specific aspect. In this paper, we present an expression paraphrase strategy and a novel sentiment-consistent contrastive learning mechanism to learn the intrinsic connections between implicit and explicit sentiment expressions and integrate them into the model to enhance implicit sentiment learning. We perform extensive experiments on public datasets, and the results show the significant efficacy of our method on implicit sentiment analysis.

pdf bib abs
Not Every Metric is Equal: Cognitive Models for Predicting N400 and P600 Components During Reading Comprehension
Lavinia Salicchi | Yu-Yin Hsu

In recent years, numerous studies have sought to understand the cognitive dynamics underlying language processing by modeling reading times and ERP amplitudes using computational metrics like surprisal. In the present paper, we examine the predictive power of surprisal, entropy, and a novel metric based on semantic similarity for N400 and P600. Our experiments, conducted with Mandarin Chinese materials, revealed three key findings: 1) expectancy plays a primary role for N400; 2) P600 also reflects the cognitive effort required to evaluate linguistic input semantically; and 3) during the time window of interest, information uncertainty influences the language processing the most. Our findings show how computational metrics that capture distinct cognitive dimensions can effectively address psycholinguistic questions.

pdf bib abs
Multilingual Supervision Improves Semantic Disambiguation of Adpositions
Wesley Scivetti | Lauren Levine | Nathan Schneider

Adpositions display a remarkable amount of ambiguity and flexibility in their meanings, and are used in different ways across languages. We conduct a systematic corpus-based cross-linguistic investigation into the lexical semantics of adpositions, utilizing SNACS (Schneider et al., 2018), an annotation framework with data available in several languages. Our investigation encompasses 5 of these languages: Chinese, English, Gujarati, Hindi, and Japanese. We find substantial distributional differences in adposition semantics, even in comparable corpora. We further train classifiers to disambiguate adpositions in each of our languages. Despite the cross-linguistic differences in adpositional usage, sharing annotated data across languages boosts overall disambiguation performance, leading to the highest published scores on this task for all 5 languages.

pdf bib abs
Empirical Study of Zero-shot Keyphrase Extraction with Large Language Models
Byungha Kang | Youhyun Shin

This study investigates the effectiveness of Large Language Models (LLMs) for zero-shot keyphrase extraction (KE). We propose and evaluate four prompting strategies: vanilla, role prompting, candidate-based prompting, and hybrid prompting. Experiments conducted on six widely-used KE benchmark datasets demonstrate that Llama3-8B-Instruct with vanilla prompting outperforms state-of-the-art unsupervised methods, PromptRank, by an average of 9.43%, 7.68%, and 4.82% in F1@5, F1@10, and F1@15, respectively. Hybrid prompting, which combines the strengths of vanilla and candidate-based prompting, further enhances overall performance. Moreover role prompting, which assigns a task-related role to LLMs, consistently improves performance across various prompting strategies. We also explore the impact of model size and different LLM series: GPT-4o, Gemma2, and Qwen2. Results show that Llama3 and Gemma2 demonstrate the strongest zero-shot KE performance, with hybrid prompting consistently enhancing results across most LLMs. We hope this study provides insights to researchers exploring LLMs in KE tasks, as well as practical guidance for model selection in real-world applications. Our code is available at https://github.com/kangnlp/Zero-shot-KPE-with-LLMs.

pdf bib abs
Investigating the Impact of Incremental Processing and Voice Activity Projection on Spoken Dialogue Systems
Yuya Chiba | Ryuichiro Higashinaka

The naturalness of responses in spoken dialogue systems has been significantly improved by the introduction of large language models (LLMs), although many challenges remain until human-like turn-taking can be achieved. A turn-taking model called Voice Activity Projection (VAP) is gaining attention because it can be trained in an unsupervised manner using the spoken dialogue data between two speakers. For such a turn-taking model to be fully effective, systems must initiate response generation as soon as a turn-shift is detected. This can be achieved by incremental response generation, which reduces the delay before the system responds. Incremental response generation is done using partial speech recognition results while user speech is incrementally processed. Combining incremental response generation with VAP-based turn-taking will enable spoken dialogue systems to achieve faster and more natural turn-taking. However, their effectiveness remains unclear because they have not yet been evaluated in real-world systems. In this study, we developed spoken dialogue systems that incorporate incremental response generation and VAP-based turn-taking and evaluated their impact on task success and dialogue satisfaction through user assessments.

Large language models (LLMs) have shown impressive prowess in solving a wide range of tasks with world knowledge. However, it remains unclear how well LLMs are able to perceive their factual knowledge boundaries, particularly under retrieval augmentation settings. In this study, we present the first analysis on the factual knowledge boundaries of LLMs and how retrieval augmentation affects LLMs on open-domain question answering (QA), with a bunch of important findings. Specifically, we focus on three research questions and analyze them by examining QA, priori judgement and posteriori judgement capabilities of LLMs. We show evidence that LLMs possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well. Furthermore, retrieval augmentation proves to be an effective approach in enhancing LLMs’ awareness of knowledge boundaries. We further conduct thorough experiments to examine how different factors affect LLMs and propose a simple method to dynamically utilize supporting documents with our judgement strategy. Additionally, we find that the relevance between the supporting documents and the questions significantly impacts LLMs’ QA and judgemental capabilities.

Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. However, this paradigm is limited by the availability of gold labels, while in certain scenarios, LLMs may need to perform tasks that are too complex for humans to provide such labels. To tackle this challenge, this study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization. We iteratively prompt LLMs to annotate unlabeled data and retain high-quality labels by filtering. Surprisingly, we obverse that this iterative process gradually unlocks LLMs’ potential on downstream tasks. Our experiments on extensive classification and reasoning tasks confirm the effectiveness of our proposed framework. Our analysis indicates that this paradigm is effective for both in-context learning and fine-tuning, and for various model sizes.

Machine unlearning aims to efficiently eliminate the influence of specific training data, known as the forget set, from the model. However, existing unlearning methods for Large Language Models (LLMs) face a critical challenge: they rely solely on negative feedback to suppress responses related to the forget set, which often results in nonsensical or inconsistent outputs, diminishing model utility and posing potential privacy risks. To address this limitation, we propose a novel approach called Alternate Preference Optimization (AltPO), which combines negative feedback with in-domain positive feedback on the forget set. Additionally, we introduce new evaluation metrics to assess the quality of responses related to the forget set. Extensive experiments show that our approach not only enables effective unlearning but also avoids undesirable model behaviors while maintaining overall model performance.

pdf bib abs
Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models
Mingyang Song | Mao Zheng | Xuan Luo

Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose Counting-Stars, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. Counting-Stars comprises two counting-based multiple pieces of evidence retrieval tasks: searching and reasoning. Using Counting-Stars, we conducted experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.

With the rapid development of large language models, AI assistants like ChatGPT have become increasingly integrated into people’s works and lives but are limited in personalized services. In this paper, we present a plug-and-play framework that could facilitate personalized large language model assistants with evolving conditional memory. The personalized assistant focuses on intelligently preserving the knowledge and experience from the history dialogue with the user, which can be applied to future tailored responses that better align with the user’s preferences. Generally, the assistant generates a set of records from the dialogue, stores them in a memory bank, and retrieves related memory to improve the quality of the response. For the crucial memory design, we explore different ways of constructing the memory and propose a new memorizing mechanism named conditional memory to enhance the memory management of the framework. We also investigate the retrieval and usage of memory in the generation process. To better evaluate the personalized assistants’ abilities, we build the first evaluation benchmark from three critical aspects: continuing previous dialogue, learning personalized knowledge and learning from user feedback. The experimental results illustrate the effectiveness of our method.

pdf bib abs
ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training
Zhouqiang Jiang | Bowen Wang | Junhao Chen | Yuta Nakashima

Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to automatically identify such grouping, we argue that current VrDU approaches are unrealistic. We thus introduce a new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups. We also propose a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together. Our experimental results demonstrate the performance of existing methods is deteriorated with the ReVrDU task, while ReLayout shows superiour performance.

With the prevalence of Large Language Models (LLMs), recent studies have shifted paradigms and leveraged LLMs to tackle the challenging task of Text-to-SQL. Because of the complexity of real world databases, previous works adopt the retrieve-then-generate framework to retrieve relevant database schema and then to generate the SQL query. However, efficient embedding-based retriever suffers from lower retrieval accuracy, and more accurate LLM-based retriever is far more expensive to use, which hinders their applicability for broader applications. To overcome this issue, this paper proposes Gen-SQL, a novel generate-ground-regenerate framework, where we exploit prior knowledge from the LLM to enhance embedding-based retriever and reduce cost. Experiments on several datasets are conducted to demonstrate the effectiveness and scalability of our proposed method. We release our code and data at https://github.com/jieshi10/gensql.

pdf bib abs
Language Models at the Syntax-Semantics Interface: A Case Study of the Long-Distance Binding of Chinese Reflexive Ziji
Xiulin Yang

This paper explores whether language models can effectively resolve the complex binding patterns of the Mandarin Chinese reflexive ziji, which are constrained by both syntactic and semantic factors. We construct a dataset of 320 synthetic sentences using templates and examples from syntactic literature, along with 360 natural sentences from the BCC corpus. Evaluating 21 language models against this dataset and comparing their performance to judgments from native Mandarin speakers, we find that none of the models consistently replicates human-like judgments. The results indicate that existing language models tend to rely heavily on sequential cues, though not always favoring the closest strings, and often overlooking subtle semantic and syntactic constraints. They tend to be more sensitive to noun-related than verb-related semantics.

pdf bib abs
HyperHatePrompt: A Hypergraph-based Prompting Fusion Model for Multimodal Hate Detection
Bo Xu | Erchen Yu | Jiahui Zhou | Hongfei Lin | Linlin Zong

Multimodal hate detection aims to identify hate content across multiple modalities for promoting a harmonious online environment. Despite promising progress, three critical challenges, the absence of implicit hateful cues, the cross-modal-induced hate, and the diversity of hate target groups, inherent in the multimodal hate detection task, have been overlooked. To address these challenges, we propose a hypergraph-based prompting fusion model. Our model first uses tailored prompts to infer implicit hateful cues. It then introduces hyperedges to capture cross-modal-induced hate and applies a diversity-oriented hyperedge expansion strategy to account for different hate target groups. Finally, hypergraph convolution fuses diverse hateful cues, enhancing the exploration of cross-modal hate and targeting specific groups. Experimental results on two benchmark datasets show that our model achieves state-of-the-art performance in multimodal hate detection.

Entities are important to understanding literary works, which emphasize characters, plots and environment. The research on entity recognition, especially nested entity recognition in the literary domain is still insufficient partly due to insufficient annotated data. To address this issue, we construct the first Genre-oriented Corpus for Entity Recognition in Chinese Web Novels, namely GenWebNovel, comprising 400 chapters totaling 1,214,283 tokens under two genres, XuanHuan (Eastern Fantasy) and History. Based on the corpus, we analyze the distribution of different types of entities, including person, location, and organization. We also compare the nesting patterns of nested entities between GenWebNovel and the English corpus LitBank. Even though both belong to the literary domain, entities in different genres share few overlaps, making genre adaptation of NER (Named Entity Recognition) a hard problem. We propose a novel method that utilizes a pre-trained language model as an In-context learning example retriever to boost the performance of large language models. Our experiments show that this approach significantly enhances entity recognition, matching state-of-the-art (SOTA) models without requiring additional training data. Our code, dataset, and model are available at https://github.com/hjzhao73/GenWebNovel.

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively explore and exploit LLM vulnerabilities through multi-round interactions. In addition to the framework, we further propose a novel indicator, Attack Effectiveness Rate (AER) to mitigate the limitations of existing evaluation metrics. By measuring the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns closely with human evaluations. Extensive experiments with both automatic and human evaluations, demonstrate the effectiveness of ARPT across both open- and closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet useful responses from Meta’s Llama-3-8B-Instruct, 50% from GPT-4o (API access), and 39% from Claude-3.5 (API access), showcasing its robust attack capability and transferability across LLMs (especially from open-source LLMs to closed-source LLMs).

pdf bib abs
Rumor Detection on Social Media with Temporal Propagation Structure Optimization
Xingyu Peng | Junran Wu | Ruomei Liu | Ke Xu

Traditional methods for detecting rumors on social media primarily focus on analyzing textual content, often struggling to capture the complexity of online interactions. Recent research has shifted towards leveraging graph neural networks to model the hierarchical conversation structure that emerges during rumor propagation. However, these methods tend to overlook the temporal aspect of rumor propagation and may disregard potential noise within the propagation structure. In this paper, we propose a novel approach that incorporates temporal information by constructing a weighted propagation tree, where the weight of each edge represents the time interval between connected posts. Drawing upon the theory of structural entropy, we transform this tree into a coding tree. This transformation aims to preserve the essential structure of rumor propagation while reducing noise. Finally, we introduce a recursive neural network to learn from the coding tree for rumor veracity prediction. Experimental results on two common datasets demonstrate the superiority of our approach.

pdf bib abs
Revisiting Implicitly Abusive Language Detection: Evaluating LLMs in Zero-Shot and Few-Shot Settings
Julia Jaremko | Dagmar Gromann | Michael Wiegand

Implicitly abusive language (IAL), unlike its explicit counterpart, lacks overt slurs or unambiguously offensive keywords, such as “bimbo” or “scum”, making it challenging to detect and mitigate. While current research predominantly focuses on explicitly abusive language, the subtler and more covert forms of IAL remain insufficiently studied. The rapid advancement and widespread adoption of large language models (LLMs) have opened new possibilities for various NLP tasks, but their application to IAL detection has been limited. We revisit three very recent challenging datasets of IAL and investigate the potential of LLMs to enhance the detection of IAL in English through zero-shot and few-shot prompting approaches. We evaluate the models’ capabilities in classifying sentences directly as either IAL or benign, and in extracting linguistic features associated with IAL. Our results indicate that classifiers trained on features extracted by advanced LLMs outperform the best previously reported results, achieving near-human performance.

pdf bib abs
Grading Massive Open Online Courses Using Large Language Models
Shahriar Golchin | Nikhil Garuda | Christopher Impey | Matthew Wenger

Massive open online courses (MOOCs) offer free education globally. Despite this democratization of learning, the massive enrollment in these courses makes it impractical for an instructor to assess every student’s writing assignment. As a result, peer grading, often guided by a straightforward rubric, is the method of choice. While convenient, peer grading often falls short in terms of reliability and validity. In this study, we explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs. To this end, we adapt the zero-shot chain-of-thought (ZCoT) prompting technique to automate the feedback process once the LLM assigns a score to an assignment. Specifically, to instruct LLMs for grading, we use three distinct prompts based on ZCoT: (1) ZCoT with instructor-provided correct answers, (2) ZCoT with both instructor-provided correct answers and rubrics, and (3) ZCoT with instructor-provided correct answers and LLM-generated rubrics. We tested these prompts in 18 different scenarios using two LLMs—GPT-4 and GPT-3.5—across three MOOCs: Introductory Astronomy, Astrobiology, and the History and Philosophy of Astronomy. Our results show that ZCoT, when augmented with instructor-provided correct answers and rubrics, produces grades that are more aligned with those assigned by instructors compared to peer grading. Finally, our findings indicate a promising potential for automated grading systems in MOOCs, especially in subjects with well-defined rubrics, to improve the learning experience for millions of online learners worldwide.

pdf bib abs
Decoding Echo Chambers: LLM-Powered Simulations Revealing Polarization in Social Networks
Chenxi Wang | Zongfang Liu | Dequan Yang | Xiuying Chen

The impact of social media on critical issues such as echo chambers, needs to be addressed, as these phenomena can have disruptive consequences for our society. Traditional research often oversimplifies emotional tendencies and opinion evolution into numbers and formulas, neglecting that news and communication are conveyed through text, which limits these approaches. Hence, in this work, we propose an LLM-based simulation for the social opinion network to evaluate and counter polarization phenomena. We first construct three typical network structures to simulate different characteristics of social interactions. Then, agents interact based on recommendation algorithms and update their strategies through reasoning and analysis. By comparing these interactions with the classic Bounded Confidence Model (BCM), the Friedkin-Johnsen (FJ) model, and using echo chamber-related indices, we demonstrate the effectiveness of our framework in simulating opinion dynamics and reproducing phenomena such as opinion polarization and echo chambers. We propose two mitigation methods—active and passive nudges—that can help reduce echo chambers, specifically within language-based simulations. We hope our work will offer valuable insights and guidance for social polarization mitigation.

pdf bib abs
Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace
Jia-Chen Zhang | Yu-Jie Xiong | Chun-Ming Xia | Dong-Hai Zhu | Xi-He Qiu

This paper proposes a novel parameter-efficient fine-tuning method that combines the knowledge completion capability of deconvolution with the subspace learning ability, reducing the number of parameters required for fine-tuning by 8 times . Experimental results demonstrate that our method achieves superior training efficiency and performance compared to existing models.

pdf bib abs
StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models
Li Yang | Zhiding Xiao | Wenxin Huang | Xian Zhong

The rapid development of multimodal large language models (MLLMs) has positioned visual storytelling as a crucial area in content creation. However, existing models often struggle to maintain temporal, spatial, and narrative coherence across image sequences, and they frequently lack the depth and engagement of human-authored stories. To address these challenges, we propose Story with Large Language-and-Vision Alignment (StoryLLaVA), a novel framework for enhancing visual storytelling. Our approach introduces a topic-driven narrative optimizer that improves both the training data and MLLM models by integrating image descriptions, topic generation, and GPT-4-based refinements. Furthermore, we employ a preference-based ranked story sampling method that aligns model outputs with human storytelling preferences through positive-negative pairing. These two phases of the framework differ in their training methods: the former uses supervised fine-tuning, while the latter incorporates reinforcement learning with positive and negative sample pairs. Experimental results demonstrate that StoryLLaVA outperforms current models in visual relevance, coherence, and fluency, with LLM-based evaluations confirming the generation of richer and more engaging narratives. The enhanced dataset and model will be made publicly available soon.

pdf bib abs
Aligning Complex Knowledge Graph Question Answering as Knowledge-Aware Constrained Code Generation
Prerna Agarwal | Nishant Kumar | Srikanta Bedathur Jagannath

Generating executable logical forms (LF) using Large Language Models (LLMs) in a few-shot setting for Knowledge Graph Question Answering (KGQA) is becoming popular. However, their performance is still limited due to very little exposure to the LF during pre-training of LLMs, resulting in many syntactically incorrect LF generation. If the LF generation task can be transformed to a more familiar task for the LLM, it can potentially reduce the syntax errors and elevate the generation quality. On the other hand, there exist specialized LLMs trained/fine-tuned on code in many programming languages. They can be leveraged to generate the LF as step-wise constrained code expression generation using modular functions in the LF. Based on this insight, we propose CodeAlignKGQA: a framework that aligns the LF generation as code generation that incorporates LF-specific constraints. We extract the question-specific subgraph information to enable Knowledge-Aware code generation. We additionally introduce a dynamic self-code-correction mechanism, to be applied as required. Our extensive experiments on Complex KGQA benchmarks such as KQA Pro demonstrate the effectiveness of our approach. CodeAlignKGQA surpasses all few-shot baselines on KQA Pro by 21%, achieving a new state-of-the-art.

Making analogies is fundamental to cognition. Proportional analogies, which consist of four terms, are often used to assess linguistic and cognitive abilities. For instance, completing analogies like “Oxygen is to Gas as < blank > is to < blank >" requires identifying the semantic relationship (e.g., “type of”) between the first pair of terms (“Oxygen” and “Gas”) and finding a second pair that shares the same relationship (e.g., “Aluminum” and “Metal”). In this work, we introduce a 15K Multiple-Choice Question Answering (MCQA) dataset for proportional analogy completion and evaluate the performance of contemporary Large Language Models (LLMs) in various knowledge-enhanced prompt settings. Specifically, we augment prompts with three types of knowledge: exemplar, structured, and targeted. Our results show that despite extensive training data, solving proportional analogies remains challenging for current LLMs, with the best model achieving an accuracy of 55%. Notably, we find that providing targeted knowledge can better assist models in completing proportional analogies compared to providing exemplars or collections of structured knowledge. Our code and data are available at: https://github.com/Thiliniiw/KnowledgePrompts/

pdf bib abs
Unified Grid Tagging Scheme for Aspect Sentiment Quad Prediction
Guixin Su | Yongcheng Zhang | Tongguan Wang | Mingmin Wu | Ying Sha

Aspect Sentiment Quad Prediction (ASQP) aims to extract all sentiment elements in quads for a given review to explain the reason for the sentiment. Previous table-filling based methods have achieved promising results by modeling word-pair relations. However, these methods decompose the ASQP task into several subtasks without considering the association between sentiment elements. Most importantly, they fail to tackle the situation where a sentence contains multiple implicit expressions. To address these limitations, we propose a simple yet effective Unified Grid Tagging Scheme (UGTS) to extract sentiment quadruplets in one shot, with two additional special tokens from pre-trained models to represent potential implicit aspect and opinion terms. Based on this, we first introduce the adaptive graph diffusion convolution network to construct the direct connection between explicit and implicit sentiment elements from syntactic and semantic views. Next, we utilize conditional layer normalization to refine the mutual indication effect between words for matching valid aspect-opinion pairs. Finally, we employ the triaffine mechanism to integrate heterogeneous word-pair relations to capture higher-order interactions between sentiment elements. Experimental results on four benchmark datasets show the effectiveness and robustness of our model, which achieves state-of-the-art performance.

pdf bib abs
Claim veracity assessment for explainable fake news detection
Bassamtiano Renaufalgi Irnawan | Sheng Xu | Noriko Tomuro | Fumiyo Fukumoto | Yoshimi Suzuki

With the rapid growth of social network services, misinformation has spread uncontrollably. Most recent approaches to fake news detection use neural network models to predict whether the input text is fake or real. Some of them even provide explanations, in addition to veracity, generated by Large Language Models (LLMs). However, they do not utilize factual evidence, nor do they allude to it or provide evidence/justification, thereby making their predictions less credible. This paper proposes a new fake news detection method that predicts the truth or false-hood of a claim based on relevant factual evidence (if exists) or LLM’s inference mechanisms (such as common-sense reasoning) otherwise. Our method produces the final synthesized prediction, along with well-founded facts or reasoning. Experimental results on several large COVID-19 fake news datasets show that our method achieves state-of-the-art (SOTA) detection and evidence explanation performance. Our source codes are available online.

As multimodal large language models (MLLMs) gain prominence in the medical field, the need for precise evaluation methods to assess their effectiveness has become critical. While benchmarks provide a reliable means to evaluate the capabilities of MLLMs, traditional metrics like ROUGE and BLEU employed for open domain evaluation only focus on token overlap and may not align with human judgment. While human evaluation is more reliable, it is labor-intensive, costly, and not scalable. LLM-based evaluation methods have proven promising, but to date, there is still an urgent need for open-source multimodal LLM-based evaluators in the medical field. To address this issue, we introduce ACE-M³, an open-sourced Automatic Capability Evaluator for Multimodal Medical Models that specifically designed to assess the question answering abilities of medical MLLMs. It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria. Subsequently, a reward token-based direct preference optimization (RTDPO) strategy is incorporated to save training time without compromising performance of our model. Extensive experiments have demonstrated the effectiveness of our ACE-M³ model in evaluating the capabilities of medical MLLMs.

Multimodal Emotion Recognition in Conversations (MERC) identifies utterance emotions by integrating both contextual and multimodal information from dialogue videos. Existing methods struggle to capture emotion shifts due to label replication and fail to preserve positive independent modality contributions during fusion. To address these issues, we propose a Dual Contrastive Learning Framework (DCLF) that enhances current MERC models without additional data. Specifically, to mitigate label replication effects, we construct context-aware contrastive pairs. Additionally, we assign pseudo-labels to distinguish modality-specific contributions. DCLF works alongside basic models to introduce semantic constraints at the utterance, context, and modality levels. Our experiments on two MERC benchmark datasets demonstrate performance gains of 4.67%-4.98% on IEMOCAP and 5.52%-5.89% on MELD, outperforming state-of-the-art approaches. Perturbation tests further validate DCLF’s ability to reduce label dependence. Additionally, DCLF incorporates emotion-sensitive independent modality features and multimodal fusion representations into final decisions, unlocking the potential contributions of individual modalities.

pdf bib abs
Can LLMs Clarify? Investigation and Enhancement of Large Language Models on Argument Claim Optimization
Yiran Wang | Ben He | Xuanang Chen | Le Sun

In argumentation, the claim is the foundational proposition that underpins the argument, serving as the central pillar upon which the argument is constructed. It guides the subsequent presentation of evidence, reasoning, and analysis, thereby facilitating the audience’s understanding of the core issue. Therefore, ensuring that the claim is precise and unambiguous is crucial for constructing a coherent and persuasive argument. While Large Language Models (LLMs) have demonstrated proficiency in text rewriting tasks such as style transfer and query rewriting, their application to claim optimization remains unexplored. Unlike other rewriting tasks, claim clarification requires the model to rewrite ambiguous or unclear segments of the claim, enhance the content by adding omitted key details, and eliminate redundant or verbose elements. Addressing this gap, this paper evaluates the performance of LLMs on the claim clarification task across various settings. While popular rewriting evaluation methods such as BLEU and Rouge rely on exact word matching, this paper introduces a novel semantic evaluation approach based on a sliding window mechanism. Three distinct LLMs, including LLama2, Mistral, and Qwen2, are assessed for their ability to clarify arguments through zero-shot or few-shot prompting, and supervised fine-tuning (SFT). Additionally, we propose a reinforcement learning-based clarification approach that optimally balances content preservation with claim clarity, thereby augmenting the performance of LLMs on the claim clarification task.

pdf bib abs
Generation-Augmented and Embedding Fusion in Document-Level Event Argument Extraction
Xingjian Lin | Shengfei Lyu | Xin Wang | Qiuju Chen | Huanhuan Chen

Document-level event argument extraction is a crucial task that aims to extract arguments from the entire document, beyond sentence-level analysis. Prior classification-based models still fail to explicitly capture significant relationships and heavily relies on large-scale datasets. In this study, we propose a novel approach called Generation-Augmented and Embedding Fusion. This approach first uses predefined templates and generative language models to produce an embedding capturing role relationship information, then integrates it into the foundational embedding derived from a classification model through a noval embedding fusion mechanism. We conduct the extensive experiments on the RAMS and WikiEvents datasets to demonstrate that our approach is more effective than the baselines, and that it is also data-efficient in low-resource scenarios.

pdf bib abs
C3LRSO: A Chinese Corpus for Complex Logical Reasoning in Sentence Ordering
Xiaotao Guo | Jiang Li | Xiangdong Su | Fujun Zhang

Sentence ordering is the task of rearranging a set of unordered sentences into a coherent and logically consistent sequence. Recent work has primarily used pre-trained language models, achieving significant success in the task. However, existing sentence ordering corpora are predominantly in English, and comprehensive benchmark datasets for non-English languages are unavailable. Meanwhile, current datasets often insert specific markers into paragraphs, inadvertently making the logical sequence between sentences more apparent and reducing the models’ ability to handle genuinely unordered sentences in real applications. To address these limitations, we develop C3LRSO, a high-quality Chinese sentence ordering dataset that overcomes the aforementioned shortcomings by providing genuinely unordered sentences without artificial segmentation cues. Furthermore, given the outstanding performance of large language models on NLP tasks, we evaluate these models on our dataset for this task. Additionally, we propose a simple yet effective parameter-free approach that outperforms existing methods on this task. Experiments demonstrate the challenging nature of the dataset and the strong performance of our proposed method. These findings highlight the potential for further research in sentence ordering and the development of more robust language models. Our dataset is freely available at https://github.com/JasonGuo1/C3LRSO.

pdf bib abs
KIA: Knowledge-Guided Implicit Vision-Language Alignment for Chest X-Ray Report Generation
Heng Yin | Shanlin Zhou | Pandong Wang | Zirui Wu | Yongtao Hao

Report generation (RG) faces challenges in understanding complex medical images and establishing cross-modal semantic alignment in radiology image-report pairs. Previous methods often overlook fine-grained cross-modal interaction, leading to insufficient understanding of detailed information. Recently, various large multimodal models have been proposed for image-text tasks. However, such models still underperform on rare domain tasks like understanding complex medical images. To address these limitations, we develop a new framework of Knowledge-guided Implicit vision-language Alignment for radiology report generation, named KIA. To better understand medical reports and images and build alignment between them, multi-task implicit alignment is creatively introduced, forming comprehensive understanding of medical images and reports. Additionally, to further meet medical refinement requirements, we design novel masking strategies guided by medical knowledge to enhance pathological observation and anatomical landm

pdf bib abs
On the Human-level Performance of Visual Question Answering
Chenlian Zhou | Guanyi Chen | Xin Bai | Ming Dong

Visual7W has been widely used in assessing multiple-choice visual question-answering (VQA) systems. This paper reports on a replicated human experiment on Visual7W with the aim of understanding the human-level performance of VQA. The replication was not entirely successful because human participants performed significantly worse when answering “where”, “when”, and “how” questions in compared to other question types. An error analysis discovered that the failure was a consequence of the non-deterministic distractors in Visual7W. GPT-4V was then evaluated using and was compared to the human-level performance. The results embody that, when evaluating models’ capacity on Visual7W, the performance is not necessarily the higher, the better.

pdf bib abs
Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models
Dahyun Kim | Sukyung Lee | Yungi Kim | Attapol Rutherford | Chanjun Park

The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks that assess their core capabilities, such as reasoning, knowledge, and commonsense, leading to the inception of certain widely-used benchmark suites such as the H6 benchmark. However, these benchmark suites are primarily built for the English language, and there exists a lack thereof for under-represented languages, in terms of LLM development, such as Thai. On the other hand, developing LLMs for Thai should also include enhancing the cultural understanding as well as core capabilities. To address these dual challenge in Thai LLM research, we propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI). Through a thorough evaluation of various LLMs with multi-lingual capabilities, we provide a comprehensive analysis of the proposed benchmarks and how they contribute to Thai LLM development. Furthermore, we have made both the datasets and evaluation code publicly available to encourage further research and development for Thai LLMs.

pdf bib abs
CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation
Weilong Dong | Xinwei Wu | Renren Jin | Shaoyang Xu | Deyi Xiong

Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.

With the success of 2D diffusion models, 2D AIGC content has already transformed our lives. Recently, this success has been extended to 3D AIGC, with state-of-the-art methods generating textured 3D models from single images or text. However, we argue that current 3D AIGC methods still don’t fully unleash human creativity. We often imagine 3D content made from multimodal inputs, such as what it would look like if my pet bunny were eating a doughnut on the table. In this paper, we explore a novel 3D AIGC approach: generating 3D content from IDEAs. An IDEA is a multimodal input composed of text, image, and 3D models. To our knowledge, this challenging and exciting 3D AIGC setting has not been studied before. We propose the new framework Idea23D, which combines three agents based on large multimodal models (LMMs) and existing algorithmic tools. These three LMM-based agents are tasked with prompt generation, model selection, and feedback reflection. They collaborate and critique each other in a fully automated loop, without human intervention. The framework then generates a text prompt to create 3D models that align closely with the input IDEAs. We demonstrate impressive 3D AIGC results that surpass previous methods. To comprehensively assess the 3D AIGC capabilities of Idea23D, we introduce the Eval3DAIGC-198 dataset, containing 198 multimodal inputs for 3D generation tasks. This dataset evaluates the alignment between generated 3D content and input IDEAs. Our user study and quantitative results show that Idea23D significantly improves the success rate and accuracy of 3D generation, with excellent compatibility across various LMM, Text-to-Image, and Image-to-3D models. Code and dataset are available at https://idea23d.github.io/.

pdf bib abs
Learning from Impairment: Leveraging Insights from Clinical Linguistics in Language Modelling Research
Dominique Brunato

This position paper investigates the potential of integrating insights from language impairment research and its clinical treatment to develop human-inspired learning strategies and evaluation frameworks for language models (LMs). We inspect the theoretical underpinnings underlying some influential linguistically motivated training approaches derived from neurolinguistics and, particularly, aphasiology, aimed at enhancing the recovery and generalization of linguistic skills in aphasia treatment, with a primary focus on those targeting the syntactic domain. We highlight how these insights can inform the design of rigorous assessments for LMs, specifically in their handling of complex syntactic phenomena, as well as their implications for developing human-like learning strategies, aligning with efforts to create more sustainable and cognitively plausible natural language processing (NLP) models.

With the development of multimedia technology, online social media has become a major medium for people to access news, but meanwhile, it has also exacerbated the dissemination of multi-modal fake news. An automatic and efficient multi-modal fake news detection (MFND) method is urgently needed. Existing MFND methods usually conduct cross-modal information interaction at later stage, resulting in insufficient exploration of complementary information between modalities. Another challenge lies in the differences among news data from different domains, leading to the weak generalization ability in detecting news from various domains. In this work, we propose an efficient Cross-modal Prompt Learning with Semantic enhancement method for Domain-robust fake news detection (CPLSD). Specifically, we design an efficient cross-modal prompt interaction module, which utilizes prompt as medium to realize lightweight cross-modal information interaction in the early stage of feature extraction, enabling to exploit rich modality complementary information. We design a domain-general prompt generation module that can adaptively blend domain-specific news features to generate domain-general prompts, for improving the domain generalization ability of the model. Furthermore, an image semantic enhancement module is designed to achieve image-to-text translation, fully exploring the semantic discriminative information of the image modality. Extensive experiments conducted on three MFND benchmarks demonstrate the superiority of our proposed approach over existing state-of-the-art MFND methods.

Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ≈45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We have released the dialectal translation models and benchmarks developed in this study (https://huggingface.co/datasets/QCRI/AraDiCE)

pdf bib abs
Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation
Weihua Wang | Qiuyu Liang | Feilong Bao | Guanglai Gao

Quaternion contains one real part and three imaginary parts, which provided a more expressive hypercomplex space for learning knowledge graph. Existing quaternion embedding models measure the plausibility of a triplet either through semantic matching or distance scoring functions. However, it appears that semantic matching diminishes the separability of entities, while the distance scoring function weakens the semantics of entities. To address this issue, we propose a novel quaternion knowledge graph embedding model. Our model combines semantic matching with entity’s geometric distance to better measure the plausibility of triplets. Specifically, in the quaternion space, we perform a right rotation on the head entity and a reverse rotation on the tail entity to learn the rich semantic features. Then, we utilize distance adaptive translations to learn the geometric distance between entities. Furthermore, we provide mathematical proofs to demonstrate our model can handle complex logical relationships. Extensive experimental results and analyses show our model significantly outperforms previous models on well-known knowledge graph completion benchmark datasets. Our code is available at https://anonymous.4open.science/r/l2730.

pdf bib abs
How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA
Yujia Zhou | Zheng Liu | Zhicheng Dou

Retrieval-augmented Large Language Models (RaLLMs) are reshaping knowledge acquisition, offering long-form, knowledge-grounded answers through advanced reasoning and generation capabilities. Despite the emergence of impactful systems like WebGPT and New Bing, the reliability of RaLLMs, especially in complex situations, is under scrutiny. Our study tackles this concern by evaluating RaLLMs’ question-answering performance using a novel benchmark focusing on Correctness and Groundedness. Correctness measures the logical soundness of the responses, and Groundedness checks for support by relevant references. We introduce an automated model-based evaluation pipeline for multi-hop question-answering tasks, revealing RaLLMs’ proneness to generating inaccuracies when dealing with flawed or partial knowledge. To improve accuracy, we introduce two reasoning strategies, Self-Reflection’ and Self-Completion,’ enabling RaLLMs to identify and fill knowledge gaps, significantly improving answer quality without extensive model retraining.

Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as O-LoRA, typically focus on constructing orthogonality tasks to decouple parameter interdependence from various domains.In this paper, we reveal that building non-collision parameters is a more critical factor in addressing CL challenges. Our theoretical and experimental analyses demonstrate that non-collision parameters provide better task orthogonality, which is a sufficient but unnecessary condition. Furthermore, knowledge from multiple domains will be preserved in non-collision parameter subspaces, making it more difficult to forget previously seen data. Leveraging this insight, we propose Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach leveraging low collision rates to enhance CL in LLMs. Experimental results on multiple CL benchmarks indicate that N-LoRA achieves superior performance (+2.9%), higher task orthogonality (×4.1times), and lower parameter collision (×58.1times) than SOTA methods.

pdf bib abs
Jump To Hyperspace: Comparing Euclidean and Hyperbolic Loss Functions for Hierarchical Multi-Label Text Classification
Jens Van Nooten | Walter Daelemans

Hierarchical Multi-Label Text Classification (HMTC) is a challenging machine learning task where multiple labels from a hierarchically organized label set are assigned to a single text. In this study, we examine the effectiveness of Euclidean and hyperbolic loss functions to improve the performance of BERT models on HMTC, which very few previous studies have adopted. We critically evaluate label-aware losses as well as contrastive losses in the Euclidean and hyperbolic space, demonstrating that hyperbolic loss functions perform comparably with non-hyperbolic loss functions on four commonly used HMTC datasets in most scenarios. While hyperbolic label-aware losses perform the best on low-level labels, the overall consistency and micro-averaged performance is compromised. Additionally, we find that our contrastive losses are less effective for HMTC when deployed in the hyperbolic space than non-hyperbolic counterparts. Our research highlights that with the right metrics and training objectives, hyperbolic space does not provide any additional benefits compared to Euclidean space for HMTC, thereby prompting a reevaluation of how different geometric spaces are used in other AI applications.

Recent improvements in the quality of the generations by large language models have spurred research into identifying machine-generated text. Such work often presents high-performing detectors. However, humans and machines can produce text in different styles and domains, yet the the performance impact of such on machine generated text detection systems remains unclear. In this paper, we audit the classification performance for detecting machine-generated text by evaluating on texts with varying writing styles. We find that classifiers are highly sensitive to stylistic changes and differences in text complexity, and in some cases degrade entirely to random classifiers. We further find that detection systems are particularly susceptible to misclassify easy-to-read texts while they have high performance for complex texts, leading to concerns about the reliability of detection systems. We recommend that future work attends to stylistic factors and reading difficulty levels of human-written and machine-generated text.

Text-to-SQL is a technology that converts natural language questions into executable SQL queries, allowing users to query and manage relational databases more easily. In recent years, large language models have significantly advanced the development of text-to-SQL. However, existing methods often overlook validation of the generated results during the SQL generation process. Current error identification methods are mainly divided into self-correction approaches based on large models and feedback methods based on SQL execution, both of which have limitations. We categorize SQL errors into three main types: system errors, skeleton errors, and value errors, and propose a multi-grained error identification method. Experimental results demonstrate that this method can be integrated as a plugin into various methods, providing effective error identification and correction capabilities.

pdf bib abs
Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain
Antoine Louis | Gijs van Dijck | Gerasimos Spanakis

Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.

pdf bib abs
MPID: A Modality-Preserving and Interaction-Driven Fusion Network for Multimodal Sentiment Analysis
Tianyi Li | Daming Liu

The advancement of social media has intensified interest in the research direction of Multimodal Sentiment Analysis (MSA). However, current methodologies exhibit relative limitations, particularly in their fusion mechanisms that overlook nuanced differences and similarities across modalities, leading to potential biases in MSA. In addition, indiscriminate fusion across modalities can introduce unnecessary complexity and noise, undermining the effectiveness of the analysis. In this essay, a Modal-Preserving and Interaction-Driven Fusion Network is introduced to address the aforementioned challenges. The compressed representations of each modality are initially obtained through a Token Refinement Module. Subsequently, we employ a Dual Perception Fusion Module to integrate text with audio and a separate Adaptive Graded Fusion Module for text and visual data. The final step leverages text representation to enhance composite representation. Our experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that our model achieves state-of-the-art performance.

pdf bib abs
Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models
Patrick Amadeus Irawan | Genta Indra Winata | Samuel Cahyawijaya | Ayu Purwarianti

Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data samples, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.

pdf bib abs
DefVerify: Do Hate Speech Models Reflect Their Dataset’s Definition?
Urja Khurana | Eric Nalisnick | Antske Fokkens

When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model’s actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DefVerify to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.

Event Argument Extraction is a critical task of Event Extraction, focused on identifying event arguments within text. This paper presents a novel Fusion Selection-Generation-Based Approach, by combining the precision of selective methods with the semantic generation capability of generative methods to enhance argument extraction accuracy. This synergistic integration, achieved through fusion prompt, element-based extraction, and fusion learning, addresses the challenges of input, process, and output fusion, effectively blending the unique characteristics of both methods into a cohesive model. Comprehensive evaluations on the RAMS and WikiEvents demonstrate the model’s state-of-the-art performance and efficiency.

pdf bib abs
ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval
Antoine Louis | Vageesh Kumar Saxena | Gijs van Dijck | Gerasimos Spanakis

State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages.

pdf bib abs
TEXT-CAKE: Challenging Language Models on Local Text Coherence
Luca Dini | Dominique Brunato | Felice Dell’Orletta | Tommaso Caselli

We present a deep investigation of encoder-based Language Models (LMs) on their abilities to detect text coherence across four languages and four text genres using a new evaluation benchmark, TEXT-CAKE. We analyze both multilingual and monolingual LMs with varying architectures and parameters in different finetuning settings. Our findings demonstrate that identifying subtle perturbations that disrupt local coherence is still a challenging task. Furthermore, our results underline the importance of using diverse text genres during pre-training and of an optimal pre-traning objective and large vocabulary size. When controlling for other parameters, deep LMs (i.e., higher number of layers) have an advantage over shallow ones, even when the total number of parameters is smaller.

The knowledge tracing (KT) model based on deep learning has been proven to be superior to the traditional knowledge tracing model, eliminating the need for artificial engineering features. However, there are still problems, such as insufficient interpretability of the learning and answering processes. To address these issues, we propose a new approach in knowledge tracing with attention-based embedding and forgetting curve integration, namely KVFKT. Firstly, the embedding representation module is responsible for embedding the questions and computing the attention vector of knowledge concepts (KCs) when students answer questions and when answer time stamps are collected. Secondly, the forgetting quantification module performs the pre-prediction update of the student’s knowledge state matrix. This quantification involves calculating the interval time and associated forgetting rate of relevant KCs, following the forgetting curve. Thirdly, the answer prediction module generates responses based on students’ knowledge status, guess coefficient, and question difficulty. Finally, the knowledge status update module further refines the students’ knowledge status according to their answers to the questions and the characteristics of those questions. In the experiment, four real-world datasets are used to test the model. Experimental results show that KVFKT better traces students’ knowledge state and outperforms state-of-the-art models.

pdf bib abs
Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering
Yinghao Hu | Leilei Gan | Wenyi Xiao | Kun Kuang | Fei Wu

Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.

Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, MVP, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via Multi-View Multi-Path Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs. The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs.

pdf bib abs
Large Language Models are good multi-lingual learners : When LLMs meet cross-lingual prompts
Teng Wang | Zhenqi He | Wing-Yin Yu | Xiaojin Fu | Xiongwei Han

With the advent of Large Language Models (LLMs), generating rule-based data for real-world applications has become more accessible. Due to the inherent ambiguity of natural language and the complexity of rule sets, especially in long contexts, LLMs often struggle to follow all specified rules, frequently omitting at least one. To enhance the reasoning and understanding of LLMs on long and complex contexts, we propose a novel prompting strategy Multi-Lingual Prompt, namely MLPrompt, which automatically translates the error-prone rule that an LLM struggles to follow into another language, thus drawing greater attention to it. Experimental results on public datasets across various tasks have shown MLPrompt can outperform state-of-the-art prompting methods such as Chain of Thought, Tree of Thought, and Self-Consistency. Additionally, we introduce a framework integrating MLPrompt with an auto-checking mechanism for structured data generation, with a specific case study in text-to-MIP instances. Further, we extend the proposed framework for text-to-SQL to demonstrate its generation ability towards structured data synthesis.

The extensive utilization of large language models (LLMs) underscores the crucial necessity for precise and contemporary knowledge embedded within their intrinsic parameters. Existing research on knowledge editing primarily concentrates on monolingual scenarios, neglecting the complexities presented by multilingual contexts and multi-hop reasoning. To address these challenges, our study introduces MLaKE (Multilingual Language Knowledge Editing), a novel benchmark comprising 4072 multi-hop and 5360 single-hop questions designed to evaluate the adaptability of knowledge editing methods across five languages: English, Chinese, Japanese, French, and German. MLaKE aggregates fact chains from Wikipedia across languages and utilizes LLMs to generate questions and answer. We assessed the effectiveness of current multilingual knowledge editing methods using the MLaKE dataset. Our results show that due to considerable inconsistencies in both multilingual performance and encoding efficiency, these methods struggle to generalize effectively across languages. The accuracy of these methods when editing English is notably higher than for other languages. The experimental results further demonstrate that models encode knowledge and generation capabilities for different languages using distinct parameters, leading to poor cross-lingual transfer performance in current methods. Transfer performance is notably better within the same language family compared to across different families. These findings emphasize the urgent need to improve multilingual knowledge editing methods.

pdf bib abs
Factual Dialogue Summarization via Learning from Large Language Models
Rongxin Zhu | Jey Han Lau | Jianzhong Qi

Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowledge distillation to improve the factual consistency of smaller pretrained models for dialogue summarization. We employ zero-shot learning to extract symbolic knowledge from LLMs, generating both factually consistent (positive) and inconsistent (negative) summaries. We then apply two contrastive learning objectives on these summaries to enhance smaller summarization models. Experiments with BART, PEGASUS, and Flan-T5 indicate that our approach surpasses strong baselines that rely on complex data augmentation strategies. Our approach demonstrates improved factual consistency while preserving coherence, fluency, and relevance, as verified by both automatic evaluation metrics and human assessments. We provide access to the data and code to facilitate future research.

pdf bib abs
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan | Neemesh Yadav | Sarah Masud | Md. Shad Akhtar

The rise of large language models (LLMs) has created a need for advanced benchmarking systems beyond traditional setups. To this end, we introduce QUENCH, a novel text-based English Quizzing Benchmark manually curated and transcribed from YouTube quiz videos. QUENCH possesses masked entities and rationales for the LLMs to predict via generation. At the intersection of world knowledge, geographical context, and common sense reasoning, QUENCH helps assess world knowledge and deduction capabilities of LLMs via a zero-shot, open-domain quizzing setup. We perform an extensive evaluation on 7 LLMs and 4 metrics, investigating the influence of model size, prompting style, geographical context, and gold-labeled rationale generation. The benchmarking concludes with an error analysis of various types of generative errors to which the LLMs are prone.

pdf bib abs
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
Sacha Muller | Antonio Loison | Bilel Omrani | Gautier Viaud

Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4’s judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4’s reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4’s evaluations and calibration on reference situations

pdf bib abs
Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models
Jiahui Li | Yongchang Hao | Haoyu Xu | Xing Wang | Yu Hong

Despite the advancements in training Large Language Models (LLMs) with alignment techniques to enhance the safety of generated content, these models remain susceptible to jailbreak, an adversarial attack method that exposes security vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG) method has demonstrated the ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs. However, the optimization process involved in GCG is highly time-consuming, rendering the jailbreaking pipeline inefficient. In this paper, we investigate the process of GCG and identify an issue of Indirect Effect, the key bottleneck of the GCG optimization. To this end, we propose the Model Attack Gradient Index GCG (MAGIC), that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens, thereby accelerating the procedure by having less computation and fewer iterations. Our experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of 74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on GPT-3.5. Code is available at https://github.com/jiah-li/magic.

Conditional semantic textual similarity (C-STS) assesses the similarity between pairs of sentence representations under different conditions. The current method encounters the over-estimation issue of positive and negative samples. Specifically, the similarity within positive samples is excessively high, while that within negative samples is excessively low. In this paper, we focus on the C-STS task and develop a conditional contrastive learning framework that constructs positive and negative samples from two perspectives, achieving the following primary objectives: (1) adaptive selection of the optimization direction for positive and negative samples to solve the over-estimation problem, (2) fully balance of the effects of hard and false negative samples. We validate the proposed method with five models based on bi-encoder and tri-encoder architectures, the results show that our proposed method achieves state-of-the-art performance. The code is available at https://github.com/qinzeyang0919/CCL.

Language in the Arab world presents a complex diglossic and multilingual setting, involving the use of Modern Standard Arabic, various dialects and sub-dialects, as well as multiple European languages. This diverse linguistic landscape has given rise to code-switching, both within Arabic varieties and between Arabic and foreign languages. The widespread occurrence of code-switching across the region makes it vital to address these linguistic needs when developing language technologies. In this paper, we provide a review of the current literature in the field of code-switched Arabic NLP, offering a broad perspective on ongoing efforts, challenges, research gaps, and recommendations for future research directions.

Execution Accuracy and Exact Set Match are two predominant metrics for evaluating the functional correctness of SQL queries in modern Text-to-SQL tasks. However, both metrics have notable limitations: Exact Set Match fails when queries are functionally equivalent but syntactically different, while Execution Accuracy is prone to false positives due to inadequately prepared test databases, which can be costly to create, particularly in large-scale industrial applications. To overcome these challenges, we propose a novel graph-based metric, FuncEvalGMN, that effectively overcomes the deficiencies of the aforementioned metric designs. Our method utilizes a relational operator tree (ROT), referred to as RelNode, to extract rich semantic information from the logical execution plan of SQL queries, and embed it into a graph. We then train a graph neural network (GNN) to perform graph matching on pairs of SQL queries through graph contrastive learning. FuncEvalGMN offers two highly desired advantages: (i) it requires only the database schema to derive logical execution plans, eliminating the need for extensive test database preparation, and (ii) it demonstrates strong generalization capabilities on unseen datasets. These properties highlight FuncEvalGMN’s robustness as a reliable metric for assessing functional correctness across a wide range of Text-to-SQL applications.

pdf bib abs
Modal Feature Optimization Network with Prompt for Multimodal Sentiment Analysis
Xiangmin Zhang | Wei Wei | Shihao Zou

Multimodal sentiment analysis(MSA) is mostly used to understand human emotional states through multimodal. However, due to the fact that the effective information carried by multimodal is not balanced, the modality containing less effective information cannot fully play the complementary role between modalities. Therefore, the goal of this paper is to fully explore the effective information in modalities and further optimize the under-optimized modal representation.To this end, we propose a novel Modal Feature Optimization Network (MFON) with a Modal Prompt Attention (MPA) mechanism for MSA. Specifically, we first determine which modalities are under-optimized in MSA, and then use relevant prompt information to focus the model on these features. This allows the model to focus more on the features of the modalities that need optimization, improving the utilization of each modality’s feature representation and facilitating initial information aggregation across modalities. Subsequently, we design an intra-modal knowledge distillation strategy for under-optimized modalities. This approach preserves the integrity of the modal features. Furthermore, we implement inter-modal contrastive learning to better extract related features across modalities, thereby optimizing the entire network. Finally, sentiment prediction is carried out through the effective fusion of multimodal information. Extensive experimental results on public benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art models.

pdf bib abs
Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies
Recep Firat Cekinel | Pinar Karagoz | Çağrı Çöltekin

This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. To be more specific, we investigate whether incorporating multimodal content improves performance compared to text-only models and how well VLMs utilize text and image information to enhance misinformation detection. Furthermore we propose a probing classifier based solution using VLMs. Our approach extracts embeddings from the last hidden layer of selected VLMs and inputs them into a neural probing classifier for multi-class veracity classification. Through a series of experiments on two fact-checking datasets, we demonstrate that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings. Furthermore, the proposed neural classifier significantly outperformed KNN and SVM baselines in leveraging extracted embeddings, highlighting its effectiveness for multimodal fact-checking.

pdf bib abs
Faithful Inference Chains Extraction for Fact Verification over Multi-view Heterogeneous Graph with Causal Intervention
Daoqi Chen | Yaxin Li | Zizhong Zhu | Xiaowang Zhang | Zhiyong Feng

KG-based fact verification verifies the truthfulness of claims by retrieving evidence graphs from the knowledge graph. The *faithful inference chains*, which are precise relation paths between the mentioned entities and evidence entities, retrieve precise evidence graphs addressing poor performance and weak logic for fact verification. Due to the diversity of relation paths, existing methods rarely extract faithful inference chains. To alleviate these issues, we propose Multi-view Heterogeneous Graph with Causal Intervention (MHGCI): (i) We construct a Multi-view Heterogeneous Graph enhancing relation path extraction from the view of different mentioned entities. (ii) We propose a self-optimizing causal intervention model to generate assistant entities mitigating the out-of-distribution problem caused by counterfactual relations. (iii) We propose a grounding method to extract evidence graphs from the KG by faithful inference chains. Experiments on the public KG-based fact verification dataset FactKG demonstrate that our model provides precise evidence graphs and achieves state-of-the-art performance.

pdf bib abs
SweetieChat: A Strategy-Enhanced Role-playing Framework for Diverse Scenarios Handling Emotional Support Agent
Jing Ye | Lu Xiang | Yaping Zhang | Chengqing Zong

Large Language Models (LLMs) have demonstrated promising potential in providing empathetic support during interactions. However, their responses often become verbose or overly formulaic, failing to adequately address the diverse emotional support needs of real-world scenarios. To tackle this challenge, we propose an innovative strategy-enhanced role-playing framework, designed to simulate authentic emotional support conversations. Specifically, our approach unfolds in two steps: (1) Strategy-Enhanced Role-Playing Interactions, which involve three pivotal roles—Seeker, Strategy Counselor, and Supporter—engaging in diverse scenarios to emulate real-world interactions and promote a broader range of dialogues; and (2) Emotional Support Agent Training, achieved through fine-tuning LLMs using our specially constructed dataset. Within this framework, we develop the ServeForEmo dataset, comprising an extensive collection of 3.7K+ multi-turn dialogues and 62.8K+ utterances. We further present SweetieChat, an emotional support agent capable of handling diverse open-domain scenarios. Extensive experiments and human evaluations confirm the framework’s effectiveness in enhancing emotional support, highlighting its unique ability to provide more nuanced and tailored assistance.

We propose ELAINE (EngLish-jApanese-chINesE)-medLLM, a trilingual (English, Japanese, Chinese) large language model adapted for the bio-medical domain based on Llama-3-8B. The training dataset was carefully curated in terms of volume and diversity to adapt to the biomedical domain and endow trilingual capability while preserving the knowledge and abilities of the base model. The training follows 2-stage paths: continued pre-training and supervised fine-tuning (SFT). Our results demonstrate that ELAINE-medLLM exhibits superior trilingual capabilities compared to existing bilingual or multilingual medical LLMs without severely sacrificing the base model’s capability.

pdf bib abs
Debate-to-Write: A Persona-Driven Multi-Agent Framework for Diverse Argument Generation
Zhe Hu | Hou Pong Chan | Jing Li | Yu Yin

Writing arguments is a challenging task for both humans and machines. It entails incorporating high-level beliefs from various perspectives on the topic, along with deliberate reasoning and planning to construct a coherent narrative. Current language models often generate outputs autoregressively, lacking explicit integration of these underlying controls, resulting in limited output diversity and coherence. In this work, we propose a persona-based multi-agent framework for argument writing. Inspired by the human debate, we first assign each agent a persona representing its high-level beliefs from a unique perspective, and then design an agent interaction process so that the agents can collaboratively debate and discuss the idea to form an overall plan for argument writing. Such debate process enables fluid and nonlinear development of ideas. We evaluate our framework on argumentative essay writing. The results show that our framework generates more diverse and persuasive arguments by both automatic and human evaluations.

In recent years, the use of large language models (LLMs) for text classification has attracted widespread attention. Despite this, the classification accuracy of LLMs has not yet universally surpassed that of smaller models. LLMs can enhance their performance in text classification through fine-tuning. However, existing data quality research based on LLMs is challenging to apply directly to solve text classification problems. To further improve the performance of LLMs in classification tasks, this paper proposes a data quality enhancement (DQE) method for text classification based on LLMs. This method starts by using a greedy algorithm to select data, dividing the dataset into sampled and unsampled subsets, and then performing fine-tuning of the LLMs using the sampled data. Subsequently, this model is used to predict the outcomes for the unsampled data, categorizing incorrectly predicted data into uncovered, difficult, and noisy data. Experimental results demonstrate that our method effectively enhances the performance of LLMs in text classification tasks and significantly improves training efficiency, saving nearly half of the training time. Our method has achieved state-of-the-art performance in several open-source classification tasks.

pdf bib abs
Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe
Zhenxuan Yu | Takeshi Kojima | Yutaka Matsuo | Yusuke Iwasawa

Large language models (LLMs) have achieved significant performance improvements in natural language processing (NLP) domain. However, these models often require large computational resources for training and inference. Recently, Mamba, a language model architecture based on State-Space Models (SSMs), has achieved comparable performance to Transformer models while significantly reducing costs by compressing context windows during inference. We focused on the potential of the lightweight Mamba architecture by applying BitNet quantization method to the model architecture. In addition, while prior BitNet methods generally quantized only linear layers in the main body, we extensively quantized the embedding and projection layers considering their significant proportion of model parameters. In our experiments, we applied ternary quantization to the Mamba-2 (170M) architecture and pre-trained the model with 150 B tokens from scratch. Our method achieves approximately 90.0% reduction in the bits used by all parameters, achieving a significant improvement compared with a 48.4% reduction by the conventional BitNet quantization method. In addition, our method experienced minimal performance degradation in both the pre-training perplexity and downstream tasks. These findings demonstrate the potential of incorporating lightweight language models into edge devices, which will become more demanding in the future.

pdf bib abs
What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios
Xuelin Liu | Pengyuan Liu | Dong Yu

As large language models (LLMs) demonstrate impressive performance in various tasks and are increasingly integrated into the decision-making process, ensuring they align with human values has become crucial. This paper highlights that value priorities—the relative importance of different value—play a pivotal role in the decision-making process. To explore the value priorities in LLMs, this paper introduces INVP, a framework for INvestigating Value Priorities through decision-making in social scenarios. The framework encompasses social scenarios including binary decision-making, covering both individual and collective decision-making contexts, and is based on Schwartz’s value theory for constructing value priorities. Using this framework, we construct a dataset, which contains a total of 1613 scenarios and 3226 decisions across 283 topics. We evaluate seven popular LLMs and the experimental results reveal commonalities in the value priorities across different LLMs, such as an emphasis on Universalism and Benevolence, while Power and Hedonism are typically given lower priority. This study provides fresh insights into understanding and enhancing the moral and value alignment of LLMs when making complex social decisions.

pdf bib abs
BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language
Muitze Zulaika | Xabier Saralegi

The rise of pre-trained language models has revolutionized natural language processing (NLP) tasks, but concerns about the propagation of social biases in these models remain, particularly in under-resourced languages like Basque. This paper introduces BasqBBQ, the first benchmark designed to assess social biases in Basque across eight domains, using a multiple-choice question-answering (QA) task. We evaluate various autoregressive large language models (LLMs), including multilingual and those adapted for Basque, to analyze both their accuracy and bias transmission. Our results show that while larger models generally achieve better accuracy, ambiguous cases remain challenging. In terms of bias, larger models exhibit lower negative bias. However, high negative bias persists in specific categories such as Disability Status, Age and Physical Appearance, especially in ambiguous contexts. Conversely, categories such as Sexual Orientation, Gender Identity, and Race/Ethnicity show the least bias in ambiguous contexts. The continual pre-training based adaptation process for Basque has a limited impact on bias when compared with English. This work represents a key step toward creating more ethical LLMs for low-resource languages.

pdf bib abs
DynRank: Improve Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification
Abdelrahman Abdallah | Jamshid Mozafari | Bhawna Piryani | Mohammed M. Abdelgwad | Adam Jatowt

This paper presents DynRank, a novel framework for enhancing passage retrieval in open-domain question-answering systems through dynamic zero-shot question classification. Traditional approaches rely on static prompts and pre-defined templates, which may limit model adaptability across different questions and contexts. In contrast, DynRank introduces a dynamic prompting mechanism, leveraging a pre-trained question classification model that categorizes questions into fine-grained types. Based on these classifications, contextually relevant prompts are generated, enabling more effective passage retrieval. We integrate DynRank into existing retrieval frameworks and conduct extensive experiments on multiple QA benchmark datasets.

pdf bib abs
Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting
Ananya Mukherjee | Saumitra Yadav | Manish Shrivastava

Evaluating machine translation (MT) systems for low-resource languages has long been a challenge due to the limited availability of evaluation metrics and resources. As a result, researchers in this space have relied primarily on lexical-based metrics like BLEU, TER, and ChrF, which lack semantic evaluation. In this first-of-its-kind work, we propose a novel pivot-based evaluation framework that addresses these limitations; after translating low-resource language outputs into a related high-resource language, we leverage advanced neural and embedding-based metrics for more meaningful evaluation. Through a series of experiments using five low-resource languages: Assamese, Manipuri, Kannada, Bhojpuri, and Nepali, we demonstrate how this method extends the coverage of both lexical-based and embedding-based metrics, even for languages not directly supported by advanced metrics. Our results show that the differences between direct and pivot-based evaluation scores are minimal, proving that this approach is a viable and effective solution for evaluating translations in endangered and low-resource languages. This work paves the way for more inclusive, accurate, and scalable MT evaluation for underrepresented languages, marking a significant step forward in this under-explored area of research. The code and data will be made available at https://github.com/AnanyaCoder/PivotBasedEvaluation.

pdf bib abs
The Shift from Logic to Dialectic in Argumentation Theory: Implications for Computational Argument Quality Assessment
Rositsa V Ivanova | Reto Gubelmann

In the field of computational argument quality assessment, logic and dialectic are essential dimensions used to measure the quality of argumentative texts. Both of them have found their way into the field due to their importance to argumentation theory. We trace the development of core logical concepts of validity and soundness from their first use in argumentation theory to their understanding in state-of-the-art research. We show how, in the course of this development, dialectical considerations have taken center stage, at the cost of the logical perspective. Then, we take a closer look at the quality dimensions used in the field of computational argument quality assessment. Based on an analysis of prior empirical work in this field, we show how methodological considerations from argument theory can benefit state-of-the-art methods in computational argument quality assessment. We propose an even clearer separation between the two quality dimensions not only in regards to their definitions, but also in regards to the granularity at which the argumentative text is being annotated and assessed.

pdf bib abs
Task-Oriented Dialog Systems for the Senegalese Wolof Language
Derguene Mbaye | Moussa Diallo

In recent years, we are seeing considerable interest in conversational agents with the rise of large language models (LLMs). Although they offer considerable advantages, LLMs also present significant risks, such as hallucination, which hinder their widespread deployment in industry. Moreover, low-resource languages such as African ones are still underrepresented in these systems limiting their performance in these languages. In this paper, we illustrate a more classical approach based on modular architectures of Task-oriented Dialog Systems (ToDS) offering better control over outputs. We propose a chatbot generation engine based on the Rasa framework and a robust methodology for projecting annotations onto the Wolof language using an in-house machine translation system. After evaluating a generated chatbot trained on the Amazon Massive dataset, our Wolof Intent Classifier performs similarly to the one obtained for French, which is a resource-rich language. We also show that this approach is extensible to other low-resource languages, thanks to the intent classifier’s language-agnostic pipeline, simplifying the design of chatbots in these languages.

Aligning Large Language Models (LLMs) with general human preferences has been proved crucial in improving the interaction quality between LLMs and human. However, human values are inherently diverse among different individuals, making it insufficient to align LLMs solely with general preferences. To address this, personalizing LLMs according to individual feedback emerges as a promising solution. Nonetheless, this approach presents challenges in terms of the efficiency of alignment algorithms. In this work, we introduce a flexible paradigm for individual preference alignment. Our method fundamentally improves efficiency by disentangling preference representation from text generation in LLMs. We validate our approach across multiple text generation tasks and demonstrate that it can produce aligned quality as well as or better than PEFT-based methods, while reducing additional training time for each new individual preference by 80% to 90% in comparison with them.

Generative information extraction (Generative IE) aims to generate structured text sequences from unstructured text using a generative framework. Scaling in model size yields variations in adaptation and generalization, and also drives fundamental shifts in the techniques and approaches used within this domain. In this survey, we first review generative information extraction (IE) methods based on pre-trained language models (PLMs) and large language models (LLMs), focusing on their adaptation and generalization capabilities. We also discuss the connection between these methods and these two aspects. Furthermore, to balance task performance with the substantial computational demands associated with LLMs, we emphasize the importance of model collaboration. Finally, given the advanced capabilities of LLMs, we explore methods for integrating diverse IE tasks into unified models.

pdf bib abs
Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System
Ruoyu Liu | Kui Xue | Xiaofan Zhang | Shaoting Zhang

This study focuses on evaluating proactive communication and diagnostic capabilities of medical Large Language Models (LLMs), which directly impact their effectiveness in patient consultations. In typical medical scenarios, doctors often ask a set of questions to gain a comprehensive understanding of patients’ conditions. We argue that single-turn question-answering tasks such as MultiMedQA are insufficient for evaluating LLMs’ medical consultation abilities. To address this limitation, we developed an evaluation benchmark called Multi-turn Medical Dialogue Evaluation (MMD-Eval), specifically designed to evaluate the proactive communication and diagnostic capabilities of medical LLMs during consultations. Considering the high cost and potential for hallucinations in LLMs, we innovatively trained a task-oriented dialogue system to simulate patients engaging in dialogues with the medical LLMs using our structured medical records dataset. This approach enabled us to generate multi-turn dialogue data. Subsequently, we evaluate the communication skills and medical expertise of the medical LLMs. All resources associated with this study will be made publicly available.

Recently, Large language models (LLMs) have revolutionized Natural Language Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with handling long token sequences, limiting their performance on various downstream tasks. Current solutions toward long context modeling often employ multi-stage continual pertaining, which progressively increases the effective context length through several continual pretraining stages. However, those approaches require extensive manual tuning and human expertise. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Embedding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process. Our HARPE leverages different Rotary Position Embedding (RoPE) base frequency values across different attention heads and directly trains LLMs on the target context length. Extensive experiments on 4 language modeling benchmarks, including the latest RULER benchmark, demonstrate that HARPE excels in understanding and integrating long-context tasks with single-stage training, matching and even outperforming existing multi-stage methods. Our results highlight that HARPE successfully breaks the stage barrier for training LLMs with long context modeling capabilities.

pdf bib abs
ACL-rlg: A Dataset for Reading List Generation
Julien Aubert-Béduchaud | Florian Boudin | Béatrice Daille | Richard Dufour

Familiarizing oneself with a new scientific field and its existing literature can be daunting due to the large amount of available articles. Curated lists of academic references, or reading lists, compiled by experts, offer a structured way to gain a comprehensive overview of a domain or a specific scientific challenge. In this work, we introduce ACL-rlg, the largest open expert-annotated reading list dataset. We also provide multiple baselines for evaluating reading list generation and formally define it as a retrieval task. Our qualitative study highlights that traditional scholarly search engines and indexing methods perform poorly on this task, and GPT-4o, despite showing better results, exhibits signs of potential data contamination.

pdf bib abs
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
Zhenglin Wang | Jialong Wu | Yilong Lai | Congzhi Zhang | Deyu Zhou

Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by encouraging the exploration of intermediate steps, surpassing the capabilities of chain-of-thought prompting. However, significant inference latency is introduced due to the systematic exploration and evaluation of multiple thought paths. This paper introduces SEED, a novel and efficient inference framework to improve both runtime speed and GPU memory management concurrently. Based on a scheduled speculative execution, SEED efficiently handles multiple iterations for thought generation and state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching. Extensive experimental evaluations on three reasoning datasets demonstrate the superior speedup performance of SEED.

pdf bib abs
Extracting structure from an LLM - how to improve on surprisal-based models of Human Language Processing
Daphne P. Wang | Mehrnoosh Sadrzadeh | Miloš Stanojević | Wing-Yee Chow | Richard Breheny

Prediction and reanalysis are considered two key processes that underly humans’ capacity to comprehend language in real time. Computational models capture it using Large Language Models (LLMs) and a statistical measure known as ‘surprisal’. Despite successes of LLMs, surprisal-based models face challenges when it comes to sentences requiring reanalysis due to pervasive temporary structural ambiguities, such as garden path sentences. We ask whether structural information can be extracted from LLM’s and develop a model that integrates it with their learnt statistics. When applied to a dataset of garden path sentences, the model achieved a significantly higher correlation with human reading times than surprisal. It also provided a better prediction of the garden path effect and could distinguish between sentence types with different levels of difficulty.

pdf bib abs
Evaluating Generalization Capability of Language Models across Abductive, Deductive and Inductive Logical Reasoning
Yu Sheng | Wanting Wen | Linjing Li | Daniel Zeng

Transformer-based language models (LMs) have demonstrated remarkable performance on many natural language tasks, yet to what extent LMs possess the capability of generalizing to unseen logical rules remains not explored sufficiently. In classical logic category, abductive, deductive and inductive (ADI) reasoning are defined as the fundamental reasoning types, sharing the identical reasoning primitives and properties, and some research have proposed that there exists mutual generalization across them. However, in the field of natural language processing, previous research generally study LMs’ ADI reasoning capabilities separately, overlooking the generalization across them. To bridge this gap, we propose UniADILR, a novel logical reasoning dataset crafted for assessing the generalization capabilities of LMs across different logical rules. Based on UniADILR, we conduct extensive investigations from various perspectives of LMs’ performance on ADI reasoning. The experimental results reveal the weakness of current LMs in terms of extrapolating to unseen rules and inspire a new insight for future research in logical reasoning.

pdf bib abs
Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Justin Vasselli | Adam Nohejl | Taro Watanabe

Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval—a prompt-based method leveraging LLMs—across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of nuanced evaluation frameworks to address real-world dialogue challenges.

pdf bib abs
Towards Robust Comparisons of NLP Models: A Case Study
Vicente Ivan Sanchez Carmona | Shanshan Jiang | Bin Dong

Comparing the test scores of different NLP models across downstream datasets to determine which model leads to the most accurate results is the ultimate step in any experimental work. Doing so via a single mean score may not accurately quantify the real capabilities of the models. Previous works have proposed diverse statistical tests to improve the comparison of NLP models; however, a key statistical phenomenon remains understudied: variability in test scores. We propose a type of regression analysis which better explains this phenomenon by isolating the effect of both nuisance factors (such as random seeds) and datasets from the effects of the models’ capabilities. We showcase our approach via a case study of some of the most popular biomedical NLP models: after isolating nuisance factors and datasets, our results show that the difference between BioLinkBERT and MSR BiomedBERT is, actually, 7 times smaller than previously reported.

In recent years, fine-grained sentiment analysis in finance has gained significant attention, but the scarcity of entity-level datasets remains a key challenge. To address this, we have constructed the largest English and Chinese financial entity-level sentiment analysis datasets to date. Building on this foundation, we propose a novel two-stage sentiment analysis approach called Self-aware In-context Learning Correction (SILC). The first stage involves fine-tuning a base large language model to generate pseudo-labeled data specific to our task. In the second stage, we train a correction model using a GNN-based example retriever, which is informed by the pseudo-labeled data. This two-stage strategy has allowed us to achieve state-of-the-art performance on the newly constructed datasets, advancing the field of financial sentiment analysis. In a case study, we demonstrate the enhanced practical utility of our data and methods in monitoring the cryptocurrency market. Our datasets and code are available at https://github.com/NLP-Bin/SILC-EFSA.

pdf bib abs
Enhancing Criminal Investigation Analysis with Summarization and Memory-based Retrieval-Augmented Generation: A Comprehensive Evaluation of Real Case Data
Mads Skipanes | Tollef Emil JÃ, rgensen | Kyle Porter | Gianluca Demartini | Sule Yildirim Yayilgan

This study introduces KriRAG, a novel Retrieval-Augmented Generation (RAG) architecture designed to assist criminal investigators in analyzing information and overcoming the challenge of information overload. KriRAG structures and summarizes extensive document collections based on existing investigative queries, providing relevant document references and detailed answers for each query. Working with unstructured data from two homicide case files comprising approximately 3,700 documents and 13,000 pages, a comprehensive evaluation methodology is established, incorporating semantic retrieval, scoring, reasoning, and query response accuracy. The system’s outputs are evaluated against queries and answers provided by criminal investigators, demonstrating promising performance with 97.5% accuracy in relevance assessment and 77.5% accuracy for query responses. These findings provide a rigorous foundation for other query-oriented and open-ended retrieval applications. KriRAG is designed to run offline on limited hardware, ensuring sensitive data handling and on-device availability.

pdf bib abs
Attention-Seeker: Dynamic Self-Attention Scoring for Unsupervised Keyphrase Extraction
Erwin Daniel Lopez Zapata | Cheng Tang | Atsushi Shimada

This paper proposes Attention-Seeker, an unsupervised keyphrase extraction method that leverages self-attention maps from a Large Language Model to estimate the importance of candidate phrases. Our approach identifies specific components – such as layers, heads, and attention vectors – where the model pays significant attention to the key topics of the text. The attention weights provided by these components are then used to score the candidate phrases. Unlike previous models that require manual tuning of parameters (e.g., selection of heads, prompts, hyperparameters), Attention-Seeker dynamically adapts to the input text without any manual adjustments, enhancing its practical applicability. We evaluate Attention-Seeker on four publicly available datasets: Inspec, SemEval2010, SemEval2017, and Krapivin. Our results demonstrate that, even without parameter tuning, Attention-Seeker outperforms most baseline models, achieving state-of-the-art performance on three out of four datasets, particularly excelling in extracting keyphrases from long documents.

pdf bib abs
Evaluating Open-Source ASR Systems: Performance Across Diverse Audio Conditions and Error Correction Methods
Saki Imai | Tahiya Chowdhury | Amanda J. Stent

Despite significant advances in automatic speech recognition (ASR) accuracy, challenges remain. Naturally occurring conversation often involves multiple overlapping speakers, of different ages, accents and genders, as well as noisy environments and suboptimal audio recording equipment, all of which reduce ASR accuracy. In this study, we evaluate the accuracy of state of the art open source ASR systems across diverse conversational speech datasets, examining the impact of audio and speaker characteristics on WER. We then explore the potential of ASR ensembling and post-ASR correction methods to improve transcription accuracy. Our findings emphasize the need for robust error correction techniques and of continuing to address demographic biases to enhance ASR performance and inclusivity.

Recently, increasing attention has been focused on improving the ability of Large Language Models (LLMs) to perform complex reasoning. Advanced methods, such as Chain-of-Thought (CoT) and its variants, are found to enhance their reasoning skills by designing suitable prompts or breaking down complex problems into more manageable sub-problems. However, little concentration has been put on exploring the reasoning process, i.e., we discovered that most methods resort to Direct Reasoning (DR) and disregard Indirect Reasoning (IR). This can make LLMs difficult to solve IR tasks, which are often encountered in the real world. To address this issue, we propose a Direct-Indirect Reasoning (DIR) method, which considers DR and IR as multiple parallel reasoning paths that are merged to derive the final answer. We stimulate LLMs to implement IR by crafting prompt templates incorporating the principles of contrapositive and contradiction. These templates trigger LLMs to assume the negation of the conclusion as true, combine it with the premises to deduce a conclusion, and utilize the logical equivalence of the contrapositive to enhance their comprehension of the rules used in the reasoning process. Our DIR method is simple yet effective and can be straightforwardly integrated with existing variants of CoT methods. Experimental results on four datasets related to logical reasoning and mathematic proof demonstrate that our DIR method, when combined with various baseline methods, significantly outperforms all the original methods.

pdf bib abs
Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges
Vinay Samuel | Yue Zhou | Henry Peng Zou

As large language models achieve increasingly impressive results, questions arise about whether such performance is from generalizability or mere data memorization. Thus, numerous data contamination detection methods have been proposed. However, these approaches are often validated with traditional benchmarks and early-stage LLMs, leaving uncertainty about their effectiveness when evaluating state-of-the-art LLMs on the contamination of more challenging benchmarks. To address this gap and provide a dual investigation of SOTA LLM contamination status and detection method robustness, we evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets often used in modern LLM evaluation. Our analysis reveals that (1) Current methods have non-trivial limitations in their assumptions and practical applications; (2) Notable difficulties exist in detecting contamination introduced during instruction fine-tuning with answer augmentation; and (3) Limited consistencies between SOTA contamination detection techniques. These findings highlight the complexity of contamination detection in advanced LLMs and the urgent need for further research on robust and generalizable contamination evaluation.

The Myers-Briggs Type Indicator (MBTI) is one of the most influential personality theories reflecting individual differences in thinking, feeling, and behaving. MBTI personality detection has garnered considerable research interest and has evolved significantly over the years. However, this task tends to be overly optimistic, as it currently does not align well with the natural distribution of population personality traits. Specifically, the self-reported labels in existing datasets result in data quality issues and the hard labels fail to capture the full range of population personality distributions. In this paper, we identify the task by constructing MBTIBench, the first manually annotated MBTI personality detection dataset with soft labels, under the guidance of psychologists. Our experimental results confirm that soft labels can provide more benefits to other psychological tasks than hard labels. We highlight the polarized predictions and biases in LLMs as key directions for future research.

Large Language Models (LLMs) are increasingly being applied in education, showing significant potential in personalized instruction, student feedback, and intelligent tutoring. Generating hints for Math Word Problems (MWPs) has become a critical application, particularly in helping students understand problem-solving steps and logic. However, existing models struggle to provide pedagogically sound guidance that fosters learning without offering direct answers. To address this issue, we introduce TMATH, a dataset specifically designed to evaluate LLMs’ ability to generate high-quality hints for MWPs. TMATH contains diverse mathematical problems paired with carefully crafted, human-generated hints. To assess its impact, we fine-tuned a series of 7B-scale language models using TMATH. Our results, based on quantitative evaluations and expert assessments, show that while LLMs still face challenges in complex reasoning, the TMATH dataset significantly enhances their ability to generate more accurate and contextually appropriate educational hints.

pdf bib abs
A Benchmark of French ASR Systems Based on Error Severity
Antoine Tholly | Jane Wottawa | Mickael Rouvier | Richard Dufour

Automatic Speech Recognition (ASR) transcription errors are commonly assessed using metrics that compare them with a reference transcription, such as Word Error Rate (WER), which measures spelling deviations from the reference, or semantic score-based metrics. However, these approaches often overlook what is understandable to humans when interpreting transcription errors. To address this limitation, a new evaluation is proposed that categorizes errors into four levels of severity, further divided into subtypes, based on objective linguistic criteria, contextual patterns, and the use of content words as the unit of analysis. This metric is applied to a benchmark of 10 state-of-the-art ASR systems on French language, encompassing both HMM-based and end-to-end models. Our findings reveal the strengths and weaknesses of each system, identifying those that provide the most comfortable reading experience for users.

pdf bib abs
What Makes Cryptic Crosswords Challenging for LLMs?
Abdelrahman Sadallah | Daria Kotova | Ekaterina Kochmar

Cryptic crosswords are puzzles that rely on general knowledge and the solver’s ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at https://github.com/bodasadallah/decrypting-crosswords.

pdf bib abs
Improving the Efficiency of Visually Augmented Language Models
Paula Ontalvilla | Aitor Ormazabal | Gorka Azkune

Despite the impressive performance of autoregressive Language Models (LM) it has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they do not know much about the visual world and its properties. To augment LMs with visual knowledge, existing solutions often rely on explicit images, requiring time-consuming retrieval or image generation systems. This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. For a fair comparison, we modify VALM, a visually-augmented LM which uses image retrieval and representation, to work directly with visually-grounded text representations. We name this new model BLIND-VALM. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks, despite being significantly more efficient and simpler. We also show that scaling up our model within the compute budget of VALM, either increasing the model or pre-training corpus size, we outperform VALM for all the evaluation tasks.

pdf bib abs
Refer to the Reference: Reference-focused Synthetic Automatic Post-Editing Data Generation
Sourabh Deoghare | Diptesh Kanojia | Pushpak Bhattacharyya

A prevalent approach to synthetic APE data generation uses source (src) sentences in a parallel corpus to obtain translations (mt) through an MT system and treats corresponding reference (ref) sentences as post-edits (pe). While effective, due to independence between ‘mt’ and ‘pe,’ these translations do not adequately reflect errors to be corrected by a human post-editor. Thus, we introduce a novel and simple yet effective reference-focused synthetic APE data generation technique that uses ‘ref’ instead of src’ sentences to obtain corrupted translations (mt_new). The experimental results across English-German, English-Russian, English-Marathi, English-Hindi, and English-Tamil language pairs demonstrate the superior performance of APE systems trained using the newly generated synthetic data compared to those trained using existing synthetic data. Further, APE models trained using a balanced mix of existing and newly generated synthetic data achieve improvements of 0.37, 0.19, 1.01, 2.42, and 2.60 TER points, respectively. We will release the generated synthetic APE data.

pdf bib abs
EvoPrompt: Evolving Prompts for Enhanced Zero-Shot Named Entity Recognition with Large Language Models
Zeliang Tong | Zhuojun Ding | Wei Wei

Large language models (LLMs) possess extensive prior knowledge and powerful in-context learning (ICL) capabilities, presenting significant opportunities for low-resource tasks. Though effective, several key issues still have not been well-addressed when focusing on zero-shot named entity recognition (NER), including the misalignment between model and human definitions of entity types, and confusion of similar types. This paper proposes an Evolving Prompts framework that guides the model to better address these issues through continuous prompt refinement. Specifically, we leverage the model to summarize the definition of each entity type and the distinctions between similar types (i.e., entity type guidelines). An iterative process is introduced to continually adjust and improve these guidelines. Additionally, since high-quality demonstrations are crucial for effective learning yet challenging to obtain in zero-shot scenarios, we design a strategy motivated by self-consistency and prototype learning to extract reliable and diverse pseudo samples from the model’s predictions. Experiments on four benchmarks demonstrate the effectiveness of our framework, showing consistent performance improvements.

pdf bib abs
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Bo Li | Shaolin Zhu | Lijie Wen

Image Translation (IT) holds immense potential across diverse domains, enabling the translation of textual content within images into various languages. However, existing datasets often suffer from limitations in scale, diversity, and quality, hindering the development and evaluation of IT models. To address this issue, we introduce MIT-10M, a large-scale parallel corpus of multilingual image translation with over 10M image-text pairs derived from real-world data, which has undergone extensive data cleaning and multilingual translation validation. It contains 0.8M images in three sizes, 28 categories, tasks with three levels of difficulty and 14 languages image-text pairs, which is a considerable improvement on existing datasets. We conduct extensive experiments to evaluate and train models on MIT-10M. The experimental results clearly indicate that our dataset has higher adaptability when it comes to evaluating the performance of the models in tackling challenging and complex image translation tasks in the real world. Moreover, the performance of the model fine-tuned with MIT-10M has tripled compared to the baseline model, further confirming its superiority.

pdf bib abs
Synthetic Paths to Integral Truth: Mitigating Hallucinations Caused by Confirmation Bias with Synthetic Data
Changwon Ok | Eunkyeong Lee | Dongsuk Oh

Recently, large language models (LLMs) have made significant progress through retrieval-augmented generation (RAG) and preference learning. However, they still exhibit issues such as confirmation bias, the tendency to favor information that confirms one’s beliefs, which remains largely unexplored in current research. In this paper, we propose a novel approach to mitigate confirmation bias-induced hallucination in LLMs through a synthetic data construction pipeline and Direct Preference Optimization (DPO) training. Our method enhances the integration of diverse and complementary information from multiple passages retrieved by RAG, enabling more balanced and accurate reasoning. Experimental results demonstrate significant improvements in response accuracy and reduced hallucination on benchmarks such as Natural Questions Open and HaluBench. These findings suggest that our approach effectively mitigates confirmation bias in long-context question answering, with potential applications to other NLP tasks. We release our data, and evaluation/train code for public access.3]https://github.com/OccasionallyNLP/Synthetic-Paths-to-Integral-Truth.git

pdf bib abs
Unlike “Likely”, “Unlike” is Unlikely: BPE-based Segmentation hurts Morphological Derivations in LLMs
Paul Lerner | François Yvon

Large Language Models (LLMs) rely on subword vocabularies to process and generate text. However, because subwords are marked as initial- or intra-word, we find that LLMs perform poorly at handling some types of affixations, which hinders their ability to generate novel (unobserved) word forms. The largest models trained on enough data can mitigate this tendency because their initial- and intra-word embeddings are aligned; in-context learning also helps when all examples are selected in a consistent way; but only morphological segmentation can achieve a near-perfect accuracy.

It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.

pdf bib abs
LLMs meet Bloom’s Taxonomy: A Cognitive View on Large Language Model Evaluations
Thomas Huber | Christina Niklaus

Current evaluation approaches for Large Language Models (LLMs) lack a structured approach that reflects the underlying cognitive abilities required for solving the tasks. This hinders a thorough understanding of the current level of LLM capabilities. For instance, it is widely accepted that LLMs perform well in terms of grammar, but it is unclear in what specific cognitive areas they excel or struggle in. This paper introduces a novel perspective on the evaluation of LLMs that leverages a hierarchical classification of tasks. Specifically, we explore the most widely used benchmarks for LLMs to systematically identify how well these existing evaluation methods cover the levels of Bloom’s Taxonomy, a hierarchical framework for categorizing cognitive skills. This comprehensive analysis allows us to identify strengths and weaknesses in current LLM assessment strategies in terms of cognitive abilities and suggest directions for both future benchmark development as well as highlight potential avenues for LLM research. Our findings reveal that LLMs generally perform better on the lower end of Bloom’s Taxonomy. Additionally, we find that there are significant gaps in the coverage of cognitive skills in the most commonly used benchmarks.

Detailed descriptions of human motion are crucial for effective fitness training, which highlights the importance of research in fine-grained human motion video captioning. Existing video captioning models often fail to capture the nuanced semantics of videos, resulting in the generated descriptions that are coarse and lack details, especially when depicting human motions. To benchmark the Body Fitness Training scenario, in this paper, we construct a fine-grained human motion video captioning dataset named BoFiT and design a state-of-the-art baseline model named BoFiT-Gen (Body Fitness Training Text Generation). BoFiT-Gen makes use of computer vision techniques to extract angular representations of human motions from videos and LLMs to generate fine-grained descriptions of human motions via prompting. Results show that BoFiT-Gen outperforms previous methods on comprehensive metrics. We aim for this dataset to serve as a useful evaluation set for visio-linguistic models and drive further progress in this field. Our dataset is released at https://github.com/colmon46/bofit.

Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.

Complex events generally exhibit unforeseen, multifaceted, and multi-step developments, and cannot be well handled by existing closed-ended event forecasting methods, which are constrained by a limited answer space. In order to accelerate the research on complex event forecasting, we introduce OpenForecast, a large-scale open-ended dataset with two features: (1) OpenForecast defines three open-ended event forecasting tasks, enabling unforeseen, multifaceted, and multi-step forecasting. (2) OpenForecast collects and annotates a large-scale dataset from Wikipedia and news, including 43,419 complex events spanning from 1950 to 2024. Particularly, this annotation can be completed automatically without any manual annotation cost. Meanwhile, we introduce an automatic LLM-based Retrieval-Augmented Evaluation method (LRAE) for complex events, enabling OpenForecast to evaluate the ability of complex event forecasting of large language models. Finally, we conduct comprehensive human evaluations to verify the quality and challenges of OpenForecast, and the consistency between LEAE metric and human evaluation. OpenForecast and related codes will be publicly released.

pdf bib abs
A Knowledge Graph Reasoning-Based Model for Computerized Adaptive Testing
Xinyi Qiu | Zhiyun Chen

The significant of Computerized Adaptive Testing (CAT) is self-evident in contemporary Intelligent Tutoring Systems (ITSs) which aims to recommend suitable questions for students based on their knowledge state. In recent years, Graph Neural Networks (GNNs) and Reinforcement Learning (RL) methods have been increasingly applied to CAT. While these approaches have achieved empirical success, they still face limitations, such as inadequate handling of concept relevance when multiple concepts are involved and incomplete evaluation metrics. To address these issues, we propose a Knowledge Graph Reasoning-Based Model for CAT (KGCAT), which leverages the reasoning power of knowledge graphs (KGs) to capture the semantic and relational information between concepts and questions while focusing on reducing the noise caused by concepts with low relevance by utilizing mutual information. Additionally, a multi-objective reinforcement learning framework is employed to incorporate multiple evaluation objectives, further refining question selection and improving the overall effectiveness of CAT. Empirical evaluations conducted on three authentic educational datasets demonstrate that the proposed model outperforms existing methods in both accuracy and interpretability.

Empathetic conversation is a crucial characteristic in daily conversations between individuals. Nowadays, Large Language models (LLMs) have shown outstanding performance in generating empathetic responses. Knowledge bases like COMET can assist LLMs in mitigating illusions and enhancing the understanding of users’ intentions and emotions. However, models remain heavily reliant on fixed knowledge bases and unrestricted incorporation of external knowledge can introduce noise. Tool learning is a flexible end-to-end approach that assists LLMs in handling complex problems. In this paper, we propose Emotional Knowledge Tool Calling (EKTC) framework, which encapsulates the commonsense knowledge bases as empathetic tools, enabling LLMs to integrate external knowledge flexibly through tool calling. In order to adapt the models to the new task, we construct a novel dataset TOOL-ED based on the EMPATHETICDIALOGUE (ED) dataset. We validate EKTC on the ED dataset, and the experimental results demonstrate that our framework can enhance the ability of LLMs to generate empathetic responses effectively. Our code is available at https://anonymous.4open.science/r/EKTC-3FEF.

pdf bib abs
Annotating the French Wiktionary with supersenses for large scale lexical analysis: a use case to assess form-meaning relationships within the nominal lexicon
Nicolas Angleraud | Lucie Barque | Marie Candito

Many languages lack broad-coverage, semantically annotated lexical resources, which limits empirical research on lexical semantics for these languages. In this paper, we report on how we automatically enriched the French Wiktionnary with general semantic classes, known as supersenses, using a limited amount of manually annotated data. We trained a classifier combining sense definition classification and sense exemplars classification. The resulting resource, with an evaluated supersense accuracy of nearly 85% (92% for hypersenses), is used in a case study illustrating how such an semantically enriched resource can be leveraged to empirically test linguistic hypotheses about the lexicon, on a large scale.

Supervised Fine-tuning has been pivotal in training autoregressive language models, yet it introduces exposure bias. To mitigate this, Post Fine-tuning, including on-policy and off-policy methods, has emerged as a solution to enhance models further. However, each has its limitations regarding performance enhancements and susceptibility to overfitting. In this paper, we introduce a novel on-policy approach called Evolution Strategy Optimization (ESO), which is designed by harnessing the principle of biological evolution, namely survival of the fittest. Particularly, we consider model tuning as an evolution process, and each output sentence generated by the model can provide a perturbation signal to the model parameter space. Then, the fitness of perturbation signals is quantified by the difference between its score and the averaged one offered by a reward function, which guides the optimization process. Empirically, the proposed method can achieve superior performance in various tasks and comparable performance in the human alignment task.

Large language model unlearning has garnered increasing attention due to its potential to address security and privacy concerns, leading to extensive research in the field. However, existing studies have predominantly focused on instance-level unlearning, specifically targeting the removal of predefined instances containing sensitive content. This focus has left a gap in the exploration of removing an entire entity, which is critical in real-world scenarios such as copyright protection. To close this gap, we propose a novel task named Entity-level unlearning, which aims to erase entity-related knowledge from the target model completely. To investigate this task, we systematically evaluate popular unlearning algorithms, revealing that current methods struggle to achieve effective entity-level unlearning. Then, we further explore the factors that influence the performance of unlearning algorithms, identifying that the knowledge coverage of the forget set and its size play pivotal roles. Notably, our analysis also uncovers that entities introduced through fine-tuning are more vulnerable than pre-trained entities during unlearning. We hope these findings can inspire future improvements in entity-level unlearning for LLMs.

Knowledge graph embedding (KGE) aims to embed entities and relations as vectors in a continuous space and has proven to be effective for KG tasks. Recently, graph neural networks (GNN) based KGEs gain much attention due to their strong capability of encoding complex graph structures. However, most GNN-based KGEs are directly optimized based on the instance triples in KGs, ignoring the latent concepts and hierarchies of the entities. Though some works explicitly inject concepts and hierarchies into models, they are limited to predefined concepts and hierarchies, which are missing in a lot of KGs. Thus in this paper, we propose a novel framework with KG Pooling and unpooling and Contrastive Learning (KGPCL) to abstract and encode the latent concepts for better KG prediction. Specifically, with an input KG, we first construct a U-KG through KG pooling and unpooling. KG pooling abstracts the input graph to a smaller graph as a pooled graph, and KG unpooling recovers the input graph from the pooled graph. Then we model the U-KG with relational KGEs to get the representations of entities and relations for prediction. Finally, we propose the local and global contrastive loss to jointly enhance the representation of entities. Experimental results show that our models outperform the KGE baselines on link prediction task.

pdf bib abs
Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation
Jia Gu | Liang Pang | Huawei Shen | Xueqi Cheng

With the rapid advancement of large language models (LLMs) for handling complex language tasks, an increasing number of studies are employing LLMs as agents to emulate the sequential decision-making processes of humans often represented as Markov decision-making processes (MDPs). The actions in MDPs adhere to specific probability distributions and require iterative sampling. This arouses curiosity regarding the capacity of LLM agents to comprehend probability distributions, thereby guiding the agent’s behavioral decision-making through probabilistic sampling and generating behavioral sequences. To answer the above question, we divide the problem into two main aspects: sequence simulation with explicit probability distribution and sequence simulation with implicit probability distribution. Our analysis indicates that LLM agents can understand probabilities, but they struggle with probability sampling. Their ability to perform probabilistic sampling can be improved to some extent by integrating coding tools, but this level of sampling precision still makes it difficult to simulate human behavior as agents.

pdf bib abs
Pseudo-label Data Construction Method and Syntax-enhanced Model for Chinese Semantic Error Recognition
Hongyan Wu | Nankai Lin | Shengyi Jiang | Lianxi Wang | Aimin Yang

Chinese Semantic Error Recognition (CSER) has always been a weak link in Chinese language processing due to the complexity and obscureness of Chinese semantics. Existing research has gradually focused on leveraging pre-trained models to perform CSER. Although some researchers have attempted to integrate syntax information into the pre-trained language model, it requires training the models from scratch, which is time-consuming and laborious. Furthermore, despite the existence of datasets for CSER, the constrained size of these datasets impairs the performance of the models. Thus, in order to address the difficulty posed by a limited sample set and the need of annotating samples with semantic-level errors, we propose a Pseudo-label Data Construction method for CSER (PDC-CSER), generating pseudo-labels for augmented samples based on perplexity and model respectively, which overcomes the difficulty of constructing pseudo-label data containing semantic-level errors and ensures the quality of pseudo-labels. Moreover, we propose a CSER method with the Dependency Syntactic Attention mechanism (CSER-DSA) to explicitly infuse dependency syntactic information only in the fine-tuning stage, achieving robust performance, and simultaneously reducing substantial computing power and time cost. Results demonstrate that the pseudo-label technology PDC-CSER and the semantic error recognition method CSER-DSA surpass the existing models

pdf bib abs
An Active Learning Framework for Inclusive Generation by Large Language Models
Sabit Hassan | Anthony B. Sicilia | Malihe Alikhani

Ensuring that Large Language Models (LLMs) generate text representative of diverse sub-populations is essential, particularly when key concepts related to under-represented groups are scarce in the training data. We address this challenge with a novel clustering-based active learning framework, enhanced with knowledge distillation. The proposed framework transforms the intermediate outputs of the learner model, enabling effective active learning for generative tasks for the first time. Integration of clustering and knowledge distillation yields more representative models without prior knowledge of underlying data distribution and overbearing human efforts. We validate our approach in practice through case studies in counter-narration and style transfer. We construct two new datasets in tandem with model training, showing a performance improvement of 2%–10% over baseline models. Our results also show more consistent performance across various data subgroups and increased lexical diversity, underscoring our model’s resilience to skewness in available data. Further, our results show that the data acquired via our approach improves the performance of secondary models not involved in the learning loop, showcasing practical utility of the framework.

pdf bib abs
Multimodal Extraction and Recognition of Arabic Implicit Discourse Relations
Ahmed Ruby | Christian Hardmeier | Sara Stymne

Most research on implicit discourse relation identification has focused on written language, however, it is also crucial to understand these relations in spoken discourse. We introduce a novel method for implicit discourse relation identification across both text and speech, that allows us to extract examples of semantically equivalent pairs of implicit and explicit discourse markers, based on aligning speech+transcripts with subtitles in another language variant. We apply our method to Egyptian Arabic, resulting in a novel high-quality dataset of spoken implicit discourse relations. We present a comprehensive approach to modeling implicit discourse relation classification using audio and text data with a range of different models. We find that text-based models outperform audio-based models, but combining text and audio features can lead to enhanced performance.

Research on text simplification has been ongoing for many years, yet document simplification remains a significant challenge due to the need to address complex factors such as technical terminology, metaphors, and overall coherence. In this work, we introduce a novel multi-agent framework AgentSimp for document simplification, based on large language models. This framework simulates the collaborative efforts of a team of human experts through the roles played by multiple agents, effectively meeting the intricate demands of document simplification. We investigate two communication strategies among agents (pipeline-style and synchronous) and two document reconstruction strategies (Direct and Iterative). According to both automatic evaluation metrics and human evaluation results, AgentSimp produces simplified documents that are more thoroughly simplified and more coherent across various articles and styles.

pdf bib abs
RA-MTR: A Retrieval Augmented Multi-Task Reader based Approach for Inspirational Quote Extraction from Long Documents
Sayantan Adak | Animesh Mukherjee

Inspirational quotes from famous individuals are often used to convey thoughts in news articles, essays, and everyday conversations. In this paper, we propose a novel context-based quote extraction system that aims to predict the most relevant quote from a long text. We formulate this quote extraction as an open domain question answering problem first by employing a vector-store based retriever and then applying a multi-task reader. We curate three context-based quote extraction dataset and introduce a novel multi-task framework RA-MTR that improves the state-of-the-art performance, achieving a maximum improvement of 5.08% in BoW F1-score.

As Large Language Models (LLMs) become available in a wider range of domains and applications, evaluating the truthfulness of multilingual LLMs is an issue of increasing relevance. TruthfulQA (Lin et al., 2022) is one of few benchmarks designed to evaluate how models imitate widespread falsehoods. However, it is strongly English-centric and starting to become outdated. We present VeritasQA, a context- and time-independent truthfulness benchmark built with multilingual transferability in mind, and available in Spanish, Catalan, Galician and English. VeritasQA comprises a set of 353 questions and answers inspired by common misconceptions and falsehoods that are not tied to any particular country or recent event. We release VeritasQA under an open license and present the evaluation results of 15 models of various architectures and sizes.

Empathy improves human-machine dialogue systems by enhancing the user’s experience. While traditional models have aimed to detect and express users’ emotions from dialogue history, they neglect the crucial and complex interactions among emotion, emotion causes, and commonsense. To address this, we introduce the ECC (Emotion, Cause, and Commonsense) framework, which leverages specialized encoders to capture the key features of emotion, cause, and commonsense and collaboratively models these through a Conditional Variational Auto-Encoder. ECC further employs novel loss functions to refine the interplay of three factors and generates empathetic responses using an energy-based model supported by ODE sampling. Empirical results on the EmpatheticDialogues dataset demonstrate that ECC outperforms existing baselines, offering a robust solution for empathetic dialogue generation.

Complex Table Question Answering involves providing accurate answers to specific questions based on intricate tables that exhibit complex layouts and flexible header locations. Despite considerable progress having been made in the LLM era, the reasoning processes of existing methods are often implicit, feeding the entire table into prompts, making it difficult to effectively filter out irrelevant information in the table. To this end, we propose GraphOTTER that explicitly establishes the reasoning process to pinpoint the correct answers. In particular, GraphOTTER leverages a graph-based representation, transforming the complex table into an undirected graph. It then conducts step-by-step reasoning on the graph, with each step guided by a set of pre-defined intermediate reasoning actions. As such, it constructs a clear reasoning path and effectively identifies the answer to a given question. Comprehensive experiments on two benchmark datasets and two LLM backbones demonstrate the effectiveness of GraphOTTER. Further analysis indicates that its success may be attributed to the ability to efficiently filter out irrelevant information, thereby focusing the reasoning process on the most pertinent data. Our code and experimental datasets are available at https://github.com/JDing0521/GraphOTTER.

pdf bib abs
Persona-Consistent Dialogue Generation via Pseudo Preference Tuning
Junya Takayama | Masaya Ohagi | Tomoya Mizumoto | Katsumasa Yoshikawa

We propose a simple yet effective method for enhancing persona consistency in dialogue response generation using Direct Preference Optimization (DPO). In our method, we generate responses from the response generation model using persona information that has been randomly swapped with data from other dialogues, treating these responses as pseudo-negative samples. The reference responses serve as positive samples, allowing us to create pseudo-preference data. Experimental results demonstrate that our model, fine-tuned with DPO on the pseudo preference data, produces more consistent and natural responses compared to models trained using supervised fine-tuning or reinforcement learning approaches based on entailment relations between personas and utterances.

pdf bib abs
Montague semantics and modifier consistency measurement in neural language models
Danilo Silva de Carvalho | Edoardo Manino | Julia Rozanova | Lucas Cordeiro | André Freitas

This work proposes a novel methodology for measuring compositional behavior in contemporary language embedding models. Specifically, we focus on adjectival modifier phenomena in adjective-noun phrases. In recent years, distributional language representation models have demonstrated great practical success. At the same time, the need for interpretability has elicited questions on their intrinsic properties and capabilities. Crucially, distributional models are often inconsistent when dealing with compositional phenomena in natural language, which has significant implications for their safety and fairness. Despite this, most current research on compositionality is directed towards improving their performance on similarity tasks only. This work takes a different approach, introducing three novel tests of compositional behavior inspired by Montague semantics. Our experimental results indicate that current neural language models do not behave according to the expected linguistic theories. This indicates that current language models may lack the capability to capture the semantic properties we evaluated on limited context, or that linguistic theories from Montagovian tradition may not match the expected capabilities of distributional models.

Low-Rank Adaptation (LoRA) is currently the most commonly used Parameter-efficient fine-tuning (PEFT) method. However, it still faces high computational and storage costs to models with billions of parameters. Most previous studies have tackled this issue by using pruning techniques. Nonetheless, these efforts only analyze LoRA parameter features to evaluate their importance, such as parameter count, size, and gradient. In fact, the output of LoRA directly impacts the fine-tuned model. Preliminary experiments indicate that a fraction of LoRA possesses significantly high output values, substantially influencing the layer output. Motivated by the observation, we propose LoRA-drop. Concretely, LoRA-drop evaluates the importance of LoRA based on the LoRA output. Then we retain LoRA for important layers and the other layers share the same LoRA. We conduct abundant experiments with models of different scales on NLU and NLG tasks. Results demonstrate that LoRA-drop can achieve performance comparable to full fine-tuning and LoRA while retaining 50% of the LoRA parameters on average.

pdf bib abs
Leveraging Language-based Representations for Better Solving Symbol-related Problems with Large Language Models
Yile Wang | Sijie Cheng | Zixin Sun | Peng Li | Yang Liu

Symbols such as numerical sequences, chemical formulas, and table delimiters exist widely, playing important roles in symbol-related tasks such as abstract reasoning, chemical property prediction, and tabular question-answering. Compared to tasks based on natural language expressions, large language models (LLMs) have limitations in understanding and reasoning on symbol-based representations, making it difficult for them to handle symbol-related problems. In this paper, we propose symbol-to-language (S2L), a method that converts symbol-based representations to language-based representations, providing valuable information for language models during reasoning. We found that, for both closed-source and open-source LLMs, the capability to solve symbol-related problems can be largely enhanced by incorporating such language-based representations. For example, by employing S2L for GPT-4, there can be substantial improvements of +21.9% and +9.5% accuracy for 1D-ARC and Dyck language tasks, respectively. There is also a consistent improvement in other six general symbol-related tasks such as table understanding and Tweet analysis. We release the GPT logs in https://github.com/THUNLP-MT/symbol2language.

pdf bib abs
Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning
Aditya Narayan Sankaran | Reza Farahbakhsh | Noel Crespi

Online abusive content detection, particularly in low-resource settings and within the audio modality, remains underexplored. We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages, in this case, in Indian languages using Few Shot Learning (FSL). Leveraging powerful representations from models such as Wav2Vec and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset with FSL. Our approach integrates these representations within the Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand model behaviour. This study highlights the generalization ability of pre-trained models in low-resource scenarios and offers valuable insights into detecting abusive language in multilingual contexts.

pdf bib abs
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
Qingyu Lu | Liang Ding | Kanjian Zhang | Jinxia Zhang | Dacheng Tao

Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM have shown state-of-the-art performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, **MQM-APE**, based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) *evaluator* to provide error annotations, 2) *post-editor* to determine whether errors impact quality improvement and 3) *pairwise quality verifier* as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirms the effectiveness of each module and offers valuable insights into evaluator design and LLMs selection.

pdf bib abs
MOPO: Multi-Objective Prompt Optimization for Affective Text Generation
Yarik Menchaca Resendiz | Roman Klinger

How emotions are expressed depends on the context and domain. On X (formerly Twitter), for instance, an author might simply use the hashtag #anger, while in a news headline, emotions are typically written in a more polite, indirect manner. To enable conditional text generation models to create emotionally connotated texts that fit a domain, users need to have access to a parameter that allows them to choose the appropriate way to express an emotion. To achieve this, we introduce MOPO, a Multi-Objective Prompt Optimization methodology. MOPO optimizes prompts according to multiple objectives (which correspond here to the output probabilities assigned by emotion classifiers trained for different domains). In contrast to single objective optimization, MOPO outputs a set of prompts, each with a different weighting of the multiple objectives. Users can then choose the most appropriate prompt for their context. We evaluate MOPO using three objectives, determined by various domain-specific emotion classifiers. MOPO improves performance by up to 15 pp across all objectives with a minimal loss (1–2 pp) for any single objective compared to single-objective optimization. These minor performance losses are offset by a broader generalization across multiple objectives – which is not possible with single-objective optimization. Additionally, MOPO reduces computational requirements by simultaneously optimizing for multiple objectives, eliminating separate optimization procedures for each objective.

Propaganda plays a critical role in shaping public opinion and fueling disinformation. While existing research primarily focuses on identifying propaganda techniques, it lacks the ability to capture the broader motives and the impacts of such content. To address these challenges, we introduce PropaInsight, a conceptual framework grounded in foundational social science research, which systematically dissects propaganda into techniques, arousal appeals, and underlying intent. PropaInsight offers a more granular understanding of how propaganda operates across different contexts. Additionally, we present PropaGaze, a novel dataset that combines human-annotated data with high-quality synthetic data generated through a meticulously designed pipeline. Our experiments show that off-the-shelf LLMs struggle with propaganda analysis, but PropaGaze significantly improves performance. Fine-tuned Llama-7B-Chat achieves 203.4% higher text span IoU in technique identification and 66.2% higher BertScore in appeal analysis compared to 1-shot GPT-4-Turbo. Moreover, PropaGaze complements limited human-annotated data in data-sparse and cross-domain scenarios, demonstrating its potential for comprehensive and generalizable propaganda analysis.

pdf bib abs
MQA-KEAL: Multi-hop Question Answering under Knowledge Editing for Arabic Language
Muhammad Asif Ali | Nawal Daftardar | Mutayyba Waheed | Jianbin Qin | Di Wang

Large Language Models (LLMs) have demonstrated significant capabilities across numerous application domains. A key challenge is to keep these models updated with latest available information, which limits the true potential of these models for the end-applications. Although, there have been numerous attempts for LLMs’ Knowledge Editing (KE), i.e., to update and/or edit the LLMs’ prior knowledge and in turn test it via Multi-hop Question Answering (MQA), yet so far these studies are primarily focused and/or developed for English language. To bridge this gap, in this paper we propose: Multi-hop Questioning Answering under Knowledge Editing for Arabic Language (MQA-KEAL). MQA-KEAL stores knowledge edits as structured knowledge units in the external memory. In order to solve multi-hop question, it first uses task-decomposition to decompose the question into smaller sub-problems. Later for each sub-problem, it iteratively queries the external memory and/or target LLM in order to generate the final response. In addition, we also contribute MQUAKE-AR (Arabic translation of English benchmark MQUAKE), as well as a new benchmark MQA-AEVAL for rigorous performance evaluation of MQA under KE for Arabic language. Experimentation evaluation reveals MQA-KEAL outperforms the baseline models by a significant margin. We release the codes for MQA-KEAL at https: //github.com/asif6827/MQA-Keal.

Hierarchical text classification (HTC) is an important task in natural language processing (NLP). Existing methods typically utilize both text features and the hierarchical structure of labels to categorize text effectively. However, these approaches often struggle with fine-grained labels, which are closely similar, leading to difficulties in accurate classification. At the same time, contrastive learning has significant advantages in strengthening fine-grained label features and discrimination. However, the performance of contrastive learning strongly depends on the construction of negative samples. In this paper, we design a hierarchical sequence ranking (HiSR) method for generating diverse negative samples. These samples maximize the effectiveness of contrastive learning to enhance the ability of the model to distinguish between fine-grained labels and improve the performance of the model in HTC. Specifically, we transform the entire label set into linear sequences based on the hierarchical structure and rank these sequences according to their quality. During model training, the most suitable negative samples were dynamically selected from the ranked sequences. Then contrastive learning amplifies the differences between similar fine-grained labels by emphasizing the distinction between the ground truth and the generated negative samples, thereby enhancing the discriminative ability of the model. Our method has been tested on three public datasets and achieves state-of-art (SOTA) on two of them, demonstrating its effectiveness.

Distilling high-accuracy Graph Neural Networks (GNNs) to low-latency multilayer perceptrons (MLPs) on graph tasks has become a hot research topic. However, conventional MLP learning relies almost exclusively on graph nodes and fails to effectively capture the graph structural information. Previous methods address this issue by processing graph edges into extra inputs for MLPs, but such graph structures may be unavailable for various scenarios. To this end, we propose Prototype-Guided Knowledge Distillation (PGKD), which does not require graph edges (edge-free setting) yet learns structure-aware MLPs. Our insight is to distill graph structural information from GNNs. Specifically, we first employ the class prototypes to analyze the impact of graph structures on GNN teachers, and then design two losses to distill such information from GNNs to MLPs. Experimental results on popular graph benchmarks demonstrate the effectiveness and robustness of the proposed PGKD.

pdf bib abs
A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models
Ahatsham Hayat | Mohammad R. Hasan

This paper presents a novel approach named Contextually Relevant Imputation leveraging pre-trained Language Models (CRILM) for handling missing data in tabular datasets. Instead of relying on traditional numerical estimations, CRILM uses pre-trained language models (LMs) to create contextually relevant descriptors for missing values. This method aligns datasets with LMs’ strengths, allowing large LMs to generate these descriptors and small LMs to be fine-tuned on the enriched datasets for enhanced downstream task performance. Our evaluations demonstrate CRILM’s superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios, with up to a 10% improvement over the best-performing baselines. By mitigating biases, particularly in MNAR settings, CRILM improves downstream task performance and offers a cost-effective solution for resource-constrained environments.

While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model’s capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.

pdf bib abs
PADO: Personality-induced multi-Agents for Detecting OCEAN in human-generated texts
Haein Yeo | Taehyeong Noh | Seungwan Jin | Kyungsik Han

As personality can be useful in many cases, such as better understanding people’s underlying contexts or providing personalized services, research has long focused on modeling personality from data. However, the development of personality detection models faces challenges due to the inherent latent and relative characteristics of personality, as well as the lack of annotated datasets. To address these challenges, our research focuses on methods that effectively exploit the inherent knowledge of Large Language Models (LLMs). We propose a novel approach that compares contrasting perspectives to better capture the relative nature of personality traits. In this paper, we introduce PADO (Personality-induced multi-Agent framework for Detecting OCEAN of the Big Five personality traits), the first LLM-based multi-agent personality detection framework. PADO employs personality-induced agents to analyze text from multiple perspectives, followed by a comparative judgment process to determine personality trait levels. Our experiments with various LLM models, from GPT-4o to LLaMA3-8B, demonstrate PADO’s effectiveness and generalizability, especially with smaller parameter models. This approach offers a more nuanced, context-aware method for personality detection, potentially improving personalized services and insights into digital behavior. We will release our codes.

Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.

pdf bib abs
Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation
Zijie Zhong | Hanwen Liu | Xiaoya Cui | Xiaofan Zhang | Zengchang Qin

Integrating information from various reference databases is a major challenge for Retrieval-Augmented Generation (RAG) systems because each knowledge source adopts a unique data structure and follows different conventions. Retrieving from multiple knowledge sources with one fixed strategy usually leads to under-exploitation of information. To mitigate this drawback, inspired by Mix-of-Expert, we introduce Mix-of-Granularity (MoG), a method that dynamically determines the optimal granularity of a knowledge source based on input queries using a router. The router is efficiently trained with a newly proposed loss function employing soft labels. We further extend MoG to MoG-Graph (MoGG), where reference documents are pre-processed as graphs, enabling the retrieval of distantly situated snippets. Experiments demonstrate that MoG and MoGG effectively predict optimal granularity levels, significantly enhancing the performance of the RAG system in downstream tasks. The code of both MoG and MoGG will be made public.

Multilingual knowledge editing (MKE) aims to simultaneously update factual knowledge across multiple languages within large language models (LLMs). Previous research indicates that the same knowledge across different languages within LLMs exhibits a degree of shareability. However, most existing MKE methods overlook the connections of the same knowledge between different languages, resulting in knowledge conflicts and limited edit performance. To address this issue, we first investigate how LLMs process multilingual factual knowledge and discover that the same factual knowledge in different languages generally activates a shared set of neurons, which we call language-agnostic factual neurons (LAFNs). These neurons represent the same factual knowledge shared across languages and imply the semantic connections among multilingual knowledge. Inspired by this finding, we propose a new MKE method by Locating and Updating Language-Agnostic Factual Neurons (LU-LAFNs) to edit multilingual knowledge simultaneously, which avoids knowledge conflicts and thus improves edit performance. Experimental results on Bi-ZsRE and MzsRE benchmarks demonstrate that our method achieves the best edit performance, indicating the effectiveness and importance of modeling the semantic connections among multilingual knowledge.

pdf bib abs
MURRE: Multi-Hop Table Retrieval with Removal for Open-Domain Text-to-SQL
Xuanliang Zhang | Dingzirui Wang | Longxu Dou | Qingfu Zhu | Wanxiang Che

The open-domain text-to-SQL task aims to retrieve question-relevant tables from massive databases and generate SQL. However, the performance of current methods is constrained by single-hop retrieval, and existing multi-hop retrieval of open-domain question answering is not directly applicable due to the tendency to retrieve tables similar to the retrieved ones but irrelevant to the question. Since the questions in text-to-SQL usually contain all required information, while previous multi-hop retrieval supplements the questions with retrieved documents. Therefore, we propose the multi-hop table retrieval with removal (MURRE), which removes previously retrieved information from the question to guide the retriever towards unretrieved relevant tables. Our experiments on two open-domain text-to-SQL datasets demonstrate an average improvement of 5.7% over the previous state-of-the-art results.

Online reporting platforms have enabled citizens around the world to collectively share their opinions and report in real time on events impacting their local communities. Systematically organizing (e.g., categorizing by attributes) and geotagging large amounts of crowdsourced information is crucial to ensuring that accurate and meaningful insights can be drawn from this data and used by policy makers to bring about positive change. These tasks, however, typically require extensive manual annotation efforts. In this paper we present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election containing mentions of election-related issues such as official misconduct, vote count irregularities, and acts of violence. We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.

pdf bib abs
On Evaluating LLMs’ Capabilities as Functional Approximators: A Bayesian Evaluation Framework
Shoaib Ahmed Siddiqui | Yanzhi Chen | Juyeon Heo | Menglin Xia | Adrian Weller

Recent works have successfully applied Large Language Models (LLMs) to function modeling tasks. However, the reasons behind this success remain unclear. In this work, we propose a new evaluation framework to comprehensively assess LLMs’ function modeling abilities. By adopting a Bayesian perspective of function modeling, we discover that LLMs are relatively weak in understanding patterns in raw data, but excel at utilizing prior knowledge about the domain to develop a strong understanding of the underlying function. Our findings offer new insights about the strengths and limitations of LLMs in the context of function modeling.

pdf bib abs
Biases in Large Language Model-Elicited Text: A Case Study in Natural Language Inference
Grace Proebsting | Adam Poliak

We test whether NLP datasets created with Large Language Models (LLMs) contain annotation artifacts and social biases like NLP datasets elicited from crowd-source workers. We recreate a portion of the Stanford Natural Language Inference corpus using GPT-4, Llama-2 70b for Chat, and Mistral 7b Instruct. We train hypothesis-only classifiers to determine whether LLM-elicited NLI datasets contain annotation artifacts. Next, we use point-wise mutual information to identify the words in each dataset that are associated with gender, race, and age-related terms. On our LLM-generated NLI datasets, fine-tuned BERT hypothesis-only classifiers achieve between 86-96% accuracy. Our analyses further characterize the annotation artifacts and stereotypical biases in LLM-generated datasets.

In the field of NLP, Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks. However, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction. However, concerns regarding the robustness of this evaluative method persist. Building upon previous discussions on the issue of variability, we reveal an additional dimension of concern: LLMs may perform MCQA by selecting the least incorrect option rather than distinctly correct. This observation suggests that LLMs might regard multiple options as correct, which could undermine the reliability of MCQA as a metric for evaluating LLMs. To address this challenge, we introduce an enhanced dataset augmentation method for MCQA, termed MCQA+, to provide a more accurate reflection of the performance, thereby highlighting the necessity for more sophisticated evaluation mechanisms in the assessment of LLM capabilities.

pdf bib abs
Benchmark Creation for Aspect-Based Sentiment Analysis in Low-Resource Odia Language and Evaluation through Fine-Tuning of Multilingual Models
Lipika Dewangan | Zoyah Afsheen Sayeed | Chandresh Maurya

The rapid growth of online product reviews spurs significant interest in Aspect-Based Sentiment Analysis (ABSA), which involves identifying aspect terms and their associated sentiment polarity. While ABSA is widely studied in resource-rich languages like English, Chinese, and Spanish, it remains underexplored in low-resource languages such as Odia. To address this gap, we create a reliable resource for aspect-based sentiment analysis in Odia. The dataset is annotated for two specific tasks: Aspect Term Extraction (ATE) and Aspect Polarity Classification (APC), spanning seven domains and aligned with the SemEval-2014 benchmark. Furthermore, we employ an ensemble data augmentation approach combining back-translation with a fine-tuned T5 paraphrase generation model to enhance the dataset and apply a semantic similarity filter using a Universal Sentence Encoder (USE) to remove low-quality data and ensure a balanced distribution of sample difficulty in the newly augmented dataset. Finally, we validate our dataset by fine-tuning multilingual pre-trained models, XLM-R and IndicBERT, on ATE and APC tasks. Additionally, we use three classical baseline models to evaluate the quality of the proposed dataset for these tasks. We hope the Odia dataset will spur more work for the ABSA task.

Information extraction (IE) needs vary over time, where a flexible information extraction (IE) system can be useful. Despite this, existing IE systems are either fully supervised, requiring expensive human annotations, or fully unsupervised, extracting information that often do not cater to user’s needs. To address these issues, we formally introduce the task of “IE on-the-fly”, and address the problem using our proposed Adaptive IE framework that uses human-in-the-loop refinement to adapt to changing user questions. Through human experiments on three diverse datasets, we demonstrate that Adaptive IE is a domain-agnostic, responsive, efficient framework for helping users access useful information while quickly reorganizing information in response to evolving information needs.

Entity Alignment (EA) is a critical task in Knowledge Graph (KG) integration, aimed at identifying and matching equivalent entities that represent the same real-world objects. While EA methods based on knowledge representation learning have shown strong performance on synthetic benchmark datasets such as DBP15K, their effectiveness significantly decline in real-world scenarios which often involve data that is highly heterogeneous, incomplete, and domain-specific, as seen in datasets like DOREMUS and AGROLD. Addressing this challenge, we propose DAEA, a novel EA approach with Domain Adaptation that leverages the data characteristics of synthetic benchmarks for improved performance in real-world datasets. DAEA introduces a multi-source KGs selection mechanism and a specialized domain adaptive entity alignment loss function to bridge the gap between real-world data and optimal benchmark data, mitigating the challenges posed by aligning entities across highly heterogeneous KGs. Experimental results demonstrate that DAEA outperforms state-of-the-art models on real-world datasets, achieving a 29.94% improvement in Hits@1 on DOREMUS and a 5.64% improvement on AGROLD. Code is available at https://github.com/yangxiaoxiaoly/DAEA.

pdf bib abs
CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues
Sebastian Steindl | Ulrich Schäfer | Bernd Ludwig

Large-scale Wizard-Of-Oz dialogue datasets have enabled the training of deep learning-based dialogue systems. While they are successful as benchmark datasets, they lack certain types of utterances, which would make them more realistic. In this work, we investigate the creation of synthetic communication errors in an automatic pipeline. Based on linguistic theory, we propose and follow a simple error taxonomy. We focus on three types of miscommunications that could happen in real-world dialogues but are underrepresented in the benchmark dataset: misunderstandings, non-understandings and vaguely related questions. Our two-step approach uses a state-of-the-art Large Language Model (LLM) to first create the error and secondly the repairing utterance. We perform Language Model-based evaluation to ensure the quality of the generated utterances. We apply the method to the MultiWOZ dataset and evaluate it both qualitatively and empirically as well as with human judges. Our results indicate that current LLMs can aid in adding post-hoc miscommunications to benchmark datasets as a form of data augmentation. We publish the resulting dataset, in which nearly 1900 dialogues have been modified, as CoPrUS-MultiWOZ to facilitate future work on dialogue systems.

pdf bib abs
JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models
Junfeng Jiang | Jiahao Huang | Akiko Aizawa

Recent developments in Japanese large language models (LLMs) primarily focus on general domains, with fewer advancements in Japanese biomedical LLMs. One obstacle is the absence of a comprehensive, large-scale benchmark for comparison. Furthermore, the resources for evaluating Japanese biomedical LLMs are insufficient. To advance this field, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Experimental results indicate that: (1) LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks, (2) LLMs that are not mainly designed for Japanese biomedical domains can still perform unexpectedly well, and (3) there is still much room for improving the existing LLMs in certain Japanese biomedical tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available to facilitate future research.

pdf bib abs
Automated Detection of Tropes In Short Texts
Alessandra Flaccavento | Youri Peskine | Paolo Papotti | Riccardo Torlone | Raphael Troncy

Tropes — recurring narrative elements like the “smoking gun” or the “veil of secrecy” — are often used in movies to convey familiar patterns. However, they also play a significant role in online communication about societal issues, where they can oversimplify complex matters and deteriorate public discourse. Recognizing these tropes can offer insights into the emotional manipulation and potential bias present in online discussions. This paper addresses the challenge of automatically detecting tropes in social media posts. We define the task, distinguish it from previous work, and create a ground-truth dataset of social media posts related to vaccines and immigration, manually labeled with tropes. Using this dataset, we develop a supervised machine learning technique for multi-label classification, fine-tune a model, and demonstrate its effectiveness experimentally. Our results show that tropes are common across domains and that fine-tuned models can detect them with high accuracy.

pdf bib abs
WER We Stand: Benchmarking Urdu ASR Models
Samee Arif | Aamina Jamal Khan | Mustafa Abbas | Agha Ali Raza | Awais Athar

This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.

Detecting fraudulent online text is essential, as these manipulative messages exploit human greed, deceive individuals, and endanger societal security. Currently, this task remains under-explored on the Chinese web due to the lack of a comprehensive dataset of Chinese fraudulent texts. However, creating such a dataset is challenging because it requires extensive annotation within a vast collection of normal texts. Additionally, the creators of fraudulent webpages continuously update their tactics to evade detection by downstream platforms and promote fraudulent messages. To this end, this work firstly presents the comprehensive long-term dataset of Chinese fraudulent texts collected over 12 months, consisting of 59,106 entries extracted from billions of web pages. Furthermore, we design and provide a wide range of baselines, including large language model-based detectors, and pre-trained language model approaches. The necessary dataset and benchmark codes for further research are available via https://github. com/xuemingxxx/ChiFraud.

pdf bib abs
CateEA: Enhancing Entity Alignment via Implicit Category Supervision
Guan Dong Feng | Tao Ren | Jun Hu | Dan dan Wang

Entity Alignment (EA) is essential for integrating Knowledge Graphs (KGs) by matching equivalent entities across diverse KGs. With the rise of multi-modal KGs, which emerged to better depict real-world KGs by integrating visual, textual, and structured data, Multi-Modal Entity Alignment (MMEA) has become crucial in enhancing EA. However, existing MMEA methods often neglect the inherent semantic category information of entities, limiting alignment precision and robustness. To address this, we propose Category-enhanced Entity Alignment (CateEA), which combines implicit entity category information into multi-modal representations. By generating pseudo-category labels from entity embeddings and integrating them into a multi-task learning framework, CateEA captures latent category semantics, enhancing entity representations. CateEA allows for adaptive adjustments of similarity measures, leading to improved alignment precision and robustness in multi-modal contexts. Experiments on benchmark datasets demonstrate that CateEA outperforms state-of-the-art methods in various settings.

pdf bib abs
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
Menan Velayuthan | Kengatharaiyer Sarveswaran

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Despite the dominance of English-Centric (EC) Large Language Models (LLMs), tokenization methods often fail to fairly represent complex scripts like Tamil, Sinhala, and Hindi, primarily due to pre-tokenization choices. This study demonstrates that pre-tokenization has a more significant impact than tokenization algorithms on achieving egalitarian representation. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi. The codebase and resources used in this work are publicly available at https://github.com/vmenan/tokenizers-coling2025.

pdf bib abs
PIRsuader: A Persuasive Chatbot for Mitigating Psychological Insulin Resistance in Type-2 Diabetic Patients
Sujatha Das Gollapalli | See-Kiong Ng

Psychological Insulin Resistance (PIR) is described as the reluctance towards initiation and adherence of insulin-based treatments due to psychological barriers in diabetic patients. Though studies have shown that timely initiation with lifestyle changes are known to be crucial in sugar control and prevention of chronic conditions in Type 2 Diabetes (T2D) patients, many patients often have deep-rooted fears and misgivings related to insulin which hinder them from adapting to an insulin-based treatment regimen when recommended by healthcare specialists. Therefore, it is vitally important to address and allay these fallacious beliefs in T2D patients and persuade them to consider insulin as a treatment option. In this paper, we describe the design of PIRsuader, a persuasive chatbot for mitigating PIR in T2D patients. In PIRsuader, we effectively harness the conversation generation capabilities of state-of-the-art Large Language Models via a context-specific persuasive dialog act schema. We design reward functions that capture dialog act preferences for persuading reluctant patients and apply reinforcement learning to learn a dialog act prediction model. Our experiments using a collection of real doctor-diabetic patient conversations indicate that PIRsuader is able to improve the willingness in patients to try insulin as well as address specific concerns they have in an empathetic manner.

pdf bib abs
Continual Learning Using Only Large Language Model Prompting
Jiabao Qiu | Zixuan Ke | Bing Liu

We introduce CLOB, a novel continual learning (CL) paradigm wherein a large language model (LLM) is regarded as a black box. Learning is done incrementally via only verbal prompting. CLOB does not fine-tune any part of the LLM or add any trainable parameters to it. It is particularly suitable for LLMs that are accessible via APIs. We also propose a new CL technique, called CIS, based on incremental summarization that also overcomes the LLM’s input length limit. Experiments show CIS outperforms baselines by a very large margin.

Previous benchmarks for evaluating large language models (LLMs) have primarily emphasized quantitative metrics, such as data volume. However, this focus may neglect key qualitative data attributes that can significantly impact the final rankings of LLMs, resulting in unreliable leaderboards. In this paper, we investigate whether current LLM benchmarks adequately consider these data attributes. We specifically examine three attributes: diversity, redundancy, and difficulty. To explore these attributes, we propose a framework with three separate modules, each designed to assess one of the attributes. Using a method that progressively incorporates these attributes, we analyze their influence on the benchmark. Our experimental results reveal a meaningful correlation between LLM rankings on the revised benchmark and the original benchmark when these attributes are accounted for. These findings indicate that existing benchmarks often fail to meet all three criteria, highlighting a lack of consideration for multifaceted data attributes in current evaluation datasets.

pdf bib abs
Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas
Bastian Bunzeck | Daniel Duran | Leonie Schade | Sina Zarrieß

Recent work investigates whether LMs learn human-like linguistic generalizations and representations from developmentally plausible amounts of data. Yet, the basic linguistic units processed in these LMs are determined by subword-based tokenization, which limits their validity as models of learning at and below the word level. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.

pdf bib abs
Evaluating Readability Metrics for German Medical Text Simplification
Karen Scholz | Markus Wenzel

Clinical reports and scientific health information sources are usually written for medical experts preventing patients from understanding the main messages of these texts. Making them comprehensible for patients is important to enable patients to make informed health decisions. Metrics are required to assess readability and to evaluate text simplification methods. However, research has mainly focused on English medical texts. We collected a set of 18 statistical, part-of-speech-based, syntactic, semantic and fluency metrics from related studies and evaluate their suitability to measure readability of German medical texts. We perform multiple t-tests on technical abstracts from English and German scientific articles and related simplified summaries, respectively. While semantic and fluency metrics can be successfully transferred to German medical texts, multiple statistical, part-of-speech-based, and syntactic metrics behave differently when they are applied to German medical texts requiring careful interpretation.

pdf bib abs
Hi-GEC: Hindi Grammar Error Correction in Low Resource Scenario
Ujjwal Sharma | Pushpak Bhattacharyya

Automated Grammatical Error Correction (GEC) has been extensively researched in Natural Language Processing (NLP), primarily focusing on English and other resource-rich languages. This paper shifts the focus to GEC for a scarcely explored low-resource language, specifically Hindi, which presents unique challenges due to its intricate morphology and complex syntax. To address data resource limitations, this work explores various GEC data generation techniques. Our research introduces a carefully extracted and filtered, high-quality dataset, HiWikiEdits, which includes human-edited 8,137 instances sourced from Wikipedia, encompassing 17 diverse grammatical error types, with annotations performed using the ERRANT toolkit. Furthermore, we investigate Round Trip Translation (RTT) using diverse languages for synthetic Hindi GEC data generation, revealing that leveraging high-resource linguistically distant language for error generation outperforms mid-resource linguistically closer languages. Specifically, using English as a pivot language resulted in a 6.25% improvement in GLEU score compared to using Assamese or Marathi. Finally, we also investigate the neural model-based synthetic error-generation technique and show that it achieves comparable performance to other synthetic data generation methods, even in low-resource settings.

Recently, several public datasets for automatic speech recognition (ASR) in Brazilian Portuguese (BP) have been released, improving ASR systems performance. However, these datasets lack diversity in terms of age groups, regional accents, and education levels. In this paper, we present a new publicly available dataset consisting of 289 life story interviews (365 hours), featuring a broad range of speakers varying in age, education, and regional accents. First, we demonstrated the presence of bias in current BP ASR models concerning education levels and age groups. Second, we showed that our dataset helps mitigate these biases. Additionally, an ASR model trained on our dataset performed better during evaluation on a diverse test set. Finally, the ASR model trained with our dataset was extrinsically evaluated through a topic modeling task that utilized the automatically transcribed output.

pdf bib abs
Multi-Layered Evaluation Using a Fusion of Metrics and LLMs as Judges in Open-Domain Question Answering
Rashin Rahnamoun | Mehrnoush Shamsfard

Automatic evaluation of machine-generated texts, such as answers in open-domain question answering (Open-Domain QA), presents a complex challenge involving cost efficiency, hardware constraints, and high accuracy. Although various metrics exist for comparing machine-generated answers with reference (gold standard) answers, ranging from lexical metrics (e.g., exact match) to semantic ones (e.g., cosine similarity) and using large language models (LLMs) as judges, none of these approaches achieves perfect performance in terms of accuracy or cost. To address this issue, we propose two approaches to enhance evaluation. First, we summarize long answers and use the shortened versions in the evaluation process, demonstrating that this adjustment significantly improves both lexical matching and semantic-based metrics evaluation results. Second, we introduce a multi-layered evaluation methodology that combines different metrics tailored to various scenarios. This combination of simple metrics delivers performance comparable to LLMs as judges but at lower costs. Moreover, our fused approach, which integrates both lexical and semantic metrics with LLMs through our formula, outperforms previous evaluation solutions.

pdf bib abs
BERT-based Classical Arabic Poetry Authorship Attribution
Lama Alqurashi | Serge Sharoff | Janet Watson | Jacob Blakesley

This study introduces a novel computational approach to authorship attribution (AA) in Arabic poetry, using the entire Classical Arabic Poetry corpus for the first time and offering a direct analysis of real cases of misattribution. AA in Arabic poetry has been a significant issue since the 9th century, particularly due to the loss of pre-Islamic poetry and the misattribution of post-Islamic works to earlier poets. While previous research has predominantly employed qualitative methods, this study uses computational techniques to address these challenges. The corpus was scraped from online sources and enriched with manually curated Date of Death (DoD) information to overcome the problematic traditional sectioning. Additionally, we applied Embedded Topic Modeling (ETM) to label each poem with its topic contributions, further enhancing the dataset’s value. An ensemble model based on CAMeLBERT was developed and tested across three dimensions: topic, number of poets, and number of training examples. After parameter optimization, the model achieved F1 scores ranging from 0.97 to 1.0. The model was also applied to four pre-Islamic misattribution cases, producing results consistent with historical and literary studies.

pdf bib abs
It’s What You Say and How You Say It: Investigating the Effect of Linguistic vs. Behavioral Adaptation in Task-Oriented Chatbots
Lindsey Vanderlyn | Ngoc Thang Vu

Given the conflicting expectations users have for how a dialog agent should sound and behave, there is no one-size-fits-all option for dialog system design. Therefore, adaptation is critical to ensure successful and enjoyable interactions. However, it is not yet clear what the effects of behavioral (what the agent says) vs. linguistic adaptation (how the agent says this) are in terms of dialog success and user perception. In this work, we implement three different types of task-oriented dialog agents which can each vary their level of formality. We evaluate subjective and objective metrics of dialog success as well as user perceptions through a user study, comparing the collected data to that of (CITATION), where users interacted with the same three types of agents without linguistic adaptation. From this, we draw insights into which subjective and objective aspects of success and user perception are influenced by each type of adaptation. We additionally all code, user surveys, and dialog interaction logs.

We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.

pdf bib abs
LASS: A Novel and Economical Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization
Yanyue Zhang | Pengfei Li | Yilong Lai | Yulan He | Deyu Zhou

As more than 70% of reviews in the existing opinion summary data set are positive, current opinion summarization approaches are hesitant to generate negative summaries given the input of negative texts. To address such sentiment bias, a direct approach without the reliance on a specific structure is to generate additional data based on large language models to balance the emotional distribution of the dataset. However, large-scale data augmentation based on large language models faces an apparent disadvantage, the expensive costs. Therefore, in this paper, we propose LASS, a novel data augmentation framework based on both LArge and Small language models for debiaSing opinion summarization. Specifically, a small number of synthesized negative reviews is obtained by rewriting the positive text via a large language model. Then, a disentangle reconstruction model is trained based on the generated data. After training, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on perplexity degree and sentiment classification. Experiments have proved that LASS can effectively alleviate emotional bias, similar to using only large models, but in a more economical way.

In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Large Language Models with this dataset. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) Smaller models not only perform worse than the largest models, but also degrade faster in Spanish than in English. The performance gap between both languages is negligible for the best models, but grows up to 37% for smaller models; (ii) Model ranking on UNED-ACCESS 2024 is almost identical (0.98 Pearson correlation) to the one obtained with MMLU (a similar, but publicly available benchmark), suggesting that contamination affects similarly to all models, and (iii) As in publicly available datasets, reasoning questions in UNED-ACCESS are more challenging for models of all sizes.

This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries. Then it conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.

pdf bib abs
DROWN: Towards Tighter LiRPA-based Robustness Certification
Yunruo Zhang | Tianyu Du | Shouling Ji | Shanqing Guo

The susceptibility of deep neural networks to adversarial attacks is a well-established concern. To address this problem, robustness certification is proposed, which, unfortunately, suffers from precision or scalability issues. In this paper, we present DROWN (Dual CROWN), a novel method for certifying the robustness of DNNs. The advantage of DROWN is that it tightens classic LiRPA-based methods yet maintains similar scalability, which comes from refining pre-activation bounds of ReLU relaxations using two pairs of linear bounds derived from different relaxations of ReLU units in previous layers. The extensive evaluations show that DROWN achieves up to 83.39% higher certified robust accuracy than the baseline on CNNs and up to 4.68 times larger certified radii than the baseline on Transformers. Meanwhile, the running time of DROWN is about twice that of the baseline.

pdf bib abs
Large Language Models with Reinforcement Learning from Human Feedback Approach for Enhancing Explainable Sexism Detection
Ali Riahi Samani | Tianhao Wang | Kangshuo Li | Feng Chen

Recent advancements in natural language processing, driven by Large Language Models (LLMs), have significantly improved text comprehension, enabling these models to handle complex tasks with greater efficiency. A key feature of LLMs is their ability to engage in contextual learning, which allows them to understand and apply instructions given in natural language to new scenarios without requiring additional training. This capability is particularly valuable in social media, where LLMs can be crucial in addressing challenges in explainable sexism detection. We hypothesize that by leveraging contextual learning capabilities, LLMs can provide clear, explainable insights into why certain content is flagged as problematic, thus enhancing transparency in the sexism detection process. To this end, we propose a Reinforcement Learning from Human Feedback (RLHF) based fine-tuning framework for sexism detection. We studied two well-known LLMs, Mistral-7B and LLaMA-3-8B, in zero-shot, supervised fine-tuning, and RLHF scenarios to conclude the superior ability of LLMs in sexism detection. The experimental results reported in this work, based on three tasks of Explainable Detection of Online Sexism (EDOS), highlight the importance of RLHF for building explainable systems in online discourse. Furthermore, we found that the LLaMA-3-8B model achieves the best results using the RLHF approach, scoring 0.8681 on Task A (binary sexism detection), 0.6829 on Task B (category classification of sexism), and 0.4722 on Task C (fine-grained sexism vectors) test sets.

Multi-level Hierarchical Classification (MLHC) tackles the challenge of categorizing items within a complex, multi-layered class structure. However, traditional MLHC classifiers often rely on a backbone model with n independent output layers, which tend to ignore the hierarchical relationships between classes. This oversight can lead to inconsistent predictions that violate the underlying taxonomy. Leveraging Large Language Models (LLMs), we propose novel taxonomy-embedded transitional LLM-agnostic framework for multimodality classification. The cornerstone of this advancement is the ability of models to enforce consistency across hierarchical levels. Our evaluations on the MEP-3M dataset - a Multi-modal E-commerce Product dataset with various hierarchical levels- demonstrated a significant performance improvement compared to conventional LLMs structure.

pdf bib abs
Representation Purification for End-to-End Speech Translation
Chengwei Zhang | Yue Zhou | Rui Zhao | Yidong Chen | Xiaodong Shi

Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language. Previous research primarily focused on enhancing speech translation by facilitating knowledge transfer from machine translation, exploring various methods to bridge the gap between speech and text modalities. Despite substantial progress made, factors in speech that are not relevant to translation content, such as timbre and rhythm, often limit the efficiency of knowledge transfer. In this paper, we conceptualize speech representation as a combination of content-agnostic and content-relevant factors. We examine the impact of content-agnostic factors on translation performance through preliminary experiments and observe a significant performance deterioration when content-agnostic perturbations are introduced to speech signals. To address this issue, we propose a **S**peech **R**epresentation **P**urification with **S**upervision **E**nhancement (SRPSE) framework, which excludes the content-agnostic components within speech representations to mitigate their negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate that SRPSE significantly improves translation performance across all translation directions in three settings and achieves preeminent performance under a *transcript-free* setting.

High-quality sense-annotated datasets are vital for evaluating and comparing WSD systems. We present a novel approach to creating parallel sense-annotated datasets, which can be applied to any language that English can be translated into. The method incorporates machine translation, word alignment, sense projection, and sense filtering to produce silver annotations, which can then be revised manually to obtain gold datasets. By applying our method to Farsi, Chinese, and Bengali, we produce new parallel benchmark datasets, which are vetted by native speakers of each language. Our automatically-generated silver datasets are of higher quality than the annotations obtained with recent multilingual WSD systems, particularly on non-European languages.

pdf bib abs
HYDEN: Hyperbolic Density Representations for Medical Images and Reports
Zhi Qiao | Linbin Han | Xiantong Zhen | Jiahong Gao | Zhen Qian

In light of the inherent entailment relations between images and text, embedding point vectors in hyperbolic space has been employed to leverage its hierarchical modeling advantages for visual semantic representation learning. However, point vector embeddings struggle to address semantic uncertainty, where an image may have multiple interpretations, and text may correspond to different images—a challenge especially prevalent in the medical domain. Therefor, we propose HYDEN, a novel hyperbolic density embedding based image-text representation learning approach tailored for specific medical domain data. This method integrates text-aware local features with global features from images, mapping image-text features to density features in hyperbolic space via using hyperbolic pseudo-Gaussian distributions. An encapsulation loss function is employed to model the partial order relations between image-text density distributions. Experimental results demonstrate the interpretability of our approach and its superior performance compared to the baseline methods across various zero-shot tasks and fine-tuning task on different datasets.

pdf bib abs
Towards Human Understanding of Paraphrase Types in Large Language Models
Dominik Meier | Jan Philip Wahle | Terry Lima Ruas | Bela Gipp

Paraphrases represent a human’s intuitive ability to understand expressions presented in various different ways. Current paraphrase evaluations of language models primarily use binary approaches, offering limited interpretability of specific text changes. Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes and offer a granular view of the flexibility in linguistic expression (e.g., a shift in syntax or vocabulary used). In this study, we assess the human preferences towards ChatGPT in generating English paraphrases with ten APTs and five prompting techniques. We introduce APTY (Atomic Paraphrase TYpes), a dataset of 800 sentence-level and word-level annotations by 15 annotators. The dataset also provides a human preference ranking of paraphrases with different types that can be used to fine-tune models with RLHF and DPO methods. Our results reveal that ChatGPT and a DPO-trained LLama 7B model can generate simple APTs, such as additions and deletions, but struggle with complex structures (e.g., subordination changes). This study contributes to understanding which aspects of paraphrasing language models have already succeeded at understanding and what remains elusive. In addition, we show how our curated datasets can be used to develop language models with specific linguistic capabilities.

pdf bib abs
Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets
Mattes Ruckdeschel

The recent development of Large Language Models lowered the barrier to entry for using Natural Language Processing methods for various tasks in the related scientific field of Computational Social Science and has led to more scrutiny of their performance on complex datasets. While in many cases the costly fine-tuning of smaller Language Models outperforms LLMs, zero and few-shot approaches on consumer hardware have the potential to deepen interdisciplinary research efforts, whilst opening up NLP research to complex, niche datasets that are hard to classify. The great effort that is coding datasets comes with the benefit of concise instructions for how to code the data at hand. We investigate, whether highly specific, instructive codebooks created by social scientists in order to code text with a multitude of complex labels can improve zero-shot performance on (quantized) LLMs. Our findings show, that using the latest LLMs, zero-shot performance can improve by providing a codebook on two complex datasets with a total of four different topics and can outperform few-shot In-Context-Learning setups. The approach is equally or more token-efficient, and requires less hands-on engineering, making it particularly compelling for practical research.

pdf bib abs
NLP for preserving Torlak, a vulnerable low-resource Slavic language
Li Tang | Teodora Vuković

Torlak is an endangered, low-resource Slavic language with a high degree of areal and inter-speaker variation. In previous work, interviews were performed with Torlak speakers in Serbia, near the Bulgarian border, and the transcripts annotated with lemma and morphosyntactic descriptions at token level. As such token-level annotations facilitate cross-language comparison in the context of the Balkan Sprachbund, where multiple languages influenced Torlak over time, including Serbian and Bulgarian. Here, we aim to improve the prediction of morphosyntactic annotations for this low-resource language using the fine-tuning of large language models, comparing several predictive models. We also further fine-tuned the large language models for scoring the degree of ‘Torlakness’ of a sentence by labeling likely Torlak tokens, to facilitate the documentation of additional Torlak transcribed speech with a high degree of Torlak-style non-standard features compared to standard Serbian. Taken together, we hope that these contributions will help to document this endangered language, and improve digital access for its speakers.

pdf bib abs
Analyzing the Attention Heads for Pronoun Disambiguation in Context-aware Machine Translation Models
Paweł Mąka | Yusuf Can Semerci | Jan Scholtes | Gerasimos Spanakis

In this paper, we investigate the role of attention heads in Context-aware Machine Translation models for pronoun disambiguation in the English-to-German and English-to-French language directions. We analyze their influence by both observing and modifying the attention scores corresponding to the plausible relations that could impact a pronoun prediction. Our findings reveal that while some heads do attend the relations of interest, not all of them influence the models’ ability to disambiguate pronouns. We show that certain heads are underutilized by the models, suggesting that model performance could be improved if only the heads would attend one of the relations more strongly. Furthermore, we fine-tune the most promising heads and observe the increase in pronoun disambiguation accuracy of up to 5 percentage points which demonstrates that the improvements in performance can be solidified into the models’ parameters.

pdf bib abs
ModaFact: Multi-paradigm Evaluation for Joint Event Modality and Factuality Detection
Marco Rovera | Serena Cristoforetti | Sara Tonelli

Factuality and modality are two crucial aspects concerning events, since they convey the speaker’s commitment to a situation in discourse as well as how this event is supposed to occur in terms of norms, wishes, necessity, duty and so on. Capturing them both is necessary to truly understand an utterance meaning and the speaker’s perspective with respect to a mentioned event. Yet, NLP studies have mostly dealt with these two aspects separately, mainly devoting past efforts to the development of English datasets. In this work, we propose ModaFact, a novel resource with joint factuality and modality information for event-denoting expressions in Italian. We propose a novel annotation scheme, which however is consistent with existing ones, and compare different classification systems trained on ModaFact, as a preliminary step to the use of factuality and modality information in downstream tasks. The dataset and the best-performing model are publicly released and available under an open license.

pdf bib abs
Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models
Tom S Juzek | Zina B. Ward

Scientific English is currently undergoing rapid change, with words like “delve,” “intricate,” and “underscore” appearing far more frequently than just a few years ago. It is widely assumed that scientists’ use of large language models (LLMs) is responsible for such trends. We develop a formal, transferable method to characterize these linguistic changes. Application of our method yields 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage. We then pose “the puzzle of lexical overrepresentation”: why are such words overused by LLMs? We fail to find evidence that lexical overrepresentation is caused by model architecture, algorithm choices, or training data. To assess whether reinforcement learning from human feedback (RLHF) contributes to the overuse of focal words, we undertake comparative model testing and conduct an exploratory online study. While the model testing is consistent with RLHF playing a role, our experimental results suggest that participants may be reacting differently to “delve” than to other focal words. With LLMs quickly becoming a driver of global language change, investigating these potential sources of lexical overrepresentation is important. We note that while insights into the workings of LLMs are within reach, a lack of transparency surrounding model development remains an obstacle to such research.

pdf bib abs
Evaluating Pixel Language Models on Non-Standardized Languages
Alberto Muñoz-Ortiz | Verena Blaschke | Barbara Plank

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model’s strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.

pdf bib abs
Cross-Lingual Sentence Compression for Length-Constrained Subtitles in Low-Resource Settings
Tollef Emil JÃ, rgensen | Ole Jakob Mengshoel

This paper explores the joint task of machine translation and sentence compression, emphasizing its application in subtitle generation for broadcast and live media for low-resource languages and hardware. We develop CLSC (Cross-Lingual Sentence Compression), a system trained on openly available parallel corpora organized by compression ratios, where the target length is constrained to a fraction of the source sentence length. We present two training methods: 1) Multiple Models (MM), where individual models are trained separately for each compression ratio, and 2) a Controllable Model (CM), a single model per language using a compression token to encode length constraints. We evaluate both subtitle data and transcriptions from the EuroParl corpus. To accommodate low-resource settings, we constrain data sampling for training and show results for transcriptions in French, Hungarian, Lithuanian, and Polish and subtitles in Albanian, Basque, Malay, and Norwegian. Our models preserve high semantic meaning and metric evaluations for compressed contexts.

pdf bib abs
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Gayane Ghazaryan | Erik Arakelyan | Isabelle Augenstein | Pasquale Minervini

Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose SynDARin, a method for generating and validating QA datasets for low-resoucre languages. We utilize parallel content mining to obtain human-curated paragraphs between English and the target language. We use the English data as context to generate synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English human-curated paragraphs form the final QA dataset. The method allows to maintain content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with 1.2K samples for the Armenian language. The human evaluation shows that 98% of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out ~70% of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

pdf bib abs
Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models
Elie Antoine | Frederic Bechet | Phillippe Langlais

This study investigates the behavior of model-integrated routers in Mixture of Experts (MoE) models, focusing on how tokens are routed based on their linguistic features, specifically Part-of-Speech (POS) tags. The goal is to explore across different MoE architectures whether experts specialize in processing tokens with similar linguistic traits. By analyzing token trajectories across experts and layers, we aim to uncover how MoE models handle linguistic information. Findings from six popular MoE models reveal expert specialization for specific POS categories, with routing paths showing high predictive accuracy for POS, highlighting the value of routing paths in characterizing tokens.

pdf bib abs
Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks
Yang Wang | Chenghua Lin

Recent advancements in natural language processing have highlighted the vulnerability of deep learning models to adversarial attacks. While various defence mechanisms have been proposed, there is a lack of comprehensive benchmarks that evaluate these defences across diverse datasets, models, and tasks. In this work, we address this gap by presenting an extensive benchmark for textual adversarial defence that significantly expands upon previous work. Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art defence mechanisms, and extends the assessment to include critical tasks such as single-sentence classification, similarity and paraphrase identification, natural language inference, and commonsense reasoning. This work not only serves as a valuable resource for researchers and practitioners in the field of adversarial robustness but also identifies key areas for future research in textual adversarial defence. By establishing a new standard for benchmarking in this domain, we aim to accelerate progress towards more robust and reliable natural language processing systems.

pdf bib abs
Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings
Guy Barel | Oren Tsur | Dan Vilenchik

Stance detection plays a pivotal role in enabling an extensive range of downstream applications, from discourse parsing to tracing the spread of fake news and the denial of scientific facts. While most stance classification models rely on the textual representation of the utterance in question, prior work has demonstrated the importance of the conversational context in stance detection. In this work, we introduce TASTE – a multimodal architecture for stance detection that harmoniously fuses Transformer-based content embedding with unsupervised structural embedding. Through the fine-tuning of a pre-trained transformer and the amalgamation with social embedding via a Gated Residual Network (GRN) layer, our model adeptly captures the complex interplay between content and conversational structure in determining stance. TASTE achieves state-of-the-art results on common benchmarks, significantly outperforming an array of strong baselines. Comparative evaluations underscore the benefits of social grounding – emphasizing the criticality of concurrently harnessing both content and structure for enhanced stance detection.

pdf bib abs
IRUEX: A Study on Large Language Models Problem-Solving Skills in Iran’s University Entrance Exam
Hamed Khademi Khaledi | Heshaam Faili

In this paper, we present the IRUEX dataset, a novel multiple-choice educational resource specifically designed to evaluate the performance of Large Language Models (LLMs) across seven distinct categories. The dataset contains 868 Iran university entrance exam questions (Konkour) and 36,485 additional questions. Each additional question is accompanied by detailed solutions, and the dataset also includes relevant high school textbooks, providing comprehensive study material. A key feature of IRUEX is its focus on underrepresented languages, particularly assessing problem-solving skills, language proficiency, and reasoning. Our evaluation shows that GPT-4o outperforms the other LLMs tested on the IRUEX dataset. Techniques such as few-shot learning and retrieval-augmented generation (RAG) display varied effects across different categories, highlighting their unique strengths in specific areas. Additionally, a comprehensive user study classifies the errors made by LLMs into ten problem-solving ability categories. The analysis highlights that calculations and linguistic knowledge, particularly in low-resource languages, remain significant weaknesses in current LLMs. IRUEX has the potential to serve as a benchmark for evaluating the reasoning capabilities of LLMs in non-English settings, providing a foundation for improving their performance in diverse languages and contexts

pdf bib abs
data2lang2vec: Data Driven Typological Features Completion
Hamidreza Amirzadeh | Sadegh Jafari | Anika Harju | Rob van der Goot

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.

pdf bib abs
Explanation Regularisation through the Lens of Attributions
Pedro Ferreira | Ivan Titov | Wilker Aziz

Explanation regularisation (ER) has been introduced as a way to guide text classifiers to form their predictions relying on input tokens that humans consider plausible. This is achieved by introducing an auxiliary explanation loss that measures how well the output of an input attribution technique for the model agrees with human-annotated rationales. The guidance appears to benefit performance in out-of-domain (OOD) settings, presumably due to an increased reliance on plausible tokens. However, previous work has under-explored the impact of guidance on that reliance, particularly when reliance is measured using attribution techniques different from those used to guide the model. In this work, we seek to close this gap, and also explore the relationship between reliance on plausible features and OOD performance. We find that the connection between ER and the ability of a classifier to rely on plausible features has been overstated and that a stronger reliance on plausible tokens does not seem to be the cause for OOD improvements.

pdf bib abs
Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs
Guillermo Marco | Luz Rello | Julio Gonzalo

In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.

pdf bib abs
Generics are puzzling. Can language models find the missing piece?
Gustavo Cilleruelo | Emily Allaway | Barry Haddow | Alexandra Birch

Generic sentences express generalisations about the world without explicit quantification. Although generics are central to everyday communication, building a precise semantic framework has proven difficult, in part because speakers use generics to generalise properties with widely different statistical prevalence. In this work, we study the implicit quantification and context-sensitivity of generics by leveraging language models as models of language. We create ConGen, a dataset of 2873 naturally occurring generic and quantified sentences in context, and define p-acceptability, a metric based on surprisal that is sensitive to quantification. Our experiments show generics are more context-sensitive than determiner quantifiers and about 20% of naturally occurring generics we analyze express weak generalisations. We also explore how human biases in stereotypes can be observed in language models.

Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination – generating content ungrounded in the realities of training data. Recent work has focused on decoding techniques to improve factuality in decoding by leveraging LLMs’ hierarchical representation of factual knowledge, manipulating the predicted distributions at inference time. Current state-of-the-art approaches refine decoding by contrasting logits from a lower layer with the final layer to exploit information related factuality within the model forward procedure. However, such methods often assume the final layer is most reliable one and the lower layer selection process depends on it. In this work, we first propose logit extrapolation of critical token probabilities beyond the last layer for more accurate contrasting. We additionally employ layer-wise entropy-guided lower layer selection, decoupling the selection process from the final layer. Experiments demonstrate strong performance - surpassing state-of-the-art on multiple different datasets by large margins. Analyses show different kinds of prompts respond to different selection strategies.

pdf bib abs
Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation
Malthe Have Musaeus | Rob van der Goot

Traditional language model compression techniques, like knowledge distillation, require a fixed architecture, limiting flexibility, while structured pruning methods often fail to preserve performance. This paper introduces Iterative Structured Knowledge Distillation (ISKD), which integrates knowledge distillation and structured pruning by progressively replacing transformer blocks with smaller, efficient versions during training. This study validates ISKD on two transformer-based language models: GPT-2 and Phi-1. ISKD outperforms L1 pruning and achieves similar performance to knowledge distillation while offering greater flexibility. ISKD reduces model parameters - 30.68% for GPT-2 and 30.16% for Phi-1 - while maintaining at least four-fifths of performance on both language modeling and commonsense reasoning tasks. These findings suggest that this method offers a promising balance between model efficiency and accuracy.

pdf bib abs
Why do language models perform worse for morphologically complex languages?
Catherine Arnett | Benjamin Bergen

Language models perform differently across languages. It has been previously suggested that morphological typology may explain some of this variability (Cotterell et al., 2018). We replicate previous analyses and find additional new evidence for a performance gap between agglutinative and fusional languages, where fusional languages, such as English, tend to have better language modeling performance than morphologically more complex languages like Turkish. We then propose and test three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. To test the morphological alignment hypothesis, we present MorphScore, a tokenizer evaluation metric, and supporting datasets for 22 languages. We find some evidence that tokenization quality explains the performance gap, but none for the role of morphological alignment. Instead we find that the performance gap is most reduced when training datasets are of equivalent size across language types, but only when scaled according to the so-called “byte-premium”—the different encoding efficiencies of different languages and orthographies. These results suggest that languages of particular morphological types are not intrinsically advantaged or disadvantaged in language modeling. Differences in performance can be attributed to disparities in dataset size. These findings bear on ongoing efforts to improve performance for low-performing and under-resourced languages.

pdf bib abs
Argument Mining with Fine-Tuned Large Language Models
Jérémie Cabessa | Hugo Hernault | Umer Mushtaq

An end-to-end argument mining (AM) pipeline takes a text as input and provides its argumentative structure as output by identifying and classifying the argument units and argument relations in the text. In this work, we approach AM using fine-tuned large language models (LLMs). We model the three main sub-tasks of the AM pipeline, as well as their joint formulation, as text generation tasks. We fine-tune eight popular quantized and non-quantized LLMs – LLaMA-3, LLaMA-3.1, Gemma-2, Mistral, Phi-3, Qwen-2 – which are among the most capable open-weight models, on the benchmark PE, AbstRCT, and CDCP datasets that represent diverse data sources. Our approach achieves state-of-the-art results across all AM sub-tasks and datasets, showing significant improvements over previous benchmarks.

pdf bib abs
Beyond Surprisal: A Dual Metric Framework for Lexical Skill Acquisition in LLMs
Nazanin Shafiabadi | Guillaume Wisniewski

Many studies have explored when and how LLMs learn to use specific words, primarily by examining their learning curves. While these curves capture a model’s capacity to use words correctly in context, they often neglect the equally important skill of avoiding incorrect usage. In this paper, we introduce a new metric, anti-surprisal, which measures a model’s capacity to refrain from using words in inappropriate or unexpected contexts. By examining both correct usage and error avoidance, we offer a more comprehensive perspective on the learning dynamics of LLMs.

pdf bib abs
RUAccent: Advanced System for Stress Placement in Russian with Homograph Resolution
Denis Andreevich Petrov

This paper presents a novel approach to the problem of stress placement in Russian text, with a particular focus on resolving homographs. We introduce a comprehensive system that combines morphological analysis, context-aware neural models, and a specialized “Ё-fikator” to accurately place stress in Russian words, including those with ambiguous pronunciations. Our system outperforms existing solutions, achieving a 0.96 accuracy on homographs and 0.97 accuracy on non-homograph words.

pdf bib abs
On the Effects of Fine-tuning Language Models for Text-Based Reinforcement Learning
Mauricio Gruppi | Soham Dan | Keerthiram Murugesan | Subhajit Chaudhury

Text-based reinforcement learning involves an agent interacting with a fictional environment using observed text and admissible actions in natural language to complete a task. Previous works have shown that agents can succeed in text-based interactive environments even in the complete absence of semantic understanding or other linguistic capabilities. The success of these agents in playing such games suggests that semantic understanding may not be important for the task. This raises an important question about the benefits of LMs in guiding the agents through the game states. In this work, we show that rich semantic understanding leads to efficient training of text-based RL agents. Moreover, we describe the occurrence of semantic degeneration as a consequence of inappropriate fine-tuning of language models in text-based reinforcement learning (TBRL). Specifically, we describe the shift in the semantic representation of words in the LM, as well as how it affects the performance of the agent in tasks that are semantically similar to the training games. These results may help develop better strategies to fine-tune agents in text-based RL scenarios.

pdf bib abs
HateBRXplain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in Brazilian Portuguese
Isadora Salles | Francielle Vargas | Fabrício Benevenuto

Nowadays, hate speech technologies are surely relevant in Brazil. Nevertheless, the inability of these technologies to provide reasons (rationales) for their decisions is the limiting factor to their adoption since they comprise bias, which may perpetuate social inequalities when propagated at scale. This scenario highlights the urgency of proposing explainable technologies to address hate speech. However, explainable models heavily depend on data availability with human-annotated rationales, which are scarce, especially for low-resource languages. To fill this relevant gap, we introduce HateBRXplain, the first benchmark dataset for hate speech detection in Portuguese, with text span annotations capturing rationales. We evaluated our corpus using mBERT, BERTimbau, DistilBERTimbau, and PTT5 models, which outperformed the current baselines. We further assessed these models’ explainability using model-agnostic explanation methods (LIME and SHAP). Results demonstrate plausible post-hoc explanations when compared to human annotations. However, the best-performing hate speech detection models failed to provide faithful rationales.

pdf bib abs
LLM4RE: A Data-centric Feasibility Study for Relation Extraction
Anushka Swarup | Tianyu Pan | Ronald Wilson | Avanti Bhandarkar | Damon Woodard

Relation Extraction (RE) is a multi-task process that is a crucial part of all information extraction pipelines. With the introduction of the generative language models, Large Language Models (LLMs) have showcased significant performance boosts for complex natural language processing and understanding tasks. Recent research in RE has also started incorporating these advanced machines in their pipelines. However, the full extent of the LLM’s potential for extracting relations remains unknown. Consequently, this study aims to conduct the first feasibility analysis to explore the viability of LLMs for RE by investigating their robustness to various complex RE scenarios stemming from data-specific characteristics. By conducting an exhaustive analysis of five state-of-the-art LLMs backed by more than 2100 experiments, this study posits that LLMs are not robust enough to tackle complex data characteristics for RE, and additional research efforts focusing on investigating their behaviors at extracting relationships are needed. The source code for the evaluation pipeline can be found at https://aaig.ece.ufl.edu/projects/relation-extraction .

Extracting metaphors and analogies from free text requires high-level reasoning abilities such as abstraction and language understanding. Our study focuses on the extraction of the concepts forming metaphoric analogies in literary texts. To this end, we construct a novel dataset in this domain with the help of domain experts. We compare the out-of-the-box ability of recent large language models (LLMs) to structure metaphoric mappings from fragments of texts containing rather explicit proportional analogies. The models are further evaluated on the generation of implicit elements of the analogy, which are indirectly suggested in the texts and inferred by human readers. The competitive results obtained by LLMs in our experiments are encouraging and open up new avenues such as automatically extracting analogies and metaphors from text instead of investing resources in domain experts to manually label data.

pdf bib abs
Enhancing Retrieval-Augmented Generation: A Study of Best Practices
Siran Li | Linus Stenzel | Carsten Eickhoff | Seyed Ali Bahrainian

Retrieval-Augmented Generation (RAG) systems have recently shown remarkable advancements by integrating retrieval mechanisms into language models, enhancing their ability to produce more accurate and contextually relevant responses. However, the influence of various components and configurations within RAG systems remains underexplored. A comprehensive understanding of these elements is essential for tailoring RAG systems to complex retrieval tasks and ensuring optimal performance across diverse applications. In this paper, we develop several advanced RAG system designs that incorporate query expansion, various novel retrieval strategies, and a novel Contrastive In-Context Learning RAG. Our study systematically investigates key factors, including language model size, prompt design, document chunk size, knowledge base size, retrieval stride, query expansion techniques, Contrastive In-Context Learning knowledge bases, multilingual knowledge bases, and Focus Mode retrieving relevant context at sentence-level. Through extensive experimentation, we provide a detailed analysis of how these factors influence response quality. Our findings offer actionable insights for developing RAG systems, striking a balance between contextual richness and retrieval-generation efficiency, thereby paving the way for more adaptable and high-performing RAG frameworks in diverse real-world scenarios. Our code and implementation details are publicly available.

Embeddings play a pivotal role in the efficacy of large language models. They are the bedrock on which these models grasp contextual relationships and foster a more nuanced understanding of language and consequently perform complex tasks that require a fundamental understanding of human language. Given that these embeddings themselves often reflect or exhibit bias, it stands to reason that these models may also inadvertently learn this bias. In this work, we build on the aforementioned seminal work of (CITATION) and (CITATION) and propose DeepSoftDebias, an algorithm that uses a neural network to perform ‘soft debiasing’. We exhaustively evaluate this algorithm across a variety of state-of-the-art datasets, accuracy metrics, and challenging NLP tasks. On a wide range of metrics, we find that DeepSoftDebias outperforms the current state-of-the-art methods at reducing bias across gender, race, and religion.

Emotion recognition in conversation (ERC), the task of discerning human emotions for each utterance within a conversation, has garnered significant attention in human-computer interaction systems. Previous ERC studies focus on speaker-specific information that predominantly stems from relationships among utterances, which lacks sufficient information around conversations. Recent research in ERC has sought to exploit pre-trained large language models (LLMs) with speaker modelling to comprehend emotional states. Although these methods have achieved the encouraging results, the extracted speaker-specific information struggles to indicate emotional dynamics. In this paper, motivated by the fact that speaker characteristics play a crucial role and LLMs have rich world knowledge, we present LaERC-S, a novel framework that stimulates LLMs to explore speaker characteristics involving the mental state and behavior of interlocutors, for accurate emotion predictions. To endow LLMs with these knowledge information, we adopt the two-stage learning to make the models reason speaker characteristics and track the emotion of the speaker in complex conversation scenarios. Extensive experiments on three benchmark datasets demonstrate the superiority of LaERC-S, reaching the new state-of-the-art.

pdf bib abs
Analysing Zero-Shot Readability-Controlled Sentence Simplification
Abdullah Barayan | Jose Camacho-Collados | Fernando Alva-Manchego

Readability-controlled text simplification (RCTS) rewrites texts to lower readability levels while preserving their meaning. RCTS models often depend on parallel corpora with readability annotations on both source and target sides. Such datasets are scarce and difficult to curate, especially at the sentence level. To reduce reliance on parallel data, we explore using instruction-tuned large language models for zero-shot RCTS. Through automatic and manual evaluations, we examine: (1) how different types of contextual information affect a model’s ability to generate sentences with the desired readability, and (2) the trade-off between achieving target readability and preserving meaning. Results show that all tested models struggle to simplify sentences (especially to the lowest levels) due to models’ limitations and characteristics of the source sentences that impede adequate rewriting. Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS, as standard ones often misinterpret common simplification operations, and inaccurately assess readability and meaning preservation.

pdf bib abs
The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian
Giovanni Puccetti | Maria Cassese | Andrea Esuli

While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language under standing in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian highschool math Olympics. We evaluate 10 powerful language models on these benchmarks and we find that they are bound by 71% accuracy on Invalsi MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct accuracy is 45%.

pdf bib abs
RRHF-V: Ranking Responses to Mitigate Hallucinations in Multimodal Large Language Models with Human Feedback
Guoqing Chen | Fu Zhang | Jinghao Lin | Chenglong Lu | Jingwei Cheng

Multimodal large language models (MLLMs) demonstrate strong capabilities in multimodal understanding, reasoning, and interaction but still face the fundamental limitation of hallucinations, where they generate erroneous or fabricated information. To mitigate hallucinations, existing methods annotate pair-responses (one non-hallucination vs one hallucination) using manual methods or GPT-4V, and train alignment algorithms to improve the correspondence between images and text. More critically, an image description often involve multiple dimensions (e.g., object attributes, posture, and spatial relationships), making it challenging for the model to comprehensively learn multidimensional information from pair-responses. To this end, in this paper, we propose RRHFV, which is the first using rank-responses (one non-hallucination vs multiple ranking hallucinations) to mitigate multimodal hallucinations. Instead of using pair-responses to train the model, RRHF-V expands the number of hallucinatory responses, so that the responses with different scores in a rank-response enable the model to learn rich semantic information across various dimensions of the image. Further, we propose a scene graph-based approach to automatically construct rank-responses in a cost-effective and automatic manner. We also design a novel training objective based on rank loss and margin loss to balance the differences between hallucinatory responses within a rankresponse, thereby improving the model’s image comprehension. Experiments on two MLLMs of different sizes and four widely used benchmarks demonstrate that RRHF-V is effective in mitigating hallucinations and outperforms the DPO method based on pair-responses.

pdf bib abs
Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
Beomseok Lee | Marco Gaido | Ioan Calapodescu | Laurent Besacier | Matteo Negri

While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.

pdf bib abs
Improving Accessibility of SCOTUS Opinions: A Benchmark Study and a New Dataset for Generic Heading Prediction and Specific Heading Generation
Malek Yaich | Nicolas Hernandez

The opinions of the U.S. Supreme Court (SCOTUS) are known for their extensive length, complex legal language, and lack of titled sections, which pose significant challenges for accessibility and comprehension. This paper defines the task of automatic section titling by proposing both generic and specific headings for each section. Given the scarcity of sections with headings in SCOTUS, we study the possibility of using data from lower courts for training models. A dataset of sections with generic or specific headings covering three courts (SCOTUS and two lower courts) was compiled. A supplementary SCOTUS set was manually annotated with these two types of titles. In order to establish a benchmark, we provide the performance of different systems trained for each subtask: For generic heading prediction, we compare the performance of fine-tuning non-contextual, general and domain-oriented pretrained language models. Transformer-based sequence-to-sequence models are considered for specific heading generation. Our results show that a fine-tuned LegalBERT can achieve a F1 score of about 0.90 % in predicting generic headings. They also show that BART and T5 have similar performance in generating specific headings and that, although this performance is good, there is still room for improvement. In addition, we provide a human assessment to support the generation experiment and show a quasi-linear correlation between human degrees of agreement and the results of conventional measures such as ROUGE and BERTScore.

pdf bib abs
SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts
Aihua Pei | Zehua Yang | Shunan Zhu | Ruoxi Cheng | Ju Jia

Traditional methods for evaluating the robustness of large language models (LLMs) often rely on standardized benchmarks, which can escalate costs and limit evaluations across varied domains. This paper introduces a novel framework designed to autonomously evaluate the robustness of LLMs by incorporating refined adversarial prompts and domain-constrained knowledge guidelines in the form of knowledge graphs. Our method systematically generates descriptive sentences from domain-constrained knowledge graph triplets to formulate adversarial prompts, enhancing the relevance and challenge of the evaluation. These prompts, generated by the LLM itself and tailored to evaluate its own robustness, undergo a rigorous filtering and refinement process, ensuring that only those with high textual fluency and semantic fidelity are used. This self-evaluation mechanism allows the LLM to evaluate its robustness without the need for external benchmarks. We assess the effectiveness of our framework through extensive testing on both proprietary models like ChatGPT and open-source models such as Llama-3.1, Phi-3, and Mistral. Results confirm that our approach not only reduces dependency on conventional data but also provides a targeted and efficient means of evaluating LLM robustness in constrained domains.

Accurately recommending candidate news articles to users has always been the core challenge of news recommendation system. News recommendations often require modeling of user interest to match candidate news. Recent efforts have primarily focused on extracting local subgraph information in a global click graph constructed by the clicked news sequence of all users. However, the computational complexity of extracting global click graph information has hindered the ability to utilize far-reaching linkage which is hidden between two distant nodes in global click graph collaboratively among similar users. To overcome the problem above, we propose a Global-view Long Chain Interests Modeling for news recommendation (GLoCIM), which combines neighbor interest with long chain interest distilled from a global click graph, leveraging the collaboration among similar users to enhance news recommendation. We therefore design a long chain selection algorithm and long chain interest encoder to obtain global-view long chain interest from the global click graph. We design a gated network to integrate long chain interest with neighbor interest to achieve the collaborative interest among similar users. Subsequently we aggregate it with local news category-enhanced representation to generate final user representation. Then candidate news representation can be formed to match user representation to achieve news recommendation. Experimental results on real-world datasets validate the effectiveness of our method to improve the performance of news recommendation.

pdf bib abs
Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models
Xinyu Zhou | Delong Chen | Samuel Cahyawijaya | Xufeng Duan | Zhenguang Cai

We introduce a novel analysis that leverages linguistic minimal pairs to probe the internal linguistic representations of Large Language Models (LLMs). By measuring the similarity between LLM activation differences across minimal pairs, we quantify the linguistic similarity and gain insight into the linguistic knowledge captured by LLMs. Our large-scale experiments, spanning 100+ LLMs and 150k minimal pairs in three languages, reveal properties of linguistic similarity from four key aspects: consistency across LLMs, relation to theoretical categorizations, dependency to semantic context, and cross-lingual alignment of relevant phenomena. Our findings suggest that 1) linguistic similarity is significantly influenced by training data exposure, leading to higher cross-LLM agreement in higher-resource languages. 2) Linguistic similarity strongly aligns with fine-grained theoretical linguistic categories but weakly with broader ones. 3) Linguistic similarity shows a weak correlation with semantic similarity, showing its context-dependent nature. 4) LLMs exhibit limited cross-lingual alignment in their understanding of relevant linguistic phenomena. This work demonstrates the potential of minimal pairs as a window into the neural representations of language in LLMs, shedding light on the relationship between LLMs and linguistic theory.

pdf bib abs
MMD-ERE: Multi-Agent Multi-Sided Debate for Event Relation Extraction
Yong Guan | Hao Peng | Lei Hou | Juanzi Li

Event relation extraction (ERE) is becoming increasingly important in the era of large language models. An extensive body of research has explored how performance can be further enhanced by the emergence of exciting technologies like chain-of-thought and self-refinement. In this paper, we introduce MMD-ERE, a multi-agent multi-sided debate approach for event relation extraction, which explores the understanding of event relations among different participants before and after debate. Specifically, for organizing the debate, participants are divided into multiple groups, each assigned its own debate topic, and the process effectively integrates both cooperation and confrontation. We also regard the audience as a crucial participant, as their conclusions from an observer’s perspective tend to be more objective. In the end, we explore the understanding of event relations among different participants before and after the debate. Experiments across various ERE tasks and LLMs demonstrate that MMD-ERE outperforms established baselines. Further analysis shows that debates can effectively enhance participants’ understanding of event relations.

pdf bib abs
Cross Domain Classification of Education Talk Turns
Achyutarama R. Ganti | Steven R. Wilson | Geoffrey Louie Wing-Yue

The study of classroom discourse is essential for enhancing child development and educational outcomes in academic settings. Prior research has focused on the annotation of conversational talk-turns within the classroom, offering a statistical analysis of the various types of discourse prevalent in these environments. In this work, we explore the generalizability and transferability of text classifiers trained to predict these discourse codes across educational domains. We examine two distinct English-language classroom datasets from the domains: literacy and math. Our results show that models exhibit high accuracy and generalizability when the training and test datasets originate from the same or similar domains. In situations where limited training data is available in new domains, few shot and zero shot exhibit more resiliency and aren’t as effected as their supervised counterparts. We also observe that accompanying each talk turn with dialog-level context improves the accuracy of the generative models. We conclude by offering suggestions on how to enhance the generalization of these methods to novel domains, proposing directions for future studies to investigate new methods for boosting the model adaptability across domains.

Artificial intelligence (AI) is transforming scientific research, with explainable AI methods like concept-based models (CMs) showing promise for new discoveries. However, in molecular science, CMs are less common than black-box models like Graph Neural Networks (GNNs), due to their need for predefined concepts and manual labeling. This paper introduces the Automated Molecular Concept (AutoMolCo) framework, which leverages Large Language Models (LLMs) to automatically generate and label predictive molecular concepts. Through iterative concept refinement, AutoMolCo enables simple linear models to outperform GNNs and LLM in-context learning on several benchmarks. The framework operates without human knowledge input, overcoming limitations of existing CMs while maintaining explainability and allowing easy intervention. Experiments on MoleculeNet and High-Throughput Experimentation (HTE) datasets demonstrate that AutoMolCoinduced explainable CMs are beneficial for molecular science research.

URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

Large Language Models (LLMs) have shown impressive abilities in solving various natural language processing tasks and are now widely offered as services. LLM services enable users to accomplish tasks without requiring specialized knowledge, simply by paying service providers. However, numerous providers offer various LLM services with variations in pricing, latency, and performance. These factors are also affected by different invocation methods, such as the choice of context and the use of cache, which lead to unpredictable and uncontrollable service cost and quality. Consequently, utilizing various LLM services invocation methods to construct an effective (cost-saving, low-latency and high-performance) invocation strategy that best meets task demands becomes a pressing challenge. This paper provides a comprehensive overview of methods help LLM services to be invoked efficiently. Technically, we define the problem of constructing an effective LLM services invocation strategy, and based on this, propose a unified LLM service invocation framework. The framework classifies existing methods into four categories: input abstraction, semantic cache, solution design, and output enhancement, which can be used separately or jointly during the invocation life cycle. We discuss the methods in each category and compare them to provide valuable guidance for researchers. Finally, we emphasize the open challenges in this domain and shed light on future research.

pdf bib abs
DP-FROST: Differentially Private Fine-tuning of Pre-trained Models with Freezing Model Parameters
Daeyoung Hong | Woohwan Jung | Kyuseok Shim

Training models with differential privacy has received a lot of attentions since differential privacy provides theoretical guarantee of privacy preservation. For a task in a specific domain, since a large-scale pre-trained model in the same domain contains general knowledge of the task, using such a model requires less effort in designing and training the model. However, differentially privately fine-tuning such models having a large number of trainable parameters results in large degradation of utility. Thus, we propose methods that effectively fine-tune the large-scale pre-trained models with freezing unimportant parameters for downstream tasks while satisfying differential privacy. To select the parameters to be fine-tuned, we propose several efficient methods based on the gradients of model parameters. We show the effectiveness of the proposed method by performing experiments with real datasets.

pdf bib abs
Evaluating LLMs’ Capability to Identify Lexical Semantic Equivalence: Probing with the Word-in-Context Task
Yoshihiko Hayashi

This study proposes a method to evaluate the capability of large language models (LLMs) in identifying lexical semantic equivalence. The Word-in-Context (WiC) task, a benchmark designed to determine whether the meanings of a target word remain identical across different contexts, is employed as a probing task. Experiments are conducted with several LLMs, including proprietary GPT models and open-source models, using zero-shot prompting with adjectives that represent varying levels of semantic equivalence (e.g., “the same”) or inequivalence (e.g., “different”). The fundamental capability to identify lexical semantic equivalence in context is measured using standard accuracy metrics. Consistency across different levels of semantic equivalence is assessed via rank correlation with the expected canonical ranking of precision and recall, reflecting anticipated trends in performance across prompts. The proposed method demonstrates its effectiveness, highlighting the superior capability of GPT-4o, as it consistently outperforms other explored LLMs. Analysis of the WiC dataset, the discriminative properties of adjectives (i.e., their ability to differentiate between levels of semantic equivalence), and linguistic patterns in erroneous cases offer insights into the LLM’s capability and sensitivity. These findings could inform improvements in WiC task performance, although performance enhancement is not the primary focus of this study.

pdf bib abs
Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery
Luke Moffett | Bhuwan Dhingra

The current generation of large language models (LLMs) show a surprising degree of robustness to adversarial perturbations, but it is unclear when these models implicitly recover the original text and when they rely on surrounding context. To isolate this recovery faculty of language models, we study a new diagnostic task —Adversarial Word Recovery — an extension of spellchecking where the inputs may be adversarial. We collect a new dataset using 9 popular perturbation attack strategies from the literature and organize them using a taxonomy of phonetic, typo, and visual attacks. We use this dataset to study the word recovery performance of the current generation of LLMs, finding that proprietary models (GPT-4, GPT-3.5 and Palm-2) match or surpass human performance. Conversely, open-source models (Llama-2, Mistral, Falcon) demonstrate a material gap between human performance, especially on visual attacks. For these open models, we show that performance of word recovery without context correlates to word recovery with context, and ultimately affects downstream task performance on a hateful, offensive, and toxic classification task. Finally, to show improving word recovery can improve robustness, we mitigate these attacks with a small Byt5 model tuned to recover visually attacked words.

pdf bib abs
NüshuRescue: Reviving the Endangered Nüshu Language with AI
Ivory Yang | Weicheng Ma | Soroush Vosoughi

The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nüshu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NüshuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NüshuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nüshu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nüshu and only 35 short examples from NCGold, NüshuRescue achieved 48.69% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. In addition, we developed FastText-based and Seq2Seq models to further support research on Nüshu. NüshuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input. All datasets and code have been made publicly available at https://github.com/ivoryayang/NushuRescue.

We study extractive question-answering in the medical domain (Medical-EQA). This problem has two main challenges: (i) domain specificity, as most AI models lack necessary domain knowledge, and (ii) extraction-based answering style, which restricts most autoregressive LLMs due to potential hallucinations. To handle those challenges, we propose TOP-Training, a target-oriented pre-training paradigm that stands out among all domain adaptation techniques with two desirable features: (i) TOP-Training moves one step further than popular domain-oriented fine-tuning since it not only moves closer to the target domain, but also familiarizes itself with the target dataset, and (ii) it does not assume the existence of a large set of unlabeled instances from the target domain. Specifically, for a target Medical-EQA dataset, we extract its entities and leverage large language models (LLMs) to generate synthetic texts containing those entities; we then demonstrate that pretraining on this synthetic text data yields better performance on the target Medical-EQA benchmarks. Overall, our contributions are threefold: (i) TOP-Training, a new pretraining technique to effectively adapt LLMs to better solve a target problem, (ii) TOP-Training has a wide application scope because it does not require the target problem to have a large set of unlabeled data, and (iii) our experiments highlight the limitations of autoregressive LLMs, emphasizing TOP-Training as a means to unlock the true potential of bidirectional LLMs.

pdf bib abs
Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations
Sayantan Pal | Souvik Das | Rohini K. Srihari

Large Language Models (LLMs) have significantly improved personalized conversational capabilities. However, existing datasets like Persona Chat, Synthetic Persona Chat, and Blended Skill Talk rely on static, predefined personas. This approach often results in dialogues that fail to capture human personalities’ fluid and evolving nature. To overcome these limitations, we introduce a novel dataset with around 400,000 dialogues and a framework for generating personalized conversations using long-form journal entries from Reddit. Our approach clusters journal entries for each author and filters them by selecting the most representative cluster, ensuring that the retained entries best reflect the author’s personality. We further refine the data by capturing the Big Five personality traits—openness, conscientiousness, extraversion, agreeableness, and neuroticism—ensuring that dialogues authentically reflect an individual’s personality. Using Llama 3 70B, we generate high-quality, personality-rich dialogues grounded in these journal entries. Fine-tuning models on this dataset leads to an 11% improvement in capturing personality traits on average, outperforming existing approaches in generating more coherent and personality-driven dialogues.

pdf bib abs
Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index
Tyler McDonald | Anthony Colosimo | Yifeng Li | Ali Emami

As prompt engineering research rapidly evolves, evaluations beyond accuracy are crucial for developing cost-effective techniques. We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption, adjusted by a user-specified cost concern level to reflect different resource constraints. Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts, across 10 widely-used language models and 4 diverse datasets. We demonstrate that approaches such as Self-Consistency often provide statistically insignificant gains while becoming cost-prohibitive. For example, on high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques like Chain-of-Thought (0.72) surpasses more complex methods like Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a reevaluation of complex prompting strategies in resource-constrained scenarios, potentially reshaping future research priorities and improving cost-effectiveness for end-users.

pdf bib abs
From Priest to Doctor: Domain Adaptation for Low-Resource Neural Machine Translation
Ali Marashian | Enora Rice | Luke Gessler | Alexis Palmer | Katharina von der Wense

Many of the world’s languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.

pdf bib abs
Improving Relation Extraction by Sequence-to-sequence-based Dependency Parsing Pre-training
Masaki Asada | Makoto Miwa

Relation extraction is a crucial natural language processing task that extracts relational triplets from raw text. Syntactic dependencies information has shown its effectiveness for relation extraction tasks. However, in most existing studies, dependency information is used only for traditional encoder-only-based relation extraction, not for generative sequence-to-sequence (seq2seq)-based relation extraction. In this study, we propose a syntax-aware seq2seq pre-trained model for seq2seq-based relation extraction. The model incorporates dependency information into a seq2seq pre-trained language model by continual pre-training with a seq2seq-based dependency parsing task. Experimental results on two widely used relation extraction benchmark datasets show that dependency parsing pre-training can improve the relation extraction performance.

pdf bib abs
Exploring Language Model Generalization in Low-Resource Extractive QA
Saptarshi Sengupta | Wenpeng Yin | Preslav Nakov | Shreya Ghosh | Suhang Wang

In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize to domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to explain the performance gap empirically. Our findings suggest that: (a) LLMs struggle with dataset demands of closed do- mains such as retrieving long answer spans; (b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; (c) Scaling model parameters is not always effective for cross-domain generalization; and (d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.

pdf bib abs
Explain-Analyze-Generate: A Sequential Multi-Agent Collaboration Method for Complex Reasoning
WenYuan Gu | JiaLe Han | HaoWen Wang | Xiang Li | Bo Cheng

Exploring effective collaboration among multiple large language models (LLMs) represents an active research direction, with multiagent debate (MAD) emerging as a popular approach. MAD involves LLMs independently generating responses and refining their own responses by incorporating feedback from other agents in a debate manner. However,empirical experiments reveal the suboptimal performance of MAD in complex reasoning scenarios. We attribute this to the potential misleading caused by peer agents with limited individual capabilities. To address this, we propose a novel sequential collaboration framework named Explain-Analyze-Generate(EAG). By decomposing complex tasks into essential subtasks and employing a pipeline approach, EAG enable agents provide constructive assistance to peers, ultimately yielding higher performance. We conduct experiments on the comprehensive complex language reasoning benchmark: BIG-Bench-Hard (BBH). Our method achieves the highest performance on 19 out of 23 tasks, with an average improvement of 8% across all tasks, and incurs lower costs compared to MAD, demonstrating its effectiveness and efficiency.

pdf bib abs
Towards Real-World Rumor Detection: Anomaly Detection Framework with Graph Supervised Contrastive Learning
Chaoqun Cui | Caiyan Jia

Current rumor detection methods based on propagation structure learning predominately treat rumor detection as a class-balanced classification task on limited labeled data. However, real-world social media data exhibits an imbalanced distribution with a minority of rumors among massive regular posts. To address the data scarcity and imbalance issues, we construct two large-scale conversation datasets from Weibo and Twitter and analyze the domain distributions. We find obvious differences between rumor and non-rumor distributions, with non-rumors mostly in entertainment domains while rumors concentrate in news, indicating the conformity of rumor detection to an anomaly detection paradigm. Correspondingly, we propose the Anomaly Detection framework with Graph Supervised Contrastive Learning (AD-GSCL). It heuristically treats unlabeled data as non-rumors and adapts graph contrastive learning for rumor detection. Extensive experiments demonstrate AD-GSCL’s superiority under class-balanced, imbalanced, and few-shot conditions. Our findings provide valuable insights for real-world rumor detection featuring imbalanced data distributions.

pdf bib abs
Addressing the Training-Inference Discrepancy in Discrete Diffusion for Text Generation
Masaki Asada | Makoto Miwa

This study addresses the discrepancy between training and inference in discrete diffusion models for text generation. We propose two novel strategies: (1) a training schema that considers two-step diffusion processes, allowing the model to use its own predicted output as input for subsequent steps during training and (2) a scheduling technique that gradually increases the probability of using self-generated text as training progresses. Experiments conducted on four widely used text generation benchmark datasets demonstrate that both proposed strategies improve the performance of discrete diffusion models in text generation.

pdf bib abs
Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model
Chaoqun Cui | Siyuan Li | Kunkun Ma | Caiyan Jia

Pretrained Language Models (PLMs) have excelled in various Natural Language Processing tasks, benefiting from large-scale pretraining and self-attention mechanism’s ability to capture long-range dependencies. However, their performance on social media application tasks like rumor detection remains suboptimal. We attribute this to mismatches between pretraining corpora and social texts, inadequate handling of unique social symbols, and pretraining tasks ill-suited for modeling user engagements implicit in propagation structures. To address these issues, we propose a continue pretraining strategy called Post Engagement Prediction (PEP) to infuse information from propagation structures into PLMs. PEP makes models to predict root, branch, and parent relations between posts, capturing interactions of stance and sentiment crucial for rumor detection. We also curate and release large-scale Twitter corpus: TwitterCorpus (269GB text), and two unlabeled claim conversation datasets with propagation structures (UTwitter and UWeibo). Utilizing these resources and PEP strategy, we train a Twitter-tailored PLM called SoLM. Extensive experiments demonstrate PEP significantly boosts rumor detection performance across universal and social media PLMs, even in few-shot scenarios. On benchmark datasets, PEP enhances baseline models by 1.0-3.7% accuracy, even enabling it to outperform current state-of-the-art methods on multiple datasets. SoLM alone, without high-level modules, also achieves competitive results, highlighting the strategy’s effectiveness in learning discriminative post interaction features.

While large language models (LLMs) have shown remarkable capabilities in natural language processing, they struggle with complex, multi-step reasoning tasks involving knowledge graphs (KGs). Existing approaches that integrate LLMs and KGs either underutilize the reasoning abilities of LLMs or suffer from prohibitive computational costs due to tight coupling. To address these limitations, we propose a novel collaborative framework named EffiQA that can strike a balance between performance and efficiency via an iterative paradigm. EffiQA consists of three stages: global planning, efficient KG exploration, and self-reflection. Specifically, EffiQA leverages the commonsense capability of LLMs to explore potential reasoning pathways through global planning. Then, it offloads semantic pruning to a small plug-in model for efficient KG exploration. Finally, the exploration results are fed to LLMs for self-reflection to further improve global planning and efficient KG exploration. Empirical evidence on multiple KBQA benchmarks shows EffiQA’s effectiveness, achieving an optimal balance between reasoning accuracy and computational costs. We hope the proposed new framework will pave the way for efficient, knowledge-intensive querying by redefining the integration of LLMs and KGs, fostering future research on knowledge-based question answering.

pdf bib abs
Language Adaptation of Large Language Models: An Empirical Study on LLaMA2
Shumin Wang | Yuexiang Xie | Bolin Ding | Jinyang Gao | Yanyong Zhang

There has been a surge of interest regarding language adaptation of Large Language Models (LLMs) to enhance the processing of texts in low-resource languages. While traditional language models have seen extensive research on language transfer, modern LLMs still necessitate further explorations in language adaptation. In this paper, we present a systematic review of the language adaptation process for LLMs, including vocabulary expansion, continued pre-training, and instruction fine-tuning, which focuses on empirical studies conducted on LLaMA2 and discussions on various settings affecting the model’s capabilities. This study provides helpful insights covering the entire language adaptation process, and highlights the compatibility and interactions between different steps, offering researchers a practical guidebook to facilitate the effective adaptation of LLMs across different languages.

pdf bib abs
Dialectal and Low Resource Machine Translation for Aromanian
Alexandru-Iulius Jerpelea | Alina Radoi | Sergiu Nisioi

We present a neural machine translation system that can translate between Romanian, English, and Aromanian (an endangered Eastern Romance language); the first of its kind. BLEU scores range from 17 to 32 depending on the direction and genre of the text. Alongside, we release the biggest known Aromanian-Romanian bilingual corpus, consisting of 80k cleaned sentence pairs. Additional tools such as an agnostic sentence embedder (used for both text mining and automatic evaluation) and a diacritics converter are also presented. Lastly, we describe the online deployment of our quantized model, considering a CPU-driven limited resource scenario.

Code search aims to quickly locate target code snippets from databases using natural language queries, which promotes code reusability. Existing methods can effectively obtain aligned token-level and query word-level features. However, these studies usually represent the semantics of code and query by averaging the features of each token and word respectively, which makes it difficult to accurately capture the code details that are closely related to the query. To address this issue, we propose a fine-grained code search model that consists of a cross-modal encoder, a mapping layer, and a classification layer. Specifically, we utilize a pre-trained model, GraphCodeBERT, in the cross-modal encoder to align features. In the mapping layer, we introduce a co-attention network to capture the fine-grained interactions between code and query, ensuring a model can precisely identify key code segments relevant to the query. Finally, in the classification layer, we incorporate instruction learning techniques that leverage contextual reasoning to improve the accuracy of query-code matching. Experimental results show that our proposed model significantly outperforms existing methods across multiple programming language datasets.

Video question answering (VideoQA) has recently gained considerable attention in the field of computer vision, aiming to generate answers rely on both linguistic and visual reasoning. However, existing methods often align visual or textual features directly with large language models, which limits the deep semantic association between modalities and hinders a comprehensive understanding of the interactions within spatial and temporal contexts, ultimately leading to sub-optimal reasoning performance. To address this issue, we propose a novel temporal-aware framework for multi-modal video question answering, dubbed VideoQA-TA, which enhances reasoning ability and accuracy of VideoQA by aligning videos and questions at fine-grained levels. Specifically, an effective Spatial-Temporal Attention mechanism (STA) is designed for video aggregation, transforming video features into spatial and temporal representations while attending to information at different levels. Furthermore, a Temporal Object Injection strategy (TOI) is proposed to align object-level and frame-level information within videos, which further improves the accuracy by injecting explicit temporal information. Experimental results on MSVD-QA, MSRVTT-QA, and ActivityNet-QA datasets demonstrate the superior performance of our proposed method compared with the current SOTAs, meanwhile, visualization analysis further verifies the effectiveness of incorporating temporal information to videos.

pdf bib abs
Cross-lingual Social Misinformation Detector based on Hierarchical Mixture-of-Experts Adapter
Haofang Fan | Xiran Hu | Geng Zhao

The spread of social misinformation has been a global concern, particularly affecting non-native speaker users who are more susceptible to misinformation on foreign social media platforms. In light of this, this study focuses on mitigating the challenges faced by social misinformation detectors in quickly regaining capability after crossing linguistic borders, especially for non-native users with only monolingual social media histories. By integrating sentiment analysis as an auxiliary, less sensitive task, we transform the challenging cross-lingual transfer into a manageable multi-task framework. Then, we propose HierMoE-Adpt, a novel, cost-effective parameter efficient finetuning method based on hierarchical mixture-of-experts adaptation, to enhance cross-lingual social misinformation detection. HierMoE-Adpt includes a hierarchical routing strategy and an expert-mask mechanism, effectively merge knowledge about the understanding posts in new language and misinformation detection capabilities, contributing to recover the performance of personal misinformation detectors in sync with the dynamics of personal international travel.

pdf bib abs
Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective
Yue Zhou | Barbara Di Eugenio | Lu Cheng

This paper studies the performance of large language models (LLMs), particularly regarding demographic fairness, in solving real-world healthcare tasks. We evaluate state-of-the-art LLMs with three prevalent learning frameworks across six diverse healthcare tasks and find significant challenges in applying LLMs to real-world healthcare tasks and persistent fairness issues across demographic groups. We also find that explicitly providing demographic information yields mixed results, while LLM’s ability to infer such details raises concerns about biased health predictions. Utilizing LLMs as autonomous agents with access to up-to-date guidelines does not guarantee performance improvement. We believe these findings reveal the critical limitations of LLMs in healthcare fairness and the urgent need for specialized research in this area.

Geocoding is a fundamental technique that links location mentions to their geographic positions, which is important for understanding texts in terms of where the described events occurred. Unlike most geocoding studies that targeted coarse-grained locations, we focus on geocoding at a fine-grained point-of-interest (POI) level. To address the challenge of finding appropriate geo-database entries from among many candidates with similar POI names, we develop a text embedding-based geocoding model and investigate (1) entry encoding representations and (2) hard negative mining approaches suitable for enhancing the model’s disambiguation ability. Our experiments show that the second factor significantly impact the geocoding accuracy of the model.

pdf bib abs
In-context Continual Learning Assisted by an External Continual Learner
Saleh Momeni | Sahisnu Mazumder | Zixuan Ke | Bing Liu

Existing continual learning (CL) methods mainly rely on fine-tuning or adapting large language models (LLMs). They still suffer from catastrophic forgetting (CF). Little work has been done to exploit in-context learning (ICL) to leverage the extensive knowledge within LLMs for CL without updating any parameters. However, incrementally learning each new task in ICL necessitates adding training examples from each class of the task to the prompt, which hampers scalability as the prompt length increases. This issue not only leads to excessively long prompts that exceed the input token limit of the underlying LLM but also degrades the model’s performance due to the overextended context. To address this, we introduce InCA, a novel approach that integrates an external continual learner (ECL) with ICL to enable scalable CL without CF. The ECL is built incrementally to pre-select a small subset of likely classes for each test instance. By restricting the ICL prompt to only these selected classes, InCA prevents prompt lengths from becoming excessively long, while maintaining high performance. Experimental results demonstrate that InCA significantly outperforms existing CL baselines, achieving substantial performance gains.

pdf bib abs
VaeDiff-DocRE: End-to-end Data Augmentation Framework for Document-level Relation Extraction
Khai Phan Tran | Wen Hua | Xue Li

Document-level Relation Extraction (DocRE) aims to identify relationships between entity pairs within a document. However, most existing methods assume a uniform label distribution, resulting in suboptimal performance on real-world, imbalanced datasets. To tackle this challenge, we propose a novel data augmentation approach using generative models to enhance data from the embedding space. Our method leverages the Variational Autoencoder (VAE) architecture to capture all relation-wise distributions formed by entity pair representations and augment data for underrepresented relations. To better capture the multi-label nature of DocRE, we parameterize the VAE’s latent space with a Diffusion Model. Additionally, we introduce a hierarchical training framework to integrate the proposed VAE-based augmentation module into DocRE systems. Experiments on two benchmark datasets demonstrate that our method outperforms state-of-the-art models, effectively addressing the long-tail distribution problem in DocRE. Our code is released at: https://github.com/khaitran22/VaeDiff-DocRE

pdf bib abs
Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection
Jinfa Huang | Jinsheng Pan | Zhongwei Wan | Hanjia Lyu | Jiebo Luo

Hateful memes continuously evolve as new ones emerge by blending progressive cultural ideas, rendering existing methods that rely on extensive training obsolete or ineffective. In this work, we propose Evolver, which incorporates Large Multimodal Models (LMMs) via Chain-of-Evolution (CoE) Prompting, by integrating the evolution attribute and in-context information of memes. Specifically, Evolver simulates the evolving and expressing process of memes and reasons through LMMs in a step-by-step manner using an evolutionary pair mining module, an evolutionary information extractor, and a contextual relevance amplifier. Extensive experiments on public FHM, MAMI, and HarM datasets show that CoE prompting can be incorporated into existing LMMs to improve their performance. More encouragingly, it can serve as an interpretive tool to promote the understanding of the evolution of memes.

pdf bib abs
An Efficient Dialogue Policy Agent with Model-Based Causal Reinforcement Learning
Kai Xu | Zhenyu Wang | Yangyang Zhao | Bopeng Fang

Dialogue policy trains an agent to select dialogue actions frequently implemented via deep reinforcement learning (DRL). The model-based reinforcement methods built a world model to generate simulated data to alleviate the sample inefficiency. However, traditional world model methods merely consider one-step dialogues, leading to an inaccurate environmental simulation. Furthermore, different users may have different intention preferences, while most existing studies lack consideration of the intention-preferences causal relationship. This paper proposes a novel framework for dialogue policy learning named MCA, implemented through model-based reinforcement learning with automatically constructed causal chains. The MCA model utilizes an autoregressive Transformer to model dialogue trajectories, enabling a more accurate simulation of the environment. Additionally, it constructs a causal chains module that outputs latent preference distributions for intention-action pairs, thereby elucidating the relationship between user intentions and agent actions. The experimental results show that MCA can achieve state-of-the-art performances on three dialogue datasets over the compared dialogue agents, highlighting its effectiveness and robustness.

pdf bib abs
Re-Cent: A Relation-Centric Framework for Joint Zero-Shot Relation Triplet Extraction
Zehan Li | Fu Zhang | Kailun Lyu | Jingwei Cheng | Tianyue Peng

Zero-shot Relation Triplet Extraction (ZSRTE) aims to extract triplets from the context where the relation patterns are unseen during training. Due to the inherent challenges of the ZSRTE task, existing extractive ZSRTE methods often decompose it into named entity recognition and relation classification, which overlooks the interdependence of two tasks and may introduce error propagation. Motivated by the intuition that crucial entity attributes might be implicit in the relation labels, we propose a Relation-Centric joint ZSRTE method named Re-Cent. This approach uses minimal information, specifically unseen relation labels, to extract triplets in one go through a unified model. We develop two span-based extractors to identify the subjects and objects corresponding to relation labels, forming span-pairs. Additionally, we introduce a relation-based correction mechanism that further refines the triplets by calculating the relevance between span-pairs and relation labels. Experiments demonstrate that Re-Cent achieves state-of-the-art performance with fewer parameters and does not rely on synthetic data or manual labor.

pdf bib abs
CoMIF: Modeling of Complex Multiple Interaction Factors for Conversation Generation
Yuxuan Chen | Wei Wei | Shixuan Fan | Kaihe Xu | Dangyang Chen

Highly realistic human-machine interaction is challenging for open-domain dialogue systems. Although existing methods have achieved notable progress by leveraging various interaction factors (e.g., emotion, personality, topic) for delivering human-like (e.g., empathetic, personalized and semantically-consistent) responses, they typically model such factor alone and thus easily suffer from low-quality response generation issue. We attribute this limitation to the neglect of implicit-correlations among factors. Furthermore, different factors may alternately dominate token-level response generation during decoding, making it harder to generate high-quality responses by applying various factors at the sentence level. To address the issue, we present a unified response generation framework, which is capable of simultaneously modeling Complex Multiple Interaction Factors (named CoMIF) to generate human-like conversations. To model the implicit correlations among factors, CoMIF first employ a dynamic perception module to construct a directed collaborative-graph to jointly learn the dynamics over time of each factor, as well as the cross-dependencies among them. Additionally, we also design a scalable post-adaptation module to introduce token-level factor signals to generate more human-like responses with appropriately multiple factors. Extensive experiments over multiple datasets demonstrate that the proposed method achieves the superior performance in generating more human-like responses with appropriate multiple-factors, as compared to the state-of-the-art methods.

pdf bib abs
Courtroom-LLM: A Legal-Inspired Multi-LLM Framework for Resolving Ambiguous Text Classifications
Sangkeun Jung | Jeesu Jung

In this research, we introduce the Courtroom-LLM framework, a novel multi-LLM structure inspired by legal courtroom processes, aiming to enhance decision-making in ambiguous text classification scenarios. Our approach simulates a courtroom setting within LLMs, assigning roles similar to those of prosecutors, defense attorneys, and judges, to facilitate comprehensive analysis of complex textual cases. We demonstrate that this structured multi-LLM setup can significantly improve decision-making accuracy, particularly in ambiguous situations, by harnessing the synergistic effects of diverse LLM arguments. Our evaluations across various text classification tasks show that the Courtroom-LLM framework outperforms both traditional single-LLM classifiers and simpler multi-LLM setups. These results highlight the advantages of our legal-inspired model in improving decision-making for text classification.

Role-playing systems powered by large language models (LLMs) have become increasingly influential in emotional communication applications. However, these systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona. This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework. Our framework identifies two core mechanisms—query sparsity and role-query conflict—as key factors driving character hallucination. Leveraging these insights, we construct a novel dataset, RoleBreakEval, to evaluate existing hallucination mitigation techniques. Our experiments reveal that even enhanced models trained to minimize hallucination remain vulnerable to attacks. To address these vulnerabilities, we propose a novel defence strategy, the Narrator Mode, which generates supplemental context through narration to mitigate role-query conflicts and improve query generalization. Experimental results demonstrate that Narrator Mode significantly outperforms traditional refusal-based strategies by reducing hallucinations, enhancing fidelity to character roles and queries, and improving overall narrative coherence.

Event Causality Identification (ECI) aims to identify fine-grained causal relationships between events in an unstructured text. Existing ECI methods primarily rely on knowledge enhanced and graph-based reasoning approaches, but they often overlook the dependencies between similar events. Additionally, the connection between unstructured text and structured knowledge is relatively weak. Therefore, this paper proposes an ECI method enhanced by LLM Knowledge and Concept-Level Event Relations (LKCER). Specifically, LKCER constructs a conceptual-level heterogeneous event graph by leveraging the local contextual information of related event mentions, generating a more comprehensive global semantic representation of event concepts. At the same time, the knowledge generated by COMET is filtered and enriched using LLM, strengthening the associations between event pairs and knowledge. Finally, the joint event conceptual representation and knowledge-enhanced event representation are used to uncover potential causal relationships between events. The experimental results show that our method outperforms previous state-of-the-art methods on both benchmarks, EventStoryLine and Causal-TimeBank.

pdf bib abs
Cognate Detection for Historical Language Reconstruction of Proto-Sabean Languages: the Case of Ge’ez, Tigrinya, and Amharic
Elleni Sisay Temesgen | Hellina Hailu Nigatu | Fitsum Assamnew Andargie

As languages evolve, we risk losing ancestral languages. In this paper, we explore Historical Language Reconstruction (HLR) for Proto-Sabean languages, starting with the identification of cognates–sets of words in different related languages that are derived from the same ancestral language. We (1) collect semantically related words in three Afro-Semitic languages from a three-way dictionary (2) work with linguists to identify cognates and reconstruct the proto-form of the cognates, (3) experiment with three automatic cognate detection methods and extract cognates from the semantically related words. We then experiment with in-context learning with GPT-4o to generate the proto-language from the cognates and use Sequence-to-Sequence (Seq2Seq) models for HLR.

pdf bib abs
Revisiting Cosine Similarity via Normalized ICA-transformed Embeddings
Hiroaki Yamagiwa | Momose Oyama | Hidetoshi Shimodaira

Cosine similarity is widely used to measure the similarity between two embeddings, while interpretations based on angle and correlation coefficient are common. In this study, we focus on the interpretable axes of embeddings transformed by Independent Component Analysis (ICA), and propose a novel interpretation of cosine similarity as the sum of semantic similarities over axes. The normalized ICA-transformed embeddings exhibit sparsity, enhancing the interpretability of each axis, and the semantic similarity defined by the product of the components represents the shared meaning between the two embeddings along each axis. The effectiveness of this approach is demonstrated through intuitive numerical examples and thorough numerical experiments. By deriving the probability distributions that govern each component and the product of components, we propose a method for selecting statistically significant axes.

Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 15k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.

pdf bib abs
Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation
Bhargav Shandilya | Alexis Palmer

The data and compute requirements of current language modeling technology pose challenges for the processing and analysis of low-resource languages. Declarative linguistic knowledge has the potential to partially bridge this data scarcity gap by providing models with useful inductive bias in the form of language-specific rules. In this paper, we propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing. We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM. The results demonstrate that significant leaps in performance and efficiency are possible with the right combination of: a) linguistic inputs in the form of grammars, b) the interpretive power of LLMs, and c) the trainability of smaller token classification networks. We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages. Our work also offers documentary linguists a more reliable and more usable tool for morphological glossing by providing well-reasoned explanations and confidence scores for each output.

Event Relation Extraction (ERE) aims to extract various types of relations between different events within texts. Although Large Language Models (LLMs) have demonstrated impressive capabilities in many natural language processing tasks, existing ERE methods based on LLMs still face three key challenges: (1) Time Inefficiency: The existing pairwise method of combining events and determining their relations is time-consuming for LLMs. (2) Low Coverage: When dealing with numerous events in a document, the limited generation length of fine-tuned LLMs restricts the coverage of their extraction results. (3) Lack of Rationale: Essential rationales concerning the results that could enhance the reasoning ability of the model are overlooked. To address these challenges, we propose LLMERE, an LLM-based approach with rationales for the ERE task. LLMERE transforms ERE into a question-and-answer task that may have multiple answers. By extracting all events related to a specified event at once, LLMERE reduces time complexity from O(n²) to O(n), compared to the pairwise method. Subsequently, LLMERE enhances the coverage of extraction results by employing a partitioning strategy that highlights only a portion of the events in the document at a time. In addition to the extracted results, LLMERE is also required to generate corresponding rationales/reasons behind them, in terms of event coreference information or transitive chains of event relations. Experimental results on three widely used datasets show that LLMERE achieves significant improvements over baseline methods.

pdf bib abs
Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations
James Ford | Xingmeng Zhao | Dan Schumacher | Anthony Rios

We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI’s GPT-3.5 Turbo and Meta’s Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. In this case study, we explore the performance of LLMs across the entire software development lifecycle with DevEval, encompassing stages including software design, environment setup, implementation, acceptance testing, and unit testing. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.

pdf bib abs
Making Large Language Models into World Models with Precondition and Effect Knowledge
Kaige Xie | Ian Yang | John Gunerli | Mark Riedl

World models, which encapsulate the dynamics of how actions affect environments, are foundational to the functioning of intelligent agents. In this work, we explore the potential of Large Language Models (LLMs) to operate as world models. Although LLMs are not inherently designed to model real-world dynamics, we show that they can be induced to perform two critical world model functions: determining the applicability of an action based on a given world state, and predicting the resulting world state upon action execution. This is achieved by fine-tuning two separate LLMs—one for precondition prediction and another for effect prediction—while leveraging synthetic data generation techniques. Through human-participant studies, we validate that the precondition and effect knowledge generated by our models aligns with human understanding of world dynamics. We also analyze the extent to which the world model trained on our synthetic data results in an inferred state space that supports the creation of action chains, a necessary property for planning.

pdf bib abs
DORA: Dynamic Optimization Prompt for Continuous Reflection of LLM-based Agent
Kun Li | Tingzhang Zhao | Wei Zhou | Songlin Hu

Autonomous agents powered by large language models (LLMs) hold significant potential across various domains. The Reflection framework is designed to help agents learn from past mistakes in complex tasks. While previous research has shown that reflection can enhance performance, our investigation reveals a key limitation: meaningful self-reflection primarily occurs at the beginning of iterations, with subsequent attempts failing to produce further improvements. We term this phenomenon “Early Stop Reflection,” where the reflection process halts prematurely, limiting the agent’s ability to engage in continuous learning. To address this, we propose the DORA method (Dynamic and Optimized Reflection Advice), which generates task-adaptive and diverse reflection advice. DORA introduces an external open-source small language model (SLM) that dynamically generates prompts for the reflection LLM. The SLM uses feedback from the agent and optimizes the prompt generation process through a non-gradient Bayesian Optimization (BO) algorithm, ensuring the reflection process evolves and adapts over time. Our experiments in the MiniWoB++ and Alfworld environments confirm that DORA effectively mitigates the “Early Stop Reflection” issue, enabling agents to maintain iterative improvements and boost performance in long-term, complex tasks. Code are available at https://anonymous.4open.science/r/DORA-44FB/.

Large language models (LLMs) often generate convincing, fluent explanations. However, different from humans, they often generate inconsistent explanations on different inputs. For example, an LLM may explain “all birds can fly” when answering the question “Can sparrows fly?” but meanwhile answer “no” to the related question “Can penguins fly?”. Explanations should be consistent across related examples so that they allow humans to simulate the LLM’s decision process on multiple examples. We propose explanation-consistency finetuning (EC-finetuning), a method that adapts LLMs to generate more consistent natural-language explanations on related examples. EC-finetuning involves finetuning LLMs on synthetic data that is carefully constructed to contain consistent explanations. Across a variety of question-answering datasets in various domains, EC-finetuning yields a 10.0% relative explanation consistency improvement on 4 finetuning datasets, and generalizes to 7 out-of-distribution datasets not seen during finetuning (+4.5% relative). We will make our code available for reproducibility.

pdf bib abs
Propulsion: Steering LLM with Tiny Fine-Tuning
Md Kowsher | Nusrat Jahan Prottasha | Prakash Bhat

The rapid advancements in Large Language Models (LLMs) have revolutionized natural language processing (NLP) and adjacent fields, yet fine-tuning these models for specific tasks remains computationally expensive and risks degrading pre-learned features. To address these challenges, we propose Propulsion, a novel parameter-efficient fine-tuning (PEFT) method designed to optimize task-specific performance while drastically reducing computational overhead. Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model, guiding output predictions toward task objectives without modifying the model’s parameters. By introducing lightweight, trainable Propulsion parameters at the pre-trained layer, we minimize the number of parameters updated during fine-tuning, thus preventing the overfitting or overwriting of existing knowledge. Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters. Empirically, Propulsion reduces the parameter count from 355.3 million to a mere 0.086 million—achieving over a 10x reduction compared to standard approaches like LoRA—while maintaining competitive performance across benchmarks.

Recent advancements in event argument extraction (EAE) involve incorporating useful auxiliary information into models during training and inference, such as retrieved instances and event templates. These methods face two challenges: (1) the retrieval results may be irrelevant and (2) templates are developed independently for each event without considering their possible relationship. In this work, we propose DEGAP to address these challenges through a simple yet effective components: dual prefixes, i.e. learnable prompt vectors, where the instance-oriented prefix and template-oriented prefix are trained to learn information from different event instances and templates. Additionally, we propose an event-guided adaptive gating mechanism, which can adaptively leverage possible connections between different events and thus capture relevant information from the prefix. Finally, these event-guided prefixes provide relevant information as cues to EAE model without retrieval. Extensive experiments demonstrate that our method achieves new state-of-the-art performance on four datasets (ACE05, RAMS, WIKIEVENTS, and MLEE). Further analysis shows the impact of different components.

The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.

pdf bib abs
Leveraging Large Pre-trained Multilingual Models for High-Quality Speech-to-Text Translation on Industry Scenarios
Marko Avila | Josep Crego

Speech-to-Text Translation (S2TT) involves converting spoken language from a source language directly into text in a target language. Traditionally, S2TT systems rely on a sequential pipeline that combines Automatic Speech Recognition (ASR) and Machine Translation (MT) models. However, these systems are prone to error propagation and demand substantial resources to develop and train each component independently. Thus, posing a major challenge in industry settings where cost-effective yet highly accurate S2TT solutions are essential. With the increasing availability of multilingual large pre-trained speech models (LPSM), we propose a parameter-efficient framework that integrates one LPSM with a multilingual MT engine. We evaluate the effectiveness of several well-established LPSMs within this framework, focusing on a real-world industry scenario that involves building a system capable of translating between French, English, and Arabic. The results show that high-quality S2TT systems can be built with minimal computational resources, offering an efficient solution for cross-lingual communication.

pdf bib abs
SA-DETR:Span Aware Detection Transformer for Moment Retrieval
Tianheng Xiong | Wei Wei | Kaihe Xu | Dangyang Chen

Moment Retrieval aims to locate specific video segments related to the given text. Recently, DETR-based methods, originating from Object Detection, have emerged as effective solutions for Moment Retrieval. These approaches focus on multimodal feature fusion and refining Queries composed of span anchor and content embedding. Despite the success, they often overlook the video-text instance related information in Query Initialization and the crucial guidance role of span anchors in Query Refinement, leading to inaccurate predictions. To address this, we propose a novel Span Aware DEtection TRansformer (SA-DETR) that leverages the importance of instance related span anchors. To fully leverage the instance related information, we generate span anchors based on video-text pair rather than using learnable parameters, as is common in conventional DETR-based methods, and supervise them with GT labels. To effectively exploit the correspondence between span anchors and video clips, we enhance content embedding guided by textual features and generate Gaussian mask to modulate the interaction between content embedding and fusion features. Furthermore, we explore the feature alignment across various stages and granularities and apply denoise learning to boost the span awareness of the model. Extensive experiments on QVHighlights, Charades-STA, and TACoS demonstrate the effectiveness of our approach.

As large language models (LLMs) demonstrate increasingly advanced capabilities, aligning their behaviors with human values and preferences becomes crucial for their wide adoption. While previous research focuses on general alignment to principles such as helpfulness, harmlessness, and honesty, the need to account for individual and diverse preferences has been largely overlooked, potentially undermining customized human experiences. To address this gap, we train LLMs that can “interact to align”, essentially cultivating the meta-skill of LLMs to implicitly infer the unspoken personalized preferences of the current user through multi-turn conversations, and then dynamically align their following behaviors and responses to these inferred preferences. Our approach involves establishing a diverse pool of 3,310 distinct user personas by initially creating seed examples, which are then expanded through iterative self-generation and filtering. Guided by distinct user personas, we leverage multi-LLM collaboration to develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures. Finally, we apply supervised fine-tuning and reinforcement learning to enhance LLMs using this dataset. For evaluation, we establish the ALOE (ALign with custOmized prEferences) benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations. Experimental results demonstrate the effectiveness of our method in enabling dynamic, personalized alignment via interaction. The code and dataset will be made public.

pdf bib abs
Automatic Evaluation of Language Generation Technology Based on Structure Alignment
Katsuki Chousa | Tsutomu Hirao

Language generation techniques require automatic evaluation to carry out efficient and reproducible experiments. While n-gram matching is standard, it fails to capture semantic equivalence with different wording. Recent methods have addressed this issue by using contextual embeddings from pre-trained language models to compute the similarity between reference and hypothesis. However, these methods frequently disregard the syntax of sentences, despite its crucial role in determining meaning, and thus assign unjustifiably high scores. This paper proposes an automatic evaluation metric that considers both the words in sentences and their syntactic structures. We integrate syntactic information into the recent embedding-based approach. Experimental results obtained from two NLP tasks show that our method is at least comparable to standard baselines.

Human tutoring interventions play a crucial role in supporting student learning, improving academic performance, and promoting personal growth. This paper focuses on analyzing mathematics tutoring discourse using talk moves—a framework of dialogue acts grounded in Accountable Talk theory. However, scaling the collection, annotation, and analysis of extensive tutoring dialogues to develop machine learning models is a challenging and resource-intensive task. To address this, we present SAGA22, a compact dataset, and explore various modeling strategies, including dialogue context, speaker information, pretraining datasets, and further fine-tuning. By leveraging existing datasets and models designed for classroom teaching, our results demonstrate that supplementary pretraining on classroom data enhances model performance in tutoring settings, particularly when incorporating longer context and speaker information. Additionally, we conduct extensive ablation studies to underscore the challenges in talk move modeling.

pdf bib abs
How to Leverage Digit Embeddings to Represent Numbers?
Jasivan Alex Sivakumar | Nafise Sadat Moosavi

Within numerical reasoning, understanding numbers themselves is still a challenge for existing language models. Simple generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance (Sivakumar and Moosavi, 2023). Among various techniques, character-level embeddings of numbers have emerged as a promising approach to improve number representation. However, this method has limitations as it leaves the task of aggregating digit representations to the model, which lacks direct supervision for this process. In this paper, we explore the use of mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models. This can be achieved either by adding a special token to the input embeddings or by introducing an additional loss function to enhance correct predictions. We evaluate the effectiveness of incorporating this explicit aggregation, analysing its strengths and shortcomings, and discuss future directions to better benefit from this approach. Our methods, while simple, are compatible with any pretrained model, easy to implement, and have been made publicly available.

pdf bib abs
AdaCQR: Enhancing Query Reformulation for Conversational Search via Sparse and Dense Retrieval Alignment
Yilong Lai | Jialong Wu | Congzhi Zhang | Haowen Sun | Deyu Zhou

Conversational Query Reformulation (CQR) has significantly advanced in addressing the challenges of conversational search, particularly those stemming from the latent user intent and the need for historical context. Recent works aimed to boost the performance of CQR through alignment. However, they are designed for one specific retrieval system, which potentially results in sub-optimal generalization. To overcome this limitation, we present a novel framework AdaCQR. By aligning reformulation models with both term-based and semantic-based retrieval systems, AdaCQR enhances the generalizability of information-seeking queries among diverse retrieval environments through a two-stage training strategy. Moreover, two effective approaches are proposed to obtain superior labels and diverse input candidates, boosting the efficiency and robustness of the framework. Experimental results on the TopiOCQA, QReCC and TREC CAsT datasets demonstrate that AdaCQR outperforms the existing methods in a more efficient framework, offering both quantitative and qualitative improvements in conversational query reformulation.

pdf bib abs
EERPD: Leveraging Emotion and Emotion Regulation for Improving Personality Detection
Zheng Li | Sujian Li | Dawei Zhu | Qilong Ma | Weimin Xiong

Personality is a fundamental construct in psychology, reflecting an individual’s behavior, thinking, and emotional patterns. While previous researches have made progress in personality detection, their designed methods generally overlook the important connection between psychological knowledge “emotion regulation” and personality traits. Based on this, we propose a new personality detection method called EERPD. This method introduces the use of emotion regulation, a psychological concept highly correlated with personality, for personality prediction. By combining this concept with emotion features, EERPD retrieves few-shot examples and provides process CoTs for inferring labels from text. This approach enhances the understanding of LLM for personality implicit within text and improves the performance in personality detection. Experimental results demonstrate that EERPD significantly enhances the accuracy and robustness of personality detection, outperforming previous SOTA by 15.05/4.29 in average F1 on the two benchmark datasets.

pdf bib abs
Linear Recency Bias During Training Improves Transformers’ Fit to Reading Times
Christian Clark | Byung-Doh Oh | William Schuler

Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates from a Transformer that includes ALiBi during training and inference show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi’s mixture of slopes—which determine the rate of memory decay in each attention head—may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.

pdf bib abs
ProsodyFlow: High-fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models
Haoyu Wang | Sizhe Shan | Yinlin Guo | Yuehai Wang

Text-to-speech (TTS) has seen significant advancements in high-quality, expressive speech synthesis. However, achieving diverse and natural prosody in synthesized speech remains challenging. In this paper, we propose ProsodyFlow, an end-to-end TTS model that integrates large self-supervised speech models and conditional flow matching to model prosodic features effectively. Our approach involves using a speech LLM to extract acoustic features, mapping these features into a prosody latent space, and then employing conditional flow matching to generate prosodic vectors conditioned on the input text. Experiments on the LJSpeech dataset show that ProsodyFlow improves synthesis quality and efficiency compared to existing models, achieving more prosodic and expressive speech synthesizing.

pdf bib abs
Mitigating Out-of-Entity Errors in Named Entity Recognition: A Sentence-Level Strategy
Guochao Jiang | Ziqin Luo | Chengwei Hu | Zepeng Ding | Deqing Yang

Many previous models of named entity recognition (NER) suffer from the problem of Out-of-Entity (OOE), i.e., the tokens in the entity mentions of the test samples have not appeared in the training samples, which hinders the achievement of satisfactory performance. To improve OOE-NER performance, in this paper, we propose a new framework, namely S+NER, which fully leverages sentence-level information. Our S+NER achieves better OOE-NER performance mainly due to the following two particular designs. 1) It first exploits the pre-trained language model’s capability of understanding the target entity’s sentence-level context with a template set. 2) Then, it refines the sentence-level representation based on the positive and negative templates, through a contrastive learning strategy and template pooling method, to obtain better NER results. Our extensive experiments on five benchmark datasets have demonstrated that, our S+NER outperforms some state-of-the-art OOE-NER models.

pdf bib abs
Cross-lingual Evaluation of Multilingual Text Generation
Shamil Chollampatt | Minh Quang Pham | Sathish Reddy Indurthi | Marco Turchi

Scaling automatic evaluation of multilingual text generation of LLMs to new tasks, domains, and languages remains a challenge. Traditional evaluation on benchmark datasets carries the risk of reference data leakage in LLM training or involves additional human annotation effort. The alternative strategy of using another LLM as a scorer also faces uncertainty about the ability of this LLM itself to score non-English text. To address these issues, we propose an annotation-free cross-lingual evaluation protocol for multilingual text generation. Given an LLM candidate to be evaluated and a set of non-English inputs for a particular text generation task, our method first generates English references from the translation of the non-English inputs into English. This is done by an LLM that excels in the equivalent English text generation task. The non-English text generated by the LLM candidate is compared against the generated English references using a cross-lingual evaluation metric to assess the ability of the candidate LLM on multilingual text generation. Our protocol shows a high correlation to the reference-based ROUGE metric in four languages on news text summarization. We also evaluate a diverse set of LLMs in over 90 languages with different prompting strategies to study their multilingual generative abilities.

pdf bib abs
Norm of Mean Contextualized Embeddings Determines their Variance
Hiroaki Yamagiwa | Hidetoshi Shimodaira

Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.

pdf bib abs
Exploring the Impacts of Feature Fusion Strategy in Multi-modal Entity Alignment
Chenxiao Li | Jingwei Cheng | Qiang Tong | Fu Zhang

Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs, which consist of structural triples and images associated with entities. Unfortunately, prior works fuse the multi-modal knowledge of all entities only via solely one single fusion strategy. Therefore, the impact of the fusion strategy on individual entities could be largely ignored. To solve this challenge, we propose AMF2SEA, an adaptive multi-modal feature fusion strategy for entity alignment, which dynamically selects the optimal entity-level feature fusion strategy. Additionally, we build a new dataset based on DBP15K, which includes a full set of entity images from multiple inconsistent web sources, making it more representative of the real world. Experimental results demonstrate that our model achieves state-of-the-art (SOTA) performance compared to models using the same modality on DBP15K and its variants with richer image sources and styles. Our code and data are available at https://github.com/ChenxiaoLiJoe/AMFFSEA.

From ice cream flavors to climate change, people exhibit a wide array of opinions on various topics, and understanding the rationale for these opinions can promote healthy discussion and consensus among them. As such, it can be valuable for a large language model (LLM), particularly as an AI assistant, to be able to empathize with or even explain these various standpoints. In this work, we hypothesize that different topic stances often manifest correlations that can be used to extrapolate to topics with unknown opinions. We explore various prompting and fine-tuning methods to improve an LLM’s ability to (a) extrapolate from opinions on known topics to unknown ones and (b) support their extrapolation with reasoning. Our findings suggest that LLMs possess inherent knowledge from training data about these opinion correlations, and with minimal data, the similarities between human opinions and model-extrapolated opinions can be improved by more than 50%. Furthermore, LLM can generate the reasoning process behind their extrapolation of opinions.

pdf bib abs
How Likely Do LLMs with CoT Mimic Human Reasoning?
Guangsheng Bao | Hongbo Zhang | Cunxiang Wang | Linyi Yang | Yue Zhang

Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from Large Language Models (LLMs). However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved questions about its usage. In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans, using causal analysis to understand the relationships between the problem instruction, reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often deviate from the ideal causal chain, resulting in spurious correlations and potential consistency errors (inconsistent reasoning and answers). We also examine various factors influencing the causal structure, finding that in-context learning with examples strengthens it, while post-training techniques like supervised fine-tuning and reinforcement learning on human feedback weaken it. To our surprise, the causal structure cannot be strengthened by enlarging the model size only, urging research on new techniques. We hope that this preliminary study will shed light on understanding and improving the reasoning process in LLM.

pdf bib abs
SGMEA: Structure-Guided Multimodal Entity Alignment
Jingwei Cheng | Mingxiao Guo | Fu Zhang

Multimodal Entity Alignment (MMEA) aims to identify equivalent entities across different multimodal knowledge graphs (MMKGs) by integrating structural information, entity attributes, and visual data, thereby promoting knowledge sharing and deep multimodal data integration. However, existing methods often overlook the deeper connections between multimodal data. They primarily focus on the interactions between neighboring entities in the structural modality while neglecting the interactions between entities in the visual and attribute modalities. To address this, we propose a structure-guided multimodal entity alignment method (SGMEA), which prioritizes structural information from knowledge graphs to enhance the visual and attribute modalities. By fusing multimodal representations, SGMEA improves the accuracy of entity alignment. Experimental results demonstrate that SGMEA achieves stateof-the-art performance across multiple datasets, validating its effectiveness and superiority in practical applications.

pdf bib abs
Unveiling Fake News with Adversarial Arguments Generated by Multimodal Large Language Models
Xiaofan Zheng | Minnan Luo | Xinghao Wang

In the era of social media, the proliferation of fake news has created an urgent need for more effective detection methods, particularly for multimodal content. The task of identifying fake news is highly challenging, as it requires broad background knowledge and understanding across various domains. Existing detection methods primarily rely on neural networks to learn latent feature representations, resulting in black-box classifications with limited real-world understanding. To address these limitations, we propose a novel approach that leverages Multimodal Large Language Models (MLLMs) for fake news detection. Our method introduces adversarial reasoning through debates from opposing perspectives. By harnessing the powerful capabilities of MLLMs in text generation and cross-modal reasoning, we guide these models to engage in multimodal debates, generating adversarial arguments based on contradictory evidence from both sides of the issue. We then utilize these arguments to learn reasonable thinking patterns, enabling better multimodal fusion and fine-tuning. This process effectively positions our model as a debate referee for adversarial inference. Extensive experiments conducted on four fake news detection datasets demonstrate that our proposed method significantly outperforms state-of-the-art approaches.

pdf bib abs
Incorporating Review-missing Interactions for Generative Explainable Recommendation
Xi Li | Xiaohe Bo | Chen Ma | Xu Chen

Explainable recommendation has attracted much attention from the academic and industry communities. Traditional models usually leverage user reviews as ground truths for model training, and the interactions without reviews are totally ignored. However, in practice, a large amount of users may not leave reviews after purchasing items. In this paper, we argue that the interactions without reviews may also contain comprehensive user preferences, and incorporating them to build explainable recommender model may further improve the explanation quality. To follow such intuition, we first leverage generative models to predict the missing reviews, and then train the recommender model based on all the predicted and original reviews. In specific, since the reviews are discrete tokens, we regard the review generation process as a reinforcement learning problem, where each token is an action at one step. We hope that the generated reviews are indistinguishable with the real ones. Thus, we introduce an discriminator as a reward model to evaluate the quality of the generated reviews. At last, to smooth the review generation process, we introduce a self-paced learning strategy to first generate shorter reviews and then predict the longer ones. We conduct extensive experiments on three publicly available datasets to demonstrate the effectiveness of our model.

pdf bib abs
Transformer-based Speech Model Learns Well as Infants and Encodes Abstractions through Exemplars in the Poverty of the Stimulus Environment
Yi Yang | Yiming Wang | Jiahong Yuan

Infants are capable of learning language, predominantly through speech and associations, in impoverished environments—a phenomenon known as the Poverty of the Stimulus (POS). Is this ability uniquely human, as an innate linguistic predisposition, or can it be empirically learned through potential linguistic structures from sparse and noisy exemplars? As an early exploratory work, we systematically designed a series of tasks, scenarios, and metrics to simulate the POS. We found that the emerging speech model wav2vec2.0 with pretrained weights from an English corpus can learn well in noisy and sparse Mandarin environments. We then tested various hypotheses and observed three pieces of evidence for abstraction: label correction, categorical patterns, and clustering effects. We concluded that models can encode hierarchical linguistic abstractions through exemplars in POS environments. We hope this work offers new insights into language acquisition from a speech perspective and inspires further research.

pdf bib abs
Hire Me or Not? Examining Language Model’s Behavior with Occupation Attributes
Damin Zhang | Yi Zhang | Geetanjali Bihani | Julia Rayz

With the impressive performance in various downstream tasks, large language models (LLMs) have been widely integrated into production pipelines, such as recruitment and recommendation systems. A known issue of models trained on natural language data is the presence of human biases, which can impact the fairness of the system. This paper investigates LLMs’ behavior with respect to gender stereotypes in the context of occupation decision making. Our framework is designed to investigate and quantify the presence of gender stereotypes in LLMs’ behavior via multi-round question answering. Inspired by prior work, we constructed a dataset using a standard occupation classification knowledge base released by authoritative agencies. We tested it on three families of LMs (RoBERTa, GPT, and Llama) and found that all models exhibit gender stereotypes analogous to human biases, but with different preferences. The distinct preferences of GPT-3.5-turbo and Llama2-70b-chat, along with additional analysis indicating GPT-4o-mini favors female subjects, may imply that the current alignment methods are insufficient for debiasing and could introduce new biases contradicting the traditional gender stereotypes. Our contribution includes a 73,500 prompts dataset constructed with a taxonomy of real-world occupations and a multi-step verification framework to evaluate model’s behavior regarding gender stereotype.

pdf bib abs
Enhancing Factual Consistency in Text Summarization via Counterfactual Debiasing
Zhenqing Ling | Yuexiang Xie | Chenhe Dong | Ying Shen

Despite significant progress in abstractive text summarization aimed at generating fluent and informative outputs, how to ensure the factual consistency of generated summaries remains a crucial and challenging issue. In this study, drawing inspiration from advancements in causal inference, we construct causal graphs to analyze the process of abstractive text summarization methods and identify intrinsic causes of factual inconsistency, specifically language bias and irrelevancy bias, and we propose CoFactSum, a novel framework that mitigates the causal effects of these biases through counterfactual estimation for enhancing the factual consistency of the generated content. CoFactSum provides two counterfactual estimation strategies, including Explicit Counterfactual Masking, which employs a dynamic masking approach, and Implicit Counterfactual Training, which utilizes a discriminative cross-attention mechanism. Besides, we propose a Debiasing Degree Adjustment mechanism to dynamically calibrate the level of debiasing at each decoding step. Extensive experiments conducted on two widely used summarization datasets demonstrate the effectiveness and advantages of the proposed CoFactSum in enhancing the factual consistency of generated summaries, outperforming several baseline methods.

pdf bib abs
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
Zike Yuan | Ming Liu | Hui Wang | Bing Qin

Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all graph types and detailed capability definitions. This paper presents GraCoRe, a benchmark for systematically assessing LLMs’ graph comprehension and reasoning. GraCoRe uses a three-tier hierarchical taxonomy to categorize and test models on pure graph and heterogeneous graphs, subdividing capabilities into 10 distinct areas tested through 19 tasks. Our benchmark includes 11 datasets with 5,140 graphs of varying complexity. We evaluate four closed-source and eight open-source LLMs, conducting thorough analyses from both ability and task perspectives. Key findings reveal that OpenAI o1 model has amazing comprehension and reasoning capabilities, semantic enrichment enhances reasoning performance, node ordering impacts task success, and the ability to process longer texts does not necessarily improve graph comprehension or reasoning.

Previous studies of face-to-face (f2f) communication have suggested that speakers rely heavily on a variety of multi-modal cues to make real-time predictions about upcoming words in rapid turn-taking. To understand how computer-mediated communication (CMC) differs from f2f communication in terms of the prediction mechanism, this study assessed how the loss of multi-modal cues would affect word predictability in turn-taking. Participants watched videos, listened to audio, or read a transcript of f2f conversations. Across these three conditions, they predicted the same set of omitted words with different levels of predictability and semantic relatedness to other words in the context. Results showed that words of higher predictability were more accurately predicted regardless of CMC types. Higher response accuracy but longer response time were observed in conditions with richer cues, and for participants with more positive and less negative self-emotions. Meanwhile, semantic relatedness did not affect predictability. These results confirmed the key role of prediction in language processing and conversation smoothness, especially its importance in CMC.

pdf bib abs
VEEF-Multi-LLM: Effective Vocabulary Expansion and Parameter Efficient Finetuning Towards Multilingual Large Language Models
Jiu Sha | Mengxiao Zhu | Chong Feng | Yuming Shang

Large Language Models(LLMs) have brought significant transformations to various aspects of human life and productivity. However, the heavy reliance on vast amounts of data in developing these models has resulted in a notable disadvantage for low-resource languages, such as Nuosu and others, which lack large datasets. Moreover, many LLMs exhibit significant performance discrepancies between high-and lowresource languages, thereby restricting equitable access to technological advances for all linguistic communities. To address these challenges, this paper propose a low-resource multilingual large language model, termed VEEF-Multi-LLM, constructed through effective vocabulary expansion and parameter-efficient fine-tuning. We introduce a series of innovative methods to address challenges in low-resource languages. First, we adopt Byte-level Byte-Pair Encoding to expand the vocabulary for broader multilingual support. We separate input and output embedding weights to boost performance, and apply RoPE for long-context handling, as well as RMSNorm for efficient training. To generate high-quality supervised fine-tuning (SFT) data, we use self-training and selective translation, and refine the resulting dataset with the assistance of native speakers to ensure cultural and linguistic accuracy. Our model, VEEF-Multi-LLM-8B, is trained on 600 billion tokens across 50 natural and 16 programming languages. Experimental results show that the model excels in multilingual instruction-following tasks, particularly in translation, outperforming competing models in benchmarks such as XCOPA and XStoryCloze. Although it lags slightly behind English-centric models in some tasks (e.g., m-MMLU), it prioritizes safety, reliability, and inclusivity, making it valuable for diverse linguistic communities. We open-source our models on GitHub and Huggingface.

pdf bib abs
PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation
Jaeseok Yoo | Hojae Han | Youngwon Lee | Jaejin Kim | Seung-won Hwang

Code generation with large language models has shown significant promise, especially when employing retrieval-augmented generation (RAG) with few-shot examples. However, selecting effective examples that enhance generation quality remains a challenging task, particularly when the target programming language (PL) is underrepresented. In this study, we present two key findings: (1) retrieving examples whose presented algorithmic plans can be referenced for generating the desired behavior significantly improves generation accuracy, and (2) converting code into pseudocode effectively captures such algorithmic plans, enhancing retrieval quality even when the source and the target PLs are different. Based on these findings, we propose Plan-as-query Example Retrieval for few-shot prompting in Code generation (PERC), a novel framework that utilizes algorithmic plans to identify and retrieve effective examples. We validate the effectiveness of PERC through extensive experiments on the CodeContests, HumanEval and MultiPL-E benchmarks: PERC consistently outperforms the state-of-the-art RAG methods in code generation, both when the source and target programming languages match or differ, highlighting its adaptability and robustness in diverse coding environments.

Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022), digital abusive speech remains a significant issue. One potential approach to address this challenge is automatic text detoxification, a text style transfer (TST) approach that transforms toxic language into a more neutral or non-toxic form. To date, the availability of parallel corpora for the text detoxification task (Logacheva et al., 2022; Atwell et al., 2022; Dementieva et al., 2024a) has proven to be crucial for state-of-the-art approaches. With this work, we extend parallel text detoxification corpus to new languages—German, Chinese, Arabic, Hindi, and Amharic—testing in the extensive multilingual setup TST baselines. Next, we conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences, diving deeply into the nuances, similarities, and differences of toxicity and detoxification across 9 languages. Finally, based on the obtained insights, we experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach, enhancing the prompting process through clustering on relevant descriptive attributes.

pdf bib abs
Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text
Ali Al Lawati | Jason Lucas | Prasenjit Mitra

Large Language Models (LLMs) have demonstrated remarkable performance in various NLP tasks, including semantic parsing, which translates natural language into formal code representations. However, the reverse operation, translating code into natural language, termed semantic captioning, has received less attention. This task is increasingly important as LLMs are integrated into platforms for code generation, security analysis, and educational purposes. In this paper, we focus on the captioning of SQL query (SQ2Text) to address the critical need for understanding and explaining SQL queries in an era where LLM-generated code poses potential security risks. We repurpose semantic parsing datasets for semantic captioning, specifically SQL2Text. To overcome the limited robustness of Text2SQL datasets for the reverse task, we introduce an iterative ICL prompt leveraging GPT-4o to generate multiple additional utterances. We conduct experiments across multiple in-context learning (ICL) methods, emphasizing smaller, more computationally efficient LLMs. Our findings demonstrate that leveraging the inherent graph properties of SQL for few-shot ICL sample selection significantly outperforms random selection by up to 39% on BLEU score and provides better results than alternative approaches. Dataset and codes are accessible.

Language models encode extensive factual knowledge within their parameters. The accurate assessment of this knowledge is crucial for understanding and improving these models. In the literature, factual knowledge assessment often relies on cloze sentences, which can lead to erroneous conclusions due to the complexity of natural language (out-of-subject continuations, the existence of many correct answers and the several ways of expressing them). In this paper, we introduce a new interpretable knowledge assessment method that mitigates these issues by leveraging distractors—incorrect but plausible alternatives to the correct answer. We propose several strategies for retrieving distractors and determine the most effective one through experimentation. Our method is evaluated against existing approaches, demonstrating solid alignment with human judgment and stronger robustness to verbalization artifacts. The code and data to reproduce our experiments are available on GitHub.

Evaluating automatic paraphrase production systems is a difficult task as it involves, among other things, assessing the semantic proximity between two sentences. Usual measures are based on lexical distances, or at least on semantic embedding alignments. The rise of Large Language Models (LLM) has provided tools to model relationships within a text thanks to the attention mechanism. In this article, we introduce ParaPLUIE, a new measure based on a log likelihood ratio from an LLM, to assess the quality of a potential paraphrase. This measure is compared with usual measures on two known by the NLP community datasets prior to this study. Three new small datasets have been built to allow metrics to be compared in different scenario and to avoid data contamination bias. According to evaluations, the proposed measure is better for sorting pairs of sentences by semantic proximity. In particular, it is much more independent to lexical distance and provides an interpretable classification threshold between paraphrases and non-paraphrases.

pdf bib abs
Summarization of Opinionated Political Documents with Varied Perspectives
Nicholas Deas | Kathleen McKeown

Global partisan hostility and polarization has increased, and this polarization is heightened around presidential elections. Models capable of generating accurate summaries of diverse perspectives can help reduce such polarization by exposing users to alternative perspectives. In this work, we introduce a novel dataset and task for independently summarizing each political perspective in a set of passages from opinionated news articles. For this task, we propose a framework for evaluating different dimensions of perspective summary performance. We benchmark 11 summarization models and LLMs of varying sizes and architectures through both automatic and human evaluation. While recent models like GPT-4o perform well on this task, we find that all models struggle to generate summaries that are faithful to the intended perspective. Our analysis of summaries focuses on how extraction behavior is impacted by features of the input documents.

To address an important gap in creating children’s stories for vocabulary enrichment, we investigate the automatic evaluation of how well stories convey the semantics of target vocabulary words, a task with substantial implications for generating educational content. We motivate this task, which we call measuring contextual informativeness in children’s stories, and provide a formal task definition as well as a dataset for the task. We further propose a method for automating the task using a large language model (LLM). Our experiments show that our approach reaches a Spearman correlation of 0.4983 with human judgments of informativeness, while the strongest baseline only obtains a correlation of 0.3534. An additional analysis shows that the LLM-based approach is able to generalize to measuring contextual informativeness in adult-directed text, on which it also outperforms all baselines.

pdf bib abs
Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring
Hongjin Kim | Jeonghyun Kang | Harksoo Kim

This study addresses critical gaps in Automatic Essay Scoring (AES) systems and Large Language Models (LLMs) with regard to their ability to effectively identify and score harmful essays. Despite advancements in AES technology, current models often overlook ethically and morally problematic elements within essays, erroneously assigning high scores to essays that may propagate harmful opinions. In this study, we introduce the Harmful Essay Detection (HED) benchmark, which includes essays integrating sensitive topics such as racism and gender bias, to test the efficacy of various LLMs in recognizing and scoring harmful content. Our findings reveal that: (1) LLMs require further enhancement to accurately distinguish between harmful and argumentative essays, and (2) both current AES models and LLMs fail to consider the ethical dimensions of content during scoring. The study underscores the need for developing more robust AES systems that are sensitive to the ethical implications of the content they are scoring.

Automatic biomedical annotation is essential for advancing medical research, diagnosis, and treatment. However, it presents significant challenges, especially when entities are not explicitly mentioned in the text, leading to difficulties in extraction of relevant information. These challenges are intensified by unclear terminology, implicit background knowledge, and the lack of labeled training data. Annotating with a specific ontology adds another layer of complexity, as it requires aligning text with a predefined set of concepts and relationships. Manual annotation is time-consuming and expensive, highlighting the need for automated systems to handle large volumes of biomedical data efficiently. In this paper, we propose an entailment-based zero-shot text classification approach to annotate biomedical text passages using the Homeostasis Imbalance Process (HOIP) ontology. Our method reformulates the annotation task as a multi-class, multi-label classification problem and uses natural language inference to classify text into related HOIP processes. Experimental results show promising performance, especially when processes are not explicitly mentioned, highlighting the effectiveness of our approach for ontological annotation of biomedical literature.

Data-driven pre-trained language models typically perform shortcut learning wherein they rely on the spurious correlations between the data and the ground truth. This reliance can undermine the robustness and generalization of the model. To address this issue, data augmentation emerges as a promising solution. By integrating anti-shortcut data to the training set, the models’ shortcut-induced biases can be mitigated. However, existing methods encounter three challenges: 1) Manual definition of shortcuts is tailored to particular datasets, restricting generalization. 2) The inherent confirmation bias during model training hampers the effectiveness of data augmentation. 3) Insufficient exploration of the relationship between the model performance and the augmented data quantity may result in excessive data consumption. To tackle these challenges, we propose a method of Smart Data Augmentation based on Large Language Models (SAug-LLM). It leverages the LLMs to autonomously identify shortcuts and generate their anti-shortcut counterparts. In addition, the dual validation is employed to mitigate the confirmation bias during the model retraining. Furthermore, the data augmentation process is optimized to effectively rectify model biases while minimizing data consumption. We validate the effectiveness and generalization of our method through extensive experiments across various natural language processing tasks, demonstrating an average performance improvement of 5.61%.

While in-context Learning (ICL) has proven to be an effective technique to improve the performance of Large Language Models (LLMs) in a variety of complex tasks, notably in translating natural language questions into Structured Query Language (NL2SQL), the question of how to select the most beneficial demonstration examples remains an open research problem. While prior works often adapted off-the-shelf encoders to retrieve examples dynamically, an inherent discrepancy exists in the representational capacities between the external retrievers and the LLMs. Further, optimizing the selection of examples is a non-trivial task, since there are no straightforward methods to assess the relative benefits of examples without performing pairwise inference. To address these shortcomings, we propose Detriever, a novel demonstration retrieval framework that learns a weighted combination of LLM hidden states, where rich semantic information is encoded. To train the model, we propose a proxy score that estimates the relative benefits of examples based on the similarities between output queries. Experiments on two popular NL2SQL benchmarks demonstrate that our method significantly outperforms the state-of-the-art baselines for the NL2SQL tasks.

pdf bib abs
Improving NMT Models by Retrofitting Quality Estimators into Trainable Energy Loss
Gahyun Yoo | Jay-Yoon Lee

Reinforcement learning has shown great promise in aligning language models with human preferences in a variety of text generation tasks, including machine translation. For translation tasks, rewards can easily be obtained from quality estimation (QE) models which can generate rewards for unlabeled data. Despite its usefulness, reinforcement learning cannot exploit the gradients with respect to the QE score. We propose QE-EBM, a method of employing quality estimators as trainable loss networks that can directly backpropagate to the NMT model. We examine our method on several low and high resource target languages with English as the source language. QE-EBM outperforms strong baselines such as REINFORCE and proximal policy optimization (PPO) as well as supervised fine-tuning for all target languages, especially low-resource target languages. Most notably, for English-to-Mongolian translation, our method achieves improvements of 2.5 BLEU, 7.1 COMET-KIWI, 5.3 COMET, and 6.4 XCOMET relative to the supervised baseline.

Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). In this paper, we aim to investigate a fundamental question: “what makes for good visual instructions”. Through a comprehensive empirical study, we find that instructions focusing on complex visual reasoning tasks are particularly effective in improving the performance of MLLMs, with results correlating to instruction complexity. Based on this insight, we develop a systematic approach to automatically create high-quality complex visual reasoning instructions. Our approach employs a synthesize-complicate-reformulate paradigm, leveraging multiple stages to gradually increase the complexity of the instructions while guaranteeing quality. Based on this approach, we create the ComVint dataset with 32K examples, and fine-tune four MLLMs on it. Experimental results consistently demonstrate the enhanced performance of all compared MLLMs, such as a 27.86% and 27.60% improvement for LLaVA on MME-Perception and MME-Cognition, respectively. Our code and data are publicly available at the link: https://github.com/RUCAIBox/ComVint.

pdf bib abs
TriFine: A Large-Scale Dataset of Vision-Audio-Subtitle for Tri-Modal Machine Translation and Benchmark with Fine-Grained Annotated Tags
Boyu Guan | Yining Zhang | Yang Zhao | Chengqing Zong

Current video-guided machine translation (VMT) approaches primarily use coarse-grained visual information, resulting in information redundancy, high computational overhead, and neglect of audio content. Our research demonstrates the significance of fine-grained visual and audio information in VMT from both data and methodological perspectives. From the data perspective, we have developed a large-scale dataset TriFine, the first vision-audio-subtitle tri-modal VMT dataset with annotated multimodal fine-grained tags. Each entry in this dataset not only includes the triples found in traditional VMT datasets but also encompasses seven fine-grained annotation tags derived from visual and audio modalities. From the methodological perspective, we propose a Fine-grained Information-enhanced Approach for Translation (FIAT). Experimental results have shown that, in comparison to traditional coarse-grained methods and text-only models, our fine-grained approach achieves superior performance with lower computational overhead. These findings underscore the pivotal role of fine-grained annotated information in advancing the field of VMT.

pdf bib abs
Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study
Mingyang Song | Mao Zheng | Xuan Luo

Utilizing Large Language Models (LLMs) as evaluators to assess the performance of other LLMs has garnered attention. However, this evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this issue, we propose and explore two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4, perform better in the many-shot regime than in the zero-shot regime. Furthermore, in most cases, MSwR performs significantly better than MSoR.

pdf bib abs
GEAR: A Simple GENERATE, EMBED, AVERAGE AND RANK Approach for Unsupervised Reverse Dictionary
Fatemah Yousef Almeman | Luis Espinosa Anke

Reverse Dictionary (RD) is the task of obtaining the most relevant word or set of words given a textual description or dictionary definition. Effective RD methods have applications in accessibility, translation or writing support systems. Moreover, in NLP research we find RD to be used to benchmark text encoders at various granularities, as it often requires word, definition and sentence embeddings. In this paper, we propose a simple approach to RD that leverages LLMs in combination with embedding models. Despite its simplicity, this approach outperforms supervised baselines in well studied RD datasets, while also showing less overfitting. We also conduct a number of experiments on different dictionaries and analyze how different styles, registers and target audiences impact the quality of RD systems. We conclude that, on average, untuned embeddings alone fare way below an LLM-only baseline (although they are competitive in highly technical dictionaries), but are crucial for boosting performance in combined methods.

pdf bib abs
Momentum Posterior Regularization for Multi-hop Dense Retrieval
Zehua Xia | Yuyang Wu | Yiyun Xia | Cam Tu Nguyen

Multi-hop question answering (QA) often requires sequential retrieval (multi-hop retrieval), where each hop retrieves missing knowledge based on information from previous hops. To facilitate more effective retrieval, we aim to distill knowledge from a posterior retrieval, which has access to posterior information like an answer, into a prior retrieval used during inference when such information is unavailable. Unfortunately, current methods for knowledge distillation in one-time retrieval are ineffective for multi-hop QA due to two issues: 1) posterior information is often defined as the response (i.e. answers), which may not clearly connect to the query without intermediate retrieval; and 2) the large knowledge gap between prior and posterior retrievals makes distillation using existing methods unstable, even resulting in performance loss. As such, we propose MoPo (Momentum Posterior Regularization) with two key innovations: 1) Posterior information of one hop is defined as a query-focus summary from the golden knowledge of the previous and current hops; 2) We develop an effective training strategy where the posterior retrieval is updated along with the prior retrieval via momentum moving average method, allowing smoother and effective distillation. Experiments on HotpotQA and StrategyQA demonstrate that MoPo outperforms existing baselines in both retrieval and downstream QA tasks.

Document-level Relation Extraction (DocRE) aims to extract relations from documents. Compared with sentence-level relation extraction, it is necessary to extract long-distance dependencies. Existing methods enhance the output of trained DocRE models either by learning logical rules or by extracting rules from annotated data and then injecting them into the model. However, these approaches can result in suboptimal performance due to incorrect rule set constraints. To mitigate this issue, we propose Context-aware differentiable rule learning or CaDRL for short, a novel differentiable rule-based framework that learns the doc-specific logical rule to avoid generating suboptimal constraints. Specifically, we utilize Transformer-based relation attention to encode document and relation information, thereby learning the contextual information of the relation. We employ a sequence-generated differentiable rule decoder to generate relational probabilistic logic rules at each reasoning step. We also introduce a parameter sharing training mechanism in CaDRL to reconcile the DocRE model and the rule learning module. Extensive experimental results on three DocRE datasets demonstrate that CaDRL outperforms existing rule-based frameworks, significantly improving DocRE performance and making predictions more interpretable and logical.

pdf bib abs
TEF: Causality-Aware Taxonomy Expansion via Front-Door Criterion
Yuan Meng | Songlin Zhai | Yuxin Zhang | Zhongjian Hu | Guilin Qi

Taxonomy expansion is a primary method for enriching taxonomies, involving appending a large number of additional nodes (i.e., queries) to an existing taxonomy (i.e., seed), with the crucial step being the identification of the appropriate anchor (parent node) for each query by incorporating the structural information of the seed. Despite advancements, existing research still faces an inherent challenge of spurious query-anchor matching, often due to various interference factors (e.g., the consistency of sibling nodes), resulting in biased identifications. To address the bias in taxonomy expansion caused by unobserved factors, we introduce the Structural Causal Model (SCM), known for its bias elimination capabilities, to prevent these factors from confounding the task through backdoor paths. Specifically, we employ the Front-Door Criterion, which guides the decomposition of the expansion process into a parser module and a connector. This enables the proposed causal-aware Taxonomy Expansion model to isolate confounding effects and reveal the true causal relationship between the query and the anchor. Extensive experiments on three benchmarks validate the effectiveness of TEF, with a notable 6.1% accuracy improvement over the state-of-the-art on the SemEval16-Environment dataset.

pdf bib abs
Inside-Outside Algorithm for Probabilistic Product-Free Lambek Categorial Grammar
Jinman Zhao | Gerald Penn

The inside-outside algorithm is widely utilized in statistical models related to context-free grammars. It plays a key role in the EM estimation of probabilistic context-free grammars. In this work, we introduce an inside-outside algorithm for Probabilistic Lambek Categorical Grammar (PLCG)

pdf bib abs
Perceive the Passage of Time: A Systematic Evaluation of Large Language Model in Temporal Relativity
Shuang Chen | Yining Zheng | Shimin Li | Qinyuan Cheng | Xipeng Qiu

Temporal perception is crucial for Large Language Models(LLMs) to effectively understand the world. However, current benchmarks primarily focus on temporal reasoning, falling short in understanding the temporal characteristics involving temporal perception, particularly in understanding temporal relativity. In this paper, we introduce TempBench, a comprehensive benchmark designed to evaluate the temporal-relative ability of LLMs. TempBench encompasses 4 distinct scenarios: Physiology, Psychology, Cognition and Mixture. We conduct an extensive experiments on GPT-4, a series of Llama and other popular LLMs. The experiment results demonstrate a significant performance gap between LLMs and humans in temporal-relative capability. Furthermore, the error types of temporal-relative ability in LLMs are proposed to thoroughly analyze the impact of multiple aspects and emphasize the associated challenges. We anticipate that TempBench will drive further advancements in enhancing the temporal-perceiving capabilities of L

pdf bib abs
Hit the Sweet Spot! Span-Level Ensemble for Large Language Models
Yangyifan Xu | Jianghao Chen | Junhong Wu | Jiajun Zhang

Ensembling various LLMs to unlock their complementary potential and leverage their individual strengths is highly valuable. Previous studies typically focus on two main paradigms: sample-level and token-level ensembles. Sample-level ensemble methods either select or blend fully generated outputs, which hinders dynamic correction and enhancement of outputs during the generation process. On the other hand, token-level ensemble methods enable real-time correction through fine-grained ensemble at each generation step. However, the information carried by an individual token is quite limited, leading to suboptimal decisions at each step. To address these issues, we propose SweetSpan, a span-level ensemble method that effectively balances the need for real-time adjustments and the information required for accurate ensemble decisions. Our approach involves two key steps: First, we have each candidate model independently generate candidate spans based on the shared prefix. Second, we calculate perplexity scores to facilitate mutual evaluation among the candidate models and achieve robust span selection by filtering out unfaithful scores. To comprehensively evaluate ensemble methods, we propose a new challenging setting (ensemble models with significant performance gaps) in addition to the standard setting (ensemble the best-performing models) to assess the performance of model ensembles in more realistic scenarios. Experimental results in both standard and challenging settings across various language generation tasks demonstrate the effectiveness, robustness, and versatility of our approach compared with previous ensemble methods.

pdf bib abs
PToco: Prefix-based Token-level Collaboration Enhances Reasoning for Multi-LLMs
Yuang Bian | Yupian Lin | Jingping Liu | Tong Ruan

Collaboration between multiple Large Language Models (LLMs) has attracted significant attention for its potential to mitigate hallucinations and enhance reasoning capabilities. Previous approaches, such as multi-agent debate and decoding-time integration, either rely on highly capable models with strong self-reflection abilities or are limited to models sharing the same tokenizer. To address these limitations, we introduce PToco (Prefix-based Token-level Collaboration), a novel mechanism that enables effective collaboration among less capable LLMs, independent of tokenizer differences. PToco uses a prefix-grouping method to extract consensus among tokens with varying levels of granularity, ensuring coherent and robust token generation across multiple models. Experimental results on a series of reasoning tasks demonstrate that PToco significantly improves performance over individual models. Furthermore, this approach generalizes well across different quantities and sizes of participating models, providing a more flexible and efficient solution for multi-LLM ensembles.

pdf bib abs
MAGRET: Machine-generated Text Detection with Rewritten Texts
Yifei Huang | Jiuxin Cao | Hanyu Luo | Xin Guan | Bo Liu

With the quick advancement in text generation ability of Large Language Mode(LLM), concerns about the misuse of machine-generated content have grown, raising potential violations of legal and ethical standards. Some existing studies concentrate on detecting machine-generated text in open-source models using in-model features, but their performance on closed-source large models is limited. This limitation occurs because, in the closed-source model detection, the only reference that can be obtained is the texts, which may differ significantly due to random sampling. In this paper, we demonstrate that texts generated by the same model can align both semantically and statistically under similar prompts, facilitating effective detection and traceability. Specifically, we fine-tune a BERT encoder through contrastive learning to achieve semantic alignment in randomly generated texts from the same model. Then, we propose a method called Machine-Generated Text Detection with Rewritten Texts, which designed several prompt refactoring methods and used them to request rewritten text from LLMs. Semantic and statistical relationships between rewritten and original texts provide a basis for detection and traceability. Finally, we expanded the text dataset with multi-parameter random sampling and verified the performance of MAGRET on three text-generated datasets. Experimental results show that previous methods struggle with closed-source model detection, while our approach significantly outperforms baseline methods in this regard. It also shows MagRet’s stable performance in detection and tracing tasks across various randomly sampled texts.

Document-grounded dialogue systems aim to answer user queries by leveraging external information. Previous studies have mainly focused on handling free-form documents, often overlooking structured data such as lists, which can represent a range of nuanced semantic relations. Motivated by the observation that even advanced language models like GPT-3.5 often miss semantic cues from lists, this paper aims to enhance question answering (QA) systems for better interpretation and use of structured lists. To this end, we introduce the LIST2QA dataset, a novel benchmark to evaluate the ability of QA systems to respond effectively using list information. This dataset is created from unlabeled customer service documents using language models and model-based filtering processes to enhance data quality, and can be used to fine-tune and evaluate QA models. Apart from directly generating responses through fine-tuned models, we further explore the explicit use of Intermediate Steps for Lists (ISL), aligning list items with user backgrounds to better reflect how humans interpret list items before generating responses. Our experimental results demonstrate that models trained on LIST2QA with our ISL approach outperform baselines across various metrics. Specifically, our fine-tuned Flan-T5-XL model shows increases of 3.1% in ROUGE-L, 4.6% in correctness, 4.5% in faithfulness, and 20.6% in completeness compared to models without applying filtering and the proposed ISL method.

Although large language models have significantly advanced natural language generation, their potential in low-resource machine translation has not yet been fully explored, especially for languages that translation models have not been trained on. In this study, we provide a detailed demonstration of how to efficiently expand low-resource languages for large language models and significantly enhance the model’s translation ability, using Uyghur as an example. The process involves four stages: collecting and pre-processing monolingual data, conducting continuous pre-training with extensive monolingual data, fine-tuning with less parallel corpora using translation supervision, and proposing a direct preference optimization based on translation self-evolution (DPOSE) on this basis. Extensive experiments have shown that our strategy effectively expands the low-resource languages supported by large language models and significantly enhances the model’s translation ability in Uyghur with less parallel data. Our research provides detailed insights for expanding other low-resource languages into large language models.

As Large Language Models (LLMs) become more advanced, the security risks they pose also increase. Ensuring that LLM behavior aligns with human values, particularly in mitigating jailbreak attacks with elusive and implicit intentions, has become a significant challenge. To address this issue, we propose a jailbreak defense method called Real Intentions Defense (RID), which involves two phases: soft extraction and hard deletion. In the soft extraction phase, LLMs are leveraged to extract unbiased, genuine intentions, while in the hard deletion phase, a greedy gradient-based algorithm is used to remove the least important parts of a sentence, based on the insight that words with smaller gradients have less impact on its meaning. We conduct extensive experiments on Vicuna and Llama2 models using eight state-of-the-art jailbreak attacks and six benchmark datasets. Our results show a significant reduction in both Attack Success Rate (ASR) and Harmful Score of jailbreak attacks, while maintaining overall model performance. Further analysis sheds light on the underlying mechanisms of our approach.

Conversational Recommendation Systems (CRSs) are a particularly interesting application for out-of-the-box LLMs due to their potential for eliciting user preferences and making recommendations in natural language across a wide set of domains. Somewhat surprisingly, we find however that in such a conversational application, the more questions a user answers about their preferences, the worse the model’s recommendations become. We demonstrate this phenomenon on a previously published dataset as well as two novel datasets which we contribute. We also explain why earlier benchmarks failed to detect this round-over-round performance loss, highlighting the importance of the evaluation strategy we use and expanding upon Li et al. (2023a). We also present preference elicitation and recommendation strategies that mitigate this degradation in performance, beating state-of-the-art results, and show how three underlying models, GPT-3.5, GPT-4, and Claude 3.5 Sonnet, differently impact these strategies. Our datasets and code are available at https://github.com/CtrlVGustavo/A-Flash- in-the-Pan-CRS.

pdf bib abs
Rule-KBQA: Rule-Guided Reasoning for Complex Knowledge Base Question Answering with Large Language Models
Zhiqiang Zhang | Liqiang Wen | Wen Zhao

Knowledge base question answering (KBQA) is recognized as a challenging task, especially when parsing complex questions into executable logical forms. Traditional semantic parsing (SP)-based approaches exhibit inconsistent performance in handling various complex questions. As large language models (LLMs) have exhibited exceptional reasoning ability and language comprehension, recent studies have employed LLMs for semantic parsing to directly generate logical forms that can be executed on knowledge bases (KBs) to achieve the desired results. However, these methods of relying exclusively on LLMs to ensure grammaticality, faithfulness, and controllability may diminish their effectiveness due to hallucinations in the reasoning process. In this paper, we introduce Rule-KBQA, a framework that employs learned rules to guide the generation of logical forms. The proposed method contains two phases, an induction phase and a deduction phase. In the induction phase, we initially extract rules from the existing data and then employ the Rule-Following Fine-Tuned (RFFT) LLM to generate additional rules, ultimately constructing a comprehensive rule library. In the deduction phase, a symbolic agent, guided by learned rules, explores the environment KB to incrementally construct executable logical forms. Meanwhile, we leverage the discriminative capability of LLMs to evaluate the plausibility of candidate decisions. Extensive experiments indicate that our method achieves competitive results on standard KBQA datasets, clearly demonstrating its effectiveness.

Although large language models (LLMs) trained on extensive multilingual corpora exhibit impressive language transfer, they often fail to respond in the user’s desired language due to corpus imbalances, an embarrassingly simple problem known as the language confusion. However, existing solutions like in-context learning and supervised fine-tuning (SFT) have drawbacks: in-context learning consumes context window space, diminishing attention as text lengthens, while SFT requires extensive, labor-intensive data collection. To overcome these limitations, we propose the language-sensitive intervention (LSI), a novel, lightweight, and label-free approach. Specifically, we analyze language confusion from a causal perspective, revealing that the training corpus’s language distribution acts as a confounder, disadvantaging languages that are underrepresented in the dataset. Then, we identify a language-sensitive dimension in the LLM’s residual stream, i.e., the language vector, which allows us to estimate the average causal effect of prompts on this dimension. During inference, we directly intervene on the language vector to generate responses in the desired language.To further advance research on this issue, we introduce a new benchmark that detects language confusion and assesses content quality. Experimental results demonstrate that our method effectively mitigates language confusion without additional complex mechanisms. Our code is available at https://github.com/SoseloX/LSI.

pdf bib abs
Detecting deepfakes and false ads through analysis of text and social engineering techniques
Alicja Martinek | Ewelina Bartuzi-Trokielewicz

Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.

pdf bib abs
Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources
Belu Ticona | Fernando Martín Carranza | Viviana Cotik

Argentina has a diverse, yet little-known, Indigenous language heritage. Most of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, no unified information on speakers and computational tools is available for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, along with national demographic data on the country’s Indigenous population. The languages are classified into seven families: Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon. We also provide an introductory survey of the computational resources available for these languages, whether or not they are specifically developed for Argentine varieties.

pdf bib abs
The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction
Arthur Amalvy | Vincent Labatut | Richard Dufour

The automatic extraction of character networks from literary texts is generally carried out using natural language processing (NLP) cascading pipelines. While this approach is widespread, no study exists on the impact of low-level NLP tasks on their performance. In this article, we conduct such a study on a literary dataset, focusing on the role of named entity recognition (NER) and coreference resolution when extracting co-occurrence networks. To highlight the impact of these tasks’ performance, we start with gold-standard annotations, progressively add uniformly distributed errors, and observe their impact in terms of character network quality. We demonstrate that NER performance depends on the tested novel and strongly affects character detection. We also show that NER-detected mentions alone miss a lot of character co-occurrences, and that coreference resolution is needed to prevent this. Finally, we present comparison points with 2 methods based on large language models (LLMs), including a fully end-to-end one, and show that these models are outperformed by traditional NLP pipelines in terms of recall.

pdf bib abs
Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede’s Cultural Dimensions
Reem Masoud | Ziquan Liu | Martin Ferianc | Philip C. Treleaven | Miguel Rodrigues Rodrigues

The deployment of large language models (LLMs) raises concerns regarding their cultural misalignment and potential ramifications on individuals and societies with diverse cultural backgrounds. While the discourse has focused mainly on political and social biases, our research proposes a Cultural Alignment Test (Hoftede’s CAT) to quantify cultural alignment using Hofstede’s cultural dimension framework, which offers an explanatory cross-cultural comparison through the latent variable analysis. We apply our approach to quantitatively evaluate LLMs—namely Llama 2, GPT-3.5, and GPT-4—against the cultural dimensions of regions like the United States, China, and Arab countries, using different prompting styles and exploring the effects of language-specific fine-tuning on the models’ behavioural tendencies and cultural values. Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions. Our study demonstrates that while all LLMs struggle to grasp cultural values, GPT-4 shows a unique capability to adapt to cultural nuances, particularly in Chinese settings. However, it faces challenges with American and Arab cultures. The research also highlights that fine-tuning LLama 2 models with different languages changes their responses to cultural questions, emphasizing the need for culturally diverse development in AI for worldwide acceptance and ethical use. For more details or to contribute to this research, visit our GitHub page https://github.com/reemim/Hofstedes_CAT

Supervised fine-tuning (SFT) is widely adopted for tailoring large language models (LLMs) to specific downstream tasks. However, the substantial computational demands of LLMs hinder iterative exploration of fine-tuning datasets and accurate evaluation of individual sample importance. To address this challenge, we introduce Meta-LoRA, a memory-efficient method for automatic sample reweighting. Meta-LoRA learns to reweight fine-tuning samples by minimizing the loss on a small, high-quality validation set through an end-to-end bi-level optimization framework based on meta-learning. To reduce memory usage associated with computing second derivatives, we approximate the bi-level optimization using gradient similarity between training and validation datasets, replacing bi-dimensional gradient similarity with the product of one-dimensional activation states and their corresponding gradients. Further memory optimization is achieved by refining gradient computations, selectively applying them to the low-rank layers of LoRA, which results in as little as 4% additional memory usage. Comprehensive evaluations across benchmark datasets in mathematics, coding, and medical domains demonstrate Meta-LoRA’s superior efficacy and efficiency. The source code is available at https://github.com/liweicheng-ai/meta-lora.

pdf bib abs
Can Large Language Models perform Relation-based Argument Mining?
Deniz Gorur | Antonio Rago | Francesca Toni

Relation-based Argument Mining (RbAM) is the process of automatically determining agreement (support) and disagreement (attack) relations amongst textual arguments (in the binary prediction setting), or neither relation (in the ternary prediction setting). As the number of platforms supporting online debate increases, the need for RbAM becomes ever more urgent, especially in support of downstream tasks. RbAM is a challenging classification task, with existing state-of-the-art methods, based on Language Models (LMs), failing to perform satisfactorily across different datasets. In this paper, we show that general-purpose Large LMs (LLMs), appropriately primed and prompted, can significantly outperform the best performing (RoBERTa-based) baseline. Specifically, we experiment with two open-source LLMs (Llama-2 and Mistral) and with GPT-3.5-turbo on several datasets for (binary and ternary) RbAM, as well as with GPT-4o-mini on samples (to limit costs) from the datasets.

pdf bib abs
Contextual Augmentation for Entity Linking using Large Language Models
Daniel Vollmers | Hamada Zahera | Diego Moussallem | Axel-Cyrille Ngonga Ngomo

Entity Linking involves detecting and linking entity mentions in natural language texts to a knowledge graph. Traditional methods use a two-step process with separate models for entity recognition and disambiguation, which can be computationally intensive and less effective. We propose a fine-tuned model that jointly integrates entity recognition and disambiguation in a unified framework. Furthermore, our approach leverages large language models to enrich the context of entity mentions, yielding better disambiguation. We evaluated our approach on benchmark datasets and compared with several baselines. The evaluation results show that our approach achieves state-of-the-art performance on out-of-domain datasets.

Automatic radiology report generation is pivotal in reducing the workload of radiologists, while simultaneously improving diagnostic accuracy and operational efficiency. Current methods face significant challenges, including the effective alignment of medical visual features with textual features and the mitigation of data bias. In this paper, we propose a method for radiology report generation that utilizes a Cross-modal Enhancement and Alignment Adapter (CmEAA) to connect a vision encoder with a frozen large language model. Specifically, we introduce two novel modules within CmEAA: Cross-modal Feature Enhancement (CFE) and Neural Mutual Information Aligner (NMIA). CFE extracts observation-related contextual features to enhance the visual features of lesions and abnormal regions in radiology images through a cross-modal enhancement transformer. NMIA maximizes neural mutual information between visual and textual representations within a low-dimensional alignment embedding space during training and provides potential global alignment visual representations during inference. Additionally, a weights generator is designed to enable the dynamic adaptation of cross-modal enhanced features and vanilla visual features. Experimental results on two prevailing datasets, namely, IU X-Ray and MIMIC-CXR, demonstrate that the proposed model outperforms previous state-of-the-art methods.

pdf bib abs
Semantic Reshuffling with LLM and Heterogeneous Graph Auto-Encoder for Enhanced Rumor Detection
Guoyi Li | Die Hu | Zongzhen Liu | Xiaodan Zhang | Honglei Lyu

Social media is crucial for information spread, necessitating effective rumor detection to curb misinformation’s societal effects. Current methods struggle against complex propagation influenced by bots, coordinated accounts, and echo chambers, which fragment information and increase risks of misjudgments and model vulnerability. To counteract these issues, we introduce a new rumor detection framework, the Narrative-Integrated Metapath Graph Auto-Encoder (NIMGA). This model consists of two core components: (1) Metapath-based Heterogeneous Graph Reconstruction. (2) Narrative Reordering and Perspective Fusion. The first component dynamically reconstructs propagation structures to capture complex interactions and hidden pathways within social networks, enhancing accuracy and robustness. The second implements a dual-agent mechanism for viewpoint distillation and comment narrative reordering, using LLMs to refine diverse perspectives and semantic evolution, revealing patterns of information propagation and latent semantic correlations among comments. Extensive testing confirms our model outperforms existing methods, demonstrating its effectiveness and robustness in enhancing rumor representation through graph reconstruction and narrative reordering.

The volume of academic articles is increasing rapidly, reflecting the growing emphasis on research and scholarship across different science disciplines. This rapid growth necessitates the development of tools for more efficient and rapid understanding of these articles. Clear and well-defined Research Questions (RQs) in research articles can help guide scholarly inquiries. However, many academic studies lack a proper definition of RQs in their articles. This research addresses this gap by presenting a comprehensive framework for the systematic extraction, detection, and generation of RQs from scientific articles. The extraction component uses a set of regular expressions to identify articles containing well-defined RQs. The detection component aims to identify more complex RQs in articles, beyond those captured by the rule-based extraction method. The RQ generation focuses on creating RQs for articles that lack them. We integrate all these components to build a pipeline to extract RQs or generate them based on the articles’ full text. We evaluate the performance of the designed pipeline on a set of metrics designed to assess the quality of RQs. Our results indicate that the proposed pipeline can reliably detect RQs and generate high-quality ones.

pdf bib abs
Confront Insider Threat: Precise Anomaly Detection in Behavior Logs Based on LLM Fine-Tuning
Shuang Song | Yifei Zhang | Neng Gao

Anomaly-based detection is effective against evolving insider threats but still suffers from low precision. Current data processing can result in information loss, and models often struggle to distinguish between benign anomalies and actual threats. Both issues hinder precise detection. To address these issues, we propose a precise anomaly detection solution for behavior logs based on Large Language Model (LLM) fine-tuning. By representing user behavior in natural language, we minimize information loss. We fine-tune the LLM with a user behavior pattern contrastive task for anomaly detection, using a two-stage strategy: first learning general behavior patterns, then refining with user-specific data to improve differentiation between benign anomalies and threats. We also implement a fine-grained threat tracing mechanism to provide behavior-level audit trails. To the best of our knowledge, our solution is the first to apply LLM fine-tuning in insider threat detection, achieving an F1 score of 0.8941 on the CERT v6.2 dataset, surpassing all baselines.

pdf bib abs
Flashback: Memory Mechanism for Enhancing Memory Efficiency and Speed in Deep Sequential Models
Taiki Sekii

In this study, we tackle three main challenges of deep sequential processing models in previous research: (1) memory degradation, (2) inaccurate gradient backpropagation, and (3) compatibility with next-token prediction. Specifically, to address (1-2), we define a Flashback property in which memory is preserved perfectly as an identity mapping of its stored value in a memory region until it is overwritten by a hidden state at a different time step. We propose a Flashback mechanism that satisfies this property in a fully differentiable, end-to-end manner. Further, to tackle (3), we propose architectures that incorporate the Flashback mechanism into Transformers and Mamba, enabling next-token prediction for language modeling tasks. In experiments, we trained on The Pile dataset, which includes diverse texts, to evaluate tradeoffs between commonsense reasoning accuracy, processing speed, and memory usage after introducing the Flashback mechanism into existing methods. The evaluations confirmed the effectiveness of the Flashback mechanism.

pdf bib abs
Engagement-driven Persona Prompting for Rewriting News Tweets
Reshmi Gopalakrishna Pillai | Antske Fokkens | Wouter van Atteveldt

Text style transfer is a challenging research task which modifies the linguistic style of a given text to meet pre-set objectives such as making the text simpler or more accessible. Though large language models have been found to give promising results, text rewriting to improve audience engagement of social media content is vastly unexplored. Our research investigates the performance of various prompting strategies in the task of rewriting Dutch news tweets in specific linguistic styles (formal, casual and factual). Apart from zero-shot and few-shot prompting variants, with and without personas, we also explore prompting with feedback on predicted engagement. We perform an extensive analysis of 18 different combinations of Large Language Models (GPT-3.5, GPT-4, Mistral-7B) and prompting strategies on three different metrics: ROUGE-L, semantic similarity and predicted engagement. We find that GPT-4 with feedback and persona prompting performs the best in terms of predicted engagement for all three language styles. Our results motivate further application of usage of prompting techniques to rewrite news headlines on Twitter to align with specific style guidelines.

Over-correction is a critical issue for large language models (LLMs) to address Grammatical Error Correction (GEC) task, esp. for Chinese. This paper proposes a Chain-of-Task (CoTask) framework to reduce over-correction. The CoTask framework is applied as multi-task instruction tuning of LLMs by decomposing the process of grammatical error analysis to design auxiliary tasks and adjusting the types and combinations of training tasks. A supervised fine-tuning (SFT) strategy is also presented to enhance the performance of LLMs, together with an algorithm for automatic dataset annotation to avoid additional manual costs. Experimental results demonstrate that our method achieves new state-of-the-art results on both FCGEC (in-domain) and NaCGEC (out-of-domain) test sets.

pdf bib abs
Beyond Dataset Creation: Critical View of Annotation Variation and Bias Probing of a Dataset for Online Radical Content Detection
Arij Riabi | Virginie Mouilleron | Menel Mahamdi | Wissam Antoun | Djamé Seddah

The proliferation of radical content on online platforms poses significant risks, including inciting violence and spreading extremist ideologies. Despite ongoing research, existing datasets and models often fail to address the complexities of multilingual and diverse data. To bridge this gap, we introduce a publicly available multilingual dataset annotated with radicalization levels, calls for action, and named entities in English, French, and Arabic. This dataset is pseudonymized to protect individual privacy while preserving contextual information. Beyond presenting our freely available dataset, we analyze the annotation process, highlighting biases and disagreements among annotators and their implications for model performance. Additionally, we use synthetic data to investigate the influence of socio-demographic traits on annotation patterns and model predictions. Our work offers a comprehensive examination of the challenges and opportunities in building robust datasets for radical content detection, emphasizing the importance of fairness and transparency in model development. The Counter dataset is available at https://gitlab.inria.fr/ariabi/counter-dataset-public.

The swift progress and widespread acceptance of artificial intelligence (AI) systems highlight a pressing requirement to comprehend both the capabilities and potential risks associated with AI. Given the linguistic complexity, cultural richness, and underrepresented status of Arabic in AI research, there is a pressing need to focus on Large Language Models (LLMs) performance and safety for Arabic related tasks. Despite some progress in their development, there is a lack of comprehensive trustworthiness evaluation benchmarks which presents a major challenge in accurately assessing and improving the safety of LLMs when prompted in Arabic. In this paper, we introduce AraTrust, the first comprehensive trustworthiness benchmark for LLMs in Arabic. AraTrust comprises 522 human-written multiple-choice questions addressing diverse dimensions related to truthfulness, ethics, privacy, illegal activities, mental health, physical health, unfairness, and offensive language. We evaluated a set of LLMs against our benchmark to assess their trustworthiness. GPT-4 was the most trustworthy LLM, while open-source models, particularly AceGPT 7B and Jais 13B, struggled to achieve a score of 60% in our benchmark. The benchmark dataset is publicly available at https://huggingface.co/datasets/asas-ai/AraTrust

This study addresses the gap in the literature concerning the comparative performance of LLMs in interpreting different types of figurative language across multiple languages. By evaluating LLMs using two multilingual datasets on simile and idiom interpretation, we explore the effectiveness of various prompt engineering strategies, including chain-of-thought, few-shot, and English translation prompts. We extend the language of these datasets to Persian as well by building two new evaluation sets. Our comprehensive assessment involves both closed-source (GPT-3.5, GPT-4o mini, Gemini 1.5), and open-source models (Llama 3.1, Qwen2), highlighting significant differences in performance across languages and figurative types. Our findings reveal that while prompt engineering methods are generally effective, their success varies by figurative type, language, and model. We also observe that open-source models struggle particularly with low-resource languages in similes. Additionally, idiom interpretation is nearing saturation for many languages, necessitating more challenging evaluations.

Cross-domain sequential recommendation (CSR) has garnered significant attention. Current federated frameworks for CSR leverage information across multiple domains but often rely on user alignment, which increases communication costs and privacy risks. In this work, we propose FedCSR, a novel federated cross-domain sequential recommendation framework that eliminates the need for user alignment between platforms. FedCSR fully utilizes cross-domain knowledge to address the key challenges related to data heterogeneity both inter- and intra-platform. To tackle the heterogeneity of data patterns between platforms, we introduce Model Contrastive Learning (MCL) to reduce the gap between local and global models. Additionally, we design Sequence Contrastive Learning (SCL) to address the heterogeneity of user preferences across different domains within a platform by employing tailored sequence augmentation techniques. Extensive experiments conducted on multiple real-world datasets demonstrate that FedCSR achieves superior performance compared to existing baseline methods.

pdf bib abs
Multi-Modal Entities Matter: Benchmarking Multi-Modal Entity Alignment
GuanChen Xiao | WeiXin Zeng | ShiQi Zhang | MingRui Lao | Xiang Zhao

Multi-modal entity alignment (MMEA) is a long-standing task that aims to discover identical entities between different multi-modal knowledge graphs (MMKGs). However, most of the existing MMEA datasets consider the multi-modal data as the attributes of textual entities, while neglecting the correlations among the multi-modal data and do not fit in the real-world scenarios well. In response, in this work, we establish a novel yet practical MMEA dataset, i.e. NMMEA, which models multi-modal data (e.g., images) equally as textual entities in the MMKG. Due to the introduction of multi-modal data, NMMEA poses new challenges to existing MMEA solutions, i.e., heterogeneous structural representation learning and cross-modal alignment inference. Hence, we put forward a simple yet effective solution, CrossEA, which can effectively learn the structural information of entities by considering both intra-modal and cross-modal relations, and further infer the similarity of different types of entity pairs. Extensive experiments validate the significance of NMMEA, where CrossEA can achieve superior performance in contrast to competitive methods on the proposed dataset.

Multiparty dialogue question answering (QA) in machine reading comprehension (MRC) is a challenging task due to its complex information flow interactions and logical QA inference. Existing models typically handle such QA tasks by decoupling dialogue information at both speaker and utterance levels. However, few of them consider the logical inference relations in multiparty dialogue QA, leading to suboptimal QA performance. To address this issue, this paper proposes a memory network with logical inference (LIMN) for extractive QA in multiparty dialogues. LIMN introduces an inference module, which is pretrained by incorporating plain QA articles as external knowledge. It generates logical inference-aware representations from latent space for multiparty dialogues. To further model complex interactions among logical dialogue contexts, questions and key-utterance information, a key-utterance-based interaction method is proposed for leverage. Moreover, a multitask learning strategy is adopted for robust MRC. Extensive experiments were conducted on Molweni and FriendsQA benchmarks, which included 25k and 10k questions, respectively. Comparative results showed that LIMN achieves state-of-the-art results on both benchmarks, demonstrating the enhancement of logical QA inference in multiparty dialogue QA tasks.

pdf bib abs
Enhancing Discourse Parsing for Local Structures from Social Media with LLM-Generated Data
Martial Pastor | Nelleke Oostdijk | Patricia Martin-Rodilla | Javier Parapar

We explore the use of discourse parsers for extracting a particular discourse structure in a real-world social media scenario. Specifically, we focus on enhancing parser performance through the integration of synthetic data generated by large language models (LLMs). We conduct experiments using a newly developed dataset of 1,170 local RST discourse structures, including 900 synthetic and 270 gold examples, covering three social media platforms: online news comments sections, a discussion forum (Reddit), and a social media messaging platform (Twitter). Our primary goal is to assess the impact of LLM-generated synthetic training data on parser performance in a raw text setting without pre-identified discourse units. While both top-down and bottom-up RST architectures greatly benefit from synthetic data, challenges remain in classifying evaluative discourse structures.

pdf bib abs
PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models
Andrianos Michail | Simon Clematide | Juri Opitz

The task of determining whether two texts are paraphrases has long been a challenge in NLP. However, the prevailing notion of paraphrase is often quite simplistic, offering only a limited view of the vast spectrum of paraphrase phenomena. Indeed, we find that evaluating models in a paraphrase dataset can leave uncertainty about their true semantic understanding. To alleviate this, we create PARAPHRASUS, a benchmark designed for multi-dimensional assessment, benchmarking and selection of paraphrase detection models. We find that paraphrase detection models under our fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset. Furthermore, PARAPHRASUS allows prompt calibration for different use cases, tailoring LLM models to specific strictness levels. PARAPHRASUS includes 3 challenges spanning over 10 datasets, including 8 repurposed and 2 newly annotated; we release it along with a benchmarking library at https://github.com/impresso/paraphrasus

pdf bib abs
Dynamic-prototype Contrastive Fine-tuning for Continual Few-shot Relation Extraction with Unseen Relation Detection
Si Miao Zhao | Zhen Tan | Ning Pang | Wei Dong Xiao | Xiang Zhao

Continual Few-shot Relation Extraction (CFRE) aims to continually learn new relations from limited labeled data while preserving knowledge about previously learned relations. Facing the inherent issue of catastrophic forgetting, previous approaches predominantly rely on memory replay strategies. However, they often overlook task interference in continual learning and the varying memory requirements for different relations. To address these shortcomings, we propose a novel framework, DPC-FT, which features: 1) a lightweight relation encoder for each task to mitigate negative knowledge transfer across tasks; 2) a dynamic prototype module to allocate less memory for easier relations and more memory for harder relations. Additionally, we introduce the None-Of-The-Above (NOTA) detection in CFRE and propose a threshold criterion to identify relations that have never been learned. Extensive experiments demonstrate the effectiveness and efficiency of our method in CFRE, making our approach more practical and comprehensive for real-world scenarios.

pdf bib abs
Enhancing Rhetorical Figure Annotation: An Ontology-Based Web Application with RAG Integration
Ramona Kühn | Jelena Mitrović | Michael Granitzer

Rhetorical figures play an important role in our communication. They are used to convey subtle, implicit meaning, or to emphasize statements. We notice them in hate speech, fake news, and propaganda. By improving the systems for computational detection of rhetorical figures, we can also improve tasks such as hate speech and fake news detection, sentiment analysis, opinion mining, or argument mining. Unfortunately, there is a lack of annotated data, as well as qualified annotators that would help us build large corpora to train machine learning models for the detection of rhetorical figures. The situation is particularly difficult in languages other than English, and for rhetorical figures other than metaphor, sarcasm, and irony. To overcome this issue, we develop a web application called “Find your Figure” that facilitates the identification and annotation of German rhetorical figures. The application is based on the German Rhetorical ontology GRhOOT which we have specially adapted for this purpose. In addition, we improve the user experience with Retrieval Augmented Generation (RAG). In this paper, we present the restructuring of the ontology, the development of the web application, and the built-in RAG pipeline. We also identify the optimal RAG settings for our application. Our approach is one of the first to practically use rhetorical ontologies in combination with RAG and shows promising results.

pdf bib abs
Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment
Go Kamoda | Akari Asai | Ana Brassard | Keisuke Sakaguchi

Evaluating the outputs of large language models (LLMs) on long-form generative tasks remains challenging. While fine-grained, aspect-wise evaluations provide valuable diagnostic information, they are difficult to design exhaustively, and each aspect’s contribution to the overall acceptability of an answer is unclear. In this study, we propose a method to compute an overall quality score as a weighted average of three key aspects: factuality, informative- ness, and formality. This approach achieves stronger correlations with human judgments compared to previous metrics. Our analysis identifies factuality as the most predictive aspect of overall quality. Additionally, we release a dataset of 1.2k long-form QA answers annotated with both absolute judgments and relative preferences in overall and aspect-wise schemes to aid future research in evaluation practices.

pdf bib abs
CharMoral: A Character Morality Dataset for Morally Dynamic Character Analysis in Long-Form Narratives
Suyoung Bae | Gunhee Cho | Yun-Gyung Cheong | Boyang Li

This paper introduces CharMoral, a dataset designed to analyze the moral evolution of characters in long-form narratives. CharMoral, built from 1,337 movie synopses, includes annotations for character actions, context, and morality labels. To automatically construct CharMoral, we propose a four-stage framework, utilizing Large Language Models, to automatically classify actions as moral or immoral based on context. Human evaluations and various experiments confirm the framework’s effectiveness in moral reasoning tasks in multiple genres. Our code and the CharMoral dataset are publicly available at https://github.com/BaeSuyoung/CharMoral.

Some encoder inputs such as conversation histories are frequently extended with short additional inputs like new responses. However, to obtain the real-time encoding of the extended input, existing Transformer-based encoders like BERT have to encode the whole extended input again without utilizing the existing encoding of the original input, which may be prohibitively slow for real-time applications. In this paper, we introduce Incremental Transformer, an efficient encoder dedicated for faster encoding of incremented input. It takes only added input as input but attends to cached representations of original input in lower layers for better performance. By treating questions as additional inputs of a passage, Incremental Transformer can also be applied to accelerate MRC tasks. Experimental results show tiny decline in effectiveness but significant speedup against traditional full encoder across various MRC and multi-turn conversational question answering tasks. With the help from simple distillation-like auxiliary losses, Incremental Transformer achieves a speedup of 6.2x, with a mere 2.2 point accuracy reduction in comparison to RoBERTa-Large on SQuADV1.1.

The translation capabilities of neural machine translation (NMT) models based on the encoder-decoder framework are extremely potent. Although Large Language Models (LLMs) have achieved remarkable results in many tasks, they have not reached state-of-the-art performance in NMT. However, traditional NMT still faces significant challenges in areas of document translation such as context consistency, tense, and pronoun resolution, where LLMs inherently possess substantial advantages. Instead of directly using LLMs for translation, employing them for Automatic Post-Editing (APE) to post-edit NMT outputs proves to be a viable option. However, document-level bilingual data is extremely scarce. This paper proposes a method that can effectively leverage the capabilities of LLMs to optimize document translation using only monolingual data. By employing two NMT models in opposite directions (Source-to-Target and Target-to-Source), we generate pseudo-document training data for the training of APE. We have identified and resolved the issue between training and inference mode inconsistency brought about by the pseudo-document training data. The final experimental results demonstrate that by using only document-level monolingual data, we can significantly improve the quality of NMT and greatly enhance issues such as reference and contextual consistency in NMT.

Low-rank adaptation (LoRA) and its variants have recently gained much interest due to their ability to avoid excessive inference costs. However, LoRA still encounters the following challenges: (1) Limitation of low-rank assumption; and (2) Its initialization method may be suboptimal. To this end, we propose PMSS(Pre-trained Matrices Skeleton Selection), which enables high-rank updates with low costs while leveraging semantic and linguistic information inherent in pre-trained weight. It achieves this by selecting skeletons from the pre-trained weight matrix and only learning a small matrix instead. Experiments demonstrate that PMSS outperforms LoRA and other fine-tuning methods across tasks with much less trainable parameters. We demonstrate its effectiveness, especially in handling complex tasks such as DROP benchmark(+3.4%/+5.9% on LLaMA2-7B/13B) and math reasoning (+12.89%/+5.61%/+3.11% on LLaMA2-7B, Mistral-7B and Gemma-7B of GSM8K).The code and model will be released soon.

pdf bib abs
Learn from Failure: Causality-guided Contrastive Learning for Generalizable Implicit Hate Speech Detection
Tianming Jiang

Implicit hate speech presents a significant challenge for automatic detection systems due to its subtlety and ambiguity. Traditional models trained using empirical risk minimization (ERM) often rely on correlations between class labels and spurious attributes, which leads to poor performance on data lacking these correlations. In this paper, we propose a novel approach using causality-guided contrastive learning (CCL) to enhance the generalizability of implicit hate speech detection. Since ERM tends to identify spurious attributes, CCL works by aligning the representations of samples with the same class but opposite spurious attributes, identified through ERM’s inference failure. This method reduces the model’s reliance on spurious correlations, allowing it to learn more robust features and handle diverse, nuanced contexts better. Our extensive experiments on multiple implicit hate speech datasets show that our approach outperforms current state-of-the-art methods in cross-domain generalization.

pdf bib abs
Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation
Samin Mahdizadeh Sani | Pouya Sadeghi | Thuy-Trang Vu | Yadollah Yaghoobzadeh | Gholamreza Haffari

Large language models (LLMs) have made great progress in classification and text generation tasks. However, they are mainly trained on English data and often struggle with low-resource languages. In this study, we explore adding a new language, i.e., Persian, to Llama (a model with a limited understanding of Persian) using parameter-efficient fine-tuning. We employ a multi-stage approach involving pretraining on monolingual Persian data, aligning representations through bilingual pretraining and instruction datasets, and instruction-tuning with task-specific datasets. We evaluate the model’s performance at each stage on generation and classification tasks. Our findings suggest that incorporating the Persian language, through bilingual data alignment, can enhance classification accuracy for Persian tasks, with no adverse impact and sometimes even improvements on English tasks. Additionally, the results highlight the model’s initial strength as a critical factor when working with limited training data, with cross-lingual alignment offering minimal benefits for the low-resource language. Knowledge transfer from English to Persian has a marginal effect, primarily benefiting simple classification tasks.

pdf bib abs
Inductive Link Prediction in N-ary Knowledge Graphs
Jiyao Wei | Saiping Guan | Xiaolong Jin | Jiafeng Guo | Xueqi Cheng

N-ary Knowledge Graphs (NKGs), where a fact can involve more than two entities, have gained increasing attention. Link Prediction in NKGs (LPN) aims to predict missing elements in facts to facilitate the completion of NKGs. Current LPN methods implicitly operate under a closed-world assumption, meaning that the sets of entities and roles are fixed. These methods focus on predicting missing elements within facts composed of entities and roles seen during training. However, in reality, new facts involving unseen entities and roles frequently emerge, requiring completing these facts. Thus, this paper proposes a new task, Inductive Link Prediction in NKGs (ILPN), which aims to predict missing elements in facts involving unseen entities and roles in emerging NKGs. To address this task, we propose a Meta-learning-based N-ary knowledge Inductive Reasoner (MetaNIR), which employs a graph neural network with meta-learning mechanisms to embed unseen entities and roles adaptively. The obtained embeddings are used to predict missing elements in facts involving unseen elements. Since no existing dataset supports this task, three datasets are constructed to evaluate the effectiveness of MetaNIR. Extensive experimental results demonstrate that MetaNIR consistently outperforms representative models across all datasets.

Large Language models (LLMs) have become a research hotspot. To accelerate the inference of LLMs, storing computed caches in memory has become the standard technique. However, as the inference length increases, growing KV caches might lead to out-of-memory issues. Many existing methods address this issue through KV cache compression, primarily by preserving key tokens throughout all layers to reduce information loss. Most of them allocate a uniform budget size for each layer to retain. However, we observe that the minimum budget sizes needed to retain essential information vary across layers and models based on the perspectives of attention and hidden state output. Building on this observation, this paper proposes a simple yet effective KV cache compression method that leverages layer uncertainty to allocate budget size for each layer. Experimental results show that the proposed method can reduce memory usage of the KV caches to only ~20% when compared to full KV inference while achieving nearly lossless performance.

pdf bib abs
Automatic Mathematic In-Context Example Generation for LLM Using Multi-Modal Consistency
Jaeseong Lee | Wei Yang | Gopal Gupta | Shiyi Wei

Large Language Models (LLMs) have advanced Natural Language Processing (NLP) tasks but are limited in mathematical reasoning. To address this, few-shot examples are used in prompts for in-context learning. However, existing methods require annotated datasets, resulting in higher computational costs and lower quality examples. To mitigate these limitations, we propose AutoMathIC, a framework that automatically generates high-quality in-context examples to enhance LLMs’ mathematical reasoning. AutoMathIC ensures consistency across different modalities (e.g., Chain-of-Thought (CoT), code snippets, and equations) by generating and selecting mutations that improve response consistency. Evaluated on four math problem datasets, AutoMathIC outperforms six baselines, with LLM accuracy ranging from 87.0% to 99.3% for GPT-3.5 and 93.1% to 98.7% for GPT-4o-mini. It surpasses the state-of-the-art in-context example retrieval method in three of the four datasets by 0.3% to 11.8%, without relying on an annotated dataset.

pdf bib abs
From Traits to Empathy: Personality-Aware Multimodal Empathetic Response Generation
Jiaqiang Wu | Xuandong Huang | Zhouan Zhu | Shangfei Wang

Empathetic dialogue systems improve user experience across various domains. Existing approaches mainly focus on acquiring affective and cognitive knowledge from text, but neglect the unique personality traits of individuals and the inherently multimodal nature of human face-to-face conversation. To this end, we enhance the dialogue system with the ability to generate empathetic responses from a multimodal perspective, and consider the diverse personality traits of users. We incorporate multimodal data, such as images and texts, to understand the user’s emotional state and situation. Concretely, we first identify the user’s personality trait. Then, the dialogue system comprehends the user’s emotions and situation by the analysis of multimodal inputs. Finally, the response generator models the correlations among the personality, emotion, and multimodal data, to generate empathetic responses. Experiments on the MELD dataset and the MEDIC dataset validate the effectiveness of the proposed approach.

pdf bib abs
Integrating Visual Modalities with Large Language Models for Mental Health Support
Zhouan Zhu | Shangfei Wang | Yuxin Wang | Jiaqiang Wu

Current work of mental health support primarily utilizes unimodal textual data and often fails to understand and respond to users’ emotional states comprehensively. In this study, we introduce a novel framework that enhances Large Language Model (LLM) performance in mental health dialogue systems by integrating multimodal inputs. Our framework uses visual language models to analyze facial expressions and body movements, then combines these visual elements with dialogue context and counseling strategies. This approach allows LLMs to generate more nuanced and supportive responses. The framework comprises four components: in-context learning via computation of semantic similarity; extraction of facial expression descriptions through visual modality data; integration of external knowledge from a knowledge base; and delivery of strategic guidance through a strategy selection module. Both automatic and human evaluations confirm that our approach outperforms existing models, delivering more empathetic, coherent, and contextually relevant mental health support responses.

Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.

Truthfulness stands out as an essential challenge for Large Language Models (LLMs). Although many works have developed various ways for truthfulness enhancement, they seldom focus on truthfulness in multilingual scenarios. Meanwhile, contemporary multilingual aligning technologies struggle to balance massive languages and often exhibit serious truthfulness gaps across different languages, especially those that differ greatly from English. In our work, we extend truthfulness evaluation to multilingual contexts and propose a practical method for cross-lingual truthfulness transfer called Fact-aware Multilingual Selective Synergy (FaMSS). FaMSS is able to select an optimal subset of all tested languages by language bias and transfer contributions, and then employ translation instruction tuning for cross-lingual truthfulness transfer. Experimental results demonstrate that our approach can effectively reduce the multilingual representation disparity and boost cross-lingual truthfulness transfer of LLMs.

Recently, Multi-modal Entity Linking (MEL) has attracted increasing attention in the research community due to its significance in numerous multi-modal applications. Video, as a popular means of information transmission, has become prevalent in people’s daily lives. However, most existing MEL methods primarily focus on linking textual and visual mentions or offline videos’ mentions to entities in multi-modal knowledge bases, with limited efforts devoted to linking mentions within online video content. In this paper, we propose a task called Online Video Entity Linking (OVEL), aiming to establish connections between mentions in online videos and a knowledge base with high accuracy and timeliness. To facilitate the research works of (OVEL), we specifically concentrate on live delivery scenarios and construct a live delivery entity linking dataset called (LIVE). Besides, we propose an evaluation metric that considers robustness, timelessness, and accuracy. Furthermore, to effectively handle (OVEL) task, we leverage a memory block managed by a Large Language Model and retrieve entity candidates from the knowledge base to augment LLM performance on memory management. The experimental results prove the effectiveness and efficiency of our method.

There is a significant body of work looking at the ethical considerations of large language models (LLMs): critiquing tools to measure performance and harms; proposing toolkits to aid in ideation; discussing the risks to workers; considering legislation around privacy and security etc. As yet there is no work that integrates these resources into a single practical guide that focuses on LLMs; we attempt this ambitious goal. We introduce LLM Ethics Whitepaper, which we provide as an open and living resource for NLP practitioners, and those tasked with evaluating the ethical implications of others’ work. Our goal is to translate ethics literature into concrete recommendations for computer scientists. LLM Ethics Whitepaper distils a thorough literature review into clear Do’s and Don’ts, which we present also in this paper. We likewise identify useful toolkits to support ethical work. We refer the interested reader to the full LLM Ethics Whitepaper, which provides a succinct discussion of ethical considerations at each stage in a project lifecycle, as well as citations for the hundreds of papers from which we drew our recommendations. The present paper can be thought of as a pocket guide to conducting ethical research with LLMs.

pdf bib abs
Should We Use a Fixed Embedding Size? Customized Dimension Sizes for Knowledge Graph Embedding
Zhanpeng Guan | Zhao Zhang | Yiqing Wu | Fuwei Zhang | Yongjun Xu

Knowledge Graph Embedding (KGE) aims to project entities and relations into a low-dimensional space, so as to enable Knowledge Graphs (KGs) to be effectively used by downstream AI tasks. Most existing KGs (e.g. Wikidata) suffer from the data imbalance issue, i.e., the occurrence frequencies vary significantly among different entities. Current KGE models use a fixed embedding size, leading to overfitting for low-frequency entities and underfitting for high-frequency ones. A simple method is to manually set embedding sizes based on frequency, but this is not feasible due to the complexity and the large number of entities. To this end, we propose CustomizE, which customizes embedding sizes in a data-driven way, assigning larger sizes for high-frequency entities and smaller sizes for low-frequency ones. We use bilevel optimization for stable learning of representations and sizes. It is noteworthy that our framework is universal and flexible, which is suitable for various KGE models. Experiments on link prediction tasks show its superiority over state-of-the-art baselines.

pdf bib abs
Chinese Automatic Readability Assessment Using Adaptive Pre-training and Linguistic Feature Fusion
Xusheng Yang | Jincai Yang | Xiao Li

Chinese Automatic Readability Assessment (ARA) aims to classify the reading difficulty of Chinese texts. To address the issues of insufficient high-quality training data and underutilization of linguistic features in existing methods, we propose a method that combines adaptive pre-training with feature fusion based on an interactive attention mechanism. First, we enhance the model’s ability to capture different text difficulties through domain- and task-specific adaptive pre-training. Then, we propose an Adaptive Task-guided Corpus Filtering (ATCF) method, utilizing embeddings generated by the pre-trained model and applying nearest-neighbor search along with a sample balancing mechanism to ensure comprehensive learning across various difficulty levels. Finally, we propose an Interactive Attention-Driven Feature Fusion method that integrates linguistic and deep features, providing rich difficulty information to the model. Experiments on Chinese textbook dataset demonstrate that our method achieves state-of-the-art (SOTA) performance. Transfer learning experiments further indicate that our approach generalizes well to extracurricular reading and Chinese as a Foreign Language (CFL) ARA tasks.

Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model’s overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.

pdf bib abs
Unmasking the Imposters: How Censorship and Domain Adaptation Affect the Detection of Machine-Generated Tweets
Bryan E. Tuck | Rakesh Verma

The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their potential misuse on social media platforms. We present a comprehensive methodology for creating nine Twitter datasets to examine the generative capabilities of four prominent LLMs: Llama 3, Mistral, Qwen2, and GPT4o. These datasets encompass four censored and five uncensored model configurations, including 7B and 8B parameter base-instruction models of the three open-source LLMs. Additionally, we perform a data quality analysis to assess the characteristics of textual outputs from human, “censored,” and “uncensored models,” employing semantic meaning, lexical richness, structural patterns, content characteristics, and detector performance metrics to identify differences and similarities. Our evaluation demonstrates that “uncensored” models significantly undermine the effectiveness of automated detection methods. This study addresses a critical gap by exploring smaller open-source models and the ramifications of “uncensoring,” providing valuable insights into how domain adaptation and content moderation strategies influence both the detectability and structural characteristics of machine-generated text.

This paper focuses on sarcasm detection, which aims to identify whether given statements convey criticism, mockery, or other negative sentiment opposite to the literal meaning. To detect sarcasm, humans often require a comprehensive understanding of the semantics in the statement and even resort to external commonsense to infer the fine-grained incongruity. However, existing methods lack commonsense inferential ability when they face complex real-world scenarios, leading to unsatisfactory performance. To address this problem, we propose a novel framework for sarcasm detection, which conducts incongruity reasoning based on commonsense augmentation, called EICR. Concretely, we first employ retrieval-augmented large language models to supplement the missing but indispensable commonsense background knowledge. To capture complex contextual associations, we construct a dependency graph and obtain the optimized topology via graph refinement. We further introduce an adaptive reasoning skeleton that integrates prior rules to extract sentiment-inconsistent subgraphs explicitly. To eliminate the possible spurious relations between words and labels, we employ adversarial contrastive learning to enhance the robustness of the detector. Experiments conducted on five datasets demonstrate the effectiveness of EICR.

pdf bib abs
Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning
Jing Bi | Yuting Wu | Weiwei Xing | Zhenjie Wei

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Advances in prompt engineering and fine-tuning techniques have further enhanced their ability to address complex reasoning challenges. However, these advanced capabilities are often exclusive to models exceeding 100 billion parameters. Although Chain-of-Thought (CoT) fine-tuning methods have been explored for smaller models (under 10 billion parameters), they typically depend on extensive CoT training data, which can introduce inconsistencies and limit effectiveness in low-data settings. To overcome these limitations, this paper introduce a new reasoning strategy Solution Guidance (SG) and a plug-and-play training paradigm Solution-Guidance Fine-Tuning (SGFT) for enhancing the reasoning capabilities of small language models. SG focuses on problem understanding and decomposition at the semantic and logical levels, rather than specific computations, which can effectively improve the SLMs’ generalization and reasoning abilities. With only a small amount of SG training data, SGFT can fine-tune a SLM to produce accurate problem-solving guidances, which can then be flexibly fed to any SLM as prompts, enabling it to generate correct answers directly. Experimental results demonstrate that our method significantly improves the performance of SLMs on various reasoning tasks, enhancing both their practicality and efficiency within resource-constrained environments.

pdf bib abs
LOG: A Local-to-Global Optimization Approach for Retrieval-based Explainable Multi-Hop Question Answering
Hao Xu | Yunxiao Zhao | Jiayang Zhang | Zhiqiang Wang | Ru Li

Multi-hop question answering (MHQA) aims to utilize multi-source intensive documents retrieved to derive the answer. However, it is very challenging to model the importance of knowledge retrieved. Previous approaches primarily emphasize single-step and multi-step iterative decomposition or retrieval, which are susceptible to failure in long-chain reasoning due to the progressive accumulation of erroneous information. To address this problem, we propose a novel Local-tO-Global optimized retrieval method (LOG) to discover more beneficial information, facilitating the MHQA. In particular, we design a pointwise conditional v-information based local information modeling to cover usable documents with reasoning knowledge. We also improve tuplet objective loss, advancing multi-examples-aware global optimization to model the relationship between scattered documents. Extensive experimental results demonstrate our proposed method outperforms prior state-of-the-art models, and it can significantly improve multi-hop reasoning, notably for long-chain reasoning.

Multilingual knowledge graphs (KGs) provide high-quality relational and textual information for various NLP applications, but they are often incomplete, especially in non-English languages. Previous research has shown that combining information from KGs in different languages aids either Knowledge Graph Completion (KGC), the task of predicting missing relations between entities, or Knowledge Graph Enhancement (KGE), the task of predicting missing textual information for entities. Although previous efforts have considered KGC and KGE as independent tasks, we hypothesize that they are interdependent and mutually beneficial. To this end, we introduce KG-TRICK, a novel sequence-to-sequence framework that unifies the tasks of textual and relational information completion for multilingual KGs. KG-TRICK demonstrates that: i) it is possible to unify the tasks of KGC and KGE into a single framework, and ii) combining textual information from multiple languages is beneficial to improve the completeness of a KG. As part of our contributions, we also introduce WikiKGE10++, the largest manually-curated benchmark for textual information completion of KGs, which features over 25,000 entities across 10 diverse languages.

pdf bib abs
Impromptu Cybercrime Euphemism Detection
Xiang Li | Yucheng Zhou | Laiping Zhao | Jing Li | Fangming Liu

Detecting euphemisms is essential for content security on various social media platforms, but existing methods designed for detecting euphemisms are ineffective in impromptu euphemisms. In this work, we make a first attempt to an exploration of impromptu euphemism detection and introduce the Impromptu Cybercrime Euphemisms Detection (ICED) dataset. Moreover, we propose a detection framework tailored to this problem, which employs context augmentation modeling and multi-round iterative training. Our detection framework mainly consists of a coarse-grained and a fine-grained classification model. The coarse-grained classification model removes most of the harmless content in the corpus to be detected. The fine-grained model, impromptu euphemisms detector, integrates context augmentation and multi-round iterations training to better predicts the actual meaning of a masked token. In addition, we leverage ChatGPT to evaluate the mode’s capability. Experimental results demonstrate that our approach achieves a remarkable 76-fold improvement compared to the previous state-of-the-art euphemism detector.

pdf bib abs
ALIS: Aligned LLM Instruction Security Strategy for Unsafe Input Prompt
Xinhao Song | Sufeng Duan | Gongshen Liu

In large language models, existing instruction tuning methods may fail to balance the performance with robustness against attacks from user input like prompt injection and jailbreaking. Inspired by computer hardware and operating systems, we propose an instruction tuning paradigm named Aligned LLM Instruction Security Strategy (ALIS) to enhance model performance by decomposing user inputs into irreducible atomic instructions and organizing them into instruction streams which will guide the response generation of model. ALIS is a hierarchical structure, in which user inputs and system prompts are treated as user and kernel mode instructions respectively. Based on ALIS, the model can maintain security constraints by ignoring or rejecting the input instructions when user mode instructions attempt to conflict with kernel mode instructions. To build ALIS, we also develop an automatic instruction generation method for training ALIS, and give one instruction decomposition task and respective datasets. Notably, the ALIS framework with a small model to generate instruction streams still improve the resilience of LLM to attacks substantially without any lose on general capabilities.

pdf bib abs
ProTOD: Proactive Task-oriented Dialogue System Based on Large Language Model
Wenjie Dong | Sirong Chen | Yan Yang

Large Language Model (LLM)-based Task-Oriented Dialogue (TOD) systems show promising performance in helping users achieve specific goals in a zero-shot setting. However, existing systems engage with users in a reactive manner, relying on a basic single-query mechanism with the knowledge base and employing passive policy planning. The proactive TOD systems, which can provide potentially helpful information and plan cross-domain multi-task dialogue policies, have not been well studied. In addition, effective evaluation methods are also lacking. To address these issues, we propose ProTOD, a novel LLM-based proactive TOD framework designed to improve system proactivity and goal completion. First, we design an adaptive exploratory retrieval mechanism to dynamically navigate domain knowledge. Second, we introduce a two-stage passive-to-proactive policy planner that effectively organizes knowledge and actions relationship. Finally, we develop two distinct user simulators with different personalities to simulate real-world interactions and propose a new error measure called Human-targeted Policy Edit Rate (HPER) for evaluation. Experimental results show that ProTOD achieves state-of-the-art (SOTA) performance, improving goal completion rates by 10% while significantly enhancing the proactive engagement.

pdf bib abs
Towards Multilingual spoken Visual Question Answering system using Cross-Attention
Amartya Roy Chowdhury | Tonmoy Rajkhowa | Sanjeev Sharma

Visual question answering (VQA) poses a multi-modal translation challenge that requires the analysis of both images and questions simultaneously to generate appropriate responses. Although VQA research has mainly focused on text-based questions in English, speech-based questions in English and other languages remain largely unexplored. Incorporating speech could significantly enhance the utility of VQA systems, as speech is the primary mode of human communication. To address this gap, this work implements a speech-based VQA system and introduces the textless multilingual visual question answering (TM-VQA) dataset, featuring speech-based questions in English, German, Spanish, and French. This TM-VQA dataset contains 658,111 pairs of speech-based questions and answers based on 123,287 images. Finally, a novel, cross-attention-based unified multi-modal framework is presented to evaluate the efficacy of the TM-VQA dataset. The experimental results indicate the effectiveness of the proposed unified approach over the cascaded framework for both text and speech-based VQA systems. Dataset can be accessed at https://github.com/Synaptic-Coder/TM-VQA.

Mental manipulation severely undermines mental wellness by covertly and negatively distorting decision-making. While there is an increasing interest in mental health care within the natural language processing community, progress in tackling manipulation remains limited due to the complexity of detecting subtle, covert tactics in conversations. In this paper, we propose Intent-Aware Prompting (IAP), a novel approach for detecting mental manipulations using large language models (LLMs), providing a deeper understanding of manipulative tactics by capturing the underlying intents of participants. Experimental results on the MentalManip dataset demonstrate superior effectiveness of IAP against other advanced prompting strategies. Notably, our approach substantially reduces false negatives, helping detect more instances of mental manipulation with minimal misjudgment of positive cases. The code of this paper is available at https://github.com/Anton-Jiayuan-MA/Manip-IAP.

pdf bib abs
MIGRATE: Cross-Lingual Adaptation of Domain-Specific LLMs through Code-Switching and Embedding Transfer
Seongtae Hong | Seungyoon Lee | Hyeonseok Moon | Heuiseok Lim

Large Language Models (LLMs) have rapidly advanced, with domain-specific expert models emerging to handle specialized tasks across various fields. However, the predominant focus on English-centric models demands extensive data, making it challenging to develop comparable models for middle and low-resource languages. To address this limitation, we introduce Migrate, a novel method that leverages open-source static embedding models and up to 3 million tokens of code-switching data to facilitate the seamless transfer of embeddings to target languages. Migrate enables effective cross-lingual adaptation without requiring large-scale domain-specific corpora in the target language, promoting the accessibility of expert LLMs to a diverse range of linguistic communities. Our experimental results demonstrate that Migrate significantly enhances model performance in target languages, outperforming baseline and existing cross-lingual transfer methods. This approach provides a practical and efficient solution for extending the capabilities of domain-specific expert models.

pdf bib abs
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving
Bhavani Shankar P S V N | Preethi Jyothi | Pushpak Bhattacharyya

Code-switching is a widely prevalent linguistic phenomenon in multilingual societies like India. Building speech-to-text models for code-switched speech is challenging due to limited availability of datasets. In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture CoSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules (that are more widely available for many languages). Speech and ASR text representations are fused using an aligned interleaving scheme and are fed further as input to a pretrained MT module; the whole pipeline is then trained end-to-end for spoken translation using synthetically created ST data. We also release a new evaluation benchmark for code-switched Bengali- English, Hindi-English, Marathi-English and Telugu-English speech to English text. CoSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.

Large language models (LLMs) have revolutionized various domains but still struggle with non-Latin scripts and low-resource languages. This paper addresses the critical challenge of improving multilingual performance without extensive fine-tuning. We introduce a novel dynamic learning approach that optimizes prompt strategy, embedding model, and LLM per query at runtime. By adapting configurations dynamically, our method achieves significant improvements over static, best and random baselines. It operates efficiently in both offline and online settings, generalizing seamlessly across new languages and datasets. Leveraging Retrieval-Augmented Generation (RAG) with state-of-the-art multilingual embeddings, we achieve superior task performance across diverse linguistic contexts. Through systematic investigation and evaluation across18 diverse languages using popular question-answering (QA) datasets we show our approach results in 10-15% improvements in multilingual performance over pre-trained models and 4x gains compared to fine-tuned, language-specific models.

The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children’s poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources. The code and dataset used in this work are publicly available.

We present the first dataset, an annotation scheme, discourse analysis, and baseline experiments on argumentation and domain content types in scholarly articles on political science, specifically on the theory of International Relations (IR). The dataset comprises over 1 600 sentences stemming from three foundational articles on Neo-Realism, Liberalism, and Constructivism. We show that our annotation scheme enables educationally-relevant insight into the scholarly IR discourse and that state-of-the-art classifiers, while effective in distinguishing basic argumentative elements (Claims and Support/Attack relations) reaching up to 0.97 micro F1 , require domain-specific training and fine-tuning on the more fine-grained tasks of relation and content type prediction.

pdf bib abs
Semantic and Sentiment Dual-Enhanced Generative Model for Script Event Prediction
Feiyang Wu | Peixin Huang | Yanli Hu | Zhen Tan | Xiang Zhao

Script Event Prediction (SEP) aims to forecast the next event in a sequence from a list of candidates. Traditional methods often use pre-trained language models to model event associations but struggle with semantic ambiguity and embedding bias. Semantic ambiguity arises from the multiple meanings of identical words and insufficient consideration of event arguments, while embedding bias results from assigning similar word embeddings to event pairs with similar lexical features, despite their different meanings. To address above issues, we propose a the Semantic and Sentiment Dual-enhanced Generative Model (SSD-GM). SSD-GM leverages two types of script event information to enhance the generative model. Specifically, it employs a GNN-based semantic structure aggregator to integrate the event-centric structure information, thereby mitigating the impact of semantic ambiguity. Furthermore, we find that local sentiment variability effectively reduces biases in event embeddings, while maintaining global sentiment consistency enhances predictive accuracy. As a result, SSD-GM adeptly captures both global and local sentiment of events through its sentiment information awareness mechanism. Extensive experiments on the Multi-Choice Narrative Cloze (MCNC) task demonstrate that our approach achieves better results than other state-of-the-art baselines.

pdf bib abs
Generation-Based and Emotion-Reflected Memory Update: Creating the KEEM Dataset for Better Long-Term Conversation
Jeonghyun Kang | Hongjin Kim | Harksoo Kim

In this work, we introduce the Keep Emotional and Essential Memory (KEEM) dataset, a novel generation-based dataset designed to enhance memory updates in long-term conversational systems. Unlike existing approaches that rely on simple accumulation or operation-based methods, which often result in information conflicts and difficulties in accurately tracking a user’s current state, KEEM dynamically generates integrative memories. This process not only preserves essential factual information but also incorporates emotional context and causal relationships, enabling a more nuanced understanding of user interactions. By seamlessly updating a system’s memory with both emotional and essential data, our approach promotes deeper empathy and enhances the system’s ability to respond meaningfully in open-domain conversations.

pdf bib abs
medIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs
Mingyi Jia | Junwen Duan | Yan Song | Jianxin Wang

Electronic Medical Records (EMRs), while integral to modern healthcare, present challenges for clinical reasoning and diagnosis due to their complexity and information redundancy. To address this, we proposed medIKAL (Integrating Knowledge Graphs as Assistants of LLMs), a framework that combines Large Language Models (LLMs) with knowledge graphs (KGs) to enhance diagnostic capabilities. medIKAL assigns weighted importance to entities in medical records based on their type, enabling precise localization of candidate diseases within KGs. It innovatively employs a residual network-like approach, allowing initial diagnosis by the LLM to be merged into KG search results. Through a path-based reranking algorithm and a fill-in-the-blank style prompt template, it further refined the diagnostic process. We validated medIKAL’s effectiveness through extensive experiments on a newly introduced open-sourced Chinese EMR dataset, demonstrating its potential to improve clinical diagnosis in real-world settings.

pdf bib abs
AIDER: a Robust and Topic-Independent Framework for Detecting AI-Generated Text
Jiayi Gui | Baitong Cui | Xiaolian Guo | Ke Yu | Xiaofei Wu

The human-level fluency achieved by large language models in text generation has intensified the challenge of distinguishing between human-written and AI-generated texts. While current fine-tuned detectors exist, they often lack robustness against adversarial attacks and struggle with out-of-distribution topics, limiting their practical applicability. This study introduces AIDER, a robust and topic-independent AI-generated text detection framework. AIDER leverages the ALBERT model for topic content disentanglement, enhancing transferability to unseen topics. It incorporates an augmentor that generates robust adversarial data for training, coupled with contrastive learning techniques to boost resilience. Comprehensive experiments demonstrate AIDER’s significant superiority over state-of-the-art methods, exhibiting exceptional robustness against adversarial attacks with minimal performance degradation. AIDER consistently achieves high accuracy in non-augmented scenarios and demonstrates remarkable generalizability to unseen topics. These attributes establish AIDER as a powerful and versatile tool for LLM-generated text detection across diverse real-world applications, addressing critical challenges in the evolving landscape of AI-generated content.

The colossal parameters and computational overhead of Large Language Models (LLMs) challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at https://github.com/wyxscir/CFSP.

pdf bib abs
Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models
Nishanth Madhusudhan | Sathwik Tejaswi Madhusudhan | Vikas Yadav | Masoud Hashemi

Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability, referring to an LLM’s capability to withhold responses when uncertain or lacking a definitive answer, without compromising performance. Although previous studies have attempted to improve AA, they lack a standardized evaluation method and remain unsuitable for black-box models where token prediction probabilities are inaccessible. This makes comparative analysis challenging, especially for state-of-the-art closed-source commercial LLMs. This paper bridges this gap by introducing a black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess AA across varied question types (answerable and unanswerable), domains (well-represented and under-represented), and task types (fact-centric and reasoning). We also propose a new confusion matrix, the ”Answerable-Unanswerable Confusion Matrix” (AUCM) which serves as the basis for evaluating AA, by offering a structured and precise approach for assessment. Finally, we explore the impact of three prompting strategies — Strict Prompting, Verbal Confidence Thresholding, and Chain-of-Thought (CoT) — on improving AA. Our results indicate that even powerful models like GPT-4, Mixtral 8x22b encounter difficulties with abstention; however, strategic approaches such as Strict prompting and CoT can enhance this capability.

Despite the demonstrated potential of Large Language Models (LLMs) in diverse NLP tasks, their causal reasoning capability appears inadequate when evaluated within the context of the event causality identification (ECI) task. The ECI tasks pose significant complexity for LLMs and necessitate comprehensive causal priors for accurate identification. To improve the performance of LLMs for causal reasoning, we propose a multi-agent Decomposed reasoning framework for Event Causality Identification, designated as Dr.ECI. In the discovery stage, Dr.ECI incorporates specialized agents such as Causal Explorer and Mediator Detector, which capture implicit causality and indirect causality more effectively. In the reasoning stage, Dr.ECI introduces the agents Direct Reasoner and Indirect Reasoner, which leverage the knowledge of the generalized causal structure specific to the ECI. Extensive evaluations demonstrate the state-of-the-art performance of Dr.ECI comparing with baselines based on LLMs and supervised training. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/Dr.ECI.

We introduce InternLM-Law, a large language model (LLM) tailored for addressing diverse legal tasks related to Chinese laws. These tasks range from responding to standard legal questions (e.g., legal exercises in textbooks) to analyzing complex real-world legal situations. Our work contributes to Chinese Legal NLP research by (1) conducting one of the most extensive evaluations of state-of-the-art general-purpose and legal-specific LLMs to date that involves an automatic evaluation on the 20 legal NLP tasks in LawBench, a human evaluation on a challenging version of the Legal Consultation task, and an automatic evaluation of a model’s ability to handle very long legal texts; (2) presenting a methodology for training a Chinese legal LLM that offers superior performance to all of its counterparts in our extensive evaluation; and (3) facilitating future research in this area by making all of our code and model publicly available at https://github.com/InternLM/InternLM-Law.

pdf bib abs
Let’s Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model
Haoyun Xu | Runzhe Zhan | Yingpeng Ma | Derek F. Wong | Lidia S. Chao

Large Language Models (LLMs) are composed of neurons that exhibit various behaviors and roles, which become increasingly diversified as models scale. Recent studies have revealed that not all neurons are active across different datasets, and this sparsity correlates positively with the task-specific ability, leading to advancements in model pruning and training efficiency. Traditional fine-tuning methods engage all parameters of LLMs, which is computationally expensive and may not be necessary. In contrast, Parameter-Efficient Fine-Tuning (PEFT) approaches aim to minimize the number of trainable parameters, yet they still operate at a relatively macro scale (e.g., layer-level). We introduce Neuron-Level Fine-Tuning (NeFT), a novel approach that refines the granularity of parameter training down to the individual neuron, enabling a more parameter-efficient fine-tuning model. The experimental results show that NeFT not only exceeded the performance of full-parameter fine-tuning and PEFT but also provided insights into the analysis of neurons. Our code and data are available at: https://github.com/NLP2CT/NeFT.

pdf bib abs
Cross-Domain Fake News Detection based on Dual-Granularity Adversarial Training
Wenjie Wei | Yanyue Zhang | Jinyan Li | Panfei Liu | Deyu Zhou

Cross-domain fake news detection, aiming to detect fake news in unseen domains, has achieved promising results with the help of pre-trained language models. Existing approaches mainly relied on extracting domain-independent representations or modeling domain discrepancies to achieve domain adaptation. However, we found that the relationship between entities in a piece of news and its corresponding label (fake or real) fluctuates among different domains. Such discrepancy is ignored by existing methods, leading to model entity bias. Therefore, in this paper, we propose a novel cross-domain fake news detection method based on dual-granularity adversarial training from the perspective of document-level and entity-level. Specifically, both the news pieces and their entities are modeled individually to construct an encoder that can generate domain-independent representations using adversarial training. Moreover, the dual-granularity soft prompt, consisting of two independent learnable segments trained on the source domains, is employed to make the model easily adapt to the unseen target domains. In addition, MultiFC, a released dataset for cross domain fake news detection, is not suitable for the evaluation due to its unreasonable domain construction rules. We artificially reconstructed the dataset and named it New-MultiFC, which is a more domain-discriminative dataset. Experimental results on both the newly constructed New-MultiFC and FND3 show the effectiveness of the proposed approach, achieving the state-of-the-art results in unseen domains.

pdf bib abs
Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
Chunsheng Zuo | Pavel Guerzhoy | Michael Guerzhoy

Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.

pdf bib abs
RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation
Ioannis Panagiotopoulos | George Filandrianos | Maria Lymperaiou | Giorgos Stamou

Riddle-solving requires advanced reasoning skills, pushing Large Language Models (LLMs) to engage in abstract thinking and creative problem-solving, often revealing limitations in their cognitive abilities. In this paper, we examine the riddle-solving capabilities of LLMs using a multiple-choice format, exploring how different prompting techniques impact performance on riddles that demand diverse reasoning skills. To enhance results, we introduce RISCORE (RIddle Solving with COntext REcontruciton) a novel fully automated prompting method that generates and utilizes contextually reconstructed sentence-based puzzles in conjunction with the original examples to create few-shot exemplars. Our experiments demonstrate that RISCORE significantly improves the performance of language models in both vertical and lateral thinking tasks, surpassing traditional exemplar selection strategies across a variety of few-shot settings.

Evaluating LLM-generated text has become a key challenge, especially in domain-specific contexts like the medical field. This work introduces a novel evaluation methodology for LLM-generated medical explanatory arguments, relying on Proxy Tasks and rankings to closely align results with human evaluation criteria, overcoming the biases typically seen in LLMs used as judges. We demonstrate that the proposed evaluators are robust against adversarial attacks, including the assessment of non-argumentative text. Additionally, the human-crafted arguments needed to train the evaluators are minimized to just one example per Proxy Task. By examining multiple LLM-generated arguments, we establish a methodology for determining whether a Proxy Task is suitable for evaluating LLM-generated medical explanatory arguments, requiring only five examples and two human experts.

Aspect Sentiment Quad Prediction(ASQP) enhances the scope of aspect-based sentiment analysis by introducing the necessity to predict both explicit and implicit aspect and opinion terms. Existing leading generative ASQP approaches do not modeling the contextual relationship of the review sentence to predict implicit terms. However, introducing the contextual information into the pre-trained language models framework is non-trivial due to the inflexibility of the generative encoder-decoder architecture. To well utilize the contextual information, we propose an extractive ASQP framework, CACA, which features with Context-Aware Cross-Attention Network. When implicit terms are present, the Context-Aware Cross-Attention Network enhances the alignment of aspects and opinions, through alternating updates of explicit and implicit representations. Additionally, contrastive learning is introduced in the implicit representation learning process. Experimental results on three benchmarks demonstrate the effectiveness of CACA. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/CACA.

pdf bib abs
Improved Sparse Upcycling for Instruction Tuning
Wangyi Jiang | Yaojie Lu | Hongyu Lin | Xianpei Han | Le Sun

The Mixture-of-Experts (MoE) architecture has demonstrated significant potential in both large-scale pre-training and instruction tuning by offering increased parameter capacity without additional inference costs. However, developing MoE models faces challenges including training instability and the need for substantial high-quality training data. While efficient methodologies like sparse upcycling exist, they often lead to performance degradation in instruction tuning scenarios. We introduce representation-based sparse upcycling, a straightforward yet effective technique for converting dense language models into sparsely activated ones while maintaining similar computational costs. Unlike conventional sparse upcycling, our approach leverages intermediate representations from language models to initialize router weights. This strategy addresses the mismatch between randomly initialized and well-trained parameters while providing prior knowledge to guide expert specialization during training. Extensive experiments across diverse benchmarks demonstrate significant improvements in both model capabilities and routing consistency compared to existing approaches.

Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers’ feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9× compared to the two-stage method.

pdf bib abs
ME2-BERT: Are Events and Emotions what you need for Moral Foundation Prediction?
Lorenzo Zangari | Candida M. Greco | Davide Picca | Andrea Tagarelli

Moralities, emotions, and events are complex aspects of human cognition, which are often treated separately since capturing their combined effects is challenging, especially due to the lack of annotated data. Leveraging their interrelations hence becomes crucial for advancing the understanding of human moral behaviors. In this work, we propose ME2-BERT, the first holistic framework for fine-tuning a pre-trained language model like BERT to the task of moral foundation prediction. ME2-BERT integrates events and emotions for learning domain-invariant morality-relevant text representations. Our extensive experiments show that ME2-BERT outperforms existing state-of-the-art methods for moral foundation prediction, with an average increase up to 35% in the out-of-domain scenario.

The rampant spread of cyberbullying content poses a growing threat to societal well-being. However, research on cyberbullying detection in Chinese remains underdeveloped, primarily due to the lack of comprehensive and reliable datasets. Notably, no existing Chinese dataset is specifically tailored for cyberbullying detection. Moreover, while comments play a crucial role within sessions, current session-based datasets often lack detailed, fine-grained annotations at the comment level. To address these limitations, we present a novel Chinese cyberbullying dataset, termed SCCD, which consists of 677 session-level samples sourced from a major social media platform Weibo. Moreover, each comment within the sessions is annotated with fine-grained labels rather than conventional binary class labels. Empirically, we evaluate the performance of various baseline methods on SCCD, highlighting the challenges for effective Chinese cyberbullying detection.

pdf bib abs
Hands-off Image Editing: Language-guided Editing without any Task-specific Labeling, Masking or even Training
Rodrigo Santos | António Branco | João Ricardo Silva | Joao Rodrigues

Instruction-guided image editing consists in taking an image and an instruction and delivering that image altered according to that instruction. State-of-the-art approaches to this task suffer from the typical scaling up and domain adaptation hindrances related to supervision as they eventually resort to some kind of task-specific labelling, masking or training. We propose a novel approach that does without any such task-specific supervision and offers thus a better potential for improvement. Its assessment demonstrates that it is highly effective, achieving very competitive performance.

Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and lexical complexity. In addition to being strongly correlated with two psycholinguistic variables, a simple linear regression on the new frequencies achieves a new high score on a lexical complexity prediction task in English and Japanese, surpassing both models trained on film subtitle frequencies and the LLM GPT-4. We publicly release our code, the frequency lists, fastText word embeddings, and statistical language models.

pdf bib abs
RealSafe: Quantifying Safety Risks of Language Agents in Real-World
Yingning Ma

We present RealSafe, an innovative evaluation framework that aims to rigorously assess the safety and reliability of large language model (LLM) agents in real application scenarios. RealSafe tracks the behavior of LLM agents in fourteen different application scenarios utilizing three contexts - standard operations, ambiguous interactions, and malicious behaviors. For standard operations and ambiguous interactions, possible risks based on the agents’ decision-making are categorized into high, medium and low levels to reveal safety problems arising even from non-malicious user instructions. In assessing malicious behavior, we evaluate six types of malicious attacks to test the LLM agents’ ability to recognize and defend against clearly malicious intent. After evaluating over 1000 queries involving multiple LLMs, we concluded that GPT-4 performed best among all evaluated models. However, it still has several deficiencies. This discovery highlights the need to enhance sensitivity and response to different security threats when designing and developing LLM agents. RealSafe offers an empirical time frame for researchers and developers to better understand the security problems LLM agents might face in real deployment and offers specific directions and ideas for building safer and smarter LLM agents down the road.

pdf bib abs
Voice synthesis in Polish and English - analyzing prediction differences in speaker verification systems
Joanna Gajewska | Alicja Martinek | Michał J. Ołowski | Ewelina Bartuzi-Trokielewicz

Deep learning has significantly enhanced voice synthesis, yielding realistic audio capable of mimicking individual voices. This progress, however, raises security concerns due to the potential misuse of audio deepfakes. Our research examines the effects of deepfakes on speaker recognition systems across English and Polish corpora, assessing both Text-to-Speech and Voice Conversion methods. We focus on the biometric similarity’s role in the effectiveness of impersonations and find that synthetic voices can maintain personal traits, posing risks of unauthorized access. The study’s key contributions include analyzing voice synthesis across languages, evaluating biometric resemblance in voice conversion, and contrasting Text-to-Speech and Voice Conversion paradigms. These insights emphasize the need for improved biometric security against audio deepfake threats.

Capitalizing on a vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data often exhibit sub-optimal performance for specialized domains, likely due to domain shift. Recent works have tackled this problem for some domains (e.g., healthcare) by constructing domain-specialized image-text data. However, constructing a dedicated large-scale image-text dataset for sustainable areas of agriculture and livestock is still open to research. Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e.g., nutrient deficiency detection and livestock breed classification). To address this, we present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset named ALive that leverages a customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on a diverse set of 20 downstream tasks demonstrate the effectiveness of the AgriCLIP framework, achieving an absolute gain of 9.07% in terms of average zero-shot classification accuracy over the standard CLIP adaptation via domain-specialized ALive dataset. Our ALive dataset and code can be accessible on Github.

pdf bib abs
RUIE: Retrieval-based Unified Information Extraction using Large Language Model
Xincheng Liao | Junwen Duan | Yixi Huang | Jianxin Wang

Unified information extraction (UIE) aims to extract diverse structured information from unstructured text. While large language models (LLMs) have shown promise for UIE, they require significant computational resources and often struggle to generalize to unseen tasks. We propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning for efficient task generalization. RUIE introduces a novel demonstration selection mechanism combining LLM preferences with a keyword-enhanced reward model, and employs a bi-encoder retriever trained through contrastive learning and knowledge distillation. As the first trainable retrieval framework for UIE, RUIE serves as a universal plugin for various LLMs. Experimental results on eight held-out datasets demonstrate RUIE’s effectiveness, with average F1-score improvements of 19.22 and 3.22 compared to instruction-tuning methods and other retrievers, respectively.

pdf bib abs
It is not a piece of cake for GPT: Explaining Textual Entailment Recognition in the presence of Figurative Language
Giuseppe Gallipoli | Luca Cagliero

Textual Entailment Recognition (TER) aims to predict whether a pair of premise-hypothesis sentences represents an entailment, a contradiction, or none of the above. Addressing TER in the presence of figurative language is particularly challenging because words are used in a way that deviates from the conventional order and meaning. In this work, we investigate the capabilities of Large Language Models (LLMs) to address TER and generate textual explanations of TER predictions. First, we evaluate LLM performance in Zero- and Few-Shot Learning settings, with and without using Chain-of-Thought prompting. After identifying the best prompts, we highlight the settings in which in-context learning is beneficial. The closed-source models GPT-3.5 Turbo and GPT-4o show unexpected limitations compared to significantly smaller open-source LLMs. Next, we thoroughly analyze the effect of LLM Fine-Tuning, showing substantial improvements in the quality of TER explanations compared to Zero- and Few-Shot Learning. Notably, 9 billion parameter open-source LLMs demonstrate again competitive performance against larger closed-source models. Finally, we compare our LLM-based approach with the state-of-the-art DREAM-FLUTE and Cross-Task architectures. The results show significant performance improvements, particularly in the quality of the generated explanations.

The visual information-seeking task aims to answer visual questions that require external knowledge, such as “On what date did this building officially open?”. Existing methods using retrieval-augmented generation framework primarily rely on textual knowledge bases to assist multimodal large language models (MLLMs) in answering questions. However, the text-only knowledge can impair information retrieval for the multimodal query of image and question, and also confuse MLLMs in selecting the most relevant information during generation. In this work, we propose a novel framework MuKA which leverages a multimodal knowledge base to address these limitations. Specifically, we construct a multimodal knowledge base by automatically pairing images with text passages in existing datasets. We then design a fine-grained multimodal interaction to effectively retrieve multimodal documents and enrich MLLMs with both retrieved texts and images. MuKA outperforms state-of-the-art methods by 38.7% and 15.9% on the InfoSeek and E-VQA benchmark respectively, demonstrating the importance of multimodal knowledge in enhancing both retrieval and answer generation.

pdf bib abs
MSG-LLM: A Multi-scale Interactive Framework for Graph-enhanced Large Language Models
Jiayu Ding | Zhangkai Zheng | Benshuo Lin | Yun Xue | Yiping Song

Graph-enhanced large language models (LLMs) leverage LLMs’ remarkable ability to model language and use graph structures to capture topological relationships. Existing graph-enhanced LLMs typically retrieve similar subgraphs to augment LLMs, where the subgraphs carry the entities related to our target and relations among the entities. However, the retrieving methods mainly focus solely on accurately matching subgraphs between our target subgraph and the candidate subgraphs at the same scale, neglecting that the subgraphs with different scales may also share similar semantics or structures. To tackle this challenge, we introduce a graph-enhanced LLM with multi-scale retrieval (MSG-LLM). It captures similar graph structures and semantics across graphs at different scales and bridges the graph alignment across multiple scales. The larger scales maintain the graph’s global information, while the smaller scales preserve the details of fine-grained sub-structures. Specifically, we construct a multi-scale variation to dynamically shrink the scale of graphs. Further, we employ a graph kernel search to discover subgraphs from the entire graph, which essentially achieves multi-scale graph retrieval in Hilbert space. Additionally, we propose to conduct multi-scale interactions (message passing) over graphs at various scales to integrate key information. The interaction also bridges the graph and LLMs, helping with graph retrieval and LLM generation. Finally, we employ a Chain-of-Thought-based LLM prediction to perform the downstream tasks. We evaluate our approach on two graph-based downstream tasks and the experimental results show that our method achieves state-of-the-art performance.

pdf bib abs
MedEx: Enhancing Medical Question-Answering with First-Order Logic based Reasoning and Knowledge Injection
Aizan Zafar | Kshitij Mishra | Asif Ekbal

In medical question-answering, traditional knowledge triples often fail due to superfluous data and their inability to capture complex relationships between symptoms and treatments across diseases. This limits models’ ability to provide accurate, contextually relevant responses. To overcome this, we introduce MedEx, which employs First-Order Logic (FOL)-based reasoning to model intricate relationships between diseases and treatments. We construct FOL-based triplets that encode the interplay of symptoms, diseases, and treatments, capturing not only surface-level data but also the logical constraints of the medical domain. MedEx encodes the discourse (questions and context) using a transformer-based unit, enhancing context comprehension. These encodings are processed by a Knowledge Injection Cell that integrates knowledge graph triples via a Graph Attention Network. The Logic Fusion Cell then combines medical-specific logical rule triples (e.g., co-occurrence, causation, diagnosis) with knowledge triples and extracts answers through a feed-forward layer. Our analysis demonstrates MedEx’s effectiveness and generalization across medical question-answering tasks. By merging logical reasoning with knowledge, MedEx provides precise medical answers and adapts its logical rules based on training data nuances.

pdf bib abs
Zero-shot and Few-shot Learning with Instruction-following LLMs for Claim Matching in Automated Fact-checking
Dina Pisarevskaya | Arkaitz Zubiaga

The claim matching (CM) task can benefit an automated fact-checking pipeline by putting together claims that can be resolved with the same fact-check. In this work, we are the first to explore zero-shot and few-shot learning approaches to the task. We consider CM as a binary classification task and experiment with a set of instruction-following large language models (GPT-3.5-turbo, Gemini-1.5-flash, Mistral-7B-Instruct, and Llama-3-8B-Instruct), investigating prompt templates. We introduce a new CM dataset, ClaimMatch, which will be released upon acceptance. We put LLMs to the test in the CM task and find out that it can be tackled by leveraging more mature yet similar tasks such as natural language inference or paraphrase detection. We also propose a pipeline for CM, which we evaluate on texts of different lengths.

pdf bib abs
Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning
Yukang Lin | Bingchen Zhong | Shuoran Jiang | Joanna Siebert | Qingcai Chen

Large language models (LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning (ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM’s performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate exemplars. On the other hand, the logical connections between reasoning steps can also be beneficial to depict the problem-solving process. This paper proposes a novel method named Reasoning Graph-enhanced Exemplar Retrieval (RGER). RGER first queries LLM to generate an initial response and then expresses intermediate problem-solving steps to a graph structure. After that, it employs a graph kernel to select exemplars with semantic and structural similarity. Extensive experiments demonstrate the structural relationship is helpful to the alignment of queries and candidate exemplars. The efficacy of RGER on mathematics and logical reasoning tasks showcases its superiority over state-of-the-art retrieval-based approaches.

pdf bib abs
A Review of Prominent Paradigms for LLM-Based Agents: Tool Use, Planning (Including RAG), and Feedback Learning
Xinzhe Li

Tool use, planning, and feedback learning are currently three prominent paradigms for developing Large Language Model (LLM)-based agents across various tasks. Although numerous frameworks have been devised for each paradigm, their intricate workflows and inconsistent taxonomy create challenges in understanding and reviewing the frameworks across different paradigms. This survey introduces a unified taxonomy to systematically review and discuss these frameworks. Specifically, 1) the taxonomy defines environments/tasks, common LLM-profiled roles (policy models, evaluators, and dynamic models), and universally applicable workflows found in prior work, and 2) it enables a comparison of key perspectives on LMPR implementations and workflow usage across different agent paradigms.

pdf bib abs
Analyzing Offensive Language Dataset Insights from Training Dynamics and Human Agreement Level
Do-Kyung Kim | Hyeseon Ahn | Youngwook Kim | Yo-Sub Han

Implicit hate speech detection is challenging due to its subjectivity and context dependence, with existing models often struggling in outof-domain scenarios. We propose CONELA, a novel data refinement strategy that enhances model performance and generalization by integrating human annotation agreement with model training dynamics. By removing both easy and hard instances from the model’s perspective, while also considering whether humans agree or disagree and retaining ambiguous cases crucial for out-of-distribution generalization, CONELA consistently improves performance across multiple datasets and models. We also observe significant improvements in F1 scores and cross-domain generalization with the use of our CONELA strategy. Addressing data scarcity in smaller datasets, we introduce a weighted loss function and an ensemble strategy incorporating disagreement maximization, effectively balancing learning from limited data. Our findings demonstrate that refining datasets by integrating both model and human perspectives significantly enhances the effectiveness and generalization of implicit hate speech detection models. This approach lays a strong foundation for future research on dataset refinement and model robustness.

Recently, large language models (LLMs) have significantly improved the performance of text-to-SQL systems. Nevertheless, many state-of-the-art (SOTA) approaches have overlooked the critical aspect of system robustness. Our experiments reveal that while LLM-driven methods excel on standard datasets, their accuracy is notably compromised when faced with adversarial perturbations. To address this challenge, we propose a robust text-to-SQL solution, called Solid-SQL, designed to integrate with various LLMs. We focus on the pre-processing stage, training a robust schema-linking model enhanced by LLM-based data augmentation. Additionally, we design a two-round, structural similarity-based example retrieval strategy for in-context learning. Our method achieves SOTA SQL execution accuracy levels of 82.1% and 58.9% on the general Spider and Bird benchmarks, respectively. Furthermore, experimental results show that Solid-SQL delivers an average improvement of 11.6% compared to baselines on the perturbed Spider-Syn, Spider-Realistic, and Dr. Spider benchmarks.

pdf bib abs
Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
Xuefen Li | Bo Wang | Ge Shi | Chong Feng | Jiahao Teng

Existing video LLMs typically excel at capturing the overall description of a video but lack the ability to demonstrate an understanding of temporal dynamics and a fine-grained grasp of localized content within the video. In this paper, we propose a Time-Perception Enhanced Video Grounding via Boundary Perception and Temporal Reasoning aimed at mitigating LLMs’ difficulties in understanding the discrepancies between video and text temporality. Specifically, to address the inherent biases in current datasets, we design a series of boundary-perception tasks to enable LLMs to capture accurate video temporality. To tackle LLMs’ insufficient understanding of temporal information, we develop specialized tasks for boundary perception and temporal relationship reasoning to deepen LLMs’ perception of video temporality. Our experimental results show significant improvements across three datasets: ActivityNet, Charades, and DiDeMo (achieving up to 11.2% improvement on R@0.3), demonstrating the effectiveness of our proposed temporal awareness-enhanced data construction method.

pdf bib abs
CE-DA: Custom Embedding and Dynamic Aggregation for Zero-Shot Relation Extraction
Fu Zhang | He Liu | Zehan Li | Jingwei Cheng

Zero-shot Relation Extraction (ZSRE) aims to predict novel relations from sentences with given entity pairs, where the relations have not been encountered during training. Prototypebased methods, which achieve ZSRE by aligning the sentence representation and the relation prototype representation, have shown great potential. However, most existing works focus solely on improving the quality of prototype representations, neglecting sentence representations and lacking interaction between different types of relation side information. In this paper, we propose a novel ZSRE framework named CE-DA, which includes two modules: Custom Embedding and Dynamic Aggregation. We employ a two-stage approach to obtain customized embeddings of sentences. In the first stage, we train a sentence encoder through unsupervised contrastive learning, and in the second stage, we highlight the potential relations between entities in sentences using carefully designed entity emphasis prompts to further enhance sentence representations. Additionally, our dynamic aggregation method assigns different weights to different types of relation side information through a learnable network to enhance the quality of relation prototype representations. In contrast to traditional methods that treat the importance of all side information equally, our dynamic aggregation method further strengthen the interaction between different types of relation side information. Our method demonstrates competitive performance across various metrics on two ZSRE datasets.

Large language models (LLMs) combined with tool learning have gained impressive results in real-world applications. During tool learning, LLMs may call multiple tools in nested orders, where the latter tool call may take the former response as its input parameters. However, current research on the nested tool learning capabilities is still under-explored, since the existing benchmarks lack relevant data instances. To address this problem, we introduce NesTools to bridge the current gap in comprehensive nested tool learning evaluations. NesTools comprises a novel automatic data generation method to construct large-scale nested tool calls with different nesting structures. With manual review and refinement, the dataset is in high quality and closely aligned with real-world scenarios. Therefore, NesTools can serve as a new benchmark to evaluate the nested tool learning abilities of LLMs. We conduct extensive experiments on 22 LLMs, and provide in-depth analyses with NesTools, which shows that current LLMs still suffer from the complex nested tool learning task.

pdf bib abs
A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection
Simon Hachmeier | Robert Jäschke

Detecting music entities such as song titles or artist names is a useful application to help use cases like processing music search queries or analyzing music consumption on the web. Recent approaches incorporate smaller language models (SLMs) like BERT and achieve high results. However, further research indicates a high influence of entity exposure during pre-training on the performance of the models. With the advent of large language models (LLMs), these outperform SLMs in a variety of downstream tasks. However, researchers are still divided if this is applicable to tasks like entity detection in texts due to issues like hallucination. In this paper, we provide a novel dataset of user-generated metadata and conduct a benchmark and a robustness study using recent LLMs with in-context-learning (ICL). Our results indicate that LLMs in the ICL setting yield higher performance than SLMs. We further uncover the large impact of entity exposure on the best performing LLM in our study.

With the rise of multi-modal large language models, accurately extracting and understanding textual information from video content—referred to as video-based optical character recognition (Video OCR)—has become a crucial capability. This paper introduces a novel benchmark designed to evaluate the video OCR performance of multi-modal models in videos. Comprising 1,028 videos and 2,961 question-answer pairs, this benchmark proposes several key challenges through 6 distinct sub-tasks: (1) Recognition of text content itself and its basic visual attributes, (2) Semantic and Spatial Comprehension of OCR objects in videos (3) Dynamic Motion detection and Temporal Localization. We developed this benchmark using a semi-automated approach that integrates the OCR ability of image LLMs with manual refinement, balancing efficiency, cost, and data quality. Our resource aims to help advance research in video LLMs and underscores the need for improving OCR ability for video LLMs. The benchmark will be released on https://github.com/YuHuiGao/FG-Bench.git.

Linear attention enhances inference efficiency of Transformer and has attracted research interests as an efficient backbone of language models. Existing linear attention based models usually exploit decay factor based positional encoding (PE), where attention scores decay exponentially with increasing relative distance. However, most work manually designs a non-trainable decay factor of exponential calculation, which limits further optimization. Our analysis reveals directly training decay factor is unstable because of large gradients. To address this, we propose a novel PE for linear attention named Disentangle to Decay (D2D). D2D disentangles decay factor into two parts to achieve further optimization and stable training. Moreover, D2D can be transformed into recurrent form for efficient inference. Experiments demonstrate that D2D achieves stable training of decay factor, and enhances performance of linear attention in both normal context length and length extrapolation scenarios.

pdf bib abs
GAProtoNet: A Multi-head Graph Attention-based Prototypical Network for Interpretable Text Classification
Ximing Wen | Wenjuan Tan | Rosina Weber

Pretrained transformer-based Language Models (LMs) are well-known for their ability to achieve significant improvement on text classification tasks with their powerful word embeddings, but their black-box nature, which leads to a lack of interpretability, has been a major concern. In this work, we introduce GAProtoNet, a novel white-box Multi-head Graph Attention-based Prototypical Network designed to explain the decisions of text classification models built with LM encoders. In our approach, the input vector and prototypes are regarded as nodes within a graph, and we utilize multi-head graph attention to selectively construct edges between the input node and prototype nodes to learn an interpretable prototypical representation. During inference, the model makes decisions based on a linear combination of activated prototypes weighted by the attention score assigned for each prototype, allowing its choices to be transparently explained by the attention weights and the prototypes. Experiments on multiple public datasets show our approach achieves superior results without sacrificing the accuracy of the original black-box LMs. We also compare with four alternative prototypical network variations and our approach achieves the best accuracy and F1 among all. Our case study and visualization of prototype clusters also demonstrate the efficiency in explaining the decisions of black-box models built with LMs.

pdf bib abs
Few-shot domain adaptation for named-entity recognition via joint constrained k-means and subspace selection
Ayoub Hammal | Benno Uthayasooriyar | Caio Corro

Named-entity recognition (NER) is a task that typically requires large annotated datasets, which limits its applicability across domains with varying entity definitions. This paper addresses few-shot NER, aiming to transfer knowledge to new domains with minimal supervision. Unlike previous approaches that rely solely on limited annotated data, we propose a weakly-supervised algorithm that combines small labeled datasets with large amounts of unlabeled data. Our method extends the k-means algorithm with label supervision, cluster size constraints, and domain-specific discriminative subspace selection. This unified framework achieves state-of-the-art results in few-shot NER, demonstrating its effectiveness in leveraging unlabeled data and adapting to domain-specific challenges.

pdf bib abs
An Efficient Retrieval-Based Method for Tabular Prediction with LLM
Jie Wu | Mengshu Hou

Tabular prediction, a well-established problem in machine learning, has consistently garnered significant research attention within academia and industry. Recently, with the rapid development of large language models (LLMs), there has been increasing exploration of how to apply LLMs to tabular prediction tasks. Many existing methods, however, typically rely on extensive pre-training or fine-tuning of LLMs, which demands considerable computational resources. To avoid this, we propose a retrieval-based approach that utilizes the powerful capabilities of LLMs in representation, comprehension, and inference. Our approach eliminates the need for training any modules or performing data augmentation, depending solely on information from target dataset. Experimental results reveal that, even without specialized training for tabular data, our method exhibits strong predictive performance on tabular prediction task, affirming its practicality and effectiveness.

Tabular data, which accounts for over 80% of enterprise data assets, is vital in various fields. With growing concerns about privacy protection and data-sharing restrictions, generating high-quality synthetic tabular data has become essential. Recent advancements show that large language models (LLMs) can effectively generate realistic tabular data by leveraging semantic information and overcoming the challenges of high-dimensional data that arise from one-hot encoding. However, current methods do not fully utilize the rich information available in tables. To address this, we introduce AI Generative Table based on prompt enhancement, a novel approach that utilizes metadata information, such as table descriptions and schemas, as prompts to generate ultra-high-quality synthetic data. To overcome the token limit constraints of LLMs, we propose long-token partitioning algorithms that enable AIGT to model tables of any scale. AIGT achieves state-of-the-art performance on 14 out of 20 public datasets and two real industry datasets within the Alipay risk control system.

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.

pdf bib abs
Development of Numerical Error Detection Tasks to Analyze the Numerical Capabilities of Language Models
Taku Sakamoto | Saku Sugawara | Akiko Aizawa

Numbers are used to describe quantities in various scenarios in daily life; therefore, numerical errors can significantly affect the meaning of the entire sentence, and even a single-letter error can be fatal. Detecting numerical errors often requires a high level of commonsense and is difficult even with the recent large language models (LLMs). In this study, we create a benchmark dataset of numerical error detection that uses automatically generated numerical errors. In our analysis, we classify the numerical errors based on the properties of the errors and investigate the ability of the model from several perspectives, including the error class, error size, and passage domain. The experimental results indicate that GPT-3.5, GPT-4, and Llama-3-Instruct (8B) perform well in the numerical error detection task; however, they are not as accurate as humans. We find that the LLMs misidentified correct numbers as errors more frequently than the humans did. In particular, the analysis demonstrates that the current LLMs still need improvement for detecting numerical errors requiring calculations or extensive prior knowledge.

pdf bib abs
Searching for Structure: Investigating Emergent Communication with Large Language Models
Tom Kouwenhoven | Max Peeperkorn | Tessa Verhoef

Human languages have evolved to be structured through repeated language learning and use. These processes introduce biases that operate during language acquisition and shape linguistic systems toward communicative efficiency. In this paper, we investigate whether the same happens if artificial languages are optimised for implicit biases of Large Language Models (LLMs). To this end, we simulate a classical referential game in which LLMs learn and use artificial languages. Our results show that initially unstructured holistic languages are indeed shaped to have some structural properties that allow two LLM agents to communicate successfully. Similar to observations in human experiments, generational transmission increases the learnability of languages, but can at the same time result in non-humanlike degenerate vocabularies. Taken together, this work extends experimental findings, shows that LLMs can be used as tools in simulations of language evolution, and opens possibilities for future human-machine experiments in this field.

pdf bib abs
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
Esteban Garces Arias | Meimingwei Li | Christian Heumann | Matthias Assenmacher

Decoding strategies for generative large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Guided by specific hyperparameters, these strategies aim to transform the raw probability distributions produced by language models into coherent, fluent text. In this study, we undertake a large-scale empirical assessment of a range of decoding methods, open-source LLMs, textual domains, and evaluation protocols to determine how hyperparameter choices shape the outputs. Our experiments include both factual (e.g., news) and creative (e.g., fiction) domains, and incorporate a broad suite of automatic evaluation metrics alongside human judgments. Through extensive sensitivity analyses, we distill practical recommendations for selecting and tuning hyperparameters, noting that optimal configurations vary across models and tasks. By synthesizing these insights, this study provides actionable guidance for refining decoding strategies, enabling researchers and practitioners to achieve higher-quality, more reliable, and context-appropriate text generation outcomes.

pdf bib abs
Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems
Xuyang Wu | Shuowei Li | Hsin-Tai Wu | Zhiqiang Tao | Yi Fang

Retrieval-Augmented Generation (RAG) has recently gained significant attention for its enhanced ability to integrate external knowledge sources into open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as language models evolve to prioritize utility, like improving exact match accuracy, fairness considerations may have been largely overlooked. Second, the complex, multi-component architecture of RAG methods poses challenges in identifying and mitigating biases, as each component is optimized for distinct objectives. In this paper, we aim to empirically evaluate fairness in several RAG methods. We propose a fairness evaluation framework tailored to RAG, using scenario-based questions and analyzing disparities across demographic attributes. Our experimental results indicate that, despite recent advances in utility-driven optimization, fairness issues persist in both the retrieval and generation stages. These findings underscore the need for targeted interventions to address fairness concerns throughout the RAG pipeline. The dataset and code used in this study are publicly available at this GitHub Repository.

pdf bib abs
CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages
Wenhao Zhuang | Yuan Sun

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE (Chinese, Uyghur, Tibetan, English) dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs’ ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.

pdf bib abs
How Ambiguous Are the Rationales for Natural Language Reasoning? A Simple Approach to Handling Rationale Uncertainty
Hazel H. Kim

The quality of rationales is essential in the reasoning capabilities of language models. Rationales not only enhance reasoning performance in complex natural language tasks but also justify model decisions. However, obtaining impeccable rationales is often impossible. Our study aims to investigate how ambiguous rationales play in model performances of natural language reasoning. We first assess the ambiguity of rationales through the lens of entropy and uncertainty in model prior beliefs, exploring its impact on task performance. We then propose a simple way to guide models to choose between two different reasoning paths depending on the ambiguity of the rationales. Our empirical results demonstrate that this approach leads to robust performance, particularly in adversarial scenarios where rationale quality is inconsistent.

pdf bib abs
Planning with Multi-Constraints via Collaborative Language Agents
Cong Zhang | Xin Deik Goh | Dexun Li | Hao Zhang | Yong Liu

The rapid advancement of neural language models has sparked a new surge of intelligent agent research. Unlike traditional agents, large language model-based agents (LLM agents) have emerged as a promising paradigm for achieving artificial general intelligence (AGI) due to their superior reasoning and generalization capabilities. Effective planning is crucial for the success of LLM agents in real-world tasks, making it a highly pursued topic in the community. Current planning methods typically translate tasks into executable action sequences. However, determining a feasible or optimal sequence for complex tasks with multiple constraints at fine granularity, which often requires compositing long chains of heterogeneous actions, remains challenging. This paper introduces Planning with Multi-Constraints (PMC), a zero-shot methodology for collaborative LLM-based multi-agent systems that simplifies complex task planning with constraints by decomposing it into a hierarchy of subordinate tasks. Each subtask is then mapped into executable actions. PMC was assessed on two constraint-intensive benchmarks, TravelPlanner and API-Bank. Notably, PMC achieved an average 42.68% success rate on TravelPlanner, significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on API-Bank by 13.64%, showing the immense potential of integrating LLM with multi-agent systems. We also show that PMC works with small LLM as the planning core, e.g., LLaMA-3.1-8B.

pdf bib abs
Enhancing Nursing and Elderly Care with Large Language Models: An AI-Driven Framework
Qiao Sun | Jiexin Xie | Nanyang Ye | Qinying Gu | Shijie Guo

This paper explores the application of large language models (LLMs) in nursing and elderly care, focusing on AI-driven patient monitoring and interaction. We introduce a novel Chinese nursing dataset and implement incremental pre-training (IPT) and supervised fine-tuning (SFT) techniques to enhance LLM performance in specialized tasks. Using LangChain, we develop an interactable nursing assistant capable of real-time care and personalized interventions. Experimental results demonstrate significant improvements, paving the way for AI-driven solutions to meet the growing demands of healthcare in aging populations.

pdf bib abs
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation
Shijie Zhou | Ruiyi Zhang | Yufan Zhou | Changyou Chen

Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct data.

pdf bib abs
Cross-Lingual Knowledge Projection and Knowledge Enhancement for Zero-Shot Question Answering in Low-Resource Languages
Sello Ralethe | Jan Buys

Knowledge bases (KBs) in low-resource languages (LRLs) are often incomplete, posing a challenge for developing effective question answering systems over KBs in those languages. On the other hand, the size of training corpora for LRL language models is also limited, restricting the ability to do zero-shot question answering using multilingual language models. To address these issues, we propose a two-fold approach. First, we introduce LeNS-Align, a novel cross-lingual mapping technique which improves the quality of word alignments extracted from parallel English-LRL text by combining lexical alignment, named entity recognition, and semantic alignment. LeNS-Align is applied to perform cross-lingual projection of KB triples. Second, we leverage the projected KBs to enhance multilingual language models’ question answering capabilities by augmenting the models with Graph Neural Networks embedding the projected knowledge. We apply our approach to map triples from two existing English KBs, ConceptNet and DBpedia, to create comprehensive LRL knowledge bases for four low-resource South African languages. Evaluation on three translated test sets show that our approach improves zero-shot question answering accuracy by up to 17% compared to baselines without KB access. The results highlight how our approach contributes to bridging the knowledge gap for low-resource languages by expanding knowledge coverage and question answering capabilities.

We introduce FarExStance, a new dataset for explainable stance detection in Farsi. Each instance in this dataset contains a claim, the stance of an article or social media post towards that claim, and an extractive explanation which provides evidence for the stance label. We compare the performance of a fine-tuned multilingual RoBERTa model to several large language models in zero-shot, few-shot, and parameter-efficient fine-tuned settings on our new dataset. On stance detection, the most accurate models are the fine-tuned RoBERTa model, the LLM Aya-23-8B which has been fine-tuned using parameter-efficient fine-tuning, and few-shot Claude-3.5-Sonnet. Regarding the quality of the explanations, our automatic evaluation metrics indicate that few-shot GPT-4o generates the most coherent explanations, while our human evaluation reveals that the best Overall Explanation Score (OES) belongs to few-shot Claude-3.5-Sonnet. The fine-tuned Aya-32-8B model produced explanations most closely aligned with the reference explanations.

pdf bib abs
Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability
Xufeng Duan | Xinyu Zhou | Bei Xiao | Zhenguang Cai

As large language models (LLMs) advance in their linguistic capacity, understanding how they capture aspects of language competence remains a significant challenge. This study therefore employs psycholinguistic paradigms, which are well-suited for probing deeper cognitive aspects of language processing, to explore neuron-level representations in language model across three tasks: sound-shape association, sound-gender association, and implicit causality. Our findings indicate that while GPT-2-XL struggles with the sound-shape task, it demonstrates human-like abilities in both sound-gender association and implicit causality. Targeted neuron ablation and activation manipulation reveal a crucial relationship: When GPT-2-XL displays a linguistic ability, specific neurons correspond to that competence; conversely, the absence of such an ability indicates a lack of specialized neurons. This study is the first to utilize psycholinguistic experiments to investigate deep language competence at the neuron level, providing a new level of granularity in model interpretability and insights into the internal mechanisms driving language ability in the transformer-based LLM.

pdf bib abs
Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages
Robert Litschko | Oliver Kraus | Verena Blaschke | Barbara Plank

A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used CLIR methods such as query translation or zero-shot cross-lingual transfer with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models.

pdf bib abs
MoKA:Parameter Efficiency Fine-Tuning via Mixture of Kronecker Product Adaption
Beiming Yu | Zhenfei Yang | Xiushuang Yi

With the rapid development of large language models (LLMs), traditional full-parameter fine-tuning methods have become increasingly expensive in terms of computational resources and time costs. For this reason, parameter efficient fine-tuning (PEFT) methods have emerged. Among them, Low-Rank Adaptation (LoRA) is one of the current popular PEFT methods, which is widely used in large language models. However, the low-rank update mechanism of LoRA somewhat limits its ability to approximate full-parameter fine-tuning during the training process. In this paper, we propose a novel PEFT framework, MoKA (Mixture of Kronecker Product Adaptation), which combines the Kronecker product with the Mixture-of-Experts (MoE) method. By replacing the low-rank decomposition of the weight update matrix with Kronecker products and utilizing a sparse MoE architecture, MoKA achieves parameter efficiency and better model performance. Additionally, we design an efficient routing module to further compress the parameter size. We conduct extensive experiments on the GLUE benchmark, E2E NLG Challenge, and instruction tuning tasks for LLMs. The results demonstrate that MoKA outperforms existing PEFT methods.

Artificial intelligence has significantly revolutionized healthcare, particularly through large language models (LLMs) that demonstrate superior performance in static medical question answering benchmarks. However, evaluating the potential of LLMs for real-world clinical applications remains challenging due to the intricate nature of doctor-patient interactions. To address this, we introduce AI Hospital, a multi-agent framework emulating dynamic medical interactions between Doctor as player and NPCs including Patient and Examiner. This setup allows for more practical assessments of LLMs in simulated clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and multiple evaluation strategies to quantify the performance of LLM-driven Doctor agents on symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance medical interaction capabilities through iterative discussions. Despite improvements, current LLMs (including GPT-4) still exhibit significant performance gaps in multi-turn interactive scenarios compared to non-interactive scenarios. Our findings highlight the need for further research to bridge these gaps and improve LLMs’ clinical decision-making capabilities. Our data, code, and experimental results are all open-sourced at https://github.com/LibertFan/AI_Hospital.

pdf bib abs
Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning
Piyapath T. Spencer | Nanthipat Kongborrirak

In the present-day documenting and preserving endangered languages, the application of Large Language Models (LLMs) presents a promising approach. This paper explores how LLMs, particularly through in-context learning, can assist in generating grammatical information for low-resource languages with limited amount of data. We takes Moklen as a case study to evaluate the efficacy of LLMs in producing coherent grammatical rules and lexical entries using only bilingual dictionaries and parallel sentences of the unknown language without building the model from scratch. Our methodology involves organising the existing linguistic data and prompting to efficiently enable to generate formal XLE grammar. Our results demonstrate that LLMs can successfully capture key grammatical structures and lexical information, although challenges such as the potential for English grammatical biases remain. This study highlights the potential of LLMs to enhance language documentation efforts, providing a cost-effective solution for generating linguistic data and contributing to the preservation of endangered languages.

pdf bib abs
Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language Models through Simulation and Task Decomposition
Sneheel Sarangi | Maha Elgarf | Hanan Salam

Theory of Mind (ToM) is the ability to understand and reflect on the mental states of others. Although this capability is crucial for human interaction, testing on Large Language Models (LLMs) reveals that they possess only a rudimentary understanding of it. Although the most capable closed-source LLMs have come close to human performance on some ToM tasks, they still perform poorly on complex variations of the task that involve more structured reasoning. In this work, we utilize the concept of “pretend-play”, or “Simulation Theory” from cognitive psychology to propose “Decompose-ToM”: an LLM-based inference algorithm that improves model performance on complex ToM tasks. We recursively simulate user perspectives and decompose the ToM task into a simpler set of tasks: subject identification, question-reframing, world model updation, and knowledge availability. We test the algorithm on higher-order ToM tasks and a task testing for ToM capabilities in a conversational setting, demonstrating that our approach shows significant improvement across models compared to baseline methods while requiring minimal prompt tuning across tasks and no additional model training. Our code is publicly available.

We focus on enhancing comprehension in small-group recorded conversations, which serve as a medium to bring people together and provide a space for sharing personal stories and experiences on crucial social matters. One way to parse and convey information from these conversations is by sharing highlighted excerpts in subsequent conversations. This can help promote a collective understanding of relevant issues, by highlighting perspectives and experiences to other groups of people who might otherwise be unfamiliar with and thus unable to relate to these experiences. The primary challenge that arises then is that excerpts taken from one conversation and shared in another setting might be missing crucial context or key elements that were previously introduced in the original conversation. This problem is exacerbated when conversations become lengthier and richer in themes and shared experiences. To address this, we explore how Large Language Models (LLMs) can enrich these excerpts by providing socially relevant context. We present approaches for effective contextualization to improve comprehension, readability, and empathy. We show significant improvements in understanding, as assessed through subjective and objective evaluations. While LLMs can offer valuable context, they struggle with capturing key social aspects. We release the Human-annotated Salient Excerpts (HSE) dataset to support future work. Additionally, we show how context-enriched excerpts can provide more focused and comprehensive conversation summaries.

The efficacy of detectors for texts generated by large language models (LLMs) substantially depends on the availability of large-scale training data. However, white-box zero-shot detectors, which require no such data, are limited by the accessibility of the source model of the LLM-generated text. In this paper, we propose a simple yet effective black-box zero-shot detection approach based on the observation that, from the perspective of LLMs, human-written texts typically contain more grammatical errors than LLM-generated texts. This approach involves calculating the Grammar Error Correction Score (GECScore) for the given text to differentiate between human-written and LLM-generated text. Experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods, achieving an average AUROC of 98.62% across XSum and Writing Prompts dataset. Additionally, our approach demonstrates strong reliability in the wild, exhibiting robust generalization and resistance to paraphrasing attacks. Data and code are available at: https://github.com/NLP2CT/GECScore.

In recent years, speech generation fields have achieved significant advancements, primarily due to improvements in large TTS (text-to-speech) systems and scalable TTS datasets. However, there is still a lack of large-scale multilingual TTS datasets, which limits the development of cross-language and multilingual TTS systems. Hence, we refine Voxpopuli dataset and propose VoxpopuliTTS dataset. This dataset comprises 30,000 hours of high-quality speech data, across 3 languages with multiple speakers and styles, suitable for various speech tasks such as TTS and ASR. To enhance the quality of speech data from Voxpopuli, we improve the existing processing pipeline by: 1) filtering out low-quality speech-text pairs based on ASR confidence scores, and 2) concatenating short transcripts by checking semantic information completeness to generate the long transcript. Experimental results demonstrate the effectiveness of the VoxpopuliTTS dataset and the proposed processing pipeline.

pdf bib abs
Self-Evolution Knowledge Distillation for LLM-based Machine Translation
Yuncheng Song | Liang Ding | Changtong Zan | Shujian Huang

Knowledge distillation (KD) has shown great promise in transferring knowledge from larger teacher models to smaller student models. However, existing KD strategies for large language models often minimize output distributions between student and teacher models indiscriminately for each token. This overlooks the imbalanced nature of tokens and their varying transfer difficulties. In response, we propose a distillation strategy called Self-Evolution KD. The core of this approach involves dynamically integrating teacher distribution and one-hot distribution of ground truth into the student distribution as prior knowledge, which promotes the distillation process. It adjusts the ratio of prior knowledge based on token learning difficulty, fully leveraging the teacher model’s potential. Experimental results show our method brings an average improvement of approximately 1.4 SacreBLEU points across four translation directions in the WMT22 test sets. Further analysis indicates that the improvement comes from better knowledge transfer from teachers, confirming our hypothesis.

pdf bib abs
On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment
Zehua Cheng | Manying Zhang | Jiahao Sun | Wei Dai

Large language models (LLMs) have made significant advancements, but their increasing capabilities present serious risks of misuse, particularly in open-weight models where direct access to the model’s parameters is possible. Current safeguards, designed for closed-weight API models, are inadequate for open-weight models, as minimal fine-tuning can bypass these protections. Preserving the integrity of open-weight LLMs before deployment has thus become a critical challenge. We argue that these vulnerabilities stem from the overemphasis on maximizing the LLM’s log-likelihood during training, which amplifies data biases, especially with large datasets. To address these issues, we introduce Kahneman and Tversky’s Prospect Theoretic Integrity Preserving Alignment (KT-IPA), a framework that prioritizes maximizing generative utility rather than a singular optimization metric. This approach strengthens LLMs against misuse and weaponization while maintaining high performance, even after extensive fine-tuning. Our results demonstrate that integrating prospect theory into LLM training enhances robustness, security, and responsible innovation in this rapidly evolving field. Our codes are available on https://anonymous.4open.science/r/KT-IPA-40B7

pdf bib abs
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks
Qintong Li | Leyang Cui | Lingpeng Kong | Wei Bi

Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluations that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such as numerical reasoning. We also find that LLM-pre-drafting before human evaluation can help reduce the impact of human subjectivity and minimize annotation outliers in pure human evaluation, leading to more objective evaluation. All resources are available at https://github.com/qtli/CoEval.

Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). While numerous studies have examined the impact of factors such as data volume and model size on English models, the scaling properties of instruction tuning in other languages remain largely unexplored. In this work, we systematically investigate the effects of data quantity, model size, and data construction methods on instruction tuning for Chinese LLMs. We utilize a newly curated dataset, DoIT, which includes over 40,000 high-quality instruction instances covering ten underlying abilities, such as creative writing, code generation, and logical reasoning. Our experiments, conducted on models ranging from 7b to 33b parameters, yield three key findings: (i) While these factors directly affect overall model performance, some abilities are more responsive to scaling, whereas others demonstrate significant resistance. (ii) The scaling sensitivity of different abilities to these factors can be explained by two features: Complexity and Transference. (iii) By tailoring training strategies to their varying sensitivities, specific abilities can be efficiently learned, enhancing performance on two public benchmarks.

pdf bib abs
Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre
Florian Debaene | Aaron Maladry | Els Lefever | Veronique Hoste

This paper explores the effectiveness of two types of transformer models — large generative models and sequence-to-sequence models — for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.

Despite the recent success of two-stage prototypical networks in few-shot named entity recognition (NER), challenges such as over/under-detected false spans in the span detection stage and unaligned entity prototypes in the type classification stage persist. Additionally, LLMs have not proven to be effective few-shot information extractors in general. In this paper, we propose an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to address these issues. We introduce a boundary-aware contrastive learning strategy to enhance the LLM’s ability to perceive entity boundaries for generalized entity spans. Additionally, we utilize LoRAHub to align information from the target domain to the source domain, thereby enhancing adaptive cross-domain classification capabilities. Extensive experiments across various benchmarks demonstrate that our framework outperforms prior methods, validating its effectiveness. In particular, the proposed strategies demonstrate effectiveness across a range of LLM architectures. The code and data are released on https://github.com/UESTC-GQJ/BANER.

pdf bib abs
In-Context Reinforcement Learning with Retrieval-Augmented Generation for Text-to-SQL
Rishit Toteja | Arindam Sarkar | Prakash Mandayam Comar

Text-to-SQL simplifies database interactions by enabling non-experts to convert their natural language (NL) questions to Structured Query Language (SQL) queries. With advancements in Large Language Models (LLM), in-context learning (ICL) has emerged as a popular choice for building Text-to-SQL systems. Real world, industry-scale databases, often comprise thousands of tables and hundreds of columns, and makes passing the entire schema as context to an LLM infeasibly expensive. This requisites access to the correct database and the set of tables. Recently Retrieval Augmented Generation (RAG) based methods have been proposed for retrieving relevant subset of databases and tables for a given query. However, we observe that the existing methods of synthetic query generation can generate predominantly simple queries which might not be sufficiently representative of complex, real world queries, thus, negatively affecting the quality of the generated SQL. To address this, we propose an innovative in-context reinforcement learning (ICRL) based framework which refines the question generation process by enhancing the model’s ability to produce intricate queries that practitioners may pose during inference. In contrast to the existing approaches, our framework ensures the generation of synthetic SQL queries which are diverse and complex. We demonstrate the effectiveness of our approach via multiple experiments comparing against the representative state-of-the-art models on public benchmark datasets and observe substantial improvements in performance and scalability. Our method achieves 15-20% higher recall in database/table retrieval task compared to the existing state-of-the-art models for schema identification and upto 2% higher execution accuracy for SQL generation.

In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward.

pdf bib abs
VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models
Haowen Hou | Peigen Zeng | Fei Ma | Fei Richard Yu

Visual Language Models (VLMs) have rapidly progressed with the recent success of large language models. However, there have been few attempts to incorporate efficient linear Recurrent Neural Networks (RNNs) architectures into VLMs. In this study, we introduce VisualRWKV, the first application of a linear RNN model to multimodal learning tasks, leveraging the pre-trained RWKV language model. We propose a data-dependent recurrence and sandwich prompts to enhance our modeling capabilities, along with a 2D image scanning mechanism to enrich the processing of visual sequences. Extensive experiments demonstrate that VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks. Compared to LLaVA-1.5, VisualRWKV has a speed advantage of 3.98 times and can save 54% of GPU memory when reaching an inference length of 24K tokens. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at the following GitHub repository: https://github.com/howard-hou/VisualRWKV.

How to better evaluate the capabilities of Large Language Models (LLMs) is the focal point and hot topic in current LLMs research. Previous work has noted that due to the extremely high cost of iterative updates of LLMs, they are often unable to answer the latest dynamic questions well. To promote the improvement of Chinese LLMs’ ability to answer dynamic questions, in this paper, we introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet. We obtain high-quality data through a pipeline that combines humans and models, and carefully classify the samples according to the frequency of answer changes to facilitate a more fine-grained observation of LLMs’ capabilities. We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA. Extensive experiments and valuable insights suggest that our proposed CDQA is challenging and worthy of more further study. We believe that the benchmark we provide will become one of the key data resources for improving LLMs’ Chinese question-answering ability in the future.

pdf bib abs
Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests
Amogh Mannekote | Jinseok Nam | Ziming Li | Kristy Elizabeth Boyer | Bonnie J. Dorr

Indirect User Requests (IURs), such as “It’s cold in here” instead of “Could you please increase the temperature?” are common in human-human task-oriented dialogue and require world knowledge and pragmatic reasoning from the listener. While large language models (LLMs) can handle these requests effectively, smaller models deployed on virtual assistants often struggle due to resource constraints. Moreover, existing task-oriented dialogue benchmarks lack sufficient examples of complex discourse phenomena such as indirectness. To address this, we propose a set of linguistic criteria along with an LLM-based pipeline for generating realistic IURs to test natural language understanding (NLU) and dialogue state tracking (DST) models before deployment in a new domain. We also release IndirectRequests, a dataset of IURs based on the Schema-Guided Dialogue (SGD) corpus, as a comparative testbed for evaluating the performance of smaller models in handling indirect requests.

pdf bib abs
Consistency Rating of Semantic Transparency: an Evaluation Method for Metaphor Competence in Idiom Understanding Tasks
Hui Gao | Jing Zhang | Peng Zhang | Chang Yang

Idioms condense complex semantics into fixed phrases, and their meaning is often not directly connected to the literal meaning of their constituent words, making idiom comprehension a test of metaphor competence. Metaphor, as a cognitive process in human beings, has not yet found an effective evaluation method to assess the metaphor competence of LLMs (Large Language Models). In this paper, we propose a method to evaluate the metaphor competence of LLMs for the idiom understanding task: the Consistency Rating of Semantic Transparency (CR-ST). This strategy assesses the difficulty of understanding idioms through two dimensions: overall semantic transparency and constituent semantic transparency, aiming to gauge LLMs’ mastery of metaphor competence. Subsequently, we introduce a prompt mechanism-Paraphrase Augmentation Strategy with Self-checking (PASS), based on human language logic, which guides the model to enhance its metaphor competence by explicitly generating idiom paraphrases. We conducted a baseline evaluation of seven LLMs on the CINLID and ChID datasets and analyzed the effectiveness of PASS on different subsets of semantic transparency. The experimental results demonstrate that LLMs can achieve performance comparable to PLMs (Pre-trained Language Models) without additional training, and PASS has a positive effect on the metaphor competence of LLMs.

pdf bib abs
KG-FPQ: Evaluating Factuality Hallucination in LLMs with Knowledge Graph-based False Premise Questions
Yanxu Zhu | Jinlin Xiao | Yuhang Wang | Jitao Sang

Recent studies have demonstrated that large language models (LLMs) are susceptible to being misled by false premise questions (FPQs), leading to errors in factual knowledge, known as factuality hallucination. Existing benchmarks that assess this vulnerability primarily rely on manual construction, resulting in limited size and lack of expandability. In this work, we introduce an automated, scalable pipeline to create FPQs based on knowledge graphs (KGs). The first step is to modify true triplets extracted from KGs to create false premises. Subsequently, utilizing the state-of-the-art capabilities of GPTs, we generate semantically rich FPQs. Based on the proposed method, we present a comprehensive benchmark, the Knowledge Graph-based False Premise Questions (KG-FPQ), which contains approximately 178k FPQs across three knowledge domains, at six levels of confusability, and in two task formats. Using KG-FPQ, we conduct extensive evaluations on several representative LLMs and provide valuable insights. The KG-FPQ dataset and code are available at https://github.com/yanxuzhu/KG-FPQ.

The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.

pdf bib abs
Efficient Architectures for High Resolution Vision-Language Models
Miguel Carvalho | Bruno Martins

Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

pdf bib abs
NCRE: A Benchmark for Document-level Nominal Compound Relation Extraction
Jincheng Cao | Bobo Li | Jiang Liu | Donghong Ji

Entity and relation extraction is a conventional task in the field of information extraction. Existing work primarily focuses on detecting specific relations between entities, often constrained to particular fields and lacking general applicability. In response, we propose a novel task: nominal compound relation extraction (NCRE), which concentrates on abstract and broadly applicable relation extraction between noun phrases. This task diverges significantly from traditional entity and relation extraction in two key respects. Firstly, our task involves general nominal compounds rather than named entities, which are longer and encompass a broader scope, presenting significant challenges for extraction. Secondly, relation extraction in NCRE demands an in-depth understanding of context to detect abstract relations. We manually annotate a high-quality Chinese dataset for the NCRE task and develop a model incorporating the rotary position-enhanced word pair (RoWP) detection schema. Experimental results demonstrate the efficiency of our RoWP model over previous baselines, while the suboptimal F1 scores indicate that NCRE remains a challenging task. Our code and data are available at https://github.com/yeecjc/NCRE.

pdf bib abs
Comet: Dialog Context Fusion Mechanism for End-to-End Task-Oriented Dialog with Multi-task Learning
Haipeng Sun | Junwei Bao | Youzheng Wu | Xiaodong He

Existing end-to-end task-oriented dialog systems often encounter challenges arising from implicit information, coreference, and the presence of noisy and irrelevant data within the dialog context. These issues hinder the system’s ability to fully comprehend critical information and lead to inaccurate responses. To address these concerns, we propose Comet, a dialog context fusion mechanism for end-to-end task-oriented dialog, augmented with three supplementary tasks: dialog summarization, domain prediction, and slot detection. Dialog summarization facilitates a more comprehensive understanding of important dialog context information by Comet. Domain prediction enables Comet to concentrate on domain-specific information, thus reducing interference from irrelevant information. Slot detection empowers Comet to accurately identify and comprehend essential dialog context information. Additionally, we introduce a data refinement strategy to enhance the comprehensiveness and recommendability of the generated responses. Experimental results demonstrate the superior performance of our proposed methods compared to existing end-to-end task-oriented dialog systems, achieving state-of-the-art results on the MultiWOZ and CrossWOZ datasets.

pdf bib abs
Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs
Yi Fang | Moxin Li | Wenjie Wang | Lin Hui | Fuli Feng

Large Language Models (LLMs) excel in various natural language processing tasks but struggle with hallucination issues. Existing solutions have considered utilizing LLMs’ inherent reasoning abilities to alleviate hallucination, such as self-correction and diverse sampling methods. However, these methods often overtrust LLMs’ initial answers due to inherent biases. The key to alleviating this issue lies in overriding LLMs’ inherent biases for answer inspection. To this end, we propose a CounterFactual Multi-Agent Debate (CFMAD) framework. CFMAD presets the stances of LLMs to override their inherent biases by compelling LLMs to generate justifications for a predetermined answer’s correctness. The LLMs with different predetermined stances are engaged with a skeptical critic for counterfactual debate on the rationality of generated justifications. Finally, the debate process is evaluated by a third-party judge to determine the final answer. Extensive experiments on four datasets of three tasks demonstrate the superiority of CFMAD over existing methods.

pdf bib abs
Extracting the Essence and Discarding the Dross: Enhancing Code Generation with Contrastive Execution Feedback
Xuanyu Zhang | Qing Yang

Recent advancements have integrated the execution process and feedback into the training of large language models for code generation, demonstrating enhanced model performance. However, current methods amalgamate erroneous code with feedback and the final correct code as target sentences, inadvertently increasing the probability of generating both correct and incorrect code during inference. While multiple iterations of feedback can eventually yield the correct answer, this iterative process is cumbersome and time-consuming for users who prefer immediate accurate results. To address this challenge, we propose ConCoder, a contrastive learning-based code generation model with execution feedback. This approach enables the model to efficiently produce accurate code from the outset while rectifying and optimizing the incorrect code. Furthermore, our training emphasizes learning from the causes of errors, allowing the model to understand and avoid mistakes. Through extensive experiments, ConCoder demonstrates significant improvements in generating accurate code and understanding error correction, paving the way for more reliable code generation models.

pdf bib abs
From Facts to Insights: A Study on the Generation and Evaluation of Analytical Reports for Deciphering Earnings Calls
Tomas Goldsack | Yang Wang | Chenghua Lin | Chung-Chi Chen

This paper explores the use of Large Language Models (LLMs) in the generation and evaluation of analytical reports derived from Earnings Calls (ECs). Addressing a current gap in research, we explore the generation of analytical reports with LLMs in a multi-agent framework, designing specialized agents that introduce diverse viewpoints and desirable topics of analysis into the report generation process. Through multiple analyses, we examine the alignment between generated and human-written reports and the impact of both individual and collective agents. Our findings suggest that the introduction of additional agents results in more insightful reports, although reports generated by human experts remain preferred in the majority of cases. Finally, we address the challenging issue of report evaluation, we examine the limitations and strengths of LLMs in assessing the quality of generated reports in different settings, revealing a significant correlation with human experts across multiple dimensions.

Recent advancements in large language models (LLMs) have boosted research on generating SQL queries from domain-specific questions, particularly in the medical domain. A key challenge is detecting and filtering unanswerable questions. Existing methods often relying on model uncertainty, but these require extra resources and lack interpretability. We propose a lightweight model that predicts relevant database schemas to detect unanswerable questions, enhancing interpretability and addressing the data imbalance in binary classification tasks. Furthermore, we found that LLM-generated schema descriptions can significantly enhance the prediction accuracy. Our method provides a resource-efficient solution for unanswerable question detection in domain-specific question answering systems.

pdf bib abs
Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models
Hongchuan Zeng | Senyu Han | Lu Chen | Kai Yu

Large language models (LLMs) have demonstrated remarkable performance, particularly in multilingual contexts. While recent studies suggest that LLMs can transfer skills learned in one language to others, the internal mechanisms behind this ability remain unclear. We observed that the neuron activation patterns of LLMs exhibit similarities when processing the same language, revealing the existence and location of key linguistic regions. Additionally, we found that neuron activation patterns are similar when processing sentences with the same semantic meaning in different languages. This indicates that LLMs map semantically identical inputs from different languages into a “Lingua Franca”, a common semantic latent space that allows for consistent processing across languages. This semantic alignment becomes more pronounced with training and increased model size, resulting in a more language-agnostic activation pattern. Moreover, we found that key linguistic neurons are concentrated in the first and last layers of LLMs, becoming denser in the first layers as training progresses. Experiments on BLOOM and LLaMA2 support these findings, highlighting the structural evolution of multilingual LLMs during training and scaling up. This paper provides insights into the internal workings of LLMs, offering a foundation for future improvements in their cross-lingual capabilities.

pdf bib abs
Understanding Token Probability Encoding in Output Embeddings
Hakaze Cho | Yoshihiro Sakai | Kenshiro Tanaka | Mariko Kato | Naoya Inoue

In this paper, we investigate the output token probability information in the output embedding of language models. We find an approximate common log-linear encoding of output token probabilities within the output embedding vectors and empirically demonstrate that it is accurate and sparse. As a causality examination, we steer the encoding in output embedding to modify the output probability distribution accurately. Moreover, the sparsity we find in output probability encoding suggests that a large number of dimensions in the output embedding do not contribute to causal language modeling. Therefore, we attempt to delete the output-unrelated dimensions and find more than 30% of the dimensions can be deleted without significant movement in output distribution and sequence generation. Additionally, in the pre-training dynamics of language models, we find that the output embeddings capture the corpus token frequency information in early steps, even before an obvious convergence of parameters starts.

pdf bib abs
Investigating Bias in LLM-Based Bias Detection: Disparities between LLMs and Human Perception
Luyang Lin | Lingzhi Wang | Jinsong Guo | Kam-Fai Wong

The pervasive spread of misinformation and disinformation in social media underscores the critical importance of detecting media bias. While robust Large Language Models (LLMs) have emerged as foundational tools for bias prediction, concerns about inherent biases within these models persist. In this work, we investigate the presence and nature of bias within LLMs and its consequential impact on media bias detection. Departing from conventional approaches that focus solely on bias detection in media content, we delve into biases within the LLM systems themselves. Through meticulous examination, we probe whether LLMs exhibit biases, particularly in political bias prediction and text continuation tasks. Additionally, we explore bias across diverse topics, aiming to uncover nuanced variations in bias expression within the LLM framework. Importantly, we propose debiasing strategies, including prompt engineering and model fine-tuning. Extensive analysis of bias tendencies across different LLMs sheds light on the broader landscape of bias propagation in language models. This study advances our understanding of LLM bias, offering critical insights into its implications for bias detection tasks and paving the way for more robust and equitable AI systems

pdf bib abs
Evaluating the Consistency of LLM Evaluators
Noah Lee | Jiwoo Hong | James Thorne

Large language models (LLMs) have shown potential as general evaluators along with the evident benefits of speed and cost. While their correlation against human annotators has been widely studied, consistency as evaluators is still understudied, raising concerns about the reliability of LLM evaluators. In this paper, we conduct extensive studies on the two aspects of consistency in LLM evaluations, Self-Consistency (SC) and Inter-scale Consistency (IC), on different scoring scales and criterion granularity with open-source and proprietary models. Our comprehensive analysis demonstrates that strong proprietary models are not necessarily consistent evaluators, highlighting the importance of considering consistency in assessing the capability of LLM evaluators.

pdf bib abs
MDPO: Customized Direct Preference Optimization with a Metric-based Sampler for Question and Answer Generation
Yihang Wang | Bowen Tian | Yueyang Su | Yixing Fan | Jiafeng Guo

With the extensive use of large language models, automatically generating QA datasets for domain-specific fine-tuning has become crucial. However, considering the multifaceted demands for readability, diversity, and comprehensiveness of QA data, current methodologies fall short in producing high-quality QA datasets. Moreover, the dependence of existing evaluation metrics on ground truth labels further exacerbates the challenges associated with the selection of QA data. In this paper, we introduce a novel method for QA data generation, denoted as MDPO. We proposes a set of unsupervised evaluation metrics for QA data, enabling multidimensional assessment based on the relationships among context,question and answer. Furthermore, leveraging these metrics, we implement a customized direct preference optimization process that guides large language models to produce high-quality and domain-specific QA pairs. Empirical results on public datasets indicate that MDPO’s performance substantially surpasses that of state-of-the-art methods.

pdf bib abs
A Collaborative Reasoning Framework Powered by Reinforcement Learning and Large Language Models for Complex Questions Answering over Knowledge Graph
Zhiqiang Zhang | Wen Zhao

Knowledge Graph Question Answering (KGQA) aims to automatically answer natural language questions by reasoning across multiple triples in knowledge graphs (KGs). Reinforcement learning (RL)-based methods are introduced to enhance model interpretability. Nevertheless, when addressing complex questions requiring long-term reasoning, the RL agent is usually misled by aimless exploration, as it lacks common learning practices with prior knowledge. Recently, large language models (LLMs) have been proven to encode vast amounts of knowledge about the world and possess remarkable reasoning capabilities. However, they often encounter challenges with hallucination issues, failing to address complex questions that demand deep and deliberate reasoning. In this paper, we propose a collaborative reasoning framework (CRF) powered by RL and LLMs to answer complex questions based on the knowledge graph. Our approach leverages the common sense priors contained in LLMs while utilizing RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve the complex KGQA task. By combining LLMs and the RL policy, the high-level agent accurately identifies constraints encountered during reasoning, while the low-level agent conducts efficient path reasoning by selecting the most promising relations in KG. Extensive experiments conducted on four benchmark datasets clearly demonstrate the effectiveness of the proposed model, which surpasses state-of-the-art approaches.

pdf bib abs
Scalability of Bayesian Network Structure Elicitation with Large Language Models: a Novel Methodology and Comparative Analysis
Nikolay Babakov | Ehud Reiter | Alberto Bugarín-Diz

In this work, we propose a novel method for Bayesian Networks (BNs) structure elicitation that is based on the initialization of several LLMs with different experiences, independently querying them to create a structure of the BN, and further obtaining the final structure by majority voting. We compare the method with one alternative method on various widely and not widely known BNs of different sizes and study the scalability of both methods on them. We also propose an approach to check the contamination of BNs in LLM, which shows that some widely known BNs are inapplicable for testing the LLM usage for BNs structure elicitation. We also show that some BNs may be inapplicable for such experiments because their node names are indistinguishable. The experiments on the other BNs show that our method performs better than the existing method with one of the three studied LLMs; however, the performance of both methods significantly decreases with the increase in BN size.

pdf bib abs
An LLM-based Framework for Biomedical Terminology Normalization in Social Media via Multi-Agent Collaboration
Yongqi Fan | Kui Xue | Zelin Li | Xiaofan Zhang | Tong Ruan

Biomedical Terminology Normalization aims to identify the standard term in a specified termbase for non-standardized mentions from social media or clinical texts, employing the mainstream “Recall and Re-rank” framework. Instead of the traditional pretraining-finetuning paradigm, we would like to explore the possibility of accomplishing this task through a tuning-free paradigm using powerful Large Language Models (LLMs), hoping to address the costs of re-training due to discrepancies of both standard termbases and annotation protocols. Another major obstacle in this task is that both mentions and terms are short texts. Short texts contain an insufficient amount of information that can introduce ambiguity, especially in a biomedical context. Therefore, besides using the advanced embedding model, we implement a Retrieval-Augmented Generation (RAG) based knowledge card generation module. This module introduces an LLM agent that expands the short texts into accurate, harmonized, and more informative descriptions using a search engine and a domain knowledge base. Furthermore, we present an innovative tuning-free agent collaboration framework for the biomedical terminology normalization task in social media. By leveraging the internal knowledge and the reasoning capabilities of LLM, our framework conducts more sophisticated recall, ranking and re-ranking processes with the collaboration of different LLM agents. Experimental results across multiple datasets indicate that our approach exhibits competitive performance. We release our code and data on the github repository JOHNNY-fans/RankNorm.

pdf bib abs
Driving Chinese Spelling Correction from a Fine-Grained Perspective
Linfeng Liu | Hongqiu Wu | Hai Zhao

This paper explores the task: Chinese spelling correction (CSC), from a fine-grained perspec- tive by recognizing that existing evaluations lack nuanced typology for the spelling errors. This deficiency can create a misleading impres- sion of model performance, incurring an “in- visible” bottleneck hindering the advancement of CSC research. In this paper, we first cate- gorize spelling errors into six types and con- duct a fine-grained evaluation across a wide variety of models, including BERT-based mod- els and LLMs. Thus, we are able to pinpoint the underlying weaknesses of existing state-of- the-art models - utilizing contextual clues and handling co-existence of multiple typos, asso- ciated to contextual errors and multi-typo er- rors. However, these errors occur infrequently in conventional training corpus. Therefore, we introduce new error generation methods to aug- ment their occurrence, which can be leveraged to enhance the training of CSC models. We hope this work could provide fresh insight for future CSC research.

General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. However, their current evaluations lack alignment with the fundamental logic of legal reasoning, the legal syllogism. This hinders trust and understanding from legal experts. To bridge this gap, we introduce LAiW, the Chinese legal LLM benchmark structured around the legal syllogism. We evaluate legal LLMs across three levels of capability, each reflecting a progressively more complex stage of legal syllogism: fundamental information retrieval, legal principles inference, and advanced legal applications, and encompassing a wide range of tasks in different legal scenarios. Our automatic evaluation reveals that LLMs, despite their ability to answer complex legal questions, lack the inherent logical processes of the legal syllogism. This limitation poses a barrier to acceptance by legal professionals. Furthermore, manual evaluation with legal experts confirms this issue and highlights the importance of pre-training on legal text to enhance the legal syllogism of LLMs. Future research may prioritize addressing this gap to unlock the full potential of LLMs in legal applications.

pdf bib abs
Retrieval-Augmented Generation for Large Language Model based Few-shot Chinese Spell Checking
Ming Dong | Zhiwei Cheng | Changyin Luo | Tingting He

Large language models (LLMs) are naturally suitable for Chinese spelling check (CSC) task in few-shot scenarios due to their powerful semantic understanding and few-shot learning capabilities. Recent CSC research has begun to use LLMs as foundational models. However, most current datasets are primarily focused on errors generated during the text generation process, with little attention given to errors occurring in the modal conversion process. Furthermore, existing LLM-based CSC methods often rely on fixed prompt samples, which limits the performance of LLMs. Therefore, we propose a framework named RagID (Retrieval-Augment Generation and Iterative Discriminator Strategy). By utilizing semantic-based similarity search and an iterative discriminator mechanism, RagID can provide well-chosen prompt samples and reduce over-correction issues in LLM-based CSC. RagID demonstrates excellent effectiveness in few-shot scenarios. We conducted comprehensive experiments, and the results show that RagID achieves the best performance on dataset that include data from multiple domains and dataset containing modal conversion spelling errors. The dataset and method are available online.

pdf bib abs
GADFA: Generator-Assisted Decision-Focused Approach for Opinion Expressing Timing Identification
Chung-Chi Chen | Hiroya Takamura | Ichiro Kobayashi | Yusuke Miyao | Hsin-Hsi Chen

The advancement of text generation models has granted us the capability to produce coherent and convincing text on demand. Yet, in real-life circumstances, individuals do not continuously generate text or voice their opinions. For instance, consumers pen product reviews after weighing the merits and demerits of a product, and professional analysts issue reports following significant news releases. In essence, opinion expression is typically prompted by particular reasons or signals. Despite long-standing developments in opinion mining, the appropriate timing for expressing an opinion remains largely unexplored. To address this deficit, our study introduces an innovative task - the identification of news-triggered opinion expressing timing. We ground this task in the actions of professional stock analysts and develop a novel dataset for investigation. Our Generator-Assisted Decision-Focused Approach (GADFA) is decision-focused, leveraging text generation models to steer the classification model, thus enhancing overall performance. Our experimental findings demonstrate that the text generated by our model contributes fresh insights from various angles, effectively aiding in identifying the optimal timing for opinion expression.

Chain-of-Thought (CoT) has been a widely adopted prompting method, eliciting impressive reasoning abilities of Large Language Models (LLMs). Inspired by the sequential thought structure of CoT, a number of Chain-of-X (CoX) methods have been developed to address challenges across diverse domains and tasks. In this paper, we provide a comprehensive survey of Chain-of-X methods for LLMs in different contexts. Specifically, we categorize them by taxonomies of nodes, i.e., the X in CoX, and application tasks. We also discuss the findings and implications of existing CoX methods, as well as potential future directions. Our survey aims to serve as a detailed and up-to-date resource for researchers seeking to apply the idea of CoT to broader scenarios.

pdf bib abs
Interpreting Topic Models in Byte-Pair Encoding Space
Jia Peng Lim | Hady Lauw

Byte-pair encoding (BPE) is pivotal for processing text into chunksize tokens, particularly in Large Language Model (LLM). From a topic modeling perspective, as these chunksize tokens might be mere parts of valid words, evaluating and interpreting these tokens for coherence is challenging. Most, if not all, of coherence evaluation measures are incompatible as they benchmark using valid words. We propose to interpret the recovery of valid words from these tokens as a ranking problem and present a model-agnostic and training-free recovery approach from the topic-token distribution onto a selected vocabulary space, following which we could apply existing evaluation measures. Results show that topic sets recovered from BPE vocabulary space are coherent.

pdf bib abs
SUMIE: A Synthetic Benchmark for Incremental Entity Summarization
Eunjeong Hwang | Yichao Zhou | Beliz Gunel | James Bradley Wendt | Sandeep Tata

No existing dataset adequately tests how well language models can incrementally update entity summaries – a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce , a fully synthetic dataset designed to expose real-world IES challenges. This dataset addresses issues like incorrect entity association and incomplete information, capturing real-world complexity by generating diverse attributes, summaries, and unstructured paragraphs with 99% alignment accuracy between generated summaries and paragraphs. Extensive experiments demonstrate the dataset’s difficulty – state-of-the-art LLMs struggle to update summaries with an F1 higher than 80.4%. We will open-source the benchmark and the evaluation metrics to help the community make progress on IES tasks.

Modeling text-attributed graphs is a well-known problem due to the difficulty of capturing both the text attribute and the graph structure effectively. Existing models often focus on either the text attribute or the graph structure, potentially neglecting the other aspect. This is primarily because both text learning and graph learning models require significant computational resources, making it impractical to directly connect these models in a series. However, there are situations where text-learning models correctly classify text-attributed nodes, while graph-learning models may classify them incorrectly, and vice versa. To fully leverage the potential of text-attributed graphs, we propose a Coupled Text-attributed Graph Learning (CTGL) framework that combines the strengths of both text-learning and graph-learning models in parallel and avoids the computational cost of serially connecting the two aspect models. Specifically, CTGL introduces coupled text-graph augmentation to enable coupled contrastive learning and facilitate the exchange of valuable information between text learning and graph learning. Experimental results on diverse datasets demonstrate the superior performance of our model compared to state-of-the-art text-learning and graph-learning baselines.

Document Image Translation (DIT) aims to translate documents in images from one language to another. It requires visual layouts and textual contents understanding, as well as document coherence capturing. However, current methods often rely on the quality of OCR output, which, particularly in complex-layout scenarios, frequently loses the crucial document coherence, leading to chaotic text. To overcome this problem, we introduce a novel end-to-end network, named Zoom-out DIT (ZoomDIT), inspired by human translation procedures. It jointly accomplishes the multi-level tasks including word positioning, sentence recognition & translation, and document organization, based on a fine-to-coarse zoom-out framework, to progressively realize “chaotic words to coherent document” and improve translation. We further contribute a new large-scale DIT dataset with multi-level fine-grained labels. Extensive experiments on public and our new dataset demonstrate significant improvements in translation quality towards complex-layout document images, offering a robust solution for reorganizing the chaotic OCR outputs to a coherent document translation.

pdf bib abs
MESAQA: A Dataset for Multi-Span Contextual and Evidence-Grounded Question Answering
Jui-I Wang | Hen-Hsen Huang | Hsin-Hsi Chen

We introduce MESAQA, a novel dataset focusing on multi-span contextual understanding question answering (QA).Unlike traditional single-span QA systems, questions in our dataset consider information from multiple spans within the context document. MESAQA supports evidence-grounded QA, demanding the model’s capability of answer generation and multi-evidence identification. Our automated dataset creation method leverages the MASH-QA dataset and large language models (LLMs) to ensure that each Q/A pair requires considering all selected spans. Experimental results show that current models struggle with multi-span contextual QA, underscoring the need for new approaches. Our dataset sets a benchmark for this emerging QA paradigm, promoting research in complex information retrieval and synthesis.

Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain adaptation. To address this, we present B2NERD, a compact dataset designed to guide LLMs’ generalization in Open NER under a universal entity taxonomy. B2NERD is refined from 54 existing English and Chinese datasets using a two-step process. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly enhances LLMs’ Open NER capabilities. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages. The data, models, and code are publicly available at https://github.com/UmeanNever/B2NER.

pdf bib abs
Get Confused Cautiously: Textual Sequence Memorization Erasure with Selective Entropy Maximization
Zhaohan Zhang | Ziquan Liu | Ioannis Patras

Large Language Models (LLMs) have been found to memorize and recite some of the textual sequences from their training set verbatim, raising broad concerns about privacy and copyright issues. This Textual Sequence Memorization (TSM) phenomenon leads to a high demand to regulate LLM output to prevent generating certain memorized text that a user wants to be forgotten. However, our empirical study reveals that existing methods for TSM erasure fail to unlearn large numbers of memorized samples without substantially jeopardizing the model utility. To achieve a better trade-off between the effectiveness of TSM erasure and model utility in LLMs, our paper proposes a new method, named Entropy Maximization with Selective Optimization (EMSO), where the model parameters are updated sparsely based on novel optimization and selection criteria, in a manner that does not require additional models or data other than that in the forget set. More specifically, we propose an entropy-based loss that is shown to lead to more stable optimization and better preserves model utility than existing methods. In addition, we propose a contrastive gradient metric that takes both the gradient magnitude and direction into consideration, so as to localize model parameters to update in a sparse model updating scehme. Extensive experiments across three model scales demonstrate that our method excels in handling large-scale forgetting requests while preserving model ability in language generation and understanding.

pdf bib abs
Re-Examine Distantly Supervised NER: A New Benchmark and a Simple Approach
Yuepei Li | Kang Zhou | Qiao Qiao | Qing Wang | Qi Li

Distantly-Supervised Named Entity Recognition (DS-NER) uses knowledge bases or dictionaries for annotations, reducing manual efforts but rely on large human labeled validation set. In this paper, we introduce a real-life DS-NER dataset, QTL, where the training data is annotated using domain dictionaries and the test data is annotated by domain experts. This dataset has a small validation set, reflecting real-life scenarios. Existing DS-NER approaches fail when applied to QTL, which motivate us to re-examine existing DS-NER approaches. We found that many of them rely on large validation sets and some used test set for tuning inappropriately. To solve this issue, we proposed a new approach, token-level Curriculum-based Positive-Unlabeled Learning (CuPUL), which uses curriculum learning to order training samples from easy to hard. This method stabilizes training, making it robust and effective on small validation sets. CuPUL also addresses false negative issues using the Positive-Unlabeled learning paradigm, demonstrating improved performance in real-life applications.

pdf bib abs
BinarySelect to Improve Accessibility of Black-Box Attack Research
Shatarupa Ghosh | Jonathan Rusert

Adversarial text attack research is useful for testing the robustness of NLP models, however, the rise of transformers has greatly increased the time required to test attacks. Especially when researchers do not have access to adequate resources (e.g. GPUs). This can hinder attack research, as modifying one example for an attack can require hundreds of queries to a model, especially for black-box attacks. Often these attacks remove one token at a time to find the ideal one to change, requiring n queries (the length of the text) right away. We propose a more efficient selection method called BinarySelect which combines binary search and attack selection methods to greatly reduce the number of queries needed to find a token. We find that BinarySelect only needs log_2(n) * 2 queries to find the first token compared to n queries. We also test BinarySelect in an attack setting against 5 classifiers across 3 datasets and find a viable tradeoff between number of queries saved and attack effectiveness. For example, on the Yelp dataset, the number of queries is reduced by 32% (72 less) with a drop in attack effectiveness of only 5 points. We believe that BinarySelect can help future researchers study adversarial attacks and black-box problems more efficiently and opens the door for researchers with access to less resources.

pdf bib abs
Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations
Rena Gao | Carsten Roever | Jey Han Lau

We present an evaluation framework for interactive dialogue assessment in the context of English as a Second Language (ESL) speakers. Our framework collects dialogue-level interactivity labels (e.g., topic management; 4 labels in total) and micro-level span features (e.g., backchannels; 17 features in total). Given our annotated data, we study how the micro-level features influence the (higher level) interactivity quality of ESL dialogues by constructing various machine learning-based models. Our results demonstrate that certain micro-level features strongly correlate with interactivity quality, like reference words (e.g., she, her, he), revealing new insights about the interaction between higher-level dialogue quality and lower-level fundamental linguistic signals. Our framework also provides a means to assess ESL communication, which is useful for language assessment.

pdf bib abs
Imposter: Text and Frequency Guidance for Subject Driven Action Personalization using Diffusion Models
Divya Kothandaraman | Kuldeep Kulkarni | Sumit Shekhar | Balaji Vasan Srinivasan | Dinesh Manocha

We present ImPoster, a novel algorithm for generating a target image of a ‘source’ subject performing a ‘driving’ action. The inputs to our algorithm are a single pair of a source image with the subject that we wish to edit and a driving image with a subject of an arbitrary class performing the driving action, along with the text descriptions of the two images. Our approach is completely unsupervised and does not require any access to additional annotations like keypoints or pose. Our approach builds on a pretrained text-to-image latent diffusion model and learns the characteristics of the source and the driving image by finetuning the diffusion model for a small number of iterations. At inference time, ImPoster performs step-wise text prompting i.e. it denoises by first moving in the direction of the image manifold corresponding to the driving image followed by the direction of the image manifold corresponding to the text description of the desired target image. We propose a novel diffusion guidance formulation, image frequency guidance, to steer the generation towards the manifold of the source subject and the driving action at every step of the inference denoising. Our frequency guidance formulations are derived from the frequency domain properties of images. We extensively evaluate ImPoster on a diverse set of source-driving image pairs to demonstrate improvements over baselines. To the best of our knowledge, ImPoster is the first approach towards achieving both subject-driven as well as action-driven image personalization.

When carefully optimized by human experts, naive prompts can significantly enhance the task performance of large language models (LLMs). However, such expert-driven prompt optimizations are resource-intensive. To address this, some studies have proposed Automatic Prompt Optimization (APO), which refines naive prompts according to task outputs from in-box testing models, utilizing advanced LLMs (e.g., GPT-4) in an ad-hoc way. Although effective, current approaches face challenges in generalization and privacy risks. To overcome these limitations, we have developed the first large-scale Prompt Optimization Preference (POP) dataset, fine-tuned offline local LLM-based optimizers, and conducted fairly evaluations across various downstream models. Our method, named Free-from Instruction-oriented Prompt Optimization (FIPO), allows precise optimization of the core task instructions in naive prompts in a model-agnostic manner. FIPO uses a modular APO template that dynamically incorporates the naive task instructions, optional instruction responses, and optional ground truth to produce refined prompts. The POP dataset is meticulously constructed using advanced LLMs, undergoing rigorous cross-validation by human experts and analytical models. By leveraging insights from this dataset, along with Tulu2 models and diverse fine-tuning strategies, we validate the efficacy of the FIPO framework across five public benchmarks and six testing models. Our dataset and codes are available at: https://github.com/LuJunru/FIPO_Project.

pdf bib abs
Context Filtering with Reward Modeling in Question Answering
Sangryul Kim | James Thorne

Question Answering (QA) in NLP is the task of finding answers to a query within a relevant context retrieved by a retrieval system. Yet, the mix of relevant and irrelevant information in these contexts can hinder performance enhancements in QA tasks. To address this, we introduce a context filtering approach that removes non-essential details, summarizing crucial content through Reward Modeling. This method emphasizes keeping vital data while omitting the extraneous during summarization model training. We offer a framework for developing efficient QA models by discerning useful information from dataset pairs, bypassing the need for costly human evaluation. Furthermore, we show that our approach can significantly outperform the baseline, as evidenced by a 6.8-fold increase in the EM Per Token (EPT) metric, which we propose as a measure of token efficiency, indicating a notable token-efficiency boost for low-resource settings.

Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs, which can be challenging to scale due to the dependence on a teacher model and high generation costs. In this paper, we focus on synthesizing code data at scale and propose a Case2Code task by exploiting the expressiveness and correctness of programs. Case2Code is an inductive inference task that aims to infer underlying code implementations by observing input-output examples or program behaviors, By incorporating LLMs to generate program inputs, and executing the program with these inputs to obtain the program outputs, we can synthesize diverse and high-quality Case2Code data at scale for training and evaluating code LLMs. Experimental results show that case-to-code induction is challenging for current representative LLMs if they are untrained. Models trained with Case2Code improve performance not only on distribution case-to-code induction but also various coding-generation tasks, demonstrating the great potential of large-scale synthetic data and inductive learning.

pdf bib abs
Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering
Mingxu Tao | Dongyan Zhao | Yansong Feng

Open-ended question answering requires mod- els to find appropriate evidence to form well-reasoned, comprehensive and helpful answers. In practical applications, models also need to engage in extended discussions on potential scenarios closely relevant to the question. With augmentation of retrieval module, open-source Large Language Models (LLMs) can produce coherent answers often with different focuses, but are still sub-optimal in terms of reliable ev- idence selection and in-depth question analysis. In this paper, we propose a novel Chain-of- Discussion framework to leverage the synergy among multiple open-source LLMs aiming to provide more correct and more comprehensive answers for open-ended QA, although they are not strong enough individually. Our exper- iments show that discussions among multiple LLMs play a vital role in enhancing the quality of answers.

pdf bib abs
RAIDEN Benchmark: Evaluating Role-playing Conversational Agents with Measurement-Driven Custom Dialogues
Bowen Wu | Kaili Sun | Ziwei Bai | Ying Li | Baoxun Wang

As Large-scale Language Models (LLMs) advance, the development of engaging Role-Playing Conversational Agents (RPCAs) has gained prominence. Despite this progress, there is a notable absence of benchmarks designed around dialogues, rather than question-answering formats, to assess the effectiveness of RPCA interactions. This paper introduces the RAIDEN benchmark, containing a comprehensive dataset specifically developed for RPCA evaluation, comprising over 40,000 multi-turn utterances across 135 characters. The benchmark focuses on assessing particular dimensions at different stages of a conversation, facilitated through interactions conducted by annotators. This approach allows the evaluation phase to concentrate on specific response dimensions, and thus subjectivity in dialogue evaluation is reduced. To further enhance objectivity, evaluators compare responses from two different models rather than assessing a single response in isolation. Besides, we introduce RPCAJudger, a specialized judging LLM tailored for automatic RPCA evaluation. The evaluations conducted by RPCAJudger closely mirror human judgments, and its API-free methodology serves to prevent potential data leakage. All the models and all non-private leaderboard data will be made publicly available.

pdf bib abs
CryptOpiQA: A new Opinion and Question Answering dataset on Cryptocurrency
Sougata Sarkar | Aditya Badwal | Amartya Roy | Koustav Rudra | Kripabandhu Ghosh

Cryptocurrency has attracted a lot of public attention and opinion worldwide. Users have different kinds of information needs regarding such topics and publicly available information is a good resource to satisfy those information needs. In this paper, we investigate the public opinion on cryptocurrency and bitcoin on two social media – Twitter and Reddit. We have created a multi-level dataset CryptOpiQA and garnered valuable insights. The dataset contains both gold standard (manually annotated) and silver standard (inferred from the gold standard) labels. As a part of this dataset, we have also created a Question Answering sub-corpus. We have used state-of-the-art LLMs and advanced techniques such as retrieval augmented generation (RAG) to improve question-answering (QnA) results. We believe this dataset and the analysis will be useful in studying user opinions and Question-Answering on cryptocurrency in the research community.

pdf bib abs
No Train but Gain: Language Arithmetic for training-free Language Adapters enhancement
Mateusz Klimaszewski | Piotr Andruszkiewicz | Alexandra Birch

Modular deep learning is the state-of-the-art solution for lifting the curse of multilinguality, preventing the impact of negative interference and enabling cross-lingual performance in Multilingual Pre-trained Language Models. However, a trade-off of this approach is the reduction in positive transfer learning from closely related languages. In response, we introduce a novel method called language arithmetic, which enables training-free post-processing to address this limitation. Extending the task arithmetic framework, we apply learning via addition to the language adapters, transitioning the framework from a multi-task to a multilingual setup. The effectiveness of the proposed solution is demonstrated on three downstream tasks in a MAD-X-based set of cross-lingual schemes, acting as a post-processing procedure. Language arithmetic consistently improves the baselines with significant gains, especially in the most challenging case of zero-shot application. Our code and models are available at https://github.com/mklimasz/language-arithmetic.

pdf bib abs
NYAYAANUMANA and INLEGALLLAMA: The Largest Indian Legal Judgment Prediction Dataset and Specialized Language Model for Enhanced Decision Analysis
Shubham Kumar Nigam | Deepak Patnaik Balaramamahanthi | Shivam Mishra | Noel Shallum | Kripabandhu Ghosh | Arnab Bhattacharya

The integration of artificial intelligence (AI) in legal judgment prediction (LJP) has the potential to transform the legal landscape, particularly in jurisdictions like India, where a significant backlog of cases burdens the legal system. This paper introduces NyayaAnumana, the largest and most diverse corpus of Indian legal cases compiled for LJP, encompassing a total of 7,02,945 preprocessed cases. NyayaAnumana, which combines the words “Nyaya” and “Anumana” that means “judgment” and “inference” respectively for most major Indian languages, includes a wide range of cases from the Supreme Court, High Courts, Tribunal Courts, District Courts, and Daily Orders and, thus, provides unparalleled diversity and coverage. Our dataset surpasses existing datasets like PredEx and ILDC, offering a comprehensive foundation for advanced AI research in the legal domain. In addition to the dataset, we present INLegalLlama, a domain-specific generative large language model (LLM) tailored to the intricacies of the Indian legal system. It is developed through a two-phase training approach over a base LLaMa model. First, Indian legal documents are injected using continual pretraining. Second, task-specific supervised finetuning is done. This method allows the model to achieve a deeper understanding of legal contexts. Our experiments demonstrate that incorporating diverse court data significantly boosts model accuracy, achieving approximately 90% F1-score in prediction tasks. INLegalLlama not only improves prediction accuracy but also offers comprehensible explanations, addressing the need for explainability in AI-assisted legal decisions.

pdf bib abs
ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media
Kung-Hsiang Huang | Hou Pong Chan | Kathleen McKeown | Heng Ji

Considerable advancements have been made to tackle the misrepresentation of information derived from reference articles in the domains of fact-checking and faithful summarization. However, an unaddressed aspect remains - the identification of social media posts that manipulate information within associated news articles. This task presents a significant challenge, primarily due to the prevalence of personal opinions in such posts. We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information. To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles. Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance. Additionally, we have developed a simple yet effective basic model that outperforms LLMs significantly on the ManiTweet dataset. Finally, we have conducted an exploratory analysis of human-written tweets, unveiling intriguing connections between manipulation and the domain and factuality of news articles, as well as revealing that manipulated sentences are more likely to encapsulate the main story or consequences of a news outlet.

pdf bib abs
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion
Ben Liu | Jihai Zhang | Fangquan Lin | Cheng Yang | Min Peng

Large Language Models (LLMs) present massive inherent knowledge and superior semantic comprehension capability, which have revolutionized various tasks in natural language processing. Despite their success, a critical gap remains in enabling LLMs to perform knowledge graph completion (KGC). Empirical evidence suggests that LLMs consistently perform worse than conventional KGC approaches, even through sophisticated prompt design or tailored instruction-tuning. Fundamentally, applying LLMs on KGC introduces several critical challenges, including a vast set of entity candidates, hallucination issue of LLMs, and under-exploitation of the graph structure. To address these challenges, we propose a novel instruction-tuning-based method, namely FtG. Specifically, we present a filter-then-generate paradigm and formulate the KGC task into a multiple-choice question format. In this way, we can harness the capability of LLMs while mitigating the issue casused by hallucinations. Moreover, we devise a flexible ego-graph serialization prompt and employ a structure-text adapter to couple structure and text information in a contextualized manner. Experimental results demonstrate that FtG achieves substantial performance gain compared to existing state-of-the-art methods. The instruction dataset and code are available at https://github.com/LB0828/FtG.

Recent advancements in text-to-image generation, notably the series of Stable Diffusion methods, have enabled the production of diverse, high-quality photo-realistic images. Nevertheless, these techniques still exhibit limitations in terms of knowledge access. Retrieval-augmented image generation is a straightforward way to tackle this problem. Current studies primarily utilize coarse-grained retrievers, employing initial prompts as search queries for knowledge retrieval. This approach, however, is ineffective in accessing valuable knowledge in long-tail text-to-image generation scenarios. To alleviate this problem, we introduce FineRAG, a fine-grained model that systematically breaks down the retrieval-augmented image generation task into four critical stages: query decomposition, candidate selection, retrieval-augmented diffusion, and self-reflection. Experimental results on both general and long-tailed benchmarks show that our proposed method significantly reduces the noise associated with retrieval-augmented image generation and performs better in complex, open-world scenarios.

pdf bib abs
User Willingness-aware Sales Talk Dataset
Asahi Hentona | Jun Baba | Shiki Sato | Reina Akama

User willingness is a crucial element in the sales talk process that affects the achievement of the salesperson’s or sales system’s objectives. Despite the importance of user willingness, to the best of our knowledge, no previous study has addressed the development of automated sales talk dialogue systems that explicitly consider user willingness. A major barrier is the lack of sales talk datasets with reliable user willingness data. Thus, in this study, we developed a user willingness–aware sales talk collection by leveraging the ecological validity concept, which is discussed in the field of human–computer interaction. Our approach focused on three types of user willingness essential in real sales interactions. We created a dialogue environment that closely resembles real-world scenarios to elicit natural user willingness, with participants evaluating their willingness at the utterance level from multiple perspectives. We analyzed the collected data to gain insights into practical user willingness–aware sales talk strategies. In addition, as a practical application of the constructed dataset, we developed and evaluated a sales dialogue system aimed at enhancing the user’s intent to purchase.

pdf bib abs
Return of EM: Entity-driven Answer Set Expansion for QA Evaluation
Dongryeol Lee | Minwoo Lee | Kyungmin Min | Joonsuk Park | Kyomin Jung

Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft exact match (EM) with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.

pdf bib abs
Data Augmentation for Cross-domain Parsing via Lightweight LLM Generation and Tree Hybridization
Ziyan Zhang | Yang Hou | Chen Gong | Zhenghua Li

Cross-domain constituency parsing remains a challenging task due to the lack of high-quality out-of-domain data. In this paper, we propose a data augmentation method via lightweight large language model (LLM) generation and tree hybridization. We utilize LLM to generate phrase structures (subtrees) for the target domain by incorporating grammar rules and lexical head information into the prompt. To better leverage LLM-generated target-domain subtrees, we hybridize them with existing source-domain subtrees to efficiently produce a large number of structurally diverse instances. Experimental results demonstrate that our method achieves significant improvements on five target domains with a lightweight LLM generation cost.

In this paper, we introduce a novel psychological benchmark, CPsyExam, constructed from questions sourced from Chinese examination systems. CPsyExam is designed to prioritize psychological knowledge and case analysis separately, recognizing the significance of applying psychological knowledge to real-world scenarios. We collect 22k questions from 39 psychology-related subjects across four Chinese examination systems. From the pool of 22k questions, we utilize 4k to create the benchmark that offers balanced coverage of subjects and incorporates a diverse range of case analysis techniques. Furthermore, we evaluate a range of existing large language models (LLMs), spanning from open-sourced to proprietary models. Our experiments and analysis demonstrate that CPsyExam serves as an effective benchmark for enhancing the understanding of psychology within LLMs and enables the comparison of LLMs across various granularities.

pdf bib abs
Optimizing Lifelong Fine-Tuning for Multiple Tasks via Dataless Distribution Replay
Zhenxing Wang

The recent emergence of various large language models, which can be fine-tuned with minimal instruction data, has demonstrated impressive performance across various tasks. However, a phenomenon of forgetting occurs during life- long fine-tuning because training on new tasks interferes with the previously acquired knowl- edge. To mitigate catastrophic forgetting, con- ventional data replay methods achieve high per- formance, but at the cost of compromising data privacy and security. This paper introduces a dataless distribution replay approach for life- long fine-tuning. Concretely, the distribution distillation is applied to replay the output dis- tribution of the linear layers at previous task stages. The optimal solution for this distri- bution replay can be directly computed using the retained inner product matrix of the input data, thereby eliminating the need for previ- ous data. Additionally, Singular Value Decom- position (SVD) and module accumulation are employed to further enhance the performance of dataless distribution replay method. Finally, the evaluation is conducted in a lifelong fine- tuning scenario involving multiple tasks. The experimental results and analysis show that the proposed method achieves significant improve- ments compared to several strong lifelong fine- tuning methods.

Physics problems constitute a significant aspect of reasoning, necessitating complicated reasoning ability and abundant physics knowledge. However, existing large language models (LLMs) frequently fail due to a lack of knowledge or incorrect knowledge application. To mitigate these issues, we propose Physics Reasoner, a knowledge-augmented framework to solve physics problems with LLMs. Specifically, the proposed framework constructs a comprehensive formula set to provide explicit physics knowledge and utilizes checklists containing detailed instructions to guide effective knowledge application. Namely, given a physics problem, Physics Reasoner solves it through three stages: problem analysis, formula retrieval, and guided reasoning. During the process, checklists are employed to enhance LLMs’ self-improvement in the analysis and reasoning stages. Empirically, Physics Reasoner mitigates the issues of insufficient knowledge and incorrect application, achieving state-of-the-art performance on SciBench with an average accuracy improvement of 5.8%.

Large language models (LLMs) have received lots of attention for their impressive performance in in-context dialogues and their potential to revolutionize service industries with a new business model, Model-as-a-Service (MaaS). Automated data labeling is a natural and promising service. However, labeling data with LLMs faces two main challenges: 1) the labels from LLMs may contain uncertainty, and 2) using LLMs for data labeling tasks can be prohibitively expensive, as the scales of datasets are usually tremendous. In this paper, we propose a hierarchical framework named LMCrowd that leverages multiple LLMs for efficient data labeling under budget constraints. The proposed LMCrowd framework first aggregates labels from multiple freely available LLMs, and then employs a large, paid MaaS LLM for relabeling selected instances. Furthermore, we formalize the core process as an optimization problem, aiming to select the optimal set of instances for relabeling by the MaaS LLM, given the current belief state. Extensive experimental evaluations across various real-world datasets demonstrate that our framework outperforms human labelers and GPT-4 in terms of both accuracy and efficiency.

pdf bib abs
Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?
Leonidas Zotos | Hedderik van Rijn | Malvina Nissim

Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty. Specifically, we exploit model uncertainty towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models’ behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.

Retrieval-augmented generation (RAG) effectively addresses issues of static knowledge and hallucination in large language models. Existing studies mostly focus on question scenarios with clear user intents and concise answers. However, it is prevalent that users issue broad, open-ended queries with diverse sub-intents, for which they desire rich and long-form answers covering multiple relevant aspects. To tackle this important yet underexplored problem, we propose a novel RAG framework, namely RichRAG. It includes a sub-aspect explorer to identify potential sub-aspects of input questions, a multi-faceted retriever to build a candidate pool of diverse external documents related to these sub-aspects, and a generative list-wise ranker, which is a key module to provide the top-k most valuable documents for the final generator. These ranked documents sufficiently cover various query aspects and are aware of the generator’s preferences, hence incentivizing it to produce rich and comprehensive responses for users. The training of our ranker involves a supervised fine-tuning stage to ensure the basic coverage of documents, and a reinforcement learning stage to align downstream LLM’s preferences to the ranking of documents. Experimental results on two publicly available datasets prove that our framework effectively and efficiently provides comprehensive and satisfying responses to users.

pdf bib abs
LlmLink: Dual LLMs for Dynamic Entity Linking on Long Narratives with Collaborative Memorisation and Prompt Optimisation
Lixing Zhu | Jun Wang | Yulan He

We address the task of CoREFerence resolution (CoREF) in chunked long narratives. Existing approaches remain either focused on supervised fine-tuning or limited to one-off prediction, which poses a challenge where the context is long. We develop a dynamic approach to cope with this: by deploying dual Large Language Models (LLMs), we assign specialised LLMs to local named entity recognition and distant CoREF tasks, respectively, while ensuring their exchange of information. Utilising our novel memorisation schemes, the coreference resolution LLM would memorise characters and their associated descriptions, thereby reducing token consumption compared with storing previous messages. To alleviate hallucinations of LLMs, we employ an automatic prompt optimisation method, with the LLM ranker modified to leverage annotations. Our approach achieves performance gains over other LLM-based models and fine-tuning approaches on long narrative datasets, significantly reducing the resources required for inference and training.

pdf bib abs
PERSONA: A Reproducible Testbed for Pluralistic Alignment
Louis Castricato | Nathan Lile | Rafael Rafailov | Jan-Philipp Fränken | Chelsea Finn

The rapid advancement of language models (LMs) necessitates robust alignment with diverse user values. However, current preference optimization approaches often fail to capture the plurality of user opinions, instead reinforcing majority viewpoints and marginalizing minority perspectives. We introduce PERSONA, a reproducible test bed designed to evaluate and improve pluralistic alignment of LMs. We procedurally generate diverse user profiles from US census data, resulting in 1,586 synthetic personas with varied demographic and idiosyncratic attributes. We then generate a large-scale evaluation dataset containing 3,868 prompts and 317,200 feedback pairs obtained from our synthetic personas. Leveraging this dataset, we systematically evaluate LM capabilities in role-playing diverse users, verified through human judges, and the establishment of both a benchmark, PERSONA Bench, for pluralistic alignment approaches as well as an extensive dataset to create new and future benchmarks.

pdf bib abs
LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings
Fred Philippy | Siwen Guo | Jacques Klein | Tegawende Bissyande

Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages. To address this issue, we compile a relatively small but high-quality human-generated cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence embedding model for Luxembourgish with strong cross-lingual capabilities. Additionally, we present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages than relying solely on high-resource language pairs. Furthermore, recognizing the lack of sentence embedding benchmarks for low-resource languages, we create a paraphrase detection benchmark specifically for Luxembourgish, aiming to partially fill this gap and promote further research.

pdf bib abs
Human Interest Framing across Cultures: A Case Study on Climate Change
Gisela Vallejo | Christine de Kock | Timothy Baldwin | Lea Frermann

Human Interest (HI) framing is a narrative strategy that injects news stories with a relatable, emotional angle and a human face to engage the audience. In this study we investigate the use of HI framing across different English-speaking cultures in news articles about climate change. Despite its demonstrated impact on the public’s behaviour and perception of an issue, HI framing has been under-explored in NLP to date. We perform a systematic analysis of HI stories to understand its role in climate change reporting in English-speaking countries from four continents. Our findings reveal key differences in how climate change is portrayed across countries, encompassing aspects such as narrative roles, article polarity, pronoun prevalence, and topics. We also demonstrate that these linguistic aspects boost the performance of fine-tuned pre-trained language models on HI story classification.

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the fac- tual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different pa- pers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified framework for building customized automatic fact-checking systems, benchmarking their accuracy, evaluating factuality of LLMs, and verifying claims in a document. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM’s factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers’ verification results using human-annotated datasets. Data and code are publicly available at https: //github.com/yuxiaw/openfactcheck.

The task of reviewer recommendation is increasingly important, with main techniques utilizing general models of text relevance. However, state of the art (SotA) systems still have relatively high error rates. Two possible reasons for this are: a lack of large datasets and the fact that large language models (LLMs) have not yet been applied. To fill these gaps, we first create a substantial new dataset, in the domain of Internet specification documents; then we introduce the use of LLMs and evaluate their performance. We find that LLMs with prompting can improve on SotA in some cases, but that they are not a cure-all: this task provides a challenging setting for prompt-based methods

pdf bib abs
Evaluating Model Alignment with Human Perception: A Study on Shitsukan in LLMs and LVLMs
Daiki Shiono | Ana Brassard | Yukiko Ishizuki | Jun Suzuki

We evaluate the alignment of large language models (LLMs) and large vision-language models (LVLMs) with human perception, focusing on the Japanese concept of *shitsukan*, which reflects the sensory experience of perceiving objects. We created a dataset of *shitsukan* terms elicited from individuals in response to object images. With it, we designed benchmark tasks for three dimensions of understanding *shitsukan*: (1) accurate perception in object images, (2) commonsense knowledge of typical *shitsukan* terms for objects, and (3) distinction of valid *shitsukan* terms. Models demonstrated mixed accuracy across benchmark tasks, with limited overlap between model- and human-generated terms. However, manual evaluations revealed that the model-generated terms were still natural to humans. This work identifies gaps in culture-specific understanding and contributes to aligning models with human sensory perception. We publicly release the dataset to encourage further research in this area.