Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

Anaelia Ovalle, Kai-Wei Chang, Yang Trista Cao, Ninareh Mehrabi, Jieyu Zhao, Aram Galstyan, Jwala Dhamala, Anoop Kumar, Rahul Gupta (Editors)

Anthology ID:: 2024.trustnlp-1
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2024.trustnlp-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.trustnlp-1.pdf

PDF (full) BibTeX Search

pdf bib abs
Beyond Turing: A Comparative Analysis of Approaches for Detecting Machine-Generated Text
Muhammad Adilazuarda

Significant progress has been made on text generation by pre-trained language models (PLMs), yet distinguishing between human and machine-generated text poses an escalating challenge. This paper offers an in-depth evaluation of three distinct methods used to address this task: traditional shallow learning, Language Model (LM) fine-tuning, and Multilingual Model fine-tuning. These approaches are rigorously tested on a wide range of machine-generated texts, providing a benchmark of their competence in distinguishing between human-authored and machine-authored linguistic constructs. The results reveal considerable differences in performance across methods, thus emphasizing the continued need for advancement in this crucial area of NLP. This study offers valuable insights and paves the way for future research aimed at creating robust and highly discriminative models.

pdf bib abs
Automated Adversarial Discovery for Safety Classifiers
Yash Kumar Lal | Preethi Lahoti | Aradhana Sinha | Yao Qin | Ananth Balashankar

Safety classifiers are critical in mitigating toxicity on online forums such as social media and in chatbots. Still, they continue to be vulnerable to emergent, and often innumerable, adversarial attacks.Traditional automated adversarial data generation methods, however, tend to produce attacks that are not diverse, but variations of previously observed harm types.We formalize the task of automated adversarial discovery for safety classifiers - to find new attacks along previously unseen harm dimensions that expose new weaknesses in the classifier.We measure progress on this task along two key axes (1) adversarial success: does the attack fool the classifier? and (2) dimensional diversity: does the attack represent a previously unseen harm type?Our evaluation of existing attack generation methods on the CivilComments toxicity task reveals their limitations: Word perturbation attacks fail to fool classifiers, while prompt-based LLM attacks have more adversarial success, but lack dimensional diversity.Even our best-performing prompt-based method finds new successful attacks on unseen harm dimensions of attacks only 5% of the time.Automatically finding new harmful dimensions of attack is crucial and there is substantial headroom for future research on our new task.

pdf bib abs
FairBelief - Assessing Harmful Beliefs in Language Models
Mattia Setzu | Marta Marchiori Manerba | Pasquale Minervini | Debora Nozza

Language Models (LMs) have been shown to inherit undesired biases that might hurt minorities and underrepresented groups if such systems were integrated into real-world applications without careful fairness auditing.This paper proposes FairBelief, an analytical approach to capture and assess beliefs, i.e., propositions that an LM may embed with different degrees of confidence and that covertly influence its predictions. With FairBelief, we leverage prompting to study the behavior of several state-of-the-art LMs across different previously neglected axes, such as model scale and likelihood, assessing predictions on a fairness dataset specifically designed to quantify LMs’ outputs’ hurtfulness.Finally, we conclude with an in-depth qualitative assessment of the beliefs emitted by the models.We apply FairBelief to English LMs, revealing that, although these architectures enable high performances on diverse natural language processing tasks, they show hurtful beliefs about specific genders. Interestingly, training procedure and dataset, model scale, and architecture induce beliefs of different degrees of hurtfulness.

pdf bib abs
The Trade-off between Performance, Efficiency, and Fairness in Adapter Modules for Text Classification
Minh Duc Bui | Katharina Von Der Wense

Current natural language processing (NLP) research tends to focus on only one or, less frequently, two dimensions – e.g., performance, interpretability, or efficiency – at a time, which may lead to suboptimal conclusions. Work on adapter modulesfocuses on improving performance and efficiency, with no investigation of unintended consequences on other aspects such as fairness. To address this gap, we conduct experiments on three text classification datasets by either (1) finetuning all parameters or (2) using adapter modules. Regarding performance and efficiency, we confirm prior findings that the accuracy of adapter-enhanced models is roughly on par with that of fully finetuned models, while training time is substantially reduced. Regarding fairness, we show that adapter modules result in mixed fairness across sensitive groups. Further investigation reveals that, when the standard finetuned model exhibits limited biases, adapter modules typically do not introduce extra bias. On the other hand, when the finetuned model exhibits increased bias, the use of adapter modules poses the potential danger of amplifying these biases to a significant extent. Our findings highlight the need for a case-by-case evaluation rather than a one-size-fits-all judgment.

pdf bib abs
When XGBoost Outperforms GPT-4 on Text Classification: A Case Study
Matyas Bohacek | Michal Bravansky

Large language models (LLMs) are increasingly used for applications beyond text generation, ranging from text summarization to instruction following. One popular example of exploiting LLMs’ zero- and few-shot capabilities is the task of text classification. This short paper compares two popular LLM-based classification pipelines (GPT-4 and LLAMA 2) to a popular pre-LLM-era classification pipeline on the task of news trustworthiness classification, focusing on performance, training, and deployment requirements. We find that, in this case, the pre-LLM-era ensemble pipeline outperforms the two popular LLM pipelines while being orders of magnitude smaller in parameter size.

pdf bib abs
Towards Healthy AI: Large Language Models Need Therapists Too
Baihan Lin | Djallel Bouneffouf | Guillermo Cecchi | Kush Varshney

Recent advances in large language models (LLMs) have led to the development of powerful chatbots capable of engaging in fluent human-like conversations. However, these chatbots may be harmful, exhibiting manipulation, gaslighting, narcissism, and other toxicity. To work toward safer and more well-adjusted models, we propose a framework that uses psychotherapy to identify and mitigate harmful chatbot behaviors. The framework involves four different artificial intelligence (AI) agents: the Chatbot whose behavior is to be adjusted, a User, a Therapist, and a Critic that can be paired with reinforcement learning-based LLM tuning. We illustrate the framework with a working example of a social conversation involving four instances of ChatGPT, showing that the framework may mitigate the toxicity in conversations between LLM-driven chatbots and people. Although there are still several challenges and directions to be addressed in the future, the proposed framework is a promising approach to improving the alignment between LLMs and human values.

pdf bib abs
Exploring Causal Mechanisms for Machine Text Detection Methods
Kiyoon Yoo | Wonhyuk Ahn | Yeji Song | Nojun Kwak

The immense attraction towards text generation garnered by ChatGPT has spurred the need for discriminating machine-text from human text. In this work, we provide preliminary evidence that the scores computed by existing zero-shot and supervised machine-generated text detection methods are not solely determined by the generated texts, but are affected by prompts and real texts as well. Using techniques from causal inference, we show the existence of backdoor paths that confounds the relationships between text and its detection score and how the confounding bias can be partially mitigated. We open up new research directions in identifying other factors that may be interwoven in the detection of machine text. Our study calls for a deeper investigation into which kinds of prompts make the detection of machine text more difficult or easier

pdf bib abs
FactAlign: Fact-Level Hallucination Detection and Classification Through Knowledge Graph Alignment
Mohamed Rashad | Ahmed Zahran | Abanoub Amin | Amr Abdelaal | Mohamed Altantawy

This paper proposes a novel black-box approach for fact-level hallucination detection and classification by transforming the problem into a knowledge graph alignment task. This approach allows us to classify detected hallucinations as either intrinsic or extrinsic. The paper starts by discussing the field of hallucination detection and introducing several approaches to related work. Then, we introduce the proposed FactAlign approach for hallucination detection and discuss how we can use it to classify hallucinations as either intrinsic or extrinsic. Experiments are carried out to evaluate the proposed method against state-of-the-art methods on the hallucination detection task using the WikiBio GPT-3 hallucination dataset, and on the hallucination type classification task using the XSum hallucination annotations dataset. The experimental results show that our method achieves a 0.889 F1 score for the hallucination detection and 0.825 F1 for the hallucination type classification, without any further training, fine-tuning, or producing multiple samples of the LLM response.

Recent studies reveal that Large Language Models (LLMs) face challenges in balancing safety with utility, particularly when processing long texts for NLP tasks like summarization and translation. Despite defenses against malicious short questions, the ability of LLMs to safely handle dangerous long content, such as manuals teaching illicit activities, remains unclear. Our work aims to develop robust defenses for LLMs in processing malicious documents alongside benign NLP task queries. We introduce a defense dataset comprised of safety-related examples and propose single-task and mixed-task losses for instruction tuning. Our empirical results demonstrate that LLMs can significantly enhance their capacity to safely manage dangerous content with appropriate instruction tuning. Additionally, strengthening the defenses of tasks most susceptible to misuse is effective in protecting LLMs against processing harmful information. We also observe that trade-offs between utility and safety exist in defense strategies, where Llama2, utilizing our proposed approach, displays a significantly better balance compared to Llama1.

pdf bib abs
On the Interplay between Fairness and Explainability
Stephanie Brandl | Emanuele Bugliarello | Ilias Chalkidis

In order to build reliable and trustworthy NLP applications, models need to be both fair across different demographics and explainable. Usually these two objectives, fairness and explainability, are optimized and/or examined independently of each other. Instead, we argue that forthcoming, trustworthy NLP systems should consider both.In this work, we perform a first study to understand how they influence each other: do fair(er) models rely on more plausible explanations? and vice versa. To this end, we conduct experiments on two English multi-class text classification datasets, BIOS and ECtHR, that provide information on gender and nationality, respectively, as well as human-annotated rationales. We fine-tune pre-trained language models with several methods for (i) bias mitigation, which aims to improve fairness; (ii) rationale extraction, which aims to produce plausible explanations.We find that bias mitigation algorithms do not always lead to fairer models. Moreover, in our analysis, we see that empirical fairness and explainability are orthogonal.

pdf bib abs
Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications
David Cecchini | Arshaan Nazir | Kalyan Chakravarthy | Veysel Kocaman

Large Language Models (LLMs) have been widely used in real-world applications. However, as LLMs evolve and new datasets are released, it becomes crucial to build processes to evaluate and control the models’ performance. In this paper, we describe how to add Robustness, Accuracy, and Toxicity scores to model comparison tables, or leaderboards. We discuss the evaluation metrics, the approaches considered, and present the results of the first evaluation round for model Robustness, Accuracy, and Toxicity scores. Our results show that GPT 4 achieves top performance on robustness and accuracy test, while Llama 2 achieves top performance on the toxicity test. We note that newer open-source models such as open chat 3.5 and neural chat 7B can perform well on these three test categories. Finally, domain-specific tests and models are also planned to be added to the leaderboard to allow for a more detailed evaluation of models in specific areas such as healthcare, legal, and finance.

pdf bib abs
HGOT: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation
Yihao Fang | Stephen Thomas | Xiaodan Zhu

With the widespread adoption of large language models (LLMs) in numerous applications, the challenge of factuality and the propensity for hallucinations has emerged as a significant concern. To address this issue, particularly in retrieval-augmented in-context learning, we introduce the hierarchical graph of thoughts (HGOT), a structured, multi-layered graph approach designed to enhance the retrieval of pertinent passages during in-context learning. The framework utilizes the emergent planning capabilities of LLMs, employing the divide-and-conquer strategy to break down complex queries into manageable sub-queries. It refines self-consistency majority voting for answer selection, which incorporates the recently proposed citation recall and precision metrics to assess the quality of thoughts, linking an answer’s credibility intrinsically to the thought’s quality. This methodology introduces a weighted system in majority voting, prioritizing answers based on the citation quality of their thoughts. Additionally, we propose a scoring mechanism for evaluating retrieved passages, considering factors such as citation frequency and quality, self-consistency confidence, and the retrieval module’s ranking. Experiments indicate that HGOT excels as a versatile approach, outperforming competing models in FEVER by up to 7% and matching leading models such as Retrieve-then-Read in Open-SQuAD, and DSP in HotPotQA, demonstrating its efficacy in enhancing LLMs’ factuality.

pdf bib abs
Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models
Tobias Groot | Matias Valdenegro - Toro

Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration.Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95% confidence intervals.

pdf bib abs
Tweak to Trust: Assessing the Reliability of Summarization Metrics in Contact Centers via Perturbed Summaries
Kevin Patel | Suraj Agrawal | Ayush Kumar

In the dynamic realm of call center communications, the potential of abstractive summarization to transform information condensation is evident. However, evaluating the performance of abstractive summarization systems within contact center domain poses a significant challenge. Traditional evaluation metrics prove inadequate in capturing the multifaceted nature of call center conversations, characterized by diverse topics, emotional nuances, and dynamic contexts. This paper uses domain-specific perturbed summaries to scrutinize the robustness of summarization metrics in the call center domain. Through extensive experiments on call center data, we illustrate how perturbed summaries uncover limitations in existing metrics. We additionally utilize perturbation as data augmentation strategy to train domain-specific metrics. Our findings underscore the potential of perturbed summaries to complement current evaluation techniques, advancing reliable and adaptable summarization solutions in the call center domain.

pdf bib abs
Flatness-Aware Gradient Descent for Safe Conversational AI
Leila Khalatbari | Saeid Hosseini | Hossein Sameti | Pascale Fung

As generative dialog models become ubiquitous in real-world applications, it is paramount to ensure a harmless generation. There are two major challenges when enforcing safety to open-domain chatbots. Firstly, it is impractical to provide training data reflecting the desired response to all emerging forms of toxicity (generalisation challenge). Secondly, implementing safety features may compromise the quality of the conversation (trade-off challenge). To tackle the challenges, this paper introduces a regularized fine-tuning approach called FlatGD. By employing a safety-tailored loss, we translate better optimization to more safety. To ensure better optimization, FlatGD penalizes sharp trajectories of loss curve, encouraging flatness of the converged local minima. Experimental results on datasets of “BAD” and “prosocial dialog” demonstrate that our model outperforms the current baselines in reducing toxicity while preserving the conversation quality. Moreover, compared to other baselines, FlatGD can better generalize to unseen toxic data.

pdf bib abs
Introducing GenCeption for Multimodal LLM Benchmarking: You May Bypass Annotations
Lele Cao | Valentin Buchner | Zineb Senane | Fangkai Yang

Multimodal Large Language Models (MLLMs) are commonly evaluated using costly annotated multimodal benchmarks. However, these benchmarks often struggle to keep pace with the rapidly advancing requirements of MLLM evaluation. We propose GenCeption, a novel and annotation-free MLLM evaluation framework that merely requires unimodal data to assess inter-modality semantic coherence and inversely reflects the models’ inclination to hallucinate. Analogous to the popular DrawCeption game, GenCeption initiates with a non-textual sample and undergoes a series of iterative description and generation steps. Semantic drift across iterations is quantified using the GC@T metric. Our empirical findings validate GenCeption’s efficacy, showing strong correlations with popular MLLM benchmarking results. GenCeption may be extended to mitigate training data contamination by utilizing ubiquitous, previously unseen unimodal data.

pdf bib abs
Semantic-Preserving Adversarial Example Attack against BERT
Chongyang Gao | Kang Gu | Soroush Vosoughi | Shagufta Mehnaz

Adversarial example attacks against textual data have been drawing increasing attention in both the natural language processing (NLP) and security domains. However, most of the existing attacks overlook the importance of semantic similarity and yield easily recognizable adversarial samples. As a result, the defense methods developed in response to these attacks remain vulnerable and could be evaded by advanced adversarial examples that maintain high semantic similarity with the original, non-adversarial text. Hence, this paper aims to investigate the extent of textual adversarial examples in maintaining such high semantic similarity. We propose Reinforce attack, a reinforcement learning-based framework to generate adversarial text that preserves high semantic similarity with the original text. In particular, the attack process is controlled by a reward function rather than heuristics, as in previous methods, to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our generated adversarial texts preserve significantly higher semantic similarity than state-of-the-art attacks while achieving similar attack success rates (outperforming at times), thus uncovering novel challenges for effective defenses.

pdf bib abs
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay | Vahid Behzadan

A significant challenge in reliable deployment of Large Language Models (LLMs) is malicious manipulation via adversarial prompting techniques such as jailbreaks. Employing mechanisms such as safety training have proven useful in addressing this challenge. However, in multilingual LLMs, adversaries can exploit the imbalanced representation of low-resource languages in datasets used for pretraining and safety training. In this paper, we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. Our experiments with five different models, namely Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to elicit harmful responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse. Content Warning: This paper contains examples of harmful language.

Large language models incorporate world knowledge and present breakthrough performances on zero-shot learning. However, these models capture societal bias (e.g., gender or racial bias) due to bias during the training process which raises ethical concerns or can even be potentially harmful. The issue is more pronounced in multi-modal settings, such as image captioning, as images can also add onto biases (e.g., due to historical non-equal representation of genders in different occupations). In this study, we investigate the removal of potentially problematic knowledge from multi-modal models used for image captioning. We relax the gender bias issue in captioning models by degenderizing generated captions through the use of a simple linear mask, trained via adversarial training. Our proposal makes no assumption on the architecture of the model and freezes the model weights during the procedure, which also enables the mask to be turned off. We conduct experiments on COCO caption datasets using our masking solution. The results suggest that the proposed mechanism can effectively mask the targeted biased knowledge, by replacing more than 99% gender words with neutral ones, and maintain a comparable captioning quality performance with minimal (e.g., -1.4 on BLEU4 and ROUGE) impact to accuracy metrics.

Language models, pre-trained on large amounts of unmoderated content, have been shown to contain societal biases. Mitigating such biases typically requires access to model parameters and training schemas. In this work, we address bias mitigation at inference time, such that it can be applied to any black-box model. To this end, we propose a belief generation and augmentation framework, BELIEVE, that demonstrates effective bias mitigation for natural language generation by augmenting input prompts with automatically generated instruction-based beliefs. Our framework eases the bottleneck required for manually crafting these instruction-based beliefs, by extending a recently proposed iterative in-context learning framework to automatically generate beliefs via a language model. We assess the impact of this system on fairness, and demonstrate effective bias mitigation on pretrained and instruction-tuned models for both sentiment and regard with respect to multiple protected classes including race, gender, and political ideology.

pdf bib abs
Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models
Majid Zarharan | Pascal Wullschleger | Babak Behkam Kia | Mohammad Taher Pilehvar | Jennifer Foster

This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.