2024
pdf
bib
abs
Generating and Evaluating Plausible Explanations for Knowledge Graph Completion
Antonio Di Mauro
|
Zhao Xu
|
Wiem Ben Rim
|
Timo Sztyler
|
Carolin Lawrence
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Explanations for AI should aid human users, yet this ultimate goal remains under-explored. This paper aims to bridge this gap by investigating the specific explanatory needs of human users in the context of Knowledge Graph Completion (KGC) systems. In contrast to the prevailing approaches that primarily focus on mathematical theories, we recognize the potential limitations of explanations that may end up being overly complex or nonsensical for users. Through in-depth user interviews, we gain valuable insights into the types of KGC explanations users seek. Building upon these insights, we introduce GradPath, a novel path-based explanation method designed to meet human-centric explainability constraints and enhance plausibility. Additionally, GradPath harnesses the gradients of the trained KGC model to maintain a certain level of faithfulness. We verify the effectiveness of GradPath through well-designed human-centric evaluations. The results confirm that our method provides explanations that users consider more plausible than previous ones.
pdf
bib
abs
A Human-Centric Evaluation Platform for Explainable Knowledge Graph Completion
Zhao Xu
|
Wiem Ben Rim
|
Kiril Gashteovski
|
Timo Sztyler
|
Carolin Lawrence
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Explanations for AI are expected to help human users understand AI-driven predictions. Evaluating plausibility, the helpfulness of the explanations, is therefore essential for developing eXplainable AI (XAI) that can really aid human users. Here we propose a human-centric evaluation platform to measure plausibility of explanations in the context of eXplainable Knowledge Graph Completion (XKGC). The target audience of the platform are researchers and practitioners who want to 1) investigate real needs and interests of their target users in XKGC, 2) evaluate the plausibility of the XKGC methods. We showcase these two use cases in an experimental setting to illustrate what results can be achieved with our system.
pdf
bib
Proceedings of the Eighth Widening NLP Workshop
Atnafu Lambebo Tonja
|
Alfredo Gomez
|
Chanjun Park
|
Hellina Hailu Nigatu
|
Santosh T.Y.S.S
|
Tanvi Anand
|
Wiem Ben Rim
Proceedings of the Eighth Widening NLP Workshop
2023
pdf
bib
abs
Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains
Chia-Chien Hung
|
Wiem Ben Rim
|
Lindsay Frost
|
Lars Bruckner
|
Carolin Lawrence
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.
2022
pdf
bib
abs
KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models
Haris Widjaja
|
Kiril Gashteovski
|
Wiem Ben Rim
|
Pengfei Liu
|
Christopher Malon
|
Daniel Ruffinelli
|
Carolin Lawrence
|
Graham Neubig
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Knowledge Graphs (KGs) store information in the form of (head, predicate, tail)-triples. To augment KGs with new knowledge, researchers proposed models for KG Completion (KGC) tasks such as link prediction; i.e., answering (h; p; ?) or (?; p; t) queries. Such models are usually evaluated with averaged metrics on a held-out test set. While useful for tracking progress, averaged single-score metrics cannotreveal what exactly a model has learned — or failed to learn. To address this issue, we propose KGxBoard: an interactive framework for performing fine-grained evaluation on meaningful subsets of the data, each of which tests individual and interpretable capabilities of a KGC model. In our experiments, we highlight the findings that we discovered with the use of KGxBoard, which would have been impossible to detect with standard averaged single-score metrics.
2020
pdf
bib
abs
SWAGex at SemEval-2020 Task 4: Commonsense Explanation as Next Event Prediction
Wiem Ben Rim
|
Naoaki Okazaki
Proceedings of the Fourteenth Workshop on Semantic Evaluation
We describe the system submitted by the SWAGex team to the SemEval-2020 Commonsense Validation and Explanation Task. We use multiple methods on the pre-trained language model BERT (Devlin et al., 2018) for tasks that require the system to recognize sentences against commonsense and justify the reasoning behind this decision. Our best performing model is BERT trained on SWAG and fine-tuned for the task. We investigate the ability to transfer commonsense knowledge from SWAG to SemEval-2020 by training a model for the Explanation task with Next Event Prediction data