Jonathan Weill

2024

pdf bib abs
InterrogateLLM: Zero-Resource Hallucination Detection in LLM-Generated Answers
Yakir Yehuda | Itzik Malkiel | Oren Barkan | Jonathan Weill | Royi Ronen | Noam Koenigstein
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the many advances of Large Language Models (LLMs) and their unprecedented rapid evolution, their impact and integration into every facet of our daily lives is limited due to various reasons. One critical factor hindering their widespread adoption is the occurrence of hallucinations, where LLMs invent answers that sound realistic, yet drift away from factual truth. In this paper, we present a novel method for detecting hallucinations in large language models, which tackles a critical issue in the adoption of these models in various real-world scenarios. Through extensive evaluations across multiple datasets and LLMs, including Llama-2, we study the hallucination levels of various recent LLMs and demonstrate the effectiveness of our method to automatically detect them. Notably, we observe up to 87% hallucinations for Llama-2 in a specific experiment, where our method achieves a Balanced Accuracy of 81%, all without relying on external knowledge.

pdf bib abs
Improving LLM Attributions with Randomized Path-Integration
Oren Barkan | Yehonatan Elisha | Yonatan Toib | Jonathan Weill | Noam Koenigstein
Findings of the Association for Computational Linguistics: EMNLP 2024

We present Randomized Path-Integration (RPI) - a path-integration method for explaining language models via randomization of the integration path over the attention information in the model. RPI employs integration on internal attention scores and their gradients along a randomized path, which is dynamically established between a baseline representation and the attention scores of the model. The inherent randomness in the integration path originates from modeling the baseline representation as a randomly drawn tensor from a Gaussian diffusion process. As a consequence, RPI generates diverse baselines, yielding a set of candidate attribution maps. This set facilitates the selection of the most effective attribution map based on the specific metric at hand. We present an extensive evaluation, encompassing 11 explanation methods and 5 language models, including the Llama2 and Mistral models. Our results demonstrate that RPI outperforms latest state-of-the-art methods across 4 datasets and 5 evaluation metrics.

pdf bib abs
LLM Explainability via Attributive Masking Learning
Oren Barkan | Yonatan Toib | Yehonatan Elisha | Jonathan Weill | Noam Koenigstein
Findings of the Association for Computational Linguistics: EMNLP 2024

In this paper, we introduce Attributive Masking Learning (AML), a method designed for explaining language model predictions by learning input masks. AML trains an attribution model to identify influential tokens in the input for a given language model’s prediction. The central concept of AML is to train an auxiliary attribution model to simultaneously 1) mask as much input data as possible while ensuring that the language model’s prediction closely aligns with its prediction on the original input, and 2) ensure a significant change in the model’s prediction when applying the inverse (complement) of the same mask to the input. This dual-masking approach further enables the optimization of the explanation w.r.t. the metric of interest. We demonstrate the effectiveness of AML on both encoder-based and decoder-based language models, showcasing its superiority over a variety of state-of-the-art explanation methods on multiple benchmarks.

Co-authors

Royi Ronen 1

Yakir Yehuda 1

Venues

findings2
acl1

Fix data