Alex Terentowicz


2025

Explainable NLG metrics are becoming a popular research topic; however, the faithfulness of the explanations they provide is typically not evaluated. In this work, we propose a testbed for assessing the faithfulness of span-based metrics by performing controlled perturbations of their explanations and observing changes in the final score. We show that several popular LLM evaluators do not consistently produce faithful explanations.