CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems

Tao Feng, Lizhen Qu, Xiaoxi Kang, Gholamreza Haffari


Abstract
Automatically evaluating the quality of responses in dialogue systems is a challenging yet crucial task. Current metrics often fail to align with human judgments, especially when assessing responses that are grammatically correct. To address this issue, we propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses. The causal strength is estimated by utilizing both unconditional dependence and conditional dependencies from dialogue histories to responses. We compare our metric with the existing competitive metrics in terms of their alignment with human judgements. Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements. Additionally, we collect a dialogue dataset CGDIALOG+ with human-annotated causal relations and a set of pairwise human judgements to facilitate the development of automatic metrics.
Anthology ID:
2025.coling-main.161
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2351–2369
Language:
URL:
https://aclanthology.org/2025.coling-main.161/
DOI:
Bibkey:
Cite (ACL):
Tao Feng, Lizhen Qu, Xiaoxi Kang, and Gholamreza Haffari. 2025. CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2351–2369, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems (Feng et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.161.pdf