Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems

Ranim Khojah, Alexander Berman, Staffan Larsson


Abstract
A Natural Language Understanding (NLU) component can be used in a dialogue system to perform intent classification, returning an N-best list of hypotheses with corresponding confidence estimates. We perform an in-depth evaluation of 5 NLUs, focusing on confidence estimation. We measure and visualize calibration for the 10 best hypotheses on model level and rank level, and also measure classification performance. The results indicate a trade-off between calibration and performance. In particular, Rasa (with Sklearn classifier) had the best calibration but the lowest performance scores, while Watson Assistant had the best performance but a poor calibration.
Anthology ID:
2022.sigdial-1.54
Volume:
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
September
Year:
2022
Address:
Edinburgh, UK
Editors:
Oliver Lemon, Dilek Hakkani-Tur, Junyi Jessy Li, Arash Ashrafzadeh, Daniel Hernández Garcia, Malihe Alikhani, David Vandyke, Ondřej Dušek
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
582–594
Language:
URL:
https://aclanthology.org/2022.sigdial-1.54
DOI:
10.18653/v1/2022.sigdial-1.54
Bibkey:
Cite (ACL):
Ranim Khojah, Alexander Berman, and Staffan Larsson. 2022. Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 582–594, Edinburgh, UK. Association for Computational Linguistics.
Cite (Informal):
Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems (Khojah et al., SIGDIAL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sigdial-1.54.pdf
Video:
 https://youtu.be/VW97fUNgUw8
Code
 ranimkhojah/confidence-estimation-benchmark