Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets

Michael Kranzlein, Nelson F. Liu, Nathan Schneider


Abstract
For interpreting the behavior of a probabilistic model, it is useful to measure a model’s calibration—the extent to which it produces reliable confidence scores. We address the open problem of calibration for tagging models with sparse tagsets, and recommend strategies to measure and reduce calibration error (CE) in such models. We show that several post-hoc recalibration techniques all reduce calibration error across the marginal distribution for two existing sequence taggers. Moreover, we propose tag frequency grouping (TFG) as a way to measure calibration error in different frequency bands. Further, recalibrating each group separately promotes a more equitable reduction of calibration error across the tag frequency spectrum.
Anthology ID:
2021.findings-emnlp.423
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4919–4928
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.423
DOI:
10.18653/v1/2021.findings-emnlp.423
Bibkey:
Cite (ACL):
Michael Kranzlein, Nelson F. Liu, and Nathan Schneider. 2021. Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4919–4928, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets (Kranzlein et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.423.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.423.mp4
Code
 nert-nlp/calibration_tfg
Data
MNIST