Pre-trained language models evaluating themselves - A comparative study

Philipp Koch, Matthias Aßenmacher, Christian Heumann


Abstract
Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.
Anthology ID:
2022.insights-1.25
Volume:
Proceedings of the Third Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venues:
ACL | insights
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
180–187
Language:
URL:
https://aclanthology.org/2022.insights-1.25
DOI:
10.18653/v1/2022.insights-1.25
Bibkey:
Cite (ACL):
Philipp Koch, Matthias Aßenmacher, and Christian Heumann. 2022. Pre-trained language models evaluating themselves - A comparative study. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 180–187, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Pre-trained language models evaluating themselves - A comparative study (Koch et al., insights 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.insights-1.25.pdf
Code
 lazerlambda/metricscomparison