A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification

Varvara Logacheva, Daryna Dementieva, Irina Krotova, Alena Fenogenova, Irina Nikishina, Tatiana Shavrina, Alexander Panchenko


Abstract
It is often difficult to reliably evaluate models which generate text. Among them, text style transfer is a particularly difficult to evaluate, because its success depends on a number of parameters. We conduct an evaluation of a large number of models on a detoxification task. We explore the relations between the manual and automatic metrics and find that there is only weak correlation between them, which is dependent on the type of model which generated text. Automatic metrics tend to be less reliable for better-performing models. However, our findings suggest that, ChrF and BertScore metrics can be used as a proxy for human evaluation of text detoxification to some extent.
Anthology ID:
2022.humeval-1.8
Volume:
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Anya Belz, Maja Popović, Ehud Reiter, Anastasia Shimorina
Venue:
HumEval
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
90–101
Language:
URL:
https://aclanthology.org/2022.humeval-1.8
DOI:
10.18653/v1/2022.humeval-1.8
Bibkey:
Cite (ACL):
Varvara Logacheva, Daryna Dementieva, Irina Krotova, Alena Fenogenova, Irina Nikishina, Tatiana Shavrina, and Alexander Panchenko. 2022. A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 90–101, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification (Logacheva et al., HumEval 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.humeval-1.8.pdf
Video:
 https://aclanthology.org/2022.humeval-1.8.mp4
Data
CoLA