Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics

Christoph Wolfgang Leiter


Abstract
Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these metrics return certain scores. This year’s Eval4NLP shared task tackles this challenge by searching for methods that can extract feature importance scores that correlate well with human word-level error annotations. In this paper we show that unsupervised metrics that are based on tokenmatching can intrinsically provide such scores. The submitted system interprets the similarities of the contextualized word-embeddings that are used to compute (X)BERTScore as word-level importance scores.
Anthology ID:
2021.eval4nlp-1.16
Volume:
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, Marina Fomicheva
Venue:
Eval4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
157–164
Language:
URL:
https://aclanthology.org/2021.eval4nlp-1.16
DOI:
10.18653/v1/2021.eval4nlp-1.16
Bibkey:
Cite (ACL):
Christoph Wolfgang Leiter. 2021. Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 157–164, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics (Leiter, Eval4NLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eval4nlp-1.16.pdf
Video:
 https://aclanthology.org/2021.eval4nlp-1.16.mp4
Code
 gringham/wordandsentscoresfromtokenmatching
Data
MLQE-PE