MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

Dominik Macháček, Ondřej Bojar, Raj Dabre


Abstract
There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this paper, we leverage the evaluations of candidate systems submitted to the English-German SST task at IWSLT 2022 and conduct an extensive correlation analysis of CR and the aforementioned metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode, with some limitations on the test set size. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation. Additionally, we observe that correlations of the metrics with translation as a reference is significantly higher than with simultaneous interpreting, and thus we recommend the former for reliable evaluation.
Anthology ID:
2023.iwslt-1.12
Volume:
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada (in-person and online)
Editors:
Elizabeth Salesky, Marcello Federico, Marine Carpuat
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
169–179
Language:
URL:
https://aclanthology.org/2023.iwslt-1.12
DOI:
10.18653/v1/2023.iwslt-1.12
Bibkey:
Cite (ACL):
Dominik Macháček, Ondřej Bojar, and Raj Dabre. 2023. MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 169–179, Toronto, Canada (in-person and online). Association for Computational Linguistics.
Cite (Informal):
MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation (Macháček et al., IWSLT 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.iwslt-1.12.pdf