Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in Machine Translation Evaluation

Xianfeng Zeng, Yijin Liu, Fandong Meng, Jie Zhou


Abstract
Recent research has shown a weak correlation between n-gram-based metrics and human evaluations in machine translation task, particularly when evaluating large language models (LLMs). Additionally, the data leakage risk in LLMs may cause an overestimation problem when evaluating LLMs on downstream tasks. In this work, we identify the limited diversity of references as the primary cause for the inferior performance of n-gram-based metrics and the overestimation problem. To address this issue, we propose to utilize multiple references generated by LLMs, coupled with an effective selection strategy focused on accuracy and diversity, to improve the alignment between automatic metrics and human evaluations. We validate our approach on the WMT22 Metrics benchmark with 4 languages and observe a maximum accuracy gain of 9.5% in F200spBLEU, which makes it on par with computationally expensive neural-based metrics. We also show that using multi-reference with n-gram-based metrics significantly alleviates the overestimation problem when evaluating LLMs with data leakage. Further analysis explores the factors that affect the quality of generated references, offering insights into data synthesis by LLMs.
Anthology ID:
2024.findings-acl.710
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11939–11951
Language:
URL:
https://aclanthology.org/2024.findings-acl.710
DOI:
Bibkey:
Cite (ACL):
Xianfeng Zeng, Yijin Liu, Fandong Meng, and Jie Zhou. 2024. Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in Machine Translation Evaluation. In Findings of the Association for Computational Linguistics ACL 2024, pages 11939–11951, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in Machine Translation Evaluation (Zeng et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.710.pdf