Digging Errors in NMT: Evaluating and Understanding Model Errors from Partial Hypothesis Space

Jianhao Yan; Chenming Wu; Fandong Meng; Jie Zhou (周洁)

doi:10.18653/v1/2022.emnlp-main.827

Digging Errors in NMT: Evaluating and Understanding Model Errors from Partial Hypothesis Space

Jianhao Yan, Chenming Wu, Fandong Meng, Jie Zhou

Abstract

Solid evaluation of neural machine translation (NMT) is key to its understanding and improvement. Current evaluation of an NMT system is usually built upon a heuristic decoding algorithm (e.g., beam search) and an evaluation metric assessing similarity between the translation and golden reference. However, this system-level evaluation framework is limited by evaluating only one best hypothesis and search errors brought by heuristic decoding algorithms. To better understand NMT models, we propose a novel evaluation protocol, which defines model errors with model’s ranking capability over hypothesis space. To tackle the problem of exponentially large space, we propose two approximation methods, top region evaluation along with an exact top-k decoding algorithm, which finds top-ranked hypotheses in the whole hypothesis space, and Monte Carlo sampling evaluation, which simulates hypothesis space from a broader perspective. To quantify errors, we define our NMT model errors by measuring distance between the hypothesis array ranked by the model and the ideally ranked hypothesis array. After confirming the strong correlation with human judgment, we apply our evaluation to various NMT benchmarks and model architectures. We show that the state-of-the-art Transformer models face serious ranking issues and only perform at the random chance level in the top region. We further analyze model errors on architectures with different depths and widths, as well as different data-augmentation techniques, showing how these factors affect model errors. Finally, we connect model errors with the search algorithms and provide interesting findings of beam search inductive bias and correlation with Minimum Bayes Risk (MBR) decoding.

Anthology ID:: 2022.emnlp-main.827
Volume:: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12067–12085
Language:
URL:: https://aclanthology.org/2022.emnlp-main.827/
DOI:: 10.18653/v1/2022.emnlp-main.827
Bibkey:
Cite (ACL):: Jianhao Yan, Chenming Wu, Fandong Meng, and Jie Zhou. 2022. Digging Errors in NMT: Evaluating and Understanding Model Errors from Partial Hypothesis Space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12067–12085, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Digging Errors in NMT: Evaluating and Understanding Model Errors from Partial Hypothesis Space (Yan et al., EMNLP 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.emnlp-main.827.pdf

PDF Cite Search Fix data