Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages

Yixuan Wang; Qingyan Chen; Duygu Ataman

doi:10.18653/v1/2023.eval4nlp-1.3

Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages

Abstract

Language generation has been an important task in natural language processing (NLP) with increasing variety of applications especially in the recent years. The evaluation of generative language models typically rely on automatic heuristics which search for overlaps over word or phrase level patterns in generated outputs and traditionally some hand-crafted reference sentences in the given language ranging in the forms from sentences to entire documents. Language, on the other hand, is productive by nature, which means the same concept can be expressed potentially in many different lexical or phrasal forms, making the assessment of generated outputs a very difficult one. Many studies have indicated potential hazards related to the prominent choice of heuristics matching generated language to selected references and the limitations raised by this setting in developing robust generative models. This paper undertakes an in-depth analysis of evaluation metrics used for generative models, specifically investigating their responsiveness to various syntactic structures, and how these characteristics vary across languages with different morphosyntactic typologies. Preliminary findings indicate that while certain metrics exhibit robustness in particular linguistic contexts, a discernible variance emerges in their performance across distinct syntactic forms. Through this exploration, we highlight the imperative need for more nuanced and encompassing evaluation strategies in generative models, advocating for metrics that are sensitive to the multifaceted nature of languages.

Anthology ID:: 2023.eval4nlp-1.3
Volume:: Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
Month:: November
Year:: 2023
Address:: Bali, Indonesia
Editors:: Daniel Deutsch, Rotem Dror, Steffen Eger, Yang Gao, Christoph Leiter, Juri Opitz, Andreas Rücklé
Venues:: Eval4NLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23–31
Language:
URL:: https://aclanthology.org/2023.eval4nlp-1.3
DOI:: 10.18653/v1/2023.eval4nlp-1.3
Bibkey:
Cite (ACL):: Yixuan Wang, Qingyan Chen, and Duygu Ataman. 2023. Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, pages 23–31, Bali, Indonesia. Association for Computational Linguistics.
Cite (Informal):: Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages (Wang et al., Eval4NLP-WS 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.eval4nlp-1.3.pdf

PDF Cite Search