Auditing the Evaluators: How Far Can Automatic Evaluation Go in Assessing Portuguese Financial Texts?

Marina Ramalhete Masid, Gabriel Assis, Daniela Vianna, Aline Paes, Altigran Soares da Silva


Abstract
Automatic metrics are widely used to evaluate text quality across various natural language processing tasks. Despite their convenience and scalability, the extent to which these metrics reliably reflect textual quality remains an open challenge. The LLM-as-a-judge paradigm has recently emerged, aligning more closely with human judgments by using LLMs themselves as evaluators. However, there is still a gap in such evaluations across specific domains and languages, as most prior work focuses on generic task benchmarks in English. In this paper, we examine the robustness of both traditional automatic metrics and the LLM-as-a-judge approach for assessing the quality of financial commentaries in Portuguese, an underexplored task and language that has been neglected in previous work. We introduce fine-grained perturbations into the texts generated by specialists to analyze which types of noise most significantly affect evaluation outcomes, using noise-free counterparts as references. The results highlight the weaknesses of classical metrics in this specific task and the limitations of even recent evaluation paradigms, underscoring the need to develop context- and domain-sensitive.
Anthology ID:
2026.propor-1.22
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
222–233
Language:
URL:
https://aclanthology.org/2026.propor-1.22/
DOI:
Bibkey:
Cite (ACL):
Marina Ramalhete Masid, Gabriel Assis, Daniela Vianna, Aline Paes, and Altigran Soares da Silva. 2026. Auditing the Evaluators: How Far Can Automatic Evaluation Go in Assessing Portuguese Financial Texts?. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 222–233, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
Auditing the Evaluators: How Far Can Automatic Evaluation Go in Assessing Portuguese Financial Texts? (Masid et al., PROPOR 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.propor-1.22.pdf