Wei-Chen Huang
2025
Multilingual Promise Verification in ESG Reports with Large Language Model Performance Evaluation
Wei-Chen Huang
|
Hsin-Ting Lu
|
Wen-Ze Chen
|
Min-Yuh Day
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Corporate ESG reports often contain statements that are vague or difficult to verify, creating room for potential greenwashing. Building automated systems to evaluate such claims is therefore a relevant research direction. Yet, existing analytical tools still show limited ability to verify sustainability promises in multiple languages, especially beyond English. This study examines how large language models (GPT-5) perform in verifying ESG-related promises across Chinese, Japanese, and English reports, aiming to provide a multilingual evaluation baseline. We assess four verification tasks using the PromiseEval datasets [1] in three languages, comparing five prompting strategies from zero-shot to five-shot learning, including Chain-of-Thought reasoning. The four subtasks are Promise Identification (PI), Evidence Status Assessment (ESA), Evidence Quality Evaluation (EQE), and Verification Timeline Prediction (VTP). The five-shot setting achieved the highest overall performance (71.12 % accuracy, 51.92 % Macro-F1). Although the accuracy results appear higher for Chinese (85.12 %) than for Japanese (68.94 %) and English (63.62 %), this mainly reflects class imbalance in the data. Hence, Macro-F1 provides a fairer comparison across languages. Among the four tasks, Evidence Quality Evaluation (EQE) remains the most difficult. While Chain-of-Thought prompting slightly lowers the overall average, it shows selective benefit on the more complex EQE task. Overall, this work offers a clearer multilingual baseline for ESG promise verification and supports the development of language-based tools that enhance the credibility and transparency of sustainability reporting.