Revisiting Automated Evaluation for Long-form Table Question Answering

Yuqi Wang, Lyuhao Chen, Songcheng Cai, Zhijian Xu, Yilun Zhao


Abstract
In the era of data-driven decision-making, Long-Form Table Question Answering (LFTQA) is essential for integrating structured data with complex reasoning. Despite recent advancements in Large Language Models (LLMs) for LFTQA, evaluating their effectiveness remains a significant challenge. We introduce LFTQA-Eval, a meta-evaluation dataset comprising 2,988 human-annotated examples, to rigorously assess the efficacy of current automated metrics in assessing LLM-based LFTQA systems, with a focus on faithfulness and comprehensiveness. Our findings reveal that existing automatic metrics poorly correlate with human judgments and fail to consistently differentiate between factually accurate responses and those that are coherent but factually incorrect. Additionally, our in-depth examination of the limitations associated with automated evaluation methods provides essential insights for the improvement of LFTQA automated evaluation.
Anthology ID:
2024.emnlp-main.815
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14696–14706
Language:
URL:
https://aclanthology.org/2024.emnlp-main.815
DOI:
Bibkey:
Cite (ACL):
Yuqi Wang, Lyuhao Chen, Songcheng Cai, Zhijian Xu, and Yilun Zhao. 2024. Revisiting Automated Evaluation for Long-form Table Question Answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14696–14706, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Revisiting Automated Evaluation for Long-form Table Question Answering (Wang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.815.pdf