TAIL: A Toolkit for Automatic and Realistic Long-Context Large Language Model Evaluation

Gefei Gu, Yilun Zhao, Ruoxi Ning, Yanan Zheng, Arman Cohan


Abstract
As long-context large language models (LLMs) are attracting increasing attention for their ability to handle context windows exceeding 128k tokens, the need for effective evaluation methods for these models becomes critical.Existing evaluation methods, however, fall short: needle-in-a-haystack (NIAH) and its variants are overly simplistic, while creating realistic benchmarks is prohibitively expensive due to extensive human annotation requirements. To bridge this gap, we propose TAIL, an automatic toolkit for creating realistic evaluation benchmarks and assessing the performance of long-context LLMs.With TAIL, users can customize the building of a long-context, document-grounded QA benchmark and obtain visualized performance metrics of evaluated models.TAIL has the advantage of requiring minimal human annotation and generating natural questions based on user-provided long-context documents. We apply TAIL to construct a benchmark encompassing multiple expert domains, such as finance, law, patent, and scientific literature. We then evaluate four state-of-the-art long-context LLMs using this benchmark. Results show that all LLMs experience varyingdegrees of performance degradation as contextlengths increase.
Anthology ID:
2024.emnlp-demo.21
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Delia Irazu Hernandez Farias, Tom Hope, Manling Li
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
198–208
Language:
URL:
https://aclanthology.org/2024.emnlp-demo.21
DOI:
Bibkey:
Cite (ACL):
Gefei Gu, Yilun Zhao, Ruoxi Ning, Yanan Zheng, and Arman Cohan. 2024. TAIL: A Toolkit for Automatic and Realistic Long-Context Large Language Model Evaluation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 198–208, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
TAIL: A Toolkit for Automatic and Realistic Long-Context Large Language Model Evaluation (Gu et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-demo.21.pdf