ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

Thibaut Thonet, Laurent Besacier, Jos Rozen


Abstract
Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending the models’ context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, we propose a new benchmark for long-context LLMs focused on a practical meeting assistant scenario in which the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, ELITR-Bench, augments the existing ELITR corpus by adding 271 manually crafted questions with their ground-truth answers, as well as noisy versions of meeting transcripts altered to target different Word Error Rate levels. Our experiments with 12 long-context LLMs on ELITR-Bench confirm the progress made across successive generations of both proprietary and open models, and point out their discrepancies in terms of robustness to transcript noise. We also provide a thorough analysis of our GPT-4-based evaluation, including insights from a crowdsourcing study. Our findings indicate that while GPT-4’s scores align with human judges, its ability to distinguish beyond three score levels may be limited.
Anthology ID:
2025.coling-main.28
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
407–428
Language:
URL:
https://aclanthology.org/2025.coling-main.28/
DOI:
Bibkey:
Cite (ACL):
Thibaut Thonet, Laurent Besacier, and Jos Rozen. 2025. ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 407–428, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models (Thonet et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.28.pdf