HelloFresh: LLM Evalutions on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr, Joao F. Henriques, Jakob Foerster


Abstract
Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development.Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating new evaluation data is tedious and may result in temporally inconsistent results. We introduce HelloFresh, based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages, mitigating the risk of test data contamination and benchmark overfitting.Any X user can propose an X note to add additional context to a misleading post (formerly tweet); if the community classifies it as helpful, it is shown with the post. Similarly, Wikipedia relies on community-based consensus, allowing users to edit articles or revert edits made by other users.Verifying whether an X note is helpful or whether a Wikipedia edit should be accepted are hard tasks that require grounding by querying the web.We backtest state-of-the-art LLMs supplemented with simple web search access and find that HelloFresh yields a temporally consistent ranking.To enable continuous evaluation on Hellofresh, we host a public leaderboard and periodically updated evaluation data at https://tinyurl.com/hello-fresh-LLM.
Anthology ID:
2024.findings-acl.754
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12702–12716
Language:
URL:
https://aclanthology.org/2024.findings-acl.754
DOI:
10.18653/v1/2024.findings-acl.754
Bibkey:
Cite (ACL):
Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr, Joao F. Henriques, and Jakob Foerster. 2024. HelloFresh: LLM Evalutions on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12702–12716, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
HelloFresh: LLM Evalutions on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits (Franzmeyer et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.754.pdf