Lewis Tunstall - ACL Anthology

Lewis Tunstall

2022

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Leandro Von Werra | Lewis Tunstall | Abhishek Thakur | Sasha Luccioni | Tristan Thrush | Aleksandra Piktus | Felix Marty | Nazneen Rajani | Victor Mustar | Helen Ngo
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub—a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support reproducibility of evaluation, centralize and document the evaluation process, and broaden evaluation to cover more facets of model performance. It includes over 50 efficient canonical implementations for a variety of domains and scenarios, interactive documentation, and the ability to easily share implementations and outcomes. The library is available at https://github.com/huggingface/evaluate. In addition, we introduce Evaluation on the Hub, a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets on the Hugging Face Hub, for free, at the click of a button. Evaluation on the Hub is available at https://huggingface.co/autoevaluate.

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

2021

The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.

Co-authors

Pawan Sasanka Ammanamanchi 1

Chandra Bhagavatula 1

Abhik Bhattacharjee 1

Simon Brandeis 1

Samuel Cahyawijaya 1

Ronald Cardenas 1

Khyathi Raghavi Chandu 1

Julien Chaumond 1

Gunjan Chhablani 1

Pierric Cistac 1

Elizabeth Clark 1

Mathias Creutz 1

Lysandre Debut 1

Clément Delangue 1

Daniel Deutsch 1

Kaustubh Dhole 1

Mariama Drame 1

Ondřej Dušek 1

Moussa Kamal Eddine 1

Cristina Garbacea 1

Sebastian Gehrmann 1

Dimitra Gkatzia 1

Thibault Goehringer 1

Sylvain Gugger 1

Hiroaki Hayashi 1

Shailza Jolly 1

Juraj Juraska 1

Mihir Sanjay Kale 1

Jenna Kanerva 1

Faisal Ladhak 1

François Lagunas 1

Teven Le Scao 1

Quentin Lhoest 1

Paul Pu Liang 1

Sasha Luccioni 1

Saad Mahamood 1

Abinaya Mahendiran 1

Bhavitvya Malik 1

Théo Matussière 1

Joshua Maynez 1

Sebastien Montella 1

Vitaly Nikolaev 1

Jekaterina Novikova 1

Alexandros Papangelis 1

Nicolas Patry 1

Laura Perez-Beltrachini 1

Aleksandra Piktus 1

Ratish Puduppully 1

Mahim Pushkarna 1

Dragomir Radev 1

Nazneen Rajani 1

Leonardo F. R. Ribeiro 1

Alexander M. Rush 1

Philipp Schmid 1

Rifat Shahriyar 1

Hendrik Strobelt 1

Nishant Subramani 1

Craig Thomson 1

Tristan Thrush 1

Ashish Upadhyay 1

Albert Villanova del Moral 1

Leandro Von Werra 1

Michael White 1

Genta Indra Winata 1

Deyi Xiong (德意熊) 1

Bingsheng Yao 1

Patrick von Platen 1

Mario Šaško 1

Sanja Štajner 1

Venues

EMNLP3