Roelien C. Timmer
2026
MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments
Roelien C. Timmer | Necva Bölücü | Stephen Wan
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Roelien C. Timmer | Necva Bölücü | Stephen Wan
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate leaderboard generation, but existing datasets for this purpose are limited by capturing only the best results from each paper and limited metadata. We present MetaLead, a fully human-annotated ML Leaderboard dataset that captures all experimental results for result transparency and contains extra metadata, such as the result experimental type (baseline, proposed method, or variation of proposed method) for experiment-type guided comparisons, and explicitly separates train and test dataset for cross-domain assessment. This enriched structure makes MetaLead a powerful resource for more transparent and nuanced evaluations across ML research. MetaLead dataset and code repository: https://github.com/RoelTim/metalead
2025
A Position Paper on the Automatic Generation of Machine Learning Leaderboards
Roelien C. Timmer | Yufang Hou | Stephen Wan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Roelien C. Timmer | Yufang Hou | Stephen Wan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g. same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, advocating for broader coverage by including all reported results and richer metadata.
2024
CSIRO at Context24: Contextualising Scientific Figures and Tables in Scientific Literature
Necva Bölücü | Vincent Nguyen | Roelien C. Timmer | Huichen Yang | Maciej Rybinski | Stephen Wan | Sarvnaz Karimi
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Necva Bölücü | Vincent Nguyen | Roelien C. Timmer | Huichen Yang | Maciej Rybinski | Stephen Wan | Sarvnaz Karimi
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Finding evidence for claims from content presented in experimental results of scientific articles is difficult. The evidence is often presented in the form of tables and figures, and correctly matching it to scientific claims presents automation challenges. The Context24 shared task is launched to support the development of systems able to verify claims by extracting supporting evidence from articles. We explore different facets of this shared task modelled as a search problem and as an information extraction task. We experiment with a range of methods in each of these categories for the two sub-tasks of evidence identification and grounding context identification in the Context24 shared task.