Predicting Fine-tuned Performance on Larger Datasets Before Creating Them

Toshiki Kuramoto, Jun Suzuki


Abstract
This paper proposes a method to estimate the performance of pretrained models fine-tuned with a larger dataset from the result with a smaller dataset. Specifically, we demonstrate that when a pretrained model is fine-tuned, its classification performance increases at the same overall rate, regardless of the original dataset size, as the number of epochs increases. Subsequently, we verify that an approximate formula based on this trend can be used to predict the performance when the model is trained with ten times or more training data, even when the initial training dataset is limited. Our results show that this approach can help resource-limited companies develop machine-learning models.
Anthology ID:
2025.coling-industry.17
Volume:
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
204–212
Language:
URL:
https://aclanthology.org/2025.coling-industry.17/
DOI:
Bibkey:
Cite (ACL):
Toshiki Kuramoto and Jun Suzuki. 2025. Predicting Fine-tuned Performance on Larger Datasets Before Creating Them. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 204–212, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Predicting Fine-tuned Performance on Larger Datasets Before Creating Them (Kuramoto & Suzuki, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-industry.17.pdf