Automatic Quality Estimation for Data Selection and Curriculum Learning

Hiep Nguyen, Lynn Yip, Justin DeBenedetto


Abstract
The size of neural models within natural language processing has increased at a rapid pace in recent years.With this increase in model size comes an increase in the amount of training data required for training.While these larger models have shown strong performance, their use comes with added training and data costs, can be resource-prohibitive for many researchers, and uses an amount of language data that is not always available for all languages.This work focuses on exploring quality estimation as a method of data selection or filtering.The aim is to provide models with higher quality data as compared to larger amounts of data.This approach was applied to machine translation models with varying data sizes as well as to the BabyLM Challenge.Given the 100M word dataset provided in the BabyLM Challenge, we test out various strategies for selecting 10M words for pretraining and use a curriculum learning approach based on the quality estimation scoring.We find small improvements in certain data settings.
Anthology ID:
2024.conll-babylm.18
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
212–220
Language:
URL:
https://aclanthology.org/2024.conll-babylm.18/
DOI:
Bibkey:
Cite (ACL):
Hiep Nguyen, Lynn Yip, and Justin DeBenedetto. 2024. Automatic Quality Estimation for Data Selection and Curriculum Learning. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 212–220, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
Automatic Quality Estimation for Data Selection and Curriculum Learning (Nguyen et al., CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-babylm.18.pdf