Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

Everlyn Asiko Chimoto; Jay Gala; Orevaoghene Ahia; Julia Kreutzer; Bruce A. Bassett; Sara Hooker

doi:10.18653/v1/2024.findings-acl.560

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

Everlyn Asiko Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce A. Bassett, Sara Hooker

Abstract

Neural Machine Translation models are extremely data and compute-hungry. However, not all datapoints contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significantdrop in model performance. In this paper, we propose a new data pruning technique: CheckpointsAcross Time (CAT ), that leverages early model training dynamics to identify the most relevantdata points for model performance. We benchmark CAT against several data pruning techniquesincluding COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks onIndo-European languages on multiple test sets. When applied to English-German, English-Frenchand English-Swahili translation tasks, CAT achieves comparable performance to using the fulldataset, while pruning up to 50% of training data. We inspect the data points that CAT selectsand find that it tends to favour longer sentences and sentences with unique or rare words.

Anthology ID:: 2024.findings-acl.560
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9407–9426
Language:
URL:: https://aclanthology.org/2024.findings-acl.560/
DOI:: 10.18653/v1/2024.findings-acl.560
Bibkey:
Cite (ACL):: Everlyn Asiko Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce A. Bassett, and Sara Hooker. 2024. Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9407–9426, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning (Chimoto et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.560.pdf

PDF Cite Search Fix data