Barret Zoph


2024

pdf bib
A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre | Gregory Yauney | Emily Reif | Katherine Lee | Adam Roberts | Barret Zoph | Denny Zhou | Jason Wei | Kevin Robinson | David Mimno | Daphne Ippolito
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. We pretrain models on data curated (1) at different collection times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we find that temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we measure the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Third, we empirically validate that heterogeneous data sources, like books and web, are beneficial and warrant greater prioritization. To date, these experiments constitute the single largest publicly documented empirical study of the effects of pretraining data. Spanning 28 unique 1.5 billion parameter models pretrained from scratch, these findings validate, quantify, and expose many undocumented intuitions about text pretraining, which ultimately support more informed data-centric decisions in model development.

2016

pdf bib
Transfer Learning for Low-Resource Neural Machine Translation
Barret Zoph | Deniz Yuret | Jonathan May | Kevin Knight
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Multi-Source Neural Translation
Barret Zoph | Kevin Knight
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies
Barret Zoph | Ashish Vaswani | Jonathan May | Kevin Knight
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib
How Much Information Does a Human Translator Add to the Original?
Barret Zoph | Marjan Ghazvininejad | Kevin Knight
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing