Is Child-Directed Speech Effective Training Data for Language Models?

Steven Feng, Noah Goodman, Michael Frank


Abstract
While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 and RoBERTa models on 29M words of English child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to OpenSubtitles, Wikipedia, and a heterogeneous blend of datasets from the BabyLM challenge. We evaluate the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children’s training data supports high performance relative to other datasets. The local properties of the data affect model results, but surprisingly, global properties do not. Further, child language input is not uniquely valuable for training language models. These findings support the hypothesis that, rather than proceeding from better data, the child’s learning algorithm is substantially more data-efficient than current language modeling techniques.
Anthology ID:
2024.emnlp-main.1231
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22055–22071
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1231
DOI:
Bibkey:
Cite (ACL):
Steven Feng, Noah Goodman, and Michael Frank. 2024. Is Child-Directed Speech Effective Training Data for Language Models?. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22055–22071, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Is Child-Directed Speech Effective Training Data for Language Models? (Feng et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1231.pdf