Representing the Toddler Lexicon: Do the Corpus and Semantics Matter?

Jennifer Weber, Eliana Colunga


Abstract
Understanding child language development requires accurately representing children’s lexicons. However, much of the past work modeling children’s vocabulary development has utilized adult-based measures. The present investigation asks whether using corpora that captures the language input of young children more accurately represents children’s vocabulary knowledge. We present a newly-created toddler corpus that incorporates transcripts of child-directed conversations, the text of picture books written for preschoolers, and dialog from G-rated movies to approximate the language input a North American preschooler might hear. We evaluate the utility of the new corpus for modeling children’s vocabulary development by building and analyzing different semantic network models and comparing them to norms based on vocabulary norms for toddlers in this age range. More specifically, the relations between words in our semantic networks were derived from skip-gram neural networks (Word2Vec) trained on our toddler corpus or on Google news. Results revealed that the models built from the toddler corpus were more accurate at predicting toddler vocabulary growth than the adult-based corpus. These results speak to the importance of selecting a corpus that matches the population of interest.
Anthology ID:
2022.lrec-1.421
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3960–3968
Language:
URL:
https://aclanthology.org/2022.lrec-1.421
DOI:
Bibkey:
Cite (ACL):
Jennifer Weber and Eliana Colunga. 2022. Representing the Toddler Lexicon: Do the Corpus and Semantics Matter?. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3960–3968, Marseille, France. European Language Resources Association.
Cite (Informal):
Representing the Toddler Lexicon: Do the Corpus and Semantics Matter? (Weber & Colunga, LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.421.pdf