WhatIf: Leveraging Word Vectors for Small-Scale Data Augmentation

Alex Lyman, Bryce Hepner


Abstract
We introduce WhatIf, a lightly supervised data augmentation technique that leverages word vectors to enhance training data for small-scale language models. Inspired by reading prediction strategies used in education, WhatIf creates new samples by substituting semantically similar words in the training data. We evaluate WhatIf on multiple datasets, demonstrating small but consistent improvements in downstream evaluation compared to baseline models. Finally, we compare WhatIf to other small-scale data augmentation techniques and find that it provides comparable quantitative results at a potential tradeoff to qualitative evaluation.
Anthology ID:
2024.conll-babylm.20
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
229–236
Language:
URL:
https://aclanthology.org/2024.conll-babylm.20/
DOI:
Bibkey:
Cite (ACL):
Alex Lyman and Bryce Hepner. 2024. WhatIf: Leveraging Word Vectors for Small-Scale Data Augmentation. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 229–236, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
WhatIf: Leveraging Word Vectors for Small-Scale Data Augmentation (Lyman & Hepner, CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-babylm.20.pdf