Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Jannek Ulm; Kevin Du; Vésteinn Snæbjarnarson

Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Jannek Ulm, Kevin Du, Vésteinn Snæbjarnarson

Abstract

Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of *contrastive decoding* for generating synthetic data. In a controlled setting, we experiment with sampling corpora using the relative difference between a GOOD and BAD model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks.In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more *reasoning skills*, while synthetic data from traditional sampling helps more on tasks requiring surface-level *linguistic* capabilities.

Anthology ID:: 2025.babylm-main.2
Volume:: Proceedings of the First BabyLM Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:: BabyLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29–41
Language:
URL:: https://aclanthology.org/2025.babylm-main.2/
DOI:
Bibkey:
Cite (ACL):: Jannek Ulm, Kevin Du, and Vésteinn Snæbjarnarson. 2025. Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling. In Proceedings of the First BabyLM Workshop, pages 29–41, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling (Ulm et al., BabyLM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.babylm-main.2.pdf

PDF Cite Search Fix data