Jeffrey Li


2026

One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.

2025

Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.

2024

We propose a new method, instruction back-and-forth translation, to improve the quality of instruction-tuning data used for aligning large language models (LLMs). Given preprocessed texts from an initial web corpus (e.g. Dolma (Soldaini et al., 2024)), we generate synthetic instructions using the backtranslation approach proposed by Li et al., (2023), filter the generated data and rewrite the responses to improve their quality further based on the initial texts. Given similar quantities of instructions, fine-tuning Llama-2 on our (synthetic instruction, rewritten response) pairs yields better AlpacaEval win rates than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct, at both 7B and 70B parameter scales. We also demonstrate that rewriting the responses with an LLM is different from direct distillation: the former process yields better win rate at 70B scale, and the two text distributions exhibit significant distinction in the embedding space. Besides, we provide analyses showing that our backtranslated instructions are of higher quality than other sources of synthetic instructions, while our responses are more diverse and complex than what can be obtained from distillation. Overall we find that instruction back-and-forth translation combines the best of both worlds—making use of the information diversity and quantity found on the web, while ensuring the quality of the responses which is necessary for effective alignment.