Negative Result · March 2026

Does Training Data Selection Help for LM Pre-training?

Parameter Golf is a competition to train the best small language model in 10 minutes on FineWeb (8B tokens, 80 shards). The model never sees all of it before the clock runs out. Can we pick a better subset of what it trains on?

The data comes in 80 shards. They all look the same.

The first question: are some shards more “val-like” than others? To find out, I trained a simple bigram language model on the validation set — it learns which pairs of tokens are common in val — then measured how well it predicts each training shard. A shard with low cross-entropy has token patterns that closely match val.

Each dot is one shard scored by a val-trained bigram model. The total range — 0.018 bits — is negligible. All shards look the same.

Picking “better” shards at this level is like choosing English books by letter frequency. They're all English.

But zoom in, and the variation explodes.

Instead of whole shards, I scored every 32K-token chunk — 244,080 in total. The within-shard variance is 535× larger than between-shard variance:

Within-shard variance is 535× larger.

Cross-entropies range from 4.2 to 9.6 bits. The averaging over 100M tokens was hiding everything. So I selected the lowest-CE chunks — those most similar to val — and trained on those instead:

Distribution of chunk cross-entropies. Dark bars = selected (lowest-CE chunks). Light bars = discarded. Selection keeps the easy center and throws away the diverse right tail.

It made things worse.

Two scorers — a bigram model and a 17M-parameter transformer — both selected “val-like” chunks. Both hurt performance.

val_bpb after training — lower is better

Bars show increase in val_bpb over baseline. Both selection methods made it worse.

The selected data fit faster (lower training loss) but generalized worse. The lowest-CE chunks are generically easy text — common patterns, simple syntax. The hard, diverse examples in the discarded tail are exactly what the model needs to learn the full distribution.

What about curriculum learning?

The experiments above select data most similar to val. But what if instead of matching the validation distribution, we select by difficulty? Train a small model briefly, measure its perplexity on each piece of training data, then reorder so the highest-perplexity data — what the model finds hardest — comes first. The idea: spend limited training time on what the model has the most to learn from.

On 8×H100 (3 seeds), I ranked the training data by perplexity under a partially-trained model, reordered hardest-first, and tested against our merged #1 submission.

3-seed comparison — val_bpb (lower is better)

Faded dots = baseline. Colored dots = hardest-first. Mean Δ: −0.0006 (95% CI spans zero).

The per-seed deltas range from −0.0019 to +0.0009 — the 95% confidence interval spans zero. That's noise. Different selection criterion, same result: whether you pick data that matches val or data the model finds hardest, it doesn't matter.

Others found the same on this dataset.

PR #737 (entropy curriculum): +0.021 worse. PR #623: “neutral or negative.” Sachdeva et al. (ICLR 2025) showed perplexity-based selection gives no benefit on already-filtered data — which is exactly what FineWeb is.

On FineWeb — a large, already-filtered web corpus — the model needs the weird stuff: the hard examples, the outliers, the text that looks nothing like the test set. For this kind of data, diversity beats selection. Whether the same holds on noisier, unfiltered corpora is a different question.