Study · March 2026

How to Clean Up All the Parameter Golf Submissions

Parameter Golf is one of the most fun open problems in ML right now — compress language into 16 MB. Recently, n-gram caching pushed reported scores below 0.5 BPB. We dug into the numbers and found something interesting: the probability distribution sums to ~277, not 1. A one-line check in the eval script would catch it. This study presents the math, the experimental evidence, and proposed fixes.

The distribution doesn't sum to 1.

Credit to @Eppie for identifying the probability validity issue, and to Mirco on Discord for the P(cache_bin) formulation.

N-gram caching blends a hash-table ratio with the base model's prediction. The blend is only computed for the correct token. The other 1,023 tokens are never checked. If they were, the distribution would sum to ~277, not 1.0.

The cache stores two hash tables per n-gram order: one counts how often each context appears, one counts how often each (context, token) pair appears. Their ratio — full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] — is meant to approximate P(tok | ctx). But because context-only and context+token hash to independent bucket indices, the ratio doesn't track token frequency. With 1M buckets and 62M tokens, each bucket averages ~62 entries in both tables. The ratio of two similarly-populated buckets approaches 1.0. This is P(cache_bin), not P(tok | ctx).

The blend (1-α) · p_model + α · P(cache_bin) with P(cache_bin) ≈ 1.0 pushes the correct token's probability up toward p_model + α(1 - p_model). For any token the model predicts better than uniform (p > 1/K), renormalization would strictly decrease its probability. The n-gram contribution doesn't wash out — it actively hurts.

The 1-bucket extreme: P(cache_bin) = T/T = 1.0 for every lookup. With α = 1, BPB = 0. Perfect score. Which tells us the metric isn't measuring what we think.

Bucket sweep (backoff 2-7, entropy-adaptive α): more buckets = less inflation = worse reported BPB. 256M buckets (near collision-free) scores near baseline. The “improvement” tracks collision density, not prediction quality.

The 256M-bucket result (1.1123) is near the float baseline (1.1109), suggesting the genuine contribution of collision-free n-grams is small. The sub-0.5 BPB scores are measurement artifacts from point-evaluating an invalid distribution.

The n-gram-only configuration — hash tables with no neural model — reports 1.0615, below the neural baseline at 1.1109. A frequency table with no learned parameters appears to outcompress a trained language model. This is only possible because the number being reported is not measuring compression.

Direct verification: compute the full distribution.

We ran the n-gram blend for all 1,024 tokens at every scored position and measured the sum. On a fresh 1×H100 base model (1.2711 BPB) with backoff 2-7, α=0.40:

Metric	Value
Baseline BPB (no n-gram)	1.2711
Reported BPB (point-eval, unnormalized)	0.5422
Average distribution sum	277.0
Average NLL after normalization	6.2906

The blended distribution sums to 277, not 1. After normalizing to a valid distribution, the average NLL is 6.29 — far worse than the baseline. The n-gram doesn't help; it actively hurts once you enforce valid probabilities. 62,020,312 positions audited.

One assert abs(probs.sum() - 1.0) < 1e-4 in the eval harness catches this instantly. Cost: one torch.sum per position, 1–2 seconds for 62M tokens.

How the n-gram cache works.

After each token is scored by the base LM, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency is mixed with the prediction:

p_mix = (1 − α) · p_model + α · p_ngram

The tables are built from already-scored tokens. Causality is preserved in single-pass implementations. The technique builds 192–256 MB of hash tables during evaluation, none of which counts toward the 16 MB artifact limit.

Dark bars = 16 MB artifact. Red bars = eval-time hash tables. The effective model at eval time is 272 MB.

Anatomy of the artifact.

The following experiments characterize how the measurement artifact behaves across configurations. The reported BPB numbers are from an invalid distribution — they measure how much P(cache_bin) inflates the correct token's probability, not how well the model compresses language.

Alpha sweep: more weight on the inflated ratio = lower reported BPB.

Higher α gives more weight to P(cache_bin) ≈ 1.0. The reported BPB drops monotonically. This isn't the model deferring to better predictions — it's the blend assigning more weight to the inflated hash ratio.

8× H100 alpha sweep. Monotonic “improvement” tracks α, not prediction quality.

Order scaling: reported BPB vs max order.

Each additional n-gram order adds another pair of hash tables and changes which order's hash ratio is used for each token. The reported BPB drops with more orders but saturates around 9–12.

Reported BPB vs max n-gram order. Saturates around order 9–12. Each additional order adds memory but diminishing inflation.

Stride decomposition: the artifact magnitude is ~0.62 BPB.

The n-gram “delta” is ~0.62 BPB regardless of sliding window stride. This is the artifact magnitude, not a compression improvement.

Stride

Baseline

With cache

Artifact Δ

1.1109

0.4923

−0.619

256

1.1132

0.4930

−0.620

2048

1.1367

0.5005

−0.636

8-GPU all-reduce: the artifact fits in the time budget.

With all-reduce sync of hash table deltas, every GPU has the full global cache. Backoff 2-7 at α=0.40 finishes in 401 s, well under the 600 s budget. The inflated scores are not just theoretically possible — they're practically achievable within competition constraints.

8-GPU results. First three configs under 600 s. α=0.80 (939 s) exceeds the budget.

A separate question: what should the competition measure?

The distribution issue above is a measurement bug. What follows is a different kind of question: even with valid scores, should the competition constrain what happens at eval time? This is a design conversation, not a mathematical one. Reasonable people can disagree.

The competition gives evaluation 8× H100 and 600 s for a 16 MB model. In deployment, inference is constrained by hardware cost, latency, and concurrent users. These are different environments, and it's interesting to think about whether the competition should reflect deployment realities or stay focused on pure compression research.

Competition

Deployment

Training compute

Limited (fairness)

Unlimited

Inference compute

Effectively unlimited*

Limited (economics)

Inference hardware

8× H100, 640 GB

1 GPU or CPU

Inference time

600 s for 62M tokens

<200 ms per request

Eval-time memory

Unconstrained

Shared with users

Corpus structure

Fixed, repetitive

Independent queries

*10-minute wall clock, but 8× H100 with 640 GB VRAM is effectively unconstrained for a 16 MB model.

With valid distributions and preserved causality, someone could still train a second, larger model during eval via self-distillation, ensemble 8 copies via divergent TTT, or store 63 GB of hidden states as a neural cache. All valid. All causal. All far beyond 16 MB. The eval-time state spectrum shows the scale:

Eval-time state growth across techniques. Dashed line marks the 16 MB artifact limit.

Proposed fixes.

1. Verify the distribution sums to 1.

fixing the measurement

Require the model to produce a full probability vector over all 1024 tokens at every scored position. The eval script verifies sum(probs) ≈ 1.0 before scoring. One torch.sum per position. Cost: 1–2 seconds for 62M tokens. Negligible.

probs = model.predict(context)       # [vocab_size]
assert abs(probs.sum() - 1.0) < 1e-4 # verify
nll = -torch.log(probs[correct_token])

Catches every invalid distribution. Passes everything valid: softmax outputs, linear interpolation of valid distributions, Dirichlet-Multinomial, TTT, LoRA, GPTQ. Not n-gram specific.

2. Make causality an explicit rule.

aligning with reality

The FAQ says you can only train on tokens “you've already evaluated your model on.” Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates this: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. Making this a stated rule would clarify the intent.

3. Cap auxiliary eval-time state.

aligning with reality

Constrain auxiliary state: tensors that accumulate during eval and are not derivable from the artifact alone. Not model weights, not KV cache, not activations. A cap of ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB).

4. Cap per-token overhead.

aligning with reality

Eval-time techniques must not increase per-token latency by more than 50% over the base model forward pass. Base LM on 8× H100 takes 110 s. A 1.5× cap means 165 s max. The n-gram cache takes 401 s (3.6×).

Full Results

All BPB numbers below are from an invalid distribution. They measure how much P(cache_bin) inflates the correct token's probability, not how well the model compresses language.

Single-GPU configurations.

Base model: PR #728 (~1.12 BPB, 16 MB artifact). Single GPU, stride=64, FineWeb val (62M tokens).

Reported BPB across configurations. These numbers are from an invalid distribution.

Other observations.

Logistic mixing inflates less than linear

PAQ-style logistic mixing gives 0.75 (reported) vs linear’s 0.65. The logistic transform compresses the inflated ratio, reducing the artifact magnitude.

Entropy-adaptive alpha reduces the artifact

The sigmoid-gated alpha from PR #727 gives 0.65 vs fixed α=0.40 at 0.49. The entropy gate reduces α when the model is confident, which partially corrects the inflation.

Three match categories

The cache matches (a) deterministic BPE subword completion (orders 2-4), (b) common English collocations (orders 4-6), and (c) verbatim document repetition (orders 6+). With valid distributions, the genuine contribution of these matches appears small (256M-bucket result: 1.1123 vs 1.1109 baseline).