Study · March 2026
The 16 MB Model That Becomes 272 MB at Eval Time
Parameter Golf caps the model artifact at 16 MB. But during evaluation, a technique called n-gram caching builds a statistical model from already-scored tokens that grows to 17× the artifact limit — costing zero artifact bytes. It cuts BPB from 1.11 to 0.38. This is a study of whether that should be legal, and what it means for the competition.
The 16 MB model becomes 272 MB during evaluation.
During evaluation, an n-gram cache builds hash tables from already-scored tokens. The tables are empty at start and grow as the 62M-token corpus is scored. None of this counts toward the 16 MB artifact limit — but it is the model doing the actual predicting.
With this technique, the leaderboard score drops from 1.11 to 0.38 BPB — a 66% reduction. The cache contributes 2/3 of the compression. It fits within the competition's 600-second eval budget on 8× H100 (401 s with all-reduce sync). And it costs zero artifact bytes.
The mechanism is fragile in a revealing way. More hash buckets makes it worse — 256M buckets scores 1.11, barely better than no cache. The hash ratio full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] is not a conditional probability. With small tables, both buckets average ~T/B entries, so the ratio approaches 1.0. The blend pushes the correct token's probability up, but it's inflating a hash ratio, not mixing in a real n-gram estimate. The reported BPB scores are not achievable by a valid compressor.
How a model actually runs in real life.
Consider what happens when a 16 MB language model ships. A company deploys it to serve user requests. The economics look something like this:
The hardware is fixed and shared
You rent a single GPU — maybe an A10G, maybe a T4, maybe a CPU-only instance to save money. The whole point of a small model is that you don’t need expensive hardware. You serve hundreds or thousands of concurrent users on that one machine. Memory is precious. Every megabyte of model state is a megabyte not available for batching requests.
Latency is the constraint, not throughput
A user sends a prompt and waits. You have maybe 50–200 ms to return the first token. The model must produce a prediction from a cold start on whatever context the user provides. There is no 62M-token corpus to build a cache over — a typical prompt is a few hundred tokens.
Each query is independent
User A asks about cooking recipes. User B asks about Python errors. User C submits legal text. There is no cross-document repetition to exploit. The n-gram cache starts empty for every request and stays nearly empty — a typical prompt is a few hundred tokens, not millions. The cache that gives 0.38 BPB on a contiguous corpus gives essentially nothing on isolated queries.
The model is all you ship
What you deploy is what you have. The 16 MB artifact is the model. You can’t allocate an additional 256 MB of hash tables per user session — that would be 256 GB for a thousand concurrent users. The artifact size IS the model size. In deployment, there is no distinction between “artifact” and “eval-time state.”
Inference speed matters
The n-gram cache adds K hash lookups and K table updates per token, across every n-gram order. In our experiments, this roughly doubles eval time (606s → 1,079s for backoff 2-7). The overhead is constant — it doesn’t get worse as the cache fills — but a flat 2× slowdown matters when your latency budget is 50–200ms. You pay the per-token cost on every request, but you only get the BPB benefit after millions of tokens of contiguous corpus. On a 500-token prompt, you get the slowdown without the payoff.
A model that scores 0.38 BPB on the competition benchmark but 1.11 BPB in deployment is not a better model. It is a better test-taker.
Training constraints are for fairness. Inference constraints are for reality.
The competition limits training to 10 minutes on 8× H100s. This is not because real-world training is limited — it isn't. Companies train for weeks on thousands of GPUs. The training limit exists for a practical reason: keeping the competition accessible. Without it, the leaderboard would be dominated by whoever could afford the most compute. The 10-minute cap levels the playing field.
Inference is the opposite. In the real world, inference genuinely IS constrained — by hardware cost, by latency requirements, by the number of users you need to serve, by the device the model runs on. These are not artificial limits imposed for fairness. They are physics and economics.
But the competition doesn't reflect this. It gives evaluation the same 8× H100 cluster and 600 seconds of wall time. That's 640 GB of VRAM and enough time to build elaborate statistical models from the scored data. The result: techniques that would be impossible in any real deployment are not just allowed but dominant.
The irony: training constraints are artificial but inference constraints would be realistic. The competition constrains the phase where real-world resources are abundant, and leaves unconstrained the phase where real-world resources are scarce.
The competition implicitly asks: given N bytes of model, how well can you compress natural language?
Eval-time caching answers a different question: given N bytes of model plus unbounded eval-time memory, how well can you compress a specific fixed corpus?
If the competition wants to measure real-world model quality, inference during evaluation should be constrained the way inference is constrained in real life: limited hardware, limited memory, limited time per token.
The 16 MB artifact limit is the right idea. It just needs to extend to eval time. A model that fits in 16 MB but needs 272 MB to run doesn't fit in 16 MB.
Where the line blurs.
The competition already permits eval-time model growth. TTT and LoRA adaptation are approved — they also build state from scored tokens, though the growth is modest (~2 MB). The n-gram cache follows the same principle at 100× the scale.
The question is not whether causality is preserved (it is), but whether unbounded eval-time model growth is in the spirit of the 16 MB constraint.
Two ways to fix this.
1. Cap auxiliary eval-time state.
A subtlety: “cap total GPU memory” doesn't work. A 16 MB int6+compressed artifact decompresses into ~50–100 MB of bf16 weights in VRAM. Add activations and CUDA overhead, and the base model alone uses several hundred MB.
The right thing to constrain is auxiliary state: tensors that accumulate across the evaluation and are not derivable from the artifact alone. N-gram hash tables (192 MB), TTT LoRA deltas (2 MB), anything that persists across batches and grows with the corpus. Not model weights (deterministic decompression of the artifact), not KV cache (recomputed each window), not activations (transient).
A cap of auxiliary state ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB) while constraining the techniques that grow the effective model by 10–250×.
2. Cap per-token overhead.
Require that eval-time techniques do not increase per-token latency by more than 50% over the base model forward pass on the same hardware. Not an absolute number — a ratio. Hardware-agnostic and easy to measure: run eval with and without the technique.
Base LM on 8× H100 takes 110 s. A 1.5× cap means 165 s max. The n-gram cache takes 401 s (3.6×). KV cache, TTT, LoRA are all well within 1.5×. This directly mirrors the real-world tradeoff: you can add eval-time tricks, but you can't blow up your serving cost.
Experimental Details
How the n-gram cache works.
After each token is scored by the base LM, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency is mixed with the neural prediction:
pmix = (1 − α) · pneural + α · pngramThe tables are built exclusively from already-scored tokens. No future tokens are accessed. Strict causality is preserved. The technique exploits three phenomena: deterministic BPE subword completion (orders 2-4), common English collocations (orders 4-6), and verbatim document repetition (orders 6+). Only the last is corpus-specific.
Single-GPU ablation.
All runs use the same base model (~1.12 BPB, 27M parameters, 16 MB artifact). The only variable is the n-gram cache configuration. Single GPU, stride=64, FineWeb val (62M tokens).
8-GPU with all-reduce sync.
With 8× H100 and all-reduce sync of hash table deltas, every GPU has the full global cache. All-reduce overhead: 1.6 s total across ~4,000 batches. The 8-GPU BPB (0.4941) matches single-GPU (0.4923) within 0.002.
Prior entries (PRs #727, #788) used partitioned caches — each GPU sees only 1/8 of tokens — and scored ~0.98 BPB. The all-reduce trick closes a 0.50 BPB gap.
Alpha sweep.
Prior work (PR #727) found α=0.40 optimal on partitioned caches. With a global cache, higher alpha is monotonically better — the n-gram is reliable enough that the model should defer to it more.
Order scaling law.
How much does increasing the max n-gram order help? We swept K from 2 to 20 with backoff and entropy-adaptive alpha.
More buckets is worse.
We expected more hash buckets to reduce collisions and improve accuracy. The opposite happened.
The hash ratio full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] is not a conditional probability. With 1M buckets and 62M tokens, each bucket averages ~62 entries in both tables. The ratio of two similarly-populated buckets approaches 1.0 — this is P(cache_bin), not P(tok | ctx).
The blend only boosts the correct token's probability. If you computed it for all 1024 tokens, each would get a similar ratio near 1.0. After renormalization, the n-gram contribution washes out. The reported BPB scores are a measurement artifact from point-evaluating an invalid distribution.
Stride decomposition.
Sliding window eval with small strides gives the model more context per token. Is the n-gram improvement just a side effect of that overlap? No. The n-gram delta is ~0.62 BPB regardless of stride:
Sliding window overlap contributes ~0.03 BPB (stride 64 vs 2048 without cache). The n-gram cache contributes ~0.62. They're orthogonal.
Other findings.
Logistic mixing is worse than linear
PAQ-style logistic mixing in stretched probability space gives 0.75 BPB vs linear’s 0.65. When n-gram confidence is low (early in corpus, rare contexts), the logistic transform amplifies low probabilities downward. Linear mixing is more robust.
Entropy-adaptive alpha hurts with global cache
The sigmoid-gated alpha from PR #727 gives 0.65 BPB — 0.16 BPB worse than fixed α=0.40 (0.49). The entropy gate is designed for sparse partitioned caches and becomes too conservative when cache coverage is high.
Three compression phenomena
The cache captures (a) deterministic BPE subword completion (orders 2-4), (b) common English collocations (orders 4-6), and (c) verbatim document repetition (orders 6+). Only (c) is corpus-specific.