Study · March 2026

The 16 MB Model That Becomes 272 MB at Eval Time

Parameter Golf caps the model artifact at 16 MB. But during evaluation, a technique called n-gram caching builds a statistical model from already-scored tokens that grows to 17× the artifact limit — costing zero artifact bytes. It cuts BPB from 1.11 to 0.38. This is a study of whether that should be legal, and what it means for the competition.


The 16 MB model becomes 272 MB during evaluation.

During evaluation, an n-gram cache builds hash tables from already-scored tokens. The tables are empty at start and grow as the 62M-token corpus is scored. None of this counts toward the 16 MB artifact limit — but it is the model doing the actual predicting.

16 MB208 MB272 MB4.0 GB16 MB artifact+192 MB (ord 2-7)+256 MB (ord 2-9)+4 GB (64M bkts)0 MB1 GB2 GB3 GB4 GBEffective model size at eval time
Dark bars = 16 MB artifact (constrained by rules). Red bars = eval-time hash tables (unconstrained). The effective model at eval time is 272 MB total.

With this technique, the leaderboard score drops from 1.11 to 0.38 BPB — a 66% reduction. The cache contributes 2/3 of the compression. It fits within the competition's 600-second eval budget on 8× H100 (401 s with all-reduce sync). And it costs zero artifact bytes.

The mechanism is fragile in a revealing way. More hash buckets makes it worse — 256M buckets scores 1.11, barely better than no cache. The hash ratio full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] is not a conditional probability. With small tables, both buckets average ~T/B entries, so the ratio approaches 1.0. The blend pushes the correct token's probability up, but it's inflating a hash ratio, not mixing in a real n-gram estimate. The reported BPB scores are not achievable by a valid compressor.

How a model actually runs in real life.

Consider what happens when a 16 MB language model ships. A company deploys it to serve user requests. The economics look something like this:

1

The hardware is fixed and shared

You rent a single GPU — maybe an A10G, maybe a T4, maybe a CPU-only instance to save money. The whole point of a small model is that you don’t need expensive hardware. You serve hundreds or thousands of concurrent users on that one machine. Memory is precious. Every megabyte of model state is a megabyte not available for batching requests.

2

Latency is the constraint, not throughput

A user sends a prompt and waits. You have maybe 50–200 ms to return the first token. The model must produce a prediction from a cold start on whatever context the user provides. There is no 62M-token corpus to build a cache over — a typical prompt is a few hundred tokens.

3

Each query is independent

User A asks about cooking recipes. User B asks about Python errors. User C submits legal text. There is no cross-document repetition to exploit. The n-gram cache starts empty for every request and stays nearly empty — a typical prompt is a few hundred tokens, not millions. The cache that gives 0.38 BPB on a contiguous corpus gives essentially nothing on isolated queries.

4

The model is all you ship

What you deploy is what you have. The 16 MB artifact is the model. You can’t allocate an additional 256 MB of hash tables per user session — that would be 256 GB for a thousand concurrent users. The artifact size IS the model size. In deployment, there is no distinction between “artifact” and “eval-time state.”

5

Inference speed matters

The n-gram cache adds K hash lookups and K table updates per token, across every n-gram order. In our experiments, this roughly doubles eval time (606s → 1,079s for backoff 2-7). The overhead is constant — it doesn’t get worse as the cache fills — but a flat 2× slowdown matters when your latency budget is 50–200ms. You pay the per-token cost on every request, but you only get the BPB benefit after millions of tokens of contiguous corpus. On a 500-token prompt, you get the slowdown without the payoff.

A model that scores 0.38 BPB on the competition benchmark but 1.11 BPB in deployment is not a better model. It is a better test-taker.

Training constraints are for fairness. Inference constraints are for reality.

The competition limits training to 10 minutes on 8× H100s. This is not because real-world training is limited — it isn't. Companies train for weeks on thousands of GPUs. The training limit exists for a practical reason: keeping the competition accessible. Without it, the leaderboard would be dominated by whoever could afford the most compute. The 10-minute cap levels the playing field.

Inference is the opposite. In the real world, inference genuinely IS constrained — by hardware cost, by latency requirements, by the number of users you need to serve, by the device the model runs on. These are not artificial limits imposed for fairness. They are physics and economics.

But the competition doesn't reflect this. It gives evaluation the same 8× H100 cluster and 600 seconds of wall time. That's 640 GB of VRAM and enough time to build elaborate statistical models from the scored data. The result: techniques that would be impossible in any real deployment are not just allowed but dominant.

Competition
Deployment
Training compute
Limited (fairness)
Unlimited
Inference compute
Unlimited*
Limited (economics)
Inference hardware
8× H100, 640 GB
1 GPU or CPU
Inference time
600 s for 62M tokens
<200 ms per request
Eval-time memory
Unconstrained
Shared with users
Corpus structure
Fixed, repetitive
Independent queries
*10-minute wall clock, but 8× H100 with 640 GB VRAM is effectively unconstrained for a 16 MB model.

The irony: training constraints are artificial but inference constraints would be realistic. The competition constrains the phase where real-world resources are abundant, and leaves unconstrained the phase where real-world resources are scarce.

The competition implicitly asks: given N bytes of model, how well can you compress natural language?

Eval-time caching answers a different question: given N bytes of model plus unbounded eval-time memory, how well can you compress a specific fixed corpus?

If the competition wants to measure real-world model quality, inference during evaluation should be constrained the way inference is constrained in real life: limited hardware, limited memory, limited time per token.

The 16 MB artifact limit is the right idea. It just needs to extend to eval time. A model that fits in 16 MB but needs 272 MB to run doesn't fit in 16 MB.

Where the line blurs.

The competition already permits eval-time model growth. TTT and LoRA adaptation are approved — they also build state from scored tokens, though the growth is modest (~2 MB). The n-gram cache follows the same principle at 100× the scale.

16 MB limit4 GB192 MB2 MB2 MB20 MBN-gram cache (2-9, 64M)N-gram cache (2-7)Per-doc LoRA TTT (8 ep)Score-first TTTKV cache (sliding window)1 MB10 MB100 MB1 GBEval-time state (log scale)
Eval-time state growth across approved and pending techniques. Dashed line marks the 16 MB artifact limit. The n-gram cache is the same principle as TTT — build state from scored tokens — but at a different scale.

The question is not whether causality is preserved (it is), but whether unbounded eval-time model growth is in the spirit of the 16 MB constraint.

Two ways to fix this.

1. Cap auxiliary eval-time state.

A subtlety: “cap total GPU memory” doesn't work. A 16 MB int6+compressed artifact decompresses into ~50–100 MB of bf16 weights in VRAM. Add activations and CUDA overhead, and the base model alone uses several hundred MB.

The right thing to constrain is auxiliary state: tensors that accumulate across the evaluation and are not derivable from the artifact alone. N-gram hash tables (192 MB), TTT LoRA deltas (2 MB), anything that persists across batches and grows with the corpus. Not model weights (deterministic decompression of the artifact), not KV cache (recomputed each window), not activations (transient).

A cap of auxiliary state ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB) while constraining the techniques that grow the effective model by 10–250×.

2. Cap per-token overhead.

Require that eval-time techniques do not increase per-token latency by more than 50% over the base model forward pass on the same hardware. Not an absolute number — a ratio. Hardware-agnostic and easy to measure: run eval with and without the technique.

Base LM on 8× H100 takes 110 s. A 1.5× cap means 165 s max. The n-gram cache takes 401 s (3.6×). KV cache, TTT, LoRA are all well within 1.5×. This directly mirrors the real-world tradeoff: you can add eval-time tricks, but you can't blow up your serving cost.


Experimental Details

How the n-gram cache works.

After each token is scored by the base LM, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency is mixed with the neural prediction:

pmix = (1 − α) · pneural + α · pngram

The tables are built exclusively from already-scored tokens. No future tokens are accessed. Strict causality is preserved. The technique exploits three phenomena: deterministic BPE subword completion (orders 2-4), common English collocations (orders 4-6), and verbatim document repetition (orders 6+). Only the last is corpus-specific.

Single-GPU ablation.

All runs use the same base model (~1.12 BPB, 27M parameters, 16 MB artifact). The only variable is the n-gram cache configuration. Single GPU, stride=64, FineWeb val (62M tokens).

leaderboard0.37790.65350.49230.52341.06151.11091.1142Backoff 2-9, order-adptBackoff 2-7, entropy-adptBackoff 2-7, α=0.40Fixed 7-gram, α=0.40N-gram only (no neural)Base LM (float, pre-quant)Base LM (int6 quantized)
Bits per byte, lower is better. Dashed line marks the quantized leaderboard score. The best config (backoff 2-9) reaches 0.3779. The n-gram cache alone (no base LM) beats the trained model: 1.06 vs 1.11.

8-GPU with all-reduce sync.

With 8× H100 and all-reduce sync of hash table deltas, every GPU has the full global cache. All-reduce overhead: 1.6 s total across ~4,000 batches. The 8-GPU BPB (0.4941) matches single-GPU (0.4923) within 0.002.

0.39420.45480.49411.1130Backoff 2-7, α=0.80Backoff 2-9, α=0.40Backoff 2-7, α=0.40Base LM (8-GPU, int6)
8-GPU results with global cache via all-reduce sync. All runs under the 600s eval budget.

Prior entries (PRs #727, #788) used partitioned caches — each GPU sees only 1/8 of tokens — and scored ~0.98 BPB. The all-reduce trick closes a 0.50 BPB gap.

Alpha sweep.

Prior work (PR #727) found α=0.40 optimal on partitioned caches. With a global cache, higher alpha is monotonically better — the n-gram is reliable enough that the model should defer to it more.

0.61800.49410.42630.39420.400.500.60BPBα=0.2α=0.4α=0.6α=0.8
8× H100 alpha sweep. Monotonic improvement from α=0.20 to α=0.80.

Order scaling law.

How much does increasing the max n-gram order help? We swept K from 2 to 20 with backoff and entropy-adaptive alpha.

neural baseline1.141.100.790.650.630.620.620.620.600.801.001.20BPB5101520Max n-gram order
BPB vs max n-gram order. Most of the gain comes from orders 2–7. Going from 7 to 20 gains 0.04 BPB. Diminishing returns, but each additional order adds another pair of hash tables to memory.

More buckets is worse.

We expected more hash buckets to reduce collisions and improve accuracy. The opposite happened.

1.11231.06290.65350.5793256M buckets64M buckets4M buckets1M buckets
BPB vs hash table size. 1M buckets beats 4M. 64M and 256M are barely better than no cache at all.

The hash ratio full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] is not a conditional probability. With 1M buckets and 62M tokens, each bucket averages ~62 entries in both tables. The ratio of two similarly-populated buckets approaches 1.0 — this is P(cache_bin), not P(tok | ctx).

The blend only boosts the correct token's probability. If you computed it for all 1024 tokens, each would get a similar ratio near 1.0. After renormalization, the n-gram contribution washes out. The reported BPB scores are a measurement artifact from point-evaluating an invalid distribution.

Stride decomposition.

Sliding window eval with small strides gives the model more context per token. Is the n-gram improvement just a side effect of that overlap? No. The n-gram delta is ~0.62 BPB regardless of stride:

Stride
Baseline
With cache
Delta
64
1.1109
0.4923
−0.619
256
1.1132
0.4930
−0.620
2048
1.1367
0.5005
−0.636

Sliding window overlap contributes ~0.03 BPB (stride 64 vs 2048 without cache). The n-gram cache contributes ~0.62. They're orthogonal.

Other findings.

1

Logistic mixing is worse than linear

PAQ-style logistic mixing in stretched probability space gives 0.75 BPB vs linear’s 0.65. When n-gram confidence is low (early in corpus, rare contexts), the logistic transform amplifies low probabilities downward. Linear mixing is more robust.

2

Entropy-adaptive alpha hurts with global cache

The sigmoid-gated alpha from PR #727 gives 0.65 BPB — 0.16 BPB worse than fixed α=0.40 (0.49). The entropy gate is designed for sparse partitioned caches and becomes too conservative when cache coverage is high.

3

Three compression phenomena

The cache captures (a) deterministic BPE subword completion (orders 2-4), (b) common English collocations (orders 4-6), and (c) verbatim document repetition (orders 6+). Only (c) is corpus-specific.

Base model: PR #549 (val_bpb 1.1194). N-gram cache concept: PR #727, #779, #788.

Code: experiments/eval_time_mixing/