PR #1420 · April 2026

Autopsy of a Triple-Loop U-Net GPT

An 80x range in quantization sensitivity across 67 matrices. Loop layers amplify rounding error 2.2x. The third loop pass is clearly worth it (contraction ratio 0.634), but a fourth would add only 25% new information. All 8 U-Net skip connections are load-bearing — the first encoder loop pass is the most important information source for the decoder, with 3–8x higher skip weight norms than late-layer connections.

Baseline BPB

5.520

Matrices

67

Loop Layers

4, 5

Loop Passes

3(triple)

Skip Connections

8U-Net


80x Range in Quantization Sensitivity

Per-matrix rate-distortion: BPB damage per compressed KB

For each of the 67 GPTQ-quantized matrices, quantize only that matrix (rest stay float32) and measure the BPB increase per compressed KB. This “pain per byte” slope reveals which matrices are the bottleneck — and which can be compressed aggressively for free. The most sensitive matrix (block 4 c_v) costs 80x more BPB per byte than the least sensitive.

Slope = BPB damage per compressed KB. Higher = more sensitive to quantization. Baseline: 5.520323 BPB. * = loop layer. Loop layers average 3.2x higher slope than non-loop.
Average slope by matrix type across all layers. Attention K/V Other. Value projections carry the most sensitive information per byte.

Key Insight

Value projections (c_v) are the most sensitive type at 19.7 x 10⁻⁶ avg slope — 5.5x more than queries (c_q) at 3.55 x 10⁻⁶. 6 of the top 8 most sensitive matrices are in loop layers (4, 5). For mixed-precision quantization: give int8 to the top-5 matrices and int5 to the bottom-10 to save ~0.005 BPB at the same compressed size.

Recurrence Amplifies Quantization Error 2.2x

Block-pair sensitivity + fixed-point convergence analysis

Blocks 4–5 are traversed three times in the triple-loop encoder. Does quantization error compound through iterations? We quantize block pairs in isolation and compare damage per parameter. The loop pair (4, 5) shows 2.2x higher D/Mparam than the average non-loop pair — recurrence amplifies every bit of rounding error.

Quantization damage per million parameters, by block pair. Loop pair (4,5) Non-loop. Amplification factor: 2.2x.
Representation change per loop pass. Pass 2 moves the representation 63.4% as far as pass 1 on average (contraction ratio = 0.634). Each subsequent pass contributes diminishing returns.

Key Insight

The contraction ratio of 0.634 means each loop pass changes the representation by only 63.4% of what the previous pass changed. Pass 2 adds 63% of pass 1's contribution. A hypothetical pass 4 (quad loop) would add only 0.634³ = 25% — not worth the training step cost. But pass 3 still adds 40% of what pass 2 does, clearly justifying the triple loop. Loop layers need MORE bits in mixed-precision, not equal — the 2.2x amplification means every bit of rounding error in layers 4–5 hurts 2.2x more than in other layers.

All 8 Skip Connections Are Load-Bearing

U-Net skip gate values and weight norms

The U-Net architecture has 8 skip connections with learned sigmoid gates: x_new = lerp(scaled_skip, x_decoder, g). A gate near 0 means the encoder skip dominates; near 1 means the decoder ignores the skip. If any skip is dead (gate ≈ 1), it's free parameter savings.

Sigmoid gate values per skip connection. Gate = 0 means encoder dominates; gate = 1 means decoder dominates. All gates cluster around 0.61–0.70 (~35% encoder / 65% decoder blend). Error bars show per-dimension std.
Weight norms of skip projection matrices. High-norm skips (first loop pass) Others. Skips 2–3 (first encoder loop pass) have 3–8x higher norms than late-layer skips — the first pass is the most important information source for the decoder.

Key Insight

No dead skips. All 8 gates cluster at 0.61–0.70 (~35% encoder / 65% decoder blend). The critical finding is in the weight norms: skips 2–3 (first encoder loop pass → decoder) have norms of 21.9 and 19.5 — 3–8x higher than late-layer skips (2.8–4.2). The first loop pass is the dominant information conduit to the decoder. The second loop pass (skips 0–1) has lower norms (6.9–13.8), suggesting it refines rather than establishes the decoder's input.

Further Exploration

Reading the Model's Mind

50-token prompts from validation data, temp=0.8 — same prompt and seed for both models

The most direct form of interpretability: compare what the float32 and quantized models generate from identical prompts. Where they agree, the quantized representations are faithful. Where they diverge, rounding noise has pushed the model into a different region of generation space.

Float32 vs Quantized Generation

50-token prompt from validation data, temp=0.8. Same prompt, same seed — differences come from quantization noise.

Prompt (50 tokens)
Insurance Company Declares Living Man Dead George Johannesen is very much alive. Which is why it was so surprising when the Canadian man received a letter addressed “To the Estate of George Johannesen
Ground truth
Even more surprising is that it came from his insurance company, who should really be on top of such things. Now this wouldn’t have been so terrible if Manitoba Public Insurance was giving Johannesen’s estate a fat check for his passing away. But that’s not what happened. Instead the letter was to
Float32 model
George Johannesen was a retired professional who was a lucrative investor in Canada. His estate was worth about $1 billion. His death last month at the age of 73 was a direct shock to the entire estate and he was still alive. That is why he was so shocked. In 2005 he was a member of the Canadian As
Quantized (int6)
George Johannesen was a retired professional who was a lucrative investor in Canada. His estate was worth about $1 billion. His estate had to be repaired after his death. He was still living at the estate and he was still alive. That is a very shocking story. I have never seen someone living the li

Key Insight

Both models produce coherent, topically relevant text — but diverge in specific facts and phrasing. Quantization shifts the model into parallel narratives rather than producing gibberish. The float32 and quantized outputs often share the same opening (Seq 3: identical first sentence, Seq 7: same opening clause) before drifting apart. This is consistent with the contraction ratio finding: the model converges to similar representations, but small quantization perturbations compound through autoregressive generation.

Actionable Conclusions

  1. Mixed-precision quantization is the #1 opportunity. The 80x sensitivity range means uniform int6 wastes bits. Giving int8 to the top-5 matrices (blocks.4.attn.c_v, blocks.5.attn.c_k, blocks.0.attn.c_k, blocks.8.attn.c_v, blocks.4.attn.proj) and int5 to the bottom-10 could save ~0.005 BPB at the same compressed size.
  2. Loop layers need priority in bit allocation. The 2.2x amplification factor means every bit of rounding error in layers 4–5 hurts 2.2x more than in other layers. These layers should get more bits, not equal.
  3. attn.c_v is the bottleneck type. 19.7 x 10⁻⁶ avg slope vs 3.55 x 10⁻⁶ for attn.c_q. Value projections carry the most quantization-sensitive information. Across the network, c_v deserves more bits than c_q.
  4. Triple loop is near-optimal. Contraction ratio 0.634. A 4th pass would add only 25% of what the 3rd adds — not worth the training step cost. But the 3rd pass still adds 63% of what the 2nd does — clearly worth it.
  5. No skip connections to prune. All are active with similar gate values. The U-Net structure is load-bearing. The first loop pass is the most important skip source.

11L GPT, 512d, Triple Loop (4,5), Parallel Residual (7+), U-Net 8-skip, Fused Kernels

GPTQ int6, Brotli, seed 1234 · Baseline BPB: 5.5203