About
Apps
Posts
Portfolio
Testimonials

Autopsy of a Triple-Loop U-Net GPT

80x range in quantization sensitivity. Loop layers amplify rounding error 2.2x. The third loop pass is worth it, the fourth isn't. All 8 U-Net skips are load-bearing.

April 6, 2026

Mining 975 Expensive Training Runs

975 H100 training runs from Parameter Golf: what the data reveals about prediction, which techniques matter, and why training outcomes are visible at step 1,000.

March 30, 2026

Autopsy of a SOTA Parameter-Golf GPT, Round 2

PR #1019 said MLP was parameter-starved. We expanded it, switched to mixed int5/int6, and hit 1.1086 BPB. Calibration degraded as a side effect.

March 30, 2026

Autopsy of a SOTA Parameter-Golf GPT

Q matrices have condition numbers up to 54,000×. They're not the fragile ones. Trained a 27M GPT to 1.134 BPB on FineWeb — merged as #1 on the leaderboard — then cut it open.

March 28, 2026

How to Clean Up All the Parameter Golf Submissions

N-gram caching in a language model competition produces probability distributions that sum to 277, not 1. We computed the full distribution to prove it.

March 26, 2026

Does Training Data Selection Help for LM Pre-training?

5 experiments on FineWeb: on a well-curated dataset, diversity beats selection

March 25, 2026

Unseen Opportunities: Upwork + GPT

Summer, 2022. I quit my lucrative AI job not out of dissatisfaction, but out of exhilaration. My past couple of years were brimming with happiness, which sparked an awakening within me: if I did not leap into the unknown, doing so might eventually become difficult or even impossible. Within two hours of this epiphany, I sent my company 2-week notice.

July 8, 2023
Why Does Batch Normalization Work?

Why Does Batch Normalization Work?

Batch Normalization is often explained as a method for reducing “Internal Covariate Shift,” but growing evidence points instead to its ability to smooth the optimization landscape and make training more robust. Through interactive demos and reproducible experiments, this overview shows how Batch Normalization enhances convergence, remains effective even under artificially introduced covariate shifts, and reduces dependence on initialization—ultimately speeding up and stabilizing neural network training.

January 15, 2019