Reiner Pope – The math behind how LLMs are trained and served
APR 29, 2026133 MIN
Reiner Pope – The math behind how LLMs are trained and served
APR 29, 2026133 MIN
Description
<p>Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served.</p><p>It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk.</p><p>It’s a bit technical, but I encourage you to hang in there – it’s really worth it.</p><p>There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.</p><p>Recommend watching this one on <a target="_blank" href="https://youtu.be/xmkSf5IS-zw">YouTube</a> so you can see the chalkboard.</p><p><a target="_blank" href="https://reiner.org/">Reiner</a> is CEO of <a target="_blank" href="https://matx.com/">MatX</a>, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on <a target="_blank" href="https://arxiv.org/abs/2211.05102">software</a> <a target="_blank" href="https://jax-ml.github.io/scaling-book/">efficiency</a>, compilers, and TPU architecture.</p><p>Download markdown of transcript <a target="_blank" href="https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe">here</a> to chat with an LLM.</p><p>Wrote up some <a target="_blank" href="https://reiner-flashcards.vercel.app/">flashcards and practice problems</a> to help myself retain what Reiner taught. Hope it's helpful to you too!</p><p><strong>Sponsors</strong></p><p>* <a target="_blank" href="https://janestreet.com/dwarkesh">Jane Street</a> needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at <a target="_blank" href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p><p>* <a target="_blank" href="https://goo.gle/Gemma4">Google’s Gemma 4</a> is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at<a target="_blank" href="https://goo.gle/Gemma4"> goo.gle/Gemma4</a></p><p>* <a target="_blank" href="https://cursor.com/dwarkesh">Cursor</a> helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in <a target="_blank" href="https://www.dwarkesh.com/p/what-i-learned-april-15">my recent blog post</a>. And if you have something to visualize yourself, go to <a target="_blank" href="https://cursor.com/dwarkesh">cursor.com/dwarkesh</a></p><p><strong>Timestamps</strong></p><p>(00:00:00) – How batch size affects token cost and speed</p><p>(00:32:09) – How MoE models are laid out across GPU racks</p><p>(00:47:12) – How pipeline parallelism spreads model layers across racks</p><p>(01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.”</p><p>(01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal</p><p>(01:33:02) – Deducing long context memory costs from API pricing</p><p>(02:04:02) – Convergent evolution between neural nets and cryptography</p> <br/><br/>Get full access to Dwarkesh Podcast at <a href="https://www.dwarkesh.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4">www.dwarkesh.com/subscribe</a>