Fail Small, IaC Control Planes, and Automated RCA

JAN 3, 202617 MIN
Ship It Weekly - DevOps, SRE, and Platform Engineering News

Fail Small, IaC Control Planes, and Automated RCA

JAN 3, 202617 MIN

Description

<p>This week on <strong>Ship It Weekly</strong>, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.</p><p>We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.</p><p>Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite.</p><p>Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10–15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.</p><p>In the lightning round: a follow-up on GitHub Actions direction (and a quick callback to Episode 6’s runner pricing pause), AWS ECR creating repos on push, a smarter take on incident metrics, Terraform drift visibility, and parallel “coding agent” workflows.</p><p>We wrap with a human reminder about the ironies of automation: automation doesn’t remove responsibility, it moves it. Faster systems require better brakes, better observability, and easier rollback.</p><p></p><p><strong>Links from this episode</strong></p><p></p><p>SRE Weekly issue 503 (source roundup - CloudFlare) <a target="_blank" rel="noopener noreferrer nofollow" href="https://sreweekly.com/sre-weekly-issue-503/">https://sreweekly.com/sre-weekly-issue-503/</a></p><p></p><p>Pulumi: all IaC, including Terraform and HCL <a target="_blank" rel="noopener noreferrer nofollow" href="https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/">https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/</a></p><p></p><p>Meta DrP: <a target="_blank" rel="noopener noreferrer nofollow" href="https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/">https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/</a></p><p></p><p>GitHub Actions: “Let’s talk about GitHub Actions” <a target="_blank" rel="noopener noreferrer nofollow" href="https://github.blog/news-insights/product-news/lets-talk-about-github-actions/">https://github.blog/news-insights/product-news/lets-talk-about-github-actions/</a></p><p></p><p>Episode 6 (GitHub runner pricing pause, Terraform Cloud limits, AI in CI) <a target="_blank" rel="noopener noreferrer nofollow" href="https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/">https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/</a></p><p></p><p>AWS ECR: create repositories on push <a target="_blank" rel="noopener noreferrer nofollow" href="https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/">https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/</a></p><p></p><p>DriftHound <a target="_blank" rel="noopener noreferrer nofollow" href="https://drifthound.io/">https://drifthound.io/</a></p><p></p><p>Superset <a target="_blank" rel="noopener noreferrer nofollow" href="https://superset.sh/">https://superset.sh/</a></p><p></p><p>More episodes + contact info, and more details on this episode can be found on our website: <a target="_blank" rel="noopener noreferrer nofollow" href="https://shipitweekly.fm">https://shipitweekly.fm</a></p>