Episode 72: Why Agents Solve the Wrong Problem (and What Data Scientists Do Instead)
<p><em>I often see what I would consider to be </em><strong><em>b******t evals</em></strong><em>, especially in data, like write this </em><strong><em>dumb SQL</em></strong><em>. Almost every one of these </em><strong><em>dumb SQL</em></strong><em> questions that I’ve seen for benchmarks are just so either obviously easy or overwhelmingly adversarial. They just, they </em><strong><em>don’t feel valuable</em></strong><em> as a </em><strong><em>data scientist</em></strong><em>, it’s something that you probably would never ask a real data scientist to do. So I went </em><strong><em>out my way to create real ones. Let me read one to you.</em></strong></p><p><strong>Bryan Bischof</strong>, <strong>Head of AI</strong> at <strong>Theory Ventures</strong>, joins Hugo to talk about what happened when <strong>150 people</strong> spent <strong>six hours</strong> using <strong>AI agents</strong> to answer <strong>real data science questions</strong> across <strong>SQL tables</strong>, <strong>log files</strong>, and <strong>750,000 PDFs</strong>.</p><p><strong>They Discuss:</strong></p><p>* <strong>Failure Funnels</strong>, pinpoint where <strong>agent reasoning breaks down</strong> using causal-chain binary evaluations instead of vague 1-5 scales;</p><p>* <strong>Median Score: 23 out of 65</strong>, what happened when world-class engineers turned agents loose on real data work, and why <strong>general-purpose coding agents</strong> with human prodding beat fancy frameworks;</p><p>* <strong>Zero-Cost Submissions Kill Trust</strong>, without a penalty for wrong answers, agents <strong>hill-climb</strong> to correct submissions through brute force instead of building confidence;</p><p>* <strong>Data Science is “Zooming”</strong>, moving beyond binary decisions to iterative <strong>problem framing</strong>, refining “does our inventory suck?” into a tractable hypothesis;</p><p>* <strong>MCP as Semantic Layer</strong>, model your organization’s <strong>proprietary knowledge</strong> once and distribute it to whatever LLM interface your team prefers;</p><p>* <strong>The Subagent vs. Tool Debate</strong>, a distinction that adds <strong>cognitive load</strong> without hiding complexity;</p><p>* <strong>Self-Orchestration Gap</strong>, agents don’t yet realize they should trigger specialized extraction frameworks like <strong>DocETL</strong> instead of reading 750K PDFs one by one;</p><p>* <strong>The Future of Evals</strong>, from vibe checks to <strong>objective functions</strong> and continuous user feedback that lets systems converge on reliability.</p><p>You can also find the full episode on <a target="_blank" href="https://open.spotify.com/show/3yuz89gqAhcMcdy3SZPe4X?si=AKl2jvIARD2Liw1bBH2Nng&nd=1&dlsi=8dfe7221896c4fc3">Spotify</a>, <a target="_blank" href="https://podcasts.apple.com/us/podcast/vanishing-gradients/id1610318868">Apple Podcasts</a>, and <a target="_blank" href="https://youtube.com/live/seh9oVngJJQ?feature=share">YouTube</a>.</p><p><a target="_blank" href="https://notebooklm.google.com/notebook/8d091eee-7a65-4212-b04d-cb52f00ea00a">You can also interact directly with the transcript here in NotebookLM</a>: If you do so, let us know anything you find in the comments!</p><p>👉 <strong><em>Want to learn more about Building AI-Powered Software? Check out our </em></strong><a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles"><strong><em>Building AI Applications course</em></strong></a>. It’s a live cohort with hands on exercises and office hours. <strong>Our final cohort has started</strong>. Registration is still open. <strong>All sessions are recorded</strong> so don’t worry about having missed any. Here is a <a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs"><strong><em>25% discount code for readers</em></strong></a>. 👈</p><p><strong>LINKS</strong></p><p>* <a target="_blank" href="https://x.com/BEBischof">Bryan Bischof on Twitter/X</a></p><p>* <a target="_blank" href="https://www.linkedin.com/in/bryan-bischof/">Bryan Bischof on LinkedIn</a></p><p>* <a target="_blank" href="https://theoryvc.com/">Theory Ventures</a></p><p>* <a target="_blank" href="https://theoryvc.com/blog-posts/the-hunt-for-a-trustworthy-data-agent">The Hunt for a Trustworthy Data Agent (blog post)</a></p><p>* <a target="_blank" href="https://github.com/TheoryVentures/antm">America’s Next Top Modeler GitHub repo</a></p><p>* <a target="_blank" href="https://hamel.dev/blog/posts/evals-faq/how-do-i-evaluate-agentic-workflows.html">Hamel’s evals FAQ: How do I evaluate agentic workflows?</a></p><p>* <a target="_blank" href="https://www.docetl.org/">DocETL</a></p><p>* <a target="_blank" href="https://hugobowne.substack.com/p/llm-judges-and-ai-agents-at-scale">LLM Judges and AI Agents at Scale (Hugo’s podcast with Shreya Shankar)</a></p><p>* <a target="_blank" href="https://www.cimolabs.com/blog/metrics-lying">When Your Metrics Are Lying (Cimo Labs)</a></p><p>* <a target="_blank" href="https://youtube.com/live/c0gcsprsFig?feature=share">Lessons from a Year of Building with LLMs (livestream on YouTube)</a></p><p>* <a target="_blank" href="https://www.youtube.com/watch?v=zqjnEptOn4k">Bryan Bischof: The Map is Not the Territory (YouTube)</a></p><p>* <a target="_blank" href="https://luma.com/calendar/cal-8ImWFDQ3IEIxNWk">Upcoming Events on Luma</a></p><p>* <a target="_blank" href="https://www.youtube.com/@vanishinggradients">Vanishing Gradients on YouTube</a></p><p>* <a target="_blank" href="https://youtube.com/live/seh9oVngJJQ">Watch the podcast video on YouTube</a></p><p></p><p></p><p>👉 <strong><em>Want to learn more about Building AI-Powered Software? Check out our </em></strong><a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles"><strong><em>Building AI Applications course</em></strong></a>. It’s a live cohort with hands on exercises and office hours. <strong>Our final cohort has started</strong>. Registration is still open. <strong>All sessions are recorded</strong> so don’t worry about having missed any. Here is a <a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs"><strong><em>25% discount code for readers</em></strong></a>. 👈</p> <br/><br/>Get full access to Vanishing Gradients at <a href="https://hugobowne.substack.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4">hugobowne.substack.com/subscribe</a>