<description>&lt;p&gt;&lt;em&gt;I often see what I would consider to be &lt;/em&gt;&lt;strong&gt;&lt;em&gt;b******t evals&lt;/em&gt;&lt;/strong&gt;&lt;em&gt;, especially in data, like write this &lt;/em&gt;&lt;strong&gt;&lt;em&gt;dumb SQL&lt;/em&gt;&lt;/strong&gt;&lt;em&gt;. Almost every one of these &lt;/em&gt;&lt;strong&gt;&lt;em&gt;dumb SQL&lt;/em&gt;&lt;/strong&gt;&lt;em&gt; questions that I’ve seen for benchmarks are just so either obviously easy or overwhelmingly adversarial. They just, they &lt;/em&gt;&lt;strong&gt;&lt;em&gt;don’t feel valuable&lt;/em&gt;&lt;/strong&gt;&lt;em&gt; as a &lt;/em&gt;&lt;strong&gt;&lt;em&gt;data scientist&lt;/em&gt;&lt;/strong&gt;&lt;em&gt;, it’s something that you probably would never ask a real data scientist to do. So I went &lt;/em&gt;&lt;strong&gt;&lt;em&gt;out my way to create real ones. Let me read one to you.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Bryan Bischof&lt;/strong&gt;, &lt;strong&gt;Head of AI&lt;/strong&gt; at &lt;strong&gt;Theory Ventures&lt;/strong&gt;, joins Hugo to talk about what happened when &lt;strong&gt;150 people&lt;/strong&gt; spent &lt;strong&gt;six hours&lt;/strong&gt; using &lt;strong&gt;AI agents&lt;/strong&gt; to answer &lt;strong&gt;real data science questions&lt;/strong&gt; across &lt;strong&gt;SQL tables&lt;/strong&gt;, &lt;strong&gt;log files&lt;/strong&gt;, and &lt;strong&gt;750,000 PDFs&lt;/strong&gt;.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;They Discuss:&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;Failure Funnels&lt;/strong&gt;, pinpoint where &lt;strong&gt;agent reasoning breaks down&lt;/strong&gt; using causal-chain binary evaluations instead of vague 1-5 scales;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;Median Score: 23 out of 65&lt;/strong&gt;, what happened when world-class engineers turned agents loose on real data work, and why &lt;strong&gt;general-purpose coding agents&lt;/strong&gt; with human prodding beat fancy frameworks;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;Zero-Cost Submissions Kill Trust&lt;/strong&gt;, without a penalty for wrong answers, agents &lt;strong&gt;hill-climb&lt;/strong&gt; to correct submissions through brute force instead of building confidence;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;Data Science is “Zooming”&lt;/strong&gt;, moving beyond binary decisions to iterative &lt;strong&gt;problem framing&lt;/strong&gt;, refining “does our inventory suck?” into a tractable hypothesis;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;MCP as Semantic Layer&lt;/strong&gt;, model your organization’s &lt;strong&gt;proprietary knowledge&lt;/strong&gt; once and distribute it to whatever LLM interface your team prefers;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;The Subagent vs. Tool Debate&lt;/strong&gt;, a distinction that adds &lt;strong&gt;cognitive load&lt;/strong&gt; without hiding complexity;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;Self-Orchestration Gap&lt;/strong&gt;, agents don’t yet realize they should trigger specialized extraction frameworks like &lt;strong&gt;DocETL&lt;/strong&gt; instead of reading 750K PDFs one by one;&lt;/p&gt;&lt;p&gt;* &lt;strong&gt;The Future of Evals&lt;/strong&gt;, from vibe checks to &lt;strong&gt;objective functions&lt;/strong&gt; and continuous user feedback that lets systems converge on reliability.&lt;/p&gt;&lt;p&gt;You can also find the full episode on &lt;a target="_blank" href="https://open.spotify.com/show/3yuz89gqAhcMcdy3SZPe4X?si=AKl2jvIARD2Liw1bBH2Nng&amp;#38;nd=1&amp;#38;dlsi=8dfe7221896c4fc3"&gt;Spotify&lt;/a&gt;, &lt;a target="_blank" href="https://podcasts.apple.com/us/podcast/vanishing-gradients/id1610318868"&gt;Apple Podcasts&lt;/a&gt;, and &lt;a target="_blank" href="https://youtube.com/live/seh9oVngJJQ?feature=share"&gt;YouTube&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;&lt;a target="_blank" href="https://notebooklm.google.com/notebook/8d091eee-7a65-4212-b04d-cb52f00ea00a"&gt;You can also interact directly with the transcript here in NotebookLM&lt;/a&gt;: If you do so, let us know anything you find in the comments!&lt;/p&gt;&lt;p&gt;👉 &lt;strong&gt;&lt;em&gt;Want to learn more about Building AI-Powered Software? Check out our &lt;/em&gt;&lt;/strong&gt;&lt;a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles"&gt;&lt;strong&gt;&lt;em&gt;Building AI Applications course&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. It’s a live cohort with hands on exercises and office hours. &lt;strong&gt;Our final cohort has started&lt;/strong&gt;. Registration is still open. &lt;strong&gt;All sessions are recorded&lt;/strong&gt; so don’t worry about having missed any. Here is a &lt;a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs"&gt;&lt;strong&gt;&lt;em&gt;25% discount code for readers&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. 👈&lt;/p&gt;&lt;p&gt;&lt;strong&gt;LINKS&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://x.com/BEBischof"&gt;Bryan Bischof on Twitter/X&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://www.linkedin.com/in/bryan-bischof/"&gt;Bryan Bischof on LinkedIn&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://theoryvc.com/"&gt;Theory Ventures&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://theoryvc.com/blog-posts/the-hunt-for-a-trustworthy-data-agent"&gt;The Hunt for a Trustworthy Data Agent (blog post)&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://github.com/TheoryVentures/antm"&gt;America’s Next Top Modeler GitHub repo&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://hamel.dev/blog/posts/evals-faq/how-do-i-evaluate-agentic-workflows.html"&gt;Hamel’s evals FAQ: How do I evaluate agentic workflows?&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://www.docetl.org/"&gt;DocETL&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://hugobowne.substack.com/p/llm-judges-and-ai-agents-at-scale"&gt;LLM Judges and AI Agents at Scale (Hugo’s podcast with Shreya Shankar)&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://www.cimolabs.com/blog/metrics-lying"&gt;When Your Metrics Are Lying (Cimo Labs)&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://youtube.com/live/c0gcsprsFig?feature=share"&gt;Lessons from a Year of Building with LLMs (livestream on YouTube)&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://www.youtube.com/watch?v=zqjnEptOn4k"&gt;Bryan Bischof: The Map is Not the Territory (YouTube)&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://luma.com/calendar/cal-8ImWFDQ3IEIxNWk"&gt;Upcoming Events on Luma&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://www.youtube.com/@vanishinggradients"&gt;Vanishing Gradients on YouTube&lt;/a&gt;&lt;/p&gt;&lt;p&gt;* &lt;a target="_blank" href="https://youtube.com/live/seh9oVngJJQ"&gt;Watch the podcast video on YouTube&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;👉 &lt;strong&gt;&lt;em&gt;Want to learn more about Building AI-Powered Software? Check out our &lt;/em&gt;&lt;/strong&gt;&lt;a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles"&gt;&lt;strong&gt;&lt;em&gt;Building AI Applications course&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. It’s a live cohort with hands on exercises and office hours. &lt;strong&gt;Our final cohort has started&lt;/strong&gt;. Registration is still open. &lt;strong&gt;All sessions are recorded&lt;/strong&gt; so don’t worry about having missed any. Here is a &lt;a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs"&gt;&lt;strong&gt;&lt;em&gt;25% discount code for readers&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. 👈&lt;/p&gt; &lt;br/&gt;&lt;br/&gt;Get full access to Vanishing Gradients at &lt;a href="https://hugobowne.substack.com/subscribe?utm_medium=podcast&amp;#38;utm_campaign=CTA_4"&gt;hugobowne.substack.com/subscribe&lt;/a&gt;</description>

Vanishing Gradients

Hugo Bowne-Anderson

Episode 72: Why Agents Solve the Wrong Problem (and What Data Scientists Do Instead)

MAR 20, 202693 MIN
Vanishing Gradients

Episode 72: Why Agents Solve the Wrong Problem (and What Data Scientists Do Instead)

MAR 20, 202693 MIN

Description

<p><em>I often see what I would consider to be </em><strong><em>b******t evals</em></strong><em>, especially in data, like write this </em><strong><em>dumb SQL</em></strong><em>. Almost every one of these </em><strong><em>dumb SQL</em></strong><em> questions that I’ve seen for benchmarks are just so either obviously easy or overwhelmingly adversarial. They just, they </em><strong><em>don’t feel valuable</em></strong><em> as a </em><strong><em>data scientist</em></strong><em>, it’s something that you probably would never ask a real data scientist to do. So I went </em><strong><em>out my way to create real ones. Let me read one to you.</em></strong></p><p><strong>Bryan Bischof</strong>, <strong>Head of AI</strong> at <strong>Theory Ventures</strong>, joins Hugo to talk about what happened when <strong>150 people</strong> spent <strong>six hours</strong> using <strong>AI agents</strong> to answer <strong>real data science questions</strong> across <strong>SQL tables</strong>, <strong>log files</strong>, and <strong>750,000 PDFs</strong>.</p><p><strong>They Discuss:</strong></p><p>* <strong>Failure Funnels</strong>, pinpoint where <strong>agent reasoning breaks down</strong> using causal-chain binary evaluations instead of vague 1-5 scales;</p><p>* <strong>Median Score: 23 out of 65</strong>, what happened when world-class engineers turned agents loose on real data work, and why <strong>general-purpose coding agents</strong> with human prodding beat fancy frameworks;</p><p>* <strong>Zero-Cost Submissions Kill Trust</strong>, without a penalty for wrong answers, agents <strong>hill-climb</strong> to correct submissions through brute force instead of building confidence;</p><p>* <strong>Data Science is “Zooming”</strong>, moving beyond binary decisions to iterative <strong>problem framing</strong>, refining “does our inventory suck?” into a tractable hypothesis;</p><p>* <strong>MCP as Semantic Layer</strong>, model your organization’s <strong>proprietary knowledge</strong> once and distribute it to whatever LLM interface your team prefers;</p><p>* <strong>The Subagent vs. Tool Debate</strong>, a distinction that adds <strong>cognitive load</strong> without hiding complexity;</p><p>* <strong>Self-Orchestration Gap</strong>, agents don’t yet realize they should trigger specialized extraction frameworks like <strong>DocETL</strong> instead of reading 750K PDFs one by one;</p><p>* <strong>The Future of Evals</strong>, from vibe checks to <strong>objective functions</strong> and continuous user feedback that lets systems converge on reliability.</p><p>You can also find the full episode on <a target="_blank" href="https://open.spotify.com/show/3yuz89gqAhcMcdy3SZPe4X?si=AKl2jvIARD2Liw1bBH2Nng&#38;nd=1&#38;dlsi=8dfe7221896c4fc3">Spotify</a>, <a target="_blank" href="https://podcasts.apple.com/us/podcast/vanishing-gradients/id1610318868">Apple Podcasts</a>, and <a target="_blank" href="https://youtube.com/live/seh9oVngJJQ?feature=share">YouTube</a>.</p><p><a target="_blank" href="https://notebooklm.google.com/notebook/8d091eee-7a65-4212-b04d-cb52f00ea00a">You can also interact directly with the transcript here in NotebookLM</a>: If you do so, let us know anything you find in the comments!</p><p>👉 <strong><em>Want to learn more about Building AI-Powered Software? Check out our </em></strong><a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles"><strong><em>Building AI Applications course</em></strong></a>. It’s a live cohort with hands on exercises and office hours. <strong>Our final cohort has started</strong>. Registration is still open. <strong>All sessions are recorded</strong> so don’t worry about having missed any. Here is a <a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs"><strong><em>25% discount code for readers</em></strong></a>. 👈</p><p><strong>LINKS</strong></p><p>* <a target="_blank" href="https://x.com/BEBischof">Bryan Bischof on Twitter/X</a></p><p>* <a target="_blank" href="https://www.linkedin.com/in/bryan-bischof/">Bryan Bischof on LinkedIn</a></p><p>* <a target="_blank" href="https://theoryvc.com/">Theory Ventures</a></p><p>* <a target="_blank" href="https://theoryvc.com/blog-posts/the-hunt-for-a-trustworthy-data-agent">The Hunt for a Trustworthy Data Agent (blog post)</a></p><p>* <a target="_blank" href="https://github.com/TheoryVentures/antm">America’s Next Top Modeler GitHub repo</a></p><p>* <a target="_blank" href="https://hamel.dev/blog/posts/evals-faq/how-do-i-evaluate-agentic-workflows.html">Hamel’s evals FAQ: How do I evaluate agentic workflows?</a></p><p>* <a target="_blank" href="https://www.docetl.org/">DocETL</a></p><p>* <a target="_blank" href="https://hugobowne.substack.com/p/llm-judges-and-ai-agents-at-scale">LLM Judges and AI Agents at Scale (Hugo’s podcast with Shreya Shankar)</a></p><p>* <a target="_blank" href="https://www.cimolabs.com/blog/metrics-lying">When Your Metrics Are Lying (Cimo Labs)</a></p><p>* <a target="_blank" href="https://youtube.com/live/c0gcsprsFig?feature=share">Lessons from a Year of Building with LLMs (livestream on YouTube)</a></p><p>* <a target="_blank" href="https://www.youtube.com/watch?v=zqjnEptOn4k">Bryan Bischof: The Map is Not the Territory (YouTube)</a></p><p>* <a target="_blank" href="https://luma.com/calendar/cal-8ImWFDQ3IEIxNWk">Upcoming Events on Luma</a></p><p>* <a target="_blank" href="https://www.youtube.com/@vanishinggradients">Vanishing Gradients on YouTube</a></p><p>* <a target="_blank" href="https://youtube.com/live/seh9oVngJJQ">Watch the podcast video on YouTube</a></p><p></p><p></p><p>👉 <strong><em>Want to learn more about Building AI-Powered Software? Check out our </em></strong><a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles"><strong><em>Building AI Applications course</em></strong></a>. It’s a live cohort with hands on exercises and office hours. <strong>Our final cohort has started</strong>. Registration is still open. <strong>All sessions are recorded</strong> so don’t worry about having missed any. Here is a <a target="_blank" href="https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs"><strong><em>25% discount code for readers</em></strong></a>. 👈</p> <br/><br/>Get full access to Vanishing Gradients at <a href="https://hugobowne.substack.com/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_4">hugobowne.substack.com/subscribe</a>