Data Engineering Podcast
Data Engineering Podcast

Data Engineering Podcast

Tobias Macey

Overview
Episodes

Details

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Recent Episodes

Holding Kafka Right: Product-Friendly Streaming with TypeStream
JUN 18, 2026
Holding Kafka Right: Product-Friendly Streaming with TypeStream
Summary&nbsp;<br />In this episode Jevin Maltais talks about the practical realities of building reliable, product-focused streaming systems with Kafka. Jevin shares lessons from roles at Zapier, Humi, and Clio, where real-time synchronization, customer data unification, and document sync at scale highlighted both the strengths and common misuses of Kafka. He digs into using events as the source of truth, materialized views with KTables, and how schema registries and type safety prevent downstream breakage. Jevin explains why teams often reach for heavyweight Kafka clusters without leveraging Streams, Connect, or interactive queries—and how his project, TypeStream, aims to make those capabilities accessible via config-as-code while keeping a thin abstraction and clear escape hatches. He also explore trade-offs across Kafka-compatible alternatives, CDC with Debezium in the real world, and where abstractions should stop so teams can scale responsibility as complexity grows.&nbsp;<br /><br /><br />Announcements&nbsp;<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to <a href="https://datadriven.io/?utm_source=dataengineeringpodcast&amp;utm_medium=podcast&amp;utm_campaign=macey_sponsorship&amp;utm_content=ep_512" target="_blank">dataengineeringpodcast.com/datadriven</a> today to start practicing.</li><li>Your host is Tobias Macey and today I'm interviewing Jevin Maltais about the challenges of building a reliable streaming&nbsp;</li></ul><br />Interview<br />&nbsp;<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what Typestream is and the story behind it?</li><li>What are the common challenges that teams encounter when trying to build on top of Kafka?</li><li>How do those challenges/misconfigurations impact the team's ability to deliver on product goals?</li><li>What are the fundamental design aspects of Kafka that contribute to the difficulties that teams encounter when using it as an element of their architecture?</li><li>There have been numerous projects taking aim at Kafka, with varying approaches and degrees of effectiveness (e.g. RedPanda, AutoMQ, Pulsar, etc.). What are the tradeoffs that each of those approaches requires?</li><li>What makes the original Kafka project so resilient in the face of all of that competition?</li><li>Can you describe the architecture of Typestream and how each of the core elements contribute to a better user experience?</li><li>For teams who want to take advantage of streaming capabilities, but don't want to invest in becoming Kafka experts, what does the Typestream workflow look like?</li><li>If they don't want to manage the operational overhead of a Kafka cluster, how tightly coupled is Typestream to the original Kafka? (can someone use RedPanda or AutoMQ instead?)</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Typestream used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Typestream?</li><li>When is Typestream the wrong choice?</li><li>What do you have planned for the future of Typestream?</li></ul><br />Contact Info<br />&nbsp;<br /><ul><li><a href="https://www.jevy.org/" target="_blank">Website</a></li></ul><br />Parting Question<br />&nbsp;<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul><br />Closing Announcements<br />&nbsp;<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul><br />Links<br />&nbsp;<br /><ul><li><a href="https://typestream.io/" target="_blank">Typestream</a></li><li><a href="https://zapier.com/" target="_blank">Zapier</a></li><li><a href="http://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://kafka.apache.org/" target="_blank">Kafka</a></li><li><a href="https://developer.confluent.io/courses/kafka-streams/ktable/" target="_blank">KTables</a></li><li><a href="https://github.com/confluentinc/ksql" target="_blank">KSQL</a></li><li><a href="https://www.redpanda.com/" target="_blank">RedPanda</a></li><li><a href="https://pulsar.apache.org/" target="_blank">Pulsar</a></li><li><a href="https://www.automq.com/" target="_blank">AutoMQ</a></li><li><a href="https://github.com/confluentinc/schema-registry" target="_blank">Kafka Schema Registry</a></li><li><a href="https://debezium.io/" target="_blank">Debezium</a></li><li><a href="https://en.wikipedia.org/wiki/Change_data_capture" target="_blank">Change Data Capture</a></li><li><a href="https://kafka.apache.org/43/kafka-connect/overview/" target="_blank">Kafka Connect</a></li><li><a href="https://developer.hashicorp.com/terraform" target="_blank">Terraform</a></li><li><a href="https://developer.confluent.io/courses/architecture/compaction/" target="_blank">Kafka Compacted Topic</a></li></ul><br />The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle icon
49 MIN
Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards
JUN 8, 2026
Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards
Summary&nbsp;<br />In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, text-to-transformations, and text-to-dashboard workflows. He also digs into on-prem versus SaaS deployments, domain-specialized agents for privacy and accuracy, code blocks for custom Python/SQL, and the roadmap for a marketplace and desktop assistant. Shravan highlights how Kaarvi compresses weeks of work into hours and bridges the gap between business users and data engineers by turning AI into a dependable force multiplier.&nbsp;<br /><br />Announcements&nbsp;<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to <a href="https://datadriven.io/?utm_source=dataengineeringpodcast&amp;utm_medium=podcast&amp;utm_campaign=macey_sponsorship&amp;utm_content=ep_511" target="_blank">dataengineeringpodcast.com/datadriven</a> today to start practicing.</li><li>Your host is Tobias Macey and today I'm interviewing Shravan Gunda about building an agent-driven data platform at Kaarvi</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what Kaarvi is and the story behind it?</li><li>"AI" is a very broad term that encompasses numerous possible implementations. Can you give some more detail about the different types and applications of AI in Kaarvi's architecture?</li><li>What are some of the core assumptions of data workflows that need to be reconsidered when AI is embedded in the execution path?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Kaarvi used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kaarvi?</li><li>When is Kaarvi the wrong choice?</li><li>What do you have planned for the future of Kaarvi?</li></ul><br />Contact Info<br />&nbsp;<br /><ul><li><a href="https://www.linkedin.com/in/shravankgunda/" target="_blank">LinkedIn</a></li></ul><br />Parting Question<br />&nbsp;<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul><br />Closing Announcements<br />&nbsp;<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul><br />Links<br />&nbsp;<br /><ul><li><a href="https://kaarvi.ai/" target="_blank">Kaarvi</a></li><li><a href="https://mitsloan.mit.edu/ideas-made-to-matter/what-synthetic-data-and-how-can-it-help-you-competitively" target="_blank">Synthetic Data</a></li><li><a href="https://n8n.io/" target="_blank">n8n</a></li></ul><br />The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle icon
52 MIN
Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture
JUN 1, 2026
Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture
Summary<br />In this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations.<br /><br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to <a href="https://datadriven.io/?utm_source=dataengineeringpodcast&amp;utm_medium=podcast&amp;utm_campaign=macey_sponsorship&amp;utm_content=ep_510" target="_blank">dataengineeringpodcast.com/datadriven</a> today to start practicing.</li><li>Your host is Tobias Macey and today I'm interviewing Weimo Liu about the engineering behind PuppyGraph's zero-copy ETL for querying your lakehouse as a graph</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you start by describing what PuppyGraph is and the story behind it?</li><li>What are some of the key use cases that people are turning to PuppyGraph and graph data models for?</li><li>Graph engines have struggled to take off for several years, not least of which is due to the difficulty of scaling them to large data volumes as a result of the topological nature of the data. Can you describe the architecture of PuppyGraph and some of the ways that you are addressing that challenge of data volume for graphs?</li><li>latency/data exploration</li><li>types of traversals and limitations</li><li>lakehouse architecture pros/cons for graphs</li><li>data modeling/translation</li><li>shortcomings of zero-ETL and how transforming the underlying representation could provide benefits</li><li>For someone who is looking for a graph engine to support a connected data use case, what are the guiding questions that you would ask to lead them toward PuppyGraph vs. a dedicated graph database like Memgraph/Neo4J/etc.?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen PuppyGraph used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on PuppyGraph?</li><li>When is PuppyGraph the wrong choice?</li><li>What do you have planned for the future of PuppyGraph and graph data exploration on large data volumes?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/weimoliu/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul>Links<br /><ul><li><a href="https://www.puppygraph.com/" target="_blank">PuppyGraph</a></li><li><a href="https://www.tigergraph.com/" target="_blank">TigerGraph</a></li><li><a href="https://research.google/pubs/f1-a-distributed-sql-database-that-scales/" target="_blank">Google F1</a></li><li><a href="https://en.wikipedia.org/wiki/Graph_database" target="_blank">Graph Database</a></li><li><a href="https://research.google/pubs/pregel-a-system-for-large-scale-graph-processing/" target="_blank">Google Pregel</a></li><li><a href="https://iceberg.apache.org/" target="_blank">Iceberg</a></li><li><a href="https://aerospike.com/docs/graph/develop/query/supernodes/" target="_blank">Graph Supernode</a></li><li><a href="https://en.wikipedia.org/wiki/Massively_parallel" target="_blank">MPP == Massively Parallel Processing</a></li><li><a href="https://spark.apache.org/docs/latest/graphx-programming-guide.html" target="_blank">Spark GraphX</a></li><li><a href="https://trino.io/" target="_blank">Trino</a></li><li><a href="https://ladybugdb.com/" target="_blank">Ladybug DB</a></li><li><a href="https://github.com/lance-format/lance-graph" target="_blank">lance-graph</a></li><li><a href="https://github.com/kuzudb/kuzu" target="_blank">KuzuDB</a></li><li><a href="https://memgraph.com/" target="_blank">MemGraph</a></li><li><a href="https://en.wikipedia.org/wiki/Property_graph" target="_blank">Labelled Property Graph</a></li><li><a href="https://en.wikipedia.org/wiki/Semantic_triple" target="_blank">RDF Triples</a></li><li><a href="https://en.wikipedia.org/wiki/Cypher_(query_language)" target="_blank">Cypher Query Language</a></li><li><a href="https://en.wikipedia.org/wiki/Gremlin_(query_language)" target="_blank">Gremlin</a></li><li><a href="https://en.wikipedia.org/wiki/Change_data_capture" target="_blank">CDC == Change Data Capture</a></li><li><a href="https://neo4j.com/" target="_blank">Neo4J</a></li><li><a href="https://github.com/janusgraph/janusgraph" target="_blank">JanusGraph</a></li><li><a href="https://networkx.github.io/" target="_blank">NetworkX</a></li><li><a href="https://pytorch.org/" target="_blank">PyTorch</a></li><li><a href="https://duckdb.org/" target="_blank">DuckDB</a></li><li><a href="https://iceberg.apache.org/spec/#semi-structured-types" target="_blank">Iceberg Array</a></li><li><a href="https://www.lancedb.com/" target="_blank">LanceDB</a></li><li><a href="https://www.paloaltonetworks.com/" target="_blank">Palo Alto Networks</a></li><li><a href="https://arrow.apache.org/blog/2023/01/05/introducing-arrow-adbc/" target="_blank">Columnar ADBC</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a><br />%
play-circle icon
54 MIN
Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes
MAY 6, 2026
Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes
Summary<br />In this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance.<br /><br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Your host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applications</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you start by giving an overview of the major contributors to wasted or idle compute?</li><li>Why does it matter if the available compute isn't being maximized?</li><li>What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)?&nbsp;</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Ray used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?</li><li>When is Ray the wrong choice?</li><li>What do you have planned for the future of Ray?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/robert-nishihara-b6465444/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul>Links<br /><ul><li><a href="https://www.anyscale.com/" target="_blank">AnyScale</a></li><li><a href="https://www.ray.io/" target="_blank">Ray</a></li><li><a href="https://en.wikipedia.org/wiki/Deep_learning" target="_blank">Deep Learning</a></li><li><a href="https://en.wikipedia.org/wiki/Computer_vision" target="_blank">Computer Vision</a></li><li><a href="https://kubernetes.io/" target="_blank">Kubernetes</a></li><li><a href="https://cursor.com/" target="_blank">Cursor</a></li><li><a href="https://code.claude.com/docs/en/overview" target="_blank">Claude Code</a></li><li><a href="https://docs.ray.io/en/latest/cluster/kubernetes/index.html" target="_blank">Kube-Ray</a></li><li><a href="https://pytorch.org/" target="_blank">PyTorch</a></li><li><a href="https://www.tensorflow.org/" target="_blank">Tensorflow</a></li><li><a href="https://github.com/theano/theano" target="_blank">Theano</a></li><li><a href="https://en.wikipedia.org/wiki/Caffe_(software)" target="_blank">Caffe</a></li><li><a href="https://vllm.ai/" target="_blank">vLLM</a></li><li><a href="https://docs.sglang.io/" target="_blank">SGLang</a></li><li><a href="https://docs.ray.io/en/latest/tune/index.html" target="_blank">Ray Tune</a></li><li><a href="https://en.wikipedia.org/wiki/Neural_network_(machine_learning)" target="_blank">Neural Network</a></li><li><a href="https://en.wikipedia.org/wiki/Learning_rate" target="_blank">Learning Rates</a></li><li><a href="https://en.wikipedia.org/wiki/Reinforcement_learning" target="_blank">Reinforcement Learning</a></li><li><a href="https://deepmind.google/research/alphago/" target="_blank">AlphaGo</a></li><li><a href="https://cursor.com/blog/composer-2" target="_blank">Cursor Composer 2</a></li><li><a href="https://en.wikipedia.org/wiki/ImageNet" target="_blank">ImageNet</a></li><li><a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning)" target="_blank">Transformer Architecture</a></li><li><a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent" target="_blank">Stochastic Gradient Descent</a></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://dagster.io/" target="_blank">Dagster</a></li><li><a href="https://flyte.org/" target="_blank">Flyte</a></li><li><a href="https://en.wikipedia.org/wiki/Mixture_of_experts" target="_blank">Mixture of Experts</a></li><li><a href="https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests" target="_blank">Prefill</a></li><li><a href="https://temporal.io/" target="_blank">Temporal</a></li><li><a href="https://en.wikipedia.org/wiki/Actor_model" target="_blank">Actor Framework</a></li><li><a href="https://en.wikipedia.org/wiki/Remote_direct_memory_access" target="_blank">RDMA == Remote Direct Memory Access</a></li><li><a href="https://www.cisco.com/site/us/en/learn/topics/computing/what-is-neocloud.html" target="_blank">Neoclouds</a></li><li><a href="https://www.aiengineeringpodcast.com/gpu-cloud-marketplace-episode-75" target="_blank">AI Engineering Podcast Episode</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle icon
58 MIN
The AI-First Data Engineer: 10–50x Productivity and What Changes Next
APR 7, 2026
The AI-First Data Engineer: 10–50x Productivity and What Changes Next
Summary&nbsp;<br />In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests.&nbsp;<br />Announcements&nbsp;<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at <a href="https://www.dataengineeringpodcast.com/retool" target="_blank">dataengineeringpodcast.com/retool</a> today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.</li><li>Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026</li></ul><br />Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?</li><li>What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?</li><li>How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend?</li></ul><br />Contact Info<br />&nbsp;<br /><ul><li><a href="https://www.linkedin.com/in/glebmezh/" target="_blank">LinkedIn</a></li></ul><br />Parting Question<br />&nbsp;<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul><br />Closing Announcements<br />&nbsp;<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul><br />Links<br />&nbsp;<br /><ul><li><a href="https://www.datafold.com/blog/data-engineering-in-2026-predictions/" target="_blank">Blog Post</a></li><li><a href="https://www.datafold.com/" target="_blank">Datafold</a></li><li><a href="https://www.anthropic.com/news/claude-opus-4-5" target="_blank">Claude Opus 4.5</a></li><li><a href="https://en.wikipedia.org/wiki/Muggle" target="_blank">Harry Potter - Muggles</a></li><li><a href="https://en.wikipedia.org/wiki/Jevons_paradox" target="_blank">Jevon's Paradox</a></li><li><a href="https://www.ibm.com/think/topics/modern-data-stack" target="_blank">Modern Data Stack</a></li><li><a href="https://compass.dagster.io/" target="_blank">Dagster Compass</a></li><li><a href="https://www.bygravity.com/orion" target="_blank">Gravity Orion</a></li><li><a href="https://modelcontextprotocol.io/docs/getting-started/intro" target="_blank">MCP == Model Context Protocol</a></li><li><a href="https://qwen.ai/home" target="_blank">Qwen</a></li></ul><br />The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle icon
59 MIN