Data Engineering Podcast | Podcast Guru

Overview

Episodes

Recent Episodes

Simplifying Data Pipelines with Durable Execution

APR 12, 2025

Simplifying Data Pipelines with Durable Execution

Summary<br />In this episode of the Data Engineering Podcast Jeremy Edberg, CEO of DBOS, about durable execution and its impact on designing and implementing business logic for data systems. Jeremy explains how DBOS's serverless platform and orchestrator provide local resilience and reduce operational overhead, ensuring exactly-once execution in distributed systems through the use of the Transact library. He discusses the importance of version management in long-running workflows and how DBOS simplifies system design by reducing infrastructure needs like queues and CI pipelines, making it beneficial for data pipelines, AI workloads, and agentic AI.<br /><br /><br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today for the details.</li><li>Your host is Tobias Macey and today I'm interviewing Jeremy Edberg about durable execution and how it influences the design and implementation of business logic</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what DBOS is and the story behind it?</li><li>What is durable execution?<ul><li>What are some of the notable ways that inclusion of durable execution in an application architecture changes the ways that the rest of the application is implemented? (e.g. error handling, logic flow, etc.)</li></ul></li><li>Many data pipelines involve complex, multi-step workflows. How does DBOS simplify the creation and management of resilient data pipelines? </li><li>How does durable execution impact the operational complexity of data management systems?</li><li>One of the complexities in durable execution is managing code/data changes to workflows while existing executions are still processing. What are some of the useful patterns for addressing that challenge and how does DBOS help?</li><li>Can you describe how DBOS is architected?<ul><li>How have the design and goals of the system changed since you first started working on it?</li></ul></li><li>What are the characteristics of Postgres that make it suitable for the persistence mechanism of DBOS?</li><li>What are the guiding principles that you rely on to determine the boundaries between the open source and commercial elements of DBOS?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen DBOS used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on DBOS?</li><li>When is DBOS the wrong choice?</li><li>What do you have planned for the future of DBOS?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/jedberg/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.</li></ul>Links<br /><ul><li><a href="https://www.dbos.dev/" target="_blank">DBOS</a></li><li><a href="https://dimosr.github.io/the-tale-of-exactly-once-semantics/" target="_blank">Exactly Once Semantics</a></li><li><a href="https://temporal.io/" target="_blank">Temporal</a></li><li><a href="https://en.wikipedia.org/wiki/Semaphore_(programming)" target="_blank">Sempahore</a></li><li><a href="https://www.postgresql.org/" target="_blank">Postgres</a></li><li>DBOS Transact<ul><li><a href="https://github.com/dbos-inc/dbos-transact-py" target="_blank">Python</a> </li><li><a href="https://github.com/dbos-inc/dbos-transact-ts" target="_blank">Typescript</a> </li></ul></li><li><a href="https://medium.com/@sahintalha1/the-way-psps-such-as-paypal-stripe-and-adyen-prevent-duplicate-payment-idempotency-keys-615845c185bf" target="_blank">Idempotency Keys</a></li><li><a href="https://hbr.org/2024/12/what-is-agentic-ai-and-how-will-it-change-work" target="_blank">Agentic AI</a></li><li><a href="https://en.wikipedia.org/wiki/Finite-state_machine" target="_blank">State Machine</a></li><li><a href="https://www.yugabyte.com/" target="_blank">YugabyteDB</a><ul><li><a href="https://www.dataengineeringpodcast.com/yugabytedb-planet-scale-sql-episode-115" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://www.yugabyte.com/" target="_blank">CockroachDB</a></li><li><a href="https://supabase.com/" target="_blank">Supabase</a></li><li><a href="https://neon.tech/" target="_blank">Neon</a><ul><li><a href="https://www.dataengineeringpodcast.com/neon-serverless-postgres-episode-433" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

play-circle

39 MIN

Overcoming Redis Limitations: The Dragonfly DB Approach

MAR 30, 2025

Overcoming Redis Limitations: The Dragonfly DB Approach

Summary<br />In this episode of the Data Engineering Podcast Roman Gershman, CTO and founder of Dragonfly DB, explores the development and impact of high-speed in-memory databases. Roman shares his experience creating a more efficient alternative to Redis, focusing on performance gains, scalability, and cost efficiency, while addressing limitations such as high throughput and low latency scenarios. He explains how Dragonfly DB solves operational complexities for users and delves into its technical aspects, including maintaining compatibility with Redis while innovating on memory efficiency. Roman discusses the importance of cost efficiency and operational simplicity in driving adoption and shares insights on the broader ecosystem of in-memory data stores, future directions like SSD tiering and vector search capabilities, and the lessons learned from building a new database engine.<br /><br /><br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today for the details.</li><li>Your host is Tobias Macey and today I'm interviewing Roman Gershman about building a high-speed in-memory database and the impact of the performance gains on data applications</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what DragonflyDB is and the story behind it?</li><li>What is the core problem/use case that is solved by making a "faster Redis"?</li><li>The other major player in the high performance key/value database space is Aerospike. What are the heuristics that an engineer should use to determine whether to use that vs. Dragonfly/Redis?</li><li>Common use cases for Redis involve application caches and queueing (e.g. Celery/RQ). What are some of the other applications that you have seen Redis/Dragonfly used for, particularly in data engineering use cases?</li><li>There is a piece of tribal wisdom that it takes 10 years for a database to iron out all of the kinks. At the same time, there have been substantial investments in commoditizing the underlying components of database engines. Can you describe how you approached the implementation of DragonflyDB to arive at a functional and reliable implementation?</li><li>What are the architectural elements that contribute to the performance and scalability benefits of Dragonfly?<ul><li>How have the design and goals of the system changed since you first started working on it?</li></ul></li><li>For teams who migrate from Redis to Dragonfly, beyond the cost savings what are some of the ways that it changes the ways that they think about their overall system design?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Dragonfly used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on DragonflyDB?</li><li>When is DragonflyDB the wrong choice?</li><li>What do you have planned for the future of DragonflyDB?</li></ul>Contact Info<br /><ul><li><a href="https://github.com/romange/" target="_blank">GitHub</a></li><li><a href="https://www.linkedin.com/in/romange/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.</li></ul>Links<br /><ul><li><a href="https://www.dragonflydb.io/" target="_blank">DragonflyDB</a></li><li><a href="https://redis.io/" target="_blank">Redis</a></li><li><a href="https://aws.amazon.com/pm/elasticache/" target="_blank">Elasticache</a></li><li><a href="https://valkey.io/" target="_blank">ValKey</a></li><li><a href="https://aerospike.com/" target="_blank">Aerospike</a></li><li><a href="https://laravel.com/" target="_blank">Laravel</a></li><li><a href="https://sidekiq.org/" target="_blank">Sidekiq</a></li><li><a href="https://docs.celeryq.dev/en/stable/" target="_blank">Celery</a></li><li><a href="https://seastar.io/" target="_blank">Seastar Framework</a></li><li><a href="https://en.wikipedia.org/wiki/Shared-nothing_architecture" target="_blank">Shared-Nothing Architecture</a></li><li><a href="https://en.wikipedia.org/wiki/Io_uring" target="_blank">io_uring</a></li><li><a href="https://github.com/romange/midi-redis" target="_blank">midi-redis</a></li><li><a href="https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect" target="_blank">Dunning-Kruger Effect</a></li><li><a href="https://www.rust-lang.org/" target="_blank">Rust</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

play-circle

43 MIN

Bringing AI Into The Inner Loop of Data Engineering With Ascend

MAR 24, 2025

Bringing AI Into The Inner Loop of Data Engineering With Ascend

Summary<br />In this episode of the Data Engineering Podcast Sean Knapp, CEO of Ascend.io, explores the intersection of AI and data engineering. He discusses the evolution of data engineering and the role of AI in automating processes, alleviating burdens on data engineers, and enabling them to focus on complex tasks and innovation. The conversation covers the challenges and opportunities presented by AI, including the need for intelligent tooling and its potential to streamline data engineering processes. Sean and Tobias also delve into the impact of generative AI on data engineering, highlighting its ability to accelerate development, improve governance, and enhance productivity, while also noting the current limitations and future potential of AI in the field.<br /><br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today for the details. </li><li>Your host is Tobias Macey and today I'm interviewing Sean Knapp about how Ascend is incorporating AI into their platform to help you keep up with the rapid rate of change</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what Ascend is and the story behind it?</li><li>The last time we spoke was <a href="https://www.dataengineeringpodcast.com/ascend-data-automation-episode-320" target="_blank">August of 2022</a>. What are the most notable or interesting evolutions in your platform since then?<ul><li>In that same time "AI" has taken up all of the oxygen in the data ecosystem. How has that impacted the ways that you and your customers think about their priorities?</li></ul></li><li>The introduction of AI as an API has caused many organizations to try and leap-frog their data maturity journey and jump straight to building with advanced capabilities. How is that impacting the pressures and priorities felt by data teams?</li><li>At the same time that AI-focused product goals are straining data teams capacities, AI also has the potential to act as an accelerator to their work. What are the roadblocks/speedbumps that are in the way of that capability?</li><li>Many data teams are incorporating AI tools into parts of their workflow, but it can be clunky and cumbersome. How are you thinking about the fundamental changes in how your platform works with AI at its center?</li><li>Can you describe the technical architecture that you have evolved toward that allows for AI to drive the experience rather than being a bolt-on?<ul><li>What are the concrete impacts that these new capabilities have on teams who are using Ascend?</li></ul></li><li>What are the most interesting, innovative, or unexpected ways that you have seen Ascend + AI used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on incorporating AI into the core of Ascend?</li><li>When is Ascend the wrong choice?</li><li>What do you have planned for the future of AI in Ascend?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/seanknapp" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.</li></ul>Links<br /><ul><li><a href="https://www.ascend.io/" target="_blank">Ascend</a></li><li><a href="https://www.cursor.com/" target="_blank">Cursor</a> AI Code Editor</li><li><a href="https://devin.ai/" target="_blank">Devin</a></li><li><a href="https://github.com/features/copilot" target="_blank">GitHub Copilot</a></li><li><a href="https://openai.com/index/introducing-deep-research/" target="_blank">OpenAI DeepResearch</a></li><li><a href="https://aws.amazon.com/s3/features/tables/" target="_blank">S3 Tables</a></li><li><a href="https://aws.amazon.com/glue/" target="_blank">AWS Glue</a></li><li><a href="https://aws.amazon.com/bedrock/" target="_blank">AWS Bedrock</a></li><li><a href="https://www.snowflake.com/en/product/features/snowpark/" target="_blank">Snowpark</a></li><li><a href="https://amzn.to/4iWKkcT" target="_blank">Co-Intelligence</a>: Living and Working with AI by Ethan Mollick (affiliate link)</li><li><a href="https://en.wikipedia.org/wiki/OpenAI_o3" target="_blank">OpenAI o3</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

play-circle

52 MIN

Astronomer's Role in the Airflow Ecosystem: A Deep Dive with Pete DeJoy

MAR 16, 2025

Astronomer's Role in the Airflow Ecosystem: A Deep Dive with Pete DeJoy

Summary<br />In this episode of the Data Engineering Podcast Pete DeJoy, co-founder and product lead at Astronomer, talks about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3. Pete shares his journey into data engineering, discusses Astronomer's contributions to the Airflow project, and highlights the critical role of Airflow in powering operational data products. He covers the evolution of Airflow, its position in the data ecosystem, and the challenges faced by data engineers, including infrastructure management and observability. The conversation also touches on the upcoming Airflow 3 release, which introduces data awareness, architectural improvements, and multi-language support, and Astronomer's observability suite, Astro Observe, which provides insights and proactive recommendations for Airflow users.<br /><br /><br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today for the details.</li><li>Your host is Tobias Macey and today I'm interviewing Pete DeJoy about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3</li></ul>Interview<br /><ul><li>Introduction</li><li>Can you describe what Astronomer is and the story behind it?</li><li>How would you characterize the relationship between Airflow and Astronomer?</li><li>Astronomer just released your State of Airflow 2025 Report yesterday and it is the largest data engineering survey ever with over 5,000 respondents. Can you talk a bit about top level findings in the report?</li><li>What about the overall growth of the Airflow project over time?</li><li>How have the focus and features of Astronomer changed since <a href="https://www.dataengineeringpodcast.com/astronomer-with-ry-walker-episode-6" target="_blank">it was last featured on the show</a> in 2017?</li><li>Astro Observe GA’d in early February, what does the addition of pipeline observability mean for your customers? </li><li>What are other capabilities similar in scope to observability that Astronomer is looking at adding to the platform?</li><li>Why is Airflow so critical in providing an elevated Observability–or cataloging, or something simlar - experience in a DataOps platform? <ul><li>What are the notable evolutions in the Airflow project and ecosystem in that time?</li></ul></li><li>What are the core improvements that are planned for Airflow 3.0?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Astro used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airflow and Astro?</li><li>What do you have planned for the future of Astro/Astronomer/Airflow?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/pdejoy58/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.</li></ul>Links<br /><ul><li><a href="https://www.astronomer.io/" target="_blank">Astronomer</a></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://www.linkedin.com/in/maximebeauchemin/" target="_blank">Maxime Beauchemin</a></li><li><a href="https://www.mongodb.com/" target="_blank">MongoDB</a></li><li><a href="https://www.databricks.com/" target="_blank">Databricks</a></li><li><a href="https://www.confluent.io/" target="_blank">Confluent</a></li><li><a href="https://spark.apache.org/" target="_blank">Spark</a></li><li><a href="https://kafka.apache.org/" target="_blank">Kafka</a></li><li><a href="https://dagster.io/" target="_blank">Dagster</a><ul><li><a href="https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://www.prefect.io/" target="_blank">Prefect</a></li><li><a href="https://www.astronomer.io/airflow/3-0/" target="_blank">Airflow 3</a></li><li><a href="https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/" target="_blank">The Rise of the Data Engineer</a> blog post</li><li><a href="https://www.getdbt.com/" target="_blank">dbt</a></li><li><a href="https://jupyter.org/" target="_blank">Jupyter Notebook</a></li><li><a href="https://zapier.com/" target="_blank">Zapier</a></li><li><a href="https://www.astronomer.io/cosmos/" target="_blank">cosmos</a> library for dbt in Airflow</li><li><a href="https://docs.astral.sh/ruff/" target="_blank">Ruff</a></li><li><a href="https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html" target="_blank">Airflow Custom Operator</a></li><li><a href="https://www.snowflake.com/en/" target="_blank">Snowflake</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

play-circle

51 MIN

Accelerated Computing in Modern Data Centers With Datapelago

MAR 8, 2025

Accelerated Computing in Modern Data Centers With Datapelago

Summary<br />In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.<br /><br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today for the details.</li><li>Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architecture</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you start by outlining the main factors that contribute to performance challenges in data lake environments?</li><li>The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?</li><li>The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?</li><li>What was the motivating insight that led you to invest in the technology that powers Datapelago?</li><li>Can you describe the system design of Datapelago and how it integrates with existing data engines?</li><li>The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?</li><li>When is Datapelago the wrong choice?</li><li>What do you have planned for the future of Datapelago?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/rajangoyal/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Links<br /><ul><li><a href="https://www.datapelago.io/" target="_blank">Datapelago</a></li><li><a href="https://en.wikipedia.org/wiki/MIPS_architecture" target="_blank">MIPS Architecture</a></li><li><a href="https://en.wikipedia.org/wiki/ARM_architecture_family" target="_blank">ARM Architecture</a></li><li><a href="https://aws.amazon.com/ec2/nitro/" target="_blank">AWS Nitro</a></li><li><a href="https://en.wikipedia.org/wiki/Mellanox_Technologies" target="_blank">Mellanox</a></li><li><a href="https://www.nvidia.com/" target="_blank">Nvidia</a></li><li><a href="https://en.wikipedia.org/wiki/Von_Neumann_architecture" target="_blank">Von Neumann Architecture</a></li><li><a href="https://en.wikipedia.org/wiki/Tensor_Processing_Unit" target="_blank">TPU == Tensor Processing Unit</a></li><li><a href="https://en.wikipedia.org/wiki/Field-programmable_gate_array" target="_blank">FPGA == Field-Programmable Gate Array</a></li><li><a href="https://spark.apache.org/" target="_blank">Spark</a></li><li><a href="https://trino.io/" target="_blank">Trino</a></li><li><a href="https://iceberg.apache.org/" target="_blank">Iceberg</a><ul><li><a href="https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://delta.io/" target="_blank">Delta Lake</a><ul><li><a href="https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://hudi.apache.org/" target="_blank">Hudi</a><ul><li><a href="https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://gluten.apache.org/" target="_blank">Apache Gluten</a></li><li><a href="https://en.wikipedia.org/wiki/Intermediate_representation" target="_blank">Intermediate Representation</a></li><li><a href="https://en.wikipedia.org/wiki/Turing_completeness" target="_blank">Turing Completeness</a></li><li><a href="https://llvm.org/" target="_blank">LLVM</a></li><li><a href="https://en.wikipedia.org/wiki/Amdahl%27s_law" target="_blank">Amdahl's Law</a></li><li><a href="https://en.wikipedia.org/wiki/Long_short-term_memory" target="_blank">LSTM == Long Short-Term Memory</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

play-circle

55 MIN