Feldera: Bridging Batch and Streaming with Incremental Computation

NOV 4, 202447 MIN

Feldera: Bridging Batch and Streaming with Incremental Computation

NOV 4, 202447 MIN

Description

Summary In this episode of the Data Engineering Podcast, the creators of Feldera talk about their incremental compute engine designed for continuous computation of data, machine learning, and AI workloads. The discussion covers the concept of incremental computation, the origins of Feldera, and its unique ability to handle both streaming and batch data seamlessly. The guests explore Feldera's architecture, applications in real-time machine learning and AI, and challenges in educating users about incremental computation. They also discuss the balance between open-source and enterprise offerings, and the broader implications of incremental computation for the future of data management, predicting a shift towards unified systems that handle both batch and streaming data efficiently. Announcements <ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to <a href="https://www.collibra.com/podcasts" target="_blank">Data Citizens Dialogues</a> on Apple, Spotify, Youtube, or wherever you get your podcasts.</li><li>Your host is Tobias Macey and today I'm interviewing Leonid Ryzhyk, Lalith Suresh, and Mihai Budiu about Feldera, an incremental compute engine for continous computation of data, ML, and AI workloads</li></ul>Interview <ul><li>Introduction</li><li>Can you describe what Feldera is and the story behind it?</li><li>DBSP (the theory behind Feldera) has won multiple awards from the database research community. Can you explain what it is and how it solves the incremental computation problem?</li><li>Depending on which angle you look at it, Feldera has attributes of data warehouses, federated query engines, and stream processors. What are the unique use cases that Feldera is designed to address?<ul><li>In what situations would you replace another technology with Feldera?</li><li>When is it an additive technology?</li></ul></li><li>Can you describe the architecture of Feldera?<ul><li>How have the design and scope evolved since you first started working on it?</li></ul></li><li>What are the state storage interfaces available in Feldera?<ul><li>What are the opportunities for integrating with or building on top of open table formats like Iceberg, Lance, Hudi, etc.?</li></ul></li><li>Can you describe a typical workflow for an engineer building with Feldera?</li><li>You advertise Feldera's utility in ML and AI use cases in addition to data management. What are the features that make it conducive to those applications?</li><li>What is your philosophy toward the community growth and engagement with the open source aspects of Feldera and how you're balancing that with sustainability of the project and business?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Feldera used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Feldera?</li><li>When is Feldera the wrong choice?</li><li>What do you have planned for the future of Feldera?</li></ul>Contact Info <ul><li>Leonid<ul><li><a href="https://ryzhyk.net/" target="_blank">Website</a></li><li><a href="https://github.com/ryzhyk" target="_blank">GitHub</a></li><li><a href="https://www.linkedin.com/in/leonid-ryzhyk-0ba031b9/" target="_blank">LinkedIn</a></li></ul></li><li>Lalith<ul><li><a href="https://www.linkedin.com/in/lalith-suresh-34bb8911/" target="_blank">LinkedIn</a></li><li><a href="https://lalith.in/research/" target="_blank">Website</a></li></ul></li><li>Mihai<ul><li><a href="https://mihaibudiu.github.io/work/index.html" target="_blank">Website</a></li><li><a href="https://github.com/mihaibudiu" target="_blank">GitHub</a></li></ul></li></ul>Parting Question <ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements <ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul>Links <ul><li><a href="https://www.feldera.com/" target="_blank">Feldera</a><ul><li><a href="https://github.com/feldera/feldera" target="_blank">GitHub</a></li></ul></li><li><a href="https://arxiv.org/abs/2203.16684" target="_blank">DBSP</a> paper<ul><li><a href="https://docs.rs/dbsp/latest/dbsp/" target="_blank">Rust Crate</a></li></ul></li><li><a href="https://timelydataflow.github.io/differential-dataflow/" target="_blank">Differential Dataflow</a></li><li><a href="https://trino.io/" target="_blank">Trino</a></li><li><a href="https://flink.apache.org/" target="_blank">Flink</a></li><li><a href="https://spark.apache.org/" target="_blank">Spark</a></li><li><a href="https://materialize.com/" target="_blank">Materialize</a></li><li><a href="https://clickhouse.com/" target="_blank">Clickhouse</a><ul><li><a href="https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://duckdb.org/" target="_blank">DuckDB</a><ul><li><a href="https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://www.snowflake.com" target="_blank">Snowflake</a></li><li><a href="https://arrow.apache.org/" target="_blank">Arrow</a></li><li><a href="https://substrait.io/" target="_blank">Substrait</a></li><li><a href="https://datafusion.apache.org/" target="_blank">DataFusion</a></li><li><a href="https://en.wikipedia.org/wiki/Digital_signal_processing" target="_blank">DSP == Digital Signal Processing</a></li><li><a href="https://en.wikipedia.org/wiki/Change_data_capture" target="_blank">CDC == Change Data Capture</a></li><li><a href="https://prql-lang.org/" target="_blank">PRQL</a></li><li><a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree" target="_blank">LSM (Log-Structured Merge) Tree</a></li><li><a href="https://iceberg.apache.org/" target="_blank">Iceberg</a><ul><li><a href="https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://delta.io/" target="_blank">Delta Lake</a><ul><li><a href="https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://www.openvswitch.org/" target="_blank">Open VSwitch</a></li><li><a href="https://en.wikipedia.org/wiki/Feature_engineering" target="_blank">Feature Engineering</a></li><li><a href="https://calcite.apache.org/" target="_blank">Calcite</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>