Data Engineering Podcast
Data Engineering Podcast

Data Engineering Podcast

Tobias Macey

Overview
Episodes

Details

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Recent Episodes

Accelerate Migration Of Your Data Warehouse with Datafold's AI Powered Migration Agent
OCT 27, 2024
Accelerate Migration Of Your Data Warehouse with Datafold's AI Powered Migration Agent
Summary<br />Gleb Mezhanskiy, CEO and co-founder of DataFold, joins Tobias Macey to discuss the challenges and innovations in data migrations. Gleb shares his experiences building and scaling data platforms at companies like Autodesk and Lyft, and how these experiences inspired the creation of DataFold to address data quality issues across teams. He outlines the complexities of data migrations, including common pitfalls such as technical debt and the importance of achieving parity between old and new systems. Gleb also discusses DataFold's innovative use of AI and large language models (LLMs) to automate translation and reconciliation processes in data migrations, reducing time and effort required for migrations.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stack</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what the Data Migration Agent is and the story behind it?<ul><li>What is the core problem that you are targeting with the agent?</li></ul></li><li>What are the biggest time sinks in the process of database and tooling migration that teams run into?</li><li>Can you describe the architecture of your agent?<ul><li>What was your selection and evaluation process for the LLM that you are using?</li></ul></li><li>What were some of the main unknowns that you had to discover going into the project?<ul><li>What are some of the evolutions in the ecosystem that occurred either during the development process or since your initial launch that have caused you to second-guess elements of the design?</li></ul></li><li>In terms of SQL translation there are libraries such as SQLGlot and the work being done with SDF that aim to address that through AST parsing and subsequent dialect generation. What are the ways that approach is insufficient in the context of a platform migration?</li><li>How does the approach you are taking with the combination of data-diffing and automated translation help build confidence in the migration target?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen the Data Migration Agent used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI powered migration assistant?</li><li>When is the data migration agent the wrong choice?</li><li>What do you have planned for the future of applications of AI at Datafold?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/glebmezh/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email <a target="_blank">[email protected]</a> with your story.</li></ul>Links<br /><ul><li><a href="https://www.datafold.com/" target="_blank">Datafold</a></li><li><a href="https://www.datafold.com/data-migration" target="_blank">Datafold Migration Agent</a></li><li><a href="https://www.datafold.com/data-diff" target="_blank">Datafold data-diff</a></li><li><a href="https://www.dataengineeringpodcast.com/datafold-database-reconciliation-episode-417" target="_blank">Datafold Reconciliation Podcast Episode</a></li><li><a href="https://github.com/tobymao/sqlglot" target="_blank">SQLGlot</a></li><li><a href="https://github.com/lark-parser/lark" target="_blank">Lark</a> parser</li><li><a href="https://www.anthropic.com/news/claude-3-5-sonnet" target="_blank">Claude 3.5 Sonnet</a></li><li><a href="https://cloud.google.com/looker/?hl=en" target="_blank">Looker</a><ul><li><a href="https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55" target="_blank">Podcast Episode</a></li></ul></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
48 MIN
Bring Vector Search And Storage To The Data Lake With Lance
OCT 20, 2024
Bring Vector Search And Storage To The Data Lake With Lance
Summary<br />The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storage</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what Lance is and the story behind it?<ul><li>What are the core problems that Lance is designed to solve?<ul><li>What is explicitly out of scope?</li></ul></li></ul></li><li>The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?<ul><li>What formats does Lance replace or obviate?</li></ul></li><li>In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?<ul><li>Are there any practical or hard limitations on vector dimensionality?</li></ul></li><li>When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?</li><li>I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?</li><li>What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?</li><li>The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?<ul><li>What are the other main integrations for Lance?</li><li>What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?</li></ul></li><li>What are the most interesting, innovative, or unexpected ways that you have seen Lance used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?</li><li>When is Lance the wrong choice?</li><li>What do you have planned for the future of Lance?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/weston-pace-cool-dude/" target="_blank">LinkedIn</a></li><li><a href="https://github.com/westonpace" target="_blank">GitHub</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Links<br /><ul><li><a href="https://lancedb.github.io/lance/" target="_blank">Lance Format</a></li><li><a href="https://lancedb.github.io/lancedb/" target="_blank">LanceDB</a></li><li><a href="https://substrait.io/" target="_blank">Substrait</a></li><li><a href="https://arrow.apache.org/docs/python/index.html" target="_blank">PyArrow</a></li><li><a href="https://github.com/facebookresearch/faiss" target="_blank">FAISS</a></li><li><a href="https://www.pinecone.io/" target="_blank">Pinecone</a><ul><li><a href="https://www.dataengineeringpodcast.com/pinecone-vector-database-similarity-search-episode-189/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://parquet.apache.org/" target="_blank">Parquet</a></li><li><a href="https://iceberg.apache.org/" target="_blank">Iceberg</a><ul><li><a href="https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://delta.io/" target="_blank">Delta Lake</a><ul><li><a href="https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://github.com/lancedb/lance/tree/main/python" target="_blank">PyLance</a></li><li><a href="https://en.wikipedia.org/wiki/Hilbert_curve" target="_blank">Hilbert Curves</a></li><li><a href="https://en.wikipedia.org/wiki/Scale-invariant_feature_transform" target="_blank">SIFT Vectors</a></li><li><a href="https://aws.amazon.com/s3/storage-classes/express-one-zone/" target="_blank">S3 Express</a></li><li><a href="https://www.weka.io/" target="_blank">Weka</a></li><li><a href="https://datafusion.apache.org/" target="_blank">DataFusion</a></li><li><a href="https://www.ray.io/" target="_blank">Ray Data</a></li><li><a href="https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders" target="_blank">Torch Data Loader</a></li><li><a href="https://lancedb.github.io/lancedb/concepts/index_hnsw/" target="_blank">HNSW == Hierarchical Navigable Small Worlds</a> vector index</li><li><a href="https://lancedb.github.io/lancedb/concepts/index_ivfpq/" target="_blank">IVFPQ</a> vector index</li><li><a href="https://geojson.org/" target="_blank">GeoJSON</a></li><li><a href="https://docs.pola.rs/" target="_blank">Polars</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
58 MIN
The Role of Python in Shaping the Future of Data Platforms with DLT
OCT 13, 2024
The Role of Python in Shaping the Future of Data Platforms with DLT
Summary<br />In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integration</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?<ul><li>What are the core principles that guide your work on dlt and dlthub?</li></ul></li><li>You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?</li><li>The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?<ul><li>The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?</li></ul></li><li>What are some of the notable investments that you have made in the developer experience for building dlt pipelines?<ul><li>How have the interfaces for source/destination development improved?</li></ul></li><li>You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?</li><li>What is your strategy for building a sustainable product on top of dlt?<ul><li>How does that strategy help to form a "virtuous cycle" of improving the open source foundation?</li></ul></li><li>What are the most interesting, innovative, or unexpected ways that you have seen dlt used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?</li><li>When is dlt the wrong choice?</li><li>What do you have planned for the future of dlt/dlthub?</li></ul>Contact Info<br /><ul><li>Adrian<ul><li><a href="https://www.linkedin.com/in/data-team/?originalSubdomain=de" target="_blank">LinkedIn</a></li></ul></li><li>Marcin<ul><li><a href="https://www.linkedin.com/in/marcinrudolf/?originalSubdomain=de" target="_blank">LinkedIn</a></li></ul></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email <a target="_blank">[email protected]</a> with your story.</li></ul>Links<br /><ul><li><a href="dlthub.com" target="_blank">dlt</a><ul><li><a href="https://www.dataengineeringpodcast.com/dlt-data-integration-library-episode-390" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://arrow.apache.org/docs/python/" target="_blank">PyArrow</a></li><li><a href="https://docs.pola.rs/" target="_blank">Polars</a></li><li><a href="https://ibis-project.org/" target="_blank">Ibis</a></li><li><a href="https://duckdb.org/" target="_blank">DuckDB</a><ul><li><a href="https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://dlthub.com/docs/general-usage/schema-contracts" target="_blank">dlt Data Contracts</a></li><li><a href="https://github.blog/ai-and-ml/generative-ai/what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/" target="_blank">RAG == Retrieval Augmented Generation</a><ul><li><a href="https://www.aiengineeringpodcast.com/retrieval-augmented-generation-implementation-episode-34" target="_blank">AI Engineering Podcast Episode</a></li></ul></li><li><a href="https://docs.airbyte.com/using-airbyte/pyairbyte/getting-started" target="_blank">PyAirbyte</a></li><li><a href="https://openai.com/o1/" target="_blank">OpenAI o1 Model</a></li><li><a href="https://lancedb.com/" target="_blank">LanceDB</a></li><li><a href="https://qdrant.tech/" target="_blank">QDrant Embedded</a></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://github.com/features/actions" target="_blank">GitHub Actions</a></li><li><a href="https://datafusion.apache.org/" target="_blank">Arrow DataFusion</a></li><li><a href="https://arrow.apache.org/" target="_blank">Apache Arrow</a></li><li><a href="https://py.iceberg.apache.org/" target="_blank">PyIceberg</a></li><li><a href="https://github.com/delta-io/delta-rs" target="_blank">Delta-RS</a></li><li><a href="https://dlthub.com/docs/general-usage/incremental-loading#scd2-strategy" target="_blank">SCD2 == Slowly Changing Dimensions</a></li><li><a href="https://www.sqlalchemy.org/" target="_blank">SQLAlchemy</a></li><li><a href="https://github.com/tobymao/sqlglot" target="_blank">SQLGlot</a></li><li><a href="https://github.com/fsspec/" target="_blank">FSSpec</a></li><li><a href="https://docs.pydantic.dev/latest/" target="_blank">Pydantic</a></li><li><a href="https://spacy.io/" target="_blank">Spacy</a></li><li><a href="https://en.wikipedia.org/wiki/Named-entity_recognition" target="_blank">Entity Recognition</a></li><li><a href="https://parquet.apache.org/" target="_blank">Parquet File Format</a></li><li><a href="https://book.pythontips.com/en/latest/decorators.html" target="_blank">Python Decorator</a></li><li><a href="https://dlthub.com/blog/rest-api-source-client" target="_blank">REST API Toolkit</a></li><li><a href="https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator" target="_blank">OpenAPI Connector Generator</a></li><li><a href="https://github.com/sfu-db/connector-x" target="_blank">ConnectorX</a></li><li><a href="https://www.blog.pythonlibrary.org/2024/03/14/python-3-13-allows-disabling-of-the-gil-subinterpreters/" target="_blank">Python no-GIL</a></li><li><a href="https://delta.io/" target="_blank">Delta Lake</a><ul><li><a href="https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://sqlmesh.readthedocs.io/en/stable/" target="_blank">SQLMesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://github.com/DAGWorks-Inc/hamilton" target="_blank">Hamilton</a></li><li><a href="https://www.tabular.io/" target="_blank">Tabular</a></li><li><a href="https://posthog.com/" target="_blank">PostHog</a><ul><li><a href="https://www.pythonpodcast.com/episodepage/open-source-product-analytics-with-posthog" target="_blank">Podcast.__init__ Episode</a></li></ul></li><li><a href="https://docs.python.org/3/library/asyncio.html" target="_blank">AsyncIO</a></li><li><a href="https://www.cursor.com/" target="_blank">Cursor.AI</a></li><li><a href="https://www.datamesh-architecture.com/" target="_blank">Data Mesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/episodepage/straining-your-data-lake-through-a-data-mesh" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://fastapi.tiangolo.com/" target="_blank">FastAPI</a></li><li><a href="https://www.langchain.com/" target="_blank">LangChain</a></li><li><a href="https://neo4j.com/blog/graphrag-manifesto/" target="_blank">GraphRAG</a><ul><li><a href="https://www.aiengineeringpodcast.com/graphrag-knowledge-graph-semantic-retrieval-episode-37" target="_blank">AI Engineering Podcast Episode</a></li></ul></li><li><a href="https://en.wikipedia.org/wiki/Property_graph" target="_blank">Property Graph</a></li><li><a href="https://docs.astral.sh/uv/" target="_blank">Python uv</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
54 MIN
Build Your Data Transformations Faster And Safer With SDF
OCT 6, 2024
Build Your Data Transformations Faster And Safer With SDF
Summary<br />In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>Your host is Tobias Macey and today I'm interviewing Lukas Schulte about SDF, a fast and expressive SQL transformation tool that understands your schema</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what SDF is and the story behind it?<ul><li>What's the story behind the name?</li></ul></li><li>What problem are you solving with SDF?<ul><li>dbt has been the dominant player for SQL-based transformations for several years, with other notable competition in the form of SQLMesh. Can you give an overview of the venn diagram for features and functionality across SDF, dbt and SQLMesh?</li></ul></li><li>Can you describe the design and implementation of SDF?<ul><li>How have the scope and goals of the project changed since you first started working on it?</li></ul></li><li>What does the development experience look like for a team working with SDF?<ul><li>How does that differ between the open and paid versions of the product?</li></ul></li><li>What are the features and functionality that SDF offers to address intra- and inter-team collaboration?</li><li>One of the challenges for any second-mover technology with an established competitor is the adoption/migration path for teams who have already invested in the incumbent (dbt in this case). How are you addressing that barrier for SDF?<ul><li>Beyond the core migration path of the direct functionality of the incumbent product is the amount of tooling and communal knowledge that grows up around that product. How are you thinking about that aspect of the current landscape?</li></ul></li><li>What is your governing principle for what capabilities are in the open core and which go in the paid product?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen SDF used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on SDF?</li><li>When is SDF the wrong choice?</li><li>What do you have planned for the future of SDF?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/lukas-schulte-a6b16254/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Links<br /><ul><li><a href="https://www.sdf.com/" target="_blank">SDF</a></li><li><a href="https://www.datacamp.com/blog/semantic-layer" target="_blank">Semantic Data Warehouse</a></li><li><a href="https://asdf-vm.com/" target="_blank">asdf-vm</a></li><li><a href="https://www.getdbt.com/" target="_blank">dbt</a></li><li><a href="https://en.wikipedia.org/wiki/Lint_(software" target="_blank">Software Linting</a>)</li><li><a href="https://sqlmesh.readthedocs.io/en/stable/" target="_blank">SQLMesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://coalesce.io/" target="_blank">Coalesce</a><ul><li><a href="https://www.dataengineeringpodcast.com/coalesce-enterprise-analytics-transformations-episode-278" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://iceberg.apache.org/" target="_blank">Apache Iceberg</a><ul><li><a href="https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://duckdb.org/" target="_blank">DuckDB</a>&nbsp;<ul><li><a href="https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/" target="_blank">Podcast Episode</a>&nbsp;</li></ul></li><li><a href="https://docs.sdf.com/guide/basics/classifiers" target="_blank">SDF Classifiers</a></li><li><a href="https://docs.getdbt.com/docs/build/semantic-models" target="_blank">dbt Semantic Layer</a></li><li><a href="https://hub.getdbt.com/calogica/dbt_expectations/latest/" target="_blank">dbt expectations</a></li><li><a href="https://datafusion.apache.org/" target="_blank">Apache Datafusion</a></li><li><a href="https://ibis-project.org/" target="_blank">Ibis</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
42 MIN
Scaling Airbyte: Challenges and Milestones on the Road to 1.0
SEP 23, 2024
Scaling Airbyte: Challenges and Milestones on the Road to 1.0
Summary<br />Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Your host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the project</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what Airbyte is and the story behind it?</li><li>What are some of the notable milestones that you have traversed on your path to the 1.0 release?</li><li>The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?</li><li>What are some of the hard-won lessons that you have learned about the realities of data movement and integration?<ul><li>What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?</li></ul></li><li>What are the core architectural decisions that have proven to be effective?<ul><li>How has the architecture had to change as you progressed to the 1.0 release?</li></ul></li><li>A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?</li><li>When is Airbyte the wrong choice?</li><li>What do you have planned for the future of Airbyte after the 1.0 launch?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/micheltricot/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email <a target="_blank">[email protected]</a> with your story.</li></ul>Links<br /><ul><li><a href="https://airbyte.com/" target="_blank">Airbyte</a><ul><li><a href="https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://airbyte.com/product/airbyte-cloud" target="_blank">Airbyte Cloud</a></li><li><a href="https://airbyte.com/product/connector-development-kit" target="_blank">Airbyte Connector Builder</a></li><li><a href="https://www.singer.io/" target="_blank">Singer Protocol</a></li><li><a href="https://docs.airbyte.com/understanding-airbyte/airbyte-protocol" target="_blank">Airbyte Protocol</a></li><li><a href="https://docs.airbyte.com/connector-development/cdk-python/" target="_blank">Airbyte CDK</a></li><li><a href="https://www.moderndatastack.xyz/" target="_blank">Modern Data Stack</a></li><li><a href="https://en.wikipedia.org/wiki/Extract,_load,_transform" target="_blank">ELT</a></li><li><a href="https://en.wikipedia.org/wiki/Vector_database" target="_blank">Vector Database</a></li><li><a href="https://www.getdbt.com/" target="_blank">dbt</a></li><li><a href="https://www.fivetran.com/" target="_blank">Fivetran</a><ul><li><a href="https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://meltano.com/" target="_blank">Meltano</a><ul><li><a href="https://www.dataengineeringpodcast.com/meltano-data-integration-episode-141" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://dlthub.com/docs/intro" target="_blank">dlt</a></li><li><a href="https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb" target="_blank">Reverse ETL</a></li><li><a href="https://neo4j.com/blog/graphrag-manifesto/" target="_blank">GraphRAG</a><ul><li><a href="https://www.aiengineeringpodcast.com/graphrag-knowledge-graph-semantic-retrieval-episode-37" target="_blank">AI Engineering Podcast Episode</a></li></ul></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
57 MIN