The Role of Python in Shaping the Future of Data Platforms with DLT
Summary<br />In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integration</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?<ul><li>What are the core principles that guide your work on dlt and dlthub?</li></ul></li><li>You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?</li><li>The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?<ul><li>The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?</li></ul></li><li>What are some of the notable investments that you have made in the developer experience for building dlt pipelines?<ul><li>How have the interfaces for source/destination development improved?</li></ul></li><li>You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?</li><li>What is your strategy for building a sustainable product on top of dlt?<ul><li>How does that strategy help to form a "virtuous cycle" of improving the open source foundation?</li></ul></li><li>What are the most interesting, innovative, or unexpected ways that you have seen dlt used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?</li><li>When is dlt the wrong choice?</li><li>What do you have planned for the future of dlt/dlthub?</li></ul>Contact Info<br /><ul><li>Adrian<ul><li><a href="https://www.linkedin.com/in/data-team/?originalSubdomain=de" target="_blank">LinkedIn</a></li></ul></li><li>Marcin<ul><li><a href="https://www.linkedin.com/in/marcinrudolf/?originalSubdomain=de" target="_blank">LinkedIn</a></li></ul></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email <a target="_blank">[email protected]</a> with your story.</li></ul>Links<br /><ul><li><a href="dlthub.com" target="_blank">dlt</a><ul><li><a href="https://www.dataengineeringpodcast.com/dlt-data-integration-library-episode-390" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://arrow.apache.org/docs/python/" target="_blank">PyArrow</a></li><li><a href="https://docs.pola.rs/" target="_blank">Polars</a></li><li><a href="https://ibis-project.org/" target="_blank">Ibis</a></li><li><a href="https://duckdb.org/" target="_blank">DuckDB</a><ul><li><a href="https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://dlthub.com/docs/general-usage/schema-contracts" target="_blank">dlt Data Contracts</a></li><li><a href="https://github.blog/ai-and-ml/generative-ai/what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/" target="_blank">RAG == Retrieval Augmented Generation</a><ul><li><a href="https://www.aiengineeringpodcast.com/retrieval-augmented-generation-implementation-episode-34" target="_blank">AI Engineering Podcast Episode</a></li></ul></li><li><a href="https://docs.airbyte.com/using-airbyte/pyairbyte/getting-started" target="_blank">PyAirbyte</a></li><li><a href="https://openai.com/o1/" target="_blank">OpenAI o1 Model</a></li><li><a href="https://lancedb.com/" target="_blank">LanceDB</a></li><li><a href="https://qdrant.tech/" target="_blank">QDrant Embedded</a></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://github.com/features/actions" target="_blank">GitHub Actions</a></li><li><a href="https://datafusion.apache.org/" target="_blank">Arrow DataFusion</a></li><li><a href="https://arrow.apache.org/" target="_blank">Apache Arrow</a></li><li><a href="https://py.iceberg.apache.org/" target="_blank">PyIceberg</a></li><li><a href="https://github.com/delta-io/delta-rs" target="_blank">Delta-RS</a></li><li><a href="https://dlthub.com/docs/general-usage/incremental-loading#scd2-strategy" target="_blank">SCD2 == Slowly Changing Dimensions</a></li><li><a href="https://www.sqlalchemy.org/" target="_blank">SQLAlchemy</a></li><li><a href="https://github.com/tobymao/sqlglot" target="_blank">SQLGlot</a></li><li><a href="https://github.com/fsspec/" target="_blank">FSSpec</a></li><li><a href="https://docs.pydantic.dev/latest/" target="_blank">Pydantic</a></li><li><a href="https://spacy.io/" target="_blank">Spacy</a></li><li><a href="https://en.wikipedia.org/wiki/Named-entity_recognition" target="_blank">Entity Recognition</a></li><li><a href="https://parquet.apache.org/" target="_blank">Parquet File Format</a></li><li><a href="https://book.pythontips.com/en/latest/decorators.html" target="_blank">Python Decorator</a></li><li><a href="https://dlthub.com/blog/rest-api-source-client" target="_blank">REST API Toolkit</a></li><li><a href="https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator" target="_blank">OpenAPI Connector Generator</a></li><li><a href="https://github.com/sfu-db/connector-x" target="_blank">ConnectorX</a></li><li><a href="https://www.blog.pythonlibrary.org/2024/03/14/python-3-13-allows-disabling-of-the-gil-subinterpreters/" target="_blank">Python no-GIL</a></li><li><a href="https://delta.io/" target="_blank">Delta Lake</a><ul><li><a href="https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://sqlmesh.readthedocs.io/en/stable/" target="_blank">SQLMesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://github.com/DAGWorks-Inc/hamilton" target="_blank">Hamilton</a></li><li><a href="https://www.tabular.io/" target="_blank">Tabular</a></li><li><a href="https://posthog.com/" target="_blank">PostHog</a><ul><li><a href="https://www.pythonpodcast.com/episodepage/open-source-product-analytics-with-posthog" target="_blank">Podcast.__init__ Episode</a></li></ul></li><li><a href="https://docs.python.org/3/library/asyncio.html" target="_blank">AsyncIO</a></li><li><a href="https://www.cursor.com/" target="_blank">Cursor.AI</a></li><li><a href="https://www.datamesh-architecture.com/" target="_blank">Data Mesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/episodepage/straining-your-data-lake-through-a-data-mesh" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://fastapi.tiangolo.com/" target="_blank">FastAPI</a></li><li><a href="https://www.langchain.com/" target="_blank">LangChain</a></li><li><a href="https://neo4j.com/blog/graphrag-manifesto/" target="_blank">GraphRAG</a><ul><li><a href="https://www.aiengineeringpodcast.com/graphrag-knowledge-graph-semantic-retrieval-episode-37" target="_blank">AI Engineering Podcast Episode</a></li></ul></li><li><a href="https://en.wikipedia.org/wiki/Property_graph" target="_blank">Property Graph</a></li><li><a href="https://docs.astral.sh/uv/" target="_blank">Python uv</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>