Data Engineering Podcast
Data Engineering Podcast

Data Engineering Podcast

Tobias Macey

Overview
Episodes

Details

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Recent Episodes

The Role of Python in Shaping the Future of Data Platforms with DLT
OCT 13, 2024
The Role of Python in Shaping the Future of Data Platforms with DLT
Summary<br />In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integration</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?<ul><li>What are the core principles that guide your work on dlt and dlthub?</li></ul></li><li>You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?</li><li>The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?<ul><li>The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?</li></ul></li><li>What are some of the notable investments that you have made in the developer experience for building dlt pipelines?<ul><li>How have the interfaces for source/destination development improved?</li></ul></li><li>You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?</li><li>What is your strategy for building a sustainable product on top of dlt?<ul><li>How does that strategy help to form a "virtuous cycle" of improving the open source foundation?</li></ul></li><li>What are the most interesting, innovative, or unexpected ways that you have seen dlt used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?</li><li>When is dlt the wrong choice?</li><li>What do you have planned for the future of dlt/dlthub?</li></ul>Contact Info<br /><ul><li>Adrian<ul><li><a href="https://www.linkedin.com/in/data-team/?originalSubdomain=de" target="_blank">LinkedIn</a></li></ul></li><li>Marcin<ul><li><a href="https://www.linkedin.com/in/marcinrudolf/?originalSubdomain=de" target="_blank">LinkedIn</a></li></ul></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email <a target="_blank">[email protected]</a> with your story.</li></ul>Links<br /><ul><li><a href="dlthub.com" target="_blank">dlt</a><ul><li><a href="https://www.dataengineeringpodcast.com/dlt-data-integration-library-episode-390" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://arrow.apache.org/docs/python/" target="_blank">PyArrow</a></li><li><a href="https://docs.pola.rs/" target="_blank">Polars</a></li><li><a href="https://ibis-project.org/" target="_blank">Ibis</a></li><li><a href="https://duckdb.org/" target="_blank">DuckDB</a><ul><li><a href="https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://dlthub.com/docs/general-usage/schema-contracts" target="_blank">dlt Data Contracts</a></li><li><a href="https://github.blog/ai-and-ml/generative-ai/what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/" target="_blank">RAG == Retrieval Augmented Generation</a><ul><li><a href="https://www.aiengineeringpodcast.com/retrieval-augmented-generation-implementation-episode-34" target="_blank">AI Engineering Podcast Episode</a></li></ul></li><li><a href="https://docs.airbyte.com/using-airbyte/pyairbyte/getting-started" target="_blank">PyAirbyte</a></li><li><a href="https://openai.com/o1/" target="_blank">OpenAI o1 Model</a></li><li><a href="https://lancedb.com/" target="_blank">LanceDB</a></li><li><a href="https://qdrant.tech/" target="_blank">QDrant Embedded</a></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://github.com/features/actions" target="_blank">GitHub Actions</a></li><li><a href="https://datafusion.apache.org/" target="_blank">Arrow DataFusion</a></li><li><a href="https://arrow.apache.org/" target="_blank">Apache Arrow</a></li><li><a href="https://py.iceberg.apache.org/" target="_blank">PyIceberg</a></li><li><a href="https://github.com/delta-io/delta-rs" target="_blank">Delta-RS</a></li><li><a href="https://dlthub.com/docs/general-usage/incremental-loading#scd2-strategy" target="_blank">SCD2 == Slowly Changing Dimensions</a></li><li><a href="https://www.sqlalchemy.org/" target="_blank">SQLAlchemy</a></li><li><a href="https://github.com/tobymao/sqlglot" target="_blank">SQLGlot</a></li><li><a href="https://github.com/fsspec/" target="_blank">FSSpec</a></li><li><a href="https://docs.pydantic.dev/latest/" target="_blank">Pydantic</a></li><li><a href="https://spacy.io/" target="_blank">Spacy</a></li><li><a href="https://en.wikipedia.org/wiki/Named-entity_recognition" target="_blank">Entity Recognition</a></li><li><a href="https://parquet.apache.org/" target="_blank">Parquet File Format</a></li><li><a href="https://book.pythontips.com/en/latest/decorators.html" target="_blank">Python Decorator</a></li><li><a href="https://dlthub.com/blog/rest-api-source-client" target="_blank">REST API Toolkit</a></li><li><a href="https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator" target="_blank">OpenAPI Connector Generator</a></li><li><a href="https://github.com/sfu-db/connector-x" target="_blank">ConnectorX</a></li><li><a href="https://www.blog.pythonlibrary.org/2024/03/14/python-3-13-allows-disabling-of-the-gil-subinterpreters/" target="_blank">Python no-GIL</a></li><li><a href="https://delta.io/" target="_blank">Delta Lake</a><ul><li><a href="https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://sqlmesh.readthedocs.io/en/stable/" target="_blank">SQLMesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://github.com/DAGWorks-Inc/hamilton" target="_blank">Hamilton</a></li><li><a href="https://www.tabular.io/" target="_blank">Tabular</a></li><li><a href="https://posthog.com/" target="_blank">PostHog</a><ul><li><a href="https://www.pythonpodcast.com/episodepage/open-source-product-analytics-with-posthog" target="_blank">Podcast.__init__ Episode</a></li></ul></li><li><a href="https://docs.python.org/3/library/asyncio.html" target="_blank">AsyncIO</a></li><li><a href="https://www.cursor.com/" target="_blank">Cursor.AI</a></li><li><a href="https://www.datamesh-architecture.com/" target="_blank">Data Mesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/episodepage/straining-your-data-lake-through-a-data-mesh" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://fastapi.tiangolo.com/" target="_blank">FastAPI</a></li><li><a href="https://www.langchain.com/" target="_blank">LangChain</a></li><li><a href="https://neo4j.com/blog/graphrag-manifesto/" target="_blank">GraphRAG</a><ul><li><a href="https://www.aiengineeringpodcast.com/graphrag-knowledge-graph-semantic-retrieval-episode-37" target="_blank">AI Engineering Podcast Episode</a></li></ul></li><li><a href="https://en.wikipedia.org/wiki/Property_graph" target="_blank">Property Graph</a></li><li><a href="https://docs.astral.sh/uv/" target="_blank">Python uv</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
54 MIN
Build Your Data Transformations Faster And Safer With SDF
OCT 6, 2024
Build Your Data Transformations Faster And Safer With SDF
Summary<br />In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at <a href="https://www.dataengineeringpodcast.com/datafold" target="_blank">dataengineeringpodcast.com/datafold</a> today!</li><li>Your host is Tobias Macey and today I'm interviewing Lukas Schulte about SDF, a fast and expressive SQL transformation tool that understands your schema</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what SDF is and the story behind it?<ul><li>What's the story behind the name?</li></ul></li><li>What problem are you solving with SDF?<ul><li>dbt has been the dominant player for SQL-based transformations for several years, with other notable competition in the form of SQLMesh. Can you give an overview of the venn diagram for features and functionality across SDF, dbt and SQLMesh?</li></ul></li><li>Can you describe the design and implementation of SDF?<ul><li>How have the scope and goals of the project changed since you first started working on it?</li></ul></li><li>What does the development experience look like for a team working with SDF?<ul><li>How does that differ between the open and paid versions of the product?</li></ul></li><li>What are the features and functionality that SDF offers to address intra- and inter-team collaboration?</li><li>One of the challenges for any second-mover technology with an established competitor is the adoption/migration path for teams who have already invested in the incumbent (dbt in this case). How are you addressing that barrier for SDF?<ul><li>Beyond the core migration path of the direct functionality of the incumbent product is the amount of tooling and communal knowledge that grows up around that product. How are you thinking about that aspect of the current landscape?</li></ul></li><li>What is your governing principle for what capabilities are in the open core and which go in the paid product?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen SDF used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on SDF?</li><li>When is SDF the wrong choice?</li><li>What do you have planned for the future of SDF?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/lukas-schulte-a6b16254/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Links<br /><ul><li><a href="https://www.sdf.com/" target="_blank">SDF</a></li><li><a href="https://www.datacamp.com/blog/semantic-layer" target="_blank">Semantic Data Warehouse</a></li><li><a href="https://asdf-vm.com/" target="_blank">asdf-vm</a></li><li><a href="https://www.getdbt.com/" target="_blank">dbt</a></li><li><a href="https://en.wikipedia.org/wiki/Lint_(software" target="_blank">Software Linting</a>)</li><li><a href="https://sqlmesh.readthedocs.io/en/stable/" target="_blank">SQLMesh</a><ul><li><a href="https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://coalesce.io/" target="_blank">Coalesce</a><ul><li><a href="https://www.dataengineeringpodcast.com/coalesce-enterprise-analytics-transformations-episode-278" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://iceberg.apache.org/" target="_blank">Apache Iceberg</a><ul><li><a href="https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://duckdb.org/" target="_blank">DuckDB</a>&nbsp;<ul><li><a href="https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/" target="_blank">Podcast Episode</a>&nbsp;</li></ul></li><li><a href="https://docs.sdf.com/guide/basics/classifiers" target="_blank">SDF Classifiers</a></li><li><a href="https://docs.getdbt.com/docs/build/semantic-models" target="_blank">dbt Semantic Layer</a></li><li><a href="https://hub.getdbt.com/calogica/dbt_expectations/latest/" target="_blank">dbt expectations</a></li><li><a href="https://datafusion.apache.org/" target="_blank">Apache Datafusion</a></li><li><a href="https://ibis-project.org/" target="_blank">Ibis</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
42 MIN
Scaling Airbyte: Challenges and Milestones on the Road to 1.0
SEP 23, 2024
Scaling Airbyte: Challenges and Milestones on the Road to 1.0
Summary<br />Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Your host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the project</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what Airbyte is and the story behind it?</li><li>What are some of the notable milestones that you have traversed on your path to the 1.0 release?</li><li>The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?</li><li>What are some of the hard-won lessons that you have learned about the realities of data movement and integration?<ul><li>What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?</li></ul></li><li>What are the core architectural decisions that have proven to be effective?<ul><li>How has the architecture had to change as you progressed to the 1.0 release?</li></ul></li><li>A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?</li><li>When is Airbyte the wrong choice?</li><li>What do you have planned for the future of Airbyte after the 1.0 launch?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/micheltricot/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email <a target="_blank">[email protected]</a> with your story.</li></ul>Links<br /><ul><li><a href="https://airbyte.com/" target="_blank">Airbyte</a><ul><li><a href="https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://airbyte.com/product/airbyte-cloud" target="_blank">Airbyte Cloud</a></li><li><a href="https://airbyte.com/product/connector-development-kit" target="_blank">Airbyte Connector Builder</a></li><li><a href="https://www.singer.io/" target="_blank">Singer Protocol</a></li><li><a href="https://docs.airbyte.com/understanding-airbyte/airbyte-protocol" target="_blank">Airbyte Protocol</a></li><li><a href="https://docs.airbyte.com/connector-development/cdk-python/" target="_blank">Airbyte CDK</a></li><li><a href="https://www.moderndatastack.xyz/" target="_blank">Modern Data Stack</a></li><li><a href="https://en.wikipedia.org/wiki/Extract,_load,_transform" target="_blank">ELT</a></li><li><a href="https://en.wikipedia.org/wiki/Vector_database" target="_blank">Vector Database</a></li><li><a href="https://www.getdbt.com/" target="_blank">dbt</a></li><li><a href="https://www.fivetran.com/" target="_blank">Fivetran</a><ul><li><a href="https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://meltano.com/" target="_blank">Meltano</a><ul><li><a href="https://www.dataengineeringpodcast.com/meltano-data-integration-episode-141" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://dlthub.com/docs/intro" target="_blank">dlt</a></li><li><a href="https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb" target="_blank">Reverse ETL</a></li><li><a href="https://neo4j.com/blog/graphrag-manifesto/" target="_blank">GraphRAG</a><ul><li><a href="https://www.aiengineeringpodcast.com/graphrag-knowledge-graph-semantic-retrieval-episode-37" target="_blank">AI Engineering Podcast Episode</a></li></ul></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
57 MIN
Enhancing Data Accessibility and Governance with Gravitino
SEP 1, 2024
Enhancing Data Accessibility and Governance with Gravitino
Summary<br />As data architectures become more elaborate and the number of applications of data increases, it becomes increasingly challenging to locate and access the underlying data. Gravitino was created to provide a single interface to locate and query your data. In this episode Junping Du explains how Gravitino works, the capabilities that it unlocks, and how it fits into your data platform.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Your host is Tobias Macey and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what Gravitino is and the story behind it?</li><li>What problems are you solving with Gravitino?<ul><li>What are the methods that teams have relied on in the absence of Gravitino to address those use cases?</li></ul></li><li>What led to the Hive Metastore being the default for so long?<ul><li>What are the opportunities for innovation and new functionality in the metadata service?</li></ul></li><li>The documentation suggests that Gravitino has overlap with a number of tool categories such as table schema (Hive metastore), metadata repository (Open Metadata), data federation (Trino/Alluxio). What are the capabilities that it can completely replace, and which will require other systems for more comprehensive functionality?</li><li>What are the capabilities that you are explicitly keeping out of scope for Gravitino?</li><li>Can you describe the technical architecture of Gravitino?<ul><li>How have the design and scope evolved from when you first started working on it?</li></ul></li><li>Can you describe how Gravitino integrates into an overall data platform?<ul><li>In a typical day, what are the different ways that a data engineer or data analyst might interact with Gravitino?</li></ul></li><li>One of the features that you highlight is centralized permissions management. Can you describe the access control model that you use for unifying across underlying sources?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Gravitino used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gravitino?</li><li>When is Gravitino the wrong choice?</li><li>What do you have planned for the future of Gravitino?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/junping-du/" target="_blank">LinkedIn</a></li><li><a href="https://github.com/JunpingDu" target="_blank">GitHub</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul>Links<br /><ul><li><a href="https://gravitino.apache.org/" target="_blank">Gravitino</a></li><li><a href="https://hadoop.apache.org" target="_blank">Hadoop</a></li><li><a href="https://datastrato.ai/" target="_blank">Datastrato</a></li><li><a href="https://pytorch.org/" target="_blank">PyTorch</a></li><li><a href="https://www.ray.io/" target="_blank">Ray</a></li><li><a href="https://www.gartner.com/en/data-analytics/topics/data-fabric" target="_blank">Data Fabric</a></li><li><a href="https://hive.apache.org/" target="_blank">Hive</a></li><li><a href="https://iceberg.apache.org/" target="_blank">Iceberg</a><ul><li><a href="https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore" target="_blank">Hive Metastore</a></li><li><a href="https://trino.io/" target="_blank">Trino</a></li><li><a href="https://open-metadata.org/" target="_blank">OpenMetadata</a><ul><li><a href="https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://www.alluxio.io/" target="_blank">Alluxio</a></li><li><a href="https://atlan.com/" target="_blank">Atlan</a><ul><li><a href="https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://spark.apache.org/" target="_blank">Spark</a></li><li><a href="https://thrift.apache.org/" target="_blank">Thrift</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
38 MIN
The Evolution of DataOps: Insights from DataKitchen's CEO
AUG 4, 2024
The Evolution of DataOps: Insights from DataKitchen's CEO
Summary<br />In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production.<br />Announcements<br /><ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to <a href="https://www.dataengineeringpodcast.com/starburst" target="_blank">dataengineeringpodcast.com/starburst</a> and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.</li><li>Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineers</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what DataKitchen is and the story behind it?</li><li>You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?</li><li>Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?</li><li>The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?</li><li>What are the challenges that never went away?</li><li>You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?</li><li>What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?</li><li>Can you talk through the technical implementation of your new obserability and quality testing platform?</li><li>What does the onboarding and integration process look like?</li><li>Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?</li><li>What do you have planned for the future of your work at DataKitchen?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/chrisbergh/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Links<br /><ul><li><a href="https://datakitchen.io/" target="_blank">DataKitchen</a></li><li><a href="https://www.dataengineeringpodcast.com/episodepage/datakitchen-dataops-with-chris-bergh-episode-26" target="_blank">Podcast Episode</a></li><li><a href="https://www.nasa.gov/ames/core-area-of-expertise-air-traffic-management/" target="_blank">NASA</a></li><li><a href="https://dataopsmanifesto.org/en/" target="_blank">DataOps Manifesto</a></li><li><a href="https://thenewstack.io/its-time-for-data-reliability-engineering/?utm_referrer=https%3A%2F%2Fwww.google.com%2F" target="_blank">Data Reliability Engineering</a></li><li><a href="https://www.ibm.com/topics/data-observability" target="_blank">Data Observability</a></li><li><a href="https://www.getdbt.com/" target="_blank">dbt</a></li><li><a href="https://itrevolution.com/product/enterprise-technology-leadership-summit-las-vegas-2024/" target="_blank">DevOps Enterprise Summit</a></li><li><a href="https://amzn.to/46BsRSo" target="_blank">Building The Data Warehouse</a> by Bill Inmon (affiliate link)</li><li><a href="https://github.com/DataKitchen/data-observability-installer" target="_blank">dataops-testgen, dataops-observability</a></li><li><a href="https://info.datakitchen.io/data-observability-and-data-quality-testing-certification" target="_blank">Free Data Quality and Data Observability Certification</a></li><li><a href="https://www.databricks.com/" target="_blank">Databricks</a></li><li><a href="https://dora.dev/" target="_blank">DORA Metrics</a></li><li><a href="https://datakitchen.io/two-downs-make-two-ups-the-only-success-metrics-that-matter-for-your-data-analytics-team/" target="_blank">DORA for data</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>
play-circle
53 MIN