ML Infrastructure Without The Ops: Simplifying The ML Developer Experience With Runhouse
NOV 11, 202476 MIN
ML Infrastructure Without The Ops: Simplifying The ML Developer Experience With Runhouse
NOV 11, 202476 MIN
Description
Summary<br />Machine learning workflows have long been complex and difficult to operationalize. They are often characterized by a period of research, resulting in an artifact that gets passed to another engineer or team to prepare for running in production. The MLOps category of tools have tried to build a new set of utilities to reduce that friction, but have instead introduced a new barrier at the team and organizational level. Donny Greenberg took the lessons that he learned on the PyTorch team at Meta and created Runhouse. In this episode he explains how, by reducing the number of opinions in the framework, he has also reduced the complexity of moving from development to production for ML systems.<br /><br /><br />Announcements<br /><ul><li>Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems</li><li>Your host is Tobias Macey and today I'm interviewing Donny Greenberg about Runhouse and the current state of ML infrastructure</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in machine learning?</li><li>What are the core elements of infrastructure for ML and AI?<ul><li>How has that changed over the past ~5 years?</li><li>For the past few years the MLOps and data engineering stacks were built and managed separately. How does the current generation of tools and product requirements influence the present and future approach to those domains?</li></ul></li><li>There are numerous projects that aim to bridge the complexity gap in running Python and ML code from your laptop up to distributed compute on clouds (e.g. Ray, Metaflow, Dask, Modin, etc.). How do you view the decision process for teams trying to understand which tool(s) to use for managing their ML/AI developer experience?</li><li>Can you describe what Runhouse is and the story behind it?<ul><li>What are the core problems that you are working to solve?</li><li>What are the main personas that you are focusing on? (e.g. data scientists, DevOps, data engineers, etc.)</li><li>How does Runhouse factor into collaboration across skill sets and teams?</li></ul></li><li>Can you describe how Runhouse is implemented?<ul><li>How has the focus on developer experience informed the way that you think about the features and interfaces that you include in Runhouse?</li></ul></li><li>How do you think about the role of Runhouse in the integration with the AI/ML and data ecosystem?</li><li>What does the workflow look like for someone building with Runhouse?</li><li>What is involved in managing the coordination of compute and data locality to reduce networking costs and latencies?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen Runhouse used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Runhouse?</li><li>When is Runhouse the wrong choice?</li><li>What do you have planned for the future of Runhouse?</li><li>What is your vision for the future of infrastructure and developer experience in ML/AI?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/greenbergdon/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. The <a href="https://www.dataengineeringpodcast.com" target="_blank">Data Engineering Podcast</a> covers the latest on modern data management. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used.</li><li>Visit the <a href="https://www.aiengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li><li>To help other people find the show please leave a review on <a href="https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243" target="_blank">iTunes</a> and tell your friends and co-workers.</li></ul>Links<br /><ul><li><a href="https://www.run.house/" target="_blank">Runhouse</a><ul><li><a href="https://github.com/run-house/runhouse" target="_blank">GitHub</a></li></ul></li><li><a href="https://pytorch.org/" target="_blank">PyTorch</a><ul><li><a href="https://www.pythonpodcast.com/pytorch-deep-learning-epsiode-202" target="_blank">Podcast.__init__ Episode</a></li></ul></li><li><a href="https://kubernetes.io/" target="_blank">Kubernetes</a></li><li><a href="https://en.wikipedia.org/wiki/Bin_packing_problem" target="_blank">Bin Packing</a></li><li><a href="https://en.wikipedia.org/wiki/Linear_regression" target="_blank">Linear Regression</a></li><li><a href="https://developers.google.com/machine-learning/decision-forests/intro-to-gbdt" target="_blank">Gradient Boosted Decision Tree</a></li><li><a href="https://en.wikipedia.org/wiki/Deep_learning" target="_blank">Deep Learning</a></li><li><a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture" target="_blank">Transformer Architecture</a>)</li><li><a href="https://slurm.schedmd.com/documentation.html" target="_blank">Slurm</a></li><li><a href="https://aws.amazon.com/sagemaker/" target="_blank">Sagemaker</a></li><li><a href="https://cloud.google.com/vertex-ai?hl=en" target="_blank">Vertex AI</a></li><li><a href="https://metaflow.org/" target="_blank">Metaflow</a><ul><li><a href="https://www.pythonpodcast.com/metaflow-machine-learning-operations-episode-274" target="_blank">Podcast.__init__ Episode</a></li></ul></li><li><a href="https://mlflow.org/" target="_blank">MLFlow</a></li><li><a href="https://www.dask.org/" target="_blank">Dask</a><ul><li><a href="https://www.dataengineeringpodcast.com/episode-2-dask-with-matthew-rocklin" target="_blank">Data Engineering Podcast Episode</a></li></ul></li><li><a href="https://www.ray.io/" target="_blank">Ray</a><ul><li><a href="https://www.pythonpodcast.com/ray-distributed-computing-episode-258" target="_blank">Podcast.__init__ Episode</a></li></ul></li><li><a href="https://spark.apache.org/" target="_blank">Spark</a></li><li><a href="https://www.databricks.com/" target="_blank">Databricks</a></li><li><a href="https://www.snowflake.com/en/" target="_blank">Snowflake</a></li><li><a href="https://argo-cd.readthedocs.io/en/stable/" target="_blank">ArgoCD</a></li><li><a href="https://pytorch.org/tutorials/beginner/dist_overview.html" target="_blank">PyTorch Distributed</a></li><li><a href="https://horovod.ai/" target="_blank">Horovod</a></li><li><a href="https://github.com/ggerganov/llama.cpp" target="_blank">Llama.cpp</a></li><li><a href="https://www.prefect.io/" target="_blank">Prefect</a><ul><li><a href="https://www.dataengineeringpodcast.com/prefect-workflow-engine-episode-86" target="_blank">Data Engineering Podcast Episode</a></li></ul></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://en.wikipedia.org/wiki/Out_of_memory" target="_blank">OOM == Out of Memory</a></li><li><a href="https://wandb.ai/site/" target="_blank">Weights and Biases</a></li><li><a href="https://knative.dev/docs/" target="_blank">KNative</a></li><li><a href="https://en.wikipedia.org/wiki/BERT_(language_model" target="_blank">BERT</a> language model</li></ul>The intro and outro music is from <a href="https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/" target="_blank">Hitman's Lovesong feat. Paola Graziano</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a>/<a href="https://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA 3.0</a>