Building Scalable ML Systems on Kubernetes

AUG 15, 202450 MIN
AI Engineering Podcast

Building Scalable ML Systems on Kubernetes

AUG 15, 202450 MIN

Description

Summary<br />In this episode of the AI Engineering podcast, host Tobias Macy interviews Tammer Saleh, founder of SuperOrbital, about the potentials and pitfalls of using Kubernetes for machine learning workloads. The conversation delves into the specific needs of machine learning workflows, such as model tracking, versioning, and the use of Jupyter Notebooks, and how Kubernetes can support these tasks. Tammer emphasizes the importance of a unified API for different teams and the flexibility Kubernetes provides in handling various workloads. Finally, Tammer offers advice for teams considering Kubernetes for their machine learning workloads and discusses the future of Kubernetes in the ML ecosystem, including areas for improvement and innovation.<br />Announcements<br /><ul><li>Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems</li><li>Your host is Tobias Macey and today I'm interviewing Tammer Saleh about the potentials and pitfalls of using Kubernetes for your ML workloads.</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in Kubernetes?</li><li>For someone who is unfamiliar with Kubernetes, how would you summarize it?</li><li>For the context of this conversation, can you describe the different phases of ML that we're talking about?</li><li>Kubernetes was originally designed to handle scaling and distribution of stateless processes. ML is an inherently stateful problem domain. What challenges does that add for K8s environments?</li><li>What are the elements of an ML workflow that lend themselves well to a Kubernetes environment?</li><li>How much Kubernetes knowledge does an ML/data engineer need to know to get their work done?</li><li>What are the sharp edges of Kubernetes in the context of ML projects?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working with Kubernetes?</li><li>When is Kubernetes the wrong choice for ML?</li><li>What are the aspects of Kubernetes (core or the ecosystem) that you are keeping an eye on which will help improve its utility for ML workloads?</li></ul>Contact Info<br /><ul><li><a target="_blank">Email</a></li><li><a href="https://www.linkedin.com/in/tammersaleh/" target="_blank">LinkedIn</a></li></ul>Parting Question<br /><ul><li>From your perspective, what is the biggest gap in the tooling or technology for ML workloads today?</li></ul>Links<br /><ul><li><a href="https://superorbital.io" target="_blank">SuperOrbital</a></li><li><a href="https://www.cloudfoundry.org/" target="_blank">CloudFoundry</a></li><li><a href="https://www.heroku.com/" target="_blank">Heroku</a></li><li><a href="https://12factor.net/" target="_blank">12 Factor Model</a></li><li><a href="https://kubernetes.io/" target="_blank">Kubernetes</a></li><li><a href="https://docs.docker.com/compose/" target="_blank">Docker Compose</a></li><li><a href="https://superorbital.io/training/core-kubernetes/" target="_blank">Core K8s Class</a></li><li><a href="https://jupyter.org/" target="_blank">Jupyter Notebook</a></li><li><a href="https://www.crossplane.io/" target="_blank">Crossplane</a></li><li><a href="https://www.dndbeyond.com/monsters/16967-ochre-jelly" target="_blank">Ochre Jelly</a></li><li><a href="https://landscape.cncf.io/" target="_blank">CNCF (Cloud Native Computing Foundation) Landscape</a></li><li><a href="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/" target="_blank">Stateful Set</a></li><li><a href="https://github.blog/ai-and-ml/generative-ai/what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/" target="_blank">RAG == Retrieval Augmented Generation</a><ul><li><a href="https://www.aiengineeringpodcast.com/retrieval-augmented-generation-implementation-episode-34" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://www.kubeflow.org/" target="_blank">Kubeflow</a></li><li><a href="https://flyte.org/" target="_blank">Flyte</a><ul><li><a href="https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291" target="_blank">Data Engineering Podcast Episode</a></li></ul></li><li><a href="https://www.pachyderm.com/" target="_blank">Pachyderm</a><ul><li><a href="https://www.dataengineeringpodcast.com/epsiode-1-pachyderm-with-daniel-whitenack" target="_blank">Data Engineering Podcast Episode</a></li></ul></li><li><a href="https://www.coreweave.com/" target="_blank">CoreWeave</a></li><li><a href="https://kubernetes.io/docs/reference/kubectl/" target="_blank">Kubectl ("koob-cuddle")</a></li><li><a href="https://helm.sh/" target="_blank">Helm</a></li><li><a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/" target="_blank">CRD == Custom Resource Definition</a></li><li><a href="https://horovod.ai/" target="_blank">Horovod</a><ul><li><a href="https://www.pythonpodcast.com/ludwig-horovod-distributed-declarative-deep-learning-episode-341" target="_blank">Podcast.__init__ Episode</a></li></ul></li><li><a href="https://temporal.io/" target="_blank">Temporal</a></li><li><a href="https://slurm.schedmd.com/overview.html" target="_blank">Slurm</a></li><li><a href="https://www.ray.io/" target="_blank">Ray</a></li><li><a href="https://www.dask.org/" target="_blank">Dask</a></li><li><a href="https://en.wikipedia.org/wiki/InfiniBand" target="_blank">Infiniband</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>