Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

MAY 6, 202658 MIN

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

MAY 6, 202658 MIN

Description

Summary In this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance. Announcements <ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Your host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applications</li></ul>Interview <ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you start by giving an overview of the major contributors to wasted or idle compute?</li><li>Why does it matter if the available compute isn't being maximized?</li><li>What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)? </li><li>What are the most interesting, innovative, or unexpected ways that you have seen Ray used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?</li><li>When is Ray the wrong choice?</li><li>What do you have planned for the future of Ray?</li></ul>Contact Info <ul><li><a href="https://www.linkedin.com/in/robert-nishihara-b6465444/" target="_blank">LinkedIn</a></li></ul>Parting Question <ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements <ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul>Links <ul><li><a href="https://www.anyscale.com/" target="_blank">AnyScale</a></li><li><a href="https://www.ray.io/" target="_blank">Ray</a></li><li><a href="https://en.wikipedia.org/wiki/Deep_learning" target="_blank">Deep Learning</a></li><li><a href="https://en.wikipedia.org/wiki/Computer_vision" target="_blank">Computer Vision</a></li><li><a href="https://kubernetes.io/" target="_blank">Kubernetes</a></li><li><a href="https://cursor.com/" target="_blank">Cursor</a></li><li><a href="https://code.claude.com/docs/en/overview" target="_blank">Claude Code</a></li><li><a href="https://docs.ray.io/en/latest/cluster/kubernetes/index.html" target="_blank">Kube-Ray</a></li><li><a href="https://pytorch.org/" target="_blank">PyTorch</a></li><li><a href="https://www.tensorflow.org/" target="_blank">Tensorflow</a></li><li><a href="https://github.com/theano/theano" target="_blank">Theano</a></li><li><a href="https://en.wikipedia.org/wiki/Caffe_(software)" target="_blank">Caffe</a></li><li><a href="https://vllm.ai/" target="_blank">vLLM</a></li><li><a href="https://docs.sglang.io/" target="_blank">SGLang</a></li><li><a href="https://docs.ray.io/en/latest/tune/index.html" target="_blank">Ray Tune</a></li><li><a href="https://en.wikipedia.org/wiki/Neural_network_(machine_learning)" target="_blank">Neural Network</a></li><li><a href="https://en.wikipedia.org/wiki/Learning_rate" target="_blank">Learning Rates</a></li><li><a href="https://en.wikipedia.org/wiki/Reinforcement_learning" target="_blank">Reinforcement Learning</a></li><li><a href="https://deepmind.google/research/alphago/" target="_blank">AlphaGo</a></li><li><a href="https://cursor.com/blog/composer-2" target="_blank">Cursor Composer 2</a></li><li><a href="https://en.wikipedia.org/wiki/ImageNet" target="_blank">ImageNet</a></li><li><a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning)" target="_blank">Transformer Architecture</a></li><li><a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent" target="_blank">Stochastic Gradient Descent</a></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://dagster.io/" target="_blank">Dagster</a></li><li><a href="https://flyte.org/" target="_blank">Flyte</a></li><li><a href="https://en.wikipedia.org/wiki/Mixture_of_experts" target="_blank">Mixture of Experts</a></li><li><a href="https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests" target="_blank">Prefill</a></li><li><a href="https://temporal.io/" target="_blank">Temporal</a></li><li><a href="https://en.wikipedia.org/wiki/Actor_model" target="_blank">Actor Framework</a></li><li><a href="https://en.wikipedia.org/wiki/Remote_direct_memory_access" target="_blank">RDMA == Remote Direct Memory Access</a></li><li><a href="https://www.cisco.com/site/us/en/learn/topics/computing/what-is-neocloud.html" target="_blank">Neoclouds</a></li><li><a href="https://www.aiengineeringpodcast.com/gpu-cloud-marketplace-episode-75" target="_blank">AI Engineering Podcast Episode</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>