Running Generative AI Models In Production

OCT 28, 202457 MIN
AI Engineering Podcast

Running Generative AI Models In Production

OCT 28, 202457 MIN

Description

Summary<br />In this episode Philip Kiely from BaseTen talks about the intricacies of running open models in production. Philip shares his journey into AI and ML engineering, highlighting the importance of understanding product-level requirements and selecting the right model for deployment. The conversation covers the operational aspects of deploying AI models, including model evaluation, compound AI, and model serving frameworks such as TensorFlow Serving and AWS SageMaker. Philip also discusses the challenges of model quantization, rapid model evolution, and monitoring and observability in AI systems, offering valuable insights into the future trends in AI, including local inference and the competition between open source and proprietary models.<br /><br /><br />Announcements<br /><ul><li>Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems</li><li>Your host is Tobias Macey and today I'm interviewing Philip Kiely about running open models in production</li></ul>Interview<br /><ul><li>Introduction</li><li>How did you get involved in machine learning?</li><li>Can you start by giving an overview of the major decisions to be made when planning the deployment of a generative AI model?</li><li>How does the model selected in the beginning of the process influence the downstream choices?</li><li>In terms of application architecture, the major patterns that I've seen are RAG, fine-tuning, multi-agent, or large model. What are the most common methods that you see? (and any that I failed to mention)<ul><li>How have the rapid succession of model generations impacted the ways that teams think about their overall application? (capabilities, features, architecture, etc.)</li></ul></li><li>In terms of model serving, I know that Baseten created Truss. What are some of the other notable options that teams are building with?<ul><li>What is the role of the serving framework in the context of the application?</li></ul></li><li>There are also a large number of inference engines that have been released. What are the major players in that arena?<ul><li>What are the features and capabilities that they are each basing their competitive advantage on?</li></ul></li><li>For someone who is new to AI Engineering, what are some heuristics that you would recommend when choosing an inference engine?</li><li>Once a model (or set of models) is in production and serving traffic it's necessary to have visibility into how it is performing. What are the key metrics that are necessary to monitor for generative AI systems?<ul><li>In the event that one (or more) metrics are trending negatively, what are the levers that teams can pull to improve them?</li></ul></li><li>When running models constructed with e.g. linear regression or deep learning there was a common issue with "concept drift". How does that manifest in the context of large language models, particularly when coupled with performance optimization?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen teams manage the serving of open gen AI models?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working with generative AI model serving?</li><li>When is Baseten the wrong choice?</li><li>What are the future trends and technology investments that you are focused on in the space of AI model serving?</li></ul>Contact Info<br /><ul><li><a href="https://www.linkedin.com/in/philipkiely/" target="_blank">LinkedIn</a></li><li><a href="https://x.com/philip_kiely" target="_blank">Twitter</a></li></ul>Parting Question<br /><ul><li>From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?</li></ul>Closing Announcements<br /><ul><li>Thank you for listening! Don't forget to check out our other shows. The <a href="https://www.dataengineeringpodcast.com" target="_blank">Data Engineering Podcast</a> covers the latest on modern data management. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used.</li><li>Visit the <a href="https://www.aiengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li><li>To help other people find the show please leave a review on <a href="https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243" target="_blank">iTunes</a> and tell your friends and co-workers.</li></ul>Links<br /><ul><li><a href="https://www.baseten.co/" target="_blank">Baseten</a><ul><li><a href="https://www.aiengineeringpodcast.com/wrap-your-model-in-a-full-stack-application-in-an-afternoon-with-baseten" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://en.wikipedia.org/wiki/Copyleft" target="_blank">Copyleft</a></li><li><a href="https://www.llama.com/" target="_blank">Llama Models</a></li><li><a href="https://www.nomic.ai/blog/posts/nomic-embed-text-v1" target="_blank">Nomic</a></li><li><a href="https://allenai.org/olmo" target="_blank">Olmo</a></li><li><a href="https://allenai.org/" target="_blank">Allen Institute for AI</a></li><li><a href="https://www.baseten.co/library/playground-v2-aesthetic/" target="_blank">Playground 2</a></li><li><a href="https://calmfund.com/thesis#:~:text=The%20Essential%20Ingredient%3A%20The%20Peace%20Dividend%20of%20the%20SaaS%20Wars&amp;text=A%20peace%20dividend%20refers%20to,put%20it%20to%20better%20uses." target="_blank">The Peace Dividend Of The SaaS Wars</a></li><li><a href="https://vercel.com/" target="_blank">Vercel</a></li><li><a href="https://www.netlify.com/" target="_blank">Netlify</a></li><li><a href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation" target="_blank">RAG == Retrieval Augmented Generation</a><ul><li><a href="https://www.aiengineeringpodcast.com/retrieval-augmented-generation-implementation-episode-34" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://www.baseten.co/blog/compound-ai-systems-explained/" target="_blank">Compound AI</a></li><li><a href="https://www.langchain.com/" target="_blank">Langchain</a></li><li><a href="https://github.com/dottxt-ai/outlines" target="_blank">Outlines</a> Structured output for AI systems</li><li><a href="https://docs.baseten.co/deploy/overview" target="_blank">Truss</a></li><li><a href="https://docs.baseten.co/chains/overview" target="_blank">Chains</a></li><li><a href="https://www.llamaindex.ai/" target="_blank">Llamaindex</a></li><li><a href="https://www.ray.io/" target="_blank">Ray</a></li><li><a href="https://mlflow.org/" target="_blank">MLFlow</a></li><li><a href="https://github.com/replicate/cog" target="_blank">Cog</a> (Replicate) containers for ML</li><li><a href="https://www.bentoml.com/" target="_blank">BentoML</a></li><li><a href="https://www.djangoproject.com/" target="_blank">Django</a></li><li><a href="https://wsgi.readthedocs.io/en/latest/what.html" target="_blank">WSGI</a></li><li><a href="https://uwsgi-docs.readthedocs.io/en/latest/" target="_blank">uWSGI</a></li><li><a href="https://gunicorn.org/" target="_blank">Gunicorn</a></li><li><a href="https://zapier.com/" target="_blank">Zapier</a></li><li><a href="https://github.com/vllm-project/vllm" target="_blank">vLLM</a></li><li><a href="https://github.com/NVIDIA/TensorRT-LLM" target="_blank">TensorRT-LLM</a></li><li><a href="https://developer.nvidia.com/tensorrt" target="_blank">TensorRT</a></li><li><a href="https://www.baseten.co/blog/introduction-to-quantizing-ml-models/" target="_blank">Quantization</a></li><li><a href="https://arxiv.org/abs/2106.09685" target="_blank">LoRA</a> Low Rank Adaptation of Large Language Models</li><li><a href="https://en.wikipedia.org/wiki/Decision_tree_pruning" target="_blank">Pruning</a></li><li><a href="https://en.wikipedia.org/wiki/Knowledge_distillation" target="_blank">Distillation</a></li><li><a href="https://grafana.com/" target="_blank">Grafana</a></li><li><a href="https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/" target="_blank">Speculative Decoding</a></li><li><a href="https://groq.com/" target="_blank">Groq</a></li><li><a href="https://www.runpod.io/" target="_blank">Runpod</a></li><li><a href="https://lambdalabs.com/" target="_blank">Lambda Labs</a></li></ul>The intro and outro music is from <a href="https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/" target="_blank">Hitman's Lovesong feat. Paola Graziano</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a>/<a href="https://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA 3.0</a>