Effortless Inference, Fine-Tuning, and RAG using Kubernetes Operators Cloud Native Rejekts NA (Salt Lake City) 2024

Effortless Inference, Fine-Tuning, and RAG using Kubernetes Operators
.ical

2024-11-11 12:05–12:35, Theater

Deploying large OSS LLMs in public/private cloud infrastructure is a complex task. Users inevitably face challenges such as managing huge model files, provisioning GPU resources, configuring model runtime engines, and handling troublesome Day 2 operations like model upgrades or performance tuning.

In this talk, we will present Kaito, an open-source Kubernetes AI toolchain operator, which simplifies these workflows by containerizing the LLM inference service as a cloud-native application. With Kaito, model files are included in container images for better version control; new CRDs and operators streamline the process of GPU provisioning and workload lifecycle management; and “preset” configurations ease the effort of configuring the model runtime engine. Kaito also supports model customizations such as LoRA fine-tuning and RAG for prompt crafting.

Overall, Kaito enables users to manage self-owned OSS LLMs in Kubernetes easily and efficiently, whether in the cloud or on-premises Kubernetes clusters.

Throughout this talk, the audience is expected to gain insights about:
Using Container Images: Utilizing container images is a feasible solution for managing LLM model files. It provides unique benefits in terms of version control and distribution across multiple nodes. Additionally, it leverages manifest caching to reduce image pull time when recreated on the same node.

Kaito and Karpenter Integration: Kaito integrates with the Karpenter API to implement GPU auto-provisioning. By decoupling LLM workload deployment from GPU resource provisioning, Kaito becomes a vendor-agnostic solution, compatible with any cloud that has a corresponding Karpenter cloud provider installed.

Model Preset Configuration: The model preset configuration takes into account both the provisioned GPU hardware and the model architecture, optimizing GPU resource usage through proper model parallelism. This shows that the tedious task of configuring model runtime settings can be effectively offloaded from users to the controllers.

Simplified Workflows: Workflows such as LoRA fine-tuning and RAG prompt crafting can be conducted using Kaito while maintaining a simple user experience. Additionally, Kaito can serve as a building block for even more complex workflows.

Future Improvements: Kaito is still in the early development phase and can be improved in many aspects. Examples include image streaming/acceleration, GPU autoscaling, and advanced inference performance tuning (e.g., enabling KV cache). We are also working closely with Kubernetes workgroups such as WG-serving and WG-device management, aiming for a unified deployment abstraction and best practices for serving workload management.

Last but not least, the solution is applicable for both cloud and on-premises Kubernetes clusters. Ultimately, we want the audience to recognize Kubernetes as a robust platform for effectively and efficiently managing self-hosted OSS LLMs.

Ishaan Sehgal

Software Engineer at Microsoft AKS specializing in Kubernetes for AI, focusing on deploying large AI/ML models. Previously, I worked as an ML Engineer at YC-backed AI company Windsor.io, leading up to its acquisition. I am a recent master's graduate from the University of Illinois, where I conducted research in systems for AI. Additionally, I served as a head teaching assistant, helping hundreds of students in cloud networking and computer architecture courses.

Effortless Inference, Fine-Tuning, and RAG using Kubernetes Operators .ical 2024-11-11 12:05–12:35, Theater

Effortless Inference, Fine-Tuning, and RAG using Kubernetes Operators
.ical

2024-11-11 12:05–12:35, Theater