Ishaan Sehgal

Software Engineer at Microsoft AKS specializing in Kubernetes for AI, focusing on deploying large AI/ML models. Previously, I worked as an ML Engineer at YC-backed AI company Windsor.io, leading up to its acquisition. I am a recent master's graduate from the University of Illinois, where I conducted research in systems for AI. Additionally, I served as a head teaching assistant, helping hundreds of students in cloud networking and computer architecture courses.

The speaker's profile picture

Sessions

11-11
12:05
30min
Effortless Inference, Fine-Tuning, and RAG using Kubernetes Operators
Ishaan Sehgal

Deploying large OSS LLMs in public/private cloud infrastructure is a complex task. Users inevitably face challenges such as managing huge model files, provisioning GPU resources, configuring model runtime engines, and handling troublesome Day 2 operations like model upgrades or performance tuning.

In this talk, we will present Kaito, an open-source Kubernetes AI toolchain operator, which simplifies these workflows by containerizing the LLM inference service as a cloud-native application. With Kaito, model files are included in container images for better version control; new CRDs and operators streamline the process of GPU provisioning and workload lifecycle management; and “preset” configurations ease the effort of configuring the model runtime engine. Kaito also supports model customizations such as LoRA fine-tuning and RAG for prompt crafting.

Overall, Kaito enables users to manage self-owned OSS LLMs in Kubernetes easily and efficiently, whether in the cloud or on-premises Kubernetes clusters.

Theater