Democratize LLM by Inferencing LLM with GPU on Kubernetes efficiently
11-05, 17:15–17:20 (US/Central), ROOM 1

Recently the use of GPUs for LLM model training and inference has experienced significant growth. Kubernetes is an ideal platform for these workloads. However, GPU instances are considerably more expensive (10x-50x) than the general-purpose instances. The challenge is to develop a cost-effective Kubernetes cluster tailored for LLM inference workloads at scale.

We found that using GPUs on Kubernetes to infer a GPT-3-like model costs less than $0.001 per call, and it can automatically scale up and down using Kubernetes HPA.

In this session, we will delve into valuable cost optimization techniques, including horizontal autoscaling based on GPU and TPS metrics, predicting incoming traffic patterns to accelerate cluster autoscaling, choosing the correct instance size based on workload patterns, and improving GPU utilization through time-sharing.


Optimizing costs for Kubernetes applications is a consistently challenging topic, and this issue is further amplified when the workloads start to utilize GPUs for LLM inference. Our goal is to provide developers with insights on using existing open-source tools to optimize their LLM workloads on Kubernetes, and more importantly, introduce innovative cost-saving ideas specifically tailored for LLM workloads. By doing so, we aim to inspire greater adoption of the Kubernetes ecosystem by ML/AI frameworks, leading to improved performance and cost efficiency.

Zihan Jiang is a Senior Software Engineer on the Intuit Kubernetes Service (IKS) team at Intuit. He specializes in developing robust, modern SaaS platforms utilizing a variety of cloud-native projects. Before joining Intuit, he worked at VMware, where he concentrated on constructing enterprise-ready Kubernetes runtime solutions. Zihan earned his Master's degree from Carnegie Mellon University. As an experienced hiker, he has explored over 40 national parks across the United States.

This speaker also appears in: