Zihan Jiang

Zihan Jiang is a Senior Software Engineer on the Intuit Kubernetes Service (IKS) team at Intuit. He specializes in developing robust, modern SaaS platforms utilizing a variety of cloud-native projects. Before joining Intuit, he worked at VMware, where he concentrated on constructing enterprise-ready Kubernetes runtime solutions. Zihan earned his Master's degree from Carnegie Mellon University. As an experienced hiker, he has explored over 40 national parks across the United States.


Multidimension pod Autoscale with Confidence
Zihan Jiang, Navin Jammula

Application Pod Autoscaling is a very challenging problem and the solutions available currently works well individually but not as a whole and may not solve all usecases (ex: peak hour traffic, weekday/weekend, month end, year end etc.). The current solutions helps with realtime scaling based on short term analysis but not based on both short and long term analysis.

We created a thorough solution (a.k.a. Global Updater) considering most of the usecases with multiple components which will address the realtime + longterm vertical sizing(pod size) and Horizontal sizing(num of pods).

Global updater analyzes vertical size recommendations from VPA(opensource), pod-size recommender(internal), horizontal recommendations from HPA(opensource), replica-recommender(internal) and generate a recommendation finetune file that contain both vertical and horizontal size after multiple checks and safeguards ensuring the reliability of the service is not impacted.

Global Updater product runs remotely and is designed to work for multiple services across clusters and different environments per service and generate recommendations for a service considering metrics from multiple environments (pre-prod and prod).

This solution helped scaling both vertical and horizontal without human intervention thereby saving time and cost and also reliability of the application.

Benefits to the ecosystem:
We are planning to open source the solutions we built as this will not just benefit us but also the community/companies with similar problems. We will keep enhancing the product and rollout the changes. This talk will help the audience in understanding the innovative approaches we have taken to solve this complex problem.

Democratize LLM by Inferencing LLM with GPU on Kubernetes efficiently
Zihan Jiang

Recently the use of GPUs for LLM model training and inference has experienced significant growth. Kubernetes is an ideal platform for these workloads. However, GPU instances are considerably more expensive (10x-50x) than the general-purpose instances. The challenge is to develop a cost-effective Kubernetes cluster tailored for LLM inference workloads at scale.

We found that using GPUs on Kubernetes to infer a GPT-3-like model costs less than $0.001 per call, and it can automatically scale up and down using Kubernetes HPA.

In this session, we will delve into valuable cost optimization techniques, including horizontal autoscaling based on GPU and TPS metrics, predicting incoming traffic patterns to accelerate cluster autoscaling, choosing the correct instance size based on workload patterns, and improving GPU utilization through time-sharing.