2025-11-08 –, Theater
AI agents for Kubernetes automation fail because they're trained on unrealistic, simplified scenarios. Unfortunately, there is a dearth of such training data available, as most companies are reticent to publicly share cluster operations data. Moreover, even existing data from, e.g., Google or Alibaba, is not representative of usage patterns seen in smaller organizations. In this talk, we will demonstrate how to use a small “seed” of real, production data from existing Kubernetes clusters to generate a large set of representative, synthetic training data for Kubernetes AI agents. We use graph-theoritic and statistical methods to generate a diverse set of training data covering failure modes, scaling events, resource contention problems, and other common scenarios found in production systems. These techniques, based on research from a team at Harvey Mudd College, allow AI Kubernetes Agents to be trained on high-quality data that is tailored to your company’s production infrastructure.
The modern AI industry is evolving at a break-neck pace; as new techniques and models become available, the rapid (re)-training of AI agents is critical for companies to remain competitive. Moreover, doing this training in a cost-effective manner provides these companies with a longer runway to get their products to market. Lastly, while AI agents trained to manage Kubernetes can often solve simple problems on small clusters, they have thus far have failed to work in large, general-purpose clusters like those seen in many companies’ production infrastructure.
In this talk, users will learn how they can build a custom, personalized set of training data for AI Kubernetes agents, based on a relatively small amount of initial data. This capability will enable them to stay competitive in a rapidly-changing ecosystem, while keeping costs under control. We will also provide users with an easy-to-use “sandbox” training environment where agents can interact with the Kubernetes API and observe the effects of these interactions on the training data.
This work was done in collaboration with a team of researchers at Harvey Mudd College, and will additionally benefit the ecosystem by facilitating the flow of knowledge from the academic community into industry.
drmorr is the founder of ACRL, and is a computer scientist, researcher, and software engineer focused on problems in optimization, scheduling, and distributed systems. He received his PhD from the University of Illinois, Urbana-Champaign in 2014, and has over a decade of industry experience (at companies like Airbnb and Yelp) as well as a strong background in academic research. In his spare time he builds Legos, plays board games, and writes fiction.