Building a SLM Platform with Karpenter, Ray Server and Ollama
Pedro Henrique Oliveira, Matheus Oliveira
In today's enterprise landscape, organizations struggle with deploying AI infrastructure at scale, facing challenges in resource optimization and cost management. This presentation introduces a Small Language Model (SLM) platform combining Karpenter, Ray Server and Ollama on Kubernetes to address these challenges. We'll showcase how to achieves up to 20% cost reduction in GPU utilization through dynamic resource allocation and efficient workload distribution. The unified management layer simplifies model versioning, monitoring, while handling concurrent model deployments, demand spikes and ensuring consistent performance with built-in audit capabilities for compliance.