Efficient Deep Learning Inferencing in the Cloud using Kubernetes with Smart Provisioning of Arm Nodes
2022-10-23, 10:20–10:50, Room 2

Deep Learning (DL) models are being successfully applied in a variety of fields. Managing DL inferencing for diverse models presents cost and operational complexity challenges. The resource requirements for serving a DL model depend on its architecture, and its prediction load can vary over time, leading to the need for flexible resource allocation to avoid provisioning for the maximum amount of resources needed at peak load. Using the cloud to allocate resources flexibly adds operational complexity to obtain minimum-cost resources matching model needs from the large and ever-evolving sets of instance types. Selecting minimum-cost cloud resources is particularly important given the high cost of x86+GPU compute instances, which are often used to serve DL models.

We describe an approach to efficient DL inferencing on cloud Kubernetes (K8s) cluster resources. The approach combines two kinds of right-sizing. The first is right-sizing the inference resources, using Elotl Luna smart node provisioner to add right-sized compute to cloud K8s clusters when needed and remove it when not. The second is right-sizing the inference compute type, using cloud Ampere A1 Arm compute with the Ampere Optimized AI library, which can provide a price-performance advantage on DL inferencing relative to GPUs and to other CPUs.

We show the benefits of the approach using inference workloads running on auto-scaled TorchServe deployments. For cloud K8s clusters from two vendors, we compare the cost and operational complexity of right-sizing against two common non-right-sized approaches.