Anne has an ongoing interest in the intersection of resource efficiency and artificial intelligence.
She worked on Uber's Michelangelo Machine Learning platform, on the management stack for Velocloud's
SD-WAN product, on VMware's Distributed Resource Schedulers for server and storage infrastructure,
on performance analysis for VMware's hypervisor and hosted products, on Omnishift's transparent
application and data delivery over the web to the desktop, on Transmeta's Crusoe processor performance
and power, and on Hewlett-Packard's low-level compiler optimizer. She received bachelors and masters
degrees from Duke University, and a doctorate from University of Virginia, all in Computer Science.
Deep Learning (DL) models are being successfully applied in a variety of fields. Managing DL inferencing for diverse models presents cost and operational complexity challenges. The resource requirements for serving a DL model depend on its architecture, and its prediction load can vary over time, leading to the need for flexible resource allocation to avoid provisioning for the maximum amount of resources needed at peak load. Using the cloud to allocate resources flexibly adds operational complexity to obtain minimum-cost resources matching model needs from the large and ever-evolving sets of instance types. Selecting minimum-cost cloud resources is particularly important given the high cost of x86+GPU compute instances, which are often used to serve DL models.
We describe an approach to efficient DL inferencing on cloud Kubernetes (K8s) cluster resources. The approach combines two kinds of right-sizing. The first is right-sizing the inference resources, using Elotl Luna smart node provisioner to add right-sized compute to cloud K8s clusters when needed and remove it when not. The second is right-sizing the inference compute type, using cloud Ampere A1 Arm compute with the Ampere Optimized AI library, which can provide a price-performance advantage on DL inferencing relative to GPUs and to other CPUs.
We show the benefits of the approach using inference workloads running on auto-scaled TorchServe deployments. For cloud K8s clusters from two vendors, we compare the cost and operational complexity of right-sizing against two common non-right-sized approaches.