Cloud Native Rejekts EU (Valencia) 2022

Efficient Deep Learning with Ludwig AutoML, Ray, and Nodeless Kubernetes
2022-05-15, 16:55–17:25, Gallery

Deep Learning (DL) has been successfully applied to many fields, including computer vision, natural language, business, and science. The open-source platforms Ray and Ludwig make DL accessible to diverse users, by reducing the complexity barriers to training, scaling, deploying, and serving DL models. However, DL’s cost and operational overhead present significant challenges. The DL model dev/test/tuning cycle requires intermittent use of substantial GPU resources, which cloud vendors are well-positioned to provide, though at non-trivial prices. Given the expense, managing GPU resources judiciously is critical to the practical use of DL. Nodeless Kubernetes commoditizes compute for Kubernetes clusters. It provisions just-in-time right-sized cost-effective compute for a Kubernetes application when the application starts, and terminates the compute when the application terminates. There are no autoscaling knobs to configure/maintain and no compute shape decisions (e.g., on-demand/spot/CaaS) to be made.

This talk describes running Ray and Ludwig on cloud Kubernetes clusters, using Nodeless K8s as a smart cluster provisioner to add right-sized GPU resources to the K8s cluster when they are needed and to remove them when they are not. Experiments comparing the cost and operational overhead of using Nodeless K8s vs using fixed-size Ray clusters running directly on EC2 show sizable improvements in efficiency and usability, reducing elapsed time by 61%, computing cost by 54%, and idle Ray cluster cost by 66%, while retaining the performance quality of the AutoML results and reducing operational complexity.


Ludwig v0.4.1 was recently released, introducing its AutoML capability [Ludwig Medium AutoML blog]. The functionality was developed by meta-learning from the results of thousands of hours of model training across 12 datasets. We previously reported [Cloud Native Rejekts 2021] on an initial proof-of-concept applying Nodeless K8s resource management to this heuristic development workload.

In this talk, we describe applying Nodeless K8s resource management to the workload of validating Ludwig’s AutoML capability across an additional three datasets. Ludwig AutoML utilizes the Ray Tune distributed execution platform to perform data-informed hyperparameter search on GPU-enabled workers. The validation datasets were run using AutoML with 1 hour, 2 hour, and 4 hour Ray Tune time budgets, and the resulting model accuracy performance was compared to high-performing manually-tuned models.