Ray on Public-Cloud Kubernetes: experiments, lessons learned, and suggested best practices Cloud Native Rejekts NA (Los Angeles + Hybrid) 2021

Ray on Public-Cloud Kubernetes: experiments, lessons learned, and suggested best practices
.ical

2021-10-09 12:15–12:45, Main stage

Ray is an increasingly popular distributed execution framework for scaling applications and leveraging state of the art machine learning libraries. With the availability of GPU compute shapes on public clouds, deploying Ray on the public cloud is an attractive option over deploying it on bespoke on-prem compute resources. This talk explores suitability of Kubernetes on public cloud as a deployment platform for Ray, shares experiments with Ray deployed on Nodeless Kubernetes, lessons learned, and suggested best practices.

Open Source project Ray provides a simple, universal API for building distributed applications which enables end users to parallelize single machine code, with little to zero code changes. One of Ray’s strengths is the ability to leverage multiple machines in the same program. Although Ray can be run on a single machine, the real power is using Ray on a cluster of machines. For this reason, Kubernetes is a well suited substrate for execution of distributed Ray programs. This talk details experiments running Ray on Kubernetes on public cloud, automating cloud-agnostic cluster compute using Nodeless Kubernetes, lessons learned, and suggests best practices.

Madhuri Yechuri

Madhuri is a systems engineer with 19 years experience in database server technologies (Oracle), virtualization (VMware), and container technologies (ClusterHQ) before founding Elotl. Madhuri received her Masters in Computer Science from Indiana University Bloomington, and Bachelors in Computer Science from Indian Institute of Technology Kharagpur.

Chi Su

Anne Holler

Ray on Public-Cloud Kubernetes: experiments, lessons learned, and suggested best practices .ical 2021-10-09 12:15–12:45, Main stage

Ray on Public-Cloud Kubernetes: experiments, lessons learned, and suggested best practices
.ical

2021-10-09 12:15–12:45, Main stage