2025-03-30 –, The Waterloo
In G-Research’s ML environment of over 10,000 nodes, we leverage Cilium as the core network for on-premise, bare-metal clusters scaling to 1,000 nodes each. In this talk, we’ll discuss several Cilium features used in detail:
- Network policy to enforce strict security controls for segmenting and protecting market-sensitive information
- Host firewall to remove the need for external firewall appliances
- High-performance eBPF dataplane that directly improves ML job performance
We’ll also cover the implications of limiting Cilium’s identity labels to reduce policy map pressure, tuning conntrack garbage collection, and the performance implications of different policies at scale. Attendees will learn how to use Cilium’s built-in tools to observe and measure large deployments, and what to look out for in large Kubernetes clusters.
James is software engineer specialising in cloud native software, distributed systems, and networking. He's currently working as Principal Customer Success Architect, Isovalent at Cisco, and has previously worked at Jetstack as a Staff Solutions Engineer and an engineer in fintech before that.
James an active contributor to Kubernetes and Cilium. He was previously a member of the Kubernetes release team from Kubernetes v1.18 though v1.24, culminating in being the Release Team Lead for Kubernetes 1.24 Stargazer 🔭. He's also served as the Emeritus Adviser for Kubernetes 1.27 Chill Vibes 🦥.
Luigi is a seasoned Kubernetes Engineer with experience designing and implementing Kubernetes at scale in on-prem environments, with a focus on automation and scalability.