Opening remarks
Leonardo DiCaprio made it look glamorous, but real-world container escapes are less Hollywood and more chaotic. Still, the parallels are striking. Like Frank Abagnale slipping past the guards at an Atlanta prison, modern attackers escape containers not with brute force but with clever misdirection: exploiting weak isolation, abusing misconfigured permissions, and sidestepping detection.
In this talk, we’ll trace the path of a container breakout—from the initial escape to lateral movement across a Kubernetes cluster. We’ll walk through the attack step by step (yep, there’s a demo), then flip the perspective to show how modern defenses shut it down.
We’ll cover:
- How container escapes actually happen in the wild
- What user namespaces in Kubernetes 1.33 bring to the table
- How to achieve multi-tenancy workload isolation
- How to detect breakout attempts before they go full clusterf*ck
Whether you're a platform engineer, security lead, or just into a good cat-and-mouse chase through the control plane, you’ll leave with real-world tactics for keeping your cluster escape-proof.
Migrating from a traditional ingress controller to a service mesh-based solution in a live environment with thousands of internal users presents significant challenges. In this session, we share Bloomberg's experience transitioning from NGINX to Istio as the ingress layer for our internal private cloud platform—a managed service supporting application deployments across the firm. We explore the motivations behind this shift, the architectural and operational changes implemented, and the hurdles encountered during the migration process.
Our journey offers practical insights into planning and executing such a migration with minimal disruption, while also highlighting the new capabilities unlocked through Istio. Attendees will benefit from our lessons learned, best practices, and retrospective advice aimed at helping other engineering teams undertake similar transitions with greater confidence and fewer surprises.
As teams move from traditional CI/CD pipelines to GitOps tools like Argo CD, they often hit a common roadblock: how do you manage promotions across dev, staging, and production? Tools like Argo CD and Flux often leave a gap when it comes to multi-stage promotions. What used to be a simple approval click now involves juggling image tags, config changes, and pull requests across multiple repos. This shift often creates confusion, adds manual steps, and breaks the developer workflow.
Tools like 'GitOps Promoter' by Argo offer a promising approach to this problem, but are still in their experimental phase, limiting their readiness for production. Other enterprise solutions offer robust features but come with licensing costs, which can be a barrier for teams.
In this talk, we’ll explore Kargo, a Kubernetes-native OSS tool for automating multi-stage promotions, and compare it with GitOps Promoter. We’ll walk through their design choices, strengths, and tradeoffs with a live demo so users can see how each tool handles this and choose the approach that best fits their GitOps workflow, without ever relying on custom scripts.
What does operational overhead look like in the era of MLOps? If you're grappling with this question, like many others, and would like a way to apply the paradigm of containers and cloud native to AI workloads ― you're in luck.
There is an effort underway to align AI workloads with the knowledge we have of operational excellence in cloud native. The CNCF Sandbox project ModelSpec brings much needed clarity to MLOps workflows. It provides the right abstraction to be able to define how DevOps and cloud native practices can be applied for machine learning operations.
APplying the ModelSpec is the KitOps tool. It helps bridge the gaps that currently exist in the tooling space for MLOps. It creates a "Docker"-like interface for AI workloads and makes it easy and efficient to work with models on Kubernetes (or other container runtimes).
In this talk, I aim to bring together the ML overhead, how cloud native paradigms can help, the ModelSpec, and KitOps. Together, all these will help expose an important painpoint in productionalizing AI in the workplace. Let's eliminate all the disconnected ways in which data teams, developers, and operations folks are working by using the principles that will be highlighted during this talk.
The Paranoid's Guide to Deploying Skynet's Interns
So, you've built an AI Agent. Congratulations! It's brilliant, autonomous, and probably a little bit terrifying. While we're all racing to build the next generation of intelligent applications, we're bolting them onto deployment architectures that treat them like any other legacy system or worse, blindly deploying them without a plan. This is a mistake, and it's going to get weird.
This talk presents a reference deployment architecture for AI Agent applications, starting with a quick primer on their core components: the Agents, the MCP servers, the Tools they access, and the Memory that gives them context. Then, we dive into the deep end of the security nightmare they represent.
We'll explore the messy reality of modern AI deployments:
-
A Tangled Web of Trust: Agents and MCPs are exposed to a chaotic mix of tools and services with wildly different levels of trust. How do you keep your high-security internal tool from being manipulated by an agent that just scraped a questionable Reddit thread?
-
Persistent Threats: The very nature of an Agent's memory means that attacks and threats can persist and evolve across sessions. A vulnerability exploited today could be a weapon wielded by the agent tomorrow.
-
Amplified Supply Chain Risks: Autonomous AI actions turn opaque, previously inaccessible components into active parts of your supply chain. This dramatically increases the attack surface, making vulnerabilities that were once theoretical suddenly very exploitable.
-
Compounding Complexity: The introduction of multi-agent communication protocols and centralized MCP servers adds layers of complexity that can obscure risk and reduce control when you need it most.
The core of this talk is a simple, radical recommendation: true, paranoid, and unapologetic isolation at every level of the AI Agent application stack. We'll argue that AI components are dynamic, untrusted supply chains and must be handled with the same (if not more) scrutiny as any other production system.
You will leave this session understanding why segmentation of components by trust level isn't just a good idea, but absolutely vital. We'll show you why you need more control over your MCP servers, not less, and provide a practical, defense-in-depth architecture for deploying AI Agents that won't turn on you.
As observability systems grow more complex, the cognitive load on users increases quite fast. This talk presents an approach that could be game-changer in the future: using AI assistants as intelligent interfaces to your observability stack. By implementing and using MCP (Model Context Protocol) servers, we can transform how observability users interact with metrics, logs, and traces. You will see how teams can query their stack in plain English and use natural language to explore data, debug issues, and even work with configurations.
The session covers both theoretical foundations and practical implementation. It demonstrates how you can integrate AI assistants directly into your day-to-day workflows and provides a comprehensive walkthrough of:
- MCP architecture and how it enables LLMs (Large Language Models) to execute observability tasks
- Setting up and configuring MCP servers (demonstrated with VictoriaMetrics) and integration with popular AI assistants
- Current and planned features of VictoriaMetrics MCP Server
- Real-world use cases: data exploring, query explanation, working with alerting rules, cardinality analysis, intelligent debugging, obtaining context-rich answer for your questions, etc
- Various tips on how to make AI assistants work better with the observability stack
Whether you're an SRE looking to reduce toil, a platform engineer seeking to democratize monitoring access, or a leader evaluating AI's role in operations, this talk provides practical insights and tools for possible transformation of your observability practice.
This approach doesn't replace monitoring expertise at the moment — it amplifies it, making expert knowledge accessible to entire teams, giving you a powerful teammate in the form of AI assistant.
Developers don’t code eight hours a day. They code one — and fight with TicketOps, Infrastructure dependencies and Security blockers the rest of the time. Many platform teams build Internal Developer Platforms (IDPs) to help, but poor abstraction choices make things worse. In this talk, we’ll share a battle-tested approach to building the right level of abstraction on top of Kubernetes using Score and Kro.
You’ll learn how to go beyond templating, reduce cognitive load, and deliver a developer experience that people actually want to use. We’ll demo how developers can deploy secure, production-grade workloads by just focusing on their applications to bring value to their end users — while the platform handles the hard parts behind the scenes.
This talk isn’t about Kubernetes and GitOps. It’s about empathy. It’s about platforms people adopt, not abandon.
AI agents for Kubernetes automation fail because they're trained on unrealistic, simplified scenarios. Unfortunately, there is a dearth of such training data available, as most companies are reticent to publicly share cluster operations data. Moreover, even existing data from, e.g., Google or Alibaba, is not representative of usage patterns seen in smaller organizations. In this talk, we will demonstrate how to use a small “seed” of real, production data from existing Kubernetes clusters to generate a large set of representative, synthetic training data for Kubernetes AI agents. We use graph-theoritic and statistical methods to generate a diverse set of training data covering failure modes, scaling events, resource contention problems, and other common scenarios found in production systems. These techniques, based on research from a team at Harvey Mudd College, allow AI Kubernetes Agents to be trained on high-quality data that is tailored to your company’s production infrastructure.
Software Bill of Materials (SBOMs) are no longer a nice-to-have; they're quickly becoming table stakes for secure software delivery. But generating SBOMs is just the start. How do you manage them at scale across thousands of artifacts, teams, and environments? How do you ensure they’re accurate, tamper-proof, and usable in real-world pipelines?
We will walk users through integrating SBOM generation, storage, and validation into a modern CI/CD workflow using cloud-native tooling.
- Best practices for generating SBOMs for containers
- Securely storing and indexing SBOMs alongside your artifacts
- Validating artifacts against SBOM data before deployment
- Using SBOMs in incident response, compliance, and auditing
The session will provide attendees a clear roadmap to make SBOMs a first-class citizen in their pipelines and will provide a real-world example of how Cloudsmith integrates CNCF projects like Trivy with OSS projects like CycloneDX, Syft and Grype for automated SBOM generation.
Many FOSS project maintainers are operating extensive CI systems to ensure quality, stability, and rapid delivery of their software. Homebrew, the package manager beloved by macOS developers, is one such project. In this session, we’ll dive into the evolution of Homebrew’s CI pipelines for pull request validations, integration testing, and full regression tests for releases.
Each tier of CI and test automation comes with its own unique challenges. With a variety of pull requests coming in across the Homebrew and Workbrew repositories, CI pipelines need to be fast and efficient. While a pull request may look simple on the surface, complexity often arises in the testing phase, as a modification may need to be tested against everything that runs on a particular package. We’ll explore how Homebrew balances scalability and reliability across its CI landscape by utilizing open source virtualization and orchestration technology tailored to developers on macOS.
Drowning in logs from your Kubernetes clusters? Struggling to scale observability without overwhelming your telemetry systems? You're not alone—and there's a better way. In this talk, you’ll learn how to efficiently manage and streamline logging data from source to destination using telemetry pipelines.
We’ll walk through the key stages of a modern telemetry pipeline—collection, parsing, filtering, routing, and forwarding—demonstrating how to build powerful, flexible pipelines that can handle logs from any source to any destination. Along the way, you’ll see a live demo in a real Kubernetes environment, where we’ll deploy your first telemetry pipeline tailored to a real-world use case.
Whether you're debugging production issues, operating multi-tenant clusters, or just trying to cut through the noise, this session will give you the tools and patterns you need to simplify and scale log collection. Plus, you’ll get access to a self-paced, hands-on workshop to continue exploring after the session: o11y-workshops.gitlab.io/workshop-fluentbit.
Edera leveraged SPIRE for cryptographically attestation of a workload’s environment. We started with a question: how do we prove that workloads are running in an isolated environment? It turns out that this is very similar to the workload identity question already answered by SPIFFE/SPIRE. By integrating SPIRE, Edera’s users are able to prove that workloads are running in a fully isolated Edera zone and get end-to-end encryption between these workloads, allowing for use cases like non-falisifiable build provenance and remote attestation.
In this talk, we will discuss workload identity and the SPIFFE specification, explaining how workload identity enabled us to build a hypervisor-based, verifiable identity system for isolated workloads. We will talk about lessons learned when deploying SPIRE, walk through some of our configuration choices, and give some tips to others looking to use this project.
Kubernetes enables teams to deploy almost any workload without modification, but its boundaries are still defined by namespaces and cgroups. The presence of seven container-escape CVEs from 2022 to 2024 shows these boundaries can be breached. Full VMs or Kata Containers can restore security but suffer from multi-second cold starts and high memory usage, impacting latency-sensitive or densely packed clusters.
In this talk, we will explore a middle ground with Hyperlight, a CNCF virtual-machine monitor that boots micro-VMs, and Nanvix, an open-source Rust microkernel designed to keep guests small yet compatible. This combination allows unmodified Rust, Python, and Wasm services to start up in tens of milliseconds while maintaining VM-class isolation.
We will delve into the architecture, present head-to-head benchmarks, and conduct a live demo. By the end of the session, you will have a clear understanding of the trade-offs and a checklist for implementing micro-VM isolation.
GPU multitenancy in Kubernetes faces significant security challenges when deploying AI workloads on shared infrastructure. Time slicing enables GPU sharing but lacks hardware isolation, risking exposure of sensitive data. NVIDIA Multi-Instance GPU (MIG) provides true hardware isolation with dedicated compute cores, memory slices, and L2 cache partitions, ensuring consistent performance and strict QoS guarantees.
Since the default Kubernetes scheduler cannot partition GPU resources like CPUs for workloads, advanced schedulers—KAI, Volcano, and Kueue can serve as the scheduler for your workloads. They improve GPU sharing through hierarchical queues for secure multi-tenant environments. This talk demonstrates how combining isolation in multi-tenant setups with intelligent scheduling results in optimal utilization, fair resource distribution, and robust security boundaries, guiding the transition from default to GPU-aware scheduling solutions for scalable AI infrastructure.
When you're managing millions of storage volumes across 13 regions, traditional deployment approaches break down. At DigitalOcean, we transformed our Storage Platform operations using ArgoCD to bring sanity to complexity.
In this talk, we'll share how DigitalOcean's Storage Platform team turned our deployment process into a GitOps-powered engine using ArgoCD. We'll take you behind the scenes of operating our Storage Kubernetes platform, StorK8s, our storage orchestration platform that powers millions of volumes across DigitalOcean's global infrastructure.
You’ll learn:
- How we architected a single ArgoCD instance to manage 13+ clusters across 13 regions while maintaining sub-5-minute deployment times.
- Real-world canary and blue-green deployment patterns for stateful workloads.
- Why centralised GitOps beats federation for our use case (and when you shouldn't follow our lead)
We’ll share what worked, what didn’t, and secret ingredients that helped us scale GitOps reliably.
What if you could build serverless applications that cold-start in under a millisecond, run anywhere—from your laptop to Kubernetes to the edge—and require no changes to move between environments? This talk introduces Spin, a CNCF open-source WebAssembly (Wasm) developer toolkit designed for performance, portability, and simplicity.
Kube Resource Orchestrator (kro) has been steadily gaining traction as a Kubernetes-native way to build higher-level abstractions for platform engineering. Kro enables platform teams to create Platform APIs that bundle multiple Kubernetes and cloud resources into a single, self-service interface.
At its core, kro uses a ResourceGraphDefinition to define the components, their dependencies, and how they should be deployed. This eliminates sprawling YAML files, automates ordering, and lets application teams consume infrastructure without wrestling with raw Kubernetes manifests.
In this lightning talk, I’ll show:
- What a Platform API built with kro looks like.
- How it compares to tools like Crossplane compositions and Helm.
- Where kro fits in your platform engineering roadmap.
In just 5 minutes, you’ll see how this approach can make your platform APIs higher-level—and your delivery pipelines faster.
Ever wondered how to speed up your Kubernetes container delivery and get results in the blink of an eye? Look no further! In this action-packed 5-minute session, you will experience the magic of Dragonfly, the ultimate tool for accelerating container delivery in Kubernetes that slashes delivery times, boosts efficiency and ensures lightning-fast container distribution across your infrastructure.
Whether you're looking to optimize deployment speed or just curious about how to supercharge your container workflow, this talk is for you.
This talk covers how to:
1. Integrate Dragonfly with Kubernetes, ArgoCD, and other tools fromthe CNCF landscape for seamless container delivery.
2. Unlock the magic behind Dragonfly’s peer-to-peer container distribution.
3. Real-world examples of using Dragonfly to accelerate deployments in Kubernetes.
Kubernetes Ingress doesn’t scale well in multi-tenant clusters, especially when teams need to share ports or protocols.
In this talk, I’ll show how the experimental XListenerSet in Gateway API solves that.
Using a real use case, I’ll walk through how it lets different teams define their own listeners safely, without stepping on each other.
If you're managing shared clusters and fighting with ingress conflicts, this is five minutes that could save you hours.
5 mins talks from attendees - sign up sheet at registration (limited slots available)
Closing out Rejekts NA 2025 in Atlanta, GA