Observing Enterprise Kubernetes Clusters At Scale
2019-05-18, 15:40–16:10, Main Hall

Observing Kubernetes clusters at scale can be challenging. While most companies operate a small number of Kubernetes clusters, Giant Swarm is responsible for hundreds. This scale makes maintaining a responsible level of observability harder.

We aim to present our observability journey, particularly with Prometheus.

This will cover our architectural choices in the past, such as building tooling for managing Prometheus for on-demand Kubernetes clusters, our current usage and drawbacks we’d like to address, and our plans for the future, such as horizontal scaling and Cortex.

We will also cover our continuous improvement process using post mortems and continuous delivery, which allows us to evolve our metrics, new exporters, and alerting as we discover blind spots.

This talk presents our learnings of handling observability at scale, with in-depth examples from our infrastructure.

Giant Swarm operates Kubernetes clusters for enterprise, offering a control plane for Kubernetes as a product, and 24/7 support for any managed Kubernetes clusters. As part of this offering, a high degree of observability and monitoring is necessary to respond to and debug operational issues.

In this talk, we will present the learnings from our observability journey to the community, with a focus on how we’ve progressed with Prometheus. This will detail our path to our current usage of Prometheus, our current architecture and drawbacks, and our plans for the future.

As part of this discussion, we’ll cover our architectural choices for monitoring our Kubernetes management platform, before moving onto our current setup and areas we’d like to improve. Towards how we’d like to improve, we’ll go into our plans to horizontally scale our Prometheus setup, as well as integrate with Cortex.

We will also dive into our use of post mortems and continuous delivery to provide continuous improvement. This will touch on some related aspects, such as how our alerting has evolved over time, and how we have implemented new Prometheus exporters to improve the observability of Kubernetes, such as through exposing network and DNS metrics.

Our main aim with this talk is to present how we’re managing observability in a holistic sense. We also believe that the community would be enriched by our learnings, as they can see which decisions we made have had positive outcomes (and which ones haven’t!).