How We Made Our Availability Metrics More Meaningful With eBPF
Getting availability metrics is easy: probe your service or calculate the ratio of failed/successful requests. These approaches are fine, but don't necessarily reflect the user experience. However, user experience is exactly what we want to represent with our metrics. Inspired by Google's meaningful availability paper, Microsoft and SAP collaboratively implemented and open-sourced the connectivity-monitor project to do just that. We expose meaningful availability metrics for the managed K8s api server endpoints. Leveraging the power of eBPF, we capture the relevant network traffic, parse the SNI of the TLS handshake to identify which Kubernetes cluster is being connected to, and assess the encrypted TCP connection to determine if it succeeded or failed. The connectivity exporter "annotates" time and exposes failed/successful seconds as counter metrics for Prometheus to scrape without losing the 1s granularity. All of this with minimal overhead, thanks to eBPF!