How Thanos Almost Snapped $100,000 from our Infra Budget Cloud Native Rejekts NA (Chicago) 2023

How Thanos Almost Snapped $100,000 from our Infra Budget
.ical

2023-11-05 10:40–11:10, ROOM 1

In a galaxy not so far away, where data is as vast as the cosmos, our team was troubled with observability data chaos.
Seeking some clarity, we sought salvation with Thanos and Fluentbit – fabled titans against our metric storage and logging issues.
Thanos empowered us with a Prometheus setup with high availability and virtually infinite historical data storage. Prometheus ascended to new heights, flawlessly scaling horizontally while Thanos Compactor's downsampling abilities promised faster results for querying older data.
Fluentbit made collecting, filtering, and outputting logs across multiple sources and destinations effortless.

But, little did we know that even the most powerful tools, when not wielded correctly could be double-edged Infinity Stones.

Join us on a thrilling tale of blunders as we recount some missteps in configuring these tools, easily missed caveats in data downsampling and log storage, and how the pursuit of seamless data handling almost cost us over $100,000.

Thanos and Fluentbit are some of the most widely trusted, recommended and deployed projects in the Cloud Native space. These tools are robust 'batteries included' solutions, that have made monitoring, storing and handling logs and metrics much simpler for countless organisations. However, for teams new to observability and the cloud, some of the best practices and suggested configurations may not be easily evident and they'll have to burn their fingers with exponential cloud costs and lost data before these caveats become more apparent.
With this talk, we aim to glance over why Thanos and Fluentbit are the best solutions for modern metric and logging problems, and with our firsthand mistakes as an example, illustrate how some configuration and setup errors within these tools and improper systems to detect these errors can cause astronomically high costs.

Ankur Rawal

CTO @ Zenduty, Reliability Advocate Everywhere Else.

Helping fast moving orgs minimise business impacting downtime around the world.
Love talking about observability and reliability at incremental scale and novel use cases for modern tech innovations.
Outside of work, you can find me on road trips, discovering new cuisines and photographing wildlife.

Vishwa Krishnakumar

Vishwa Krishnakumar is a co-founder of Zenduty, where he manages the engineering and product functions and helps customers implement scalable major incident response processes and site reliability engineering best practices.

He has over 14 years of experience in software engineering and architecture and follows the latest in APIs, networking, cloud architecture and site reliability.

Shubham Srivastava

Leading Developer Relations at Zenduty - an advanced incident management and response orchestration platform.
Take pride in making mistakes, learning from them and advocating for best practices for orgs setting up their DevOps, SRE and Production Engineering teams.

A zealous and eternally curious professional, fascinated by stories from DevOps, Incident Management and Product Design. An orator, gamer, writer, and hopeful comedian trying his very best to do something worth remembering everyday.

Deepak Kumar

Senior Cloud Infrastructure and DevOps Engineer at Zenduty

How Thanos Almost Snapped $100,000 from our Infra Budget .ical 2023-11-05 10:40–11:10, ROOM 1

How Thanos Almost Snapped $100,000 from our Infra Budget
.ical

2023-11-05 10:40–11:10, ROOM 1