GitLab Developer Evangelism Team, 9 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SLOconf 2022: Michael Friedrich - Left Shift your SLOs with Chaos

Description

Developers and SREs are instrumenting applications and apply observability workflows with metrics, traces, logs, and beyond. The first service level objective (SLO) is defined, now what - wait for the first production incident?

Think of day-2-Ops: SLOs need to be well understood and simulated early in the development process. False-positive alerts can lead to on-call fatigue. How to simulate an incident? Add chaos to production and simulate network failures, broken apps, etc. - and validate the SLOs. Developers can add their own chaos experiments too.

Join this talk to learn how SLOs can be shifted left with chaos, and get inspired by new tools and workflows for your production environment.

A

Servos today I want to dive into left shift your slos with chaos and would dive into nsl retail. Obviously, we started adopting slos or service level agreements, objectives and indicators with different aspects of tracking availability and also defining error budgets and from there um it was on fire to so-called golden signals defined by the google series. We had latency traffic error saturation um as an indicator and at some point we needed code instrumentation.

A

So maybe I said sre solved everything yeah, maybe maybe maybe but switching roles a little bit into a developer, and you might remember that from my talk last year, um where I told you about hey, I needed to debug a software and it crashed a lot, but only in production.

A

It was quite exhausting- and I also burned out from debugging this problem, especially because it behaved different differently on different platforms and so on and in a retrospective.

A

I really would have loved to have ops requirements being met like monitoring the stack and heap memory defining the sli as the mem usage level and dslo maybe has a 10 increase um throughout the development process, in ci, cd, merch requests and later on in production deployments- and I also was thinking of maybe using some sort of chaos, engineering and fussing for apr requests.

A

Another story brought me into like also thinking about slos and chaos, and this is an ops tale with a spoiler of it's always dns.

A

In 2011 I was maintaining the it top level domain and we were installing dns sec. We had signing hardware, we had a state machine typically friday afternoon deployment something broke, production was broken over the weekend. We had no more signing, no dns updates and monitoring was going wild. So the thing is, monitoring has alerts and someone is getting paged. Someone is getting called so the dns zone serial was out of date.

A

um The first alarm was fired by email, nobody read it, then it got into sms, you get woken up and at 5 am in the morning. Debugging is not fun.

A

The change was persisted in the git repository, but we didn't have any sort of like ci, cd or quality gates back then, and it kept me thinking like in a retrospective for correct corrective actions. We might have a staging signing hardware for dnsec.

A

We might be using infrastructure as code and github's workflows for that and define the sli and slos for the zone serial age and say hey when it's older and then one hour, we need to alert. We need to do something about it um and also thinking a little bit about chaos. Engineering with the name servers like denying zone updates returning different zone serials like diving into that, and so, when thinking about these stories, um how can I really get there and what?

A

What is needed to really adopt slos so like as an sre, devops devops engineer where to actually start with slos and then chaos engineering for slos? Let's look into monitoring, we have metrics, we have key or tags with values and what else is in there and one of the things where I got hooked into was: okay, let's use prometheus and promql for queries.

A

Look at the format learn how to query the metrics and the data being stored and from there we can navigate into adopting service level objectives.

A

Now there are different metric sources to to keep in mind so one once that might exist in infrastructure. Like memory cpu io, using an exporter on on the node port or kubernetes cluster, there might be services being exposed by promises, exporters and for own applications.

A

We need to add that into the code like using app instrumentation when we have the metrics, um we need to do something with them and using pramcal for alert rules which trigger the slo is one way to do it so you're, defining the allowed errors in the error budget and define when the slo is violated and to alert- and this is basically how we can adopt slos for later on now, when we, what can we do with an slo?

A

We can measure it with prometheus and, for example, combine it with captain as a quality gate, so that we measure the application, performance and insights in a staging environment using the quality gate to not deploy it into production, for example.

A

Now, with quality gates, we can adopt the continuous delivery pattern as well, so like ensuring that production is is in a safe environment. The thing is simulating a production incident is really hard, so you want to see how the app is behaving.

A

We can add chaos to our staging environments and also to our production deployments and see how the application is behaving now. This brings me to shifting left with chaos, which means like again similar to to metrics under cellos. We need to understand where to go um and as as as a re, def ops, whoever you what whatever your role title might be start thinking about cloud native clusters. You have deployments inside these deployments.

A

You add the chaos framework which allows you to run experiments, and if you need to dig further into it, you can actually use instrumentation sdks, create your own experiments, extend what is out there already and can be used.

A

One example is to use litmos chaos, which is a cloud native computing foundation project, um and it helps you to fail your infrastructure and the in and the kubernetes cluster. For example, you can see how the application is behaving, um whether it crashes or lags, or does something else.

A

You can measure the metrics and the slos and see if they still match and from there you can define actions and improvements um to iterate on your observability, workflows and specific other things you might be missing.

A

This is helped by so-called experiments and workflows which allow you to also visualize the experiments and add chaos to your deployments.

A

Now, thinking about the stories, I told you before like um what what is needed to really add chaos, to to your deployments um like as an sre, maybe adding something which is a cpu overload or http requests are getting blocked.

A

So the golden signals become available again as a dev, who's struggling with rest, apis and building software, maybe simulate something where api clients are not closing the connections correctly or something is intercepting the dns traffic or something else for ops, think about intercepting dns traffic, obviously or just break it somehow um and ensure that it doesn't resolve and see how um the deployment, the application, uh the kubernetes cluster and everything else is behaving super interesting by the way and for devops think about container registries not allowing to pull or something else.

A

This could also be taken into account into really adding chaos to your deployments. Now that we want to shift left or do the left shift. Just to recap it um a little bit.

A

We want to have observability everywhere, so collect the metrics and think about adding more events later on, learn about app instrumentation and also educate your devops teams, because you need to deploy that into your ci cd pipelines, for example.

A

So you want to define and measure the slos before they are actually being deployed, um and you can also benefit from from cloud native environments and the cloud native ecosystem, because the deployments and um and other things happen with auto scaling in the kubernetes cluster, for example- and there are many projects out there which integrate with each other and you can learn from the best practices so, for example, kubernetes edit promises, monitoring and metrics, and you can inspect the source code.

A

You can see the pull requests and learn from that and get into the documentation how it's being used um similar thing with open, telemetry and litmus chaos, and so on.

A

When you want to shift left shift with chaos, bring kos into observability, see how it behaves, verify the slos in quality gates and also ensure that reliability is there iterate and innovate from there. The wish list on my wish list is correlate things and use machine learning to maybe make it easy for us. Add chaos out of the box, make it accessible for everyone and ensure that open telemetry gets widely adopted, for example, for ci cd observability.

A

With that, thanks for your attention, I would encourage you to check out o11y.love to learn more about observability and yeah have a great day go slow.

A