GitLab Developer Evangelism Team, 7 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: From Monitoring to Observability: Left Shift your SLOs with Chaos - Michael Friedrich, GitLab

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello: everyone, my name is michael, I'm, a senior developer evangelist at gitlab, and today I want to dive into from monitoring to observability left shift your slos with chaos before we dive in I'm dns mihi, which is dns mi-chi on the internet. If you want to reach out or have a chat, I'm looking forward to actually see you in in hamburg at container days this week.

A

um So this is this is like the virtual talk, I'm providing to you today and I want to start with a lesser retail um building, some bricks and stuff like that, um and for that we want to turn back time a little bit. So when we started with monitoring, we had traditional service monitoring state black box monitoring.

A

um There were some service level like remands and reported requirements, so state changes have been tracked over time. We then had metric data points and trends and, at some point we kind of added metrics uh into everything, making it white box monitoring and one of the things I I discovered in the past was prometheus, actually um created this format of having slash metrics being exposed at an http server and docker has added that natively into the application in 2016 and from there it was not far too saying. Hey a web.

A

Application should actually have that now when we were defining slos or service level agreements, several service level objectives and service level indicators. We need to keep in mind that, like the agreement is 99.5 percent availability, the objective is much higher. It should be like 99.9 and from the indicators it's like errors, latency, something which gives an indication what is going wrong and in order to really do that, we also need to define error budgets, so body is allowed to be violated in order to match the service level objectives.

A

We then iterate it into the golden signals defined by the google sre handbook or the google search book, which allows you to immediately see that something is going wrong. So, like the four golden signal, signals, latency traffic errors and saturation, but it didn't solve everything because at some point we needed code. Instrumentation and sre didn't really solve everything, but it was a great way to get started.

A

Now, speaking a little bit about myself um to bring you into the idea why we need observability and chaos engineering, I want to start with a developer's tale of myself.

A

Things have been like going wrong in the past in in my development life and back then we had the task to create a scaling and performing a rest api, which was also used as a json rpc api to communicate in a distributed environment.

A

We had certain cpu overloads with spawning many threads and thought of well, we are writing c plus plus go has go routines as a lightweight alternative.

A

Maybe we can do that in c, plus plus as well, and we discover the library using coroutines working stackless, putting the function pointer on the heap and then using stack unwinding for continuation. This is rather complex, but at some point we figured it out the thing is it didn't turn out that good? So we had crashes happening in production, but only like when there were like a thousand api clients, and the problem was it: it was not reproducible on on the developers, machines and in test environments, so the memory was corrupted or maybe exhausted.

A

Maybe there were some leaks and to add more to the problem on windows.

A

There were specific crashes, so there were stack cards and the security scanner triggered the crashes and basically it was like it was a mess. It kind of lead led to burn out of myself as well. In a retrospective, I would have loved to have something like um defining a service level indicator which could which could have been the memory usage uh or memory usage level and the service level objective should be on every commit or merge request being created.

A

The memory does not increase by 10, for example, and this totally means to also meet the ops requirements so that everything is fine and actually the next release doesn't create all these problems, with memory leaks and memory problems in customer and production environments.

A

In the end, we would also have been benefited from metrics monitoring in ci cd. Already so things being triggered before it's actually put in production and maybe some sort of like chaos, engineering for for api requests and creating creating things, you cannot really stimulate the test.

A

Now, chaos didn't stop or the things which broke, which would have been broken, didn't stop so shifting gears into an obstacle and spoiler alert. It's always dns. When I was working at the university of vienna back, then we also maintained the dot at cctld and it got dns sec.

A

So the idea was simple: we had the signing hardware, there's a state machine of steps, verifying things and then everything gets deployed and dns is safe. Basically, on a friday afternoon, the script has been changed, which was responsible for the deployment and it was deployed to production, then, for some reason, no more signing happened, which also meant that no dns updates were being pushed. So the domain delegation, the names of updates, the taxi admin c and so on.

A

Well and we we had monitoring, um but let me tell you well, we also monitored, for example, the zone serial h and the problem was there were, like, I think, 25 or 30 name servers alerting at the same time, so the first alarm came at on saturday at 3 am via email.

A

Basically, it's fine nobody's awake at 4am. It was sms and we didn't have any alert groups, so it was like fire in the hole on the mobile and at 5. Am everyone was awake and we needed to do debugging and figuring out the root cause of the problem.

A

Now we figured out that the change was persisted in git, um which we thought about as a corrective action, but the thing is, it was directly rolled into production, so there was no ci cd or quality gate in place which could have helped prevent the situation um and from there again thinking about what could have been, maybe having a staging signing hardware where everything can be like simulated or being tested properly, and all the changes should be uh rolled out with infrastructures, code or github's workflows um in with regards to service level, objectives and indicators.

A

Using the sound serial age and the service level objective is, for example, that the zone age or so on. Serial h is not older than one hour with regards to adding some chaos to observability ideas, while maybe denying zone server updates with prox in the middle or just returning different zone series and then see how everything is behaving.

A

It's super fun, but it doesn't stop with problems. At some point I switched into the devops role and we had containers um and it also brought some problems because, when you're consuming services, which are free at some point, they might not be free anymore. There might be some limits and dockerhub announced some rate limits in september 2020, and we didn't at that point. We didn't know what is affected.

A

We thought of well ci, cd pipelines are using containers cloud native deployments and kubernetes, for example, are using containers organizations behind a net which, which is using like a single appear, address for rate, limiting and obviously help providers within their network within the deployments and everything else and the problem was like, we had a known state, so the limits have been applied. There was things like an api and pull simulating a pull, getting a header response passing something and back then. I also wrote a promises exporter for that, and it was.

A

It was pretty much efficient how it worked. The only problem was um the monitoring was there, but we also had the unknown state like docker, pulling in in an environment what happens in a kubernetes cluster, what happens in the icd pipelines when developers are waiting for the deployment for fast feedback for code reviews and so on, with the only option of having uh four to nine too many requests in the logs?

A

Maybe um so this is like super hard to debug, and one of the problems we also like figured out was maybe there's also a problem with deployment, because the application has been rolled out in in one-third to the customers showing a different price um just because the release deployment failed because docker pull failed in production.

A

So this could be really expensive and I thought of well- maybe we we could have triggered this somehow different or treated it differently, with more long-term planning, knowing what's actually going on and defining a service level indicator. Like the pull counts remaining or the limit um and the service level objective could have been like, the pull counts. Remaining should be more than 10, which is an arbitrary number in huge environments, maybe 100, or something like that, and with the quality gate.

A

We shouldn't even start a deployment when the pull limit is too low because essentially those who are consuming the deployments and the ci cd pipelines like the developers, they are blocked and when nothing is working because of a rate limit, you need to wait for like six hours or something um you cannot. You cannot be productive or do something now.

A

We've talked a lot about stories and things gone wrong, um going slow go using as service level objectives is another story, because how do I start what is like my service level objectives and somehow it relates, relates to monitoring.

A

We do have metrics, we have keys and tags, and we have values which we can measure and define our thresholds upon. But what else is going on and with that? We do have an idea or a specification in the cloud native community foundation as well, which is prometheus.

A

Promises allows to monitor things, and it also allows to query the metrics data which is being stored and do calculations based on that and also compare it. So it's it's a pretty easy and easy to learn format, and we can also use it to define slos in in the screenshot. There is an example from from clearing an http server now, in order to really like understand what is going on in.

A

In our environments, we have different metrics resources or different metric sources, which can be from infrastructure like memory cpu io, we do have the exporter on the kubernetes node in the port inside the cluster. Actually, then, we have services which have their own promises, exporters, hopefully- and at some point we also need to do instrumentation in the code adding our own metrics, exposing our own metrics.

A

This is something to keep in mind for defining slos, because when there are no metrics, we cannot define any slos when it comes to describing an slo, we can use promql and alert rules from promisius, which is pretty nifty. So, in order to do that, we calculate something and then define the condition which should be should be met and, for example, saying hey. This needs to be active or violated actively violated for 15 minutes. For example.

A

In the end we can build on that. The thing is in order to add more metrics. I would. I would advise you to learn with playful examples, so create your own docker file, cicd, build images, container registries, um inspect the promises operator and the service, monitor customer resource definition and then start inspecting metrics with promisius and later on, with open telemetry.

A

Now my talker is also about left shifting slos. So what does this actually mean? Like I've? We've talked about promises with. We have like. We know what service level objectives are. um This can be like calculated and in ci cd uh environments.

A

We can actually like deploy prometheus um in the staging environment and, for example, use a quality gate with captain to measure the slos and when it's violated, the feedback to the cicd pipeline is hey. I'm I'm in a failed state. You shouldn't be deploying because something is going wrong within uh the staging deployment um captain on its own works. Quite this way, it's an observability platform. It's not tied only to the icd quality gates, but one can use it this way.

A

So it really triggers the quality gate, evaluates things and then goes back with feedback and either it's it gets fixed automatically or the we get the code with feedback, for example in the budget request to actually do something about the problem.

A

Now, with a quality gate, not everything is solved. um We also need to ensure that um next to continuous delivery, we kind of need to simulate a production incident for applications like when dns is wrong or going wrong. What happens and the idea is to add chaos to the staging environment and also to the production environments.

A

The the outcome should be triggering alerts and service level objective failures to immediately see what is what needs to be corrected, specific alerts being being redefined or refined, and so on.

A

Now, how can we actually do that so left shifting with chaos, um and in order to really do that, um we kind of remembering what what we said with metrics? We have cloud native, we have clusters, we have deployments and then um so-called chaos frameworks have been built in the open source community.

A

They define experiments and oftentimes. They also provide instrumentation sdks to build your own chaos, experiments based on top. What what already is provided?

A

One of the examples of one of the tools is litmus chaos, which allows you to fail. Your infrastructure cluster see how the application behaves see if the slo is still matching and then define your actions and improvements. It provides experiments and workflows out of the box, and it's it's easy to get started with another application or another tool is chaos mesh, which we will be looking uh into in a little bit.

A

uh It would kind of works the same way: you're failing cubano, kubernetes or hosts. We have chaos, experiments which either run once or continuously in a schedule and failing dns is always a good idea, because sometimes it's really always dns from the usability perspective, chaos mesh provides ui or cli.

A

It has a preview and some like scheduling strategies. I find it pretty straightforward to use and just to reiterate to all the stories which I talked about before chaos. Engineering could help this way, for example, for an sre we can simulate cpu overload or stressing the the cpu. Maybe the http requests get uh get blocked following the golden signals to figure out latency and so on and errors as a developer. I won't have many many api clients which are not closing their connection, so something is leaking potentially for an ops.

A

We want to obviously intercept dns traffic or things do not resolve, and then we see how the applications and the deployments actually behaving um and from a devops perspective.

A

Maybe we could build something like a registry proxy which limits the container pools and then we can see how, like the deployments, are being behaving and so on now thinking about the own chaos.

A

There were experiments and that's the case, so there might be also the possibility to to natively integrate that into your cicd workflows, using documentation and what is available or community uh forums and and resources.

A

On the other side, when you're thinking about your own chaos, um there are limits, so you should be evaluating the resource use usage because when you're simulating something which leaks memory at some point, maybe you're paying too much and also define a maintenance window or work window where the chaos engineering happens, so that it doesn't harm existing workflows or teams and also know that it won't solve immediately all the reliability issues you might be seeing.

A

But it can help you with new perspectives in um combining chaos, engineering with um sre, for example, now my own chaos, which was in the beginning with as a developer.

A

We do have like this dns connection problem, which leaks memory, so I wrote a c plus plus code with uh as a demo, which basically leaks some memory, but only when the dns resolution fails or it doesn't provide any results. So this is a special kind of thing, but it shows when something is going wrong with dns.

A

We are leaking memory, and this is something I would have never found out in my deaf environment system, everything is deployed into a kubernetes cluster, cube promises and the premises, operators, monitoring things, chaos mesh is installed and promises, alerts and slos have been defined to trigger something actually now for this demo, we will look into this. This is um this is running in a civil cloud kubernetes cluster currently, and um just to show you what's currently going on.

A

um We can switch into the terminal for a moment and cube. Ctl get parts shows out, shows what's currently running. Hopefully, yes, and then we say, cube ctl logs and follow one of the o11 violet love parts which is actually just trying to resolve this domain and um yeah. In order to now say what else could be going on or what what is happening in in my promises on my kubernetes deployment, I've prepared the promises interface.

A

We want to create a container memory rss for this specific container type, and if we are executing this curve, you can see we do have 250k memory usage which, which is not a lot now in order to trigger some dns failures. I've prepared.

A

The dns chaos schedule already, so it's generating some dns chaos. It generates an error, it doesn't fail, but the general is an error so that we can like fail. Everything which is following these patterns for these domains, um and we want to start this one because yeah, we love chaos.

A

Now that this being started, we should actually be seeing some errors already, which is great, um just need to scroll a little bit oops a little bit up, maybe not yeah over there. It was at some point the chaos experiment will stop. I think I configured it for one minute or so and um next to seeing that nothing is working, we should also be seeing an increase in memory actually yeah. Actually, we have been.

A

We are already leaking some memory and maybe to um show this a little more in the scope, we can see that the memory is going up and it's going up, and then there is a certain limit for this memory, where we should be seeing an alert. Actually, let's quickly inspect the alert manager over here and refresh. Do we have some new alerts which need to be handled not yet, but maybe this is coming in fast.

A

Let's see if something is over here and you can see that I've been testing before doing this talk, recording um yeah, but no, it's it's not yet triggered. This is like the demo problem all the time.

A

um Let's see if we do have some more memory, yet it's consuming much more memory over time, so it's still failing and then um continuing what is going on now. Everything is fine again, so this schedule uh the experiment, paused and then the next experiment will be starting and at some point we can actually see that memo. So memory will be leaked and hopefully we can see yeah. We have some alerts being generated, which is pretty accurate, yeah, it's it's utc, um it's a little late. Actually doing doing this talk, recording but anyways.

A

um We can see that we triggered the alert and potentially we should be seeing the alert over here as well. Yeah and it's it doesn't look um for some reason: it's pretty slow. Now um it's just for testing, so it's need to work on the payload and things like that. But this is like an iterative process. If we are like navigating back um to our use case, we have kind of proven that the dns chaos is leaking memory and we can trigger our slo and everything is fine.

A

A

What else can we do, and this is like the screen shots of what I did before when we're moving from monitoring to observability? um We also need to keep in mind. There are not only metrics. There are logs events, there are this there's distributed tracing, there's continuous profiling. There are so many observability event, types which we need to get keep track of, um and we also need to potentially shift from monolith lift to microservices, but you need to stop and you need to start focusing on metrics and tracing, but everything else is like it's.

A

Okay, traces are a little different to logs spans with start and end time more metadata. In the context code instrumentation might be needed. We do have open telemetry, which, which is growing fast. It allows you it needs you to bring your own backend like jaeger, elasticsearch, clickers and so on can build your own distribution, and it also has auto instrumentation for certain languages, which is great.

A

Another idea is using ebpf, maybe combining it for with slos so like on the kernel level. um Implementing observability could also be an interesting idea just leaving this idea for you now and in the end we want to shift left.

A

So as a developer, we want to see- or we should be seeing the value in matrix logs and traces, which is quite more than that, so it's observability, and we also need to provide insights into the applications for everyone not familiar with the code so like as an sre or devops engineer.

A

I want to seize the problem, the code or something else, and for that um I would recommend using boring solutions like starting with metrics and mon and promisius, um adding tracing with open telemetry, maybe metrics in open telemetry at some later point and collect more observability data to really decide um what is going on to see things left. Shifting the slos is observability needs to be everywhere, so we need to collect everything. We need to learn app, instrumentation and educate everyone.

A

Team onboarding is super important, integrate that into devops workflows so like slos and alerts in ci, cd, merch requests, alert channels, incident management and also make sure that you're using cloud native benefits. So like everything in kubernetes clusters, we do have litmus chaos, we have chaos mesh, we have open telemetry, we do have promises and we have a great community around it.

A

Adding chaos helps with everything and we need to verify the slos, the quality gates and the reliability and integrated innovate. My personal wish list is to add more machine learning or ai ops to really auto, generate slos and help everyone maintaining all these.

A

The observability data have chaos engineer out of the box available for everyone to consume and combine it with all the ideas with open telemetry into use cases as a recap we're starting with app instrumentation and with metrics and traces. We have prompt ql and slos. We have quality gates with captain and promisius.

A

We want to shift left, we have chaos, engineering with litmus and chaos mesh and in the future we have observability everywhere.

A

That's that's about it. If you want to learn more um dive into o11, why not love which is a knowledge base which I've been building uh in the past months, and I would love to connect with you um and talk everything about observability chaos, engineering, sre, thanks for listening thanks for attention. If there are any questions hit me up either now or sdns me here on social media, see you around and happy observability.