GitLab Developer Evangelism Team, 21 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Confidence with Chaos for your Kubernetes Observability - Michael Friedrich, GitLab

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

And I think we will speed things up and talk a little bit about chaos and observability and kubernetes.

A

um My name is Marty I go as DNS machine, which is a little hard to pronounce in English, which is DNS. M-I-C-H-I um I got that nickname like 10 years ago, when I was working in Vienna at the University and I couldn't get rid of it after a while.

A

um But it's not about me today, it's like what what should you expect in the next couple of minutes, um some stories from an open source monitoring, maintainer I was in the past um diving into metrics Prometheus grafana cubicius, some things about alerts in the service level objectives and then we're diving into chaos with kubernetes um talking a little bit about DNS, again, chaos tracing and some ideas and stories around Observer, Beyond, observability, open Telemetry and obviously you might find some Lego images between the slides, so you might catch them all and I want to start with some up stories on like how do we approach?

A

We have kubernetes up and running. um There is the architecture. There are many components. There are many names to understand. It's notes. Parts containers deployments Services apis um ports, data sources and then my knowledge ends at some point and the question was like how could I like monitor that?

A

What is important to me and someone said we need monitoring, I'm, saying: okay, maybe availability, monitoring or something like that, or some performance and resource monitoring, and we want to identify slower blocking deployments now, the classic host service model with state-based pulling and something like that, doesn't really apply to microservices. So we might be looking into metrics logs even more. We do. We need to understand all the components which are running to really feel that everything is working. Maybe maybe not what are the best practices and it can get overwhelming.

A

So we really need to figure out what is important now and what can be done later on um so the first iteration it's now like kubernetes has many different data sources. Where can use service Discovery and within the cncf ecosystem and wider Community I found Prometheus the metrics endpoints. It has a Time series database, you can calculate Trends, you have dashboard service level objectives and from there it's easy to go to alerts and incidents.

A

um A little bit about Prometheus itself, it's a huge picture. The most important thing is, we have the premises server and it allows us to query kubernetes by service Discovery and being able to monitor certain things.

A

um You can play kind of Lego and take many things on top, but we want to focus really on Prometheus and also keep in mind. It has a strong query language or, like a feature-rich query language which allows us to calculate the metrics and the data we want to use. We can present it, we can consume it using an API and the format is pretty much I, wouldn't say self-learning, but it's straightforward to really get an Insight pretty fast into your monitoring.

A

Now from the UI side, you might know that there are rafana dashboards, but promises also has a UI most recently pretty much improved, and one of the other things I recently found was persis, which is in development. At the moment at the cncf, which is dashboard, is code and could help automate certain things even more.

A

um Do you like kubernetes and Prometheus? The Prometheus operator is a feature Rich operator, and it also provides Cube Prometheus on top, which allows you to deploy Prometheus the nodex Border, the web manager grafana and also certain best practices, around alerts and dashboards, which is awesome because you can immediately see something and it's not just monitoring is deployed and it's empty.

A

Now. What exactly will we be seeing within um the cube premises? Deployment? It will be lots of metrics. We do have custom metrics for the node status, for the resource usage for the deployments, number of Parts Network, and even even more so at some point it has gotten quite overwhelming, looking at all the dashboards, but it's great to have them, so you can analyze them.

A

You can see whether you want to use, for example, the container metrics, the cubelets provided C advisor exporter, for example, and get insights, how the containers are using the CPU usage, how memory is being used and so on.

A

um The other thing to mention is cubesat State metrics is is also a project which gets automatically deployed to figure out the health of the deployments, nodes and parts. um Many different ways are being abstracted. So you don't need to reinvent the wheel, doing the monitoring can just use that and it's installed by default.

A

Now, when we think of defining service level objectives or slos and alerts, we need to figure out. Well, metrics are nice. What's next so like we want to define the definition of failure, we want to do something because maybe a threshold has been violated, a rule has been matched, so we want to notify and raise an alert, the who is potentially everyone No.

A

It should be a responsible team and identified personas, and also it shouldn't be just like an alert and say Yes, nice but acknowledged and go away, but you want to really act on that and provide documentation for incidents, run books to fix problems to analyze things and maybe even Define, like corrective actions within the incident, um and the other thing is to keep in mind to iterate on every incident and everything which happens like reducing the mean time to response or with uh resolver depends how you want to um abbreviate it um when we think of alerts with Prometheus.

A

um There are two ways like Prometheus alert rules, sending them to the alert manager and the alert manager allows you to do some grouping. Some inhibitions, some silencing, which means kind of acknowledging things for a given time, um and then you can send the alerts to specific um endpoints, apis, email, classic or Pages or other ways to to notify about things. Now for kubernetes.

A

um There are some alerts defined by default by the premises operator. Cube Prometheus I've also found a website which is called awesome prometheusalerts.grab.to, which has many many more best practices and you can easily like copy paste them. You should like um inspect what you really need, but it can be helpful to Define infrastructure alerts for memory, CPU and even disk pressure plots error rates and, and also reachability, for example.

A

um This is a lot to like understand and to know the way we can integrate it into the premises operator is. There is a so-called Prometheus Reward customer resource definition, which allows you to wrap the from the premises alert rule into the kubernetes XML format, and this can be helpful to like deploy everything in one format and not just have five different ways to configure things with your Prometheus.

A

Now for the alert receivers, um a chat can be something like I want to immediately see something but oftentimes. It's really needed to refine things and think about grouping and not have 500 alerts being flooded in your inbox at 3am in the morning, and you have no idea what's going on, but you still need to fix the problem because the customer is calling.

A

um Maybe you want to like reuse, external ticket issue systems to have a way to consolidate this, for everyone? Maybe mailing lists even and the other thing is like also to think about that you might get tired of too many alerts. It's alert, fatigue I think it's pronounced. This is often which something which can lead to Bernard after a while.

A

um So it's really important to think about what should be actually actually alerted and not just be a notification number in your inbox, which is 20K and read or something um yeah, and the thing is: how do we get these alerts so by default, everything is working and oftentimes. You don't see anything now we can trigger loads manually by killing a ball or doing something, um but like manual work is a pack, so there are probably better ways to um cause some chaos in in the kubernetes monitoring.

A

You might be using Cube Doom to kill some ports in a fun way, but I'm not sure if that is allowed at work, um but it's still an interesting project to try out and and see what the alerts are working or not.

A

um The other thing is then, on a more serious note, like simulating a production incident. um This is really hard like I can't break something.

A

I can break DNS and see how everything is like not working, um but in the end you want to like add some sort of automated Chaos on a schedule on a cycle um on a way of also using it in a staging environment, but also in a production deployment testing in production, and the idea is to really trigger alerts and service level objectives to see that everything is working as expected, and you can react on that now. This brings me to observability and Chaos engineering, which is kind of you can do it.

A

The boring way like the German chaos um or you're, looking into like Cloud native deployments, chaos, Frameworks chaos, experiments and so-called instrumentation sdks, but I still like the German chaos um but yeah. The idea is really to use a framework, and one of these Frameworks I've found is in the cncf communities. Chaos mesh. There is also litmus scales and some other projects around this area and yeah, and it's easy to fail.

A

Kubernetes notes, Parts whatever or even hosts in your Cloud infrastructure, and you can play around with like reducing the network or failing the network failing https or making it slower um playing around with time, which is quite interesting.

A

If you maybe know an Argos from 10 years ago, cannot reschedule a check because time is not in sync um or maybe even DNS, because when something is not resolving to an IP address or giving a wrong response, the application might do something weird um or the deployments and um from a from a user perspective, chaos mesh allows you to run experiments like want.

A

You define it or you can also use it continuously as a schedule which can be helpful that you say you want to test it every day in the morning at 9am and see how it goes, probably not. On the weekend um the first steps in generating some cows. Some chaos can be um to kill some some or some not to kill some parts of face and parts in a more Gentle Way and the shown screenshot kind of forces.

A

The pods to fail and kubernetes detects that, after a while there's a crash loop back off and okay, this is kind of expected, but this is sort of like the real chaos. I really want to see, um and so I was like. Maybe we should dive into more practical examples rather than trying to click in the UI and do something um so I was thinking of maybe turning back time a little bit when I was a developer and we wrote distributed monitoring application and only when DNS was failing at the customer environment.

A

um It leaked memory and caused some other troubles and we as developers. We were not able to reproduce that, and it was like fix this no idea. Okay, then fixing the code pretending it worked, um releasing it and letting the customer test in production, this kind of turned into an endless burnout cycle, but in the end it was a nice. Well, it was a good experience to learn and say: hey. There were certain scenarios: I cannot even predict, they would happen in this in in a production environment, um and it got me thinking of well.

A

Maybe I can write a short application um which simulates that behavior just using a receive buffer, which is like one megabyte and we're doing some DNS resolver and we're handling all the errors and we're not freeing the buffer. So we are leaking this problem intentionally now. The thing is then I I talked about this. You know everyone can contribute. Cafe, Meetup and Nicholas was like yeah I have my own chaos. I can like scale, uh coordinates to zero replicas and was like oh I.

A

Didn't know that interesting, um which is one way which is a it, is a cool way to fail everything, um and it got me thinking but yeah. How can we do that with the more automated way and only failing certain things in this scenario, so with um chaos mesh? You can automate this kind of DNS failure and say I want to choose the the type DNS chaos for that for that thing, or for that for that experiment and the action should be error.

A

You can also choose random, which provides some random IP addresses as a response and see what's what's happening and the other thing is, you can use selectors for namespaces and labels, of course, and the patterns so in this example, I'm checking specifically for or 11y dot whatever and also for container days. So I was actually looking into a demo some hours ago um to really make this fail and um looking at time, I think I'm good in time.

A

I can do the demo um I want to do the steps with you together now and one of the things is I do have oops, maybe I'm looking at it.

A

This didn't work for some reason: I do have some ports, hopefully yeah, and we do have the container Days part and.

A

A

Yeah and this one is actually resolving things, then I need to make sure this is the port for warning is currently not working. Okay, um we do have a chaos experiment over here in chaos mesh so.

A

Over here, this is something I created a while ago, and the idea is now to start the experiment and see whether we can fail DNS, for this part should be starting soon. The other thing, meanwhile, why the experiment is running.

A

We do have the um Prometheus query for the container memory usage, and this is currently something like 300 kilobytes, 200 kilobytes, something and when the experiment is starting, the should go up actually.

A

And hopefully it'll be more.

A

Hopefully this works otherwise I have some screenshots from before. So we need to pretend it worked.

A

My yeah should be, it should be working.

A

A

Yeah, probably.

A

Let's create one from scratch, and this should be the DNS schedule for container days, so you can also write it. The same way as I did now and just create a new schedule, and you can just upload a Yammer definition, for example, submit it everything gets pre-filled by by default, and it's called it CD Now and force it to run.

A

Okay, now I have two experiments running.

A

And if it doesn't work um yeah, then then I blame it on the Wi-Fi. No.

A

Maybe I'm I'm looking at the wrong.

A

Okay, um I'm gonna, we'll be getting back to that after. In the end of the talk, the thing is, or the thing I wanted to show you is that the container memory usage is going up and I've defined an alert on a service level objective for 10 megabytes and everything goes up and the chaos experiment is running like in the screenshot and before doing so, I've also defined an alert which triggers and you can see it in the alert manager. So this is a screenshot from um first of September. I also have one from today.

A

Well yesterday, the idea is really to like trigger an alert and see that something is failing just because it leaked memory. Now the thing is when you're generating lots of alerts and lots of things in your kubernetes cluster and and generating a lot you might need grouping. You need additional context. You also need to focus on the dashboards, because the cube Prometheus provided dashboards.

A

There are so many of them that it might be needed to reduce the amount of data, correlate certain things and also provide more context to really provide actionable insights into what's going on, because when everything is burning, you really need focus and want to fix things fast enough um to gain confidence and really look into things. There are many um ideas points already available. We have metrics, we have the values we can like, create, prompt queries for alerts and service level objectives um for Ops.

A

It's still valuable to look at the golden signals defined by the um Google SUVs, like latency traffic error situation, to get an immediate Insight whether the deployment is healthy or the current cluster.

A

The other thing is like document everything about customizations, um also think of onboarding, because when someone is joining the team- and they don't really know about the current state of observability in kubernetes- can be really hard to really to understand and get into the loop basically and the goal should always see to see to immediately see what's important during an incident.

A

um When you look into customization for cube Prometheus, you need to learn. Jsonnet which can be done. Can develop your own rules and dashboards? You can monitor their namespaces. Add applications remove applications if you don't want to use grafana, but something else, for example, can do that.

A

um I've looked into the previous example of the container memory usage, this can be defined by adding a grafana dashboard, defining the data sources using the prompt query and just like have have it configured in one format and deployed into the the cluster another Advantage. Another confidence thing is like with Prometheus promises automatically scrapes HTTP endpoints, like the slash Matrix endpoint, when your application provides a service, monitor customer resource definition, um uh key promises picks it up automatically and you don't need to think about.

A

Hey I need to configure a host a service or the metric, but it's that everything works with auto discovery, which is also a quite nice way to to do that now. For a moment, um we've talked a lot about metrics, but obserability is like Beyond metrics and there are certain event types which we certainly see like logs. We have maybe traces profiling, error, tracking, even more event types and then we're thinking about if this observability yeah.

A

Maybe we want to have like the look from above and see things which are like the unknown unknowns like DNS is leaking some memory or something else is causing a failure which we haven't been aware of. Yet we never would have would have expected it. For example, for locks. I've read many things, I've ever evaluated many things myself. There are many different opinions as well for Central Lock management solution with kubernetes. You really need to figure out what is important to you.

A

Is it helpful to have a retention of logs for one year and pay I? Don't know a million of dollars or something or is it? Is it just needed for Life tailing or for the past the past day, when the internet happened to analyze, something there are various options available? This is just a random list of of vendors and tools which you can look into, but in the end, it's really like up to what is needed to solve the problems rather than keeping the logs forever, because we have endless storage, which we don't.

A

um The other thing which is, in my opinion, a little more interesting, is to look into tracing, which is like a trace is like a log but spans with starts and end time. There is a context there is metadata enrichment and the thing is um even if I need to learn it in in the way of adding it to my code or as a developer, I need to edit to the code.

A

There are other ways to Auto to use Auto instrumentation um and details providing a few and distributed environments with many different micro services and sources and ways how like the the client response, gets calculated in the back end. So this can be really like helpful looking into tracing one of the things which evolved over time and I will only touch the surface now is open Telemetry as a specification and framework which is beyond vendor.

A

So everyone is working together on a specification, um it it's a collector and can also be run as a sidecar, and the idea is to bring your own backend. So if you want to store metrics use Prometheus if you want to store traces, use, yoga tracing um and on the other side, um the the client libraries in the ecosystem is very feature-rich, so there are many libraries already available or in development, and even our trans instrumentation is available.

A

So um if you want to learn more I would recommend checking out dotan's talk and lightning talks tomorrow. um I think he's an expert in that area. Can answer many questions for now. For me, it's the focus on traces, so in cubanators the components can send traces. There are certain patches, Upstream already available, which you can enable and the application can get instrumented with traces being sent to the open, Telemetry collector stored in Jaeger, and we can use it to visualize, correlate and alert.

A

Why would I be doing that so one example could be like the client request, some data from an HTTP server, the servers, the backend query, some apis collects. The data, creates client response. The idea is then, to instrument nginx, and apart share as a web server to see the traces you can use. For example, the open Telemetry web server, SDK and Center traces to Jager, which, which shown in the screenshot, um got me thinking.

A

How can I add chaos to that, because I really want to see that the traces are getting longer because something is going wrong, so I thought of well, maybe add some sleep functions in my code and then deploy it and so that nobody sees it behind a feature flag. I was like no, that's crazy um shouldn't be doing that better.

A

Think of a chaos experiment for like HTTP Network or even put a stress test on the Node or the pod um to see what is going on and one of the things I I did in the past week was to use chaos mesh to stress test. The CPU in memory of a specific part for this open, Telemetry engineering, service and I could see like the trace. Time was increased by five times or something still.

A

It was two microseconds two or two milliseconds to nine milliseconds, um but it wasn't a certain increasement um and the idea which which I got from that was hey. We can link metrics to traces so when something is breaking in in the system- and we see the graph going wild, um maybe it's helpful to see the traces.

A

The other thing is at kubecon.

A

um There was a talk about the future of yiga and aggregated Trace metrics for mentioned, so you can create metrics from traces received in open Telemetry, so we can again add alerts to the metrics and add chaos to that to see what's what's going on, and the other thing is like, like I mentioned, before, traces for kubernetes system, components which, which is also a thing with kubernetes, now long story short.

A

um There is so much more to talk about even Beyond observability just want to like pitch some ideas to you. What else is coming or coming soon and it's important to maybe have a look at we're talking a lot about eppf, which will also allow us to perform observability on the Kernel level or on on specific sus call events using it for observability, but also security and other things, um and the thing is: how does it fit?

A

um How does it fit Prometheus metrics does: can we integrate that can? Does it complement itself? How do like ebpf, service level, objectives, alerts and Chaos engineering work together?

A

um Also because um psyllium open source tetragon at kubecon and I thought of well I only have like 35 minutes today, but I will be building future talk stories and demos and other things for that. So we need to learn and learn and learn now.

A

The other thing I want to bring to your attention is um hacking kubernetes, so the security parts and thinking of like impersonating the attacker and maybe use chaos, engineering principles for pen testing I'm talking too much um yeah and maybe have observability for for better security in in that regard,.

A

Confidence is looking at a nice Lego picture, that's something from a living room; actually Define different scenarios for chaos, engineering for chaos, testing so for sres for Dev, Forbes, devops.

A

And even devsecops and um for your own chaos, you should know the limits of Chaos so avoid chaos Inception, it doesn't make sense or you shouldn't be running all experiments all the time because it can harm existing workflows and teams, maybe use staging environments to prevent data loss ah and also think of that chaos. Engineering doesn't solve all the reliability issues, but they can bring what can help with New Perspectives um and I'm keeping it short because I'm losing my voice already continuous observability.

A

Think of your workflows, the ICD, with with merge requests and there to use chaos, engineering and metrics and observability in there to get feedback fast and avoid the DNS errors I made 10 years ago or think of continuous delivery and run chaos, experiments in production and, last but not least, I'll bring confidence with chaos, bring chaos into observability um chaos, workflows, alerts, iterate and innovate, and the last thing is maybe combine machine learning and Chaos engineering, which would be fun, maybe Skynet in the future, or something but yeah.

A

Many many ideas, many great things to learn and if you want to learn more I would recommend. Checking out and many more to tomorrow, which dive deeper into observability and I. Have a newsletter and I also have new gitlab stickers. If you want some later on after the questions, thanks for the attention.

B

Maybe some water might help oh drink something please thanks after that talk and then join me here in the center for the questions, I need to brighten my phone a bit. We have two for now. Let's start with the first.

A

Can I use the CDs of alert muncher with an external alert manager? I don't want to miss any other different class that goes down.

A

um One thing you can do, is you define an uh the alert manager, allows you to Define web hooks, for example, or a different transport, and it should be doable if you just configure the the external webhook with your or something else and move it from there. But you can ping me on on Twitter. If, if you need more information about this okay.

B

And are there any hours as Alternatives of grafana for visualizing promoters, time service time, series, data.

A

I'm hoping that the the passive project which I showed so it's github.com passes um where there's like work underway, it's Apache version, 2.0 licensed it's not httpl, like ravana, is now um so I'm, really looking forward to see progress and I. Think at coupon kubecon North America there will be more updates in late. October I will be there as well and like follow the project. I would say: okay,.

B

Okay, can you elaborate on the new parameters, dashboarding capabilities.

A

We don't know if that's really like dashboarding already in the UI, so like it premises, UI was replaced. I think it's react based right now um and there is like a basic view on metrics and you can query with things I'm, not sure how far they will be going with dashboarding or focusing on the aforementioned persons project, but I really like I'm. Looking forward that someone Builds an alternative to the almighty grafana. To be honest, okay,.

B

We have two more questions. We will do those two and then we are at the end of the session. Can you use kubernetes drift detection to move the cluster from an unhealthy to a healthy state.

A

uh Hey that's an interesting question to.

B

A

Honest to be honest, you don't know I I, don't know, but I think it's um I will look it up um yeah! It's a great idea. Thanks.

B

And the next question that came was: is there any plan to have Auto remediate feature from alerts.

A

I'm not sure if it will, if this will like happen within the premises operator project um I'm, pretty sure that vendors who integrate with Prometheus will provide an obserability platform or devops platform or whatever um that they are thinking of this I've seen that the term AI Ops is coming back within observability, so that we get assistance with many metrics with observability and also with alerts, and maybe in the future.

A

We will see some things like you're writing your code in gitlab and then it gets deployed and observed in some AI says: hey, you should be fixing this and by the way, here's the merch request with the diff. You can review and apply it. Something like that.

B

Okay, we have one last question: are you ready for this?

B

Let's see drink it drink another sip of water? How do you coordinate test runs with chaos, scenarios.

A

Probably you shouldn't be doing tests when your case experiments are running, but on the other side it would be interesting to see. I would make sure whether you want to have all the teams informed and know about this or you go the security way of pen testing and say we are the red team. We don't tell anyone and we're just deploying our chaos experiment and see how things are going. I think like damage, control and estimating what could go wrong in production is really important to not have like 10 000 of viewers Wasted by chaos.

A

Engineering. That's that's not the idea behind it.

B

Thank you so much.

A

You're welcome.

B

I also have a card deck for you.