Cloud Native Computing Foundation KCD Africa, 1 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Observability 101

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Awesome, thank you very much. Everyone welcome back from the break and uh we'll be continuing with our next session from richard. Richard is very well known for personality within the open source industry and in this session, he'll be speaking to us about observability when you are deploying your infrastructure or you are managing your services or whatever you have in your enterprise.

A

One thing that needs uh serious attention, it's observability and he will be introduced in the stop stability 101. We did. Some uh workshops did a workshop on graffana yesterday and last week, so this will give more context to understand it better and other things around observability over to you, rich.

B

Thank you. Thank you, um so yeah it seems you can see my screen perfect. Yes, so.

A

B

um Let's get started observability 101, uh with the focus focus on prometheus and beyond. um Let's start with the buzzwords, because observability and also are absolute buzzwords um and, as per usual, um buzzwords um tend to have a core of of truth, a core of meaning, but often they're, just applied to whatever you already have, which is which is understandable, but it is.

B

It is somewhat dangerous because um often you you need to actually understand why a term has become a buzzword and why there is so much industry attention onto that onto that term or that concept. um There is a concept of cargo culting, which is basically just replicating. What what you perceive others to be doing without actually looking into the details of it and then not getting the outcome which you would actually like to be getting, which is obviously not what you want.

B

So it's it's dangerous to just try and and and see what others do it's a lot about. Also thinking about why it works for certain people and and for certain situations and then to try and and adapt this to your own problems into your own space.

B

As I said this kernel of truth, it is a lot in observability and necessary about changing the actual behavior and not just changing the name of whatever you have done before.

B

In this context, monitoring is the old term more or less, and it has more taken a meaning of collecting a lot of data, but not necessarily using it. um There are extremes where either you just toss everything into store data lake and you you don't really use it at least not in in monitoring and observability context or full text, and all full indexing, which is just hugely expensive.

B

So it's about finding out why something is the way it is and not just yes, it is the way it is so observability to me is about enabling humans to understand complex systems and, obviously, with cloud and such you, you get ever more complex systems, and you don't want that.

B

You don't, uh or I mean you, want to have the complex systems and you want to have the benefits of those complex systems, but you still want to enable humans to understand those systems and also, at the same time, you enable machines to understand complex systems, which means you can automate a lot of things like, for example, alerting.

B

So again, this is a lot more about the. Why is something broken and how can I fix it and not just well, yes, it's broken, and- and I start from scratch in in my debugging- we need to look at a few more terms to to go to the depth of this complexity is one of the most important ones.

B

I I distinguish between two types of complexity. One is fake complexity, which is just bad design or legacy design. Or what have you? Maybe there were design constraints before which are not there anymore, doesn't matter, but often things which are complex in a system are not system inherent and they can be in that complexity can be reduced and it should be reduced because if, if you just have complexity for complexity's sake, um you're making your own life harder um and you're, making it more expensive to run your service, which is again not what you want.

B

On the other hand, we have this real and system inherent complexity, and that is complexity, which is a necessity of actually doing what you want to achieve, because, obviously, if, if you do complex things and good things that, if you have lots of moving bits and pieces, there is some complexity which you cannot just reduce away, you must actually deal with it, because that is part of what makes your service a service which your users want to use.

B

So you can move this complexity around. We have we had monolithic and mainframe designs. We had client servers, we have microservices, for example, premises itself is a monolith, so you can see that you, you can make different decisions and even within even within the cloud native context, it can make sense to run monolithic services like prometheus, but you cannot get this away. You just move it.

B

um It must be comparison, mentalized, a different name, for this is service boundaries, like your hard drive, is insanely complex, but it it has a clearly defined interface and your operating system in your main board can just address your hard drive same for your cpu and such those are also super complex, but it's compartmentalized away, um so you don't have to deal with it on the level which you're dealing with with whatever service you have same for cloud instances and everything, of course, and ideally it should be distilled in a meaningful way that you can actually extract what you want from that complexity to to understand what is happening where you need to to understand it, and else you can just more or less ignore it unless you're part of of the team, which is actually responsible for running that.

B

One thing, as there is another buzzword which often comes up in in the context of of observability another one, would be devops which are again not precisely the same, but they go in the same direction to me and there's other definitions. But again to me.

B

The core meaning of sre is to align incentives, um because you want different people, different teams, you want them to actually work together and not against each other and a lot of what you can see in the google asset. Ebook and such is, if you just distill it to its essence, it's basically about making people think about the same things and aligning your incentives, so they automatically without having to discuss and fight about this, do the same thing or go in the same direction.

B

Hugely important here is sli slo sla, which is service level, indicator service level, objective and service level agreement. um The indicator is just what you measure. The objective is what you do not want to go above, and the agreement is where you go above, you actually have to pay, or you break a contract or whatever specific example. If you have error budgets for your service, this allows your developers, your operation, people, your product manager, your everyone to optimize for shared benefits.

B

If that service is super stable, everyone gets to do their a b testing and their new features and everything. But if that error budget goes away, the operation people can say, okay, we cannot put any any updates unless it's super well tested, which puts load on the developers and they can't put their new features in which they don't like.

B

So obviously, they will try to to to use up the shared error budget, which is shared between everyone, as as well or as good as possible same as the operation same as the product manager same as everyone else. That is aligning the incentives of of those people.

B

What can this mean in in in the specific everyone using the same tools and dashboards would be a good thing. uh Of course you have the shared incentive that everyone invests in the same tooling. Everyone works on the same dashboard, so they share a language, because the terminology is, if you have only one dashboard or five. The terminology is the same, so they share this language automatically. They share the understanding of how to look into the service, of course, again all their tools and such are working the same ways that also pulls your institution. Knowledge.

B

Of course, one improvement to that one dashboard from that one person or to the alert or doesn't matter, benefits everyone else. So you don't have those 10 different islands of data. No, everyone is working on the same system and they share this system knowledge.

B

What is a service? It's basically you commercialize, your complexity, have your interfaces. They usually have different owners and or teams and contracts define the service interfaces why the term contract? Well, I like this term because it is a shared agreement in writing which must not be broken.

B

If it's being broken, you need to discuss this with with whoever is, is invested in that service or in that contract. So automatically you, you have this control function in this forcing function it doesn't matter. If those are external or internal customers.

B

Some people in your org will care more about external customers, but I would argue that internal customers within the org are just as important because they provide services to other external customers, so treating yourself within the org as your own customers between different teams and service owners makes absolute sense. In my opinion, you could also call this layer and the internet wouldn't exist without proper layer layering, where you can have your your layer, 2 your layer, 3d layer, everything and you can. You can actually fully paralyze the work on those different layers.

B

Different innovations can happen everything as long as the interfaces are the same and are compatible. You can do whatever wi-fi has been, has been developed without tcp having to be or ip having to be adjusted for this.

B

Of course, it is a different layer with clean interfaces, and so you could just do this on a different layer and no one had to think about. Could I ever have wireless at the time when they designed ip? It's just still, working and yeah. I already talked about cpu hard disk and such um even your lunch. You will not in common case, be be doing.

B

Everything like you won't be growing, all your own wheat and and blacksmithing your own, your own tools to to uh to actually grow that wheat and everything you will be buying certain bits and pieces. So, no matter how much you cook yourself! Still you have those service interfaces everywhere in your life, your customers. They don't really care about your internal things. uh They don't care. If, if half of your database notes are down, they care about their database service being up and quick, and that is how you need to think about those services.

B

You need to think about them from the perspective of the paying customer who doesn't really care about any of your journals. They just care to, so that the service works, um something which you will not see very often but which I think is hugely important. uh You need to discern between between different types of slis.

B

Usually the wisdom is that you only care about your own slis, which I disagree with.

B

I think that you need to also care about the slis of your underlying services, so basically, what what your underlying services consider their primary and service relevant, slis for, alerting and such and for for seeing if the contract isn't all right and everything you should be treating those as informational slis to to help you debugging and to understand what might be happening in your underlying services or in this in the underlying services, which you rely on as to alerting anything which is currently or imminent cause impacting a customer service must be alerted upon, and nothing else should be if it's just a disk which is half full whatever raise a ticket.

B

Do it during business hours, don't wake someone up for this! If your customers are not able to access the system course of that half fold is. That is the reason why to alert, but not just that a disk is full or something. So, let's look at tools, first and foremost, obviously prometheus. Many of you will know it, but still, let's walk through the 101.. It's inspired by google sportsman. It's a time series database, which internally stores the values in 64-bit numbers.

B

It has concepts for instrumentation and exporters, instrumentation being modifying your own source code or other people's source code to emit metrics directly from within the system and exporters are basically proxies where you can take snmp or a database, or something and rewrite this into something which prometheus can understand. It is not meant for event. Logging and dashboarding happens via grafana.

B

The main selling points of of prometheus are. It has a highly dynamic built-in service discovery. You can just point it at your kubernetes and everything will happen as if by magic, you can have a zone transfer and just transfer that one zone and prometheus will just start um monitoring all.

B

What's in there and there's integration with pretty much every cloud provider, or at least every major cloud provider and there's more coming all the time, so you just point them at this end, point of your of your cloud provider and prometheus knows about the services and start scraping them. You don't have a hierarchical data model. You have an n-dimensional label set you, so you don't have your region, country customer and then you want to to group by customer and you, you kind of break this hierarchical tree model.

B

Now you can just select by the label customer equals whatever and you're done. There's a language um prom we used for everything, processing, graphing, alerting, exporting everything you need to learn it. It's a new language, but it is insanely. Powerful um permit itself is quite simple to operate and it's super efficient, most likely more efficient than anything which, which you saw, which is older than prometheus, which is not so common anymore, but still uh there's, there's still people who who see this as a new thing.

B

Other selling points um it's pull based, which gives you nicer properties about certain types of alerting and and consistency checks. um We have the concept of black box monitoring where you look at stuff from the outside versus white box. Monitoring where the box is is completely open and you can look into the inside of of that box.

B

Usually every service should have its own metrics endpoint, which, with the agent and such you can you can go against, but usually in prometheus. You should have that and we have super hard commitments within major versions about what we treat as stable. So we don't just break stuff.

B

um What are time series time series are recorded values over time or which change over time?

B

um If you have individual events like a function being called those are merged, usually into counters and or histograms like latency, and such uh changing values like your temperature or your memory usage are as gorgeous and they can go up and they can. They can go up and down typical excess examples. You probably already read excess rates. The web servers would be encountered, temperatures would be a gorge service. Latency would be histogram, it's super easy to admit in parse.

B

I know people who just print f in their c code and pos, put this on a website, and that's it that's how they emit data towards prometheus super easy scaling.

B

um Kubernetes is equivalent to borg, which is what google runs their their services with and prometheus is basically equivalent to pokemon, but um the apis are more of monarch type and while kubernetes and prometheus were not started with each other in mind, inherently, they are designed for each other because of the heritage of their shared heritage and also if, if kubernetes changes anything about their cubesat metrics and such that's always agreed with with prometheus team course, we have people overlapped between those two projects, raw numbers, um the highest we know of- um are uh 2.5 million samples per second and prometheus server, which comes out depending on how you tune it.

B

A recent test. I got 260k samples per second and core test. Before that we went to 100k, um we can compress those 16 bytes per sample and second or per sample into into 1.36 bytes, which speaks a little bit about the efficiency and the highest.

B

The largest prometheus we know of has 125 million active series. There's two long-term storage options: one is thanos, one is cortex. Historically, thanos is easier to run and and was scaling storage horizontally, whereas cortex was harder, but it has become a lot easier and that started with scaling the ingester and the query horizontally.

B

Cortex took in the code from thanos to also scale the storage horizontally and thanos is working on on taking the ingested inferior horizontally scalable code because those projects are super new to each other.

B

They experiment differently, but still they are closed. I hope that at some point they merge, but probably not, but I would hope so. The official format for prometheus is called openmetrics. um It's basically an independent standard of prometheus, but permeated uses it as its official standard.

B

um That is mainly for compatibility reasons to to give people and projects and vendors something to support which is not called prometheus.

B

So for political reasons in in part, that name was chosen. um That's also about about putting all of this into itf. So you have a real official independent standard um yeah.

B

There is a concept of three pillars: um metrics logs and traces. Of course, they usually have the metrics and logs are the easiest and cheapest in many ways, and traces are just where you go with your application monitoring, which is why, which is why those are super tight super tightly coupled and in particular, tying metrics to traces or lobster traces is super easy with with ex-employer.

B

This is a way to to attach ids of traces directly to your logs or your traces reason being you don't have to have the full label set um on your traces. You can just use this one direct pointer, which has a few nice other properties. In particular, you can just you already have all the context when you jump into your into your into your trace, you already know what's wrong, and yes, I'm absolutely absolutely serious about that one.

B

I did start openmetrics to change how the world does observability, where you have matrix logs and traces all with the same data model with the same underlying assumptions, so it makes it easier to jump in between those those things, speaking of which loki loki is basically like prometheus, but for logs it has the same label based system like prometheus. You don't need your full text index, you just index your labels and everything else is an opaque string which makes it super quick and super cheap to run.

B

It works at insane scales.

B

Your yeah, your your logs, would have the same label sets as your metrics. I already said that um which makes it a lot easier to just jump between uh the two and you can also easily extract metrics from your logs. If that looks familiar, um that's because it is, you have your timestamp, which is mandatory in in logging, but else you have the same label set, and then you just have your your opaque string.

B

That leaves us with traces um tempo is, is one of the implementations is designed precisely for this example based world? It's only an object store. You don't have to run any any expensive services in the backend. You can just use an object store. It's fully compatible with open telemetry tracing zipkin conjure all those things um because it is so efficient. You don't need to sample your traces. So if you have an interesting trace id, you know you can actually jump to it and you don't just lose it and you can like prometheus cortex, thanos loki.

B

They all support they all support ex-employers. So you can do this jumping back and forth some numbers on scaling for what we run internally. We have 1 million samples per second retain 100 of those, and if we go and as we go for 14 day retention with three copies stored, uh we have a cost of roughly 200 cpu cores, 300 gigs of ram and 40 terabyte of object, storage for 1 million samples per second for 14 days, and we did a 10x jump recently and we already have plans for the next 10x jump.

B

Those numbers are already a few weeks old. I think we already have better numbers now bringing all of this together. This allows you to jump from your logs to your traces directly. This allows you to jump from your metrics to your traces, from your traces to your logs, and all of this is open source. You can run it yourself.

B

um We also have a cloud offering and such obviously, but all of this is open source, um so you can really run it yourself without having to pay anyone, or so you can just take the software run it and done.

B

Thank you, and now for.

B

A

Awesome. Thank you very much rich. uh Let me check the chats so far. I've not seen any question yet.

A

So but I would like to ask one for all benefits of juice that are not uh that are new to the climatic ecosystem. Yet now is observability a major concern or a major thing that someone who is new to everything that they should worry about or is killed. They should pick up at very early stage.

B

um Yeah, absolutely it's a hard requirement. In my opinion, if you look at, if you look at previous systems where you had one service running on one machine or some such a lot of you, you basically have a lot of the same underlying complexity, but this was uh well hidden behind the operating system and behind more traditional tools which allowed you to do all that debugging already that changed with cloud native.

B

Of course, the cost or part of the cost which you have to pay for being so flexible and so scalable in a cloud native world. Is that you that you redistribute the inherent complexity?

B

And if you have a previous, you had maybe your server and then you had more users and you had to buy a bigger server and, and you you contained a lot of this through the system. But now, if you run everything in the cloud and you you have a lot of users jump in, um maybe you just scale out to two three ten times the amount of of whatever is your?

B

Is your service uh thing um and you scale this out, and this ex leads to an absolute explosion of the overall system, information about your system as it is running, and this immense amount of data you're not able to anymore to just as a human go through a few log lines and figure out what's happening, it's just impossible because you have so much stuff going on at the same time, so um you don't have a chance to to run a service properly unless you have a chance to understand how that service is run.

B

Obviously yeah and observability in large part is just a code word for make it possible to to understand all of this and not only understand what is happening, but in particular understand what and why is happening when something goes wrong.

A

Okay, how is it different to tracing.

B

Not at all uh tracing is part of observability. um There's like there are different. There are different uh approaches to how you do tracing within uh within observability.

B

If, if you want me to talk about this, I can easily do it, but the high level uh reply is just um it's one of the signals which you need to do for proper observability and at least, if you have access into your software, which is cloud native and such is, is pretty common.

B

If you run more traditional uh services or even servers and machines and and network routers, you usually don't even have access to those places yeah you just cannot, but as soon as you have access to them, you should absolutely make this part of your observability story.

A

Okay, awesome yeah. Thank you very much. I think uh we still don't have any questions. I've checked the live stream. Also, there are no questions, but I believe um the participants have seen your contact details. You can reach richie on either twitter or send him an email. If you have any questions or if you need more clarification on observability or tracing he's an expert in it and can definitely point you in the right direction.

A

Thank you very much. Rich.

B

And thank you for having me.