Cloud Native Computing Foundation KCD Sri Lanka 2022, 22 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Seeing the Invisible Observability with Linkerd by Flynn

Description

Sri Lanka has a growing group of Cloud Native enthusiasts, students, professionals, and technology leaders. KCD Sri Lanka offers a platform for this community to come together and connect with other tech communities in India and neighboring countries. It provides an opportunity to experience conferences like KubeCon / CloudNativeCon together with the rich cultural heritage of Sri Lanka.

A

Hi there, my name is Flynn I am a tech evangelist for buoyant the makers of linker d, if you're not already familiar with linkerty, it's the only cncf graduated service. Mesh linkerty's purpose in life is to arrange it so that every cloud native developer on the planet has access to the tools that they need to do secure, reliable, easily observable Cloud native applications and for those tools to be freely available.

A

My role as a tech evangelist is to make sure that people know that and also to make sure that people have the knowledge and resources that they need to really succeed at the whole Cloud native thing. So. To that end, today, I will be talking about observability using Linker D.

A

This is kind of the classic problem in Cloud native, it's very very hard to really see what's going on inside the cluster, even when things are going well, when things start going badly, it's even harder service meshes are well positioned to help with that and for link Rd. There are two things in particular that make it really good at it. One is this Linker Davis tool, the other is the service profile. Crd lucrative Vis is a tool that just gives you easy access visually to a bunch of things within your cluster.

A

Here we can see the topology of our application. We can see a bunch of Statistics success. Requests per second there's a bunch of stuff in there. We'll look at some of that as we go on through this presentation.

A

The service profile, on the other hand, is a crd that first defines kind of how Linker D should be watching your application, and it also gives you access to ways in which Linker D can help manage your application too.

A

So, for example, if a delete request comes in matching this URL over there, then that will get bundled up as a statistic into this bucket with a much more human, readable name and all of the deletion requests, no matter what id they are go into the same bucket to make it really easy, as a human to figure out, okay are deletions working or are they broken? Also, the really killer bit of this is that as soon as the service profile is created, Linker D will watch that and it'll aggregate statistics for you on its own.

A

Without you having to do anything, special you'll be able to go through and look back in time to get access to those. It's a really really wonderful tool for troubleshooting, uh also I, said management. So, for example, you can use a service profile to do things like configure retries automatically we're going to see some of that with the rest of the demo here and uh yeah. This is pretty much it for the slides.

A

The rest of this presentation is a live demo, so um yeah, let's get to it, shall we okay, so I have here a running kubernetes cluster I'm, doing all this on a k3d cluster, that's running in my laptop just because I kind of like having control over everything. uh The first thing that you'll see here is we've got the books demo and we have the Emoji vote demo. So we have both of those running in the cluster. We can take a look. This is the books demo. You may have seen it before you can.

A

You know, go click around and look at books and look at authors and yeah. That's it's pretty simple. This is the Emoji vote, application where you can go and vote for emoji and then you can review the leaderboard and that's all there is to it. I should point out: there is a traffic generator in here. I am not just clicking endlessly on emoji all day. uh So let's go back here.

A

We also had Linker D running and you know: I've got an Ingress running in the cluster as well. That's kind of I just like to be able to use domain names instead of having to do everything by a port forwards, and then I must remember the port. uh All of this I pretty much set up just using the quick starts available from the Linker D documentation. I'll have the link at the end of this presentation, where you can go and see exactly how I set everything up.

A

It's pretty standard, pretty easy to get going very important thing here is that we, you have run liquidy check. We can see that Linker D is actually running cleanly on its own, so we're good to go all right now, let's suppose it's Friday night and somebody calls up and says something is wrong with the Emoji boat application, they don't know what's wrong. They just know something is wrong: it's not behaving.

A

What should we do about this? Well, I guess. The first thing we can do is we can just look over all the namespaces in the cluster using Linker dvis from the command line, but we can see immediately that yeah, there's there's some challenging stuff here. The Emoji boat application actually is not showing us 100 success.

A

Neither is books, but we'll have to come back to that. So, given that we can see that there's something wrong in this namespace, let's drill into that a little bit here, we're going to look just at deployments in the Emoji photo namespace, because that's where we already know that there is a problem anyway.

A

So if we do that again, we can immediately see that the web deployment and the voting deployment look like there's some kind of unhappy things going on. So those also seem like Pretty Natural places to look so we're going to go ahead and take a look at the web deployment. We'll use Linker dvis top for this. That's going to give us a rundown in real time of the most common requests that are going to or from the web deployment.

A

If we do that, we can go through and we can see a bunch of things happening, but it's most interesting I think to look over at the success rate column here, which is all hundred percent. So so far this is working out. Okay, that's wait a minute! That's not a good sign! Okay, so that tells us that at least this time voting for a donut did not work and it looks like it might not be working at all kind of think of it.

A

So that's something where it looks like the web deployment is talking to the voting deployment. So maybe we should go. Take a look at that. uh Let's yeah, let's go ahead and take a look at the voting deployment. If we look at the voting deployment with Lincoln Eva's top, then let's see things up nope! That's not working um yeah, it looks like we have a problem with voting for donuts.

A

That's too bad Donuts are usually pretty popular. At this point. We probably could go off and hand this off to the developers and say hey. It looks like there's a problem voting for donuts, but we can probably do better than that. So another thing we can do is instead of running Linker. Davis top we'll run Linker, diviz, tap tap, shows us real time request by request. It just gives us a running list of everything going on.

A

It's a really nice way to get a quick look at you know the actual real live traffic, and here I see a bunch of things that are working right, I see a post, it gets 200, it's got a grpc status of okay. So far, so good um I don't actually see any donut requests so far, oh wait. Here's one all the way down at the bottom um yeah, so you'll notice that this is grpc status. Unknown very important. To note here, status unknown does not mean that Linker D doesn't know what the status is.

A

That is a grpc error message that says something went wrong and the grpc error, grpc layer, can't tell us more about what happened, but we know that something is wrong here, so that means we can come back. Maybe we should just drill into the voting for a donut.

A

You know only that one right if we do this and we might have to wait a little bit for somebody to actually vote for a donut.

A

But if we do this, we should be able to see whether you know is voting for a donut always failing or is it only failing. Part of the time uh looks like it's always failing. Okay, all of these are saying grpc status unknown. That definitely gives us enough to go back to the deficit with we could go back and say: hey we're, seeing a grpc error when you vote for a donut, but we can now say we know it's a grpc error, so we can do here.

A

There's one more thing: if we attack Dash o Json onto that same command, so we're going to look at the donuts, but we're going to look at it in Json then. Instead of giving us that nice three-line summary it'll break everything out into a huge Json block and we'll get information on the requests and the responses there, we go there's one so we can see. This is a request. It's going from the web deployment to the voting deployment.

A

It is in fact, a vote for a donut. If we scroll down we'll, be able to see the response, so we come down here. There's the response response and if we scroll a bit further down, we can see. Oh yeah look grpc status 2. If everything were working that would be grpc status as zero.

A

We don't really get anything particularly useful in the error message, but at least we can now go back to the developers and say right. When you try to vote for a donut, you get a grpc status of two. That's a problem so overall that works out pretty well.

A

On the other hand, this took a little while and the reason that it took a little while was that we needed to go through and watch the traffic and wait for somebody to vote for a donut. So we could see the problem, then we had to wait for them to do it again, so we could see if the problem persisted it. You know there should be a better way and, as I was mentioning earlier with service profiles, there is a better way. The Emoji vote. Application is a grpc application. Grpc means protobuf.

A

Protobuf means that, rather than writing service profiles by hand, we can just ask linkerty profile to go through, read the protobuf definition and write us a service profile for it. So let's do that for the Emoji Proto there you go, there's not much to it. You can it's both posts, which is kind of interesting. You can list all the Emojis. You can find a given Emoji by short code, that's kind of it.

A

um If we needed to modify this we could, but this is certainly a great way to get started at minimum. So let's go ahead and apply that to the cluster.

A

Same Command right just generate the Proto and then apply it that works out pretty well, uh and then we can do the same thing for the other grpc. That's part of the Emoji about application. The voting Proto I'm not going to bother showing that it's pretty much just more of the same, so we'll apply that one. Now we really want to see some things about the web too and that's problematic, because the web app here is not grpc. The web app is just plain old rest, so we could write the service profile by hand.

A

If we had a Swire definition, we could have linked profile. Just read the Swagger definition and write a profile for us. In this case we don't have either of those things. So, instead, what we're going to do is this other trick where we can have Linker D just watch the traffic going by for a little while, in this case we're going to say 10 seconds and generate the profile based on what it actually sees in the traffic which is kind of cool.

A

So let's go ahead and do that and it's going to take a few seconds here, of course, because I told it 10 seconds.

A

Let's take a look at that profile that it wrote and yeah you know you can see lists, you can see votes, it's pretty straightforward right. So let's go ahead and apply that.

A

And now we should be set up to debug this problem with a budgie boat much more quickly than we otherwise would have. So now we're going to go check this out by looking at the dashboard.

A

So here we are at the dashboard. We have the Bookshop namespace the Emoji about namespace and, as we saw from the command line, things are not working flawlessly here. Let's go into this namespace and you can see with the topology graph. We've got this vote bot, that's generating traffic, spinning it sending it over to the web service. The web service in turn is talking to the Emoji service and the voting service. All this lines up with what we saw from the command line, which is kind of nice.

A

Let's go take a look at the web deployment itself and you'll also see a different graph here, of which deployments are talking to which. But the really neat thing we can do now is we can click on this route, metrics tab where we can just immediately see? Oh, hey, here's the route, that's not doing so well get API vote.

A

If we go over to the voting deployment in turn, then we can scroll down and check out its route metrics here, let's just sort by success rate, and we can instantly see without waiting for anything that yeah, it's the donut, that's causing us problems. If we want to drill into that, let's go back to the live calls here and we can click on the microscope for any of these. Let's see if we get a donut yep, there's a donut great.

A

ah It moved up because it was. This is a top view. Let's click on that that fills in a tap page for us. You can see it filled it in with donut click on start, as requests for donuts come in, then this will populate here. There's one and we can click on this tab and there you go. There's the Json view from before, so we can kind of get to have it both ways.

A

Obviously this is much faster than going through and running all the stuff from the command line, but important to note it's working on exactly the same information, so everything you can do here. You can do from the command line all right, so at this point we would hand this over to our developers. Tell them there's a problem with the donuts we can go on and, of course, that would be the time that something comes up with the books app because I don't know, that's the way. Life goes right now.

A

The nice thing about the books app is that the books app already has service profiles, so we don't have to go through and build them manually, so we can go straight to the route metrics from the command line and see. Okay, what's going on with the web app service here in the books app, and we can immediately see that oh right, there's two things in here that seem to be failing about half the time that looks you know problematic.

A

We can also go through and do the same trick. We did before where we drill down and say: okay, show me where the web apps deployment is talking to the author's service. How is that going? And you can see there's a little bit more in more detail here. There are calls in this list that don't show up up here, but in particular we can also note for our debugging purposes. All of these are working. So there's probably nothing for us to worry about there.

A

Let's check the weather app talking to the book service, and what do we see there? There we see okay here, we've got a couple of things that are failing about half the time. That's probably not good. Finally, up here we're not going to see traffic between the book service and the author's service, although maybe there is some. So let's take a look at that and what do we see?

A

Yeah, there's one call it's just a head call, but it's failing about half the time kind of interesting now. We could, of course, do all of this from the GUI as well. Let's go ahead and do that we'll back up to namespaces then duck into the book. App's namespace, there's our topology graph. Again. We saw this at the beginning of the presentation. It's traffic generator talking to the web, app talking to the books, app talking to authors and books and authors or talking to each other.

A

So here, if we go through and take a look at one of these, let's look at the web app. Shall we once again we get the neat little graph here and once again we can go down and look at route route, metrics and kind of immediately see right. So these are a problem if we go in and we look at, let's look at the author's deployment. Actually, if we look at the author's deployment and we look at its route metrics and again, we can immediately see okay, this head, that that's that's, got some trouble right.

A

So interesting thing here, head requests are item potent they have no data, they make a great candidate for retries and service profiles. If you remember, are a place where we can configure retries here's the service profile, docs configure retries on your own Services. If we look over here routes that are item, potent and don't have bodies you can edit the surface service profile and add is retryable to the retrievable route. That's the only thing we have to do to enable retries happening down in the mesh. We don't have to change any application code.

A

So let's go ahead and try that out. We will do that using uh Coupe control edit we'll do that. The really simple way, that's the author's service profile, all the way at the bottom you can see. This is the head that we're talking about retrying and we are literally going to just add- is retryable. True, then we'll save that quit that's updated and now let's go and watch and see what happens.

A

If we attack Dash o wide on this link already route, it will tell us the effective success rate as well as the actual success rate, and if this worked, we should see these two start to diverge. We should see that the effective success rate will be going up, even though the actual success rate isn't doing much and yeah. We can see that it is going up, in fact that here we see 61 62 percent, almost let's give it another few seconds here, 68 yeah, so that looks like it's headed in the right direction.

A

um What we don't know yet is whether this is the only problem with our books application right now, um so let's go take a look and see from what should we do? Let's look from the web apps point of view, I think uh yeah web apps talking to books. That'll probably do this, and here we can see right. So we've hit 100 on everything at this point here. We're only seeing the effective success rate, but the effective success rate is the one that we care about from the user's point of view.

A

So at this point we've been able to use Linker D to figure out what was failing and to put in a mitigation so that the end user is no longer affected by this problem. We still have work to do with the developers. Obviously we're going to have to go through figure out why exactly the thing was failing, which we don't know right now and we're going to have to you know, get a fix put in place to really solve the problem for real.

A

But if you remember I started this by saying it was a Friday night and putting in this quick change to the service mesh means that we don't have to go back and Bug the developers on Friday night. Everything is working from the user's point of view. We can come back and Tackle this in the morning on Monday. So that's about it. For this demo, you can find more information about Linker D at linkerdy.io or you're, always welcome to join our slack, go to slack.linkerty.io. For that I hope you do.

A

The source code for Linker D is in the github.com linkerty organization. You can also find us at linkerdy on Twitter, if you're curious about how this particular demo got set up and run. You can look in the service mesh Academy repo you'll find everything there all the details, and you can always reach me at Flynn, buoyant.io for email or at Flynn, on the link to D slack hope to hear from you thanks.