VMware Contour Demos and Deep Dives, 28 Feb 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Contour demo - Envoy Shutdown Manager

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi I'm Steve, sloka I work on contour and today, I want to talk a little bit about a new feature. We have in contour called the shutdown manager, so the goal of this shutdown manager is a new sub command of contour. That lets you manage the lifecycle of on voice. So, if you're not familiar, let's go ahead and review the architecture for contour. So here's a quick overview of that so again, contours and ingress controller for kubernetes.

A

It gets traffic from outside the cluster, and this job is to route that traffic from outside the cluster into the cluster. We, the routing component, that makes that happen is envoy. So envoys are data path component, so it handles all the actual routing of traffic on the wire contours. Job is to be a controller or a server for envoy, so it goes and builds configuration based on what it sees in the cluster and passes that configuration off to envoy.

A

By default, we used to float, deploy contour as a deployment in kubernetes and we deploy envoy as a daemon set. So the two different resources that get deployed and by doing this lets us manage the lifecycle each differently.

A

We can manage the deployment of contour change versions of that independently of changing the Envoy version, but at some point, you're gonna have to change envoy or roll out a new change or upgrade or do some sort of change to it, and what the problem we want to solve here today is to be able to roll out that set of envoys, and do it in a way that we don't have to have these users out here. Sending requests notice that you're even changing it right.

A

We want to have them not realize downtime or have so any kind of error is due to us rolling out this infrastructure change. That's you know not apparent to them, so the shutdown manager is going to help us do that work. So it's hot back here- and this is the the project contour data I/o website and here's the documentation on how we can redeploy Envoy. So the general overview. What we want to accomplish today, when we, when we want to you, know roll a non-void and change it as first.

A

We want to stop accepting new connections to that envoy, so find a way that we can not direct new traffic to it. Once we do that, we want to start draining connections, and we can do that by telling envoy that it needs to it. Fails it's health check, so we can send a post request to envoy over its admin web page and that's to the slash health check, slash, fail, endpoint and once we do, that envoy will start to drain connections after that. We'll just have to basically wait.

A

Wait for all the connections to drain off before we allow it to get terminated cool, so the shutdown manager has a two different, a few different ways to do this.

A

The first thing that does is it has a liveliness program, and this is just there to maintain that this shutdown manager is acting healthy right, so kubernetes, aliveness probe will call this endpoint and here in our example, we're using slash Health Z. So basically, just says: hey are you healthy and the shutdown manager will reply? Yes, I'm healthy? If it ever doesn't, then it will go ahead and restart that container for us. The second thing we want to do is we're going to implement a pre stop hook.

A

So increment is the pre stop hook lets you intercept the shutdown from kerbin eyes before the container gets the cig term. This is the generic example, for this is always a database right. So if I'm running a database and before I and I got a signal, hey shut down before I, do that shut down, I want to make sure I clean up. My connections I want to commit many transactions. I have in memory, get all of that stuff cleaned up before I actually shut down, so the pre stop hook.

A

Lets the container decide when it's actually ready to go, shut down and there's two gates around this one is it you know it decides. It replies to this HTTP HTTP GET request, which is here. The second gate is the termination grace period seconds right and that's kind of, like a high overview of the max time that a pod will sit in terminating before cubanos will stick terming so, regardless of how healthy or how you know how many connections we've drained.

A

If we hit this total termination grace period seconds, crew, babies will just forcibly get rid of it right. So the first thing we're going to do when we go through the shutdown procedure is we'll get a signal from kubernetes that we want to terminate the pod right, and this happens first and the culet says: hey envoy, you're gonna go shutdown and it calls the pre stop hook for envoy.

A

When that happens, the cutout manager also gets the same hook and it'll send a signal or a post request again to envoys Ivan, webpage and we'll say: hey envoy, go and healthy and begin draining connections as soon as this happens, the the readiness probe on envoy, which is the way for kubernetes, to determine if a pot is ready or not that will start to fail and that will stop new connections coming in from outside the track outside the cluster. Once that happens and an envoy starts draining.

A

Basically what the shutdown manager does is it'll pull for open connections and on some interval, that's defined. It will just say: look you look at the prometheus end point of envoy and look at the listeners, the TLS and non TLS listeners, and it just looks for how many connections are open once it meets a certain minimum criteria, then it we know is that it's drained enough connections. The shutdown manager will reply to the pre, stop hook that the cubelets sent initially and it will tell kubernetes hey we're good to shutdown between all connections.

A

The kubernetes will dead and kill the pod and that will go away and the new one will come back right so doing this model, we're gonna be able to wait for these connections did Pulte drain off, and then you know what will minimize the downtime that we have to our users now again, if, if there's some sort of problem setting this unhealthy and envoy or some sort of time, where hey the connections, just don't drain off this total termination grace period seconds is a parameter.

A

You can tweak to figure out how long you want to wait. You know maybe for you it might take. You know ten minutes for connections to drain. It may take an hour, so we'll have to look at and see how you can. You know turn these these settings up and down to really match your scenario all right. So let's go ahead and do this for real. So what I have is I've got a cluster running and in my cluster we actually shipped some sample dashboards for this.

A

So this is the generic dashboards you get from from contour. That looks at all the Envoy metrics and what you see here is we have this envoy open connections, tab so in my cluster I've got three different envoys on three different nodes and right now, I have no traffic going to it. So what I'm gonna do is I'm gonna go ahead and we're gonna generate some traffic.

A

Once we have traffic running to all the nodes, then we're gonna go ahead and change the envoy daemon set, and when we do that we're gonna, you know just change it have it rolled out and what we can do is we can watch it. Go unhealthy, wait for connections to drain and then and proceed on so I'm gonna do is I, have a have a generic load test thing here. Let me pull this up and show you so what this is.

A

This is a tool called Bombardier and what it's gonna do is is going to go ahead and get rid of that. This is going to go and send some connections. So it's going to send a thousand connections for five minutes and then it's just gonna, you know hammer are so versus with requests right. So let's pop that over here and let's go ahead and create our job. Okay,.

A

Our jobs created, let's go ahead and watch the logs on it.

A

Watch the logs.

A

There we go cool so now we're watching it. I'm happy just kill this, make this cleaner. Okay, so we've got that watching and let's go ahead and spin off a new window here.

A

Right and we'll do a watch on getting our pods alright. So here what you can see is I've got my three pods, my three Envoy pods, which are matched up here and what we can see now is there's traffic going to each one right. So I have a thousand connections and I've got about you know 300 and some connections per envoy getting traffic. So what we'll do is we'll go ahead and we'll edit the dataset.

A

So let's go ahead and edit our Envoy process and we will we'll change something in it and will cause the rollout now I can this could be a new image. This could be something like that. I'm just gonna change, one of these um delay seconds to make it go. So let's go ahead and change this one here, this period seconds from 3 to 4. Again this is a silly change, but just to demonstrate the change so I'll write that out right.

A

So what should happen now is one of the pods got that shutdown signal right, so the pretty stop hook got called the shutdown manager went and sent the termination or the unhealthy state to envoy, and here you can see that it went unhealthy, so one of two are ready, so the Envoy pod is unhealthy. So now no new traffic's gonna target that envoy and what we should see is.

A

Then all the traffic should peel the way from from that one from this pod j-jake p2 key this one- and here you can see, there's no traffic going to it now so now the shutdown manager is pulling for that zero traffic and once it finds it, it's going to respond back to the cubelet and say: hey I have zero traffic you're I'm free to terminate and the one parameter we didn't discuss yet is I've got a built-in delay at first, so after it gets the priests top hook, the shutdown manager has a one minute delay right and that's just to help give it some time to close everything up right again, this is configurable.

A

You could make it larger or smaller. It'll have a delay until waits for things to go so there you go that it must have found zero zero connections, so it responded to the pre stop hook and now the pot got terminated right so that one went down and now we can see the new pot spinning up again. It's got a past. Its readiness probe once Envoy goes healthy, it'll get its configuration from contour, and then things will chug along. So we should see this pot here.

A

This JJ ptq go away, and then we should see this new man come up this KKR SD and there's the KKR SD and you can see now it's getting traffic now the next one to go down is the CN 8zq and again it's the same way. We should see all traffic go to zero on this one. Once it goes to zero, we will be able to go back to enroll the next one.

A

Okay, now we're at zero.

A

And you can see, this is still adding up to about a thousand seven hundred plus the three hundred.

A

Everybody's waiting for this connections to pull we're pulling through that now again, there are different parameters down here that we can tweak. So we can tweak the check interval, which by default, is every five seconds that check delay, which I talked about. Is that time delay before it starts to poll? That's by default at a minute and right now my minimum opened connections is zero. So it's going to wait till all the connections drain to zero and your environment. That may not be zero.

A

That may be some other number, so my game is configure what you and then the port on the shutdown manager is configurable just for, if you needed to change that within the pod. Okay, so now, I should be able to see. Is our our pods terminating.

A

And that's this kick ers D cool that one's going this one just spun up thirty three seconds ago, so that was the second one. Now here's the third one that's gonna terminate once again. This one goes to zero will finish and how long we have this log here. So we did a five-minute run for I got a thousand connections, so we're probably getting close to that five minutes so be able to check back and see how many requests we sent and then what were the error rates now?

A

The tricky part with doing this is it's difficult to really get 100% of you know of a rollout right, there's always some times where you there's some chance. You might still get errors from your users, but again the job of this is to help minimize those those rollout problems, all right cool, so that one went to zero zero. It's getting terminated. Our new pod will spin up here in a second.

A

Okay, there comes the new one t4 hf2 here it pops it in there again, traffic she's spread out across all three of those. Now once that gets bounced out, then our job will finish here: okay, cool! It did finish so now you can see we sent for that private run. We sent how many 5.4 million connector requests and they're all 200 level requests and we had zero for hundreds of 500s in that whole scenario. Right so again that was a simple test. It's a small app! It's a very stateless app! So it's just again.

A

The goal is here to demonstrate how we can roll out of traffic. So again, thanks for watching, please check out the project contour dial website and thanks for watching.