GitLab Delivery: GitLab.com migration to k8s demos, 28 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-01-28 GitLab.com k8s migration APAC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

A

B

He was coming to this.

A

Well, I was wondering he has accepted the invite so yeah we'll give him a few minutes.

A

A

A

A

A

A

A

A

Hey graham.

C

Hey how's, it going yeah.

A

A

How are things going over there.

C

Yeah, pretty good we're um we're more or less living a pretty normal life. At the moment, I guess all things considered. I know that you're, probably more of the opposite end of the spectrum yep.

A

Unfortunately, yeah yeah we're getting used to just staying inside for like the next three months, so yeah.

C

Oh man yeah. No, we're basically uh trying to think if there's any restrictions at the moment, I think there's some general restrictions like but yeah, even like public events like football games and stuff, is back on and so wow.

A

Nice lucky you.

C

C

How about yourself job- I remember you mentioned a few like a month ago or so ago. You guys were going back into lockdown. Are you guys out again.

B

Yeah we keep on dithering between yeah uh various states of lockdown, I would say yeah.

C

We've always been in.

B

Lockdown, um currently, uh the new strain from the uk is slowly getting into the country, so they're talking about moving us back to a full lockdown. But for now um they I mean stores are closed, but you can order online and order.

C

Online for restaurants.

B

So bath yeah I mean there's movement restrictions curfew all this stuff, so it sucks. But uh you can stop. You can see. My co-working space is open, which is nice.

B

A

Cool, so we've got a few suggestions, but if you want to, if there's anything else, you want to cover on this as well, graham then happy too um do you want to start java like whatever's the most interesting topic.

B

um Yeah sure graeme, so I um take this opportunity with you here to talk a little bit about some of the issues we're facing with the websocket service, um we're seeing errors, we're seeing errors on pod cycles and um I've done some testing on pre-prod, and I can reproduce this and apparently it looks like that.

B

We start to see 500 errors when old pods are germinated, while new pods are ready to take traffic. Yes, um maybe I can demonstrate this in real time and show you or show everyone.

C

Silly question the websockets are not going to be like, like aren't they long-lived connections like? Are they going to be.

D

C

But it, but that should still be, I guess, a terminate cleanly, and so what we're seeing here is something different. Is it.

E

B

C

Closes the connection like just hard closes at all.

B

This was my first instinct was like okay. This is just web sockets, it's particularly bad, because it's websockets and I talked to the developers and they're like well. It doesn't really matter. um We handle these failures, gracefully the client can just retry and it will gracefully degrade as well. So that's why we've just silenced all of the alerts around these errors on the service- and you know, but um what's interesting, though, is that I can take websockets out of the equation and I'm still.

C

B

C

B

Which is which is not, which is not so nice. um First first question, maybe for you, which I'm still not 100 clear on pdb file disruption budget. If we set the max we we default this to a max unavailable of one pod.

B

I understand pdb to be important for.

B

Pdb to be important for evictions yeah is it also used for when you make a configuration, change and.

C

Rolls the replica set.

B

C

I don't believe so. I believe that I don't.

E

Believe, though,.

C

Yeah, the surge settings and everything, and that kind of like the max surge and all that kind of stuff, I'm trying to think those settings have moved they've moved around a bit. I think it's. The specification for a deployment have changed, but um my I I'm happy to admit I'm by no means a deep expert on pb pdb's and once again, there's something that's changed a lot um as the spec has grown, but my my understanding is no that shouldn't the pdb should be taking effect for just the standard.

C

Okay, we're rolling out a new set of pods.

B

Okay, so yeah, because I kind of like I freaked out a little bit yesterday, because um this service is unique in that we only have one pod running per cluster: uh oh really, okay, yeah and uh the reason for that is that. Well, we have three clusters, so we have a z coverage, so there's.

D

Really three pods total.

B

um And um there didn't seem to be a reason with our default spec there's no way we would scale up to more than one pod. I mean we basically have a min plot of one hammer stake sticking out one. So I thought that okay, this is this. Maybe this is part of the problem, so I I bumped to that to two yesterday, but um we're almost seems like we're still seeing errors, but but going back to like having one one pod min replica of one or min max replica of one.

B

Let's say, and we do a configuration change. Well, we shouldn't have any interruption right like that. Yeah.

C

Yeah no you're right so.

B

C

Sorry refresh my memory, the websockets is still the full uh cloudflare h, a proxy everything else. Yeah.

B

So it's a little bit different, um but not that different. It goes ha proxy to the service endpoint directly, no nginx congress, okay, and we made.

C

B

uh Yeah cloud players: yes, um so I'm gonna share my screen and we'll do a little live test right now. So upper right! This is the pre-prod cluster. You can see that we have two get web service: pods one web web service pod and then here's our websockets pod um I've gone ahead and adjusted the settings by hand I'll set them back when I'm done just because I want to make sure that the pdb uh max I'm available, I made max on available one. I guess this doesn't matter.

B

You said, though, right like, I could make this right. I mean I could set this to one and it's fine. I I set it to zero, just to be kind of paranoid, but I'll just set that to the default. um Let's look at the deployment.

E

Should be here.

E

B

Okay, well, um I think we're good, so we have one websockets pod and what I'm gonna do is just use this little load testing tool which sends traffic to the service endpoint, which is uh directly to workhorse.

C

Okay, so this is going to the this is from one of the ha proxy nodes to the google ilb that sits in front.

B

Exactly exactly so, I'm sending it to this external ip on port 8181, which is of course, so this is going correct. This is going through. So if I just stop this, I can see that um I'm getting all 200. This rate is not very it's 10 uh 10 requests a second, the duration's 20 minutes, or so so I'm to do that and then roll them down.

C

B

I'm going to.

C

Just do like um an annotation or something maybe or maybe not.

E

An annotation yeah, let's see.

C

E

C

B

Yeah, so I'm just going.

E

B

This end end variable to true.

B

And so in the upper right.

B

Oops I'll get out of here so now we have the new pot coming up. The old pot is running so um so far everything is fine. I'm still only seeing two.

B

C

I guess I guess, is the websockets part itself running like a full rail stack. It is very, very.

B

Sad but yeah it's running workhorse and rails. It's basically the same. It's just the same as the git container or the git pod.

B

So we have the web service container state running ready. True, we have the workhorse, so I I think I misunderstood this um or maybe not this two out of two. This is just the two containers right: the workhorse and the um okay.

C

B

C

I mean I thought one was an init container, but I actually don't know what the two containers are. um We have.

B

So one is, one is puma for web service, the other one is workhorse, and then we have a bunch of init containers and these uh we have dependencies.

C

And configure so sorry, those two pods would be puma and workhorse. Then sorry that wouldn't be the unit containers right, yeah, yeah, yeah,.

B

um So we um it takes a long time to terminate.

E

I can do it. I just want to do a quick sanity test.

B

So now we see see now we see 500's.

C

Yeah see I'm not saying I don't know if your screen's frozen or something, but I'm not actually seeing like I'm, not yeah treasury's sharing it sure. Maybe it was just my end. So we have.

B

Do you see it now.

C

No I'm just seeing a black screen.

B

Really, do you see it amy.

A

uh I did sorry hang on show me again.

A

uh No. uh I was before, but now it's yeah. I.

C

A

C

It before and then.

B

It just kind of.

C

B

You don't it's.

C

Just yes, yes,.

B

C

Do you see it now? No, no.

B

No no oh, come on, I mean. Let me turn my whole desktop.

C

How about that? No it's still black.

A

Yeah, it's still luck.

C

I think I think you know we can the takeaway. If I'm, if I'm hearing it correctly, is you know I it got the video kind of cut out just after um just before that pod came ready, but I think I assumed that we were seeing 500s during.

B

C

B

See my browser.

C

B

I'm just going to hop out.

C

A

Technology and.

C

It's funny because he was doing like just small text. I thought it was like it was. You know I didn't notice it, because the picture was still there. No, it's still black. I know I got a picture yeah. We go cool.

B

Okay, so we still have this pod terminating, but you can see here, um I started to see 500s as soon as the pod started terminating, even though the new pod was running, and um so so what's going on here, I guess at first I thought, like okay, the new pot, although it's running, is not really ready to take traffic.

B

That would be my first guess, but um I tried doing this test with having a min pod of two which um and because of our max surge like yeah, exactly so that so I still solve the problem. So I think, what's happening is that somehow requests are going to the terminating pod. Maybe, but um I haven't. I haven't validated that um and I don't see what I don't see are 502s, I'm sorry, I don't see 503s in the workhorse log, like I don't see like these requests going to workhorse.

B

It's like it's like they're, being rejected at the tcplb.

C

Yeah, which would be if this was an actual gke ingress like a which is implemented by a layer, 7 https, load balancer, then definitely I'm going.

B

To point the finger.

C

At that, but it should be because it's layer, 4 yeah it shouldn't. That being said, I'm still super suspicious about the ilb.

C

Just I've been looking into him a little bit just in general at all this stuff and yeah. I am.

D

C

I am suspicious that there can be gaps so the way the ilb works is there's a controller. That's running the kubernetes masters, which we don't see that looks at the service endpoint and looks at the endpoint maps and updates google like google's ilb. But if it is.

D

C

Like if it's slow in syncing that, then you know the ilbs list of you know peers, I guess you could say or end points, um maybe slower. The second thing is: is this a new server? The service object, it's pointing to that that. um Can you actually show me the kubernetes definition for that I want to see what.

D

C

Annotations they've put on it. If any are because I have a suspicion as well, so you want to see this service websockets one.

B

C

C

Okay, so okay, so they haven't put any okay, that's fine! I was just.

D

Checking they haven't put any.

C

Additional annotations on it, or something because they're starting to do with with gcp 1.17 they they're starting to force all services to do pod native um pod native load, balancing and they're switching from using instance groups and cube proxy to trying to push everyone to network endpoint groups. That is only for new services, that's created, so so that's only for new services, that's created. So if this was created before say the 117 upgrade it shouldn't be an issue, but um what it does mean is they are changing um her release?

C

How the internals of how ilbs work do we think? Is it even possible that this do we think this issue has been around for months or do we think this is a relatively new issue.

B

I can't say uh I don't know, but you know we I haven't really one is that what's new here. Is that we're not going through the engine x, ingress, yep,.

D

B

So we're going through the tcplb, so maybe maybe that's coming into play sure um this is so that's new uh yeah other than that. I don't. uh I can't think of anything else. Yeah! That's.

D

B

But I I really thought this was sort of a boring configuration. I didn't think that this was anything special and that I was super surprised that we're seeing problems I mean the whole point of removing the nginx ingress was to make our configuration less complicated and to like remove.

C

That factor right.

B

C

I agree- and I I would say this definitely if assuming it is uh an issue with the ilb and not something else, then I would definitely say this is like like a failure like a bug or something because this, obviously it shouldn't be happening. I'm I'm interested. Look I in the interest of keeping this meeting sh, you know spiraling out into a debugging session. I I'm it's really good for me to see this, I'm actually interested now and going to look through stackdriver and poke some of the gcp like get some of the gcp.

D

C

From the ilb that actually implements this and just have a look and see if we can um confirm- because what I would like to see is in theory the ilb that this maps to which we should be able to see in the gcp console. What we would expect to see is the back ends, for it is all the nodes in the cluster and and whatever the node port is on this server. So what's the node port here it should be someone here somewhere here, uh a1 node, port 30898, I think, is the.

B

Yeah, that's the.

C

B

For it yeah, I can.

C

Only got multiple work, no, no okay, yep! So.

B

There's lots of ports here, one there, but it's because we have a bunch of service ports defined for the exporters. Okay,.

C

But yeah I I would. I would double check that what we're seeing on the ilb side matches and that that configuration is not for whatever reason changing. I wonder if I can actually watch stackdriver as well to see if it thinks that the endpoints flap up and down at all, because once again they shouldn't, and if we um do see that, then that would be something else. That's suspicious.

B

That's what I was going to look at as well, so, unfortunately, google doesn't google doesn't make it easy.

C

Yeah, no, their ilb stuff is really difficult, because they've got like it's actually implemented by just like routes and url maps and a bunch of other stuff. It's like it's actually a bit abstract the way they kind of implement it.

B

C

E

C

It would be interesting is we can, I think, if I'm not sure, I'd have to check if we actually have access into the pod network itself and doing the test against the part ip for the pod? That is terminating like directly straight to the pod and.

D

C

What that kind of gives you like, obviously, it's going to error and then eventually fail to connect, but what does that actually look like I'll see.

B

C

Do we see, I guess, trying to work through those those bits and pieces because.

B

Once we've got.

A

A few ideas or we've eliminated a few things. Why don't we also ask our? uh We have a tam right at.

C

Ftcp we should.

A

Definitely reach.

C

Out to them and say yeah, this is like.

A

Is it known yeah.

B

um Nothing special on the load balancer like this is. uh I was looking at this yesterday. I think.

C

This screen's frozen again, but.

B

ah Really, what's going on.

C

B

That sucks, it's.

C

Okay, um yeah: no, it's black screen now again.

B

Well, anyway, uh yeah so um yeah, I'm gonna spend a little bit more time today. Troubleshooting this.

C

Do we think we only see this for web sockets as well like in in theory? We should be seeing this well. Okay,.

B

So there is yes.

C

B

I think there's a key difference here right because in our other configurations we have an I o. uh We have a internal ilb, but it's in front of nginx and nginx.

C

Is like always available available so yeah sure sure, and but it's.

B

C

Internally, buffers between the pods then yeah.

B

Yeah, but it's funny because we were definitely seeing a lot of errors on nginx pod churning and what we did is we just upped the resource allocation so that we never scale nginx, so that's very stable, uh but that doesn't help us here because now we're bypassing engines yeah, and so um so. This is kind of good.

B

um It's good that we're seeing it for websockets because, like I said like errors here, don't they matter too much, at least that's what I've been told so, um but it's still like something we should get to the bottom of, especially if we're gonna, I my intention is to move https get to this configuration. Oh.

C

Yeah definitely but.

B

I don't want to, if, like I.

C

B

This is, this is.

C

This is yeah crazy because, um as you said, this is a completely boring it should. This should be an absolutely bulletproof rock-solid configuration or setup like if we're seeing these problems, whether with nginx in front of it or without or you know, we need to make sure the pod communication is working as we expect.

B

But this is why I'm suspicious, because it's so boring like okay, come on. Why hasn't anyone else reported this, um and this is why I think that there could be a delay between the time that a pod is ready and the time that a pod is able to accept traffic, and that would explain it right like like the first. The new pod is ready. The old pod is terminated, but if there's nowhere for traffic to go because the new pod isn't actually ready, then that would explain. 500.

C

Yeah that would explain it.

B

C

Something's giving the 503 right because that's the other thing too, it's like yeah theory, a layer for load balance, can't give us a five or three to give us connection, refuse or clinician close or something.

B

So so what I would love to do is be able to see the health of the back ends. I think you were saying this too, like see the health of the back ends in real time like to see when of the l4 layer, the layer 4 load balancer, to see when, like the load balancer is dropping them right like this, because the low bouncer has a health check right or not yeah, it does.

C

Yeah but the the load balancer in theory. So let me think so.

B

C

It's just going to point to all the nodes and not change yeah you're right and those nodes are always going to be healthy. Yep. It's cube proxy, which manipulates the ip table rules between.

C

So when a when traffic hits a node cube proxy will manipulate the ip table, rules to say you can can or cannot go to this pod and and q proxy syncs from the kubernetes api and in theory like in theory, there could be lag there, but once again, that would be a huge kubernetes bundle that a lot of people would pick up on. You would think right, you can do. There is setups where you can do this, like.

C

Basically, you get rid of cube proxy, which is what I've done for cast and stuff. I set it up so using this kind of because it's a new service and I had some time I set it up using container native load, balancing and stuff, and so when you look on the old load, balancer the actual pods themselves, like.

D

C

Ips and ports at the back end, and so if you have a thousand pods you'll, see a thousand back ends.

B

C

Cass is obviously only two, so it's not a big deal, but it was much easier to see what it thinks is up and down because it's like okay, it thinks that's down, because I got the health check wrong or something rather than, and it.

D

C

Cube proxy. That being said, I don't think we should be doing something as drastic as that for this problem, because this should just work. So we need to. We need to figure out you could. Even you could even do something interesting like go into the ilb drop. Every manually drop every other node out, so there's only one actual physical node listening on a node port that traffic is going to go through and I don't know see if that changes anything see if like and if you have.

C

You know that all the traffic is going through the one node you might be able to watch that kubelet's logs or watch that to do a tcp dump. I mean it really depends on how deep you want to go with this, but yeah.

B

It's hard it's hard, though it's hard, though, because there's only one pod so you'd have to ensure that the pod was running on that node right, no.

C

Because that's the thing cube proxy will route through we'll send it to so.

D

C

Yeah, so the whole thing is the ilb does not know how to the whole point of um cube proxy is that I.

B

C

Load balancers.

B

C

Know about nodes and then it can go to any node and then it's like. Oh there's, one part, and it's over on node three. I will pack it mangle the packets and forward them on over to node three, um so at least you'll get that one. By doing all I'm trying to say is: I guess, with that at least: you'll bottleneck, the incoming connection from the ilb to one node, and maybe that.

E

C

And trying to narrow it down more debugging.

B

Okay, well, uh I think what I'll do is uh debug or yeah. I think I'll do some more debugging today and.

C

Like, even if we, even if we just did a raw tcp dump, that tried to like decode for a http, 503 header or something and just find.

D

C

That it's coming from, because is that, like a service ip, is it a pod ip? Is it like something else like yeah? I? I might have I'm going to have a look at. Let me know how you go, because I'm definitely interested in having poke around this. If we make have no luck, but we should also definitely squeeze out google support for this, because this sounds like a fairly standard question. They should be able to answer us on yeah.

B

Okay, um that's pretty much all I have amy.

D

Do you want to talk a little bit about the kubernetes upgrade cream yeah? I was just wondering uh graham, like you're, about.

A

Whether about one one dot, 18 really.

C

Yeah, so uh once again, I know try and keep it short on, um keep this meeting on time. So the short answer is, it looks like we've identified at least two incidents that could be maybe not alleviated, but helped a lot by uh gk, 1 1 18 upgrade talking about that now.

C

It's actually also made me think about this problem as well, because at the moment one of the things we've highlighted and even google have now acknowledged- is the tcp settings they have on all their nodes is incorrect, and so we saw that cause problems with mailroom and I'm actually wondering maybe maybe this is crazy. I am actually wondering if that could also be causing some kind of weird connection issues we are seeing, but it probably wouldn't explain the 503, so maybe not um so. Basically, I've done a bunch of the prep work.

C

The only thing so there's two main well there's three major things that come as part of this upgrade and actually the next four kubernetes upgrades are going to be more painful than the last ones. They've they've fully removed a bunch of apis uh api versions for deployments, pods and stuff. So unless we've got, manifestos are really old and we've never updated them. um We we should be fine. I've identified one spot and plant uml, so I'll probably put a merge request up to just you know, yeah um and fix that.

C

The second part is the they've changed the ingress spec so on 118 they've finally solidified and got the ingress spec out of beta. So that's going to change a bunch of stuff and that's going to be really invasive for the gitlab chart. I believe, um but we don't have to do that before the upgrade that's going to happen after so you know that's another thing we need to do and then the final issue is they've, done, a nice uh rename and and changing and removing of metrics. So I've got to.

C

I asked anthony um to get someone in his team to review, but I think obviously now with him.

D

C

On he's, probably he hasn't picked up the ticket and he hasn't responded, but I'll um get someone from observability to basically go through all of the documentation and confirm that this isn't going to cause metric issues and then basically, I'm ready to green light like I've got the mrs ready to do like ops and stuff, and I'm keen to do this as quickly as possible, especially if we think it's causing issues.

C

In the meantime, I have deployed into pre and staging a um what is essentially like a workaround fix for the tcp issues, so I can actually roll that into production any any day now. So, basically, it's just deploying the demon set that runs ctl to change the the settings. So it's like getting.

D

C

But now so that's in pre, it's in staging. um So obviously, actually it's not going to fix this issue, because if it's in pre and staging and we're still seeing it that it doesn't cause it, but um I'm I'm ready to put together a change, request and roll that out. So that kind of gets the one benefit or one um one.

C

You know corrective action from the 18 upgrade out there um and then obviously, when we upgrade to 18 I'll just remove that fix and keep going but yeah besides observability and those little bits I've talked about which I'm pretty much okay with. I think you know we can basically do it whenever.

B

Is is it worth trying to get the demon set workarounds into the chart? The gitlab chart.

C

Well, it's a question of it's only a gk gke, specific problem, so yeah, okay, probably not and and to be fair. I guess they're fixing it in I mean I it's frustrating, I don't. I can't get good. There's google aren't very informative about when they fix things like they've said they've fixed.

D

C

Release, but I don't know, are they back putting it to 117., I'm not sure I can probably ask them, but um yeah. I think we we don't want. We want to do the 118 upgrades sooner because, what's going to happen, is eventually they'll force us to do it and that's what happened with the 117 upgrade, and it was a little bit more. You know nerve-wracking being forced to do it rather than doing it on our own terms.

A

Yeah, okay, that makes sense cool yeah.

B

Even though it's gta specific, maybe yeah, I don't know if we have chart documentation, that's specific to provider and but but maybe we should, because it sounds like you know, for anyone who's trying to run our chart on gke. This is probably something they'd want to know right.

C

Yeah look, I mean, I definitely think for not only that right, like all these, all these issues we're talking about with, like nginx and ios.

D

C

And I know from working with amazon and some of the other cloud providers they do also have they have they have their own annotations, they have their own setups, they have their own kind of. So I don't know, maybe even a documentation section on the gitlab. You know chart or.

D

C

Docs or something about different provider specific problems, but I feel like we should yeah. We should be capturing this. I do agree like this is.

A

Yeah, do you want to.

E

Open an issue for.

A

That job and maybe see what distribution think.

E

B

Yes, um what do you think is the like? Are you going to be driving the gke upgrade game? Sure you will be okay yeah and what what do you think is your like time frame for starting it and.

C

Yeah uh look honestly, the biggest blocker at the moment is just getting an observability to confirm that you know we're not going to lose, because we did last time I did the 17 upgrade. Suddenly a bunch of dashboards stopped working and I was scrambling to fix it. So I'd like to get ahead of that this time, um yeah so once they're tomorrow, I'm gonna sit down and probably get most of the merge requests ready.

C

Honestly- um and you know, as soon as I get the kind of rubber stamp for the from observability, I can probably do it next week. I think I can fit it in next week or I'm on call next week, so it would probably that that actually doesn't work out too badly, because if things break, I like being the person on call when I do it, because you know at least I get to get the alerts and fix it. So I would like to at least get the smaller environments like even pre or ops, or anything.

C

That's not too big, like underway. You know next week would be great.

A

B

A

I'll ask um I'll have brent about the metric side, so if we can get that prioritized done because yeah it'd be great to get this upgrade, you think.

C

A

B

Do we think we're gonna have to do some ugly things with the chart, because we'll have maybe new api versions that will support new features that aren't compatible with previous kubernetes versions? And you know to use a good.

C

Question so I probably need to look in depth. Do we have a policy actually on what kubernetes versions we support, because, technically speaking, like all the versions we've run are like, with the exception of red hat and openshift, because you know they they're the we'll support it long after upstream supported model, I don't think we should be having to support that many old versions, like maybe 16 and 15, 17 and 16., and I think I think, we're okay, but I should you're right.

C

We should double check or I have a way to um figure that out a bit better.

B

Okay, yeah. I think I think we do support explicit versions, but uh you have to type the distribution to see.

C

I'm hoping not fingers crossed, we shouldn't and yeah yeah, okay.

B

Cool anything else.

A

um Go ahead. Sorry yeah go ahead grab if you have something.

C

I was gonna say I I don't think um just small questions with the gitlab com, repo that we still can't take changes off master. Yet is that still being held up on this like web sockets and moving in genetics and all that and that stuff yeah.

B

That's the problem um we need to. We really need to figure this out, I'm hoping by the.

D

B

Week we'll understand better this issue that we're seeing and whether we want to move forward with the get https nginx bypass. My hope is, we do that and then we can just upgrade the nginx ingress controller, which is going to be a no app. If we can't do that, then we need to just do a cluster by cluster upgrade, which is really not that bad. I mean we've done it twice already. um It's just kind of high touch. You know, that's all.

C

Yeah right, okay, yeah, it's just curious because um yeah just want to start getting some more changes in, but that's fine, um the only other thing. Actually, I realized we only got five minutes. So I'll mention this briefly. uh Let me see if I can share my screen, uh so I've been spending a bit of my spare time playing around with.

C

uh Is this gonna work yeah cool um playing around so we've got an issue open. If I can, basically, we I kind of talked about this. A few months ago, decoupling helm file execution from sinking from chef, so I actually had a little bit of spare time and I actually had a crack at implementing it and basically using json it to actually uh so that values from external sources, instead of being a go, template being json, essentially and just passing the values in and using jsonnet to pull them out.

C

um The very short answer is: it looks promising, there's still some kind of warts and stuff and not so good pieces to it to it, but it was actually really oops if I can actually find the json file.

C

So basically, what I've got in in this commit- and I can pop it in the dock there we go. Here's the json obvious file, but, as you can see here, I just basically pull in a bunch of stuff with using json external variables, which is the chef rolls. So all of the chef roles are json all I pull the like load, balancer ips from the google api also is json, and then it makes it so easy to just manipulate and pull out all the values. I don't need to shell out to jq.

C

I don't need to do any of the other stuff, and so you can see here all these settings. I've just got like chef railsconf, which is just um you know, default attributes on the gus, gitlab gitlab rb, and so then you can just see. All of this is basically just all those settings mapped to the values we need, so it becomes a little bit nicer, a little bit easier to read doing conditionals based off things like redis configuration.

C

um You know it also becomes a lot easier, uh there's actually like there's other yaml files. We have like the init values. Yaml and stuff, which contains a whole bunch of you, know, base we. We have a lot of very awkward logic that we do in go templating and I'm playing around with the idea of using json, because it's a bit higher level and has got some nice features for us to make that simpler.

C

But then, at the end result is it just generates json files that then helm file consumes, but because they actually live in git. um You know you basically have this process. If you do similar to what we have in the run books, you do like a make generate it. You know, pulls all the values from chef writes out the files for every single environment. You commit it all in one commit um and then all of the pipelines run never have to talk to chef again. um You know the pros of that.

C

Are it's a lot faster um that we're no longer, depending on a service that might go down, causing helm file to fail, becomes easier for people to see and read. You know, where's the setting coming from. um I actually found a whole bunch of settings like one or two settings in our production environments or staging environments that weren't set um because helm file was pulling them incorrectly and then just failing, silently and sending them to empty. So, like century.

D

C

And things like that that wasn't there um so yeah, it's as I said something I haven't opened an mr or any kind of concrete, yet I'm still just kind of poking around with it um but yeah. I just wanted to put out there something I've been playing with.

B

So um so there's I guess, there's like three external sources. We have the chef repo, we have secrets and then we have gcp um for the chef repo. I think yeah. You know I mean like uh it's hopefully temporary right like I, I think once we move.

D

B

Over there will be far fewer like things to bring over. Gcp is here to stay, and so is secret. So the idea is, then, you would just uh do make generate. That would generate a json file and then your jsonnet would consume that json.

C

File so yeah well make generate, would actually one json, so it would execute like.

D

C

And then pass that into jsonnet as well and that's why, like, for example, you can see here I've written like a function, that's like find g cloud address, so it takes the whole json object with every single address and all the settings. Oh sorry, this is probably really small yeah um and then I can just like you know, find the you can just call this function like get me.

C

The address, forget https and because I've um you know, given us the entire json object from google with every single address we have, and it becomes really easy to just you know, write these helper functions to find, addresses and or get extra information out of it um and like likewise, you know, I wrote a function for mapping the giddily, so the gitly transformation is really ugly, which we use in jq. At the moment, it's horrible. um You know it makes it a little bit easier. I can convert italy entries and just map that to arrays.

C

um So, ultimately, if we would, if we think this idea has legs, you know this. This would just basically replace the values from external sources, but there's nothing saying that we couldn't make this more sophisticated and get it to the point where it's more or less one jsonnet document, with all the conditionals, where some variables are passed in per environment like we do now, and it just generates the one so basically helm file just consumes one json file out of json instead of like values.yaml environment.yaml values from external sources.

C

We just use jsonnet to do all of that complicated logic, around values and then helm file just simply executes with okay. I've just got one value file to consume for this environment and I'll just consume. That, and you know it just keeps the job of doing the helm upgrade stuff, whereas we pull logic, environment, logic out of helm file, maybe yeah.

B

Yeah sounds good to me for secrets. We there's not much that would I.

C

Have no secrets yeah, so I have I've deliberately put to the side because yeah that's kind of a trickier problem also there is a high chance that chef is going away as well. um So that's like you know once again.

C

If we, this kind of jsonnet approach also means you know, because I'm externalizing the execution of chef, although you know we do that in helm file anyway, um we could just change this to point to you know whatever pump puts the json source in it doesn't really matter, and in fact I think when we talk in the discussions around replacing chef for the nodes that we are going to keep, um there's a bigger discussion there about. How do we do things like service discovery for thing?

C

Like uh you, I would personally almost argue that, should we be putting consol sorry um giddaly servers in console like should we be relying on text files and roles for like our service discovery for get uh getaway nodes, and things like that? Do you know what I mean? So I don't know.

D

C

Over time, this will just become easier anyway, as we because a lot of it a lot of it's just oh, we need to sync a list of servers. We have from chef somewhere and it's like well, should we actually be having that in our chef or ansible or whatever system at all? Should that actually be in console where you know it's a live set. It's.

E

B

C

Mechanism for all these guideline servers.

B

Yeah um yeah, I I guess, um but the thing is- is that a getaly server doesn't need to know so. Getaway servers are going to be managed by chef for the foreseeable future. At least we don't have a plan to move into kubernetes unless, if we switch to something no.

C

No, no, there is this, but there's discussion to replace the configuration management. So basically, we've got a we've, got a deadline on chef. We either have to pay for enterprise license or move to something else, and that's supposed to be tackled in march or whatever, so that there is a this is comes out of the compliance audit or something so all they. All we got to do is either pay the money for chef, enterprise or and.

B

C

I guess all I'm saying is you're right. Those nodes will always exist. They will always be managed by something so that but but yeah what that is. I don't know.

B

But uh yeah I was just making the point that the list of file servers the list of gitaly nodes. um Once we move the front end to kubernetes, all that configuration will be will no longer chef will no longer need that configuration at all, because.

C

Really because gitly notes.

B

Don't need to know about each other, they.

C

Okay, so right.

B

C

Don't think that's.

B

um But with prefect uh prefix that it needs to know about the nodes underneath it so yeah, so so that I guess is special.

D

And prefixes but but prefect might also be.

C

B

C

B

Kubernetes right so, uh okay, but.

C

Anyway, yeah yeah cool.

B

So yeah this looks pretty cool. Like I'm yeah, I think um I think we need to kind of figure out the timing, whether it might.

D

B

Now or make it whether it makes sense to wait until we move the front end, because then a lot of the chef stuff just goes away entirely right.

C

Absolutely I agree, I think, and then once we hit that point we kind of because at the moment we're like. Oh chef is the single source of truth. Let's pull from chef, but once we do that flip we'd almost say that. Well now the kubernetes manifest repo is perhaps the single source of truth, and I don't.

D

C

You know what that means. Maybe it means nothing but but yeah. That kind of is interesting. When that happens,.

B

A

Cool thanks very much. Everyone.

C

I'm just going to throw this in the dark if anyone's interested in having I've.

A

Dropped it uh if you've got your. Mr I'll put the issue in but yeah I'll be great to see.

C

uh Where is our doc gone.

C

C

ah Yeah cool cool.

A

Cool well thanks, very much graham great chat to you. Yeah.

C

It was great thanks very much again for taking the time to have the meeting so early.

A

That's right, happy to shop cool, enjoy the rest of your days.

C

Yeah you too talk to you later bye.