GitLab Delivery: GitLab.com migration to k8s demos, 14 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-07-14 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning,.

B

Good morning, hello.

C

Alessia's going full professional background today.

D

Yeah, it's really terrible because I have the mirroring setting on my computer. I mean it's a default on zoom, so I see the background mirrored and it's really yeah it's weird, because it seems so. I flipped all the images and then I was testing it on a call and said your background is mirrored it.

B

B

C

Awesome so um I'm not sure jarv may jump in, but uh here we go um awesome so scuba. Do you want to kick us off.

A

Sure um I have a few items to demo. These are things I kind of wanted to show off last week, but you know time so I'm actually going to start with item number two just because I already have those.

A

Those tabs open so console we're missing dashboards, and this is a relatively critical service. All of our any service that requires access to our database. Our postgres database relies on console for service discovery for determining who the master is and for determining who the secondaries for the database are. That way, we request rights and reads to the correct database systems, so, like console, is pretty decently important. The fact that we don't have dashboards for it is kind of concerning, so I decided that we should finally do something about it. So I have a merge request.

A

Open, I'm, hoping andrew will eventually get time to look at it, but not all the dashboards are going to currently work just because we have a mixture of where data is being gathered from and where that data is actually important.

A

So we have our cluster our five nodes that run console as a server that our clients all kind of link up to to advertise their services, there's five of those as virtual machines sitting around. So it's like some of the stuff should have data, but I don't know why it's not showing up.

A

So you know be something that I'm hoping andrew could help me address in the merge request and there's also certain data that just doesn't make sense to have I'll get to that in a second but like we do have data coming in from all of our agents that are running uh inside of kubernetes. So you know, we've got how much cpu that we're consuming, which is kind of cool, and you know also the amount of memory that we're consuming.

A

It is very interesting that, for we appear to be using a lot of cpu, but we need to go back to account for how many nodes we're running inside of production and we run a lot of nodes. So while we only request a fraction of cpu, it adds up, depending on how many nodes that we're actually consuming or running inside of our clusters.

A

The one thing that I do have missing in this merge request is our sli related detail, um because currently I don't know what metrics are important for determining this information. You know console is not a service where you have a request rate, that's important! It's! Actually, whether the service is available, but at that point, you're monitoring the individual service behind console. So it's going to be more like is the cluster of console itself healthy, and I don't know which metrics actually give me that information.

A

There's plenty of metrics available that discuss that, but in order for me to determine how to write the appropriate queries to determine whether cluster is healthy or not, I would need to build up my own cluster and then kind of put it into an unhealthy state just to figure out what those metrics actually do and represent in order to figure out how to make that work inside of our dashboards.

A

The other problem I was having was just my knowledge with j sonnet, um like this active replica set. I should be able to get this chart to work along with this section over here, but I can't because I don't know how to write an if statement in jsonnet.

A

I tried I was very unsuccessful and troubleshooting uh not a stack trace, but uh goodness, but troubleshooting a compilation. Failure in json is not easy yet to me so again, stuff like rps, which doesn't make sense. We obviously don't have data for saturation. We have no data for which we kind of should.

A

I think the only thing that we see is this open file descriptors, which is specific toward nodes, not kubernetes but like we do have no level metrics, because we do have five servers as virtual machines um and then, if we go down you know we get the same kind of stuff. I kind of showcased already where we get our kubernetes related information as well.

A

And then my merge request also includes some changes to um the service definition so stuff, like the cpu utilization, we'll start gathering that, after that, merge request gets pushed into place. So we have dashboards coming. It's a open, merge request.

B

Nice, that's.

C

A

Focus item number two: is that something.

C

Should we be, uh should we have this on our roadmap to to be migrating stuff? You mentioned that we're running some vms. Is this something that, at some point should all be in kubernetes.

A

uh In the future, yes, I think it's I think console is a decent service to having kubernetes, just because of its clustering capabilities, but I have not looked into it myself, yet um just due to the fact that we've got enough work to do so, I think it should be on the rematch for the future when we do it, I don't know. Okay,.

D

Starbuck regarding your um question about detecting healthy state of the cluster itself, maybe you should take a look at the raft, so basically a console underscore raft something you should be able to see the leader and the peers and make some assumption based on those numbers, because you know the how we configure the cluster. So the majority model, and things like that, so it kind of gives you an idea because, on the other aspect, an unhealthy.

D

Peer should be detected as a service, if I remember correctly, because basically, every console node is part of the service catalog and if it's unhealthy by its own definition, it just appears there as unhealthy, but as a cluster level. Knowing if the cluster is working, it depends on the presence presence of um the right number of monster leaders actually yeah. Just leaders right, yeah,.

A

There's there's a few things I could take away from when I was exploring. Metrics is one.

A

The console itself has its own service inside of console, which that sounds bizarre, but it makes sense because console itself is registering itself and then all the servers that run as part of the cluster register itself into that service, as well, so that the service is held yeah.

D

Because it's a key value! Storage! If you want to reach the.

A

Size of storage there's a service definition for it. Precisely so one would think that if the service is healthy uh console the cluster itself should be healthy. What I wanted to try to do was figure out if the cluster is healthy. I should see five notes, because we've got five servers participating.

A

What I couldn't figure out is what happens if one of the nodes dies, and I couldn't figure that part out, because I would signal hey, that's an unhealthy cluster despite it working it just means, there's a problem that we should investigate. Let's alert somebody because we don't want counsel to completely go down that kind of thing.

D

Do we have one muscle out of five notes because I was looking at the metrics and it shows a one one leader.

A

So that's something I was actually confused about because there is a leader metric of some sort, um but when I queried the cluster or when I cleared the metrics, all of them showed up as a leader and I'm like that doesn't make sense to me. I thought only one was supposed to show up as a leader, so I think I was either looking at something wrong or I need to refine the queries.

A

But I'm like at this point I'll say that for a future iteration, I'm just going to concentrate on this merge request, because this is better than anything else we have which is zero, currently so yeah, so yeah. um Okay. So let's look at nginx real, quick because that's the other one that is merged- and I know those dashboards are in a nice fun.

A

State uh capacity: this is this one takes forever to load, uh but we have our uh nginx controllers and we got you know our container details. So we have you know the cpu usage of all the containers and you know their uh standard deviations and as well from memory, and we also have the fact that we've got the waiting reasons and termination. So we can see that we're scaling on a regular basis.

A

That's kind of cool um same deal for the actual deployments, so we know how much cpu memory and network traffic is happening through each one of our individual clusters.

A

We do have error ratios. We don't have sli's attached to these yet because our errors are kind of high, which is kind of surprising, but I have not really dug into this very much yet I'm just trying to get data out there available so that we could look into it. I did find some interesting measures. I thought were important, so I went ahead and had dashboards for those items and just total connections and connections by state um there was actually a fun during one incident.

A

This chart looked kind of fun, but I don't remember what day it was. I don't want to sit here and search for it, but bless you and then same deal. Our pod metrics have fun data which is kind of.

A

Cool, so we can see how many active replicas are operating across our cluster, which is cool, so you know we got some pretty cool stuff. Oh, the other thing I wanted to show was inside the original detail. We got this nifty link here. That goes to stackdriver, because this is where our error logs are being sent to. uh We decided to just ignore our access log details, uh so you can see we're kind of dominated by the fact that we're buffering temporary files- this is normal behavior.

A

So I'm not really sure why this shows up as a an error or rewarding severity error severity. But if we do a quick filter of not.

A

That we should be able to get down to the very important stuff. Okay, don't filter that out. uh Let's just get rid of this. Do I need, and no just not.

A

Yeah so in production I'm expecting to see something related to our web showing up somewhere because we haven't deployed web in my production. I am in production.

A

I don't quickly see it, but another item that we see a lot is the fact that we don't have tls configured, which is a perfectly fine error that we could ignore. But you know oh here it is service. Web doesn't exist, so you know we're we're getting air logs, which is our primary goal of increasing our nginx observability. So I'm happy with where that stands as of right now,.

C

Awesome, did you just go with the error logs um in the end for engineers.

A

Yeah, so access logs there's two problems that I see that are leading towards us not wanting to try to consume those is one or I guess, a few things, one they're going to be kind of redundant with our hd proxy logs. The only difference that we're going to see is which back-end takes care, of which request hd proxy provides us that to the nginx ingress, which is always going to be the same across all endpoints nginx is going to be like.

A

Oh, this request is going to go to the web uh deployment, or this request is going to go to the api deployment. That's about the only important information I see out of that out of that data.

A

Another item that's problematic. Is that nginx we don't ha currently have a method to filter out secret data that gets logged by nginx so because of that, that kind of puts us in a higher situation of exposing data to the wrong engineers.

A

That's kind of lower priority because only certain people have access to the production data anyways in our the gcp console. But I don't want to deal with that from a security perspective in general, so I'm just like screw it um and two there's just going to be a large volume of data. um Yeah.

B

A

Aj proxy, we spend nearly three two to three thousand dollars a month, just storing and querying that data. We'll we'd essentially be doubling that amount. If we.

B

A

Thing for nginx so because of the other two items I don't deem the cost of it worthwhile so and plus we've been dealing with this for the last two plus years. So why bother so I'm going to leave it that way uh if we have an incident where looking at nginx logs are more important, I guess we'll tackle it then. But at this point it's been, I updated our run book to comment that we're not doing this because of xyz.

B

A

We know it's a known fact at this moment in time.

C

Awesome yep sounds good nice great great progress, um good to have visibility of both those things so awesome.

A

um Something else we could potentially demo um should we showcase the fact that web works in stage.

C

Oh yes, please yeah.

A

Okay, let me let me get set up real, quick because grain merged that chef change yesterday, which is kind of exciting we're one step closer.

A

Whoops and then I need come on, I hate cabana, oh wrong, crap. Let's start.

A

Over so in here.

A

Obviously I just want to look at the web requests.

A

And then I want to make sure I filter for only kubernetes data. So we'll just add.

A

A

A

Oops and then where is.

A

I'll showcase, my ip address, just to prove that things are working um so staging. I don't know if I'm logged in not.

A

So just navigating a boot show me some data and we should start seeing my ip address show up in these logs at some point in time.

A

But I'm pretty happy that this web traffic is coming through nicely and qas passing, which is the important part.

C

Yes, exactly that's awesome, really good um yeah and deployments have been super smooth today. So that's a huge milestone.

A

Where's my ip address, dude.

A

What's the health control, don't care about the health controller?

A

Actually is my username scarback yeah? It is.

A

Oh, I need- and I hate combining just just why. Why do we have to have 15 ways to query things? I have to learn: cabana ql boom. There we go cabana, ql, prom, ql, big query has its own syntax sql.

A

Can we just agree on a standard for databases.

A

So yeah web traffic is definitely going to our kubernetes nodes, which I'm pretty stoked about so awesome. I think the next step uh that uh we need to accomplish is there's a configuration file that graham did not yet audit. So I'm going to try to do that today and then we could start um knocking out some of the other issues like the redness reviews and stuff that are currently.

B

A

C

Sounds good: what are you thinking in terms of um moving into canary? Like do you want to get all of the readiness reviews things done.

A

Listen last time we were doing the api migration. I was wrong every single time. I was like oh yeah this week and then the next week happened next week. Yeah.

C

They're not gonna say that this is not in terms of timing, like not gonna apologize, but I mean in terms of like do you want to get like? um Are we gonna, try and get like a small bit of traffic on canary and then get the readiness reviews signed off? Or do you want to get all those things wrapped up and approved before we put any traffic on canary.

A

I think it's fine that we push traffic to canary just because we could easily turn it off or scale it down via weights. So I think it'd be fine to work on the canary portion and the readiness reviews and other stuff all. At the same time.

A

I do want to make sure that we audit the configurations prior to sending traffic to it. um Yes, exactly.

C

We'll need to do that as well as we go along, because um there is a reasonable chance that we'll get new chef changes coming in that won't be inc like if anyone changes stuff next few weeks, it's not guaranteed that they'll do it in both places.

A

Is there any specific ship change, you're aware of or.

C

um I know great just mentioned that he saw one that was, uh that was only in chef, so that was the thing. I think that made him aware that we, we will need to watch for this, so he started sharing uh shouting, but we might want to just um he's probably fine in canary, uh but at some point before we go fully into production. We should just do another quick order.

A

The only thing that I know of that recently was made as a change in chef that we cannot take over new kubernetes is due to a chart, uh configuration it's just not available in our health chart.

C

uh I see okay, alejandro.

A

Is working on this and there's an open, merge request to get that merged into our charts? Okay,.

B

A

I don't know the issue associated with it or anything, but I did see that flying by I'm not worried about that, because that was something that's very new that was introduced to our infrastructure as a whole. So I'm not really concerned.

B

A

The one that graeme has talked about, I'm not concerned about it, that much.

C

Okay, super okay. Hopefully it's that one great okay sounds good um awesome. That sounds great and one thing I was like at some point. um So we as we've kind of been talking about q3 ideas and things like pages is- is like the next uh big state. That's not big, it's hopefully small, but the next stateless service.

C

I think we should just get started on that whenever we're comfortable um managing the pieces, so it doesn't have to be like right now, but, like also, I don't know if we necessarily have to wait for all of web to migrate. If we feel like we've got the pieces all moving along and we could move it in pre or staging, then let's get that started when we can.

A

Yeah, I would imagine, by the time henry comes back, we'll be either at or near completing that project, and then we can start working on pages near at the same time, pretty much.

C

Yeah cool sounds good, sounds good, um great great progress, great progress.

A

C

Graham did all.

A

The work thank you to graeme.

C

Well and and you as well so like go team work, so I'm loving this time zone uh handing off it's uh working out really well.

A

It's convenient. I like this.

C

Things move faster.

A

This way, I.

C

Do it's really good um awesome, so I had a couple of um discussion on some points. The main one is really about q3 okrs. I was wondering um kind of based off yesterday. Should we try and do something to reduce the conflicts between auto, deploys and uh kate's workload? Changes like would that make it easier to work on the clusters.

A

um Yes, um I'm more concerned about auto, deploys being blocked um and graeme has an idea about this, and I'm pretty sure he's got an issue logged about this as well yeah. So I remember the proposal was to try to make auto, deploy specific pipelines, simply not check for configuration changes when upgrading the cluster, which should be doable because, right now we query our single source of true, which is currently chef and our secrets to make those necessary changes. But there should be a way we could say.

A

Hey only do this one specific change to our clusters: um graeme had an idea about this. um I just we just need to find that issue and maybe prioritize it. I think it's either in the tech debt, uh epic or it might.

D

A

Some other epic related to helm, changes or upgrades, or something that effect.

D

So when we say conflicts are we referring to the diff checkers that fails during a deployment? If there's something yeah? Okay, but I was talking with this with jarv, so my question was can't we simply use the research group to avoid interfering deployments.

D

I mean if we have the research group set on those jobs, the either is running the config change or the version bump, not both of them at the same time. So.

A

We do use resource groups. The problem that we run into is that the diff checker might run while a configuration change may have been pushed into place, and if that happens, we block auto deploys because auto deploy saw a change that may not have been rolled out to that cluster yet because it may have been blocked because the diff job was running from auto deploy, for example,.

B

A

You have a lot of competing things competing for that resource to run that job which might slow down one or the other.

A

Does that make sense.

D

Yeah, it does make sense, but in in terms of how the resource group should work, they will slow down, because it's designed to just allow one thing to run at the same time, but it should not block. We just delay. Why here we're talking about blocking and retrying.

A

Well, the blocking happens, because maybe that configuration change hasn't been rolled out but because the auto deploy uses the master branch, the default branch they're, making.

D

A

Committed that hasn't been yet been.

D

Understood because something: okay, yeah, okay, thank you.

A

So yeah we need to address that in some way, shape or form, and I think our best option is going to be uh determining how to figure out how kate's workloads operates or deploys things overall, which I know graham has ideas for.

C

Cool okay yeah. I think that makes sense um cool okay, so I'm gonna I'll add a comment on the okr discussion issue, but, like I'm kind of thinking like uh the um in terms of like scaling up stuff, that to me feels like it'd, be a nice one to have removed just as auto deploys are moving um faster. Now with bridge jobs like you, you get so much less opportunity to push these things out in between auto deploys like at the moment.

C

The last uh well instance aside, but over the last couple of weeks, with graham deploying as well, um we've been hitting five deploys a day, which is awesome great for like getting things out, but it's gonna make it harder for you to like dodge these things. Basically,.

A

And as we migrate more stuff into kubernetes, that's only going to be more difficult and, as.

C

The infrastructure team.

A

Itself makes changes in configuration.

C

And such exactly um so we'll have a lot less visibility of of changes coming in so yeah, okay, so um then the other um so I'll add the other one which I'm going to mention on the issue we should have think about is um chanter jarvan he was mentioning like. Should we uh should we try and reduce our dependencies on omnibus, like specifically, post deployment? Migrations could be a great one right um following web fleet migration.

C

Maybe that's a good time to actually try and work out, because we know that registry needs this. They don't need it yet, but they they know they'll need it in the future. We don't actually have a solution for applying post-employment migration safely um on registry and the reason I believe I I can't remember the full thing, but I believe the reason is the fact that is it still with the fact. We can't control the order it applies or it's is that right, yeah.

D

Yep, basically, we need an operator and no one is working on operators. So the ask is to re-implement the deployer in kate's workload, because basically, this is the so the baseline here is that the more we remove from the deployer we have. We we end up in a situation where basically, the um the box that runs the migration can simply be an image running as a job on the cluster. So we schedule a job and say: could you please run regular migration?

D

Then we kick in the helm upgrade and then, when it's done, we run another job which is running positive alignment, migration, which sounds like re-implementing um deployer in kate's workload, and the reason being is that because we need an orchestrator, which usually is uh what an operator is is made for, but we don't have one.

A

B

A

Or maybe we should talk to the distribution because they're working on refreshing, the operators specifically for openshift installs, and there should be no reason why that can't work in a standard kubernetes cluster.

A

So I wonder if there's some word that they're doing there, that we could probably invest some time in to bring it over here and take it uh dot com.

A

I don't know anything about it, though.

D

Yeah we were discussing this since at least more than one year. I think and yeah I mean if it's not a priority for them.

D

For us, it's easier to just do the work around instead of building an uh full-fledged operator.

C

So I know they are working on. I can I check in with them on timelines, but yeah. That is an interesting one.

D

But then we also have multi-clusters and things like that, so I'm I'm afraid it may not be enough. So we need something that is very specific to our use case.

C

Yes, um okay! Well, that's a good one that we should uh at least explore like whether that's an option um I mean I guess that could be a solution right. Somehow we need to find a solution for post-deployment migrations and perhaps assets as well and an operator of some variety. I guess is going to be needed right.

A

Well, assets- I don't think, is involved here because that's going to be that's all front-end work that we send to our cdn.

A

uh So that will all I don't see that changing in any way shape or form really or do we depend on the omnibus package for those assets.

D

I think we do oh there's some weird dependency where something waits for the other packages. I I don't remember the details, but there's.

A

Something like this.

D

Happening gotcha.

A

C

Yeah, um so that.

A

C

A

No, but theoretically, if we figure out how to do post-deploy migrations, we should be able to put pre-deployed migrations into the same, like style, potentially.

A

And then we could almost start removing a lot of awkwardness and deploy and removing a lot of ansible, and that sounds exciting to me.

D

C

D

This is not hard right, it's just. I mean you you're, changing the ansible job with uh with a kubernetes job. You just schedule the job to run, and then we need two images. I think the images are already there because the helm chart is using those images. So it's just a matter of providing the environment variable to say, skip positive migration run, possibly migration. So it's it's simple.

D

It's more that even when we talk about operators, operators is um it's a cluster concept, so within a collapser you may have an operator, but we have a multi-cluster deployment.

D

So this is why it doesn't doesn't really fit there right so either we have something that is external and kind of orchestrate everything, and this can be kate's workload. I mean it's just that.

D

I mean it sounds like we are moving, removing complexity from one piece and adding it to another one, because I I mean the private is complex, but ansible is quite simple to understand what is happening and kate's workload, or even tanka deployments so, depending on what you are deploying, and we have plenty of configuration options, then you have jsonnet that you were mentioning at the beginning which another set of yeah, I'm not saying anything about json that is yeah.

D

It reminds me the time when I was working on xlts transformation, so it's kind of yeah. Let me stop here, so the thing is that it it does something which is important. It just gives an edge on not writing, tons and tons and yaml, but it comes with a cost. So we gave.

D

If we move away from omnibus packages, we will increase the our velocity in terms of how much will it takes for deploying something, because we will remove one hour of building the images and things like that, and on top of that we may also have something interesting to explore here, which is uh decoupling, italy and the rails deployments.

D

So there are tons of opportunities, but it doesn't really sound that we are going to something which is streamlining the process in any way. We're just making custom we're just removing a custom tool for building a new custom tool which may or may not maybe easier. It may be faster to run but not necessarily easier to implement and understand and work on.

C

Interesting, okay, yeah. That makes sense.

A

And as long as we've got omnibus installed somewhere such as giddily or prefect, like we'll we're, still gonna, be waiting on omnibus anyways. So it may just be a situation where it may not be worth investing the time to look into this until we get everything into kubernetes, potentially.

A

It's a balance.

D

Yeah I mean the point: is that if we know that there is no easily change, let me rephrase if we know that there is no getaly api change, which means that it can run with the old gateway version. We can completely decouple things, and so we run migrations in kubernetes, kubernetes deployment plus deployment migration in kubernetes as some as a as a deployment, and then when the omnibus package is ready, we roll out visually independently, but right now we can't do this because we have the migrations happening on vms.

D

So we need the omnibus packages just for kickstarting the the process. So in that sense we can get a great improvement.

A

D

A

We complete the web and web pages and if the sidekick nodes are gone, the virtual machines are gone. We could definitely start looking to figure out what we want to do.

C

Thought for tackling the post-deployment migration piece. Do you mean well yeah.

A

Because then we could good, then we could potentially rid of the deployment vm entirely too, because that's just a vm that sits idle until we deploy to it.

D

Yeah, maybe something we can do in the meantime for experimenting on this to validate the approach is that we remove post-deployment migration from deployer and we put them in kate's workload so that we still have this. um We still wait for omnibus packages, and so we just run the deployment in the with the same steps that we are doing today, but we validate our ability to run post deployment, migration using uh kubernetes job, just switching the job, because at that point in time we ate we both we have both vms and kubernetes images.

D

So we can, we can try if it works and then in theory should be running. The regular migration would be just flipping up environment variable.

A

We'll feature flag it, so we don't block out our deploys.

D

No, I mean the the the environment variable in the pods, because the the pod reacts to the environments variable for running or not running, poster blind migration.

A

If that's a global variable that might be a little difficult, though we'll figure it out implementation detail.

C

Yeah, certainly, okay, cool yeah. That sounds that sounds interesting. um They say we know at some point. We're gonna need to solve this for registry. um We didn't have a solution, but it's not. uh I thought that was a piglet. That's like right scarves got a.

B

Small piglet, that's how you hold pigs. Isn't it not? You know any cats.

C

um Yeah, we know we all need to solve this for registry. We don't have a good weight right now um and I think that the the concerns raised were around um the way they'll roll back, if it if it fails or or the order of things. So we can certainly work out details, but that sounds like it could be a good thing to explore.

C

Yeah, for sure I'll, add a comment onto the onto the um issue, so we can discuss that cool um and then is there any other stuff? Anyone wants to talk about on on okrs.

A

um Not about okrs no.

C

Cool all right and then just to find out, I was going to say on five on. uh It was really interesting graph uh scale back on the gitlab shell memory leak. um Should we ping the team and get get some developers to take a look at that, like I'm, assuming that's not really just an infrastructure.

A

So there's two reasons why I did not raise an issue. I actually started writing an issue, but I didn't submit it um one.

A

I see memory like sitting over one gigabyte, which does not make sense to me, and I want to figure out why, because kubernetes, when you exceed that one gigabyte limit, they should be killing those bonds, so that in itself is strange, so.

D

B

A

There's something wrong with how I'm querying the data or something. The second reason why I did not raise this issue is because I know at some point in time I don't know the timeline uh gitlab. Shell is supposed to be changing out how that service runs so, instead of it being like the ssh demon that calls out to a binary, it's supposed, gitlab show itself is supposed to be the ssh demon, that's listening for a request which that's an entirely different way of running the gitlab show process entirely.

A

So I just figured this is probably not worth my time investigating and instead we'll just let gitlab show do their thing. I don't know what the timeline is for whenever gitlab shows introducing their new.

C

A

Binary into into things at all, so.

C

I'd suggest once you, if you once you get to the point where you're reasonably confident that data is right, I suggest just ping them on air and you can literally pretty much just say that right, like hey, I know you're working on this stuff, just his visibility of what how things look and then I think they can work out the timeline and uh does it make sense to fix or are they just migrating?

C

So I think yeah just get the data to be where you, where you feel confident it's accurate and then we can ping them.

A

I'm also not heavily concerned about this because I don't see any um bad things happening inside of kubernetes like it's not killing these pods and I have not seen people complaining about the get lab ssh unless there's another incident, I'm not.

B

A

People complain about ssh being problematic, so it's probably something worth looking into long term, but yeah I'll I'll. Keep investigating this on the off chance. But I'm.

B

A

Not focusing all my effort on it right.

C

Yeah, I know I think that makes sense if I understood um from henry, uh if I, if I answer correctly like it, was the uh sidekick uh tuning that was causing the most kind of pain for engineers on call and the other two registry and gillab shower just once, he saw in amongst the metrics uh like the dashboard, so.

A

Yeah, so the registry one did, I I think I closed the registry one this morning and the gitlab show I'm actively working on that. I plan to actually close that issue. Hopefully, today, if things are.

B

Super great work.

D

I have a question starbuck about the the github shop memory usage. So my question is: how do we schedule those spots? So are they long long running and they just get routed requests or they kind of serve a number of requests and then they kind of get killed because of the hpa just expanding and contracting the fleet.

A

It's just the hpa in deployments that determines when a pod dies.

D

But about can conserve more than one request.

A

D

A

Like in this particular case, we're just using the sh daemon- and I forget the exact configuration but like it theoretically, each demon will handle upwards of 200 requests before it starts denying clients it's either 200 or 400 yeah. I don't know how many clients they actually serve, because the ssh demon has no metrics in any way shape or form.

A

I'd have to figure out how we could query data like that in a different fashion.

D

So I was looking at the individual graphs that you linked instead of just the because the trending is obviously this is monotonically increasing. So there's a it sounds like there's a memory rig, but on the other hand, what I was thinking is that you have a go demon, because I know what the shell thing was written and go right, so you have something that is kind of streaming data so from one direction to the other one. So we're talking about clone pushes so um data intensive operation.

D

So the way these things works very likely have a a memory buffer that is kind of moving information from one direction to the other one. So I was thinking because all of most of those things have a very short-lived um metric. So if you just click on some of them, they're kind of very tiny. So I was wondering if we see it monotonically increasing because we have peak, then we spawn apart it starts serving. Then the memory peak memory gets used and then basically he's done is get killed.

D

So we never see the memory being reclaimed back.

A

D

This was the because otherwise, I'm kind of expecting people having more problems with this. If they're getting out of memories on visual machines and things like that, right, yeah.

A

Right, but I see with your theory, is that the way the ssh daemon works today is that it calls out to a binary and that binary should only last as long as that request does so once that request is done with that, binary ends its process.

A

The next request will come in and spin up the same binary for that new request. So this makes me think that either it's ssh that's running the system out of memory or these go binaries are using some ram and for some reason, they're not reclaiming that ram after the binary has been closed.

D

That's the problem with recording rules that you don't see what we are looking at. You just have this weird name, and maybe we are just looking at the go allocated memory and not the ssd and then yeah, but.

A

Yeah I'd it'd be helpful if the sshd may have prometheus metrics, but you know ssh has been around too long to know about prometheus. So unfortunately, but you know that's part of the reason why, if.

D

A node exporter can be tuned. The node export can be tuned for looking for a specific process, so we can pack, the the container with the regular node exporter from prometheus, is telling you just. We want information about the process with this name.

A

I wonder if we can control that, because this is coming from our gke nodes and I'd have to invest. I'd had to look into that. No.

D

I'm talking about the the not the node exporter, the host exporter, sorry, the one that you can install on your machine so that it runs on linux and collect a set of metrics and I'm sure quite sure that you can just say I am. I want uh detailed information about process with the speed or this name or the spot or whatever, and they will get added to the export thing.

A

That would be something we would need to invest some time in with the prometheus deployment, because that's where or that's how we currently employ that, so that might be worth exploring I'll think about that I'll create an issue later today.

B

C

Cool um is there anything else that we want to go through.

A

um I have not not something I really want to go through it's more of just a question. um I had a one-on-one with marin earlier today and I was discussing you know what kind of stuff we want to see in the future, um and we see that gideon is currently in discussion, but I already know that gideon is not going to be ready to be migrated to kubernetes just due to the current state of how it works right now,.

B

A

Of our home chart marin brought up prefect. Perfect is a stateless service. It's backed by a database.

A

It's currently in alpha state with our helm chart what, if we, instead of focusing our time and effort on gilly focused our time and effort on prefect.

A

I don't know enough about prefect to provide an opinion about this, but I feel like if it is a stateless service, we could spin up- and I don't know what the expectation of prefect deployments are supposed to be. But theoretically, if it's stateless like there should be no problem with moving it over to kubernetes and then it is now the backing store for giddily, which sends this data to a virtual machine.

A

But later we could then start spinning up giddily, pods and then so long as prefect is working as it's supposed to be. It doesn't matter how we've deployed giddily, whether it's just one or two pods that are sitting behind it. But as long as the replication is working, we should be able to have a seamless deploy with that.

C

It's definitely a good option. Yeah.

A

I need to learn more.

C

I mean it's something which might be worth us just trying to test somewhere. Like you know, where could we spin up uh perfect in kubernetes and actually be confident that we can, like you know we want to do some deploys? We want to see scaling, we want to check like logging um and those sorts of things. If we can work that out, then I think it would be a great one, like I'm almost at the stage where. Actually I wonder if we've got lots of kind of options, people are talking about like?

C

Should we do something with reddish? Should we do something with like um console ho proxy, like they're, almost all the pieces which we should probably just test somewhere and see what works, what doesn't work and what would be involved.

D

In the past, we have used staging for this, because the way that projects are routed to a specific storage uh provider, we can just uh deploy something and just route uh one or two projects that we are. We start testing manually with them and then we can uh just extend this to qa or things like that.

A

Yeah the problem I have with that is that sometimes people don't tear down their tests when they're done and they're not moving forward. So you're still.

C

Going to believe that's fine, we can do that right like if we. If we started this, I mean it is a bit of an interest run around deployments, but it's something which like if we handle within delivery, we could manage that conflict and set up and tear down. um Do you want to like? Shall we try and do a test on prefect and see if it works.

A

So I asked the question on the getaway conversation issue. Whatever.

B

A

You might have seen this already, I'm wondering if there's another issue already about prefect and its stability within kubernetes, because I feel like that would be a better issue to start uh prioritizing.

C

I'm not aware that it is because we created the gitly one just the other month when we started talking about italy, okay, um so I'm not aware of a perfect one. No, um but it would be a great one to get me like get discussion, because I know um jason and perhaps other people in distribution have thoughts on perfect, and I know andrew's mentioned it before that it might be a great time to migrate it before it actually gets tons of users.

B

A

I also plan on having a one-on-one with zj, just to learn a little bit more and figure out what what priorities are kind of in his mind: okay, um yeah that'd.

C

A

C

A

To spit up an issue yet because I I need to learn more um but yeah what if we can make that decision, I'm happy to test this out.

C

Super yeah. I love that, like I mean we're. Definitely at the stage where I think the uh there's lots of pieces. We want to get into kubernetes, um but none of none of them are going to be quite as straightforward and obvious as like the web fleet, where you know it's sort of just there already so yeah. I think we probably need to plan some some tests and see what we can pick over.

A

Yeah that'll be an interesting one because that's a service, that's not very well known, um yeah.

B

That's true kubernetes.

A

Sounds like a natural place to put profit because you need to run multiple versions of them, not versions, but multiple uh pods at the same time, for redundancy um and because it's stateless, we don't have to worry about disk storage, it's just a matter of deploying the actual image it feels like a no-brainer, but we need to learn more.

A

D

Learn yeah, it's more interesting to upgrade procedure, how the rollout will affect it and uh yeah, but yeah I mean it would be more or less the same. Things just happened on vms, so.

C

Yeah for sure, yeah cool, well, yeah, yeah. Absolutely please go ahead and, like start digging on that, let's see if we can get that into something where we could uh test it and see what it looks like fantastic um cool great. Is there anything else cena wants to bring up today.

A

I have something unrelated to kubernetes. I would like to address real quick.

C

Yeah go for it.

A

Alessio your background, I'm disappointed that the earth is shifted so that your head is covering italy, which is where you live.

C

Yeah, that does seem a bit of a I.

A

Think when these are created, the backgrounds need to be shifted such that the center is not where you live.

C

I mean it feels like that. Like the pin that actually says florence, italy should be at least vaguely close to italy.

D

Okay, so you should open an issue on the corporate marketing tracker, because these things is this came from a slide: deck incorporate marketing. Where you can change things, then you export the background. So yeah contributions are welcome.

A

I guess for reuben, you kind of see the west coast of india, so it might work for him, but, like still, I would still like to see it moved over even for reuben. I would still want to see it moved over a little bit just so you could see more of it, but.

A

While you're suggesting might as well suggest that they highlight the age, maybe yeah yeah.

C

It's a great surgery.

A

Maybe color in some of the dots too or something.

C

Please ping me: well don't ping me directly, but please share the link when you leave that comment scroll back so that we can all enjoy enjoy the response.

C

I think you get you literally just get. The contributions are welcome, right, like, let's be honest,.

B

A

How to make the background move, I don't know how to do that.

D

A

You can record.

D

The movie and just have it as a virtual background, so you can have a video clip, and so in that case it can move, because what is this one? No, what there's some of the default one yeah this one is moving. So what it is yeah this one.

A

uh Yeah, that's yeah that works.

C

A

I'm poor linux over here.

C

Oh well, that's your choice.

C

Don't blame them.

A

I'll get all these fun features. Oh yeah, that's that's something. I've been suffering with as well. So none of the backgrounds work for you right.

A

uh Let's see, if I recall it's just the moving ones, I could still do a virtual background.

B

I just don't have any.

A

Of the videos, even.

B

The static one has not worked for me.

B

C

Unlucky, this is like three iterations of backgrounds that we're seeing alessio's got the third iteration, where it works. Basically,.

B

C

Awesome so yeah good that we've covered. I often say to people that occasionally, if you watch like they, these demo videos all the way to the end, you get the easter egg. You don't always. I feel, like some people feel like really cheating. After an hour of watching us talk about like dashboards and things, nothing happens. We just go away, but occasionally you get like gold.

D

We should prepare some props or things like that that we just dress at the end of the day,.

B

A

And I held up a piglet for a half second, so you know.

C

I was genuinely like: why is that? Why have you got a piglet like I've, never seen anyone holding a cat by its legs before.

A

These kittens have been annoying me today.

A

I've had to clean up dirt multiple times, because they've been digging in my plan.

A

C

Anyways all right well, super chat, you all um yeah, it's not working super well, but it's like you.

D

Would try buying a green screen, maybe with a green screen it would work yeah. I might have to get one.

A

Actually like this, it's like you're wearing some sort of costume or.

B

Something yeah, it's quite incredible.

C

uh All right: well, I will uh I'll see you in a few minutes for ama.

A

C

A

C