GitLab Delivery Team, 28 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-06-28 Delivery team weekly EMEA/AMER

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

Just wait a few seconds and see if scarbeck normally just I start talking to go back thousand so.

B

Cool I'll get started, so um mttp is actually looking not too unhealthy. Given the week we had last week um 36 hours of blockers last week. Almost all of that was down to the gcp incident um and of course we had family and friends, though so that never plays nicely on mttp. So not too bad.

B

All things considered we're still right just right over the current target, um so announcements, I'm gonna, go through all of these, but I do just want to highlight a few of them, so mainly uh welcome reuben uh to the team great to have you um if everyone hasn't like, if you haven't all scheduled in coffee chats, yet please go ahead and uh do that so you can all meet um and um henry's out this week.

B

um A couple of things I want to highlight on c and d, so reliability have made a few changes to processors, so on c just an fyi. um They are now grouping all of their tasks under reliability, so it won't be the individual work, streams of observability and data stores and core infra. So, if you're, adding work in there just to be aware of um and then d is one that will be more visible to us and we have to manage a little bit more carefully, I think is um incidents will also close.

B

So previously, what tended to happen? Was they mitigated resolved and then there'd be a period of time where they'd sit with the resolved label? The issue itself would be open and we'd do like the kind of corrective actions and things on there. So all of that will still go on, but after you add the resolve label, it will now close.

B

So to answer your question: where are the difference between resolved and mitigated? So resolved is like um say, for example, we have a um revert, mr, that we're waiting to pick into a deployment once you get it, you can mitigate the incident to get your deployment going. You know like the issue is more or less solved or, if we've taken any kind of short-term action to like recover the problem and things are generally looking fine that would be mitigated and then it's resolved at the point where, like absolutely it's done, this thing is finished.

B

um So yeah just to be aware of that, the bit that we'll need to really uh just keep a check on is that try- and I mean you can do it straight after as well, but please make sure we're adding summaries and timelines to our incident. So generally it's a little easier. I find to do that before. I add the resolved label, so we get that bit.

B

This is also a new process. If you have feedback, please go ahead and you can add that in as well so um cool and then starbucks got some time off as well.

B

So onto discussion, items alessio.

C

Okay, so um today I was looking at the prometheus push gateway for tracking deployments, and basically there is an underlying problem, which is that deployments happened in a very long period of time and because push gateway just keep erasing information. So when you, when we push stuff to a push gateway, we just wipe out the old metrics and replace with a new one, but the way so every run can actually add just one deployment because it runs at the end of a deployment.

C

So many of the metrics out of this are not able to understand that there was a metric reset, so there is other. There are some screenshots in the links and things like that. So basically, what happens is that some of them are actually uh counted for the resets. So uh prometheus is smart enough to understand it. This is a new value, but many don't so basically keeps thinking that we still have the same data as before.

C

So my proposal there is to run some kind of helper daemon. I've wrote a simple one and go it's kind of 20 line of code. Basically, you just inform the thing I did a deployment. This was the time it took and it just published the information to promise, so they can be scraping, scrape it, and so it's it's running on memory and we and and it never resets, unless we restart it. So that's the thing uh so, robert. You have a comment.

A

Yeah I was wondering if this is another possible use case for an old issue I had where we might want a kind of hosted release tools, endpoint, that we can do a bunch of different things with that there.

C

Yeah I mean that's, it.

A

Obviously, that's.

C

Just about the metrics.

A

Deploying a simple go program that does this one thing, but maybe yeah.

C

It really depends on how much we want to move away from running everything in batches in ci. That thing that I brought and go it's simple enough to just put it somewhere and just forget about it. I mean we have plenty of kubernetes clustered. I think it would cost penny to run that in something that we already have, because it's just keeping numbers in memory and if it restarts in that case it would be no problem because you would keep counting.

C

So maybe you have one deployment to deployment redeployment, then it just if it restarts it go back to one so from four to one, the it's okay, because you can understand that it just started, but if we do from one to one, which is what happens with push gateway, it can't understand that this that one is a new one.

D

So this one is completely independent from the coordinated pipeline right having this little helper.

C

Yeah yeah. Basically, the idea is that the coordinated pipeline at the end can do a api call and just record the the amount of time that is tracking and that one we just keep it in memory. So we just had the data point so that prometheus scrape the this demon here instead of the push gateway.

D

So we will push information directly to prometheus instead of the intermediate. That is the push to get away right now. You.

C

Can think this a little helper as a specialized push gateway? That's the easiest way to think about it because it acts as a push getaway, because you push information to it and then prometheus scrape the thing. This is the same thing in the push gateway.

C

Do because push gateway, you push the push get away, but it promises, scrape the push gateway, but push gateway is generic, so you can only replace stuff or add information, but for histograms it's it's it's tricky, because if you push information you just replace the old one, and so basically you have your instagrams with ones in the bucket and then you add a new one, new new value, which is still one and basically you're replacing the old thing.

C

It's just a very quick example. Let's say that instead of this thing, we were counting number of api calls, instead of just counting number of deployments. So every run will have a different number of api calls. So it would be easier to understand that, even if it was a reset because the number is moving, but in our case one pipeline, one deployment, so it always push one.

D

Yeah, this is something that we noticed the other day with robert, that it was resetting the value. So we did something very hackish to work around that, uh but I was not aware that it was also happening on the commutative buckets of the instagram, which is also a problem which is.

C

Up yeah, I saw that yeah I saw the counter, so it's clever. The implementation of the counter is really clever because you fetch the information and you increase by the old value- that's perfect, but on instagram this doesn't work.

A

It's kind of surprising, though, because that essentially makes the push gateway useless for anything, but a gauge metric right because yeah, it's weird.

C

A

Something that.

C

Changed if every run has a different right.

A

C

Yeah, because if you look at the description, they just say that if you run a batch job, something like every 15 minutes, they suggest you to replace to change it to a demon or if you have something that is short short short, then they hope that you are that the numbers that you count are different so that they can understand the the result. Otherwise they just say just run a demon.

E

I thought proprietors was supposed to have some logic, so it knows if a counter resets, it still continues the incrementation of that value.

C

Yes, but it's based on uh basically this is this is andrew, was explaining me this thing that instagrams is kind of a hack on top of the metric system, so on prometheus level they don't really belong to the same thing. So in theory, you could be able to understand that because of the sum counter actually reset, then the overall thing is reset, but in in the internals of remedies, every one is a different metric. So each bucket is a metric. The sum the counter they are all different metrics, so because we are tracking time.

C

The only thing that change is the total time, because sometimes you take two hours, then you take two hours point something so that matrix is um is re is resetted as a as is reset properly, but the other one you're just changing one. So yes, you just say I did one deployment and then six hours later you just replace that values, and I did one deployment and say okay, so you are just pushing the old value, and so it's still the same one.

C

So if you're doing rate or increase it just say zero, because it's the same value as well as histograms, you just say I was filling one single deployment on those buckets and then you just swipe and you provide the same one.

C

B

Could we check in with some of the other teams, I'm wondering, for example, like I know you use these sorts of things like all the info teams use this in some way. Right like it sounds like you've already asked. Andrew, though, is that right, alessia.

C

Yeah and then so he told me that they're doing something like that in tamland, no in the murky customer project, so they have a little program running in cloud function. I don't even know there's something running that in that case is scraping, elasticsearch and keeping information in memory, because yeah the biggest problem is histograms.

B

Interesting, okay um mara! Do you want to like see about like what do you want to do about gathering some like extra stuff? Or do you want to just go with this approach, or do you want to like compare with some of the other infra team uh setups.

D

uh Well, I was about to ping andrew uh to help us with that, uh but if, unless you already talk to him, we can probably just do this approach, so I think we can open up another issue about this to change the way we are pushing information.

D

I don't know if, um if that works, for you alessi or do you prefer to have all of this in a single issue.

C

It's fine. We can move it to another one to only focus on the keeping the track information and then yeah. We can split into issues no problem.

B

Okay, well, we will we lose data on this one, so you said like if it restarts, then I mean we'd know about it, because it's restarted but like um are we are we creating kind of a a risk around this stuff?

B

Like I'm kind of wondering, I guess how scalability are doing things with their error budgets, because I guess that's a similar thing in a way that maybe they're not using histograms but it'd, be interesting to know how we can avoid also like, if we're going to rely on this data for a month or a few months, it'd be good not to lose it.

C

I don't think we keep this information for this long because prometheus as a retention policy, I think, is around I don't I don't know if we change it give me a week. It can be a month a couple of months, but the point is that uh so the process has the information in memory when you scrape it prometheus instance, keep it in his own data storage.

C

And then, if you do things like tennos, when you kind of scrape multiple prometheus from different places, so every prometeus has its own storage.

C

So the thing that you lose is your local information in the in your daemon, which is expected because prometeos keep the thing uh stored and that's that's it basically. So, but if the value changes properly, so they just monotonically increase and then reset then prometheus can understand that it would the process restarted and and and is helper function, would just do the right thing and give you the monotonical increase rate. But if you just go from zero to one and then again one and then again one and then again one there's no way to tell.

C

If is the same value as before, or if it just was a.

C

D

um So, just to wrap up. So what? What do you need for this one? It will be just like basically implementing this little goal: program, programming, some part of our infrastructure and then removing the or adjusting the implementation of release tools to use this little demon.

C

Yeah releases implementation is just one line change basically, instead of using the instead of pushing there, we push in the other place. That's fine! It's more about understanding. How do we want to deploy this thing?

C

That's the because I was thinking about. We can probably put this in a sub folder in release tools so that we have everything together. I just wrote it in go because I know how to do this and go. We can write this in ruby, probably it's going to be more expensive to run it than the go demon. So I was kind of open to suggestion here.

C

So just kind of this thing work and can- and I think it's easy enough to understand how it works and also if someone is not really proficient with go, should be quite easy to eventually change some behavior there. But if we don't want to go in with go direction, we can rewrite this in ruby and as well think about how we want to deploy. The thing, in my mind, was kind of in the end, we're going to build the docker image and then run it somewhere.

E

We should dog food, auto, devops and rock out our own.

E

uh Whatever that feature is called.

C

It could work actually right because it's completely stateless.

B

So do you want someone to put an issue together? We can look at things I'd like, so I think it's a kind of a balance between getting something up and running, so we can see this data, so we can make progress on deployment slo, um but at the same time we also know this is like one iteration of many in terms of like overall metrics.

B

So if we can find a nice balance between getting what we need now without needing to do weeks of work, um but also something that maybe we know is going to be okay for a few months.

B

We have to think about how we're going to operate this thing like how will we know it's down and who is going to put it back online if it goes down and how do we kind of do all of those pieces as well.

B

um Who's going to open up the issue like is that something maybe alessia you could do yeah.

C

I can do it.

B

And then we can input in there cool, okay, thanks very much, um let's, let's, um let's work out on the issue: how we're going to deploy that.

B

Cool um and then be, I think, might be a bit of a similar type of conversation. So we've uh following uh that incident, there was a we need to modify our deployment pipeline so that we're not running italy deployments ahead of rails. So at the moment it's in canary we do gitly prefect rails and then we go into the main fleet and we do all the rest of italy, the rest of prefect and the rest of rails.

B

That's a bit of a risk because actually we have rails changes ahead of the full italy um fleet. So we need to make a change. um Java has made a proposal that um we could just lift, like literally just pick up the production, italy and profit jobs and put them next to the canary ones. So we would kick off at the same time, canary um italy, canary uh sorry, production, italy and then we'd do canary prefect production, proof it and then we'd do canary uh web canary rails and then we would roll in.

B

So it certainly seems like an easy initial one. Like iteration for this I'm wondering, if kind of longer term, we might actually want to split the getaway deployment off these other ones and sort of like stagger it in, but that might be a little bit more work. So I kind of have two questions. One is on the proposal that java's come up with, looks like a really straightforward piece of work.

B

One concern I have is, I know, gitly team have lots of alerts around their deployment process, I'm assuming they wouldn't get any kind of notifications a deployment. A production deployment has completed until we finish the production deployment, because it will be all in those same tracking jobs.

B

Is there anything else that would be affected in a similar.

B

B

Like I do think, there's a slight risk because we'll have done a full getly deployment. We may not even promote that thing to production. Right like there are instances, I guess, and then there won't be any tracking of that deployment. So it's definitely not a perf like it. I mean a better solution would be to also create italy tracking jobs that then sit with the actual deployment. But of course that's a bit more work.

C

Yeah also timing is challenging here because we are going to just extend the time it takes to deploy canary by one hour I mean you're moving right, so it just it isn't.

A

B

C

But still it it has to do it.

B

Exactly it does add to it for sure, um but we.

E

B

The production.

E

C

Sure right, but we deploy without promotion. So it's kind of I mean.

B

That's actually true. The other thing, of course, is the fact that canary deploys take place when we're not online.

B

Theoretically right, which is a much bigger problem.

B

hmm Okay, uh let's go back I'll, add some more comments on this issue. I think we need to think through what this might may. I we do need to change this, but I'm wondering if there may not be a quick hack. We might actually need to think about how this properly fits in the pipeline.

B

hmm Okay, um let's go back. Let's move on to your point.

E

uh Ever since we switched to bridge jobs, we um the warm-up, used to run in parallel with the canary deploy when it was done that has since ceased. I was wondering if we still had an issue to address pushing that backwards. That way, the warm-up runs at some point.

C

Yes, we have, but I don't have a link at.

B

Hand yeah we should uh we should readdress that um I'm trying to gather up all of the kind of additional sort of random little um uh record like release tool type things that we have there'll be lots of. I think things come out as we get deployment slo up and running, we'll see lots of areas for improvement ah thanks robert.

B

All the links uh thanks thanks both um because one thing we are getting close to is thinking about okrs for q3, so we might want to see if we can wrap up a load of these things. But if it's a small change, if someone wants to just go for it, go for it, um then otherwise, I think in q3 we can have a think about what um changes we want to get in place to um to improve all the things. Basically,.

B

Awesome um is there anything else and I want to discuss.

E

I guess, lastly, just a notification that I created a corrective action related employer today, due to an incident.

B

E

B

E

Aware that we've got a high priority issue in our queue right now,.

B

Awesome all right thanks for that, um I will um pull that onto the board. So we've got a few of these things. Circling around um yeah, okay, I'll I'll, dig it out and see what we can do with that awesome. I shall stop the recording.