GitLab Delivery Team, 29 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-04-29 AMA about GitLab Releases

Description

AMA with Delivery team and Engineering Managers
https://docs.google.com/document/d/1R8OvFSacFIPdZlQ6lgmZiloDGFh7bHFpL_4Z5ZYLkCU/edit#heading=h.igt0lkjqo3a0

A

A

All right well, thank you. Everyone for joining this is the ask me anything session about github releases. My name is Mary Nyong'o ski and I. Am the senior engineering manager in the infrastructure department and current interim engineering manager for delivery and delivery team is responsible for all of our github releases deployments to get low calm and also the current kubernetes migration. We have on github.com.

A

Alright, that was a intro that I rehearsed I. Hope you all liked it awesome. Thank you. Thanks for the turns up, I see that Phil is not going to be able to join and asked us to record. So this is being recorded. Phil I'll share the recording after I finish processing it does anyone want to be Phil and ask questions so I can answer okay,.

B

C

B

You were planning on paying Phil I, don't want to ruin the poem.

C

I had a similar question to full, so I thought I would channel and.

D

C

After an MRI is merged to master, how do we know when the change will be deployed to keep that calm, especially in that last week, before the kind of like release cutoff and we you know this is routing that fall for an mr2 to be in the release. It has to be able to come by a certain time, just the the like. Can you run through that process? Again, I know it's documented in the handbook lysing. It would be great and just reiterating it for me thanks yeah.

A

Of course, of course, this like having it written down, is completely different to actually we'll be running through it. There's like a lot of new ones that you cannot catch in documentation.

A

So I know that everyone is impatient when there jukka's goes into monster because they think their job is now done, which is not even close, so we currently have documented in our handbook. We currently have the cadence of to deployment branches a week. What does that actually mean?

A

That means that two times a week we go into master and branch out from a last-known passing build, and then that is going to be our deployment branch for the next couple of days until we get new new deploy, branch created, multiple reasons for that, some of them are related to the fact that we still don't have a really great story when it comes to. How do we do QA?

A

We don't have enough confidence in what we actually deployed in all of our environments prior to production, and we also need some time to turn around the fix when we get an s-1. So it's a very high severity issue, so, in order to kind of balance the a lot of manual things that we have in general in our process and understand safety, we decided to go with to deploy branches a week that actually means that everything you've merged between.

A

So, for example, one of the deployments we have is on Sunday. So that means that everything from previous deploy branch, which is created on Wednesday until Sunday, is being collected and that's being deployed, let's say Monday, but it is depending on your timezone. It is Sunday evening basically and then also everything from Sunday evening that time until Wednesday is being deployed in the second part of the week. This way, this way we kind of balance the the coverage also that we have, because we have to think about github.com and how that gets deployed. So.

A

In order for you to know exactly when this is going to be deployed, we also have to have a stable platform, like it's basically impossible for us right now, to tell you exactly when things that are gonna be deployed, because we have time that it takes for CI pipeline to pass in get lab rails time. It takes for a package to build time it takes for us to deploy on staging time.

A

It takes for the QA to pass on staging again time for canary deployment QA again and then time for production deploy in any of those steps. Things can go wrong and things can get a significantly delayed. For example, we might have CI pipeline failing in the deployment branch for known or unknown reason right. That automatically means someone needs to go in and fix that spec failure. We can see a queue headed to a failure in canary while we had the same QA passed in staging right. The reason for that is completely different.

A

Data sets that we have on our canary environment and in production, so because of that, it is really hard to say that we have a cut-off. That's why I'm fighting really hard against the cutoff because cut off actually means everyone will be running under a radar, and then it becomes someone else's responsibility to ensure that it this actually gets deployed and actually works on github.com.

A

Did I answer your question somewhat.

C

Yes, you did Lily, let's, let's assume that all that is fine and- and we're not so stressed about the day. I want to confirm. I have seen labels added to the merge request recently, which indicates when, when it's been deployed to staging and canary, is that correct? That is correct.

A

C

A

Are a couple of things that we are trying to do? Sorry, if you don't mind, there are a couple of things that we are trying to do from our side to make sure that the visibility of when something lands in any of the environment is clearer. So one of the things we added was in the merge request.

A

Widget you have that block that that shows which environment we've deployed to that is actually correct up until a second right like it as soon as we finish a deployment in one on the environments, we have an API call that we trigger for each of the merge request that we just deployed and we add a link between what we deployed and in that merge request. So that's one item and at the same time we also use the workflow labels to prop propagate one of them work or the merge request between environments.

A

You will see your post staging. That means it's deploying on stage local Canary, internet and so on. So it gets add at the same time that we'd add that information and a my Ridgid.

C

And I don't know if this is flow as well, so I'll carry on reading. It says by the way, I really like this and chat ups great for finding out after the fact- and it was linked to an issue.

C

Mary shares the Permian environment, which is exactly what you just spoke about. Yes, yeah.

A

So so this is basically ask communicating the the changes we are making. It actually allows people to discuss things and also has a bit more of a context than on documentation. Part might have the Chatham's parts I, don't know if this audience knows it, but I think I'm gonna share just so. I can show you if you're developers don't know so part like we have a regular process that is defined in how do you use chat ups to neighbor feature flags, for example.

A

That is very important for a lot of you, because it allows you to create a feature that it might not be fully ready and like experiment a bit and so on, but as part of that, we've also introduced basically an audit log. So this is linked in our documentation, but I don't know if you're all aware, where every interaction with chat, ups and feature flags is logged in this project, and we know exactly at which time who and which host things got triggered in, and this is the information it gives us like.

A

It also tells us which group it was enabled on right, like it exactly which time who was the person triggering the thing, and this way we can quickly backtrack. In case we have a problem on production or a problem in any of their environments to find out who was responsible for what, because it gets really difficult to track those things. Otherwise,.

A

Yeah I just wanted to share that trivia related to chat ups.

D

So I'll speak to the commenter question, so I think the labels are fantastic. I use those a lot to track. Progress to aid in conversation just make sure that verification is taking place, but I'm wondering how we can get notified in to some extent. I know there were the comments. There were created notifications, but that was creating a lot of noise. Are you guys thinking of ways to approach that yeah.

A

So we were originally thinking of just creating a to-do to do this, so good love to do to highlight when something was done, but some of our engineers were saying that that's actually not really actionable the fact that something moved to any of the environment. What do I do with that information I might need to verify, but I might not not need to do that. So developers are divided. I'll tell you that the comments that we were leaving we used to be we used to leave.

A

At the same time, we were applying the label we used to leave a comment saying this merge request is now in this environment right, but that created so much noise and I. Don't think anyone was a fan of that. So if you have a better suggestion on how we can actually do that and make it an actionable item happy to happy to discuss that, but for now we've kind of bandhans the idea of doing a proactive notification in favor of the Mr widget and the label.

A

So if you actually want to track something you have, you can create an issue board where you can track within your team, how certain things are going between different workflow labels and track you that way.

D

That's exactly what we're doing right now and I think a lot of other teams are as well. It's just a bit passive and it's like a topic that will add to our one-on-one check-ins, just to make sure we're covering those notification stages. I think.

E

D

Point that's where I was thinking as well, but I agree with your point: it's gonna be noisy for it. For everybody differently, I mean.

A

If we, if we decide as a company that this is a direction we want to take, then a bit of noise for for people that don't necessarily want to know is well we'll have to live with that right. But we also always have to wonder whether this type of feature actually also helps out our self-managed and other customers right.

A

This is the hard balance that we have to keep on doing, because some of these things that we implemented we implemented outside of github, mostly because there was literally no interest from any of our customers to have a feature like that. So we don't want to introduce something that we will have to maintain, but in the end no one will use and some of the things usually go down the route of well. We need it for github, Inc, we'll add it.

A

The customer comes in and says I don't need, it can I get a configuration option and then we had a configuration option, which means now we have to maintain configuration options and maintain the feature, and it becomes just a bit too much.

D

Understood thanks for your thoughts on this and.

A

There was that announcement that we made I don't know if we follow up follow up on with an issue on that so new. If you want to go through that and see whether an issue was created, so we can see where the product might have changed their minds.

C

There's a last statement here from from full: when should we use pic into auto, deploy.

A

Yep great question, so you remember my intro with to deployment branches and so on right. We kind of need it, those deploy branches to be our stabilization period, so we catch any outstanding issues that our test coverage didn't catch in time.

A

We have sometimes situations where we see an s1 or s2 bug so severity, one or two bug which impacts a lot of our customers, but throughout the whole process, no test failed, meaning our unit has passed our QA test past our environments, where we have some of these artificial traffic generated, also didn't expose an issue, but only when we reach production.

A

We saw a problem because there was a workflow in a certain way on github.com that someone is using that we were hitting and we only saw it there so in order to be able to react to those things quickly and have a stabilization period so to speak. We have this speaking to auto deploy label that always checks for existence of s 2 and s 1 labels. So if there is an issue, a bug that had that severity and picking to auto deploy was applied to merge request.

A

Our automated system will go through those labels automatically victim in deployment branch and roll them out to all of our environments.

A

That is exactly when you should be using them so bug regression, s 1 and s, 2 fixes anything lower than that has lower priority and lower urgency and can usually wait additional two days to get into any of our environments. We are working on this week. Actually we are experimenting on creating deploy branches once a day and that's going to be like a huge impact on everyone, regardless of how it goes, and that would make the picking to auto deploy a little less interesting anyway for anyone.

A

But the main major problem we have is the fact that we still have to wait for manual, so we click a button to deploy to production, because we need to get an approval from SRA on-call, whether we can promote the production. So we now are working on getting some data from our metrics, the metrics that we collected century exceptions, Prometheus and so on, to automatically inform decisions and whether we can go to production or not so be more metrics area.

A

But this is just an experiment for now so peekytoe into auto deploy remains necessary for those as ones and asters, with one second to let my cat out, because Lee's drag.

A

A

All right did I answer your question. Yes,.

C

I'm sure full we'll also be happy with that answer. Phones got the next one, so I'll go for that as well for self-manage releases who determines and win whether to create a patch release after this twenty. Second, based on pick into twelve LinkedIn level and as Ian's, what criteria should we use the styling whether to use any begin to eggs labels yeah.

A

So for self-managed releases release managers are the ones responsible for deciding when a patch release goes out. The reason for that is that release managers know whether we have other releases in in in motion, so we have a special type of patch release, which I'm sure all of your pin, fully aware of which is a security patch release. Those are really really disruptive and cause a number of departments to work together, which is a challenge, and then we also have possible deployments going on on github.com possible exceptions.

A

There, like there are a couple of things happening at the same time, so the criteria that the release managers use usually to decide on whether we are creating a patch release, is a how many fixes need to go into a patch release right. How many merge requests are there already with the peaking to auto deploy label? Then what is the severity of those bugs that we are trying to fix? So if we go through, let me try to open that page real, quick and show something easier to probably explain all right.

A

I'm gonna share my screen. Real, quick, brave.

A

All right, so this is the group level view forget lab org. If we take a look at speaking to 1210, which is our current stable release, we have six open begin to merge, requests so open. That means they're not ready, yet we have Turkey in merged.

A

If we take a look at the list, we have here I as a release, manager and looking at severity here, so something that is a high priority. But then low severity is not gonna move my doll one way or another, something that is very high severity in priority will like something like this s2 and p1. If we have more of these type of merge, request merged and we have no ongoing release, we have we are making a decision on the spots to to creates patch releases. So customer impact is important here. How many?

A

How much of an impact does this have on the customers? What else do we have in flight? How many other merge requests are open waiting for this release? We don't want to create too many patch releases, because that slows down our customers. Now our customers don't really want to upgrade every single time. We create a release, so we kind of have to balance the number with everything else that is in flights. There.

A

C

We'll have to wait for folks to say whether, but yes and arrow is a script I just wanted to quickly ask I saw there was a couple of em ours that had multiple pick into releases. Is that because we're supporting multiple, past versions of gitlab stole no.

A

Those so back ports to all the releases, then the current stable one are not supported, so we only support. We only do that for security patches, so security patches are the only releases that we do where we back ports to two other older releases. So we have a three month: support policy for those whoever labeled that merge request.

A

That has like a couple of picking two labels is basically betting on us, maybe having an event where we have to create something for an older release, but I'll tell you that happens, maybe once a year, if that, so we usually kind of limit very strictly for patch releases for the current, stable release and security patch releases for a three month period that we have in our maintenance policy.

A

Let me link you the maintenance policy as well under and then you is being awesome and taking all the notes. Daniel. Thank you. So much I see that you were a CEO shadow at some point.

A

That's where we get that from for sure, so I think the maintenance policy. You can all check out where we describe some of these definitions. We describe what goes in where why we can do some backwards and cans and some upward recommendations in general that we have for our customers, Thank You, chase you're.

F

Up next, thank you. Let's see so I just wanted to extend that that the previous question, just a little bit like I, think this is my I'm news sort of to the some of these release management experiences.

F

We had like a security or not a secure gloom I like the oppression issue that flowed through recently- and those was my first time going through this, and also the engineer who was responsible and so I think we were both kind of not surprised but like curious, like about the timing of both like when the patch release was being opened and also like, when it's I think you kind of touched on a little bit of like when it's being closed.

F

We have like multiple em ours that we were trying to call package up and get picked into the same to the same patch release, but then were surprised that the the the that release had been already been closed, and maybe, if you could just talk about a little bit, just expand on that some yeah and then maybe some guidance about like. When should we be doing this? Should we like be labeling as soon as we know or or what for helpful for those who are managing that process? Thank you. Yeah.

A

So great question chase and I I'm I'm, fully understanding of why this question is coming up. We by the sheer fact that we tied in our shipping right, like the self, manage release with SAS deployment, we're kind of combining two of the hardest things in in software right and with the timing I explained, it means that there is a vacuum there.

A

Where things can happen where, for example, you might have thought that we deployed it on github.com, but there was an event that you're completely unaware of, and you don't necessarily need to be aware of where we had to make a decision of we're gonna ship, something that was deployed earlier rather than later.

A

So that's what creates this vacuum so, basically, what I'm, in the release page on in the handbook, I'm trying to explain that as if you have things merged early in the cycle early in the monthly cycle, there is very little reason for you to apply picking to labor for the current release you're targeting right. So if we are talking about, let's use the example of 22nd this month, so 22nd May of this month. Our active preparation is for github 13.

A

Our stable release is 12.10 right, so if you're working on a feature or a bug fix or something that is related to 13.0 and you're merging it right now, there is no reason for you to think about picking two 13.0 label right. If you're fixing a bug that was already introduced in 12.10, it's a bug, it's a regression and it has certain severity. You want to ship it in 12.9 so that our customers can leverage it. Then you apply the peaking to 12.9 label.

A

Now we, if we get closer to the release date of 22nd of May, say we are at 18th 18th of May and we are almost there with getting github 13 ready, depending on how stable github.com deployments were up. Until that moment, we might be still going out and deploying all of the fixes that you merging even on the 18th right. So basically based on what is in your mr widget, what is on your in your label, merge request. You can actually see whether you're merging is going to go into deployment right.

A

Is it going to make it if it makes it if it's already going to stage in production, canary and so on? There is no reason for you to think about applying the picking to 13.0 label, yet right release managers announced in the releases channel.

A

What is our current guaranteed comment that we will be deploying right in the releases channel we always post when we are certain that something is gonna reach that if your merge request is up until but within that you have, you don't have to worry about things, but if we are already past that deadline and things are, will they make it all? Won't they make it? The best that you can have is in a day before the release.

A

If your merge request is not inside of the github 13 release you're going to apply picking two 13-0 label because that's going to go into a patch release, but the important part here is: if it's a bug, if it's a regression, if it's a feature, features only get shipped in dot zero releases right, we have other requirements with our community edition, with our SEM we're following and so on. Where we can ship features in patch releases did I answer your question. Yes,.

F

Thank you thank you for walking through that. So.

A

Just to add to this, this becomes easier, the faster we shake so the more frequently we shape the more frequent we will deploy to github.com the less you need to think about. When do I apply a label, is there a cutoff or not, but that is a collective responsibility like the more stable we are with the code. We are shipping, the more the last problems we are causing on any of the environments, including github.com, the quicker.

A

We react to a problem, meaning if we can fix tests really quickly if we can deploy really quickly if we can build our packages really quickly. The quality of all of that is great. We can deploy more frequently, which means all of these discussions. We are having right now, but cut-off dates and so on become moot almost.

B

Easy yeah, I I, just wonder how we use a canary like. Is it just for safety check or oh? Is it just for experimenting? Some features that we are not cultural to open to telepathic yeah.

A

So I'm gonna give me one sec to find that talk. I'm gonna link it very quickly for you here. So we have the general description of what canary does in this document. So canary is production environment. So that is important to note. We don't experiment necessarily. This is our way of not only our obviously just checking out and seeing like the impact of the changes that we are deploying before we roll it out to everyone else.

A

Usually what you have in other companies is you do percentage rollout right, like you expose to 5% of the traffic you expose.

A

The deployment which is just had the problem is we can't do that because github is an application, doesn't allow us to have that separation where we use the same database, the same Radice instance, but then different codes running at the same time like it's. It's it's really not allowing us to do that, so canary, actually is only there on I think registry Webb fleet.

A

So everything that you see when you load the page, then HTTP gets traffic, so cloning through HTTP and I think one or other to API so couple of those things which means that we are only using it. As are we seeing a huge amount of errors popping up based on the traffic we are generating. If we don't see any problems, we just expand it or deploy it to the rest of the production, but at at the point where reach canary, we already ran that database migrations.

A

So already the data set change there. So we that's also one of the reasons why we we have all the TEL. We are trying to have the test to make sure that we do backwards compatible codes. So we wrote we basically run with canary ahead. The rest of production is behind when it comes to code, but they are all using the same data layer.

A

Does it help yeah yeah thanks cool welcome.

A

All right, I, think you're, not item is also next.

B

Yeah, the next question also from me I, wonder whether we have the process of rolling back a deployment.

A

Yes and no, we do have that, but we've been told by the database team frequently actually that, while we can do reversal of database migrations, the regular database migrations when it comes to post deployment and background database migration, so those are the longer running and the ones that are affecting in some cases destructively our data set. They are not sure how that will work on github.com scale. So to answer your question, we have a process for this. Yes, the process is actually defined in this document guinness. I could share it.

A

Thank you, okay and this process is based. We were just deployment deploying the the older code in Reverse, so we are first handling the stateless services and then we roll back get early, which is a stateful service, and then we roll back the database migrations right. So we have that and we've done it once or twice, but we've been generally warned or we've been generally advised that we should always roll forward.

A

So whenever we have to actually do a rollback, what we end up doing is revert the code that is offending and deploy again, so we continuously go forward all right. That's the best thing in the world, I, don't think so.

A

Cool, we are going a bit over time, I, don't mind taking few more minutes. If you all don't mind as well.

A

Cool Stephen, who.

E

I am aware of the baguette lab comm kubernetes migration and would really I'm aware of the videos and stuff just wanted to get kind of the fresh get status on it. Alright,.

A

I'll link the epic as well, so everyone else that's it like this is all bookmarks right like it is literally everything that I'm looking at everyday.

A

Basically, right now we're at the stage where we are rolling out sidekick to kubernetes, so we spent a couple of months actually trying to ensure that we deploy side by sides with our VMs. So at the same time, we build boats for our home charts and our omnibus packages you're, probably aware of that, and we needed to connect it to, and we also needed to find a way to manage the risk of sightly jobs, expanding right like creating a large number of parts in our kubernetes cluster, meaning.

A

How do we contain so it doesn't auto, expand until infinity and then causing all sorts of problems there. We found the solution for that and we were waiting for distribution and scalability teams to complete their part of the task on redoing how sidekick is actually functioning before we can roll out the rest of sidekick.

A

We have couple of blockers right now that we identified so with our sidekick migration. We will be able to probably migrate around 80% of all of our sidekick workloads, the 20% that we can't are dependent on shared state. This means we have NFS, for example, in our VMs that are acting as that shared state in kubernetes. That can happen right there, stateless and our application so get lab features specifically, don't necessarily support stateless in some cases.

A

So some of our customers are using this for a while. Now the helm, chart and installation on kubernetes they're not running into these problems because they don't have the scale we have it github.com. So we see these problems exactly because we are running it at scale. So we've seen situations where a feature would buffer things on disk fill up the pod. The pod would fail out of this space that would trigger a new pod in a node, and that would expand to fill the whole node.

A

The node would fail, then, because it will run out of disk space, kubernetes will trigger a new node and then it will start figuring new nodes and we will have a cascading set of failures. So right now we are focusing only on safe rollout of the cyclic use that we absolutely know will not have those type of dependencies and then, through a different channel, I'm working with development, to remove some of those dependencies that we have like some of the features we have.

A

We have the model pins down for how we are doing these migrations. So as soon as the application starts on blocking us, when I say application, I mean the actual application that we all develop as soon as we start. Having that unblocked, we can speed up our migration with rest of the fates, meaning we can do that from the API and for the Wellfleet.

A

If the features that we depend on right now actually lands soon, we might have all of the states less workloads migrated by the end of this year, probably sooner, and then we can start talking about stateful things and make migrating stateful things.

E

Awesome thanks for all that great contacts, man, cool.

A

Of course, thanks for asking all right any other questions before we end the call.

A

Cool alright! Well thanks everyone for asking awesome questions and.

B

A

E

For giving us that rundown, those great.

A

Yeah thanks that was awesome.