Jenkins Infrastructure Project, 20 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021 04 20 Jenkins Infra Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Head started: olivia your meeting, hi everybody. Welcome to this new jenkins infrastructure meeting. um We have few incident to reports, so we definitely have to be too tall today. um So first, let me share the notes uh for today, as last week, we are going to use um mcmd.

A

So here are the news: if you cannot edit, the news feel free to request access um but yeah. So let's, let's start. um The first thing that I want to talk is the incident that happened yesterday with schedule jenkins that I own so as a reminder get to jenkins. That is a mirror engine that redirects every request um to download an artifact from the jenkins to a mirror um located.

A

um I mean to the closest mirror uh from your location, which means that from I mean from europe, you originated from to europe, my house and so on. um What happened um around six uh around four p.m: around sorry, around five five pm utc for some reason: um the network storage, so the azure file storage mounted into that mirror bit service uh stopped responding.

A

We got error, saying quota accident, and so because we could not um communicate with the network file. Storage, um basically mirror bits had no idea about which file can be distributed can be distributed to which mirror.

A

So that's basically what happened um so the fallback so the current the fallback. The way it was configured is, if that to get the changes that I was not was not working. It fall back to a service that was using the same network file storage. So basically, we just resent way too much. Requests to the to the file storage was which was um problematic at the time, so it took us around two hours to three hours to um understand the issue.

A

The good thing is that was the same issue that affected us in back in november 2020.

A

So we had a rough idea of what to do where to look at um several people were involved in that outage, so the the first step was to redirect the traffic to a node machine um that we named package touching the io which has every file, so that machine has the same content that what is located on the azure file storage. So the idea was just to regrade the traffic to a different machine, so we could put back that.

A

We could put that service back on track until we understand what happened um so the service was done so yesterday evening. um Europe time everything was fine. Yes,.

B

Sorry so the the redirect was that redirect of get.jenkins.io traffic. So it was a dns change to switch from exactly.

A

So we we didn't fix the the the the get the change in the layout. We just redirected traffic to a location that was working.

A

uh The temporary the temporary solution was fine, but that machine is not able to under the loads that happened during peak hours, so it was definitely just a temporary solution um until until we understand what what what's that, what was happening so on my side, what I did yesterday was to open a portion um to open a connection using powershell with the azure account to list the every open, open file connection with that azure file storage, and there is a heart limit to 2000 and what I identify is there was one sec in one session.

A

We almost filled that limit, so we opened like um 1998 connection um around I'm, not sure damien, because we we investigated wisdom in this afternoon- around 4 p.m. I think utc.

C

uh Yes, 4 pm utc, it's when we started to see the first.

A

Picks, and so once once the the azure firestore, it was not working correctly. Then you we saw a lot of side issues like um cpu usage, went crazy inside the nodes memory increased, obviously because the service was not able to to answer requests. The number of requests increased as well. So I mean we can clearly identify huge peaks that happened during that time.

A

Until now, we we could not identify the root cause of um I mean who basically opened those files. Was it an issue on azure? Was it an issue on our side? Did we have one process that just opened every connection in one time um yeah? It is not something that we could identify right now,.

C

We have identified a private ip, but we are missing some observability data to be able to know, because the iep is one of the kubernetes nodes and we don't know we are missing information. To conclude.

C

Did we cause the azure file system issue by making too much request, or was it an azure issue that caused the request to pile and to go on race condition on whatever application and keeping the file under open? And this we cannot conclude on one or the other, and we miss data to be able to conclude.

A

So that yeah, that was the main issue, any question on this.

A

No, so so, if, while we are still on that specific issue, I think what was really nice um to notice, um based on our learning back in november.

A

First, we put in place a status page in november, so we could communicate about this issue and people were able to quickly open uh the incident on the jkc status hit repository, so we were able to communicate about the incidents when it starts when closed. That was the first thing.

A

First thing: sorry, some people ask why um the monitoring and the status page did not show that get the jenkins that I o was down um the root cause of that is because the way the container is working, um the service starts and read in the directory.

A

So read the network of file storage content into a directory, so the service was working but could not read the data in a specific directory, so r il check, because we only monitor the route so get the changes that are you um or each check told us that the service was up and running, but when we tried to access a specific file, um obviously that was not working. So that was the thing. So we have to improve on monitoring, to monitor a specific file, let's say slash time, slash into or whatever file we monitor.

A

um Ideally, that file needs to be small, so we don't put pressure on the network, but um we have to improve the monitoring, something that we saw um in november and we saw the same pattern. This time was um some requests were passing and others not so we were able to download some specific files, but other were returning five or three hours in in november we had the same issue where we could access every files, except that those under the directory planes.

A

um I didn't understand why at that time the problem resolved by itself and we saw the same issue yesterday. Some requests were passing, others not get. The jenkins dollar was up and running, but um so yeah, that's, that's the that's one of the things.

A

So so yeah and we're not sure wha. What we'll do is we'll just improve the monitoring to monitor a specific file, but this will just reduce um to I mean this would just help us to to detect the issue, but um in this case we we didn't get the metering notification.

A

The second thing that we monitor that we checked was since now a while. We monitored that we can download the latest jenkins version from um package of jenkins leo. So we have a monitoring check for that and how it works. We query repo the jk.org to see. What's the latest version for the weekly and deltas, we relieved the retrieve that version, and then we query get the jenkins that I use.

A

If that check is failing for 30 minutes and then it triggers an alert and then we are notified by petroleum team. What happened here is because the service was not failing for two or two hours. Some requests were passing and others not. So when we look at the metadata dashboard, we clearly see that half of the request was were working under others. Not so that's why we didn't get notified by the monitoring um so well we we, we had a look to multiple things here, um but we we could not. Really, I mean yeah.

A

That was, that was a tough issue to detect.

A

So what I have to another thing that we have to improve here, um I I clean up. Look at the open connection, so I just had to connect with azure console and run a bunch of comments.

A

Obviously, in my case that was easy because I, the the command that I run back in november, was still in my powershell history. So I just um re-execute the same comments, um but I should put the documentation in a runbook so the next time, if it happen again, someone else can do the same.

A

So this is something that we work with damien to be sure that someone else can run the same comment again in the future, but um any question.

A

So if you have yeah, I just put um here the the kind of access that are done on that network storage. Just for to give you an id. um While you can mount the same network, the same as your file storage into multiple containers, you can write from multiple containers. You can read from multiple multiple containers. You still have a limit on the number of open files you can have at the same time, and just to give you an id, we have my turn check that tests.

A

If you can um access some specific location on the container you have, um the apache will have an apache. That means that that can return content. We have mirror bits which scan on a regular basis from every containers. um We have datadog monitoring. We have so that's it's really difficult to really have to have an a clear understanding of where those um access those file um access came from, um but yeah we are still investigating um the next topic that I want to mention the next outage. I mean that was not really a notation.

A

So any question before we move on no um another issue that happened last week, so we wanted to improve the way we deliver jenkins at our website um by directly totally rely on hem charts, um and we face an interesting challenge here.

A

We put branch protection on the git repository that contains jenkins. That are your website, so we put branch protection, so we always use pull requests to introduce change and because the new workflow implies committing to the branch and we could not identify a way to say we want to keep the brand protection but only allow a specific bot to modify that specific file.

A

A

We had a bunch of discussion here and now I'm really open to suggestion. One of them was to just remove the branch protection. So now we just allow the right person to directly commit to master.

A

Or there were also suggestions about keeping the branch protection but automatically open a pull request and automatically close that pull request.

C

So since I have a proposal here because there are, there were at least six different ways of implementing that workflow, based on changing some bits connecting automating. This means that we don't have a consensus right now, so that could be a nice ground to start writing an erp which is the same thing as a grp.

C

It's something we haven't done since a long time, so infrastructure enhancement proposal where the goal will be to state the goals list, the solution with each pro uncounts and then discuss based on that, just to be sure that everything is clear for everyone and have a specific meeting and decision, and then we can jump to implementation because it appears that the two tries at implementing it were missing tiny parts where we discovered that maybe we have to to go on a consensus or to act somewhere.

C

So that's my proposal here that we, we start a specific discussion with a written process. First, where we list the solution just to avoid a risk of a meeting where we might not understand or see all the parts.

A

I would be really happy to work on that document because, four years ago I work on the same for the current implementation and the current way to deploy it into their websites, and I mean every other website like javadoc and api plugins and so on and in four years a lot of things evolve and so yeah. I would be really glad to to re-evaluate my assumption that were done four years ago to purpose something different.

C

It was mentioned that maybe iop could be merged into the gep. uh I here I don't really care. The goal is that we get started on writing the proposal there, and if we see that we can do that exercise as a community team for one or two important topic, then we can raise the discussion of should we go to gp, but the goal is to learn to work before running here, so that that's the goal of the proposal of staying on the iop that hasn't been updated since a lot of time.

C

So, let's see how we behave as a community team and then we we can then see if we have to merge to gep, where I will say the writers and readers of gep are more at ease with that process than we are.

A

Today, that sounds a good idea and I would even go one step further, um since we are um working again on documentation, git repository to put a lot of documentation, I'm just wondering if we could just put the document there as well, so we'll just regroup that document the meeting notes outage maintenance and documentation in one.

B

Location olivia, I'm not sure. I understood the last thing that you said, could you so the idea was place, the iep in a in a different location than the than the earlier iep repository yeah.

A

B

A

The okay, so so right now um we have so we have a git repository name, documentation that was created a long time ago and the idea was to have a public documentation uh where we document everything related to infrastructure.

A

Last week we had a discussion in this meeting about, should we put acme documents in jenkins at a website or in a different location.

A

We agreed that we would push in a different git repository, which is the documentation, and so since we we, we collect the the note for the meetings and the upgrade plan and every other things, including run books. I was just suggesting to move the ip documents in the same deck in the same git repository, so we just have a bigger repository with more content.

B

That sounds great to me. Okay, thanks for the.

B

A

Documentation sounds good. um Those were the two most important uh topic that I wanted to talk today. um Damien, you were mentioning that you wanted to bring the website depicting the issues we have is really that's the I in french. So I guess that's the right time.

C

Yep so, as I mentioned before, the recordings were mentioning it for everyone.

C

Yesterday, during the incident, while we were investigating the azure console elements, we saw an alert on the kubernetes, which was not, which was only a warning when we run the diagnostic and troubleshoot integrated on the azure ui, and that that warning was about the fact that the kubernetes control plane of our aks cluster named public gates is thresholds.

C

That mean that we are making too much request at some times. So the I'm having a bad moment with language natural threshold is the correct word.

C

It's throttled, that's with the t, so if we are throttled, so that means that some of our requests are queued before being sent to avoid uh uh having peak of flow big workload on the api control plane, and this request comes from different sources, mostly all our m-file process, which takes care of doing the git ups operation to the kubernetes cluster, but also from all the jenkins instances that are spawning pods, because when a pipeline is running inside a pot template it, it run websocket command from a kubernetes client inside jenkins to run the cube.

C

Cuttle exec command in charge of the pipeline steps sh or b80 depending on linux or windows, so these requests are also been being sent directly to the kubernetes api control plane.

C

So we don't know who the culprit is, but this will explain the websocket issue we see because websocket not only is used between jenkins and its agents, but also between jenkins and container that are not the default dnlp container on the given code, when the pipeline is run for each sh, there is a cube, catal exec, which involve one web socket and most of the errors we saw were mostly related to this and are correlated in time with the peaks.

C

So we will have to push this further, maybe monitoring the amount of request or the peak of bills. I don't know which direction to go from there. That's also something we saw yesterday and that's all I don't want to have. We need more data to prove that it's related to that.

D

What do we get? What monitoring can we get out of because yeah can we are we able to see when we're exceeding the thresholds or the quotas.

C

Yeah there is some there is some monitoring currently uh integrated in azure, so we could maybe start with. With this point, um I don't know how to extract that information continuously from jenkins, though there's there are way to have a groovy script to run that will print uh instantaneous usage of the current open connection from jenkins.

C

However, I saw on the graphs. It looks like that azure is able to determine the user agent of each request, because I saw some requests that were coming from my own web browser with the operating system, the web browser and the kind of clients.

C

So maybe that one could help, because I suppose the user agent of the kubernetes client in jenkins is different than a web browser or a m-file applied from a go longer. What.

A

Would be also nice is to identify all the potential limits that you have to use aks, um because that's a common issue in azure, whatever service you use, you have limits um like the number of computers and the number of cpu you can deploy in one region and stuff. Like that, the number of files you can open on the azure file storage at the same time, and so maybe we have a limit that I mean that we need to identify.

C

And also that could be worth it to check with jenkins and jenkins six communities and maybe uh jenkins user or any variation, because I'm sure we are not alone, I mean we don't make so much request. So is it because our azure iks is meeting some issues? Is it because of the kubernetes version? Do we have other user with the same issues? Another kubernetes kind cluster, because I mean it's not uncommon and I we don't do anything that is exotic. We run pipelines on pods and the pod. Template might have two or sometimes three containers.

C

That's not a lot. So, even though there are technical limitations that also subject from the user experience in jenkins when using the kubernetes plugin.

D

Yeah, I'm wondering whether so I know on jenkins x there's a recommendation to turn down giveness external secrets from polling, because that can be quite heavy in terms of listing secrets and looking for changes.

D

But I don't think we're running that on that particular intro that stuff um my I I'm I'm guessing um that it's probably unfair related the fact that the fact that we run a full deploy or a full kind of chat- maybe maybe so often.

A

Maybe it would be nice to reach out to our elastic friends, because I know they work on drinking observability. So.

C

Yeah and so right now, the first step will be what, as garrett said, uh checking doing the existing monitoring on azure and see the breakdown between different clients and see which one is emitting bunch of pics and then from there. We could go further.

A

It was interesting to investigate and that may be damien. You will have to look at. You will get.

C

Anyone interested to pair with me is welcome from the team or from outside yep garrett. I can't you.

C

That sounds great.

C

So one quick addon in the action I move on the action points the runebox, I'm gonna, polish. The writing. The goal is to write, run box about to to fix some elements that could have helped us yesterday during the incident the goal is we tried as much as possible to not wake up olivier.

C

We already had the knowledge from the previous one, so we partly failed and partly uh and partly succeeded, because team was able to provide a fallback solution, so I've identified at least two procedures that must be written at all costs how to fold.

C

What team and mark did yesterday how to to have a temporary fallback to be sure that user can still download file for one full day slower without the mirror capability, but still they can download and how to identify and fix the azure file storage, identify we were able to follow microsoft online procedure, but we were missing some point about a powershell script that olivia mentioned earlier. This being should be a run book I mean by one book I mean a no-brainer, only the main dot. We can still do ourselves during an incident.

C

The line between the dots, but we need there was some missing element that could have helped and maybe could have helped the fixing, without bothering olivier so now that we all have the knowledge and understanding of what happened, because since it happened already in november, that's the second time. That means we will have another issue with azure file storage.

C

So if we have a runbook for that, it's a no-brainer, anyone from the team can fix it and that should also shrank the time of the issue.

A

Sounds sounds perfect and again, we'll put those does the run book in the jenkins infra documentation, repository.

B

Is it worth one more action item to investigate or suggest or discuss ways to detect that style of failure? The the failure mode was was rather strange in terms of flapping and- and you know it was on again off again some some here some there. I'm not sure that failure detection is is ultimately possible for that kind of failure, but is it? Is it worth.

A

So yeah, to be honest, identifying this issue that happened yesterday is definitely challenging because, as you says, as you say um that was flapping. um When we look at the monitoring, sometimes they check, where passing sometimes that and even worse, um we have 400 gigabytes of files. Some files were accessible, others, not, and in november those files were under the director plugins. This time it was not necessarily the same.

C

What do you think olivier? Maybe putting I don't know if it's technically possible, but the amount of file handle open on the azure file storage were a pretty good indicator that something went wrong because it's the quota we we reached and once the quota was reached, then everything got south. So do you think it's possible to have a routine that hourly that measure at least the amount of open 500 on that specific volume?

C

So if it reaches, let's say it's 2 000, so if it reached let's say 8 and 600, then we have an alert that say: okay, maybe you should check and with the run book with the process, so a human still have to operate, but the human would would not have to go through the, which is the issue on the kubernetes centrifugation.

C

We know that the issue was specifically on the azure file storage there. So.

A

So first thing: first, everything is possible uh depending on how much time we want to invest in there practically I'm not sure. If it's worthwhile, because python the checks are published to datadog, they are written in python. I I don't have any documentation, so we could probably use the python sdk for that, but those are definitely not information.

A

I don't think those information are available, um as is in data and so.

B

We definitely have to collect.

A

Those and publish them to that ourselves.

B

Would would you be okay, olivia, I think there's some interest there. If I ask datadog, do they already have a built-in monitor somewhere? That would check azure file storage used, because I would expect this to be a common thing. They've already implemented, and all we would need to do is find out how they what they did.

A

um So we definitely have um storage information uh storage file counts. I have to do that check. I mean we have information like the month, egress, ingress and stuff like that, um but we have the latency, but we don't have the information that yeah. I have to look at okay thanks, so we are running out of time, um so I propose to finish the meeting here, but before we do, um I just want to to highlight the fight that, because of the new workflow, we are going to have one document permitting.

A

um So I put the idea of this document the link to the next week uh meeting. So, if you have anything, you want to put to the agenda, feel free to add that information there and so yeah. We used the next token next week, thanks for your time have a great day bye.