Jenkins Infrastructure Project, 9 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: INFRA Weekly Meeting 2020 06 09

Description

Jenkins Infrastructure Project Meeting - 2020-06-09

Notes - http://bit.ly/2T0oZ9v

A

Hi her buddies, so the recording is done now and welcome to this new Jenkins infrastructure meeting. Considering the week who have a few things, the week has been pretty interesting.

A

Basically, everything started last week when we had no with the AKS cluster, so the cluster was in the broken sense in the broken state.

A

Since weeks, I try and just try to create the the password to see if and so basically, it needs to just all nodes were disconnected from the load balance first, so basically it was the possible to use a cluster anymore, and so we investigate several solutions and none of them were working, so we decided to delete the cluster and we store everything with one which, which was also interesting and in which ways, because we discover some issues and one of them was related to the adapt database.

A

So basically, what we discover is around Jeff February, the the volume that store backups switched in a read-only mode, so we had no backups for that database since February, and so, but if you weren't able to restore all the users, so the first focus was first to restore all the services on that Lister and now we are investigating what are the little issues with the users. So first do you have any questions regarding the AES tester.

B

Yeah, so it would be interesting to know more about what we monitor on the cluster and especially how to but contribute to monitoring in the future.

B

It can be discussed later if there are guidelines which I need to read. First I.

C

Think the concern.

B

C

What was running, but it was whenever any changes would try to be made if there was their shoes. So we hadn't been able to run the employment job for the last couple of weeks because it kept failing and I. Think Olivia was trying to do update to the ingress records in preparation for the next version, and it kind of would not working properly come on and yeah.

A

That's that that's what happens. The cluster is money towards. We are using data to monitor the cluster. Every services running on a cluster is method towards one of the things that we could have better melter is that database, but having I mean I'm, looking I'm looking at what happened with the cluster and I'm, not sure if we could have work, I need a better way, except that having more clusters spread the risk and multiple Crestor, but otherwise that was a word issue. Something that also tried was to open a ticket an azure.

A

But there are two things: first, it would have take more than what our viewers to solve the issue, and then we don't have the support anymore with Asher, and if we want to have the technical support you have to pay for that.

A

So this is also something that we discover I'm, not sure that we that really need that I think yeah I think we just had to do to clean up everything. So that list was running since two years, something that I know since one year and since when you're sorry that one was ringing for one for a year and I know monitoring by Toby I, don't think that we could have catch this.

C

Under the control plane, they don't know Microsoft ID.

A

So we don't have access to what's what's what's what's broken, yeah.

B

Other backup services, for example, if you connected to Vanderbilt, directly and sort of really and Microsoft.

A

So, just just to come back to the cluster and not talking about backups, the micro cross of the blister that we are is called a guess and basically it's a managed cluster by a by a sure. So basically they manage the master and the only thing that we have a cultural is the agents.

A

And so when we want to use an dated person of kubernetes, we just just ask Microsoft to update a version to the next one that we want to use and Microsoft will do with management to turn some machine down up and so on. So we don't have the visibility there and basically, what happened here is we had a lot of time, much issues between the agent and the master, so it was not possible to let's say a cessation, the Machine, the bugs what's happening with the CCD or whatever.

A

You were just like a black box to us. So that's why I guess we try to upgrade hoping that upgrading the version even to a minor version which just restarts at the nodes on a short sides. um That's one of the things that we try. Also I was also hoping that maybe I should support would help us in this case, but it was not the case. So that's why we just said we decided to just do it everything, because it's also at that time, that it was the the quicker solution to a problem.

A

I did not realize at that time that the backup was not then anymore. So, basically, the way that these backups is done is each time so, every day there is a cron job that done the database on a giraffe on Azure file, storage and each time we stopped the container. We also generate a backup on the database storage and that astral storage is replicated in multiple regions, so um there is no reason that the backup begun, and so basically what happened here.

A

We just mounted that a short file storage in read-only mode instead of instead of read rights- and maybe something I mean something that we need here- is a monitoring job. That just say sounds like your backup is quite old, that's older than one week or two weeks or one day today, whatever this is something that we could have seen earlier, but yeah. So something that you have to keep in mind is that adapt services running on curate it's since three years.

A

Multiple time we turn to turn to move the cluster at the container into unreachable cluster, because we have created the crystal of within particular permission. It was always almost transparent because the person was done for free segments and we were able to back up and restore very quickly. So you thought this was read the first time that we had such issues with that vectors.

B

So respect all the little detail is there and it is setting up some kind of monitoring pulled up and all of the critical data backups would be quite high on our list yep. So two things happened, but definition are.

A

There and other what yeah.

A

Otherwise there are multiple things that I still have to report in respective for the community's voltage, but we saw we had a bunch of issues like, for example, we generated the big IP, so we generate an azure public IP that we can reuse, that we can assign to multiple machines, and so this is something that we used for, while and for example, the kind of issues that we had was we weren't able to reuse the public IP that we generate in the past, the new communities cluster, because some because things changed on Azure site, and so we had to delete that we can't be generate a new one and assign it to the load, balancer and so doors.

A

Those kind of issues were really small, but it means, for example, employee tennis changes. So we had to wait for up yours before the Dannette was totally propagated and stuff like this, and so I have a list of things that I that I changed. Everything has been pushed to. The gigantic forest has charge repository most of the changes but yeah we still have. We still have to the retrospective once everything is totally fixed was regarding that outage.

A

So I propose to move to the next topic, which is at the database issues.

D

So so before you act it back up. Are there any other services where we should be checking short term that we've got a good back up and main.

A

D

A

No database is the only state full application, storing that on the cluster, so this is the only one, that's that is at least otherwise. Every other services have been reviewed, they're all working now so.

B

A

B

Jillian contains backup sign, MySQL MySQL. The cup goes where the moment.

A

So so all the machines are all the services running on bare metal machines like charge or a conference, and they are just back up on the machines. So basically we use the machine, you use the backup, so we don't have wheat on. We don't have any backup policy for the services.

C

One way to move to get have a shoot.

D

E

D

So that feels like a separate topic, we ought to flag that that it's it's worth us considering. Should we put something: that's not on machine as a backup destination for the zero issues database, the Connaught I'm less worried about that. Your head, Helen.

A

A

Gia yes, but another one that touristy really need to be back again: Joe Jenkins, that are you yeah.

A

The thing is, we have quite a lot of services where we can retrieve some details like we did with under that internal right now, but because we don't have defined policy at the moments we are still at rest and other services, so it should be truly given on the service. We.

C

Should be quite safe to back it up to a 0yw it known as soon if.

F

You never access.

C

Because you normally pay for transfer, if you have a transferring just really nice feeling up.

B

Would it be possible to make a one-time snapshot for the backup so yeah? It was all setting up a pipeline. At least we have recovered a point O we and we have.

D

Learned yeah, we at least on 0. We have backups that are taking I, believe it's every week, but what I think only the edge point was. Is there short on the same machine? Yes, always.

B

Alert on the same machine, that's the program, yeah! That's why I'm asking whether we could have a snapshot for that. So, for example, for week II we can just keep one standard forever, because we don't expect any changes to happen. No Nikki now I mean you cannot lose and for a Giri it would be still useful in general so that, if anything happens, we at least have one time show for most relevant version where you have historic data, and yet that could think what we do that next.

A

Place may be provisional, a storage, that's.

A

So decent good yeah.

B

So I will of power and predict it for that and are the best for packaging can say: oh it's a bit less trivial, but for that we exist happily oops.

C

F

I shouldn't be clear that, because most of them.

C

Are and historic already they're all that they would pull the Davian in the Red Hat one one.

A

That's what I'm saying by we don't have a real bike, a policy but most of the data are kind of already duplicated into digital editions.

A

So, for example, you mentioned package injections an idea, yes you're right most of the packages also on measure, but for example, we don't have the very old version like before we start to ablai upload so Bessie for the release line that are not used anymore same if I, for example, for certain packages, that's what I mean by most of data or I can be retrieved in some way, but it's really been under the services and all the data. So, for example, for me hast, you know off student by unit yeah.

A

We do not know how everything I think the image is 100, kilobytes click out, and otherwise it's just removed from the condom cost basis. So there is just one half called archive search in case yet at Orakei that contained everything so that.

A

Again, we don't have a pretty backup policy, but if we knew slot machine, it's not on the end of the world, so.

B

In principle, if you wanted to do a call, the cop so basically just gaining historical snapshot, which we put somewhere, where it's not that expensive, what would be to do that? Just absolutely a search.

A

Engine I think the first time would just be to nasty events on the different machine. Let's do it so if we don't have to collect a if you just want to do it, one snapshot today to be sure it's which, if you just want to put in place some script, to do that on an on a regular basis. Then you have to go to work on the script and you have to work on the monitoring as well too.

A

In order to be sure that you're backing you are doing backups, and also that you are able to restore your backups, that you don't have corrected data, and so that's dozen.

B

Different three I think the one-time shot for confluence and for packaging to say it was more than enough or either way it's a bit less trivial, but it's a much bigger story, so just having this shot, at least for now, would be great.

B

For sure we have some sensitive data, they would say security, project and other things, but because you wouldn't like this data just in case but yeah for the rest, the second team, it would be good if I can see how we do.

A

So yeah, let's go back to the to the LDAP database. So basically what you did here so something that you have to understand is you have so yeah. We measured up as a source of identity, but you also have multiple services that synchronized database in their service and so keep local version of the of the of the users. And so basically what happened here is we lost the LDAP database.

A

Someone is able to create a new accounts, because account is that which is the database and because you created that that is that no user accounts it can now access the multiple services.

A

That's used that user accounts, so that was a risk I, let it here and that's basically, we initially review for all administrators, like me alike- and there are the risk, was nil, but we realized that it was a different story for plug-in maintainers, because we had something like one hundred one thousand seven months and hundreds up in mine sinners, and so basically what we did was we fetched a list of user from Roberto change in Seattle org that we compared that list of user with the Chitra database?

A

We retrieved the user, the user name, the interests and all the information that we could and then we over created the different user into the adapt database. So that's that's the current state now.

A

So yeah and now we are I'm looking at how to restore the people who create an account bill, does not have an admin access, I checked just before the meeting and it's around 9,000 user that were removed from the database, and so now we have to bring them back into the app. That is that we have to write a script for this, but that's kind of the current state.

B

F

B

Assumption that that aneesa's stay blocked and kill the next week at least right I.

A

Think it's a good, as is a good assumption as adduct the abductor, so nobody can register your account now. So busy can block it into tomorrow and I hope tomorrow to be done with, with all the users.

B

So we have some super users, for example a Kiki or Jesse Duke, who should be able to release convenience now use the current set of permissions and yeah, probably we could start adding some contributors to the allowed least, maybe additional resource or whatever, so that the admissions for counts for asked for that I. Don't think it's an end of the world if we delay this general these permissions. If you have a personal work around now,.

A

And otherwise, if someone will really need to release something I mean there are still options to release. But when you just like right now, it's easier to just block everything for everybody. So we're sure that we're gonna change, Generation.

E

Y Rockets, we have identified the I think 50 accountants, that our maintainer accounts that we no longer have or that have been recreated. So what we can do is we can remove permissions from these accountants in the permission files with positronic permissions, updater and just restore the old behavior, because we know which accounts are potentially compromised or should not have access, I mean and every everyone else is fine. So we can do that.

E

Basically, it revert my patch after we, you know make these people no longer maintained, errs essentially of the components they registered and then we can go through them, communicate with them with the email addresses we have, for example, in JIRA we create their accounts, tell them they need to request a new password and that's about it or we just start by resetting the any email addresses that don't match up reset reset passwords and basically the same thing.

E

But the vast majority of accounts are fine and we know they're fine, because they are in the they existed before February.

E

So I don't think we need to reintroduce the super user problem that we had with Kate and Jesse, in which I think I even got rid of.

B

Okay, if we introduce it, it's fine, you just need to provide away or partially. So you have options.

A

Okay, great, is there any other question regarding.

A

Sounds right um so I guess we can move to the last a big project. The work being done on the automated release obviously did not have the time to work and this over the last week, but basically, just before the oth happen. I just merged major PR, where we could directly really stable secret. A weekly release directly from the religion warrants for the stable I think it's ready, but before really stable a little bit sure that we can use a security one and yeah.

A

Right now, for the secret releases, I mean looking way to test that the full process is working correctly. So that's that's. The current state.

A

But I think it's more topic for that. Yet.

A

Sorry, what was the question- and this question is just like I- was just saying that we have to sit together, how to test it and to validate that the process is working for you, but.

E

I mean in principle, what we can do is what, once the.

E

Arguments are introduced that make the entire thing configurable we can cut. We can set up an environment where we would release a weekly release as if it were a security update and whether that have works, or we just create repos in for maven 148 and pretend there's a security update happening. You can do either of these things so.

A

Yeah, so basically is this something that is something that we quit test now so right now we have two jobs, the one that's release that use mavin, release, beginning and the that package. Everything and we also promotes- can also promote artifact at the end of the release at the end of the packaging, and so everything is parameterize now so we just have yeah. We just have to say together to destroy working well.

C

So have I. If we have time we could relate to the next weekly from the security one just do it like security relief, which is really just a real weekly.

F

B

Yeah the problem that week security ring is request, release from the old cronies infrastructure. So we cannot use the BC edging to say Oh for.

F

That right now, it's already protect things. They security apply.

A

So it so, basically what what seem is suggesting instead of creating a weekly release on Monday, we create a city to release based on weekly content, so we just fetch the data from the Jenkins, a master branch. We don't have any if we do not introduce security whatever just like we just like we will be under security and.

D

A

D

For for my time as kindness to me, I'd appreciate it for Tuesday rather than Monday the waited that release. But.

G

Monday of you must do it on Monday, we'll figure out a way to do it. So, let's.

A

Just we can, we can do it on Tuesday. It was just a confusion from me because I just missed the email where we said. Okay, we are going to do the release on Tuesday, and so initially it puts a crunch up on Monday and could not was not taken to a conscious today. So we did release yesterday, but it will not happen anymore, so we can do it on to the prefer that as well, oh yeah, so it sounds like we can. Our sink west I need to see how we can drill right.

A

We can do the next weekly with the security, more flow.

B

To solve issues reduce permissions ban.

A

And another another stuff that are also at the reason: varmint has no heart a folder called components and on the under components we can now release the remoting components yeah. So we can. We can use the cosine in certificate for the remoting components, so the next. The next release will happen from from that environments and I know the people who were interested to to use across any certificate to sign components. um So yeah, that's the first, the first that's one of the should I change to the release environment.

E

Port security fixes, or would they email still need to be released as before or state.

A

What do you, what do you mean not talking about the careers early voting radius.

E

Remoting voting is a component, we deliver us part of canyons or, and they have been security fixes in Jenkins core that were actually in remoting before so. How would they be handled.

A

A

Yeah yeah very have to touch to me. I think we can.

E

I mean every component that goes into Jenkins core could be part of a coil acuity update and we need to be staged and then figure out included in in the core, with the pom or whatever, and then we stage core as a security update. So this gets annoying pretty quick. So.

A

Basically, since some photo for the remoting compartment, I just reviewed most of what we do for Jenkins core, except that I remove some parts, so I do not expect it to be a big work here. It's just like it was more proof of concept to see if we can can released remoting. So if do anything that we need this tool to introduce staging environments, that's not a big deal.

A

So, as the last wish to an automated story, disease I mention.

C

That we've got two issues on the weekly at the moment with the release process. One is that Windows is broken due to a Microsoft security update, and we need to rebuild our images. We're having issues with that and the other is that if the packaging failed, it seems like we get packaged, we get them absolutely, but not. The package is uploaded to it here. So both we've gone through and look like the BBN, and they read headline like that failing, even though they stay hidden so.

A

No regard regarding the issue with the Debian that is not published did something weird to me right now, because the way the packages are happening is you have to bein happening at the same time, then as Intuit's Sue's windows, and it's also published work, and if, for some reason one of them is broken, it will finished the I mean the footsteps, for example, the event should be terminated and there is another I'm think that need to be improving. The release process is at the end of the released at the end of the packaging process.

A

You synchronize the different levels, so we which trigger our scripts, but that script is also trigger based on Crunchwrap and those things not only synchronize a mirror of your horse and so right now, even if the windows packaging is broken, it should still published that did be under edit insurance for the million.

C

Dollars hitting updated, but the packages aren't getting uploaded to issue I. Don't know why without being able to see what's on that machine and what's going on but from the G, the news: can you skip forwards back? Look for a is your post or euro.

A

C

Entitled it a lot to weekly really failed stupid. The window, which is main we've had like a day or two and we're users have been complaining that they can't download. You can.

H

The person updated windows image today for the that's used and then I'm gonna, look at the idea of a separate vs tools, image from the normal Jane, LP or inbound agent. So I'll make it easier to update inbound agent separate from the vs tools.

A

Just just under the windows packaging, so the beggining dub is fitting since we agreed at Leicester and the reason to that is because we were using old version of Windows on the old cluster and since we upgraded castor, we now have up-to-date version of Windows the notes, and there was a security issue with vehicles in all version of Windows and in order to fix that security issue, the introduced, breaking change, and so now the old vest. Those does not work with a new, quick Windows notes.

A

So that's one of the things and general issue which is related to Windows but also to reinforce Witcher, is when we started using Windows notes for the release process. We had one big image containing channel P content, web channel P and invest also piously, and we had also to put in place specific infrastructure levels on that specific notes, and so now that your creds are every Jenkins instance. So we are web created to changing shots, that we are using to infrastructure and Sun.

A

We are putting a lot of logic in that specific windows container in order to be able to work in our infrastructure, um and this does not scale, and it's also difficult to test on working machines, because we mean we have to do a lot of specific configuration. Configuration changes in order to test that the windows packaging. So that's why, as we really need new container finders.

A

Okay, we are running out of time, so any last topic that you want to discuss here. Otherwise we can switch in Irish, Larsen I, don't know if you saw if you saw that the news regarding the insights, some nice improvements.

A

So we can now have access to the issues and the release. Information for every plug-in, I died, I, don't know if you're planning to you toast or have issues coming from github issues, or is it supposed to work.

B

Requests for three.

B

So, for today it is an enhancement because we defer to have companies like Jenkins configuration less code using github wishes and right now it goes to G Day, which is well not that relevant, although I'm not sure what still needs to get implemented. For that I believe it's. We need some updates under metadata, so they would pull requests from team for that and so that we need to apply some magic to get it posted. But yeah that render link is already there.

B

I, don't think that wishes on political and the change looks under my she different yeah I've.

E

Commented it would be useful if change log files were recognized, especially if the releases would be empty. Otherwise,.

B

Result as well.

B

Plug inside- and that is a big issues for it.

F

B

I will post it in the get a chat or elsewhere.

B

It's a really great improvement.

A

A

That's that's less hopeful for the peak, otherwise I propose to stop the meeting here and to go back to Aris new one time two time three time, but but it thanks for the time and sniper.