Jenkins Infrastructure Project, 3 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022 05 03 Jenkins Infra Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Here we go. Everyone welcome to the jenkins infrastructure, weekly team meeting. We are the 3rd of may 2022, so today we have.

A

Markwit is not there, so we have stefan step from.

A

Is your name crow? I am pronouncing it correctly. Yeah! That's right, isil, crow! Okay! I I'm still having issues to have the correct intonation for basil. No words.

B

A

I'm really sorry if I, if I I'll try my best and we have 1 or 2d to putting bruno.

B

A

And mark won't be able to join.

A

First of all announcement so today is the day of the weekly release. The weekly release failed to happen because we have another service credential that expired a few days ago, four or five days ago, and it failed well because that credential is required to retrieve the gpg key used to sign the packages.

A

I don't know the exact I'm not at ease with the release process on that area. I don't know because there is a there might be issue if the release has been already performed by maven. Hence the meta that have been pushed to jfrog artifactory, as I remember once performed it download. Then the war sign it and upload it again, but my memory might betray me so I haven't spent too much time. I just saw that just before this meeting, so I don't know it has to be diagnosed.

A

An issue has been opened on the help desk and the team and mark have been picked on that area. So we'll delegate that part to them.

A

If they are not able to fix it until tomorrow, then we will take over and trigger a new release tomorrow morning, but the risk is that the new release might have an incremented version number that could be the 347 instead.

A

So ideally, we should avoid jumping one version, but if they cannot- and they are not available- we trigger near one- that's not that much an issue, but always better.

A

Failed because expired, credential.

A

um Lts bain la baseline selection.

A

First saw that one I'll, let you type I don't know what's typing, is it you stephan? It's.

C

Me I'm trying to help you.

A

Okay, nice. uh So let's get started on other announcements. Unless you have some.

A

Okay, so then, let's proceed um a quick note on what you see on the screen, the task that we were able to finish successfully during that the past milestone.

A

So we were able to fix the missing ram on cpu-z images.

A

We didn't fix the sso for crowding during kinsale as discussed last week.

A

The decision, the collegial decision was, we don't want to connect the new service crowd in jenkins io to the actual ldap, so that might be github sso and that's another area, because we in infrastructure team are not admin of the jenkins ci organization on github and it has been closed as one fix, we were able to finish the upgrade campaign for kubernetes 1.21, so that happened last wednesday on the aks cluster that were the two missing clusters uh outage of two minutes of for the ldap, because it's not highly available.

A

That's an end, ldap time for the pod to restart on the new machine.

A

We were able to upgrade all the adoptium gdkr 17 and 11.. We have an issue for tracking the 17, while the 11 was done automatically by the update cli. So we didn't created an issue just for that, but the jdk 17 was fixing severity. That's why we created an issue.

A

This version are now available on all the ci jenkins, io agents and also on the tools of the ci junk inside your system.

A

If there is an issue, please open a help desk, we can totally roll back the virtual machine and or container templates.

A

We also had an issue with the web ui on jfrog artifactory, the web ui wasn't available, while the artifactory was still okay, so we contacted frog who fixed the issue. However, that on the line, we are missing a monitoring element, we need to add a new web ui probe.

A

So a new issue has been created to tackle down, because we were only monitoring the backend system that may even use for downloading or uploading artifacts, but still the web ui provider and user feature some end user want to search the artifacts, they need the web ui and they need to be able to log through that service. So we need to monitor that, so we can avoid having the service being done for at least three days.

A

So thanks for all the users who were patient enough- and let us know so- that's clearly an improvement area for us, because we were really bad on that area.

A

Finally, we were able to finish and close the issue regarding docker up credentials. So now all the controllers have their own set of credentials and we have split, pull and push to increase the security of all the docker hubs accounts.

A

We have another issue on going about being onboarded on the docker open source program that will increase the rate limit of our container. They accept it and now they are doing technical stuff. They should come back to us, but they are still quite busy.

A

So thanks, stefan for doing the heavy lifting on pipeline library there and we documented all the accounts, which was an information missing, a lot that documentation is not public that should be, but we weren't focused on public or private. We were, we were focused on writing it somewhere, and then we can decide if we move it to public private. In the second step.

A

Now, on the work in progress elements, I proposed to start with the ci jenkins, io outages.

A

So initially we had container agent in a degraded state. That was an issue that basically opened 12 days ago. If I'm not mistaken, we have a postmortem to run on that area.

A

I've proposed date time wednesday. So tomorrow I don't know if it's. If it fits your schedule. Folks yeah, that's fine, yes, cool, maybe late.

C

A

I guess because end of day and yeah wednesday afternoon for the kiddos mark and third and I don't know for rv but harvey- is back uh yes back on tomorrow.

A

That will be recorded. We will take public notes and put the notes back on the public issue.

A

Also, there has been an issue on sierra quincy cooked by at least mark and a few other person during the weekend, so it was night week night for the european, so most of the u.s people were affected, while the europeans were sleeping um and last night. Also so yesterday for our american friends, there's been an issue. So thanks basil for jumping on that with mark, it appeared that you were able to take some flame graph and started some diagnosis, um so that makes it a top priority now to focus on ci jenkins.

A

Are you in the upcoming weeks and one of the first tasks we need to grant to access brazil, as we said last week, but I'm the culprit. I was late this week so sorry for that, but we need to grant you access on that area.

B

Yeah so yeah. To summarize what mark and I looked at yesterday- the cpu was saturated on all cores of this controller and when we looked at the flame graph, we saw something unexpected, which was that the jvm was spending a lot of time, doing compilation of byte code into native code about 30 30 of the cpu time during a two minute sample that we took and then during the rest of that time it was executing java code, mostly either git cloning of pipeline, shared libraries or compilation of those pipeline shared libraries.

B

Now, whether or not the and by compilation I mean uh the groovy java code that that compiles the the pipeline, shared libraries and parses, the abstract, syntax tree, etc.

B

So, whether or not that groovy compilation was related to the hot spot, compilation, I'm not completely sure, but it's possible that they're related and in any case the git cloning was about one-third of the cpu time and as far as I could tell, the memory usage looked pretty good. So the only problem I could see was that the cpu was saturated and it was almost exclusively doing something with pipeline shared libraries. That's all that we were able to determine in the hour or so that we spent looking at this.

B

We did capture a flame graph, a thread dump and a heap dump, which I think should be part of the run book for any incident, and I can't remember if we captured the jenkins logs. I hope we did and if not, then we should have.

A

Yep good point.

A

I had a question on that area that might or might not be related, but yesterday we did a change on the shad library configuration.

A

We added an exclusion to the caching rule, because the system is expected to cache the shard library. So that's why the git clone part for the I'm not sure. If I understand correctly, do you remember if it was git clone of shard library itself or git clone of the repository of the repositories being built by the projects? Oh, it was. It was a clone of the shared library.

A

So I wonder, then, if the change that we did could not be related. um Let me get the issue there. If I can remember it, um so ah no, it wasn't jenkins infra. So what we did is that we were working with stefan on trying to test on real life before merging the pull request on the pipeline library and after running the unit test and running some end-to-end tests on our area. We wanted to run it on siege and kinsey in real life.

A

In order to do that, we had it on the top of the pipeline, the annotation at library, pointing to the git reference of the pull request, pool slash number of the pull request head.

A

We do that, usually, except uh since it cached on cigo, we added the temporary exception that then we persisted on the configuration as code for every branch's name in pool something. Then they should not be cached, because it's only a few jobs and only few edge cases.

A

That's a change we did yesterday. But since it's on the same area, I wonder if it could or could not be related. I have no proof of both sides, but that's a change that happened on the system. So that's why I'm mentioning it uh on the audit log there.

B

Yeah, I think this caching feature is relatively new, so there may be some edge cases or some bugs that have not been fully resolved. Yet in this caching implementation.

A

um There has been a discussion on the close pull request, so let me add the link I'm adding the link on the issue might might not be related, so it appeared that you see click on told us on after it was released, since I mentioned that one that we shouldn't have had the issue that require us to exclude the pool slash because it looks like that the feature is always trying to get the latest reference and then it's only the git clone step, which is expected to be cached or not.

A

So in our case, since we push new commit to the pipeline library in the pull request, the system should have detected the new, commit and said: okay, it's a new one. I should clone it, which definitively did not happen. We were always using the initial test with it, even if we were pushing changes.

A

So that's the reason initially why we added the exception, I'm not sure if, as a safety feature, we should maybe roll back that one. I'm not sure. How do you feel about that folks, because I don't have any facts that could let me know that's the problem. So, let's roll back but since it's the same area gut feeling.

C

Just to make sure it's coming from end.

B

Yeah, I mean, I think it's good to do some analysis, like you said, and I'm planning to look into the flame graph a little more closely and see if I can come to any conclusions, um but though the the comment that was made about the caching should the about how the caching should have done. This item should have not forced you to make this manual change in the first place. I think that's a legitimate comment.

B

um It does sound like a bug in this caching feature if it didn't work in this use case out of the box. um So I I would. I would encourage you to race that bug with the um the maintainers of the pipeline share the pipeline shared libraries plug-in, and I think the best way to do that would be to if you could, if you could come up with a simple reproducible test case and file a jira ticket.

B

That would be the best way to start in in that that that list of steps to reproduce could then be turned into an integration test, and that would be the first part of the process toward fixing the bug in the pipeline.

B

Shared libraries plug-in you know if that can be shown in in a set of manual steps and later on in an integration test, then fixing the problem would be the logical step after that, and that's something that we could take to the pipeline maintainers or even the person who added the caching feature, because they might not be aware of this particular deficiency.

B

But if we can show them with a ticket and a set of steps to reproduce, then that might be a good way to motivate a fix for this problem.

A

I agree on that part. uh That's a matter of finding a reproduction case. uh My personal uh measurement, when it comes to jenkins, is that it takes me two hours to find a representation k if it's only one line of pipeline.

A

That's the amount of time that it takes me, so uh I'm totally willing to do that and we agree that it helps. But that's always the I always have mixed feeling about asking user for reproduction case, because it's not always easy. So that's a time investment. We do this because we are the jenkins community, so no problem but yeah. That's that might be hard for other kind of users.

B

A

B

Always better to collect all of the state for postmortem analysis at the time of failure, um and I don't think we do a great job of that overall in the jenkins project. I think that's something we could. We could improve on in a wide variety of areas in this, in this case, the the the way to reproduce this sounds like you would need an already cached shared library, and then you would need to make an update and a pull request and demonstrate that the cached version is used rather than it's.

C

Rather than that, you you have to to have a pull request of a pipeline library that and that pull request is already in cached. And then you change your pull request. You you commit you push and and the pull requests have changed, but then the the update is not done right.

C

So it's not just the pipeline library, not not updated in the clash. That's the pull request that we use as a pipeline library temporarily with the at library common instruction.

B

That that that.

C

To be clean in the cache, it's.

B

C

B

Understood so yeah there's two there's two levels to reproducing this yeah. I mean that that that's probably that complexity is probably the reason this bug has not been caught in the first place, but if, if that can be documented and tested, then I think there's there's a solution that could be developed without too much difficulty.

B

It sounds like just yet another if statement or an edge case that could be handled.

A

Yep, that's a good tip. um Stefan. I think that will be worth an exercise that we do this in pair, because uh since you are you are you ask to learn more and more about being able to spawn jenkins instances. That could be a great exercise to produce partial reproduction on the local instance for you, so you will get at ease without to spin up jenkins and do ephemeral setups like this one to payback, and that will be a worth worth it investment of these two hours of time and it.

C

Will produce something it's two hours for you it's two days for me, so we have to mix together to to lower the two days. No.

A

Problem, I I the goal is to have a valuable investment of that time, and knowledge sharing is always a valuable investment on that area. um Yeah thanks.

C

A

Thanks for basil, for the tips and jumping so.

B

We have other points yeah in general, with with newly released features like this. I think it being the first being one of the early adopters is going to increase the likelihood of encountering these types of bugs. So that's something that you might want to consider as far as the planning goes, if you're, if you're going to adopt a new feature, I think that's great, and uh certainly I mean if it's released on the update center, then jenkins users are going to be adopting this feature, so it should work but there's just an increased likelihood.

B

So, for example, if you're having a busy sprint- and you have a lot of other things to do- might not be a good time to adopt a new feature. um So I'm pointing that out. Just in case, you didn't didn't realize that this pipeline shared library, cache, was a recent addition. I think it well, it's not very recent, but it was. I think it was added about six months ago or something like that, but.

C

It's good that we are the one who pinpoint those problems and deal with them. So that's that's fine. Can I just add something I would love to know exactly which process you are using to do the phlegm graph and and all the debug stuff. Even if it's not kind of my job, I would love to know how you do that sure.

B

Sure- and uh I would encourage you to do that kind of analysis- anytime, there's an incident um I uh I could write something down. I have. um I have written these kinds of run books in the past, so I'm happy to write one um describing what I did in this case. um I saw that mark was referencing some kind of run book and I don't know what he was referencing, but if I can find that I'll be happy to add some additional um some additional description of what I did.

B

Basically, what I did at a high level was I I downloaded async profiler, which is a an open source java, profiling tool, and essentially I ran it on the host mark and I ran it together because he had access to the box. So we ran this async profiler on the host. What it does is it finds the java process inside of the container and then collects stack traces every every couple of milliseconds.

B

um I think it's every 1000 milliseconds by default, or something like that, and you know what this tool does is collects these stack traces, every tick interval every you know 1000 milliseconds, and does that for a long period of time, like 30 seconds or for two minutes and then sorts the stacked races and creates a visualization of the stack traces that were the hottest during that time period.

B

So that's what we were looking at and this this tool covers both java stacked races as well as native stack traces, so that we were able to see the c plus plus code in the jvm. That was hot, which was the compilation that I mentioned, and this was not relevant yesterday, but the tool also shows the kernel side of the stack trace.

B

So, for example, java code is calling open to read a file and then the open is, you know, calling you know, ext4 lookup, to read the file from the ext4 file system. It'll show you that as well. So it's a very useful tool for doing this kind of analysis and it's not very difficult to set it up and use it. um You know it does require you need to have things like jvm debug symbols, but our docker image for jenkins already has those. So, fortunately we we had most of what we needed.

B

There are also a few settings you have to change. We have to run ctl on the box to enable some debugging flag temporarily in the kernel, but it's really not too difficult to set it up, and it's it's usually my first tool that I use when I'm dealing with cpu issues, because it's a good way of visualizing, where the cpu time is going perfect.

A

A

I need to search, I used to have a docker image that you run as privileged on such a machine when you have the docker engine that was able to automate all these settings immediately for native things uh for the gba given part, I never used flame graph with uvm, so I don't know.

B

For the uh for the run books is there? uh Is there um a public runbook that I could update, or maybe there's a private one that I could add.

A

That's a private one, so um we have private contents. Most of this content should be public, but we haven't at the time. The risk is that there are some personal names, a sensitive information hidden somewhere. So there is a walk about around the extracting the public to private okay.

A

B

Follow if there's something to update.

A

I've added, I need to create an issue for the upcoming milestone. That's part of the what I call runbook access, that's, basically adding you to the private repository and you will have full access.

A

Okay, is there any other point on that area? Should we wait for the.

A

For the postmortem in order to have more information- and I try to keep track of what we say than uh having an outcome of what we plan to to do afterwards, as for today, um okay, so quickly, jumping on the next topic, we were able to finally migrate writing the jenkins dot io from the aws virtual machine into the kubernetes cluster, the mega, the migration finished this morning by switching the dns.

A

So it's almost done, but it's still working progress because we have cleanup actions that are listed on the tickets. We need to put the dns ttl to something big now and we need to clean up the virtual machines and the former postgresql database.

A

So thanks for that work, stefan.

A

We had a minor issue: stephanie is working on replacing brochure in default display url. So we we asked the developers. We are waiting for feedback from the community. If we do it or not uh reminder it's not removing lotion, it's only changing the default link when you click on a github check or generic link to stay on the classic ui more people use the classic ui more feedbacks we can make to the developer and people who are revamping that ui in the upcoming lts and weekly.

A

That's the equilibria. The work is done by stefan as a preparatory work. So if the community say yes, then we can just merge it that will update the link. Otherwise we close the pull request and we and we go forward.

A

Started the work on our ability to build our own docker windows, images on the infrastructure or private controllers.

A

We were only using building linux images, so that in involves a lot of changes on the pipeline library, because we need the pipeline library to be able to handle the powershell or bat command and we have tooling aspect we need to find and ensure that each tool we already use on linux today for the usual docker build and docker push workflow. That will work the same on windows machines. So we are working on that area.

A

We didn't have time to work on sunsetting, mirror brain. I need to write a blog post on that area and I had issues with jenkins, so you want the latest oculus for mac problem solves, so we can go back on that area. Next, we have the apply to docker open source programs that will move out of the milestone now, because we are waiting for them to apply the chance, so we will benefit the rate.

A

Limiting a side note for you bezel the rate limiting is one of the root cause that triggered the first outage 12 days ago. It might not be the core of the problems, because such event should not break or should not make have some these consequences.

A

However, it's part of the postmortem so just wanted to mention it aloud, and here we will be sure to have way more api request limits, but if we don't or if it doesn't work, we are still at risk now, so we have to keep an eye on that area.

A

uh Finally, one last bug uh after migrating in for report to trusted the change on the pipeline library involved on that change had a minor impact on the repository permission updater, so I've reopened. The issue until the problem is fixed, basically, is that we need to update the virtual machine template we're using for agents, so they have the azure command line installed.

A

That's almost done, so that's all for the work in progress we have. I don't know if you have other work in progress tasks from the past seven days.

A

One two: three: okay, so now the the new or important tasks, so we covered the ci jenkins, io outages, which is the top priority.

A

um We have two new tasks that we need to do this week, related to datadog, first, one that a dog announced uh two or three weeks ago, the depreciation of some of the syntax on-call system that was linking datadog to to pagerduty and we are using these handles. So we have to use the new pagerduty integration.

A

So I haven't checked in details. What is the migration path, but that will be just duplicated, so we need to find a solution on that area and at a dog we need to add a new monitoring. I mentioned earlier about the web ui of artifactory, so we have these two new tasks that are coming on the on the stack.

A

um There have been two other, let's say, long term elements that are just behind ci. The jenkins ion priority, but still top priority for the infra first, the realignment of the ripo jenkins ci org mission. That's a topic started by daniel beck a few years ago. We have an issue with gfrog because we are, we are costing them quite some amount of money and bandwidth, and the usages done on the repo jenkins ci are not really legitimate. It appears that a lot of external organizations are mirroring that repository while they should not it's.

A

It's not expected to be a proxy tools and services such as the maven, central or maybe us having a public proxy system should be used. But here the initial agreement between the community and the frog is that they sponsor us, so we can use it for ci jenkins io and for the developers on direct for the plugins developers, but clearly the top eaters are artifactories that are in mirror mode from outside.

A

So now we are working with g frog. We are waiting for them. They are trying to extract the list of the top heater public ip, so we can start searching dns name and ip for some people, but um yeah. We need help, especially from danielle, about the legacy things. There has been a discussion one or two years ago, if you remember correctly, about forcing authentication to be able to retrieve artifact from this one, which is quite a nuclear option, because that will yeah.

A

That will require some a lot of work and that could have an impact on the contributors because they won't be able to may even clean, install a plugin. They will need to configure their local maven installation and then do it, so that might create some additional steps for, let's say, newbie contributors.

A

That was the core of the discussion, I'm just trying to transplant uh why it wasn't done like this, but that will ensure that we don't have a lot of issue, because we had a lot of performance or outages issues on that service, which is outside our area and yeah frog is hosting us for free. So we need to find it's impressive. We have around 20 percent of the requests made to the repository that are http 404, that's 20, that's huge!

A

So of course it create performances issue when these peaks arrive, because it uncached their underlying file system and create a lot of trouble for them. So I don't know what kind of implementation they have. They might have technical solution on their area, but they ask us if we could provide some data or search for the culprits, because we are not expected to have so much 404.

A

So there has been different solutions. We I'm not sure if we will have the ability to work next week, because we need action item that we don't have right now. But uh we asked for help for daniel so daniel. We spent some time in the upcoming days to point us on some direction, but that that's totally worth a discussion to trigger on the mailing list for developers.

A

And one last topic: important: I've started working on that. It's migrating the update center to another cloud on aws, because it's costing us 3k per month of bandwidth and we cannot move easily and fastly reason is technically it's easy.

A

The script which is generating the json every five minutes can totally uncash fastly immediately at least one or two seconds. So technically, no problem, however, fastly like gfrog, is a stingers for free and we will clearly explode the bandwidth they expect to have from us.

A

So the idea is to work with the cdf, who are the pers or the organization paying for the jenkins organization to see if we have, if we can have a hybrid account, so an account where either they they create a bill, and then they remove some part of the bill as part of the partnership.

A

But we still have to pay the additional bandwidth, but last month it wasn't possible for them to have such an account, which is administrative issue and not technical issue.

A

Alternatively, the idea that mark- and I also added that can be complementary- is to move to oracle, because oracle cloud has really cheap bandwidth like instead of 3k per month. That should be 100 between 100 and 200, with the amount that we have, which could be totally fine, and additionally, we could benefit some better performances because it's a simple web server serving file ser and they provide iran servers which clearly have a better cost performance benefits than intel for this specific use case.

A

So that is what we saw with other services that we move from aws to oracle without tuning, so we're tuning, I'm sure we can do better, but without tuning without spending too much time. We have very nice performances and it's really cheap.

A

So these are the two main topics that are coming priority: keeping in mind that ci jenkins, io outages are our top priority for now.

A

um One last thing: alvey already the proposal that I'm mentioning here. It's still an idea that need to be tracked. That will be splitting the terraform as your project in two separated projects, one that handles the network and the dns and the rest of the infrastructure azure manage on the actual repository.

A

The goal is the following: the azure automatic management with terraform has been stopped by olivier two or three years ago, because there's been a terraform provider updates that deleted one important dns and deleted one private network.

A

So we were up that it was not the public network, but that might have an impact and the olivier was alone he freaked out and stopped the automatic management. So, two years later we have terraform.

A

We have some archive terraform that are not up to date, so we are trying to we go on that area problem is that we we don't want to take the same kind of risk, and so the proposal of harvey would help us feeling safer, adding services that based on azure, because we could have two different accounts and the default accounts for most of the infrastructure will not be able to delete the virtual network.

A

They can only read and do the reference to them, so we can add subnets or services inside these networks and by separating both, then we avoid this kind of issue.

A

So that that's nice proposal, we will ask away when we'll be back to formalize, that under a null desk issue. So we can have something written to share, not only already during a meeting like now for the next weekly meeting.

A

I think I did the tour of what we did, so the next step is, as usual, taking all the work in progress items from current milestone to the next milestone, because we still have to work on them except the docker that exists.

A

I'm gonna set the priority to the ci jenkins. I o related tasks pass more time tomorrow and giving access to bazel, um and then we can close the current issue and start working again.

A

A

Okay, I'm gonna. Stop the recording, stop, sharing, stop recording.