Jenkins Google Summer of Code Office Hours, 26 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022 05 26 Git Cache Maintenance

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Welcome this is the get cash maintenance project. It's may the 27th 2022 uh topics for today. So uh hushakesh are there questions that you have that we want to be sure we address what topics are of concern for you.

B

Oh, I'm still concerned about the architecture itself. You know uh like how. How are we going to proceed like.

A

Okay- and so so, let's talk some more about that, so so I think you had said that you're, okay, with the idea that cash maintenance happens in the background.

A

Right, so it it, it runs, probably runs on a separate thread.

B

A

Yes, async inside the jenkin in the jenkins controller uh running a separate process.

A

Get processed to do the work.

B

I was thinking of uh you know a queue you know so that I put the tasks into the queue and then you use another thread to you know: dequeue everything from that queue to run run the maintenance tasks.

A

And I think I think that makes sense, and I believe that there is a concept of cueing inside jenkins that you can use so use a queue to and the concept there would be to keep uh basically a relatively few maintenance tasks running concurrently. So we don't want to overwhelm the thing with too many maintenance tasks.

B

Basically, I think a maximum of five million five uh you know get maintenance starts, will be put if all of them are set at the same time, you know.

A

C

So I have a question here: um we want to use a queue as far as I understand, cues are used when you want to decouple two concepts.

C

You don't want to have a tight coupling between two things, so you use a cue so that you push something and then your you see, there's a producer who would push something and there's a consumer who would then take it whenever they have the resources all the time it's in their control.

C

So uh since we know that these tasks that we're running are going to be scheduled at some frequency that we've configured ourselves is there a way? Is there a reason or is there a possibility that our um initial, let's say the first job or process hasn't finished, and there would be other processes that have started um happening, and that is why we need something like a queue to make sure that that doesn't happen and.

C

We are managing it, we want a queue for that. I I don't. I am not clear why we would use a queue here.

B

Oh, I I thought of a queue because you know uh we have discussed in the meet. You know that we want to run the maintenance task, sequentially, okay, so assume uh assume a cube which has like five or two two or three maintenance stuff. You know like a prefetch, you know and a comet graph and a gc for example, okay.

B

So uh whenever the uh uh I I check whether the queue is empty or not, if the queue isn't empty, I'd remove the for I uh dequeue the first element, and then I run all the maintenance tasks you know, although you know, I run that maintenance task on all the uh get caches and then after that finishes, then I continue for the you know next maintenance task, but this thing happens only when all uh the maintenance starts were scheduled at the same time, like you know, same ground schedule like every minute or every hour that time only this this case happens.

B

Otherwise, if it's for different uh chron uh syntax. That time, I think only one uh maintenance task would be pushed into the queue.

C

C

Still, my question is that, like mark has written, we want to do control resource consumption right. So if we were not sure what would be the amount of time a process takes, or we are not sure about the incoming processes that are going to come. While my existing process is running on the controller, then we would want to use a user queue which would make sure that we don't overload the controller.

C

But if you're, for an example, if I know that I have exposed an endpoint that is going to be, that is going to be we're going to push let's say, 10 000 messages, and that is how you've built your controller right to consume, 10, 000 messages and process them.

C

Let's say I, I push a million messages within a second, so you you need a queue or something like that to not overload your system, but when you know that the processes that are going to run are going to run in as finite time- and you yourself are the originator of the next set of tasks that are going to come there. Then, why would you need something like that?.

A

Well, but but you made an assertion there, richard that, I think I think is- is not precise right in that the maintenance tasks we actually don't know their duration, get gc of the linux kernel could be minutes and in fact it could be much longer than that if the computer is under memory, pressure or relatively lower performance and and whereas fetch of most repositories is seconds or take the other. The other end fetch of the linux kernel over a slow link could be an hour or more.

A

It's a two gigabyte.

B

A

And so, if we were trying to if we had just the beginnings, the first few commits and we did an incremental fetch- and it's now recover retrieving two gigabytes of data- that that is a very unpredictable duration.

C

Yes, that's true, that's true, so that means then I I'm not sure how much time it would take, and that is why I don't want to uh yeah. So I want to keep them separate and I I have a layer in between like a cube which will make sure that I consume a task only when the first one or the previous one is.

A

Complete, at least that was my mental model. Russia does that does that fit for you or what? What were you envisioning.

B

This is what I was thinking, but you know that then again, I was thinking of an edge case where you know assume uh like I. I do a gc maintenance task on all the maintenance.

B

uh You know get me uh get cashes when I do that, okay and uh assume there's another maintenance task which is scheduled hourly okay, so you know- uh and if this gc task and another maintenance task like a preset both have been set at the same time, so two maintenance tasks would enter the queue uh and assume the g. Garbage collection is uh takes another one hour: okay, if it.

B

If there are too many uh repositories on the kit, uh jenkins controller, uh then more tasks would be added to the maintenance uh to the queue and the previous maintenance tasks also wouldn't have been executed. I think that would be a duplication of maintenance tasks on the queue.

A

So so, but the the condition you're describing sounds like sort of what I would expect right. If garbage collection is processor intensive and is going to take 60 minutes, I don't think we want the prefetch that was scheduled after it to begin until the garbage collection is done. Do we.

B

Yeah yeah we wouldn't, but assuming 60 minutes, has finished and you know again the crown syntax executes for you know after 60 minutes, because we said you we set prefetch, so there would be two prefetches in the queue or you know another comment graph in the queue uh which would uh uh you know there would be a duplication of two uh tasks in the queue. What I was telling you.

A

I see what you're saying okay right, so so you're you're concerned okay. So how do we? So? I guess the question there is: do we need to handle duplication, duplicates that arrive in the cube that are are in the queue, so should the queue as now I now I have to make a match to my jenkins job experience so at jenkins, job, a jenkins, freestyle job, let's be very precise, jenkins freestyle job with its parameters, a cued jenkins, freestyle job with its parameters will be replaced.

A

I.E not executed.

A

If a new job, a new, a q jenkins bill, build this correct word if a new build is scheduled with the same parameters, so this is the way that the jenkins freestyle job handles the duplication problem you were just describing and- and so I think mentally are you? Are you suggesting we may want to allow that in the queue?

A

If, if a particular repository has a prefetch assigned and another one is scheduled, we discard the second one, because we've already got one scheduled.

B

A

So discard duplicates rather than and then add them to the cube and- and I think that.

B

Makes sense yeah because it wouldn't make sense, adding it to the queue that.

A

Yeah well, for instance, don't schedule a second gc when a gc is already scheduled.

A

Second, a second gc of a repository when a gc is already scheduled, and would we say already scheduled or in progress.

A

So if we're already doing a garbage collection scheduling, another one of the exact same repository seems unlikely to be helpful.

A

Rishabh you look like you're perplexed. Does that make any sense.

C

No, no, I was I just. I was trying to understand um asynchronous processes in java. I was just looking at different ways to do it so yeah yeah. No, I was just lost a little bit. Okay, great all right, but I I guess that makes sense. You know this discarding duplicate.

C

I I was thinking that we would have a way to um identify a particular job or a process using a unique identifier, and we that would be calculated based on the parameters that we've defined for a particular process.

C

So if the identifiers match, then we probably could discard it or whatever either way we could do it to discard the previous messages.

A

Okay, so hang on: let's take that, so a maintenance task should be, we think should be uniquely identified by the repository. It is processing the task it is performing.

A

A

And then the idea is only one in the in the queue at a time.

B

A

Only one you only yeah, so q contains unique, unique tasks.

A

Okay, russia was that helpful to have that discussion, or is that just making it more confusing.

B

It's fine, but I have a doubt. Is there any way of getting? You know the uh the path of all the caches on the jenkins controller.

A

Yes, there is yeah, so um so is there a way, so is there a at least I'm pretty sure there is because to list all the caches on the controller?

A

If, if worse comes to worse, what you do is list the directory contents.

A

But I think I believe there is actually a cash. A cash method in, let's see is it richard? Do you remember? Is it scm api.

A

Because or what maybe yeah it was last year's project, wasn't it um that we had an scm cache on the controller that we were relying on as a way to answer size questions? Oh yes, yes, hang on now. I think we can find it just a minute. You've prompted me.

A

So someplace in here there is a concept of a cache.

A

Get tool chooser, isn't this the one that knows how to find the size of a repository.

C

Yes, it would calculate the size of the question, then choose a tool.

A

And the size, where is okay, size of repo, so here we go so there's a repository size, cache and that size cache. If I remember right knows how to look inside yes, here it is abstract, get scm source dot get cache entry, so there is inside the get plugin. This get cache entry method that for a given repository, url or here getcashdur, I think, is the one we want right, the the directory. We want the folder on the file system that contains the cache, and here it is so abstractgitscmsource.getcashter will for a given repo.

A

Now that doesn't give us the list all of them, but I bet inside abstractgitscm source. We can find that just a minute.

A

And if not, you could add a add a method. It certainly knows what the caches are somehow, okay, so.

A

Keep one lock per cache directory get remote includes, get browser.

A

Maybe we look for a file caster, okay, so.

A

A

Okay, get cash dirt is here here it is iterating over the okay, so it says cashter give me one if it's a direct, not a directory, create it.

A

Cash entry, oh okay! That's.

C

This cash increase.

A

Yeah so well, so here, let's let's look at it.

A

I need to use emacs instead, so we can also use it for operating system functions. So here let's look at it on my installation.

A

So this is the file or this is the directory, the folder that that code is looking in and you can see it here. It's got a bunch of caches in it and if we look at this one, here's a git file with a config. This one is jenkins pipeline utils.

A

And if we look at this one we'll hope it's not the same thing pipeline utils, private.

A

So so there definitely is an abstract git scm source. It knows the list of local caches.

A

And now I I didn't see an iterator that lets us iterate over all the caches, so for that you may have to extend this class to give it a way to walk across all of them tree walking, scm probe. No, that's not! It.

A

Yes, so I think you may have to extend abstract get a cm source so that it it knows it gives us a way to ask for the next cache folder or an iterator over the cache folders.

A

So hiroshikesh did that answer that question.

B

A

Okay, so abstract get sem source, uh find the caches, find a cache or a repository abstract, yet scm source.

A

No obvious method to iterate all caches. You could add one.

C

So we would have a cache entry per git scm source because I can see that the gish git.

C

That is how abstract get the same source is expecting it for populating them.

A

So so you you made two statements there. One was you use the word every and I'm not sure that it's every at least I think, for instance, if all I do is run a freestyle job.

A

I won't get any cash for it because there isn't any benefit to that cash if instead, I'm running a pipeline job in order to do in order to do the execute the jenkins file of the pipeline job. That pipeline job has to check out a copy of the of the repository on the controller.

A

Now, if it's got access to the github api, it can get the jenkins file without requiring a full clone of the repository.

A

Likewise with bitbucket, I believe in gide, but if it's using generic git, it doesn't have those apis available, so it has to ask for the whole repository, and in that case I would expect it to be cached.

A

So so did did that answer your question rishabh or is there something I missed.

C

No, it did I just. I was just looking at the the way they are trying to get the cash entry, because that is how we get the cash directory.

C

So I was just looking at how that cash entry is being tied to the git scheme source.

C

That's it. I was just looking at that.

A

Okay, all right.

A

Good okay, so fushakesh are you: are you feeling comfortable with with that idea that you may have to extend abstract, get a cm source to give it a way to iterate over all the caches.

B

Yeah, I I'll try it out once and then I can.

A

Okay, great yeah, so if on your jenkins installation, you can look or on a jenkins installation, you can look and see examples of this in the caches directory.

A

Now you may say: oh, but I'm not sure I've got that a caches directory, and if, if you find that you don't have a caches directory, that may mean that you haven't run a pipeline yet or haven't run a multi-branch pipeline. If it would help you, you are welcome to use a toy environment that I maintain as a docker image.

A

A

A wider sample wider of jenkins configuration see.

A

This thing and I'll paste it here.

A

There's a repository I I keep that has things that help me and are, I think, has it has interesting configurations in it: many job definitions, multiple plugins, installed, etc, and, if you're, if that helps, you can use it, and you start it with docker underscore run.

A

um And build it with docker underscore build.

A

It's so it requires docker container, but it what it does is gives you a um it's a jenkins controller with mem with several interesting jobs.

A

Now I've got a much more complicated version of it: that's stored with credentials embedded in it, etc, and that one is private. I I can't make it public because it's got credentials stored inside.

A

So back to the the ca iterating over the caches on the controller rishikesh you're. Okay, if with the the task hey, let look through abstract, get a cm source, see if you can identify how you would iterate over the entries there and and decide if work me is needs to be done, etc.

B

Yeah, I'm fine with that I'll look into it. Okay,.

C

I just I I thought that it would be beneficial if there was a document which defined the architecture of the scm. I remember mark shared that when I was doing my project initially, this was written by stephen stephen connolly. I believe which outlays the fundamentals of um how jenkins has implemented.

C

So I think that would be helpful. I'm just trying to find that document.

A

Yeah, I wonder if that's okay, so let's see if yeah I thought that was in the in the maybe the writing and scm plug-in. No! Well! uh Here's! Okay! Here's! Oh yes! Here we go scm. Consumer guide and implementation guide. Okay! So so here's this one writing an scm plugin.

A

And then, if we follow that.

A

A

C

A

This one, the consumer guide, so those who are calling the scm api are here and it gives an overview of the different concepts and and how they are used. And then there is the implementation guide here.

A

That we hope the git plugin is a correct implementation of of the scm api. If you find something mistaken there by all means, let us know.

A

So is that what you were alluding to rashaab or did I is, is there more than that.

C

Okay, I have this and the question that I have related to this, which I think, if you've answered already or not, is that um the way I see it the this work or this um handling these processes is that this is above the layer of um a particular job or a pipeline.

C

So when I was working in the git plugin, I remember that I always had this concept that I'm going to get a job which could be a freestyle could be a pipeline, and then I have to do some work on it and give my results whatever. That is the function that I had to write and that was the environment or context which I was working under with rishikesh works. What I understand is we need to go above that um layer right.

C

We need to go above that, because we are actually sitting at the sitting at the place where we decide when and where the control the jobs are going to run or whoever is publishing these jobs. So does that come so now? My question is: I'm sorry that I it's a long-winded question is that is this within the purview of the git plugin?

C

Is this code? Can this code reside within the git plugin, or does it have to be above that some poor infrastructure jenkins left.

A

Good question, and since since the git plug-in has things that it configures at a global level, I think it's safe for this to be inside the git plug-in, but but for me it's there. There are things that the git plug-in configures globally, and this will be effectively another significant, large and interesting global configuration where yes, maintenance tasks should never be more than this. Many in the this many running concurrently, yes, maintenance tasks should prefer to start sunday morning or things like that.

A

That's what I was thinking anyway, fushakesh does that does that align with what you were thinking? Were you thinking something different.

B

B

But you know all the there were. There was some other problems which I was thinking about. Like you know, when we are scheduling the task, there are trons and taxes where you can run the maintenance task every minute. Okay, like there's a taxes left, I was thinking. Can we prevent users from running maintenance tasks like that? No, there should be like a base.

B

uh Syntax, like you, can start a running task only hourly, and then you know start from hourly only you can't run any maintenance tasks below one number, like you know, every 30 minutes or every 15 minutes or every 5 minutes the minimum uh base uh you know focusing tax for cron should be starting hourly. You know so that's established.

A

So I and I I have no objections to that. One of my worries to go along with that is, if I imagine a jenkins controller like ci.jenkins.io, a large jenkins controller.

A

uh Ci.Jenkins.Io has, let's see probably 2 000 multi-branch pipelines.

A

A

um From five to fifty jobs, so with that number we could have as many as 10 000 jobs on that that large jenkins controller.

A

If, if I asked to schedule even hourly.

A

10 000 maintenance tasks may still not complete right.

A

Because if they're all the linux kernel or they're all variants of some large repository, it could just be a long time. So so isn't this safeguard that you're suggesting? Should we have that we've got to have that form of safeguard somehow without making it time-based, but rather making it don't overload the jenkins controller.

A

Is that am I being clear in my phrasing or have I have I misstated something.

B

But here uh scheduled, ten thousand, are you know, maintenance which are written here? All of them will be happening, sequentially right.

A

They will that's. At least that was my assumption. My assumption is because we're we cue them, they will be sequential, so it shouldn't overload the controller, but the user may say: hey. I've got 10, 000 maintenance tests in the queue and I'm only processing a hundred maintenance tasks an hour. I will never empty the queue that that that was the the thought I was seeing, and so I think we may have to have a way to maybe it's maybe not safeguard them, but alert them should we we have.

A

Should we graph the queue length over time.

A

So admins can see if they've got if they're scheduling too many.

A

And I don't know if that makes any sense, but we've we've got the concept of these graphs over time that already exist in jenkins elsewhere. If we look at the system load graph, I can bring it up. I think I can find it here. Let's look at this one. So no! No! No! I don't remember where it is.

A

Oh wait, a sec. I know where it is. It's under system load statistics. Here we go so this this picture shows shows the let's look at it: yeah, here's a good picture.

A

This shows executors that are online busy and how long the the run queue is and and it it's it's something like this- might help the administrator know if they had scheduled maintenance too frequently.

A

I I don't know that we have to do that, but it's it's an option.

A

You're quiet I'm worried that I'm I'm blathering, I'm saying things that aren't helping. What can we do to help? I.

B

I don't know I was thinking of this. You know situation because, as we are running meeting and start sequentially, okay so assume now I'm running the common graph maintenance task. So when I'm running comment, craft maintenance tasks, all the repositories of the commit all the repositories on the jenkins controller, uh you know would run the comments draft maintenance class, so I don't think it is required to put it in a queue.

B

Once I finish this, then I take out the other maintenance task from the queue and then again around all of the you know: maintenance tasks on the get uh caches present on the controller. I don't see, you know why uh we need to add to get caches in the.

A

Okay, all right good, all right.

A

So what you're saying is no need to add the caches to the queue for.

B

Like information like you know, uh such as you know the pre-fetch the commit graph, the gc, and once you remove the information from the queue, then we can iterate through all the repositories present on the jenkins controller and run around all of them.

A

Oh, oh, I see so your okay, your concept is, is different than the one I was envisioning. I was envisioning that that the iteration was by repository, but your concept is: let's, let's admit that the task is garbage collection and we're going to garbage, collect all repositories.

A

Now, isn't there a danger there that that someone may say I don't want you to garbage, collect my linux kernel repository, but once a month.

B

ah Okay, so we need to have some kind of set of features or settings in the you know: maintenance, ui, page.

A

Well, I think it's a good question: do we iterate over the over tasks.

A

Gc prefetch uh was what was the graph calculation one I'll come again commit graph. Thank you, commit graph dot dot, or do we iterate over repositories.

A

Iterate over tasks and process, all repositories.

A

I mean that has the nicety that has the very nice thing that the user interface is much more straightforward, then isn't it by iterating over tasks? You say: how often do you want to garbage, collect and- and it's one single setting- we're not worrying about how many repositories are there that are cached? That's that's a very elegant idea. I don't know why I didn't think of that. That's very good. Okay, iterate over repositories, and but the problem with iterating over repositories is that then needs much more complicated maintenance, much more complicated, interfa user interface.

A

Good, very good. Okay,.

A

So prefer to iterate over tasks to perform all of a single, a task on all repositories.

A

Good okay! Well now, now, with that small set, does that I guess we still need a queue? Don't we even if it's going to be a relatively small queue, go ahead, rashad.

C

Yes, so my question is, I, I think that's a good. This is a great point, but what I want to understand here is that uh there's a trade-off right.

C

If we're not letting the user decide um the the repository, where they're going to run this maintenance task, how do if you're going to so then the assumption is that our system is not going to care about the size of the repository or the parameters of the repository, we're going to care about um running a maintenance, pushing a maintain tasks and then executing it over all the repositories.

C

But then I I believe we discussed. There were issues with that for an example if we're running gc and if we start gc as the first maintain tasks and then we're going to run that all over the system, this sense that being a combinational heavy tasks. So do we want to then have the awareness to understand.

C

How do we make sure that whatever tasks we're running, because I understood initially that since the user has the ability to decide the the task and the the resource on which that task is going to run on that is the pipeline or the sorry the repository? Then it would be easier for it. It's it's a work that we don't have to do and that is to um make sure that we're not over utilizing the over utilize.

C

The over utilization of the controller is not happening, but now, since the user does not have an option to decide the repositories, we need to make sure that either we have the intelligence within our system to figure out the kind of tasks we're running and the kind of um utilization that they're going to take once they started to run because we're going to run them across the the whole controller right with all the repositories.

A

So I'm not sure I understood that. Could you repeat the last part of your sentence again, it was, if that we need to run the the thing we will need to check is that we are running the maintenance task over all of the repositories, but that that is not overloading the system. Ask your question again richard I'm sorry for my not being able to keep up.

C

No, I actually, I think, I'm thinking I'm just thinking out loud, probably so I'm just thinking if that is something that we need to worry about or not. So my question was that if we are not giving the user an option to decide on upon which repositories they want to run these tasks, then what we're doing is we're just making sure that okay, this is the schedule I'm going to run a gc on all of the repositories.

C

Or I'm going to run commit graph on all of the repositories.

C

Okay, I guess it's, it's not a question mark. I I guess I I think this. There is no problem doing that.

B

B

You know a regular expression, or you know the minimum size and while processing uh each maintenance uh you each get cash. We can check all these conditions, which the user has put and then check if he wants to run that uh you know or get a repository, you know, run maintenance tasks on that repository or if we skip it. That was what I was thinking. One like.

C

B

Like a filter, yeah.

A

Okay, yeah and that that see for me, I think I think you would already have something useful and helpful if we had a single page. That said here are the five maintenance tasks.

A

When would you like to them to start and how frequently would you like them to run that's already something quite useful and, and that will, I think, help performance already of of operations, then, and if you then later extended it with added the additional filtering, if you'd like or or do something, you know something more sophisticated, but first first round of iterate your garbage collection, how often commit graph how often prefetch, how often isn't that already something that that would be valuable to users and and you could do it independent of of the of all sorts of other things.

C

I agree: iterative approach is always best right. It's better to you, know open the first thing.

A

So start start with with a simple yeah: okay, so so fushakesh are you okay with that idea, then I think I think this was your key concept and I had missed it iterating over tasks. I liked that a lot.

B

That was what I worked.

B

Okay, a sample ui. I haven't used the latest ui things and that, but that that's something you know like that's like a start.

A

Good well, and- and I think that's a great that's already a great thing that you could begin coding now is implement the sample. Ui be ready to show it, and we could have a conversation about hey here's, how the ui, because the first experience the user has is with the ui. So I like that.

A

Good now are you and you're familiar you've, seen at least the design library on weekly, good, okay, good.

A

I like that very much.

A

Now I apologize we're almost at the hour and I'm it's late, my night, I'm I'm a little bit weary! I'm unavailable next week because I'll be out of town reshab you'll, provide the zoom link, it'll be a different length than the one that's in our meeting and the two of you can meet separately now. Do you have permission to to edit this this? Oh, yes, you do good, okay, okay, so you've got a way to edit.

A

The notes be sure you record the session and then after I return I'll help so that we can upload it to youtube so that we've got it recorded publicly.

C

A

A

Next session, in a week.

A

Mark will miss the session.

A

Okay, send the url record the session great.

A

Roshakesh anything else that we should discuss in the last few minutes before before I go go to sleep.

A

Okay, now I apologize that the recording from our last session isn't available, yet I will do my best to get it ready within 24 hours.

A

Back maybe what I should do is, let's end now I'll go upload, those at least then I can give you a pointer to them uploaded, even if I haven't finished all the processing of them, because then at least you've got access to them.

A

Would that would that work? Okay! For you! Yes great! Let's do that then rishabh anything else from you, no and frustration anything from you.

A

Okay, then I'm going to go ahead and call the caller session done for today. Yes, I just sorry our session.

C

Is on wednesday right not on friday.