National Energy Research Scientific Computing Center (NERSC) NUG Monthly Webinars, 18 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NUG meeting Feb 2021

Description

Video from our monthly NUG meeting on Feb 18, 2021. Topic of the day was "Making the most of Slurm at NERSC" with Shahzeb Siddiqui from NERSC User Engagement Group

A

All right, um so people who have uh been to a few of these now will be familiar with this. Our plan is for quite an interactive meeting. We've got uh somewhat more than 50 people, which uh potentially could get yeah a little bit noisy, but you know that's: okay, we'll um we'll start off reasonably informally and if, if if we need to we'll go to a slightly more formal q, a part but um really the key point here is this is not meant to be just a presentation. This is a uh interactive meeting.

A

It's a forum, an opportunity to share things. um Please participate um yeah, unmute yourself and speak when you have something to say if it starts getting noisy, we'll ask people to raise their hand first yeah, but otherwise, so like that, if you haven't already seen, we have a nurse user slack and there is a hashtag webinars channel there. That's another good place to ask questions and continue the conversation uh one of the yeah, the nice things about. That is that the uh chatter there uh result is retained beyond the end of the zoom meeting.

A

The slides, in fact, are already up on the web page associated with this meeting, which is under www.nurse.gov under events. um You can use those slides to find the uh the link and, in fact, if you haven't already got this slack link, you can post it in the chat here.

A

Maybe you can post it in the chat here here we go if you're, not on slack. This is how you can get onto it.

A

So our agenda follows: follows our normal agenda for these meetings, uh we'll start out with a win of the month, which is an opportunity to tell about success stories.

A

Today I learned, which is an opportunity to talk about things that didn't go as smoothly, but that, maybe you know each other can learn something from uh we'll have a section for announcements and calls for participation and then we'll go into our topic of the day, uh which is around uh slurm and shazam siddiqui from nest's user engagement group will give us some tips on how to work with slurm to get you know the the most outcome for the least wall clock or the the least curating time and the least cost.

A

So, let's start out with uh win of the month, so the purpose of this segment is to show off an achievement or shout out somebody else's achievement, and it doesn't have to be big. um It can be a fairly big thing like having a paper accepted or a relatively small thing, like solving a bug that had you know, kept you uh stumped for a few days.

A

um This is also a good opportunity to tell about some scientific achievements. um We know that nurse users are doing some. You know pretty amazing things. um We don't always know the details of what those things are, but we do really like hearing about it and we think it's um inspiring, for you know us and each other as well in particular uh nurse uh hands out some awards every year for high impact scientific achievement and for innovative use of high performance computing, and you know maybe you're doing some work.

A

That is a candidate for those awards, and you know we'd love to hear about here. um On that note, actually, the early career nominations for those awards are due this week. We'll have an announcement about that uh fairly shortly, but just so we're we're especially keen to hear about that. We've got one more day before the deadline for those so yeah any um high impact, scientific work or innovative uses of high performance computing that you either are doing or know of um you know please nominate by the end of this week.

A

Would anybody like to kick off with a win of the win of the week, something something they've.

B

A

Oh um something I forgot to mention earlier is we are recording this session and the uh video will be available on the website afterwards.

A

So maybe people are a little shy. Oh I'll kick one off. I had something that I I was pretty pleased with how it came out working with one of our users, who is who has a high throughput computing type workflow, which and and this particular workflow has uh an extra challenge in the individual tasks, uh mpi tasks and multiple mpi tasks across you know, per node mpi task across many. Many nodes are a little bit tricky because most of the workflow tools use the mpi infrastructure to manage the workflow and so running.

A

An mpi task inside of an mpi job, essentially inside of a larger workflow, has a few challenges and there is kind of a known. You know, at least in principle, solution um which is to use shifter containers so create a docker container on on the laptop and set up a uh the mpi workflow to happen. Sort of you know locally to a node within that container and then use shifter to run that container in parallel multiple times uh across many nodes of corey, and uh you know it took a it took a few.

A

um You know, hiccups and starts to you know, work out exactly how to do this without things interfering with each other and and make it all work smoothly.

A

But you know, after a a a couple of days of sort of yeah backwards and forwards, um we yeah we're able to make a um a working workflow of it and that that will uh make its way onto a or a variation of that will make its way as an example on our documentation pages in the in the not too distant future. But you know I was. I was pretty chuffed about that um yeah anybody got any other stories they'd like to.

A

C

We have one not kind of wind of the month but more like ireland stuff. uh That might be. uh You know interesting to share, um so I was running I'll, try to run globally high resolution simulations in the context of climate simulations, but again io uh has been uh taking quite a lot of chunks with time, but I just realized uh so again again um concerning the file size, ranging from the tens of gigabyte 200 gigabyte as a model output.

C

I didn't realize that I have not set properly the uh the last file stripping striping before ah yeah, for example, in the end of the simulations, the model write, restart file. That's you know more than 100 gigabytes, but uh it taking in the worst time about two hours to just write this file. But then, after I you know set the five striping to the the medium. I believe then. Actually it reduced like 10 minutes or maybe 20 minutes.

C

So it's it's huge change by just changing uh this uh striping, so yeah. So that's one thing I totally uh you know neglected. Oh, I only set that strap small for my other lower resolution simulations. So that's one thing might be useful to others.

D

A

Yeah, so that's great with it with a couple of script, changes basically a couple of a couple of lines added to the script. You you got a what.

C

A

To 20 time.

C

Improvement exactly and then also it's nice thing is once I set this property for the directory, then all the files underneath you know written after that. It's it's inherit that property. So all the output files is, it's naturally striped medium size by just sitting at the top directory. So that's quite handy. Yes,.

A

Yeah good work, nice, uh nice outcome.

C

Yeah, um and maybe another one that I might want to share is that so this exercise was part of my preparation for running production. High resolution simulation, so we are testing different number of nodes or different io settings or compile optimizations, and one thing I so I was in the end of the exercise. I was calculating okay, so at the end of one year, our own one year how many simulation years can we get? So that's the you know, purpose of this exercise, so that's I just realized doing that.

C

Not just increasing number of nodes to get some improvement in the model. Throughput does not necessarily uh leads to um much many more simulation years to be done in one year, because let me can I share one like a graph file on the on the chat. I need a google spreadsheet to schedule. Statistics for the average q wait time for corey knight landing on the past one year and um were you.

D

Saying you have a screen to share showing what you found. I.

C

D

Or maybe I should say.

C

Actually, I shared on the chat, but I don't know if everyone can open up. Maybe ah is it in the in the.

A

C

Is that in the.

A

Cesm channel chat.

C

uh The chat in zoom, or should I put that.

A

C

A

uh I think that wasn't the webinar chat channel, though so again, I think that was visible to a a smaller group of people.

C

Oh okay, oh I see oh, oh, you can see. Should I put this somewhere in in the slack channel? Maybe yeah, do you want to copy it across to the the hashtag.

E

C

E

C

Okay, you can do that, so let me do that. First.

C

A

So scanning, through this, it sounds like you found, there's a sweet spot for the the number of nodes to use for the simulation that gets the best.

C

That's yeah, that's one reason, but also you know I found about 20 to 30 percent of increase uh or decrease in estimation time going from like 100 north to 200, but that improvement actually very overwhelmed by the wait time difference because wait time difference is more than 50 percent.

C

So once we I take into consideration those things to get how many years of simulation we can get in in a given year. Actually uh 100 nose is better than using. 200 knows I have. I have to go up to actually 1020 photos or more to actually uh to to get a better annual simulation throughput.

A

Yep so yeah, so the so that's interesting in the context of our topic of the day-to-day, which is around slum and how to exactly.

F

A

Learn to you know, minimize queue, wait time and and find optimal points, and that's a good consideration that um looking for the sweet spot, including the q, wait time yeah the size and shape of your job is a worthwhile exercise.

C

Yeah, I was always curious about this table itself because I got this from the my you know asks too, but uh it might be very. I can see very large viability most months, so I there's large variability even inside those.

C

You know each node count group and then also dependency of the how how many hours we request so those kind of tips I'm looking forward to here today.

A

Yeah, so we'll definitely talk more about that.

E

A

E

In a short time, thank you.

A

Any other um wins or stories to share.

F

So can I actually ask a quick question to quiche because I had uh sort of this fred avantos for what we are doing- and I was curious in in your table- is: is this wait time?

F

So let me try to understand, is the number of node is the wait time um the average wait time per node or or certain your job, the request and and nodes.

C

Now this is a wait time. So, uh if you use a request, you know knows between 64 127 and it depends on the month, but you on average you, for example, if it's in 2020 january, you have to wait almost 50 hours, an.

F

C

To get scheduled.

F

To get all those nodes.

C

All those nodes, meaning yeah, to get the nodes you requested so that your job gets started so between between your submit and then until your job gets started. That's why I understand you.

F

Okay, okay, so it's very interesting so requesting more nodes, not necessarily make it makes you wait longer.

C

No, that's not the case, I think on quarry night landing and especially you you get bonus for very large jobs right. Fifty percent discount on thousand twenty four and then its wait time is is, is usually uh shorter.

A

Okay, thank you. So so that's some very interesting tips and we're very much getting that. That's that's very much in line with our topic of the day. uh So what do I do? Is I um pause this conversation until we get up to that section and I think we'll have an opportunity to to go into some more uh depth in detail and and shows I've just got some uh some background information about why this is the case as well.

A

Okay, great so uh time is actually getting along, so we might move on to our next section which, which we've sort of started covering already here, um which is kind of the the flip side of win of the month, which is today, I learned and the um the thinking behind this is that you know research happens by experimenting and you know getting things wrong a few times until we get them right and yeah.

A

I think we can learn from not only uh our own sort of yeah points that we got stuck in and tripped up, but the the things that were challenging for other people, but this doesn't have to be things that didn't work. um That's also an opportunity to call out you know: resources you've stumbled across, for instance, or discovered recently that might be valuable to other nurse.

A

Users so uh coach's uh tip just before about you know the the difference that setting striping made to his code um is a good example here.

A

People are a little bit shy to speak.

A

A

Well, we've had a a few discussions, I think of uh things that we learned, so we can step on to the next uh item. So our next segment is about announcements and calls for participation. um We have a a couple of announcements from nurse side, but this is also yeah. This is a general user forum and we're also keen to hear about um you know, uh conferences, events and so on that uh you know of or perhaps uh organizing or particular you're contributing to that would be valuable to other nurse users.

A

From from nurse side, you hopefully saw this already in your weekly email, etc, but the nurse early career nominations for hpc achievement awards uh due at the end of this week, that is to say, bye tomorrow.

A

So we have two categories for these awards. uh One is for high impact scientific achievement which recognizes work that has or is expected to have. uh You know an exceptional impact on scientific understanding, uh engineering, design for scientific facilities or a broader societal problem.

A

Yeah we've got a, I guess. Yeah society has a fairly uh big invisible target in the pandemic at the moment. For that, uh the other award is for innovative uses of high performance computing, which recognizes research, researchers who have used nurse resources in innovative ways to solve a significant problem or have come up with a new methodology that might have a large scientific impact.

A

So this can include things like you know, using hpc in a field where it hasn't previously been used or used terribly much or combining um computing data, networking and edge services to do something new. You know you know in a domain where hpc is already in use, so for the early career eligibility. This is I take it particularly at users using nurse resources who uh early in their career, so so postdocs or um or nurse users who have received their degree.

A

You know in their in the last five years. I guess a little over five years during or after 2015.

A

um So you can nominate uh these people by. um I think, if you're checking the weekly email, there is probably a link or you can send us a ticket with the topic being nomination for award and uh yeah. Give us a point of to uh who you know of that. um You would be a a good recipient for one of these awards.

A

So I think that's the only uh announcement we have at the moment on nurse's side. um Apart from the there are a few other announcements actually of the things going on in there that are in the weekly email.

A

Do does our user community here uh have any announcements or events that would be good to know about.

A

uh Don't forget that you can also yeah make announcements. uh Tell other users about things in the nurse slack.

A

Organization, so if we don't have anything else in the way of announcements, we can go on to our topic of the day and I'd like to introduce uh shazam, siddiqui who's part of our user engagement group here and has a lot of experience with sloan and is going to walk us through some of uh the details of. I guess your.

A

What slum is doing underneath and you know how it works at nurse and how you can use that information to get more out of uh at a slum and spend you know less time in the queue um if you'd like to just say next, I can click the slide through to the next one, when you're ready.

G

Thanks uh yeah, you can start okay, um so yeah. This is going to be a quick recap on you know, for you guys to get up to speed on slurm and get some best practices using our cluster. So um you know, as you may log you know, use our cluster. You typically log in to our nodes. um You know please note that these login nodes they're shared with many users right, so these are not meant for computational resource.

G

So what you typically would do is write a job script and submit it through either s patch, which will submit a batch job or if you want to have an interactive access to a node, then you can use siloc right.

G

What you would typically do is, you know, specify your s patch directives and a and you know, replace the number of nodes and um you know, run your script. What what happens in the back end is your script. Actually, um you know is processed by this alarm server. It processes the script and then finds the uh allocated nodes, and you know fulfills your request.

G

So on corey we have like several login nodes. Those are, you know, those icons in blue and sometimes some of our login nodes are also used as compute resources right um when a job gets allocated, you may you may get allocated to one node, which will be this yellow icon. uh If you have multiple nodes, then you know, slurm will do that for you.

G

That could be those um red icons and you know, typically at the end of the job, you will get an output or error file on the saved uh on the disk right and uh just one thing to also notice. You know we have cues that allow us to submit exclusive node. So you can. You can do that, but we have like, for instance, a shared queue which is shared with multiple users.

G

G

Okay, so when you submit a job either through s patch or um uh yeah, basically what happens is the job will go through um several job states? The first is the you know, append um and salaam will try to figure out. You know when it needs to get the resources, so it will be in this state. um It will once it's done from pending. You will configure the nodes that are required to run the job, and then your job will actually be in a running state.

G

You would see this as icon as a dash. I mean capital r. This will be for the remainder of the job execution and then the job will complete and that will be the last step.

G

And finally, if the job either completes successfully will be completed or most of the other time, you would see it failed if it's the fail, if you run out of, for instance, like timeout or something like that, it would also show up as failed, but if you cancel a job, you know that will show up now, as during the process of this uh job lifetime, you can actually monitor this job through either sqrs control.

G

We also have a wrapper sqs, which is to sq.

G

uh One thing to note: is you use sq and s control for the lifetime of the job from pending to completed, but you can use sacct to query um historical jobs. Right, um didn't know that you know using these commands frequently will ping our slurm server so doing too much request. You know on like a loop can actually impact a server, so it's it's generally prohibited right.

G

uh Another thing to note is that we have a limit on how much you can uh quarry from stacey cct, so you can just do like a really large square.

G

I think we have it set to 31 days next slide.

G

Okay, so I want to just talk briefly about how the scheduling works. Currently, it's learn I mean here we have set up backfilling.

G

The idea is that backfilling allows us to increase throughput of the of our jobs, and you know backfilling um what what what it, what the intent is to allow uh jobs that are like lower priority, um um like short running jobs, to be filled in wherever there is gaps into the cluster.

G

um Typically short running jobs are, like you know, smaller, like really small boxes like either low run time or a low number of nodes. If you see like in this diagram, you should look at it from the top down, so jobs that are submitted long time ago.

G

Long rectangulars are, like you know, small jobs with long running time or, like you know, horizontal rectangulars are like uh like accessing the full node, like you know, thousand a thousand nodes, but like it's a very short running job. These are generally um hard for.

G

This is bad for scheduling, so this alarm will try to accommodate this um and it will put it into the scheduler, but what backfilling does it allows for short and small running jobs to be run earlier in the schedule like, for instance, the blue and the purple one they get submitted earlier, so that we improve the throughput, and you know without backfilling it would just generally just do first in first serve and that's not going to be optimal.

G

Okay, I think we can go to the next slide.

G

Okay, another useful tip is to use time min what this allows you to do.

G

Is you know if you, if you have a job, you don't need to guess the wall clock time you can say something like um the minimum amount of time that you need uh like, for instance, if you have a job that runs, um let's say six hours, but you want to make it this time so that sloan will pick it up more quickly. You can say, let's say a time limit of like let's say three hours in this example.

G

You may have a job that um you don't know how much time it takes, but you expect that three hours is sufficient with min mean time, sloan will actually look at it and say: okay, I have a three three hour time limit and I will actually fit that in in this diagram.

G

If you see the whole box is actually blue, that's six hours, but the dark blue could be the the three hours and then it will just try to fit it in um to the scheduler. uh One thing to note is that some of our q policies require a min time. For instance, flex skew is one of them, um so it's just one thing to be noted.

G

I guess we can go to the next slide.

H

I'm sorry, could you go over the the meantime again, I'm not sure.

I

I understood does this: if you, if you suspected your job, the job might run longer, but you don't know- and you specify mean time and you run out of that time- slot what happens.

A

See if I can interject slightly, this is more if you know that your job has flexibility. If, for instance, it has a restart capability, if it's, if it's running a time, step every half hour and and you can pick it up and continue from the last time step.

A

Okay, that makes sense. Thank you. Then you know you. You could tell it to run as many times as you can.

J

But will your job get killed at the end of the minimum time or if it happens to take longer it just runs till completion.

A

uh Would you like to answer to socialize.

G

Yeah, I'm just looking exactly so according to what it does. It's um set the minimum limit on the job allocation. So the time is, it allows to be executed earlier.

G

It doesn't change the uh the actual job time itself, so it improves the back. The backfilling scheduling algorithm so like allows scheduler to see that you know this job, which is supposed to be, let's say six hours. It doesn't need all that time just needs. um Let's say the amount up to min time, which may be like two or three hours and it will schedule a job um ahead.

K

Or earlier in the scheduling, that's not how I understand it. Min time means that it'll run for a minimum of that many hours, but then what can be killed anytime between that and the maximum time.

A

So the the time gets adjusted at the time that the job starts. So if there's a gap in the schedule of say three and a half hours and you have a min time of two hours in a six hour job, uh it will adjust the time of your job to three and a half hours to exactly fill that gap.

B

Since it can't know the gap perfectly, does it kill your job when it's ready to fill in the job who's? You know ready to go or does it kill you automatically at it at the time it decided.

A

um It it adjusts the time at the time that the job starts. So it has a schedule, for instance, that it knows that this wide job is expected to. You know have nodes available to it to start say in in three and a half hours from now.

A

um So what it does, then it adjusts your job's time to three and a half hours um and then at the end of three and a half hours your job will be will be killed.

A

So I think what you're asking is yeah things can change, and maybe another job finished early and this job that was scheduled for in three and a half hours time might have otherwise been able to start in three.

A

But at that point your job's already running and it's got three and a half hours.

A

So so it does kind of lock the schedule in.

B

Yeah: okay, thanks.

A

Okay, sorry, I kind of interrupted there. uh Would you like to continue this okay.

G

Yeah yeah, I think.

E

G

Next slide, okay, so um in terms of what we have and the cues that we have uh typically you will. You will use a regular clue for most of your workload right um and you know so in terms of the the cost. um You know regular, crucial piece of food um it. I think I believe it has. I think 48 hours is a time limit, so it should be sufficient for most of your workload.

G

uh If you need, for instance, um some emergency workload, uh premium queue is, is the way to go, it will submit, it will schedule the job much faster, um but you know it. It is more expensive right. um Premium queue is also special, so yeah q in in the sense that not all user automatic access to it, the your pi will have to grant you access to this, um and also the charge factor gets changed once you reach. I believe two percent is our is the rate so just be mindful of that.

G

um Only use this queue when you really need to uh debug is a really good queue if you want to just submit a job uh just for debugging purpose, so I think it's got a 30 minute time. So it's it's.

F

G

You get a good turnaround time. um If you need interactive access to a node, then use the s the interactive queue, so one thing to know is you need to use sl up s patch is not going to work.

G

A log qos: this is good. If you have workload, that really is not that important. You don't mind having a long kind of wait time, um so you you can use that and it's also very cheap. uh The flex q is is good. If you um support um like flexible wall time, this queue requires you to have the min time or time main option set to at least two hours. It won't accept the job. If you don't set this, so it has to be less than two hours and it's only available on k.

G

L- and this is good for if you have uh your job supports like restarts or if you wanted to be able to start if the job, for instance gets killed, um if you're going to use a shared queue, um this is shared with other users. So when you, when you submit a job, you get into a node, keep in mind that you will be sharing this note with other users.

G

So um this is good if you, um if you just want to get your jobs done, but you don't care about in the case of like performance, you don't need exclusive node. If you do need an exclusive node, then you should use something else or just use the exclusive option: the overrun queue. This is. uh This is good if you uh have a zero um project, account balance um and you still need to submit jobs.

G

um Otherwise you can't use this queue and the real view is only for a special purpose. If you need real like immediate access, so yep and you can take the take a look at the link um below for the qos next slide.

G

Okay, um so, as I mentioned, you know you most of your workload should be going through the regular queue. um You know. We also have the premium queue. That's for you know, like you, know, immediate kind of needs uh like if you have some kind of conference that you need to submit something and only have like a week or so you know you could use the premium queue or um so for ins. uh One thing to note is the flat skill.

G

You know it is uh we we do discount, I think 75 on and it's only on knl right and you know because it's on node you, you can actually um pretty much on the node itself. So uh that's pretty good um and yeah.

G

The large kit, uh like if you're submitting like large jobs um on known nodes, then we also do discount like up to 10 20 foods uh with a 50 percent discount so that it's good to know um if, if you, if you want to submit like a larger workload uh and and this is available on the regular regular queue, okay and yep so just to summarize, um you know, I think one thing. One thing that you may learn is uh you know you is the scheduling them. Do we use backfilling?

G

So um if you have short running jobs, you know um just try to um use, let's say mint uh or time mission to um get your jobs too quickly. um You know we support several cues. uh So pick the right. Cue and that's um you know, use a flex queue when you need uh it will save you money for sure um users can uh if you're gonna submit large number of jobs up to 1024.

G

um You know you get a 50 discount right and you can take a look at the link for best practices.

G

I think that's all that you have.

A

Thanks joseph so so we've got about um five minutes or so for um q a and uh I see actually um canoe priya, I'm not sure. If I'm pronouncing it correctly uh has a has asked a question in the chat. If you ask for four nodes using the interactive queue, can you do this split between two different jobs? uh I think I want to clarify that question. Do you mean running different s runs together or do you mean request two jobs of two nodes? Each.

L

Two different test runs in parallel.

A

uh Yes, you can do that. um We have uh in the examples page of our docs, which is, uh I might put a put a link to it in there in a chat in a moment um you can- and you can do this interactively as well start an s run in the background where, for one of the jobs and then start the other one, what you'll probably want to do is actually I I assume that you're looking at, for instance, comparative debugging, that's correct yep as the use case.

A

So what you'll probably want to do in this case is actually ssh into the node that you have a job on and- and we do have some tips on that in the in the documentation, we'll post a link to that in the webinars channel shortly um so then, you've got two terminal windows.

A

Only the first terminal window will have the um the slam environment so you'll need to. You know: do the s runs from there, but you'll be able to go into the other nodes, and you know look at them. Catch them in a debugger, for instance, from another terminal window.

L

um Okay, thanks: can you uh post the documentation link please on the channel.

A

Are there any other um questions that people would like to ask.

B

So this is kind of related to what the um earlier uh uh speaker was talking about. You know if you have something that you expect, can you backfill reasonably? Well, so the you know, total number of core hours for your job is is modest right, um but you know, essentially you know if you can divide the job up in a way such that you know, you essentially have a rectangle in you know time node space right.

B

What's the best shape of the rectangle for um back filling? Is it to have you know short time many nodes or long time? Few notes.

B

Again, I'm assuming that it's you know the the no time product is, is a modest number. You know the you know, you know tens or hundreds of core hours or something.

A

Yeah, as a general rule short time is much easier to um to find gaps for even even quite wide. Gaps are relatively easy to find compared to long gaps.

A

So a 48-hour job is almost never going to be backfilled. It's always going to wait until it reaches the front of the queue.

A

Whereas a one-hour job, even with a thousand nodes, will probably backfill and start ahead of its. You know nominal place.

B

D

D

So just digging up uh okay.

A

So I'm just posting now in the webinars channel, um one of our docs page is, is example job scripts, and this is a great resource actually well. We hope it's a great resource for examples of lots of different use. Cases uh which includes multiple simultaneous jobs for I'll need to dig a little further to find the link specifically for ssh into a node.

A

I can see there's a few other questions in the chat as well.

A

A

All right so from so from bill, sloan job array, submission is a handy way to get a lot of embarrassingly parallel tasks through the queues and smaller pieces. I'm curious how these submissions age for priority do they age based on the initial submission or does each task age independently?

A

Okay, that comes into play when, if the setting is used to control the number of tasks allowed to run concurrently.

A

So there are a couple of ways of doing this: you can use job arrays, which use a single job script, to manage a whole series of jobs um with a single s batch or you can ask batch a whole lot of individual jobs.

A

um I presume bill that you're talking about uh joe arrays.

M

Yes, I recently discovered this trick and it comes in handy sometimes, but I'm not sure how it ages in the queue and for optimal use. So so the job array.

A

The request kind of acts as a single job and then um basically the the individual jobs, get kind of pulled off the request. So my understanding- and maybe shazeed, has more information about this. um Is that the request will age. um So so you know the entire request will reach. You know a kind of uh you know the same priority, um and then you know when slums scanning it it will pull off. You know, however many jobs. It can start at the moment, or at least the next job that it can start at the moment.

A

From that request,.

G

Yes, so one thing that that we didn't discuss today was the uh how house learn: does job priority so currently we we use this learn feature. um It's called multi-factor priority um when salary actually figures out which job actually needs to be scheduled. It does it based on priority and there are multiple factors.

G

What goes into job priority, and one of the factor is age, the the age of the job length, right as it sits in the queue and then also when it's actually running, as you may know, if it's a longer, if it's a job, that's waiting in the cube for a longer time, then the age factor will grow. That means that slurm is trying to figure out that this job is in the queue for a long time. It needs to get scheduled right same thing.

G

If the job is running for a longer time, then that's not good too. You can take a look at the link. I I think one of the commands that you could use to actually get uh job. Priority is espio, um but we haven't documented that, but um there is some useful um things that you can get out of it.

G

um If you think that would be useful, I think we can. We can try to put some more documentation into that. uh One thing to note is that espio works only for pending jobs. So if you have a job that's pending and you want to know the priority of pending jobs, then that could help you can try to see your job and then also sorted by the queues and uh it.

F

G

Help you to tell the start, but it will at least give you a numerical value of the priority of the job.

G

If you don't care about priority, I mean you can also do notification. You can have email notification when the job will start. If you don't want that, but espio could help yeah.

A

G

A

Yeah, I was gonna, say yeah, so espresso gives you the breakdown of what are the components of your job's priority. I think the the total number um can actually be found in sq as well.

A

Yes, um so we have a couple. More questions have come up. um I might uh address them in reverse order, because of uh uh basically how how easy easily answered they are uh so ronnie asks. Do specific projects have different priority and the answer is, for the most part, no um different projects have different allocations, which you know, I guess, influences how many jobs they can run over the entire year.

A

But priority is all the same. There is kind of an exception in that projects can request access to the real time queue and that's generally, for sort of special cases such as um you know, needing needing to synchronize with time on a um on an instrument somewhere. For instance, you know super facility type work and real-time queue. Jobs have super high priority, um but that's kind of a special case um in the in the normal course of things.

A

You know one project and another project have the same priority in the queue um I see uh says: joseph's posted an example of what the esprit output looks like.

A

When my pronunciation correctly uh asks, can we explain a bit about the denial of service problem?

A

uh I guess you mean about too many. Too many s runs or too many sqs requests.

A

Yes, sir yeah, uh so this is this is because corey is quite a large system with quite a lot of users, um but the the scheduling, I guess, is a centralized service, and so you know the the slum daemon basically has to you know: answer requests as well as uh do the scheduling and if it gets too many requests, you know it's, it's multi-threaded, it scales reasonably well, but it's uh it's not omnipotent and it is entirely possible to overwhelm it with requests and they uh your one easy way to do.

A

That is uh by calling you know, sqs or sram in a loop, particularly with no no sort of a sleep between them. uh Yeah, the the demon will just sort of get so many requests coming from so many different directions that it starts to drown under them a little bit. So that's why we ask users not to do that.

J

uh Has such an attack happen in the past.

A

uh Certainly, never, I think deliberately it's mostly uh by mostly when this happens it's by users who have a a genuine use case. You know, I need to start a thousand things in my high throughput workflow and they do the kind of obvious standing answer or make a loop that runs s run.

A

Which, unfortunately, has this side effect of hitting the scheduler very very hard? So as far as I'm aware, I I don't know of any malicious attacks like this, uh but it has definitely accidentally happened.

A

And oftentimes, if you see slow response from slurm, if you yo, if you type sqs and it seems to take ages to come back, um it's probably because slum is dealing with a lot of requests.

A

uh There was another question further up: oh for jobs that finish within two hours on k. L. Is it better to use the flex cube? uh So that's an interesting question. If, if the job finishes within two hours, uh chances are it'll run fine in the normal queue, you'll get started, sort of straight away.

A

um The flex queue is really for, and flexq is valid to use for short jobs as well, but it's most useful for if you have a job that you know ideally you'd like to run for quite a long time, but it can be broken up into smaller chunks.

A

um You know you could also, for instance, you know if you've got a job, that yeah will run up to two hours but can run one hour chunks. You could use the flexq for that as well.

A

We're actually getting close to the top of the hour, and I think we've covered the questions that have come in in the chat. We might be able to continue uh with a q a session sort of at the at the end of the meeting. For those who are still, um you know around and available, but what we might do now is flip through the last couple of items in our agenda so that we can finish the sort of formal part of the meeting before 12 o'clock, so the next one's a fairly quick and easy one.

A

What's coming up next, we are always looking for topic, requests, suggestions or better still nominations and volunteers to host a topic of the day. This is a a great opportunity if you want to do a kind of a relatively short. You know, lightning talk, kind of level, uh overview of some interesting work that you're doing using nurse resources and some tips that you've learned that might help other users. You know we'd love to hear about it.

A

I think that would make a great topic of the day if you've got requests or nominations or self nominations for this, please either you know, send us a ticket or write something in the webinars channel and we'll definitely chat with you about it.

A

Last month's numbers, so our availability was, was pretty high. In january we had, you can see uh this uh black era was the scheduled maintenance that happened during the allocation year transition.

A

We did have a very short schedule, uh unscheduled outage later in the month uh other than that uh corey's scheduled availability was pretty close to 100 and the uh storage systems scheduled availability was at 100. That was good news. uh Corey's utilization was nicely high. uh Things like the flex.

A

Q are actually really helping us here, um because you know an increasing number of users are making their jobs more easily fit into gaps, we're not getting very much time of uh empty nodes, sitting idle because they're waiting for other nodes to be unavailable for some job, we're able to make good use of them.

A

So that's that's great to see we have a target of 25 percent of the workload on corey being jobs that need a system of corey scale, um basically large jobs, things needing more than more than a thousand nodes, uh and you know it's good to see that corey is being well used for this use case. So we have a 25 target and over 40 of our workload in january was these large jobs.

A

uh Tickets are coming in at a slightly faster rate last month than than what we closed them. So we have a current backlog as of a couple of weeks ago, of about 620 tickets, and that is all of the formal part of the meeting. If people are interested in uh sticking around a little longer to chat about um tips on using slurm, um I think I'm available for a little longer. I'm not sure if I'm not sure what she's ed's calendar looks like. Are you available to stick around for another five or ten minutes?

A

If people are still asking questions, does it.

G

Yeah, I could stay for a few minutes. I do have a meeting. I can stick around.

A

Okay, so it might be only only a few minutes then and then we'll probably need to drop, but I I believe we have a few other nurse people on the line as well so between us. We can probably uh answer some questions and I wouldn't be surprised if some of our users online are also quite experienced slum users and can participate in that contribute.

A

Answers uh other than that. um Yes, thank you all for joining us, we'll post the recording on the web page uh reasonably soon after the meeting and look forward to seeing you all again next month,.

E

A

A

So as people are exiting, were there any other uh questions that people had about making making the best use of slab.

C

Ask a follow-up questions about this flex: queue um sure for the shorter jobs. I was just curious. If uh people you know, run shorter jobs like snowy hours and use flex q in not intentional way just to get the cheaper. uh You know charge factor.

A

Yeah, so that's an interesting question and a cunning plan.

C

It's not what the flex queue is intended for right, yeah, but I saw many people take advantage of that. I just realized today.

A

Don't think we actually have a mechanism in place to prevent or disallow it at the moment, but it is entirely likely that that would happen if, um if we do start to see it being used, you know as a as a way of getting a discount.

C

Yeah, because I I just realized today that has 75 percent discount- that's uh that's! That's quite large! So yeah.

A

Yes, it's that's a very good discount, because so so I guess um perhaps we can talk a little bit about the intent here and, and you know what are the yeah?

A

What are the the the needs and goals that we're balancing off so so one important thing to keep in mind is that for the most part, nurse actually doesn't allocate the hours, so the department of energy program managers allocate the hours for uh for projects, and so the hours allocation uh should then, at least in theory reflect the priority that the doe um has for yeah as well as I guess, the the needs in that you know for for different uh research, and so we don't want to kind of undermine that.

A

You know if, if, um if the program managers really want a lot of the time to be spent in a particular area of research or a lot of resources to be dedicated to a particular area of research, we don't want to risk undermining that through you know, discounts and and and so on, like that, on the other hand, we don't want the system to go to waste so yeah.

A

We would much rather have jobs that can fill gaps, fill those gaps and do useful research um yeah, rather than letting those jobs sit idle, and so this is what drives having some of these discounts for, for things like the flex queue is that um by encouraging people to to work towards making their jobs flexible and to create jobs that work well with the scheduler to sort of you know optimize the experience for everybody, um you know, there's there's benefits all around, and so that is a you know: a motivation behind putting a large discount there.

C

Yeah yeah, yes, so.

A

So yeah at the moment, as far as I know, I I don't think we've had. I guess the need to police very hard, whether your weather users are kind of gaming, the system a little bit and um yeah. I guess it will depend a lot on how much we see of that yeah.

C

And I think I saw a very nice nice point made as well in the chat that the the priority is quite low for the flex queue. So yes, yeah once I react, you know people realize that, probably they don't. We don't use that for that kind of ill-purposed uh in a way.

A

So so, if your purpose is that you have um yeah the the flex q, will you know it's it's filling gaps, but it is a a lower priority than others. There's also a low priority queue which is actually starts slightly higher than the flex queue, um but has a 50 discount.

A

So if you, if you have a project that is, uh you know not rich in terms of nursing hours, um but the workload doesn't have high urgency, then you know you can use the low q to to stretch your scholars further.

C

Yes, okay, yeah! Thank you. That's cute,.

E

Yes, um oh, I see another a couple of questions.

A

Here uh about cluster expansion calculations, how to keep it running over 48 hours. ah So that's a good question. So uh seeing am I pronouncing that correctly um uh asked about what what, if you just do, have a workload that takes more than 48 hours unavoidably.

A

So this is, this is kind of a challenge and the the best and preferred solution is if that workload can be made to support uh stopping and starting so that it can be broken up into smaller chunks. uh You can actually use.

A

We have a documentation page about uh variable time, jobs uh where we've got some example, scripts, pretty much for setting up a long workload to use the flex queue to break it up into chunks and then resubmit itself that you know when it reaches the end of one um time chunk to sort of automatically do long-running jobs. But that does require the job to be capable of checkpointing.

A

um The reason for having a 48-hour limit is that uh well, apart from that longer, jobs are harder to schedule and will tend to spend a huge amount of time sort of stuck in the queue.

A

Longer running jobs make it more difficult for uh scheduling things like maintenance or, if you know, if the system becomes unavailable, there's a there's a higher. I guess probability of um of interrupt that way.

G

I'm not sure if we allow, I guess, our administrators who bump the time limit, I mean we can extend it, but generally we.

A

Don't right so it is possible using a reservation to get a longer time limit, but this is kind of exceptional circumstances. um Yeah you! You don't want this to be your normal workflow, see yeah.

F

A

That's worth looking at is we're working with the developers of a checkpoint, restart package called dmtcp.

A

um Which, for you know for jobs that don't have their own built-in checkpoint and restart capability uh in a lot of cases, even for mpi jobs? uh This can allow the job to be stopped. You know within 48 hours, even running the flex queue for a few hours at a time and automatically re-chewed rejuvened.

A

I'll post a link in the chat and in the webinar channel.

E

Oops wrong week. Sorry.

G

I think reservation would be the best if it wasn't next.

A

About using checkpoint, restart.

A

See if you do have a long running job, take a look at that. If all else fails, you can request a reservation. What we'll probably actually do is try to work with you first to see if we can find a more sustainable way to you know perform that workflow.

A

So I think we've covered all the questions that have turned up in the chat. If there's any that I've overlooked, please speak up.

C

um I was just curious one more question about you know the table I made out of this webpage too. I I wonder if you guys have any idea about the variability within each group or each column. I made like, for example, between in the 64 to 108 28 notes among that group like how much variability.

C

Typically is expected.

A

So that's a really interesting question and it's quite an interesting table hang on. I wonder if I can share the.

E

Screen with that, uh let's share.

C

And also I'm curious about the reason possible reason behind that the monster month variations at least in in 2020. I didn't go farther back.

A

So so I have to admit this is not something I've looked into, and I I don't know if anybody else said nurse, it's entirely possible. Somebody else uh has looked a little more than more into it than what I have.

A

But I suspect what we're seeing here is really driven by the workload.

A

um There is more variability than I expected to see so, which uh it might just imply that there is a different workload at different times, uh something that's a fairly easy. I guess statistical trap to trip up on with the uh wait time charts is.

A

There is a second chart on that page showing the number of jobs, and in some cases, when you see a you know the the wait time might be particularly high or particularly no low for a certain category. But then, when you look it's because uh during the period there was only a very small number of jobs, you know five or ten jobs um that requested that sort of yeah that particular shape, and so these, uh statistically probably not too significant jobs, maybe kind of adding a lot of. If you like false variability.

C

I see so yeah, indeed this it's actually more like a 2d space right. I collapse that thing to one day, because I ignore the I don't know um number of hours requested.

A

Okay um yeah, so this is oh, and this is particularly by a number of nodes. That's a that's a point as well yeah. Then the number of hours will add sort of a second dimension to it.

C

Yeah, but I didn't yeah that I don't see other tables in this particular web page too, that the you know uh numbers as a function of hours requested only graphically it's. It is represented. I believe.

A

um Oh as in as in getting getting the underlying table behind this chart, yeah yeah yeah that information should be available, but I don't think we like, specifically, you know, publish it. So it's not it's not that it's hidden. It's just not actually actively published right.

A

A

So I think this is the this is the chart that you were using.

C

uh Hold on yes, yeah yeah exactly- and I was just using you know, only focusing on okay now and then pretty much copy and paste from the first uh table.

A

Here we go, I hadn't actually noticed this either, but if you scroll down a little bit further, uh there is the numbers.

E

Average wait time.

A

Although this is just by nodes and not necessarily by time, weighted right, if I sorry uh wall time requested so yeah, uh you will sometimes find that yeah here. Some of these you know in this square there was only six jobs, and so that you know is, is contributing to you know one of these um boxes here, because it's not very many jobs, it's not statistically fantastic.

A

So it is something to have in mind when you're. Looking at this another really useful one. I find um that in a way zooms out a little bit and- and I think can be a bit helpful because of that is this backlog, and what this is showing is the total amount of queued work in terms of you know all of corey days.

A

So if you just submitted a job, it's presumably somewhere near the back of this queue, and it's going to take it in this case. If it's a haswell job around about 10 days to get to the front of the queue.

A

Hopefully, if it's a little bit short it'll, be able to backfill and fill and and start a lot sooner than that, you will notice that k. L jobs have a a much lower backlog, they're they're in the kind of the two-day range compared to 10 days for haswell, and that is driven partly by the fact that we have five times as many k, n l nodes right.

C

And in this context of backlog means just wait time kind of well um so.

A

A way to describe it is with an example. If you imagine, corey had 100 nodes in total and you've got jobs submitted, so you've got a job that requests 50 nodes for 2 hours and another one that requests 25 nodes for 2 hours and another one that requests nodes for four hours.

A

uh Then you've got a total amount of work here of what uh you know about like 100 nodes for about six hours,.

C

A

Calculation is not correct, but something like that, so that would be.

C

A

Backlog of six hours of work queued for the.

C

A

Nodes, okay, I see got it um so so the numbers for corey are obviously a lot bigger because there's some somewhere in excess of 12 000 nodes, and so you know uh was it there's about 10 000, k, l nodes, so there's sort of you know, 2.2 days times, 10 000 nodes worth of work currently queued for corey knl.

C

I see oh thank you. I never really looked at that and I didn't, I quite understand the title background history before so. This is very nice.

A

I find it can be a helpful summary of of just how busy the machine is.

C

How busy the lashing.

A

Is yeah how much how much work it has sort of.

D

A

I

I

A lot of these jobs can be scheduled and completed.

I

Much faster right.

A

um As in the job sizes,.

I

Right well at any given time and for instance, you you submit a job a lot of times. I've seen this, you submit a job and then I do sqs and tells me there's a backlog of three days.

I

But then my job gets scheduled in an hour, yeah yeah, and that happens because confused me and.

E

Is that related to what we're seeing here I said, that's actually related to um oh yeah. Let's just go back in it.

N

Could explain this yeah yeah.

E

N

So sqs, I will remove that column now before. What we had is that it tells you uh it's gonna be starting three days. Actually, it's not it's going to start in three days. It's it's about its priority is going to be reaching the threshold that scheduler will consider it. So this is for the the backfield pass, but then there's also when you see it, it happens immediately.

N

It's because it's backfilled, so the column tells you it's going to reach the threshold to for the scheduler to consider its regular path in how many days, because we have a threshold set there, but then all the jobs in there. If it's short enough small enough and you can get a opportunity to get back filled, it will run quickly.

I

H

What is this threshold.

N

Again that you're, this is what this plot is going to tell us. We have in our configuration actual numbers. So if you submit a job in a regular queue, you come in with a start priority, and then we have it. It takes you three days to start to be scheduled, but actually recently we removed that wait time. The the original we had this as manual uh wait time for a regular job to be scheduled.

N

Our a new scheme does not need to have this uh uh up manual weight to originally we had some some related bugs that without this weight there were issues related to the system over utilization and newer version does not need this weight anymore. So, basically, a a job submit it already reaches threshold. So that's why we remove that column, you're, not seeing it in sqs anymore, something like that.

I

Okay, so we shouldn't worry about it. That's.

N

I

The backlog that we're talking about here.

N

No, it was our old threshold difference not anymore.

I

But the backbone now, returning to that diagram, the backlog um just means that this is how much work is scheduled and waiting to be done, but it doesn't mean that that's how exactly how long it will take these jobs to finish many of them could ensure them. Many of them could arrow out and etc. Right.

N

The backlog is like, however, many uh workload is in the system right now, new jobs after that like right now, you could still finish maybe before that, because there are some jobs submitted in the priority lower than yours, and there are also some users have more than two jobs in the queue and they they're big enough they're, not fitting to a backfield.

N

So those jobs are basically scheduled two at a time, and your job just will be considered for scheduling before those.

A

Jobs so expanding on that a little bit um so so your job goes into the queue at some point with a like. Your priority basically increases over time. The the change that helen was talking about where you used to see like a three-day wait was because qregular used to start out down here and it would take.

A

You know three days to get up to this line, whereas now it starts right at the top, but um only two jobs can increase past this line at a time per user and above here this is the this is the part of the queue that slurm spends a lot of time, trying to find a place to schedule the job to start and then for everything below that it does a quick run through and just ask. Can I start this job right now.

A

So the reason so the the two jobs that helen was talking about is that this limitation that only allows two jobs per user to cross this line at a time, and the reason that you see a job start very quickly is usually because the the scheduler scans down there and found oh, I can fit this job in an existing gap.

A

Between all of that, did it sort of make it a somewhat clear picture.

I

Yep, thank you.

A

So we're almost at 12 30 now and it might be time to close out the meeting thanks all for sticking around for further discussion and um yeah uh thanks jose again, although I think he's dropped out for um for telling us about slum and we'll look forward to seeing you all again next month.